Sunday, August 24, 2008

Don't Use StringBuffer!

One thing I have noticed among Java API users is that some don't seem to know that there is a difference between StringBuffer and StringBuilder. There is: StringBuffer is synchronized, and StringBuilder isn't. (The same is true for Vector and ArrayList, as well as Hashtable and HashMap). StringBuilder avoids extra locking operations, and therefore you should use it where possible. That's not the meat of this post, though.

Among those who know that StringBuffer is synchronized, they sometimes think that it has magical thread-safety properties. Here's a simplified version of some code I wrote recently

final StringBuilder sb = new StringBuilder();
Runnable runner = new Runnable() {
@Override public void run() {
sb.append("1");
}
};
Thread t = new Thread(runner);
t.start();
try { t.join() } catch (InterruptedException e) {}
// use sb.
A fellow looked at this code, and said that he thought I should use StringBuffer, because it is safer with multiple threads. Well, no. Using StringBuffer here would actually be bad coding practice.

The issue here is that it isn't actually thread-safe to use StringBuffer. It is just pretend thread-safe. Yes, that's an official term. It is completely useless in practice to have multiple threads pounding on a StringBuffer, if at least one of them is mutating it, because you then can't make any guarantees about the StringBuffer's contents. Let's see what would happen if I added another thread to the example we were examining, and replaced the Builder with a Buffer:

Thread 1:
sb.append("a");

Thread 2:
sb.append("b");

Thread 3:
join 1,2
print(sb.toString());
Sure, it is "thread-safe", in the sense that there are no data races (which are basically concurrent accesses without adequate synchronization). But you have no idea what thread 3 will print: "ab" or "ba". I would have to introduce more synchronization to make this produce a sensible result. The locks that come with the StringBuffer haven't helped anything.

What I did in that original code was a perfectly legitimate handoff, and a widely used Java idiom. One thread hands off an object to another thread in a properly synchronized way, and then stops using it. The second thread uses it for a while, stops using it, and hands it back to the first.

Using Java memory model words: The initialization of the StringBuilder happens-before the spawn of the thread, which happens-before the append operation, which happens-before the join. Thread starts and joins induce happens-before edges, which is something that several people have managed to miss recently.

The additional locking provided by the StringBuffer class is completely superfluous: the guarantees are provided entirely by the Thread.start and Thread.join invocations. Using StringBuffer is even harmful, in the sense that you need to perform a whole bunch of (relatively) expensive synchronization, and prevent the compiler from aggressively optimizing the code.

To go back to the original point, I would even argue that using the StringBuffer class would be bad coding style: it would imply that that extra locking is providing something of benefit in the presence of multiple threads, which would be misleading.

As it turns out, I have yet to see a practical use case for the StringBuffer class over the StringBuilder class. A quick poll of Josh Bloch (who also works at Google), indicates that he has yet to see one, too. There are similar problems in some of the synchronized OutputStream classes in Java, too -- who writes to an OutputStream in multiple threads?

In short: a locked version of an API isn't not automatically better than an unlocked one just because you happen to be using it in multiple threads.

I've often thought that a short discussion of this should be added to the StringBuffer JavaDoc, since everyone who maintains that JavaDoc believes this, too.


A final note. Some people think that there is no performance impact to using StringBuffer instead of StringBuilder, because of all of the clever things JVMs do to eliminate synchronization (I can blog about this, too, if you want). This is a little unclear. Whether it can even perform these optimizing transformations definitely depends wildly on which JDK you use; I wrote a microbenchmark to test it, and the results differed on different JDKs -- but all that microbenchmarks really demonstrate is the performance of the microbenchmark.


ETA One of the comments points out that javac often replaces Strings with StringBuilders when constructing them. This is correct, and this is because javac realizes that when you are performing changes to the object, it makes more sense to using StringBuilders than Strings. If you have code like this:
while (guard) {
string += value;
}
You are almost always better off writing that like this:
while (guard) {
stringBuilder.append(value);
}
string = stringBuilder.toString();
The first way will allocate a new String object on every iteration, and copy the values of the previous String object. The second way will allocate a single array and append the characters to it. I have improved the performance of some pieces of code several orders of magnitude by switching from the first form to the second. Don't use String concatenation in a loop!

Monday, August 18, 2008

Top 3 Reasons Why Constructors are Worthless

In my last post, I made an offhand reference to the fact that object constructors are worthless in Java. I was asked why this is, so I thought I'd fill in the details.

This topic is a little more well-understood by the developer community than some of the concurrency tidbits that I usually discuss. People who use dependency injection toolkits like Guice and Spring's inversion of control tend to have very strong feelings on the advanced suckitude of constructors. I know quite a few developers who claim that use of Guice changed the way they code (for the better).

ETA: I've had a few people tell me just to use SPI. The point of this post is not that DI is wonderful, it is that constructors are awful. SPI doesn't stop constructors from being awful. In fact, it is yet another horrible solution, what with all of those Foo and FooImpl classes. If you like SPI, go ahead and use it. Furthermore, neither DI nor SPI addresses two of the three issues I mentioned.

So, without further ado:
  1. They are the one kind of instance method that can't be overridden. This is the major pain point for most people using constructors. For testing purposes, programmers like to create mock objects, which duplicate the functionality of the objects you are not testing. They are usually created because the objects they represent aren't easy to construct properly in a unit test, either because they are non-deterministic, or require interaction with a client, or are simply too complex to create easily.

    One of the major difficulties of creating mock instances in Java is that the way we create objects is to call the constructor:
    GiantNetworkedDatabaseConnector connector = 
    new GiantNetworkedDatabaseConnector(); // connects to the giant networked database.

    You don't really want to connect to the giant networked database when you are unittesting your code. The typical approach is to abstract the method into a factory:
    GiantNetworkedDatabaseConnector connector = 
    GiantNetworkedDatabaseConnector.newConnector();
    The newConnector code returns an object. But the method is a static one, which means that you can't replace that at runtime, either. You actually need a static field to hold the factory that makes the connector:
    GiantNetworkedDatabaseConnector connector = 
    GiantNetworkedDatabaseConnector.getFactory().newConnector();
    and then you need to set the factory in some other part of the program:
    GiantNetworkedDatabaseConnector.setFactory(mockFactory);
    and then you need to hope and pray that there aren't multiple threads trying to inject their own factories.

    Getting around this is, of course, the stock-in-trade of the dependency injection frameworks mentioned above. Guice and Spring tend to do what they do very well. It is just completely awful that they have to do it.

    In short: the constructor is just like a static method. You can't override it properly. You can't test it properly. Constructors bring pain.

  2. All immutable state must be set up front. I spent the last few blog entries describing how and why immutable state is good. But all immutable state must be set in the constructor. The degree to which this is terrible doesn't necessarily occur to people who are used to programming in Java. You just have to stop, take a step back, and realize that having separate StringBuilder and String classes is violently counterintuitive.

    Why do these two classes exist? Because there are actually two parts to String creation:

    1. Generate the String, and

    2. Publish it.

    The StringBuilder class exists solely because there is no way to have a clean break between step 1 and step 2. You need a way to say "I will be constructing this object for a while, and then I will be done." For those familiar with Ruby, you are probably thinking of the "freeze" operation at this point, which makes any mutable object into an immutable one. That is closer to what I am imagining, but the hard part of that is enforcing the fact that you call it.

    This problem isn't limited to Strings, of course. A minute's hard thought will make you think of a dozen other possible uses. Code bases tend to have lots of other Builder classes; I probably write a couple a month.

    In short: immutable objects can't be constructed properly when you can't just shove all of the built state into the constructor. Constructors bring pain.

  3. They make custom serialization and deserialization more painful. I'm not sure that I can do this topic justice in a single blog post. For a good overview of why serialization is awful, go purchase and read Effective Java.

    A brief precis: when you do deserialization, you need an instance of the object so that you can deserialize into it. The way this works is that the default constructor (the one with no arguments) is invoked for the non-serializable classes. This means that ETA: I phrased this incorrectly here, so I'm editing it. a) you need a visible no-arg constructor if something ever might get serialized (which is hard to predict), and b) that constructor might get called.

    That's just for the default mechanism, of course. For anything that requires custom deserialization (i.e., anything non-trivial), you have to use the the magic readObject method, which is not a constructor. That method, of course, doesn't have the semantics of constructors, so can't set final fields (as discussed in my post on deserialization and reflection).

    Serialization is, of course, in general, completely awful. It is no surprise that people turned to XML. At Google, we use a more compact system called protocol buffers, which are still rather hacktackular, but definitely better than serialization. Serialization is awful, and constructors make it worse. Constructors bring pain.

At one level removed, a big part of the problem is that object construction and initialization are tied together. Imagine if you could construct an object with all of its fields set to null or zero, and then call as many initializer methods as you wanted:
  1. Client code could create the object and could override the initializer methods, thereby neatly avoiding the problem of not being able to override the initializer. Even better, you could specify the type you wanted created at runtime:
    GiantNetworkedDatabaseConnector buildConnector(
    Class<? extends GiantNetworkedDatabaseConnector> c) {
    // This is impossible with standard Java bytecode.
    return new c.init();
    }

    // MockGiantNetworkedDatabaseConnector overrides the init() method
    GiantNetworkedDatabaseConnector c =
    buildConnector(MockGiantNetworkedDatabaseConnector.class);
  2. You could call initializer methods, and have them set immutable state (possibly using the notion of freeze to make an object immutable):
    String s = new String.append("foo").append("bar");
    assert s.equals("foobar");
    freeze s;

  3. You wouldn't have to worry about serialization calling the no-argument constructor, because the no-argument constructor wouldn't do anything. readObject could have initializer semantics, and be able to set final fields.
The point isn't really the solution: this solution is broken because the object doesn't always maintain its invariants (until the init() methods are done being called). You can continue to brainstorm to fix this, and come up with as many straw man approaches as you want, but the fact remains that Java programmers are still going to be stuck with constructors.

There isn't much of a moral to this story, other than that constructors bring pain.