Java Concurrency (&c)

Friday, December 18, 2015

Why I don't like System.gc()

I wrote this for internal-to-my-company consumption, but I couldn't find a reasonable place to put it. And then I remembered I had a blog.

The question to me was whether is ever made sense to do a System.gc(), and, more specifically, whether it made sense to do System.gc() followed by System.runFinalization().

I can't claim it is always wrong. It suffers from what some people call the Politician's syllogism: Something must be done, this is something, therefore we must do it. There's no control over GC other than a big red button labeled "System.gc()", so this is always the something that must be done.

In practice, the biggest problem with System.gc() is that what it does is rather unpredictable. In fact, there's no guarantee that it will do anything. In fact, Hotspot has a -XX:DisableExplicitGC flag to stop it from doing anything. So, it can sometimes solve an immediate problem in a system-dependent way, but you are introducing a significant source of overhead with ill-defined behavior into your code, and that behavior can change with a change of flags or a change of runtime.

Using the concurrent collector (our default at G), System.gc() forces a Full, stop-the-world GC instead of a concurrent GC. By calling System.gc(), you are stopping the world for potentially much longer than just letting a concurrent collection happen would. At G, our Full STW GC under CMS is parallel, and the rest of the world's is serial, so it will finish faster for us than it will for everyone else. Except the rest of the world has the parallel GC by default, so it will be parallel. Except that they're changing it to G1 for Java 9, which has a serial full STW GC, so it won't be parallel.

Of course, if you are using the concurrent collector, it matters if the user passes -XX:+ExplicitGCInvokesConcurrent, in which case System.gc() will happen concurrently, which would mean that the application will continue executing, and you will probably call runFinalization() before the GC has enqueued the finalizable objects.

It should be relatively clear at this point that you can't depend on it for performance, because you can't guarantee its performance, and you can't depend on it for correctness, because you have no idea what it is going to do.

If you really want to wait for a GC to occur and for some percentage of objects to be finalized, you probably to do something like:

Create a CountDownLatch with a count of 1.
Create an object whose finalizer calls countDown().
Make the object garbage immediately (bearing in mind that object lifetimes aren't what you think they might be).
Call System.gc().
Call await() on the CountDownLatch.
Call runFinalization to make sure all of the pending finalizers are finished (but do it in a loop with a timeout!)

Which will probably work most of the time, unless System.gc() really doesn't do anything and a GC is never triggered (or, worse, only young-gen GC is ever triggered, so that objects in the old gen aren't collected or finalized, and you think something has been finalized when it hasn't).

In short, usage should be carefully considered.

As for runFinalization: it suffers from the same "I have one button for this" issue. What it does (in Hotspot) is add a thread as a worker for processing finalization, and then return when that thread is done (i.e., the finalization queue is empty, although note that the finalization queue being empty is not the same as all finalizers having been run, since there are two other threads processing them).

How much are you buying by blocking until the finalization queue is empty? Do you really want to stop the application for this? Is it going to do what you think it is going to do? Are you sure the finalizer you care about is enqueued?

Hey, if you have lots of GCs, and expensive finalizers, one could even imagine the finalization queue not ever being empty. This is fairly unlikely, in practice, but would be really, really nasty if it happened.

Anyway, let the buyer beware. Google tells me that the first couple of hits for "System.gc()" are also nasty warnings, but I had already written this, so there is no harm in posting it anyway.

Tuesday, July 9, 2013

Lightweight Asynchronous Sampling Profiler

I finally got off of my lazy elbows and wrote a proof-of-concept sampling profiler for public consumption based on the Hotspot JVM's AsyncGetCallTrace interface.

The benefit of using this interface to write a profiler is that it can gather stack traces asynchronously. This avoids the inaccuracies of only being able to profile code at safe points, and the overhead of having to stop the JVM to gather a stack trace. The result is a more accurate profiler that avoids the 10-20% overhead of something like hprof. For a more thorough explanation as to why this is interesting, see my 2010 writeup on the subject.

Another feature of the profiler is that it will tell you what stack traces are being executed, rather than, like many sampling Java profilers, what stack traces could be executed. That is, with traditional profilers, if you stop the VM and ask it for stack traces, it will tell you all of the stack traces from all of the threads that are not currently blocked. This profiler tells you the stack trace from the thread that got interrupted when you asked for the stack trace. The tradeoff is reasonably important if you are trying to think about thread starvation and scheduling; with this approach, you get to see what isn't being scheduled. With the "get everything" approach, you get more samples, so there is a meaningful benefit to both approaches.

Code is here. I'm not likely to invest a great deal of time in making it wonderful, but patches from interested parties are welcome.

Edited to add: Some of the commenters below wonder why the stack trace grabber can't print out the stack traces when it gets them. The answer to this lies in what I meant when I called this function asynchronous. I was a little (deliberately) vague on what "asynchronous" means. When I say that the timer that goes off to grab a stack trace interrupts a running thread asynchronously, I mean that it can interrupt the thread at any point. The thread can be in the middle of GC, or the middle of adding a frame to the stack, or in the middle of acquiring a lock. The reader can imagine that writing something asynchronous-safe is somewhat more tricky than writing something thread-safe: not only do you have to worry about what other threads are doing, you also have to worry about what the current thread is doing.

Needless to say, there are a lot of things you can't (or shouldn't) do asynchronously. For example, you can't call malloc, because if the thread is already in the middle of executing malloc when you interrupt it, you can destroy malloc's internal data structures. In fact, for the standard C library functions, the POSIX standard specifies a list of functions that have to be async-safe (see man 7 signal for the list, if you are curious). Note that functions like printf are not async-safe, so you can't call them asynchronously. As far as I know, AsyncGetCallTrace is pretty much the only non-trivial async-safe JNI/JVMTI call. Others, including the standard JVMTI stack trace gathering function and the functions that make sense out of what AsyncGetCallTrace returns, are not async-safe, and are likely to crash your program if you try to call them asynchronously.

For this programming exercise, I was lazy, and printed the stack traces at VM exit. That doesn't mean that a sufficiently clever approach can't get the stack traces incrementally. A separate, synchronous thread can read the data that was written out asynchronously. You can't use a standard blocking lock to protect the data, though, because standard lock implementations aren't async-safe. You could use a spin lock, but you would have to be careful about the async handler interrupting a thread that holds the spin lock. In that case, the async handler won't be able to acquire the lock, so it would have to do a trylock and give up if it can't complete the acquire. There are also clever things you can do with atomic operations if you apply some creativity; I'll leave those as an exercise for the reader.

Tuesday, February 7, 2012

Transactional Hardware on x86

While I'm posting anyway, I might as well mention the big concurrency news of the day: the new transactional memory specification from Intel. Transactional memory sounds a lot like what it is. Your code begins a transaction and continues to execute it until it ends, performing some memory writes along the way. If there are no conflicting transactions (i.e., transactions that write to the same memory that your code wrote) executing at the same time as yours, your transaction will end normally and commit your memory writes. If there are conflicting transactions, your transaction will abort and roll back your writes.

Transactional memory on x86 will come in two flavors:

Hardware Lock Elision (HLE), which consists of XACQUIRE and XRELEASE instruction prefixes. These can optimistically turn lock regions into transactions. What's the advantage? Well, transactions can execute concurrently, as long as they aren't writing to the same memory. Locks are serialized: only one thread can be in a lock region at a time. So this mechanism can (hypothetically) execute more code concurrently. Also - and perhaps more importantly - it should be much cheaper to acquire a lock using XACQUIRE than it is to acquire it without, since with XACQUIRE you don't have to do anything other than tell the system to start looking out for conflicts. Between the two advantages, HLE should be able to provide a nice performance boost. These are backwards compatible, so that software written to use them can be run on earlier hardware.
Restricted Transactional Memory (RTM), which consists of XBEGIN, XEND and XABORT instructions. These have the traditional transactional memory semantics: you begin a transaction, and if you abort before you end it, your writes are undone. These are very flexible, but are not backwards compatible (i.e., you can't use them on older hardware).

I like the idea of hardware-level transactions / atomicity. Compilers can be written to take advantage of them, and they can be a win for people implementing synchronization primitives and doing very low-level multithreading. They have the potential of making synchronization very cheap when there isn't much contention. AFAICT, Azul has claiming ~~very large~~ benefits from their HLE-style synchronization for years (edited: They claim <10%: see this presentation on hardware transactional memory from Cliff Click).

There is also likely to be a renaissance in lock-free data structures. Currently, lock-free data structures are pretty much all built on top of one non-blocking atomic instruction: the compare-and-swap (LOCK: CMPXCHG on x86, or AtomicReference.compareAndSet() to Java programmers). Trying to shoehorn everything you might want to do atomically into a single instruction that operates on one word has led to some very interesting data structures over the last 20 years or so, but is really the wrong way to solve the problem.

The flip side of this is that I don't think these will have much of an impact on most user-level synchronization. The HLE is really just an optimization for current lock implementations. The RTM aborts if you execute any of a large number of instructions (like PAUSE or CPUID, or many operations that affect control flow), or if there is a conflicting write to the cache line (which may not have any logical connection to the data the programmer cares about): the average programmer who deals with multithreaded code can't reasonably be expected to think about what is going on at the microarchitectural level.

Also, as with most transactional memory systems, RTM only deals with memory writes to a limited number of cache lines. If you have to deal with, for example, I/O, or if you touch too many cache lines, you have to write an abort handler that deals with rollback correctly. My feeling is that this will probably be far too difficult for most programmers.

Transactions' usefulness also depends enormously on the quality of their implementation. I'd love to see some hardware, but it isn't expected to be released until 2013.

The full specification is here.

Monday, February 6, 2012

Update to the Allocation Instrumenter

A couple of years ago, I open-sourced a tool that lets you instrument all of the allocations in your Java program. Some quick updates today:

It now supports JDK7.
You can now optionally use it to instrument constructors of a particular type, instead of allocations. So, if you wanted to find out where your Threads were being instantiated, you could instrument them with a callback that invokes Thread.dumpStack().

I know that most of the readers of this blog are more interested in concurrency, and that it has been a very, very long time since I posted. The spirit is willing, but the flesh is insanely busy... I have it in mind to do a post that discusses a bunch of strategies for doing efficient non-blocking concurrency in C++, because the attempts are very amusing, but I haven't had the chance to sit down and do it.

Friday, June 3, 2011

Scala / Java Shootout

I should probably comment on this paper from my colleague Robert Hundt, where he writes the same benchmark in Java, Scala, Go, Python and C++ (TL;DR here). Short version: Initially, C++ was faster than Scala, and Scala was faster than Java. I volunteered to make some improvements, which ended up with the C++ and Java implementations being about the same speed. Some C++ hackers then went and improved the C++ benchmark by another ~6x, and I didn't think I should go down that road.

Imagine my surprise when I got this email from a friend of mine this morning:

Date: Fri, 3 Jun 2011 03:20:48 -0700
Subject: Java Tunings
To: Jeremy

"Note that Jeremy deliberately refused to optimize the code further,
many of the C++ optimizations would apply to the Java version as
well."

Slacker.

Gee, thanks for making me look good, Robert. (Actually, I had approved that language, but I didn't realize anyone would actually read the paper. :) )

Anyway, I was not just being obdurate here; there was a thought process involved. I should probably document it.

The benchmark did not follow the rules of Java benchmarking. At some point, I should write a post about the rules of Java benchmarking (maybe I have? I don't know). Basically, it timed the full execution of the program, including startup time, JIT compilation time, and GC time. This isn't what you are supposed to do in Java or Scala, so I was very uncomfortable with the experimental setup. In 2011, if you care about startup time, then you probably shouldn't be using Java. The program ran very briefly, so the effect was measurable (once I got it to run fast enough).ETA: I am informed that later versions of this benchmark took this into account. Apologies for the confusion.

It's kind of silly to compare C++ with Java. The paper is about Scala, of course, but the immediate internal reaction at Google was a discussion of how much worse Java was than C++. As it happens, they really do target two different niches. Programmers are more productive in Java (if they know both languages well). It's easier to test Java code. It scales to large code sizes better. The tradeoff is that you are only going to get gcc -O2 level performance, and you have to be careful about garbage collection. In many, many cases, this tradeoff isn't a big deal. A single-threaded performance arms race between the two doesn't make sense. This is the primary reason I stopped trying to make improvements.

The benchmark did not use Java collections effectively. It was generating millions of unnecessary objects per second, and the runtime was swamped by GC overhead. In addition to the collections-related fixes I made (as described in the paper), the benchmark also used a HashMap<Type, Integer> when it should have just stored a primitive int in Type; it also used a LinkedList where it should have used an ArrayDeque. In fact, after being pressed on the matter by other Googlers (and seeing the response the C++ version got), I did make these changes to the benchmark, which brought the numbers down by another factor of 2-3x. The mail I sent Robert about it got lost in the shuffle (he's a busy guy), but he later said he would update the web site.

It is this last point that made the difference between Java and Scala. They use the same runtime, so the compiled performance is going to be roughly the same. However, the choice of data structure matters! From the perspective of Scala vs. Java performance, the takeaway isn't that "Java is slow", it is that "you can write bad code in any language" (with apologies to Robert, who is one of the smartest guys around, but isn't a Java programmer).

It is also "you can write bad code in any language" that inspired me to make any improvements. Robert was not a C++ newbie, but he was a Java newbie, so I thought that the comparisons that this was inspiring between C++ and Java performance basically made no sense. I didn't make the changes to demonstrate that Java could do better than C++, as I know that's not true. I made them to show that the playing field wasn't quite as uneven as people tend to think, and that algorithmic changes tend to make more difference than language changes.

These observations don't actually contradict the point that Robert was trying to make. In fact, they underline it. He was trying to say that Scala is very newbie-friendly when you want performance, as it balances the two well. That was clearly true in this case: as a newbie in both Java and Scala, Robert chose more performance-friendly data structures in Scala than in Java. In fact, in large measure, Scala chose the data structures for him, which is probably even better.

The other languages strike terrible balances in this area. Python and Go aren't on the performance radar at all (Python isn't compiled, and Go is young, yet). C++ isn't newbie friendly at all, ever.

Robert, of course, is a very odd choice for newbie, as he is an experienced compiler writer. Havlak is also an odd choice for newbie program, as it is a relatively sophisticated compiler benchmark. Perhaps the moral turns out to be that newbies who are also experiened compiler writers should look carefully at Scala?

Friday, May 13, 2011

Java Puzzlers@Google I/O

Josh Bloch and I gave one of his Java Puzzlers talk at Google I/O this year. If you hate Java, you can waste a perfectly good hour listening to us make unfunny jokes at its expense.

Tuesday, April 5, 2011

Cliff Click in "A JVM Does What?"

Cliff Click gave a very good talk at Google last week. For your viewing enjoyment:

The slides are available here.

(For those who watched it: The inventors of bytecode were in the audience. I think he was just used to making that joke elsewhere.)

Java Concurrency (&c)

Friday, December 18, 2015

Why I don't like System.gc()

Tuesday, July 9, 2013

Lightweight Asynchronous Sampling Profiler

Tuesday, February 7, 2012

Transactional Hardware on x86

Monday, February 6, 2012

Update to the Allocation Instrumenter

Friday, June 3, 2011

Scala / Java Shootout

Friday, May 13, 2011

Java Puzzlers@Google I/O

Tuesday, April 5, 2011

Cliff Click in "A JVM Does What?"

Blog Archive

About Me

Java Concurrency (&c)

Friday, December 18, 2015

Why I don't like System.gc()

Tuesday, July 9, 2013

Lightweight Asynchronous Sampling Profiler

Tuesday, February 7, 2012

Transactional Hardware on x86

Monday, February 6, 2012

Update to the Allocation Instrumenter

Friday, June 3, 2011

Scala / Java Shootout

Friday, May 13, 2011

Java Puzzlers@Google I/O

Tuesday, April 5, 2011

Cliff Click in "A JVM Does What?"

Blog Archive

About Me

Subscribe To This Blog