The nature of this server was that it handles the raw TCP connections with a number of threads, multiplexing data in and out of the streams, using locking to control access in various ways. In the normal case the server was using around 10% CPU with periods of higher CPU time during 'failover' (switching all connections to a secondary server) or of high concurrency (lots of new connections coming in).
During these times of high load though we sometimes saw a strange phenomenon:
High CPU usage is expected for a short time during these periods but this was something quite different. The CPU in the above chart is being pegged at 100% for long periods - sometimes minutes - and the proportion of privileged time (kernel time) is very high.
Even more strange then that during this time the server seemed to be doing very little. Few if any printouts to the log, connections to the server appeared to be being starved and would report an error, what could be happening during this time?
Garbage collection - not always the culprit
In this case at least, GC wasn't the issue.
Safepoints and sampling profilers that can't sample
Unfortunately these profilers can sometimes have a hard time getting the snapshots at the point where they want them. One profiler we ran over a 1-minute high-cpu time for example managed to gain all of 2 snapshots, then happily presented the results that BigInteger was the cause of all our high CPU. Sadly these snapshots both happened after the fact pause and gave only a brief look into what the server was doing in recovering connections after the hang.
For a more complete rundown on the problems with sampling profilers take a look here. A better sampling profiler can use OS signals to interrupt and break into a JVM at a given point and capture information without having to wait to be given time within the JVM but in our case even this didn't work because the kernel time was too high and the OS wasn't even responsive to signals.
Privileged time and kernel APIs
Drivers and other low level OS bits of code can produce privileged time and a buggy driver can often be a candidate for high privileged time cpu usage so this was our next port of call. Installing driver updates didn't help so we switched to kernrate which helped us narrow it down some more, specifically to one module - NTOSKRNL.
Kernrate also showed some CPU usage in the network driver module but the vast majority was in the above one, unfortunately thats probably the widest possible one since NTOSKRNL is the entire Windows NT OS Kernel. Kernrate did report more specific information on the functions where the issue occurred but the functions were generic enough to not point the finger at a specific bit of code - KeSynchronizeExecution and ExReleaseRundownProtectionCacheAwareEx.
The kernel only offers a certain set of APIs though, so the issue has to be in one of those. Typically this might be something like:
- Disk I/O
- Network I/O
- Interrupts and exceptions
- Other functions that you might find in the Windows API
A typical Java program will do most of its work outside of these but a server in particular might be expected to use some of them quite heavily. Looking at the specific functions in the module with kernrate didn't show much. The functions that kernrate identified don't tell us much but they do point to something to do with interrupts.
At this point we started monitoring a bunch of metrics during the pauses. Perfmon also suffered from starvation during the pause but running it in high priority allowed us to grab some metrics and showed that paging and disk I/O weren't to blame but that we were getting a spike in packets per second under TCP/IP v4.
Packets per second
Modern OSes can handle a lot of data flowing through them and in our case the amount of data being moved around was relatively small - under 2MB/sec - but packet processing isn't the same as bandwidth. Instead, packets coming into your network adapter produce an interrupt - they block the CPU from doing whatever its doing and ask it to process the packet that just came in. The same process occurs for your keyboard and mouse which is why they can stay responsive even when your computer is working hard.
We found and fixed the issue, re-ran our test, saw that the packets per second remained low, and then saw the same huge privileged time and long pause. Packets per second weren't the cause of our problems after all.
Exceptions and JDI
Java profilers are often sampling profilers because by doing sampling they limit the impact they have on the running program, allowing it to run much like before. Measuring profilers take a different approach - measuring the entry and exit times of at least a subset of method calls occurring in the JVM. This is a vastly more computationally expensive method and although the profiler was producing better information the issue was impossible to reproduce given the slowdown caused by monitoring methods with either JVMTI or JDI. Both of these are great APIs for creating your own fine tuned profiler and allow you to do so in pure Java.
That said JDI did at least allow us to monitor Java exceptions without a massive slowdown. Although we did see Java exceptions around the problem period they didn't correlate. The number of windows exceptions being generated were many orders of magnitude higher than ours - hundreds of thousands per second.
JVM bugs - don't forget to -XX:+UseMembar
We eventually worked our way through to this JVM bug and this duplicate.
Closed as fixed in Java 6 but apparently back with a vengeance in Java 7 and 8, Java's 'pseudo memory barrier' optimisation somehow erroneously results in a pauseless loop of generated windows exceptions. This can lock up the computer in useless high priority exception handling for minutes at a time preventing almost anything else from running, yet can be disabled with a single simple JVM option:
While I can't say I enjoyed our extensive debugging travels in this case it was an interesting look into the depths of Windows and the JVM and ultimately resulted in a much smoother running server.
Do you run a Java server or app and have a list of JVM options you always make sure to use?