Cassandra Freezes on CentOS 6.6 and Haswell Processors

October 21, 2015 by Robert Birnie

This is an FYI and warning, be very careful with haswell processors with RHEL/CentOS 6.6. There is a futex wait() bug that can cause processes which wait to never resume agian. A good description is on InfoQ.

“The impact of this kernel bug is very simple: user processes can deadlock and hang in seemingly impossible situations. A futex wait call (and anything using a futex wait) can stay blocked forever, even though it had been properly woken up by someone. Thread.park() in Java may stay parked. Etc. If you are lucky you may also find soft lockup messages in your dmesg logs. If you are not that lucky (like us, for example), you'll spend a couple of months of someone's time trying to find the fault in your code, when there is nothing there to find.”

I recently saw this with Dell R630's and cassandra. A thread dump shows the threads in a BLOCKED state and the stack trace shows them as parked.

Thread 104823: (state = BLOCKED)
 - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise)
 - java.util.concurrent.locks.LockSupport.parkNanos(java.lang.Object, long) @bci=20, line=226 (Compiled frame)
 - java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(long) @bci=68, line=2082 (Compiled frame)
 - java.util.concurrent.LinkedBlockingQueue.poll(long, java.util.concurrent.TimeUnit) @bci=62, line=467 (Compiled frame)
 - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=141, line=1068 (Compiled frame)
 - java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker) @bci=26, line=1130 (Compiled frame)
 - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 (Interpreted frame)
 - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)

Cassandra logs go completely blank and CPU utilization will stay constant at some level (sometimes high / sometimes none). Interestingly you can revive the process with a kill -STOP <jvm_pid> and kill -CONT <jvm_pid> (which is much faster than a service restart).

Update Centos 6.6 to the newest kernel in the updates repository to fix this, version 2.6.32-504.30.3.el6.x86_64.

Big thanks to Adam Hattrell, Simon Ashley and Erick Ramirez from Datastax for the help to figure this out.

Uber
Robert