Once in a while the CPU usage on one of our development servers rails and hits 100% and never lets up. A couple of times it actually hit 200% (2 threads burning 2 cores at 100%) or more before we noticed. These are not good situations, and something you want to get to the bottom of and fix before it starts happening in your production environment.
The lazy thing to do is restart the entire server, or restart the process that's railed. But this is just ignoring the problem. Avoidance is what you should be doing. Here's how you get to the bottom of what's going on:
1. Get the PID causing the problem, run top to get the list of processes using the most CPU, it'll be the first entry. As you can probably guess, it was my JVM.
2. Get the IDs of the threads in the process causing the problem (my PID was 2089): top -H -p 2089
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
11405 liferay 20 0 11.7g 6.8g 14m R 99.4 88.5 23074:07 java
10000 liferay 20 0 11.7g 6.8g 14m R 99.4 88.5 23079:00 java
3390 liferay 20 0 11.7g 6.8g 14m S 3.9 88.5 22:42.06 java
2089 liferay 20 0 11.7g 6.8g 14m S 0.0 88.5 0:00.02 java
2098 liferay 20 0 11.7g 6.8g 14m S 0.0 88.5 0:01.85 java
2111 liferay 20 0 11.7g 6.8g 14m S 0.0 88.5 3:15.03 java
2112 liferay 20 0 11.7g 6.8g 14m S 0.0 88.5 3:11.32 java
From the top listing, it's easy to see the first two threads are the problem, 11405 and 10000 are using almost 100% of the CPU each. Convert the thread IDs to hex for step 4.
11405 -> 0x2C8D
10000 -> 0x2710
3. Take a thread dump of your JVM: /usr/java/jdk1.6.10/bin/jstack 2089 > /tmp/td.log
4. Search your thread dump for the threads causing your problem (found in step 2). For example, I found the following:
"pool-400-thread-53" prio=10 tid=0x00007fde4801efd0 nid=0x2c8d runnable [0x00007fdde3ba4000]
java.lang.Thread.State: RUNNABLE
at java.util.HashMap.getEntry(Has hMap.java:347) at java.util.HashMap.containsKey( HashMap.java:335)
at java.util.HashSet.contains(Has hSet.java:184)
at com.vaadin.ui.CustomTable.unre gisterPropertiesAndComponents( CustomTable.java:2322)
at com.vaadin.ui.CustomTable.getV isibleCellsNoCache(CustomTable .java:2219)
...
Looking through the stack trace for 0x2c8d, I was eventually able to find a custom class and a line number that allowed me to narrow down the source of the problem.
Finding the general area of the problem is usually the easy part. The hard part is eliminating the source of the problem, which I'll leave up to you.
Good luck.
The lazy thing to do is restart the entire server, or restart the process that's railed. But this is just ignoring the problem. Avoidance is what you should be doing. Here's how you get to the bottom of what's going on:
1. Get the PID causing the problem, run top to get the list of processes using the most CPU, it'll be the first entry. As you can probably guess, it was my JVM.
2. Get the IDs of the threads in the process causing the problem (my PID was 2089): top -H -p 2089
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
11405 liferay 20 0 11.7g 6.8g 14m R 99.4 88.5 23074:07 java
10000 liferay 20 0 11.7g 6.8g 14m R 99.4 88.5 23079:00 java
3390 liferay 20 0 11.7g 6.8g 14m S 3.9 88.5 22:42.06 java
2089 liferay 20 0 11.7g 6.8g 14m S 0.0 88.5 0:00.02 java
2098 liferay 20 0 11.7g 6.8g 14m S 0.0 88.5 0:01.85 java
2111 liferay 20 0 11.7g 6.8g 14m S 0.0 88.5 3:15.03 java
2112 liferay 20 0 11.7g 6.8g 14m S 0.0 88.5 3:11.32 java
From the top listing, it's easy to see the first two threads are the problem, 11405 and 10000 are using almost 100% of the CPU each. Convert the thread IDs to hex for step 4.
11405 -> 0x2C8D
10000 -> 0x2710
3. Take a thread dump of your JVM: /usr/java/jdk1.6.10/bin/jstack
4. Search your thread dump for the threads causing your problem (found in step 2). For example, I found the following:
"pool-400-thread-53" prio=10 tid=0x00007fde4801efd0 nid=0x2c8d runnable [0x00007fdde3ba4000]
java.lang.Thread.State: RUNNABLE
at java.util.HashMap.getEntry(Has
at java.util.HashSet.contains(Has
at com.vaadin.ui.CustomTable.unre
at com.vaadin.ui.CustomTable.getV
...
Looking through the stack trace for 0x2c8d, I was eventually able to find a custom class and a line number that allowed me to narrow down the source of the problem.
Finding the general area of the problem is usually the easy part. The hard part is eliminating the source of the problem, which I'll leave up to you.
Good luck.