CPU Usage vs Load

Don’t confuse the two!

Benny Ou
ESTL Lab Notes

--

Consider two load-test runs, where the exact same system was exposed to two different loads. The listing below shows the statistics measured during the two runs.

Run 1
-----
BUSY_TIME 1,159,299 # CPU times are in hundredths of a second
IDLE_TIME 286,806
OS_CPU_WAIT_TIME 1,621,100
%busy 80.17
response time 0.2 # seconds
Run 2
-----
BUSY_TIME 1,305,666
IDLE_TIME 125,475
OS_CPU_WAIT_TIME 5,071,200
%busy 91.23
response time 1.5

Although CPU usage was quite high in both cases, their performance could not have been more different. Run 1 had a response time of 200 ms whereas run 2 had a response time of 1500 ms (!).

It is also known that under completely no load, a single transaction takes 200 ms. Hence run 1’s response time is the same as the baseline number even though the CPU usage was as high as 80%.

Is 100% CPU Always a Bad Thing?

In a way, if a system is running at 20% CPU, that just means you are wasting 80% CPU. In business terms, that means you are wasting precious money.

In fact, it’s been said that CPU time is not something that you can save up in the bank to use later.

Back to our load test. In run 2, the OS_CPU_WAIT_TIME of 5,071,200 vs the BUSY_TIME of 1,305,666 means that the system spent almost 4 times the time waiting for the CPU to become available rather than actually running on it. That’s because there were more processes ready to run than there were CPUs available.

That brings us to the concept of load. Load is simply a count of the number of processes using or waiting for the CPU at a single point in time. That is an instantaneous quantity, so utilities such as uptime instead display the exponentially weighted moving average for the past one, five and fifteen minutes rather than the instantaneous number.

> sysctl -n hw.ncpu
8
> uptime
18:09 up 19 days, 9:06, 3 users, load averages: 2.49 2.04 1.98

For a single-CPU system, if the load is less than 1, that means on average, every process that needed the CPU could use it immediately without being blocked. Conversely, if the load is greater than 1, that means on average, there were processes ready to run, but could not due to CPUs being unavailable.

Therefore, there’s a world of difference between 100% CPU usage and load = 1, and 100% CPU usage and load = 10.

For a multi-processor system, the load number must be interpreted together with the number of CPUs. For example, if the system has 4 CPUs, then there would be contention if the load is greater than 4. (In run 2 of our load test, the system had 4 CPUs whereas the load was 19.22! No wonder the performance was so bad.)

For the rest of this article, when we talk about load, we mean the load divided by the number of CPUs (to get a comparable measure).

Numbers to Aim For

In theory, if the instantaneous load is perfectly steady, then having load = 1 means that the CPU resources are optimally used. There are two problems with that. First, real world loads are never perfectly steady. Second, the load numbers generated by the system are average numbers, so even if the value is 1, you can never know if there were times when the instantaneous load actually went above 1.

Based on the above, many people tend to aim for a load number of about 0.7 to cater for the spikes.

You might also want to provision spare capacity for planned, or even sudden increases in business needs.

Caveat for Linux

For some reason, Linux includes processes that are blocked in I/O in the load calculation. So when you see high load on Linux, you also need to consider the possibility that the problem is with I/O rather than with CPU, whereas on other systems, it is more straightforwardly a CPU issue.

--

--