Meaning of Linux Load Average

If you came here, you probably saw something like the below image when you tried to check out your linux CPU usage using the uptime or top command:


You noticed %Cpu(s), a bunch of numbers, and "load average" on the top, right?
but you don't have the slightest clue about the meaning of those numbers, or at least you don't know everything about them.

QUICK ANSWER
These numbers are, in order, the sum of the average usage of the last one, five and fifteen minutes load of all your CPUs.

When you should be worried about those numbers?
To find out, run this command:

grep 'model name' /proc/cpuinfo | wc -l

and you'll find out how many CPUs you do have on your current machine. 

If the 15 minutes average (the last one) is greater than your CPU number, then probably you have a problem, for your linux box needed more CPUs than available for a "long" time. 

LONG ANSWER
Before explaining how to correctly interpret those numbers, we have to define what is CPU load. 

First of all, load average is NOT CPU percentage. CPU percentage is how much of the sampling time (the time between two top refresh, for example) the CPU was not idle, or, if we are talking about a single process, how much of its time it was requesting for CPU activity. Given that sampling time usually is very short, CPU percentage is something more like a snapshot of the situation in a given moment. If in that moment all the resources of your machine were busy, for just 1 second, and you get a report of this moment, you're suddenly worried about your production process stuck for some mysterious reason.
Load average, instead, is (in linux) approximately the weighted average of the number of processes running, runnable (waiting for CPU) or in an uninterruptable sleep state (see Wikipedia - Uninterruptible Sleep). What does it mean? 

Well, you have to know about how the kernel handles processes. Moreover, there's queue theory involved, but for explanations on this side you'd better look at this Linux Journal arcticle. It's very precise, and if you don't want the exact mathematical details I think you can ignore them without losing anything in clarity.

A CPU (one of your dual or quad core processor for example) can just execute an operation at a time. And there's no such thing like a 45% working CPU: in any given instant, it either works (100%) or is idle (0%). 
For this reason, let's say that the kernel has a queue line for all the processes that needs to be executed. They are lined up and ready to enter the CPU. The kernel then assigns a time slot of 10 ms (actually it's a variable, you can configure it, but that's the default value) to the first process in line to be executed in the CPU. Beware, on a 2.5 GHz CPU 10ms is something like 25 millions of operations, so it's more than enough for most processes. If the process executes everything it has to do within its time limit, it sends a signal to the kernel, which regain control of the CPU. Then the kernel removes the process from the queue (if I'm not greatly mistaken) and assigns another 10 ms to the next process in queue, and so on. If the process doesn't execute all its operations in time (as stated by the previous Linux Journal article), the hardware sends an interrupt to the kernel, which takes again control of process and CPU. 
The kernel does this continuously for every CPU available, so if you have more than one CPU, obviously you can execute more than one process at once.

Ok, now you roughly understand how process management in the kernel works. What is then load average? It's the average (keep in mind, a particular average) number of processes in the queue for every instant in time during the last 1, 5 and 15 minutes. Remember, being in the queue, for a process, means that either it's waiting or it's being executed (and, in linux, that it's in uninterruptible state). 
And that's the reason for my very simple rule of thumb above (load above the number of CPUs available for the last 15 minutes): if load numbers are below CPU number, it means that, on average, there were some CPU idle, or some CPU idle for some time; if load numbers are above CPU number, (eg. you have 4 CPU and your load number is 4.79) it means that for everything to be executed smoothly without any waiting (on average, remember), you would have needed 0.79 more CPU. 
If that happened on the last 15 minutes, you probably have an undersized machine, for most of the time it needed more CPU than available, and you probably have to do something about before your PC/server totally locks or start to have seriously degraded performances.

USEFUL LINKS
RELATED POSTS

Labels: ,