Thursday, September 07, 2006

"Basement" processes in Solaris

All the documentation about the Solaris scheduler says that the highest priority runnable thread is chosen for execution (see, for example, section 3.8.4 of Solaris Internals 2/e). At first glance that might seem to mean that the same thread will always get the CPU, if it is runnable.

That is indeed what would happen if thread priorities were static, but in fact for most threads (those in the TS, IA, an FSS classes) the priority changes based on their usage of the CPU.

On the other hand, the FX (fixed priority) scheduling class does not change the priority of a thread, so that we can use it to experiment with the scheduler's behaviour.

First of all, lets get ourselves some privileges. Note that we don't need this for plain priority 0 processes, but we do for using any other priority or quantum later.


$ ppriv $$
449: -zsh
flags = <none>
E: basic
I: basic
P: basic
L: all
$ su root -c "ppriv -s EIP+proc_priocntl $$"
Password:
$ ppriv $$
449: -zsh
flags = <none>
E: basic,proc_priocntl
I: basic,proc_priocntl
P: basic,proc_priocntl
L: all


Ok, and we'll need something that will used lots of CPU and not make system calls that cause it to sleep. This will make observing the behaviour clearer.


$ cat spin.c
int main()
{
int i = 0;
for (;;)
i++;
exit(0);
}
$ gcc -o spin spin.c


Now, let's look at the current processes that we're running.


$ ps -o sid -p $$
SID 449
$ priocntl -d -i sid 449
TIME SHARING PROCESSES:
PID TSUPRILIM TSUPRI
449 0 0
593 0 0


So, only TS processes with no fancy characteristics.

Lets now start our test program. The FX class provides user priorities that range from 0-60 (numerically higher is higher priority). We want out test program to be low priority.


$ priocntl -e -c FX -m 0 -p 0 ./spin &
[1] 652
$ priocntl -d -i sid 449
TIME SHARING PROCESSES:
PID TSUPRILIM TSUPRI
449 0 0
653 0 0
FIXED PRIORITY PROCESSES:
PID FXUPRILIM FXUPRI FXTQNTM
652 0 0 200


Good, so it's running at low priority, but on this system it has very little competition. In fact it's using close to 100% of this box's single CPU. Lets allow some time for the stats to catch up.


$ prstat -c -p 652 15 5 | sed -n -e 1p -e /spin/p
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
652 boyd 996K 560K run 0 0 0:00:21 63% spin/1
652 boyd 996K 560K run 0 0 0:00:36 81% spin/1
652 boyd 996K 560K run 0 0 0:00:51 91% spin/1
652 boyd 996K 560K run 0 0 0:01:06 95% spin/1
652 boyd 996K 560K run 0 0 0:01:21 97% spin/1


Now, we start another job at the same priority.


$ priocntl -e -c FX -m 0 -p 0 ./spin &
[2] 660
$ priocntl -d -i sid 449TIME SHARING PROCESSES:
PID TSUPRILIM TSUPRI
449 0 0
661 0 0
FIXED PRIORITY PROCESSES:

PID FXUPRILIM FXUPRI FXTQNTM
652 0 0 200
660 0 0 200
$ prstat -c -p 652,660 60 2 | sed -n -e 1p -e /spin/p -e 's/^Total.*//p'
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
652 boyd 996K 560K run 0 0 0:01:45 71% spin/1
660 boyd 996K 560K run 0 0 0:00:08 27% spin/1

652 boyd 996K 560K run 0 0 0:02:15 51% spin/1
660 boyd 996K 560K run 0 0 0:00:37 48% spin/1


And we see that the two jobs are sharing the CPU nearly equally.

Now, lets tweak a little. First, notice that the two jobs have the same quantum, which means that they'll have the CPU for the same amount of time each time they are scheduled (assuming that no higher priority job preempts them).

Let's experiment with that quantum by halving the time for one process.


$ priocntl -s -t 100 -i pid 660
$ priocntl -d -i sid 449
TIME SHARING PROCESSES:
PID TSUPRILIM TSUPRI
449 0 0
669 0 0
FIXED PRIORITY PROCESSES:
PID FXUPRILIM FXUPRI FXTQNTM
652 0 0 200
660 0 0 100
$ prstat -c -p 652,660 60 2 | sed -n -e 1p -e /spin/p -e 's/^Total.*//p'
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
652 boyd 996K 560K run 0 0 0:02:36 54% spin/1
660 boyd 996K 560K run 0 0 0:00:55 44% spin/1

652 boyd 996K 560K run 0 0 0:03:15 64% spin/1
660 boyd 996K 560K run 0 0 0:01:16 35% spin/1


As we might expect, the adjusted process now has half as much CPU time as the other one.

Next, let's set the quantum back to its default value and bump the priority up by one.


$ priocntl -s -t 200 -m 1 -p 1 -i pid 660
$ priocntl -d -i sid 449
TIME SHARING PROCESSES:
PID TSUPRILIM TSUPRI
449 0 0
677 0 0
FIXED PRIORITY PROCESSES:
PID FXUPRILIM FXUPRI FXTQNTM
652 0 0 200
660 1 1 200
$ prstat -c -p 652,660 120 2 | sed -n -e 1p -e /spin/p -e 's/^Total.*//p'
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
660 boyd 996K 560K run 1 0 0:02:00 67% spin/1
652 boyd 996K 560K run 0 0 0:04:07 30% spin/1

660 boyd 996K 560K run 1 0 0:04:40 99% spin/1
652 boyd 996K 560K run 0 0 0:04:07 0.0% spin/1


Wow! That's really made a difference. Process 660 is getting a lot of CPU. That makes sense, since it has a higher priority and so, based on our initial premise, we'd assume it gets chosen over the lower priority process every time.

Let's see if that's really the case. First we need some extra privileges so that we can use DTrace.


$ su root -c "ppriv -s EIP+dtrace_kernel,dtrace_proc,dtrace_user $$"
Password:
$ ppriv $$
449: -zsh
flags = <none>
E: basic,dtrace_kernel,dtrace_proc,dtrace_user,proc_priocntl
I: basic,dtrace_kernel,dtrace_proc,dtrace_user,proc_priocntl
P: basic,dtrace_kernel,dtrace_proc,dtrace_user,proc_priocntl
L: all
$ dtrace -q -n 'sched:::on-cpu /execname == "spin"/ {@[pid] = count()} tick-5sec { exit(0) }'

660 103


Yep, just as we expected, process 652 has not been scheduled even once in our sampling period of 5 seconds. It's getting absolutely no CPU time at all.

Just to be sure, let's make the two priorities equal again and check again with DTrace to see that they are being scheduled more evenly.


$ priocntl -s -m 0 -p 0 -i pid 660
$ dtrace -q -n 'sched:::on-cpu /execname == "spin"/ {@[pid] = count()} tick-5sec { exit(0) }'

660 50
652 57


So, in summary, processes at the lowest priority level (0 in FX) will be starved of CPU time by anything on the system at a higher priority. Processes at the same priority level can have time apportioned between them using mechanisms such as the quantum.

The interaction between the FX and other scheduling classes becomes more complicated thanks to the appearance of global priorities into the equation, but that's a subject for another post. :)