Page 279 - DCAP403_Operating System

P. 279

Operating System

Notes The Traditional UNIX Way

Traditional Unix systems provide the select(2) and/or poll(2) system calls. With both of these
you pass an array of FDs to the kernel, with an optional timeout. When there is activity, or when
the call times out, the system call will return. The application must then scan the array to see
which FDs are active. This scheme works well with small numbers of FDs, and is simple to use.
Unfortunately, for thousands of FDs, this does not work so well.
The kernel has to scan your array of FDs and check which ones are active. This takes approximately
3 microseconds (3 us) per FD on a Pentium 100 running Linux 2.1.x. Now you might think that
3 us is quite fast, but consider if you have an array of 1000 FDs. This is now 3 milliseconds (3
ms), which is 30% of your timeslice (each timeslice is 10 ms). If it happens that there is initially
no activity and you speciﬁed a timeout, the kernel will have to perform a second scan after some

activity occurs or the syscall times out. Ouch! If you have an even bigger application (like a large
http server), you can easily have 10000 FDs. Scanning times will then take 30 ms, which is three
timeslices! This is just way too much.
You might say that 3 ms for 1000 FDs is not such a big deal: a user will hardly notice that. The
problem is that the entire array of FDs is scanned each time you want to go back to your polling
loop. The way these applications work is that after checking for activity on FDs, the application
processes the activity (for example, reading data from active FDs). When all the activity has been
processed, the application goes back to polling the OS for more FD activity. In many cases, only
a small number of FDs are active at any one time (say during each timeslice), so it may only
take a few milliseconds to process all the activity. High performance http servers can process
hundreds or thousands of transactions per second. A server that takes 2 ms to process each active
FD can process 500 transactions per second. If you add 3 ms for FD scanning in the kernel, you
now have 5 ms per transaction. That only gives 200 transactions per second, a massive drop in
performance.

There is another problem, and that is that the application needs to scan the “returned” FD array
that the kernel has updated to see which FDs are active. This is yet another scan of a large array.
This isn’t as costly as the kernel scan, for reasons I’ll get to later, but it is still a ﬁ nite cost.

New POSIX Interfaces
A fairly simple proposal is to use the POSIX.4 Asynchronous I/O (AIO) interface (aio_read()
and friends). Here we would call aio_read() for each FD. This would then queue thousands of
asynchronous I/O requests. This model looks appealing, until we look under the hood of some
aio_*() implementations. The Linux glibc implementation is a case in point: there is no kernel
support. Instead, the C library (glibc 2.1) launches a thread per FD for which there are outstanding

AIO requests (up to the maximum number of conﬁgured threads). In general, implementing this
facility in the C library is reasonable, as it avoids kernel bloat. However, if you use this facility to
start thousands of AIO requests, you may end up creating thousands of threads. This is no good,
since threads are costly. The “obvious” solution is to implement AIO in the Linux kernel, then.
Another solution is to use userspace tricks to avoid the scalability problems (see the description

of migrating FDs below). These solutions may be ﬁne if you only want to run under Linux, but
is not much help if you want to run under another OS which also implements AIO using threads
(and for which you don’t have the source code so you can change the implementation). The point
here is that there appears to be no guarantee that aio_*() implementations are scalable across
platforms which support it.
It is also worth noting that POSIX.4 Asynchronous I/O is not necessarily available on all POSIX.4

compliant systems (facilities deﬁned by POSIX.4 are optional). So even if you were prepared to
limit your application to POSIX.4 systems, there is still no guarantee that AIO is available. Many
or most implementations will be scalable, but we can’t be sure all are scalable, so we need an
alternative.

272 LOVELY PROFESSIONAL UNIVERSITY

274 275 276 277 278 279 280 281 282 283 284