The libc unit testing project, organized jointly by musl libc and Openwall, aims to apply rigorous testing to implementations of the standard library functions, in order to:
The recent discovery of the longstanding fnmatch/alloca vulnerability in glibc should serve as motivation.
This project is focusing on a testing approach rather than direct code audits. This way, the tests we produce can be applied to multiple implementations, and as regression tests during musl's rapid development and against other non-mainstream, less-widely-used libcs. Nonetheless, the project entails some level of libc source audit to search and identify the components likely to contain errors and which demand the most attention in testing.
Below is an outline of the proposed categories we aim to test.
Test that the interfaces (type definitions, macros, and prototypes) defined in the headers are basically correct. The idea is to catch incorrect or inconsistent definitions when porting to new platforms.
Testing entails making simple calls to at least one function using each interface structure (stat, sigaction, flock, msghdr, …) and sanity-checking the results. A reasonable attempt should also be made to use as many of the bitflag or enum-like macros as possible (things like O_APPEND, etc.), testing their behavior to ensure that they were defined correctly.
Test all functions defined in string.h for:
under all combinations, for both source and dest if applicable, of:
And additionally, for strstr, possibly test needles with various periodicity properties that affect the way the two-way algorithm decomposes the needle.
Test various patterns of memory allocation, aiming for various levels of fragmentation. Perform the tests both in single-threaded and multi-threaded environments.
Checks for correctness:
Further implementation-specific correctness checks: checking consistency of bookkeeping information before and after each allocated block.
Possible quality-of-implementation checks: Attempting to obtain pathological fragmentation and allocation failure where it should not happen.
Numeric parsing with the following functions should be tested: (str|wcs)to(umax|imax|u?l?l|ld|d|f), sscanf, fscanf, wsscanf, fwscanf.
Working from the specifications for these functions, develop a number of corner case strings likely to be wrongly accepted or wrongly rejected. Especially worth testing are strings which are initial substrings of valid numeric strings, but which are not themselves valid numeric strings, the most basic example of which is “0x”.
For the scanf-family functions, encountering an initial subsequence of a valid numeric string followed by junk should result in a scan failure. For the (str|wcs)to… functions, sometimes an initial subsequence of the initial subsequence will still be a valid number, and it should be processed. Tests should check the end pointer these functions save to confirm that the right number of characters were accepted.
Also test overflow behavior: the value of errno, the “fake” values returned on overflow (ULONG_MAX, etc.) and so on. Some of the specified behavior, especially “negative” values for unsigned conversions, is a bit unintuitive so a careful reading of the specs is needed.
Checking the return value of snprintf: Look for uncaught overflows. For example, check (on 64-bit) that attempting to format a single 4gb+1 string with %s results in -1/EOVERFLOW rather than length 1.
[This section incomplete.]
To begin, research the gnulib/autoconf tests to determine which tests are actually correct and which ones are looking for GNU-specific behavior. I believe they are already very thorough, so just setting them up to build and run as part of the larger test package may be sufficient. Of course, reading them may also give ideas for further testing of stdio.
[This section incomplete.]
Make a list of such functions, and design tests which arrange for each function to return a string just longer than the nominally-available buffer space. Check that the function has not written past the end of the buffer, and that it returns a failure code or indication of truncation if specified to do so.
Make a list of such functions, which would include things like fnmatch, glob, regcomp, and anything that might need to precompute a case-alterred version of its argument string. Then attempt to pass argument strings so long that allocation via malloc fails, or that on-stack allocation (alloca or VLA) wraps the stack pointer or moves it to point over top of other program data. Test that the function returns an error or works without allocation rather than crashing or clobbering memory.
These tests will need root and/or a setuid-somebody binary to do anything useful.
Test 1: Setup RLIMIT_NPROCS and fork a number of child processes which each change to the same uid to exhaust this limit. Create a bunch of threads in the parent process (with the original uid), then call setuid. At the same moment, have the child processes that were exhausting the process limit begin terminating. Test for setuid returning 0 (success) but failing to change the uid for some of the threads (you may need to have each thread call getuid, or read /proc/self/task, to evaluate the results).
Other tests in this group will involve creation or termination of threads within the process as the same time setuid is called, forking at the same time, etc. to look for race conditions around synchronization of threaded uid changes. Understanding the potential races and designing tests is most of the difficulty here.
A number of pthread functions are specified to disallow returning EINTR. Attempt to have them blocked and interrupted by a non-restarting signal handler, and check that they do not return EINTR.
Test the behavior of sigsuspend, sigtimedwait, sigwait, and sigwaitinfo with regard to cancellation and being interrupted by an unmasked signal.
[This section incomplete.]
Identify library functions which are likely to use recursion and try to make them recurse unboundedly, looking for stack overflow.
Test the multibyte/wide character conversion functions in a UTF-8 locale, and the iconv functions with various character sets, to ensure:
Most importantly, “over-long sequences” must not be accepted.
Perhaps some punycode DNS tests could also be added to this task, but that would be hard to test without dropping in a custom nameserver to serve bogus data.
[This section incomplete.]
The following operations all potentially deal with resources (counters, cached pid/tid, cached thread structures, etc.) shared by the whole process, and synchronization between them is sufficiently complex that it should be tested:
In addition, fork() is async-signal-safe, which means it might be called from a signal handler while another fork, or any of the other above operations, is taking place. POSIX is rather unclear and contradictory on what restrictions, if any, are placed on such use.
Designing these tests requires reading implementation source to understand the potential race and deadlock issues.