How to create a user-local build of recent GCC

These days, GCC stands for the GNU Compiler Collection, but here we're only interested in the C and maybe C++ compilers.

The following applies to gcc versions 4.x (last tested with gcc 4.5.0 on Owl-current installed from the 2010/03/23 OpenVZ container template, but should work on other Linux systems with essential “development” tools installed as well).

These instructions show how to build the prerequisite libraries separately from gcc, although another valid approach would be to place their source code in properly-named subdirectories under the gcc build tree and let everything get built along with gcc itself (this is mentioned and the directory names are given on the gcc prerequisites web page).

Download gmp-4.3.2.tar.bz2, mpfr-2.4.2.tar.bz2, and mpc-0.8.1.tar.gz from ftp://gcc.gnu.org/pub/gcc/infrastructure/. Download gcc-core-4.5.0.tar.bz2 and optionally gcc-g++-4.5.0.tar.bz2 from ftp://gcc.gnu.org/pub/gcc/releases/ (there are also newer development snapshots in a nearby directory, perhaps the same approach will work for them as well). Make sure that you have bison and flex packages installed (On Ubuntu, do “sudo apt-get install bison flex”).

Build gmp:

mkdir ~/gmp-4.3.2 ~/build
cd ~/build
tar xjf gmp-4.3.2.tar.bz2
cd gmp-4.3.2
./configure --prefix=/home/user/gmp-4.3.2 --enable-cxx
nice -n 19 time make -j8
make install
# Optionally, test our build of gmp:
make check
echo $?

The –enable-cxx option is needed in case we intend to possibly build ppl, which will use this build of gmp. ppl is in turn needed for building gcc with Graphite (which we're not doing yet).

Build mpfr:

mkdir ~/mpfr-2.4.2
cd ~/build
tar xjf mpfr-2.4.2.tar.bz2
cd mpfr-2.4.2
./configure --prefix=/home/user/mpfr-2.4.2 --with-gmp=/home/user/gmp-4.3.2
nice -n 19 time make -j8
make install

Build mpc:

mkdir ~/mpc-0.8.1
cd ~/build
tar xzf mpc-0.8.1.tar.gz
cd mpc-0.8.1
LD_LIBRARY_PATH=/home/user/gmp-4.3.2/lib:/home/user/mpfr-2.4.2/lib ./configure --prefix=/home/user/mpc-0.8.1 --with-gmp=/home/user/gmp-4.3.2 --with-mpfr=/home/user/mpfr-2.4.2
LD_LIBRARY_PATH=/home/user/gmp-4.3.2/lib:/home/user/mpfr-2.4.2/lib nice -n 19 time make -j8
make install

Build gcc:

mkdir ~/gcc-4.5.0
cd ~/build
tar xjf gcc-core-4.5.0.tar.bz2
tar xjf gcc-g++-4.5.0.tar.bz2 # optional
cd gcc-4.5.0
LD_LIBRARY_PATH=/home/user/gmp-4.3.2/lib:/home/user/mpfr-2.4.2/lib:/home/user/mpc-0.8.1/lib ./configure --prefix=/home/user/gcc-4.5.0 --with-gmp=/home/user/gmp-4.3.2 --with-mpfr=/home/user/mpfr-2.4.2 --with-mpc=/home/user/mpc-0.8.1 --disable-multilib
LD_LIBRARY_PATH=/home/user/gmp-4.3.2/lib:/home/user/mpfr-2.4.2/lib:/home/user/mpc-0.8.1/lib nice -n 19 time make -j8
make install

Beware: if you happen to omit or mistype the LD_LIBRARY_PATH setting (or if it needs to be revised for newer versions of the software components involved) or/and omit the –disable-multilib option (yet you're on an x86_64 or a “similar” 64-bit system without 32-bit compatibility compiler and libraries installed), then the build will likely fail with an error message that would not indicate the root cause of the failure. The error message will not provide any hint about LD_LIBRARY_PATH and –disable-multilib.

If you downloaded the full gcc tarball including all language frontends, but only need support for C and C++, use the –enable-languages=c,c++ option to configure to speed up the build and reduce disk space usage.

Create a script to conveniently and transparently run the new gcc. It may look like this:

host!user:~$ cat newgcc.sh 
export LD_LIBRARY_PATH=~/gmp-4.3.2/lib:~/mpfr-2.4.2/lib:~/mpc-0.8.1/lib:~/gcc-4.5.0/lib64
export PATH=~/gcc-4.5.0/bin:$PATH

~/gcc-4.5.0/lib64 is correct on x86_64; it may need to be replaced with ~/gcc-4.5.0/lib on other platforms.

“Source” the script into the shell and check that simply running gcc refers to the newly built version now:

host!user:~$ . newgcc.sh
host!user:~$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/home/user/gcc-4.5.0/libexec/gcc/x86_64-unknown-linux-gnu/4.5.0/lto-wrapper
Target: x86_64-unknown-linux-gnu
Configured with: ./configure --prefix=/home/user/gcc-4.5.0 --with-gmp=/home/user/gmp-4.3.2 --with-mpfr=/home/user/mpfr-2.4.2 --with-mpc=/home/user/mpc-0.8.1 --disable-multilib
Thread model: posix
gcc version 4.5.0 (GCC)

Now it is possible to build arbitrary programs with the new gcc, which will be used by default from under the running shell session.

How to build GCC with the Graphite loop optimizations

Additionally download ppl-0.10.2.tar.gz and cloog-ppl-0.15.9.tar.gz from ftp://gcc.gnu.org/pub/gcc/infrastructure/.

ppl fails to build with gcc 3.4.5 that we have in Owl, triggering an internal compiler error. So we will be using our brand new builds of gcc and g++ from this point on. Also, we'll assume that our LD_LIBRARY_PATH already has gmp, mpfr, and mpc, which it does after sourcing the newgcc.sh script shown above. Let's source the script:

. newgcc.sh

Build ppl:

mkdir ~/ppl-0.10.2
cd ~/build
tar xzf ppl-0.10.2.tar.gz
cd ppl-0.10.2
CXXFLAGS=-I/home/user/gmp-4.3.2/include LDFLAGS=-L/home/user/gmp-4.3.2/lib ./configure --prefix=/home/user/ppl-0.10.2 --with-libgmp-prefix=~/gmp-4.3.2
nice -n 19 time make -j8
make install

The explicit specification of include and library paths via CXXFLAGS and LDFLAGS is a workaround for incomplete implementation of –with-libgmp-prefix in the configure script.

Build cloog-ppl:

mkdir ~/cloog-ppl-0.15.9
cd ~/build
tar xzf cloog-ppl-0.15.9.tar.gz
cd cloog-ppl-0.15.9
CFLAGS='-I/home/user/gmp-4.3.2/include -I/home/user/ppl-0.10.2/include' LDFLAGS=-L/home/user/gmp-4.3.2/lib ./configure --prefix=/home/user/cloog-ppl-0.15.9 --with-gmp=/home/user/gmp-4.3.2 --with-ppl=/home/user/ppl-0.10.2
nice -n 19 time make -j8
make install

The explicit include and library paths specified via CFLAGS and LDFLAGS are for the same reason as above - bad configure script.

Now let's rebuild gcc:

mkdir ~/gcc-4.5.0-g
cd ~/build/gcc-4.5.0
make distclean
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/user/ppl-0.10.2/lib:/home/user/cloog-ppl-0.15.9/lib ./configure --prefix=/home/user/gcc-4.5.0-g --with-gmp=/home/user/gmp-4.3.2 --with-mpfr=/home/user/mpfr-2.4.2 --with-mpc=/home/user/mpc-0.8.1 --with-ppl=/home/user/ppl-0.10.2 --with-cloog=/home/user/cloog-ppl-0.15.9 --disable-multilib
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/user/ppl-0.10.2/lib:/home/user/cloog-ppl-0.15.9/lib nice -n 19 time make -j8
make install

Create a revised script for invoking this build of gcc:

host!user:~$ cat newgcc-g.sh 
export LD_LIBRARY_PATH=~/gmp-4.3.2/lib:~/mpfr-2.4.2/lib:~/mpc-0.8.1/lib:~/ppl-0.10.2/lib:~/cloog-ppl-0.15.9/lib:~/gcc-4.5.0-g/lib64
export PATH=~/gcc-4.5.0-g/bin:$PATH

Once the above script is sourced, gcc and g++ will correspond to the new gcc build with Graphite. This build recognizes options such as -ftree-parallelize-loops=N (which needs to be passed during linkage as well) and -floop-parallelize-all.

Parallel processing with GCC

Automatic parallelization example

Let's try the following program:

#include <stdio.h>

#define N 1000000
#define M 10000

int main(void)
{
        unsigned int x[N], y;
        int i, j;

        for (j = 0; j < N; j++)
                x[j] = j * j;

        for (i = 0; i < M; i++)
                for (j = 0; j < N; j++)
                        x[j] += j;

        y = 0;
        for (j = 0; j < N; j++)
                y -= x[j];

        printf("%08x\n", y);
        return 0;
}

Notice how individual iterations of the innermost loop are independent of each other (and may thus be computed in parallel).

First, let's compile it without parallelization, and run it:

$ gcc loop.c -o loop -s -Wall -O3 -funroll-loops
$ time ./loop
cf5419a0

real    0m5.659s
user    0m5.655s
sys     0m0.004s

Now let's try automatic parallelization:

$ gcc loop.c -o loop -s -Wall -O3 -funroll-loops -floop-parallelize-all -ftree-parallelize-loops=8
$ time ./loop
cf5419a0

real    0m2.474s
user    0m19.752s
sys     0m0.006s

The program completed more than twice quicker, producing the same result, yet it consumed almost four times more CPU time. This is on a Core i7, which is quad-core with 2 threads per core, so the 8 threads were in fact running on the CPU at the same time. Ideally, the speedup would be at least close to 4x, if not approaching 8x. So we see that the automatic parallelization has worked, and actually provided an advantage (as long as we don't care about other tasks that could have been running on the system), but the efficiency is poor. This is explained by the x[] array not fitting in the L1 data cache because of the large value of N, thereby making the task “memory-bound” (or likely L3-cache-bound on this CPU).

Unfortunately, reducing N such that the array would fit in the L1 data cache prevented automatic parallelization from working. What could the reason for this be? I actually tried forcing the compiler into parallelizing the loop even for smaller values of N by using OpenMP directives, and the performance was poor - all 8 threads running, but so slowly that there was no advantage overall. Apparently, this slowness was caused by cache coherence overhead. I expected this problem to occur with multiple threads accessing the same cache line. My test results demonstrated that it was worse than that: if a thread wrote to a given page of memory, accesses to that entire page by other threads would be slow (entry thrown out of TLBs maybe?) I did not spend time to confirm (or disprove) this guess via the CPU's performance counters, though.

The problem did not occur for read-only accesses, which we'll use for the OpenMP demo below. Unfortunately, I was unable to construct a “read-only” program that would trigger gcc's automatic parallelization.

Another curious detail: we're using “unsigned int”, not plain “int” (which is signed) for x[] elements and for y. This is because the behavior on integer overflow is unspecified for signed integer types (it is only specified for unsigned ones). However, what if we try to use plain int anyway? Let's change just one line:

        unsigned int x[N], y;

to:

        int x[N], y;

Now time for a surprise:

$ gcc loop.c -o loop -s -Wall -O3 -funroll-loops
$ time ./loop
cf5419a0

real    0m3.022s
user    0m3.021s
sys     0m0.002s
$ gcc loop.c -o loop -s -Wall -O3 -funroll-loops -floop-parallelize-all -ftree-parallelize-loops=8
$ time ./loop
cf5419a0

real    0m2.291s
user    0m18.275s
sys     0m0.008s

Both benchmarks became much faster. (Also, the advantage from parallelization reduced.)

Why is that? The answer is SSE2. It turns out that gcc does not dare to use SSE2 integer operations on unsigned integers. However, when we actually declare our variables as signed, it does use SSE2, thereby processing four 32-bit integer additions in parallel. Apparently, even though the task is memory-bound (or L3-cache-bound), the use of SSE2 instructions does provide further improvement, probably due to wider bus accesses. It is unfortunate that in order to use the CPU optimally we “have to” write formally-incorrect code in this case (of course, there are other ways around the problem, such as through explicit use of vectorized data types, but the drawback is that it is no longer “fully” portable C source code then).

BTW, both revisions of the program have another problem - because N is so large, we incur signed integer overflow on multiplication of j. Correcting this bug (which does not affect the computation result with this version of gcc) somehow inhibits automatic parallelization of the innermost loop (where we reuse the j variable). Oops.

OpenMP example

As stated above, I failed to construct a program where automatic parallelization would exploit the CPU's full performance. (Perhaps I did not try hard enough, but this does illustrate the limitations of gcc's automatic parallelization for practical purposes, at least on this system.) Let's try with OpenMP, which is also supported by this version of gcc (including by our initial/simpler build of it).

Here's the new program:

#include <stdio.h>

#define N 1000
#define M 10000000

int main(void)
{
        unsigned int x[N], y;
        int i, j;

        for (j = 0; j < N; j++)
                x[j] = j * j;
        y = 0;

#pragma omp parallel for default(none) shared(x) private(i,j) reduction(+:y)
        for (i = 0; i < M; i++)
                for (j = 0; j < N; j++)
                        y += x[j];

        printf("%08x\n", y);
        return 0;
}

First, let's compile it without OpenMP, and run it:

$ gcc loop.c -o loop -Wall -s -O3 -funroll-loops
loop.c: In function 'main':
loop.c:15:0: warning: ignoring #pragma omp parallel
$ time ./loop
615e5600

real    0m1.202s
user    0m1.201s
sys     0m0.001s

Now let's try with OpenMP:

$ gcc loop.c -o loop -Wall -s -O3 -funroll-loops -fopenmp
$ time ./loop
615e5600

real    0m0.240s
user    0m1.901s
sys     0m0.000s

Now we have a 5x performance improvement, which is about right for a quad-core with SMT.

Switching to signed integers changes the compiled code somewhat, but does not affect performance dramatically this time.

BTW, with both of our examples OpenMP runtime parameters may be controlled via environment variables, such as OMP_NUM_THREADS (only works with explicit use of OpenMP) and OMP_WAIT_POLICY (works with automatic parallelization as well).

Application to John the Ripper

Although my opinion is that John the Ripper should be parallelized at a higher level, I've briefly tried both gcc's automatic parallelization and OpenMP on JtR's implementation of bitslice DES. To do this, I created a custom architecture-specific parameters file (modifying ia64.h to be correct for x86_64, but avoiding use of any assembly code) and a custom make target that would use the file. In the file, I increased DES_BS_VECTOR, thereby making the bitslice DES code work with vectors of size in excess of the CPU's native word size (BTW, JtR also does that when it is built for PA-RISC, to take advantage of unusual properties of caches on some implementations of that architecture).

Long story short, I was only able to get gcc's automatic parallelizer to kick in with extremely large values of DES_BS_VECTOR. The threshold appeared to be around 420 for 2 threads and at around 1700 for 8 threads. Unfortunately, even with values this large, different threads would still read and write(!) to the same pages of memory, because of the way the B[] array is defined (the vector “depth” is the last index, which is as desired in many other cases - e.g., to have SSE2 assembly code co-exist with non-vectorized C code). As I mentioned above, this memory access pattern is a major problem at least on this system. The resulting performance is thus poor - it is worse than that of a single thread (yet with 8 threads running busily on the CPU). Tweaking the parameters and the code much further, as well as using OpenMP directives, I was finally able to achieve a 2x speedup over the performance of a single thread (that's with 8 threads running!), but that speedup was smaller than that achieved by simply going with SSE2 for a single thread. (I kept SSE2 out of the picture for this parallelization test.)

A way around this problem would be to introduce a third dimension to the arrays such that the memory accesses by different threads would be to locations that are much further apart. However, this was beyond the level of complexity that I was willing to introduce into the code, given that proper parallelization needs to be implemented at a higher level anyway.

Update (2010/05/08): I ended up actually implementing OpenMP support for bcrypt hashes, and the efficiency (for those hashes) is good.

Update (2010/06/16): JtR 1.7.6+ integrates OpenMP support for bcrypt hashes and more.

Update (2010/06/27): I've implemented reasonably efficient parallelization of JtR's bitslice DES code with OpenMP directives, placing different threads' read-write data at sufficient distances, which did the trick. The current quick and really dirty patch (maybe to be improved, cleaned up, and integrated later) is found on the JtR patches page. The patch is capable of making use of SSE2 (as well as likely MMX and AltiVec, although this is untested with this specific patch and may require minor adjustments) along with OpenMP parallelization, due to the use of intrinsics instead of the bundled non-thread-safe pieces of assembly code for SSE2 and MMX. gcc 4.5 produces really good SSE2 code (in this specific case), gcc 4.4 produces decent code as well - both are comparable to the bundled assembly code. Older versions of gcc might not work so well.

Table of Contents