This page describes a few of the advanced topics that are asked about occasionally.
This page is not intended to provide an in-depth discussion of any topic, or to provide a tutorial on any subject. Instead, this page assumes that the user already knows why they need the answers and this page will give them most of the answers they need to get started.
CPU Architecture Performance Guidelines
The AMD "Bulldozer" CPUs are available in RHEL6 queues on 64-core nodes.
Each Bulldozer node has:
- Each Bulldozer node has 4 sockets
- Each socket has two silicon dies
- Each die has 4 modules with a shared L3 cache
- Each module has 2 integer units (reported as distinct cores) and one 256-bit FMAC4 FPU with a shared L2 cache.
- Each integer core has a private L1 cache.
Floating Point Performance Recommendations:
- Each module (a reported pair of cores) shares a single FMAC4 FPU.
- Assigning a single floating-point-heavy thread to a module will be able to extract most of the performance from the FPU (assuming that the application is written to take advantage of ACML BLAS for example)
- However, AMD recommends that for maximum performance, two floating-point-heavy threads be assigned to the module to get the best possible performance. A good single-thread run will get approximately 80-90% of the possible performance from the FPU (HPL shows this to be true). The second thread will be able to extract the remainder of the possible performance from the FPU.
Memory Performance Recommendations:
- For memory-bandwidth heavy applications, use numactl to pin memory
- Use 3 threads per die (6 threads per socket), pinned to every-other core pair, scattered first by socket. e.g. Pin threads to sockets 0, 1, 2, 3, 0, 1, 2, 3 (in that order)
- Pin threads to cores 0, 2, 4 of socket [0,1,2,3] die 0
- If more are needed, pin threads to cores 0, 2, 4 of socket [0,1,2,3] die 1
MVAPICH2 and OpenMPI both provide CPU affinity setting.