Profiling and optimization¶
In general, in order to reach performances close to the theoretical peak, it is necessary to write your algorithms in a form that allows the use of scientific library routines, such as BLACS/LAPACK.
Arm Performance Reports¶
Arm Performance Reports
offers a nice and convenient way to get an overview profile for your run very quickly.
It will introduce a typically negligible runtime overhead
and all you need to do is to load the
and to launch your “normal” execution using the
Here is an example script:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
#!/usr/bin/env bash #SBATCH --nodes=1 #SBATCH --ntasks-per-node=20 #SBATCH --time=0-00:10:00 module load perf-reports/5.1 # create temporary scratch area for this job on the global file system SCRATCH_DIRECTORY=/global/work/$USER/$SLURM_JOBID mkdir -p $SCRATCH_DIRECTORY # run the performance report # all you need to do is to launch your "normal" execution # with "perf-report" cd $SCRATCH_DIRECTORY perf-report mpiexec -n 20 $SLURM_SUBMIT_DIR/example.x # perf-report generates summary files in html and txt format # we copy result files to submit dir cp *.html *.txt $SLURM_SUBMIT_DIR # clean up the scratch directory cd /tmp rm -rf $SCRATCH_DIRECTORY
What we do there is to profile an example binary located in
The profiler generates summary files in html and txt format and this is how an example html summary can look (open it in your browser):
Performance tuning by Compiler flags¶
Quick and dirty¶
We usually recommend that you use the
ifort/icc compilers as
they give superior performance on Stallo. Using
-O3 is a quick
way to get reasonable performance for most applications. Unfortunately,
sometimes the compiler break the code with
-O3 making it crash
or give incorrect results. Try a lower optimization,
-O1, if this doesn’t help, let us know and we will try to solve
this or report a compiler bug to INTEL. If you need to use
-O1 instead of
-O3 please remember to add the
-ftz too, this will flush small values to zero. Doing this can
have a huge impact on the performance of your application.
Profile based optimization¶
The Intel compilers can do something called profile based optimization. This uses information from the execution of the application to create more effective code. It is important that you run the application with a typical input set or else the compiler will tune the application for another usage profile than you are interested in. With a typical input set one means for instance a full spatial input set, but using just a few iterations for the time stepping.
- Compile with
- Run the app (might take a long time as optimization is turned off in this stage).
- Recompile with
-prof-use. The simplest case is to compile/run/recompile in the same catalog or else you need to use the
-prof-dirflag, see the manual for details.
Intel Vtune Amplifier is a versatile serial and parallel profiler, with features such as stack sampling, thread profiling and hardware event sampling.