[llvm-mca][docs] Add Timeline and How MCA works.

For the most part, these changes were from the RFC.  I made a few minor
word/structure changes, but nothing significant.  I also regenerated the
example output, and adjusted the text accordingly.

Differential Revision: https://reviews.llvm.org/D49527

llvm-svn: 337496
This commit is contained in:
Matt Davis 2018-07-19 20:33:59 +00:00
parent 056904599b
commit bc093ea003
1 changed files with 223 additions and 3 deletions

View File

@ -21,9 +21,9 @@ The main goal of this tool is not just to predict the performance of the code
when run on the target, but also help with diagnosing potential performance
issues.
Given an assembly code sequence, llvm-mca estimates the IPC (Instructions Per
Cycle), as well as hardware resource pressure. The analysis and reporting style
were inspired by the IACA tool from Intel.
Given an assembly code sequence, llvm-mca estimates the IPC, as well as
hardware resource pressure. The analysis and reporting style were inspired by
the IACA tool from Intel.
:program:`llvm-mca` allows the usage of special code comments to mark regions of
the assembly code to be analyzed. A comment starting with substring
@ -207,3 +207,223 @@ EXIT STATUS
:program:`llvm-mca` returns 0 on success. Otherwise, an error message is printed
to standard error, and the tool returns 1.
HOW MCA WORKS
-------------
MCA takes assembly code as input. The assembly code is parsed into a sequence
of MCInst with the help of the existing LLVM target assembly parsers. The
parsed sequence of MCInst is then analyzed by a ``Pipeline`` module to generate
a performance report.
The Pipeline module simulates the execution of the machine code sequence in a
loop of iterations (default is 100). During this process, the pipeline collects
a number of execution related statistics. At the end of this process, the
pipeline generates and prints a report from the collected statistics.
Here is an example of a performance report generated by MCA for a dot-product
of two packed float vectors of four elements. The analysis is conducted for
target x86, cpu btver2. The following result can be produced via the following
command using the example located at
``test/tools/llvm-mca/X86/BtVer2/dot-product.s``:
.. code-block:: bash
$ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=300 dot-product.s
.. code-block:: none
Iterations: 300
Instructions: 900
Total Cycles: 610
Dispatch Width: 2
IPC: 1.48
Block RThroughput: 2.0
Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)
[1] [2] [3] [4] [5] [6] Instructions:
1 2 1.00 vmulps %xmm0, %xmm1, %xmm2
1 3 1.00 vhaddps %xmm2, %xmm2, %xmm3
1 3 1.00 vhaddps %xmm3, %xmm3, %xmm4
Resources:
[0] - JALU0
[1] - JALU1
[2] - JDiv
[3] - JFPA
[4] - JFPM
[5] - JFPU0
[6] - JFPU1
[7] - JLAGU
[8] - JMul
[9] - JSAGU
[10] - JSTC
[11] - JVALU0
[12] - JVALU1
[13] - JVIMUL
Resource pressure per iteration:
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]
- - - 2.00 1.00 2.00 1.00 - - - - - - -
Resource pressure by instruction:
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] Instructions:
- - - - 1.00 - 1.00 - - - - - - - vmulps %xmm0, %xmm1, %xmm2
- - - 1.00 - 1.00 - - - - - - - - vhaddps %xmm2, %xmm2, %xmm3
- - - 1.00 - 1.00 - - - - - - - - vhaddps %xmm3, %xmm3, %xmm4
According to this report, the dot-product kernel has been executed 300 times,
for a total of 900 dynamically executed instructions.
The report is structured in three main sections. The first section collects a
few performance numbers; the goal of this section is to give a very quick
overview of the performance throughput. In this example, the two important
performance indicators are the predicted total number of cycles, and the
Instructions Per Cycle (IPC). IPC is probably the most important throughput
indicator. A big delta between the Dispatch Width and the computed IPC is an
indicator of potential performance issues.
The second section of the report shows the latency and reciprocal
throughput of every instruction in the sequence. That section also reports
extra information related to the number of micro opcodes, and opcode properties
(i.e., 'MayLoad', 'MayStore', and 'HasSideEffects').
The third section is the *Resource pressure view*. This view reports
the average number of resource cycles consumed every iteration by instructions
for every processor resource unit available on the target. Information is
structured in two tables. The first table reports the number of resource cycles
spent on average every iteration. The second table correlates the resource
cycles to the machine instruction in the sequence. For example, every iteration
of the instruction vmulps always executes on resource unit [6]
(JFPU1 - floating point pipeline #1), consuming an average of 1 resource cycle
per iteration. Note that on Jaguar, vector floating-point multiply can only be
issued to pipeline JFPU1, while horizontal floating-point additions can only be
issued to pipeline JFPU0.
The resource pressure view helps with identifying bottlenecks caused by high
usage of specific hardware resources. Situations with resource pressure mainly
concentrated on a few resources should, in general, be avoided. Ideally,
pressure should be uniformly distributed between multiple resources.
Timeline View
^^^^^^^^^^^^^
MCA's timeline view produces a detailed report of each instruction's state
transitions through an instruction pipeline. This view is enabled by the
command line option ``-timeline``. As instructions transition through the
various stages of the pipeline, their states are depicted in the view report.
These states are represented by the following characters:
* D : Instruction dispatched.
* e : Instruction executing.
* E : Instruction executed.
* R : Instruction retired.
* = : Instruction already dispatched, waiting to be executed.
* \- : Instruction executed, waiting to be retired.
Below is the timeline view for a subset of the dot-product example located in
``test/tools/llvm-mca/X86/BtVer2/dot-product.s`` and processed by
MCA using the following command:
.. code-block:: bash
$ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=3 -timeline dot-product.s
.. code-block:: none
Timeline view:
012345
Index 0123456789
[0,0] DeeER. . . vmulps %xmm0, %xmm1, %xmm2
[0,1] D==eeeER . . vhaddps %xmm2, %xmm2, %xmm3
[0,2] .D====eeeER . vhaddps %xmm3, %xmm3, %xmm4
[1,0] .DeeE-----R . vmulps %xmm0, %xmm1, %xmm2
[1,1] . D=eeeE---R . vhaddps %xmm2, %xmm2, %xmm3
[1,2] . D====eeeER . vhaddps %xmm3, %xmm3, %xmm4
[2,0] . DeeE-----R . vmulps %xmm0, %xmm1, %xmm2
[2,1] . D====eeeER . vhaddps %xmm2, %xmm2, %xmm3
[2,2] . D======eeeER vhaddps %xmm3, %xmm3, %xmm4
Average Wait times (based on the timeline view):
[0]: Executions
[1]: Average time spent waiting in a scheduler's queue
[2]: Average time spent waiting in a scheduler's queue while ready
[3]: Average time elapsed from WB until retire stage
[0] [1] [2] [3]
0. 3 1.0 1.0 3.3 vmulps %xmm0, %xmm1, %xmm2
1. 3 3.3 0.7 1.0 vhaddps %xmm2, %xmm2, %xmm3
2. 3 5.7 0.0 0.0 vhaddps %xmm3, %xmm3, %xmm4
The timeline view is interesting because it shows instruction state changes
during execution. It also gives an idea of how MCA processes instructions
executed on the target, and how their timing information might be calculated.
The timeline view is structured in two tables. The first table shows
instructions changing state over time (measured in cycles); the second table
(named *Average Wait times*) reports useful timing statistics, which should
help diagnose performance bottlenecks caused by long data dependencies and
sub-optimal usage of hardware resources.
An instruction in the timeline view is identified by a pair of indices, where
the first index identifies an iteration, and the second index is the
instruction index (i.e., where it appears in the code sequence). Since this
example was generated using 3 iterations: ``-iterations=3``, the iteration
indices range from 0-2 inclusively.
Excluding the first and last column, the remaining columns are in cycles.
Cycles are numbered sequentially starting from 0.
From the example output above, we know the following:
* Instruction [1,0] was dispatched at cycle 1.
* Instruction [1,0] started executing at cycle 2.
* Instruction [1,0] reached the write back stage at cycle 4.
* Instruction [1,0] was retired at cycle 10.
Instruction [1,0] (i.e., vmulps from iteration #1) does not have to wait in the
scheduler's queue for the operands to become available. By the time vmulps is
dispatched, operands are already available, and pipeline JFPU1 is ready to
serve another instruction. So the instruction can be immediately issued on the
JFPU1 pipeline. That is demonstrated by the fact that the instruction only
spent 1cy in the scheduler's queue.
There is a gap of 5 cycles between the write-back stage and the retire event.
That is because instructions must retire in program order, so [1,0] has to wait
for [0,2] to be retired first (i.e., it has to wait until cycle 10).
In the example, all instructions are in a RAW (Read After Write) dependency
chain. Register %xmm2 written by vmulps is immediately used by the first
vhaddps, and register %xmm3 written by the first vhaddps is used by the second
vhaddps. Long data dependencies negatively impact the ILP (Instruction Level
Parallelism).
In the dot-product example, there are anti-dependencies introduced by
instructions from different iterations. However, those dependencies can be
removed at register renaming stage (at the cost of allocating register aliases,
and therefore consuming temporary registers).
Table *Average Wait times* helps diagnose performance issues that are caused by
the presence of long latency instructions and potentially long data dependencies
which may limit the ILP. Note that MCA, by default, assumes at least 1cy
between the dispatch event and the issue event.
When the performance is limited by data dependencies and/or long latency
instructions, the number of cycles spent while in the *ready* state is expected
to be very small when compared with the total number of cycles spent in the
scheduler's queue. The difference between the two counters is a good indicator
of how large of an impact data dependencies had on the execution of the
instructions. When performance is mostly limited by the lack of hardware
resources, the delta between the two counters is small. However, the number of
cycles spent in the queue tends to be larger (i.e., more than 1-3cy),
especially when compared to other low latency instructions.