[llvm-mca][docs] Add Timeline and How MCA works.
For the most part, these changes were from the RFC. I made a few minor word/structure changes, but nothing significant. I also regenerated the example output, and adjusted the text accordingly. Differential Revision: https://reviews.llvm.org/D49527 llvm-svn: 337496
This commit is contained in:
parent
056904599b
commit
bc093ea003
|
@ -21,9 +21,9 @@ The main goal of this tool is not just to predict the performance of the code
|
|||
when run on the target, but also help with diagnosing potential performance
|
||||
issues.
|
||||
|
||||
Given an assembly code sequence, llvm-mca estimates the IPC (Instructions Per
|
||||
Cycle), as well as hardware resource pressure. The analysis and reporting style
|
||||
were inspired by the IACA tool from Intel.
|
||||
Given an assembly code sequence, llvm-mca estimates the IPC, as well as
|
||||
hardware resource pressure. The analysis and reporting style were inspired by
|
||||
the IACA tool from Intel.
|
||||
|
||||
:program:`llvm-mca` allows the usage of special code comments to mark regions of
|
||||
the assembly code to be analyzed. A comment starting with substring
|
||||
|
@ -207,3 +207,223 @@ EXIT STATUS
|
|||
:program:`llvm-mca` returns 0 on success. Otherwise, an error message is printed
|
||||
to standard error, and the tool returns 1.
|
||||
|
||||
HOW MCA WORKS
|
||||
-------------
|
||||
|
||||
MCA takes assembly code as input. The assembly code is parsed into a sequence
|
||||
of MCInst with the help of the existing LLVM target assembly parsers. The
|
||||
parsed sequence of MCInst is then analyzed by a ``Pipeline`` module to generate
|
||||
a performance report.
|
||||
|
||||
The Pipeline module simulates the execution of the machine code sequence in a
|
||||
loop of iterations (default is 100). During this process, the pipeline collects
|
||||
a number of execution related statistics. At the end of this process, the
|
||||
pipeline generates and prints a report from the collected statistics.
|
||||
|
||||
Here is an example of a performance report generated by MCA for a dot-product
|
||||
of two packed float vectors of four elements. The analysis is conducted for
|
||||
target x86, cpu btver2. The following result can be produced via the following
|
||||
command using the example located at
|
||||
``test/tools/llvm-mca/X86/BtVer2/dot-product.s``:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=300 dot-product.s
|
||||
|
||||
.. code-block:: none
|
||||
|
||||
Iterations: 300
|
||||
Instructions: 900
|
||||
Total Cycles: 610
|
||||
Dispatch Width: 2
|
||||
IPC: 1.48
|
||||
Block RThroughput: 2.0
|
||||
|
||||
|
||||
Instruction Info:
|
||||
[1]: #uOps
|
||||
[2]: Latency
|
||||
[3]: RThroughput
|
||||
[4]: MayLoad
|
||||
[5]: MayStore
|
||||
[6]: HasSideEffects (U)
|
||||
|
||||
[1] [2] [3] [4] [5] [6] Instructions:
|
||||
1 2 1.00 vmulps %xmm0, %xmm1, %xmm2
|
||||
1 3 1.00 vhaddps %xmm2, %xmm2, %xmm3
|
||||
1 3 1.00 vhaddps %xmm3, %xmm3, %xmm4
|
||||
|
||||
|
||||
Resources:
|
||||
[0] - JALU0
|
||||
[1] - JALU1
|
||||
[2] - JDiv
|
||||
[3] - JFPA
|
||||
[4] - JFPM
|
||||
[5] - JFPU0
|
||||
[6] - JFPU1
|
||||
[7] - JLAGU
|
||||
[8] - JMul
|
||||
[9] - JSAGU
|
||||
[10] - JSTC
|
||||
[11] - JVALU0
|
||||
[12] - JVALU1
|
||||
[13] - JVIMUL
|
||||
|
||||
|
||||
Resource pressure per iteration:
|
||||
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]
|
||||
- - - 2.00 1.00 2.00 1.00 - - - - - - -
|
||||
|
||||
Resource pressure by instruction:
|
||||
[0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] Instructions:
|
||||
- - - - 1.00 - 1.00 - - - - - - - vmulps %xmm0, %xmm1, %xmm2
|
||||
- - - 1.00 - 1.00 - - - - - - - - vhaddps %xmm2, %xmm2, %xmm3
|
||||
- - - 1.00 - 1.00 - - - - - - - - vhaddps %xmm3, %xmm3, %xmm4
|
||||
|
||||
According to this report, the dot-product kernel has been executed 300 times,
|
||||
for a total of 900 dynamically executed instructions.
|
||||
|
||||
The report is structured in three main sections. The first section collects a
|
||||
few performance numbers; the goal of this section is to give a very quick
|
||||
overview of the performance throughput. In this example, the two important
|
||||
performance indicators are the predicted total number of cycles, and the
|
||||
Instructions Per Cycle (IPC). IPC is probably the most important throughput
|
||||
indicator. A big delta between the Dispatch Width and the computed IPC is an
|
||||
indicator of potential performance issues.
|
||||
|
||||
The second section of the report shows the latency and reciprocal
|
||||
throughput of every instruction in the sequence. That section also reports
|
||||
extra information related to the number of micro opcodes, and opcode properties
|
||||
(i.e., 'MayLoad', 'MayStore', and 'HasSideEffects').
|
||||
|
||||
The third section is the *Resource pressure view*. This view reports
|
||||
the average number of resource cycles consumed every iteration by instructions
|
||||
for every processor resource unit available on the target. Information is
|
||||
structured in two tables. The first table reports the number of resource cycles
|
||||
spent on average every iteration. The second table correlates the resource
|
||||
cycles to the machine instruction in the sequence. For example, every iteration
|
||||
of the instruction vmulps always executes on resource unit [6]
|
||||
(JFPU1 - floating point pipeline #1), consuming an average of 1 resource cycle
|
||||
per iteration. Note that on Jaguar, vector floating-point multiply can only be
|
||||
issued to pipeline JFPU1, while horizontal floating-point additions can only be
|
||||
issued to pipeline JFPU0.
|
||||
|
||||
The resource pressure view helps with identifying bottlenecks caused by high
|
||||
usage of specific hardware resources. Situations with resource pressure mainly
|
||||
concentrated on a few resources should, in general, be avoided. Ideally,
|
||||
pressure should be uniformly distributed between multiple resources.
|
||||
|
||||
Timeline View
|
||||
^^^^^^^^^^^^^
|
||||
MCA's timeline view produces a detailed report of each instruction's state
|
||||
transitions through an instruction pipeline. This view is enabled by the
|
||||
command line option ``-timeline``. As instructions transition through the
|
||||
various stages of the pipeline, their states are depicted in the view report.
|
||||
These states are represented by the following characters:
|
||||
|
||||
* D : Instruction dispatched.
|
||||
* e : Instruction executing.
|
||||
* E : Instruction executed.
|
||||
* R : Instruction retired.
|
||||
* = : Instruction already dispatched, waiting to be executed.
|
||||
* \- : Instruction executed, waiting to be retired.
|
||||
|
||||
Below is the timeline view for a subset of the dot-product example located in
|
||||
``test/tools/llvm-mca/X86/BtVer2/dot-product.s`` and processed by
|
||||
MCA using the following command:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -iterations=3 -timeline dot-product.s
|
||||
|
||||
.. code-block:: none
|
||||
|
||||
Timeline view:
|
||||
012345
|
||||
Index 0123456789
|
||||
|
||||
[0,0] DeeER. . . vmulps %xmm0, %xmm1, %xmm2
|
||||
[0,1] D==eeeER . . vhaddps %xmm2, %xmm2, %xmm3
|
||||
[0,2] .D====eeeER . vhaddps %xmm3, %xmm3, %xmm4
|
||||
[1,0] .DeeE-----R . vmulps %xmm0, %xmm1, %xmm2
|
||||
[1,1] . D=eeeE---R . vhaddps %xmm2, %xmm2, %xmm3
|
||||
[1,2] . D====eeeER . vhaddps %xmm3, %xmm3, %xmm4
|
||||
[2,0] . DeeE-----R . vmulps %xmm0, %xmm1, %xmm2
|
||||
[2,1] . D====eeeER . vhaddps %xmm2, %xmm2, %xmm3
|
||||
[2,2] . D======eeeER vhaddps %xmm3, %xmm3, %xmm4
|
||||
|
||||
|
||||
Average Wait times (based on the timeline view):
|
||||
[0]: Executions
|
||||
[1]: Average time spent waiting in a scheduler's queue
|
||||
[2]: Average time spent waiting in a scheduler's queue while ready
|
||||
[3]: Average time elapsed from WB until retire stage
|
||||
|
||||
[0] [1] [2] [3]
|
||||
0. 3 1.0 1.0 3.3 vmulps %xmm0, %xmm1, %xmm2
|
||||
1. 3 3.3 0.7 1.0 vhaddps %xmm2, %xmm2, %xmm3
|
||||
2. 3 5.7 0.0 0.0 vhaddps %xmm3, %xmm3, %xmm4
|
||||
|
||||
The timeline view is interesting because it shows instruction state changes
|
||||
during execution. It also gives an idea of how MCA processes instructions
|
||||
executed on the target, and how their timing information might be calculated.
|
||||
|
||||
The timeline view is structured in two tables. The first table shows
|
||||
instructions changing state over time (measured in cycles); the second table
|
||||
(named *Average Wait times*) reports useful timing statistics, which should
|
||||
help diagnose performance bottlenecks caused by long data dependencies and
|
||||
sub-optimal usage of hardware resources.
|
||||
|
||||
An instruction in the timeline view is identified by a pair of indices, where
|
||||
the first index identifies an iteration, and the second index is the
|
||||
instruction index (i.e., where it appears in the code sequence). Since this
|
||||
example was generated using 3 iterations: ``-iterations=3``, the iteration
|
||||
indices range from 0-2 inclusively.
|
||||
|
||||
Excluding the first and last column, the remaining columns are in cycles.
|
||||
Cycles are numbered sequentially starting from 0.
|
||||
|
||||
From the example output above, we know the following:
|
||||
|
||||
* Instruction [1,0] was dispatched at cycle 1.
|
||||
* Instruction [1,0] started executing at cycle 2.
|
||||
* Instruction [1,0] reached the write back stage at cycle 4.
|
||||
* Instruction [1,0] was retired at cycle 10.
|
||||
|
||||
Instruction [1,0] (i.e., vmulps from iteration #1) does not have to wait in the
|
||||
scheduler's queue for the operands to become available. By the time vmulps is
|
||||
dispatched, operands are already available, and pipeline JFPU1 is ready to
|
||||
serve another instruction. So the instruction can be immediately issued on the
|
||||
JFPU1 pipeline. That is demonstrated by the fact that the instruction only
|
||||
spent 1cy in the scheduler's queue.
|
||||
|
||||
There is a gap of 5 cycles between the write-back stage and the retire event.
|
||||
That is because instructions must retire in program order, so [1,0] has to wait
|
||||
for [0,2] to be retired first (i.e., it has to wait until cycle 10).
|
||||
|
||||
In the example, all instructions are in a RAW (Read After Write) dependency
|
||||
chain. Register %xmm2 written by vmulps is immediately used by the first
|
||||
vhaddps, and register %xmm3 written by the first vhaddps is used by the second
|
||||
vhaddps. Long data dependencies negatively impact the ILP (Instruction Level
|
||||
Parallelism).
|
||||
|
||||
In the dot-product example, there are anti-dependencies introduced by
|
||||
instructions from different iterations. However, those dependencies can be
|
||||
removed at register renaming stage (at the cost of allocating register aliases,
|
||||
and therefore consuming temporary registers).
|
||||
|
||||
Table *Average Wait times* helps diagnose performance issues that are caused by
|
||||
the presence of long latency instructions and potentially long data dependencies
|
||||
which may limit the ILP. Note that MCA, by default, assumes at least 1cy
|
||||
between the dispatch event and the issue event.
|
||||
|
||||
When the performance is limited by data dependencies and/or long latency
|
||||
instructions, the number of cycles spent while in the *ready* state is expected
|
||||
to be very small when compared with the total number of cycles spent in the
|
||||
scheduler's queue. The difference between the two counters is a good indicator
|
||||
of how large of an impact data dependencies had on the execution of the
|
||||
instructions. When performance is mostly limited by the lack of hardware
|
||||
resources, the delta between the two counters is small. However, the number of
|
||||
cycles spent in the queue tends to be larger (i.e., more than 1-3cy),
|
||||
especially when compared to other low latency instructions.
|
||||
|
|
Loading…
Reference in New Issue