[llvm-mca][docs] Always use `llvm-mca` in place of `MCA`.
llvm-svn: 338394
This commit is contained in:
parent
0b8fdd2847
commit
bdcf6ad60d
|
@ -207,23 +207,23 @@ EXIT STATUS
|
||||||
:program:`llvm-mca` returns 0 on success. Otherwise, an error message is printed
|
:program:`llvm-mca` returns 0 on success. Otherwise, an error message is printed
|
||||||
to standard error, and the tool returns 1.
|
to standard error, and the tool returns 1.
|
||||||
|
|
||||||
HOW MCA WORKS
|
HOW LLVM-MCA WORKS
|
||||||
-------------
|
------------------
|
||||||
|
|
||||||
MCA takes assembly code as input. The assembly code is parsed into a sequence
|
:program:`llvm-mca` takes assembly code as input. The assembly code is parsed
|
||||||
of MCInst with the help of the existing LLVM target assembly parsers. The
|
into a sequence of MCInst with the help of the existing LLVM target assembly
|
||||||
parsed sequence of MCInst is then analyzed by a ``Pipeline`` module to generate
|
parsers. The parsed sequence of MCInst is then analyzed by a ``Pipeline`` module
|
||||||
a performance report.
|
to generate a performance report.
|
||||||
|
|
||||||
The Pipeline module simulates the execution of the machine code sequence in a
|
The Pipeline module simulates the execution of the machine code sequence in a
|
||||||
loop of iterations (default is 100). During this process, the pipeline collects
|
loop of iterations (default is 100). During this process, the pipeline collects
|
||||||
a number of execution related statistics. At the end of this process, the
|
a number of execution related statistics. At the end of this process, the
|
||||||
pipeline generates and prints a report from the collected statistics.
|
pipeline generates and prints a report from the collected statistics.
|
||||||
|
|
||||||
Here is an example of a performance report generated by MCA for a dot-product
|
Here is an example of a performance report generated by the tool for a
|
||||||
of two packed float vectors of four elements. The analysis is conducted for
|
dot-product of two packed float vectors of four elements. The analysis is
|
||||||
target x86, cpu btver2. The following result can be produced via the following
|
conducted for target x86, cpu btver2. The following result can be produced via
|
||||||
command using the example located at
|
the following command using the example located at
|
||||||
``test/tools/llvm-mca/X86/BtVer2/dot-product.s``:
|
``test/tools/llvm-mca/X86/BtVer2/dot-product.s``:
|
||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
|
@ -316,7 +316,7 @@ pressure should be uniformly distributed between multiple resources.
|
||||||
|
|
||||||
Timeline View
|
Timeline View
|
||||||
^^^^^^^^^^^^^
|
^^^^^^^^^^^^^
|
||||||
MCA's timeline view produces a detailed report of each instruction's state
|
The timeline view produces a detailed report of each instruction's state
|
||||||
transitions through an instruction pipeline. This view is enabled by the
|
transitions through an instruction pipeline. This view is enabled by the
|
||||||
command line option ``-timeline``. As instructions transition through the
|
command line option ``-timeline``. As instructions transition through the
|
||||||
various stages of the pipeline, their states are depicted in the view report.
|
various stages of the pipeline, their states are depicted in the view report.
|
||||||
|
@ -331,7 +331,7 @@ These states are represented by the following characters:
|
||||||
|
|
||||||
Below is the timeline view for a subset of the dot-product example located in
|
Below is the timeline view for a subset of the dot-product example located in
|
||||||
``test/tools/llvm-mca/X86/BtVer2/dot-product.s`` and processed by
|
``test/tools/llvm-mca/X86/BtVer2/dot-product.s`` and processed by
|
||||||
MCA using the following command:
|
:program:`llvm-mca` using the following command:
|
||||||
|
|
||||||
.. code-block:: bash
|
.. code-block:: bash
|
||||||
|
|
||||||
|
@ -366,7 +366,7 @@ MCA using the following command:
|
||||||
2. 3 5.7 0.0 0.0 vhaddps %xmm3, %xmm3, %xmm4
|
2. 3 5.7 0.0 0.0 vhaddps %xmm3, %xmm3, %xmm4
|
||||||
|
|
||||||
The timeline view is interesting because it shows instruction state changes
|
The timeline view is interesting because it shows instruction state changes
|
||||||
during execution. It also gives an idea of how MCA processes instructions
|
during execution. It also gives an idea of how the tool processes instructions
|
||||||
executed on the target, and how their timing information might be calculated.
|
executed on the target, and how their timing information might be calculated.
|
||||||
|
|
||||||
The timeline view is structured in two tables. The first table shows
|
The timeline view is structured in two tables. The first table shows
|
||||||
|
@ -415,8 +415,8 @@ and therefore consuming temporary registers).
|
||||||
|
|
||||||
Table *Average Wait times* helps diagnose performance issues that are caused by
|
Table *Average Wait times* helps diagnose performance issues that are caused by
|
||||||
the presence of long latency instructions and potentially long data dependencies
|
the presence of long latency instructions and potentially long data dependencies
|
||||||
which may limit the ILP. Note that MCA, by default, assumes at least 1cy
|
which may limit the ILP. Note that :program:`llvm-mca`, by default, assumes at
|
||||||
between the dispatch event and the issue event.
|
least 1cy between the dispatch event and the issue event.
|
||||||
|
|
||||||
When the performance is limited by data dependencies and/or long latency
|
When the performance is limited by data dependencies and/or long latency
|
||||||
instructions, the number of cycles spent while in the *ready* state is expected
|
instructions, the number of cycles spent while in the *ready* state is expected
|
||||||
|
@ -602,9 +602,9 @@ entries in the reorder buffer defaults to the `MicroOpBufferSize` provided by
|
||||||
the target scheduling model.
|
the target scheduling model.
|
||||||
|
|
||||||
Instructions that are dispatched to the schedulers consume scheduler buffer
|
Instructions that are dispatched to the schedulers consume scheduler buffer
|
||||||
entries. MCA queries the scheduling model to determine the set of
|
entries. :program:`llvm-mca` queries the scheduling model to determine the set
|
||||||
buffered resources consumed by an instruction. Buffered resources are treated
|
of buffered resources consumed by an instruction. Buffered resources are
|
||||||
like scheduler resources.
|
treated like scheduler resources.
|
||||||
|
|
||||||
Instruction Issue
|
Instruction Issue
|
||||||
"""""""""""""""""
|
"""""""""""""""""
|
||||||
|
@ -612,22 +612,21 @@ Each processor scheduler implements a buffer of instructions. An instruction
|
||||||
has to wait in the scheduler's buffer until input register operands become
|
has to wait in the scheduler's buffer until input register operands become
|
||||||
available. Only at that point, does the instruction becomes eligible for
|
available. Only at that point, does the instruction becomes eligible for
|
||||||
execution and may be issued (potentially out-of-order) for execution.
|
execution and may be issued (potentially out-of-order) for execution.
|
||||||
Instruction latencies are computed by MCA with the help of the scheduling
|
Instruction latencies are computed by :program:`llvm-mca` with the help of the
|
||||||
model.
|
scheduling model.
|
||||||
|
|
||||||
MCA's scheduler is designed to simulate multiple processor schedulers. The
|
:program:`llvm-mca`'s scheduler is designed to simulate multiple processor
|
||||||
scheduler is responsible for tracking data dependencies, and dynamically
|
schedulers. The scheduler is responsible for tracking data dependencies, and
|
||||||
selecting which processor resources are consumed by instructions.
|
dynamically selecting which processor resources are consumed by instructions.
|
||||||
|
It delegates the management of processor resource units and resource groups to a
|
||||||
The scheduler delegates the management of processor resource units and resource
|
resource manager. The resource manager is responsible for selecting resource
|
||||||
groups to a resource manager. The resource manager is responsible for
|
units that are consumed by instructions. For example, if an instruction
|
||||||
selecting resource units that are consumed by instructions. For example, if an
|
consumes 1cy of a resource group, the resource manager selects one of the
|
||||||
instruction consumes 1cy of a resource group, the resource manager selects one
|
available units from the group; by default, the resource manager uses a
|
||||||
of the available units from the group; by default, the resource manager uses a
|
|
||||||
round-robin selector to guarantee that resource usage is uniformly distributed
|
round-robin selector to guarantee that resource usage is uniformly distributed
|
||||||
between all units of a group.
|
between all units of a group.
|
||||||
|
|
||||||
MCA's scheduler implements three instruction queues:
|
:program:`llvm-mca`'s scheduler implements three instruction queues:
|
||||||
|
|
||||||
* WaitQueue: a queue of instructions whose operands are not ready.
|
* WaitQueue: a queue of instructions whose operands are not ready.
|
||||||
* ReadyQueue: a queue of instructions ready to execute.
|
* ReadyQueue: a queue of instructions ready to execute.
|
||||||
|
@ -638,8 +637,8 @@ scheduler are either placed into the WaitQueue or into the ReadyQueue.
|
||||||
|
|
||||||
Every cycle, the scheduler checks if instructions can be moved from the
|
Every cycle, the scheduler checks if instructions can be moved from the
|
||||||
WaitQueue to the ReadyQueue, and if instructions from the ReadyQueue can be
|
WaitQueue to the ReadyQueue, and if instructions from the ReadyQueue can be
|
||||||
issued. The algorithm prioritizes older instructions over younger
|
issued to the underlying pipelines. The algorithm prioritizes older instructions
|
||||||
instructions.
|
over younger instructions.
|
||||||
|
|
||||||
Write-Back and Retire Stage
|
Write-Back and Retire Stage
|
||||||
"""""""""""""""""""""""""""
|
"""""""""""""""""""""""""""
|
||||||
|
@ -656,15 +655,13 @@ for the instruction during the register renaming stage.
|
||||||
|
|
||||||
Load/Store Unit and Memory Consistency Model
|
Load/Store Unit and Memory Consistency Model
|
||||||
""""""""""""""""""""""""""""""""""""""""""""
|
""""""""""""""""""""""""""""""""""""""""""""
|
||||||
To simulate an out-of-order execution of memory operations, MCA utilizes a
|
To simulate an out-of-order execution of memory operations, :program:`llvm-mca`
|
||||||
simulated load/store unit (LSUnit) to simulate the speculative execution of
|
utilizes a simulated load/store unit (LSUnit) to simulate the speculative
|
||||||
loads and stores.
|
execution of loads and stores.
|
||||||
|
|
||||||
Each load (or store) consumes an entry in the load (or store) queue. The
|
Each load (or store) consumes an entry in the load (or store) queue. Users can
|
||||||
number of slots in the load/store queues is unknown by MCA, since there is no
|
specify flags ``-lqueue`` and ``-squeue`` to limit the number of entries in the
|
||||||
mention of it in the scheduling model. In practice, users can specify flags
|
load and store queues respectively. The queues are unbounded by default.
|
||||||
``-lqueue`` and ``-squeue`` to limit the number of entries in the load and
|
|
||||||
store queues respectively. The queues are unbounded by default.
|
|
||||||
|
|
||||||
The LSUnit implements a relaxed consistency model for memory loads and stores.
|
The LSUnit implements a relaxed consistency model for memory loads and stores.
|
||||||
The rules are:
|
The rules are:
|
||||||
|
@ -701,15 +698,15 @@ cache. It only knows if an instruction "MayLoad" and/or "MayStore." For
|
||||||
loads, the scheduling model provides an "optimistic" load-to-use latency (which
|
loads, the scheduling model provides an "optimistic" load-to-use latency (which
|
||||||
usually matches the load-to-use latency for when there is a hit in the L1D).
|
usually matches the load-to-use latency for when there is a hit in the L1D).
|
||||||
|
|
||||||
MCA does not know about serializing operations or memory-barrier like
|
:program:`llvm-mca` does not know about serializing operations or memory-barrier
|
||||||
instructions. The LSUnit conservatively assumes that an instruction which has
|
like instructions. The LSUnit conservatively assumes that an instruction which
|
||||||
both "MayLoad" and unmodeled side effects behaves like a "soft" load-barrier.
|
has both "MayLoad" and unmodeled side effects behaves like a "soft"
|
||||||
That means, it serializes loads without forcing a flush of the load queue.
|
load-barrier. That means, it serializes loads without forcing a flush of the
|
||||||
Similarly, instructions that "MayStore" and have unmodeled side effects are
|
load queue. Similarly, instructions that "MayStore" and have unmodeled side
|
||||||
treated like store barriers. A full memory barrier is a "MayLoad" and
|
effects are treated like store barriers. A full memory barrier is a "MayLoad"
|
||||||
"MayStore" instruction with unmodeled side effects. This is inaccurate, but it
|
and "MayStore" instruction with unmodeled side effects. This is inaccurate, but
|
||||||
is the best that we can do at the moment with the current information available
|
it is the best that we can do at the moment with the current information
|
||||||
in LLVM.
|
available in LLVM.
|
||||||
|
|
||||||
A load/store barrier consumes one entry of the load/store queue. A load/store
|
A load/store barrier consumes one entry of the load/store queue. A load/store
|
||||||
barrier enforces ordering of loads/stores. A younger load cannot pass a load
|
barrier enforces ordering of loads/stores. A younger load cannot pass a load
|
||||||
|
|
Loading…
Reference in New Issue