[llvm-mca][docs] Always use `llvm-mca` in place of `MCA`.

llvm-svn: 338394
This commit is contained in:
Andrea Di Biagio 2018-07-31 15:29:10 +00:00
parent 0b8fdd2847
commit bdcf6ad60d
1 changed files with 46 additions and 49 deletions

View File

@ -207,23 +207,23 @@ EXIT STATUS
:program:`llvm-mca` returns 0 on success. Otherwise, an error message is printed :program:`llvm-mca` returns 0 on success. Otherwise, an error message is printed
to standard error, and the tool returns 1. to standard error, and the tool returns 1.
HOW MCA WORKS HOW LLVM-MCA WORKS
------------- ------------------
MCA takes assembly code as input. The assembly code is parsed into a sequence :program:`llvm-mca` takes assembly code as input. The assembly code is parsed
of MCInst with the help of the existing LLVM target assembly parsers. The into a sequence of MCInst with the help of the existing LLVM target assembly
parsed sequence of MCInst is then analyzed by a ``Pipeline`` module to generate parsers. The parsed sequence of MCInst is then analyzed by a ``Pipeline`` module
a performance report. to generate a performance report.
The Pipeline module simulates the execution of the machine code sequence in a The Pipeline module simulates the execution of the machine code sequence in a
loop of iterations (default is 100). During this process, the pipeline collects loop of iterations (default is 100). During this process, the pipeline collects
a number of execution related statistics. At the end of this process, the a number of execution related statistics. At the end of this process, the
pipeline generates and prints a report from the collected statistics. pipeline generates and prints a report from the collected statistics.
Here is an example of a performance report generated by MCA for a dot-product Here is an example of a performance report generated by the tool for a
of two packed float vectors of four elements. The analysis is conducted for dot-product of two packed float vectors of four elements. The analysis is
target x86, cpu btver2. The following result can be produced via the following conducted for target x86, cpu btver2. The following result can be produced via
command using the example located at the following command using the example located at
``test/tools/llvm-mca/X86/BtVer2/dot-product.s``: ``test/tools/llvm-mca/X86/BtVer2/dot-product.s``:
.. code-block:: bash .. code-block:: bash
@ -316,7 +316,7 @@ pressure should be uniformly distributed between multiple resources.
Timeline View Timeline View
^^^^^^^^^^^^^ ^^^^^^^^^^^^^
MCA's timeline view produces a detailed report of each instruction's state The timeline view produces a detailed report of each instruction's state
transitions through an instruction pipeline. This view is enabled by the transitions through an instruction pipeline. This view is enabled by the
command line option ``-timeline``. As instructions transition through the command line option ``-timeline``. As instructions transition through the
various stages of the pipeline, their states are depicted in the view report. various stages of the pipeline, their states are depicted in the view report.
@ -331,7 +331,7 @@ These states are represented by the following characters:
Below is the timeline view for a subset of the dot-product example located in Below is the timeline view for a subset of the dot-product example located in
``test/tools/llvm-mca/X86/BtVer2/dot-product.s`` and processed by ``test/tools/llvm-mca/X86/BtVer2/dot-product.s`` and processed by
MCA using the following command: :program:`llvm-mca` using the following command:
.. code-block:: bash .. code-block:: bash
@ -366,7 +366,7 @@ MCA using the following command:
2. 3 5.7 0.0 0.0 vhaddps %xmm3, %xmm3, %xmm4 2. 3 5.7 0.0 0.0 vhaddps %xmm3, %xmm3, %xmm4
The timeline view is interesting because it shows instruction state changes The timeline view is interesting because it shows instruction state changes
during execution. It also gives an idea of how MCA processes instructions during execution. It also gives an idea of how the tool processes instructions
executed on the target, and how their timing information might be calculated. executed on the target, and how their timing information might be calculated.
The timeline view is structured in two tables. The first table shows The timeline view is structured in two tables. The first table shows
@ -415,8 +415,8 @@ and therefore consuming temporary registers).
Table *Average Wait times* helps diagnose performance issues that are caused by Table *Average Wait times* helps diagnose performance issues that are caused by
the presence of long latency instructions and potentially long data dependencies the presence of long latency instructions and potentially long data dependencies
which may limit the ILP. Note that MCA, by default, assumes at least 1cy which may limit the ILP. Note that :program:`llvm-mca`, by default, assumes at
between the dispatch event and the issue event. least 1cy between the dispatch event and the issue event.
When the performance is limited by data dependencies and/or long latency When the performance is limited by data dependencies and/or long latency
instructions, the number of cycles spent while in the *ready* state is expected instructions, the number of cycles spent while in the *ready* state is expected
@ -602,9 +602,9 @@ entries in the reorder buffer defaults to the `MicroOpBufferSize` provided by
the target scheduling model. the target scheduling model.
Instructions that are dispatched to the schedulers consume scheduler buffer Instructions that are dispatched to the schedulers consume scheduler buffer
entries. MCA queries the scheduling model to determine the set of entries. :program:`llvm-mca` queries the scheduling model to determine the set
buffered resources consumed by an instruction. Buffered resources are treated of buffered resources consumed by an instruction. Buffered resources are
like scheduler resources. treated like scheduler resources.
Instruction Issue Instruction Issue
""""""""""""""""" """""""""""""""""
@ -612,22 +612,21 @@ Each processor scheduler implements a buffer of instructions. An instruction
has to wait in the scheduler's buffer until input register operands become has to wait in the scheduler's buffer until input register operands become
available. Only at that point, does the instruction becomes eligible for available. Only at that point, does the instruction becomes eligible for
execution and may be issued (potentially out-of-order) for execution. execution and may be issued (potentially out-of-order) for execution.
Instruction latencies are computed by MCA with the help of the scheduling Instruction latencies are computed by :program:`llvm-mca` with the help of the
model. scheduling model.
MCA's scheduler is designed to simulate multiple processor schedulers. The :program:`llvm-mca`'s scheduler is designed to simulate multiple processor
scheduler is responsible for tracking data dependencies, and dynamically schedulers. The scheduler is responsible for tracking data dependencies, and
selecting which processor resources are consumed by instructions. dynamically selecting which processor resources are consumed by instructions.
It delegates the management of processor resource units and resource groups to a
The scheduler delegates the management of processor resource units and resource resource manager. The resource manager is responsible for selecting resource
groups to a resource manager. The resource manager is responsible for units that are consumed by instructions. For example, if an instruction
selecting resource units that are consumed by instructions. For example, if an consumes 1cy of a resource group, the resource manager selects one of the
instruction consumes 1cy of a resource group, the resource manager selects one available units from the group; by default, the resource manager uses a
of the available units from the group; by default, the resource manager uses a
round-robin selector to guarantee that resource usage is uniformly distributed round-robin selector to guarantee that resource usage is uniformly distributed
between all units of a group. between all units of a group.
MCA's scheduler implements three instruction queues: :program:`llvm-mca`'s scheduler implements three instruction queues:
* WaitQueue: a queue of instructions whose operands are not ready. * WaitQueue: a queue of instructions whose operands are not ready.
* ReadyQueue: a queue of instructions ready to execute. * ReadyQueue: a queue of instructions ready to execute.
@ -638,8 +637,8 @@ scheduler are either placed into the WaitQueue or into the ReadyQueue.
Every cycle, the scheduler checks if instructions can be moved from the Every cycle, the scheduler checks if instructions can be moved from the
WaitQueue to the ReadyQueue, and if instructions from the ReadyQueue can be WaitQueue to the ReadyQueue, and if instructions from the ReadyQueue can be
issued. The algorithm prioritizes older instructions over younger issued to the underlying pipelines. The algorithm prioritizes older instructions
instructions. over younger instructions.
Write-Back and Retire Stage Write-Back and Retire Stage
""""""""""""""""""""""""""" """""""""""""""""""""""""""
@ -656,15 +655,13 @@ for the instruction during the register renaming stage.
Load/Store Unit and Memory Consistency Model Load/Store Unit and Memory Consistency Model
"""""""""""""""""""""""""""""""""""""""""""" """"""""""""""""""""""""""""""""""""""""""""
To simulate an out-of-order execution of memory operations, MCA utilizes a To simulate an out-of-order execution of memory operations, :program:`llvm-mca`
simulated load/store unit (LSUnit) to simulate the speculative execution of utilizes a simulated load/store unit (LSUnit) to simulate the speculative
loads and stores. execution of loads and stores.
Each load (or store) consumes an entry in the load (or store) queue. The Each load (or store) consumes an entry in the load (or store) queue. Users can
number of slots in the load/store queues is unknown by MCA, since there is no specify flags ``-lqueue`` and ``-squeue`` to limit the number of entries in the
mention of it in the scheduling model. In practice, users can specify flags load and store queues respectively. The queues are unbounded by default.
``-lqueue`` and ``-squeue`` to limit the number of entries in the load and
store queues respectively. The queues are unbounded by default.
The LSUnit implements a relaxed consistency model for memory loads and stores. The LSUnit implements a relaxed consistency model for memory loads and stores.
The rules are: The rules are:
@ -701,15 +698,15 @@ cache. It only knows if an instruction "MayLoad" and/or "MayStore." For
loads, the scheduling model provides an "optimistic" load-to-use latency (which loads, the scheduling model provides an "optimistic" load-to-use latency (which
usually matches the load-to-use latency for when there is a hit in the L1D). usually matches the load-to-use latency for when there is a hit in the L1D).
MCA does not know about serializing operations or memory-barrier like :program:`llvm-mca` does not know about serializing operations or memory-barrier
instructions. The LSUnit conservatively assumes that an instruction which has like instructions. The LSUnit conservatively assumes that an instruction which
both "MayLoad" and unmodeled side effects behaves like a "soft" load-barrier. has both "MayLoad" and unmodeled side effects behaves like a "soft"
That means, it serializes loads without forcing a flush of the load queue. load-barrier. That means, it serializes loads without forcing a flush of the
Similarly, instructions that "MayStore" and have unmodeled side effects are load queue. Similarly, instructions that "MayStore" and have unmodeled side
treated like store barriers. A full memory barrier is a "MayLoad" and effects are treated like store barriers. A full memory barrier is a "MayLoad"
"MayStore" instruction with unmodeled side effects. This is inaccurate, but it and "MayStore" instruction with unmodeled side effects. This is inaccurate, but
is the best that we can do at the moment with the current information available it is the best that we can do at the moment with the current information
in LLVM. available in LLVM.
A load/store barrier consumes one entry of the load/store queue. A load/store A load/store barrier consumes one entry of the load/store queue. A load/store
barrier enforces ordering of loads/stores. A younger load cannot pass a load barrier enforces ordering of loads/stores. A younger load cannot pass a load