As the VentusRegExtInsertion pass will break the def-use chain, so it should
only run without -verify-machineinstrs and should only be run at the very end
of codegen pass.
Initially implemented 2 stacks support for sGPR spill/restore stack and per-thread stack,
but stack size calculation is computed as a sum of 2 stacks(this works but wastes lot of
spaces).
Now TP register is used as per-thread stack pointer, SP register is used for sGPR spill/restore.
Clean up RVV related stack frame code etc.
Init captures added in processBlock() to avoid capturing structured bindings,
which caused the build problems (with clang).
RISCV has this disabled for now until problems relating to post RA pseudo
expansions are resolved.
This enabled the select optimize patch for ARM Out of order AArch64
cores. It is trying to solve a problem that is difficult for the
compiler to fix. The criteria for when a csel is better or worse than a
branch depends heavily on whether the branch is well predicted and the
amount of ILP in the loop (as well as other criteria like the core in
question and the relative performance of the branch predictor). The
pass seems to do a decent job though, with the inner loop heuristics
being well implemented and doing a better job than I had expected in
general, even without PGO information.
I've been doing quite a bit of benchmarking. The headline numbers are
these for SPEC2017 on a Neoverse N1:
500.perlbench_r -0.12%
502.gcc_r 0.02%
505.mcf_r 6.02%
520.omnetpp_r 0.32%
523.xalancbmk_r 0.20%
525.x264_r 0.02%
531.deepsjeng_r 0.00%
541.leela_r -0.09%
548.exchange2_r 0.00%
557.xz_r -0.20%
Running benchmarks with a combination of the llvm-test-suite plus
several versions of SPEC gave between a 0.2% and 0.4% geomean
improvement depending on the core/run. The instruction count went down
by 0.1% too, which is a good sign, but the results can be a little
noisy. Some issues from other benchmarks I had ran were improved in
rGca78b5601466f8515f5f958ef8e63d787d9d812e. In summary well predicted
branches will see in improvement, badly predicted branches may get
worse, and on average performance seems to be a little better overall.
This patch enables the pass for AArch64 under -O3 for cores that will
benefit for it. i.e. not in-order cores that do not fit into the "Assume
infinite resources that allow to fully exploit the available
instruction-level parallelism" cost model. It uses a subtarget feature
for specifying when the pass will be enabled, which I have enabled under
cpu=generic as the performance increases for out of order cores seems
larger than any decreases for inorder, which were minor.
Differential Revision: https://reviews.llvm.org/D138990
This patch mechanically replaces None with std::nullopt where the
compiler would warn if None were deprecated. The intent is to reduce
the amount of manual work required in migrating from Optional to
std::optional.
This is part of an effort to migrate from llvm::Optional to
std::optional:
https://discourse.llvm.org/t/deprecating-llvm-optional-x-hasvalue-getvalue-getvalueor/63716
With D135642 ignoring unregistered symbols, isTransitiveUsedByMetadataOnly added
by D101512 is no longer needed (the operation is potentially slow). There is a
`.addrsig_sym` directive for an only-used-by-metadata symbol but it does not
emit an entry.
Reviewed By: rnk
Differential Revision: https://reviews.llvm.org/D138362
This fixes the miscompile in issue #58883.
The test demonstrates that we gave up on store merging in that example.
This change should be strictly safe (just adds another clause
to avoid the transform), and it does not prohibit any existing
valid optimizations based on regression tests. I want to believe
that it's also a sufficient fix (possibly overkill), but I'm not
sure how to prove that.
Differential Revision: https://reviews.llvm.org/D137791
This patch is the Part-2 (BE LLVM) implementation of HW Exception handling.
Part-1 (FE Clang) was committed in 797ad70152.
This new feature adds the support of Hardware Exception for Microsoft Windows
SEH (Structured Exception Handling).
Compiler options:
For clang-cl.exe, the option is -EHa, the same as MSVC.
For clang.exe, the extra option is -fasync-exceptions,
plus -triple x86_64-windows -fexceptions and -fcxx-exceptions as usual.
NOTE:: Without the -EHa or -fasync-exceptions, this patch is a NO-DIFF change.
The rules for C code:
For C-code, one way (MSVC approach) to achieve SEH -EHa semantic is to follow three rules:
First, no exception can move in or out of _try region., i.e., no "potential faulty
instruction can be moved across _try boundary.
Second, the order of exceptions for instructions 'directly' under a _try must be preserved
(not applied to those in callees).
Finally, global states (local/global/heap variables) that can be read outside of _try region
must be updated in memory (not just in register) before the subsequent exception occurs.
The impact to C++ code:
Although SEH is a feature for C code, -EHa does have a profound effect on C++
side. When a C++ function (in the same compilation unit with option -EHa ) is
called by a SEH C function, a hardware exception occurs in C++ code can also
be handled properly by an upstream SEH _try-handler or a C++ catch(...).
As such, when that happens in the middle of an object's life scope, the dtor
must be invoked the same way as C++ Synchronous Exception during unwinding process.
Design:
A natural way to achieve the rules above in LLVM today is to allow an EH edge
added on memory/computation instruction (previous iload/istore idea) so that
exception path is modeled in Flow graph preciously. However, tracking every
single memory instruction and potential faulty instruction can create many
Invokes, complicate flow graph and possibly result in negative performance
impact for downstream optimization and code generation. Making all
optimizations be aware of the new semantic is also substantial.
This design does not intend to model exception path at instruction level.
Instead, the proposed design tracks and reports EH state at BLOCK-level to
reduce the complexity of flow graph and minimize the performance-impact on CPP
code under -EHa option.
One key element of this design is the ability to compute State number at
block-level. Our algorithm is based on the following rationales:
A _try scope is always a SEME (Single Entry Multiple Exits) region as jumping
into a _try is not allowed. The single entry must start with a seh_try_begin()
invoke with a correct State number that is the initial state of the SEME.
Through control-flow, state number is propagated into all blocks. Side exits
marked by seh_try_end() will unwind to parent state based on existing SEHUnwindMap[].
Note side exits can ONLY jump into parent scopes (lower state number).
Thus, when a block succeeds various states from its predecessors, the lowest
State triumphs others. If some exits flow to unreachable, propagation on those
paths terminate, not affecting remaining blocks.
For CPP code, object lifetime region is usually a SEME as SEH _try.
However there is one rare exception: jumping into a lifetime that has Dtor but
has no Ctor is warned, but allowed:
Warning: jump bypasses variable with a non-trivial destructor
In that case, the region is actually a MEME (multiple entry multiple exits).
Our solution is to inject a eha_scope_begin() invoke in the side entry block to
ensure a correct State.
Implementation:
Part-1: Clang implementation (already in):
Please see commit 797ad70152).
Part-2 : LLVM implementation described below.
For both C++ & C-code, the state of each block is computed at the same place in
BE (WinEHPreparing pass) where all other EH tables/maps are calculated.
In addition to _scope_begin & _scope_end, the computation of block state also
rely on the existing State tracking code (UnwindMap and InvokeStateMap).
For both C++ & C-code, the state of each block with potential trap instruction
is marked and reported in DAG Instruction Selection pass, the same place where
the state for -EHsc (synchronous exceptions) is done.
If the first instruction in a reported block scope can trap, a Nop is injected
before this instruction. This nop is needed to accommodate LLVM Windows EH
implementation, in which the address in IPToState table is offset by +1.
(note the purpose of that is to ensure the return address of a call is in the
same scope as the call address.
The handler for catch(...) for -EHa must handle HW exception. So it is
'adjective' flag is reset (it cannot be IsStdDotDot (0x40) that only catches
C++ exceptions).
Suppress push/popTerminate() scope (from noexcept/noTHrow) so that HW
exceptions can be passed through.
Original llvm-dev [RFC] discussions can be found in these two threads below:
https://lists.llvm.org/pipermail/llvm-dev/2020-March/140541.htmlhttps://lists.llvm.org/pipermail/llvm-dev/2020-April/141338.html
Differential Revision: https://reviews.llvm.org/D102817/new/
Extends the Asm reader/writer to support reading and writing the
'.memtag' directive (including allowing it on internal global
variables). Also add some extra tooling support, including objdump and
yaml2obj/obj2yaml.
Test that the sanitize_memtag IR attribute produces the expected asm
directive.
Uses the new Aarch64 MemtagABI specification
(https://github.com/ARM-software/abi-aa/blob/main/memtagabielf64/memtagabielf64.rst)
to identify symbols as tagged in object files. This is done using a
R_AARCH64_NONE relocation that identifies each tagged symbol, and these
relocations are tagged in a special SHT_AARCH64_MEMTAG_GLOBALS_STATIC
section. This signals to the linker that the global variable should be
tagged.
Reviewed By: fmayer, MaskRay, peter.smith
Differential Revision: https://reviews.llvm.org/D128958
In branch relaxation pass, restore blocks are created and placed before
the jump destination if indirect branches are required. For example:
foo
sd s11, 0(sp)
jump .restore, s11
bar
bar
bar
j .dest
.restore:
ld s11, 0(sp)
.dest:
baz
The BasicBlock information of the restore MachineBasicBlock should be
identical to the dest MachineBasicBlock.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D131863
A new pass MachineLateInstrsCleanup is added to be run after PEI.
This is a simple pass that removes redundant and identical instructions
whenever found by scanning the MF once while keeping track of register
definitions in a map. These instructions are typically immediate loads
resulting from rematerialization, and address loads emitted by target in
eliminateFrameInde().
This is enabled by default, but a target could easily disable it by means of
'disablePass(&MachineLateInstrsCleanupID);'.
This late cleanup is naturally not "optimal" in removing instructions as it
is done by looking at phys-regs, but still quite effective. It would be
desirable to improve other parts of CodeGen and avoid these redundant
instructions in the first place, but there are no ideas for this yet.
Differential Revision: https://reviews.llvm.org/D123394
Reviewed By: RKSimon, foad, craig.topper, arsenm, asb
As stated in
https://discourse.llvm.org/t/rfc-llc-add-expandlargeintfpconvert-pass-for-fp-int-conversion-of-large-bitint/65528,
this implementation is very similar to ExpandLargeDivRem, which expands
‘fptoui .. to’, ‘fptosi .. to’, ‘uitofp .. to’, ‘sitofp .. to’ instructions
with a bitwidth above a threshold into auto-generated functions. This is
useful for targets like x86_64 that cannot lower fp convertions with more
than 128 bits. The expanded nodes are referring from the IR generated by
`compiler-rt/lib/builtins/floattidf.c`, `compiler-rt/lib/builtins/fixdfti.c`,
and etc.
Corner cases:
1. For fp16: as there is no related builtins added in compliler-rt. So I
mainly utilized the fp32 <-> fp16 lib calls to implement.
2. For fp80: as this pass is soft fp emulation and no fp80 instructions can
help in this problem. I recommend users to deprecate this usage. For now, the
implementation uses fp128 as the temporary conversion type and inserts
fptrunc/ext at top/end of the function.
3. For bf16: as clang FE currently doesn't support bf16 algorithm operations
(convert to int, float, +, -, *, ...), this patch doesn't consider bf16 for
now.
4. For unsigned FPToI: since both default hardware behaviors and libgcc are
ignoring "returns 0 for negative input" spec. This pass follows this old way
to ignore unsigned FPToI. See this example:
https://gcc.godbolt.org/z/bnv3jqW1M
The end-to-end tests are uploaded at https://reviews.llvm.org/D138261
Reviewed By: LuoYuanke, mgehre-amd
Differential Revision: https://reviews.llvm.org/D137241
It's an artifact very specific to using TFAgents during training, so it
belongs with ModelUnderTrainingRunner.
Differential Revision: https://reviews.llvm.org/D139031
Currently per-function metadata consists of:
(start-pc, size, features)
This adds a new UAR feature and if it's set an additional element:
(start-pc, size, features, stack-args-size)
Reviewed By: melver
Differential Revision: https://reviews.llvm.org/D136078
This patch fixes crashes related with how PatchableFunction selects the instruction to make patchable:
- Ensure PatchableFunction skips all instructions that don't generate actual machine instructions.
- Handle the case where the first MachineBasicBlock is empty
- Removed support for 16 bit x86 architectures.
Note: another issue remains related with PatchableFunction, in the lowering part.
See https://github.com/llvm/llvm-project/issues/59039
Differential Revision: https://reviews.llvm.org/D137642
Currently per-function metadata consists of:
(start-pc, size, features)
This adds a new UAR feature and if it's set an additional element:
(start-pc, size, features, stack-args-size)
Reviewed By: melver
Differential Revision: https://reviews.llvm.org/D136078
The use of a PSV for buffer intrinsics is misleading because it may be
misinterpreted as all buffer intrinsics accessing the same address in
memory, which is clearly not true.
Instead, build MachineMemOperands without a pointer value but with an
address space, so that address space-based alias analysis can still
work.
There is a lot of test churn because previously address space 4
(constant address space) was used as an address space for buffer
intrinsics. This doesn't make much sense and seems to have been an
accident -- see the change in
AMDGPUTargetMachine::getAddressSpaceForPseudoSourceKind.
Differential Revision: https://reviews.llvm.org/D138711
This reverts commit a1255dc467.
This patch results in:
llvm/lib/CodeGen/SanitizerBinaryMetadata.cpp:57:17: error: no member
named 'size' in 'llvm::MDTuple'
Currently per-function metadata consists of:
(start-pc, size, features)
This adds a new UAR feature and if it's set an additional element:
(start-pc, size, features, stack-args-size)
Reviewed By: melver
Differential Revision: https://reviews.llvm.org/D136078
I had reverted this before the holiday week because a problem was reported with a related change (D137140 - scalable vector known bits in DAG). I had initially confused the two patches, and then decided to leave this reverted out an abundance of caution. Now that we're through the holiday week, reapplying.
I also roled in fixes for several post commit review comments that hadn't landed with the original change.
Original commit message
This is a continuation of the series of patches adding lane wise support for scalable vectors in various knownbit-esq routines.
The basic idea here is that we track a single lane for scalable vectors which corresponds to an unknown number of lanes at runtime. This is enough for us to perform lane wise reasoning on many arithmetic operations.
Differential Revision: https://reviews.llvm.org/D137141
As discussed on Issue #59217, under certain circumstances the DAG can generate duplicate MUL and MUL_LOHI nodes, often during MULO legalization.
This patch attempts to replace MUL nodes with additional uses of the LO result from the MUL_LOHI node
Differential Revision: https://reviews.llvm.org/D138790