Commit Graph

269 Commits

Author SHA1 Message Date
Kazu Hirata 7a47ee51a1 [llvm] Don't use Optional::getValue (NFC) 2022-06-20 22:45:45 -07:00
Kazu Hirata e0e687a615 [llvm] Don't use Optional::hasValue (NFC) 2022-06-20 10:38:12 -07:00
Adrian Tong 7c13ae6490 Give option to use isCopyInstr to determine which MI is
treated as Copy instruction in MCP.

This is then used in AArch64 to remove copy instructions after taildup
ran in machine block placement

Differential Revision: https://reviews.llvm.org/D125335
2022-05-26 18:43:16 +00:00
Andre Vieira 572fc7d2fd [AArch64] Order STP Q's by ascending address
This patch adds an AArch64 specific PostRA MachineScheduler to try to schedule
STP Q's to the same base-address in ascending order of offsets. We have found
this to improve performance on Neoverse N1 and should not hurt other AArch64
cores.

Differential Revision: https://reviews.llvm.org/D125377
2022-05-23 09:50:44 +01:00
Momchil Velikov b4ad28da19 [CodeGen] Async unwind - add a pass to fix CFI information
This pass inserts the necessary CFI instructions to compensate for the
inconsistency of the call-frame information caused by linear (non-CGA
aware) nature of the unwind tables.

Unlike the `CFIInstrInserer` pass, this one almost always emits only
`.cfi_remember_state`/`.cfi_restore_state`, which results in smaller
unwind tables and also transparently handles custom unwind info
extensions like CFA offset adjustement and save locations of SVE
registers.

This pass takes advantage of the constraints taht LLVM imposes on the
placement of save/restore points (cf. `ShrinkWrap.cpp`):

  * there is a single basic block, containing the function prologue

  * possibly multiple epilogue blocks, where each epilogue block is
    complete and self-contained, i.e. CSR restore instructions (and the
    corresponding CFI instructions are not split across two or more
    blocks.

  * prologue and epilogue blocks are outside of any loops

Thus, during execution, at the beginning and at the end of each basic
block the function can be in one of two states:

  - "has a call frame", if the function has executed the prologue, or
     has not executed any epilogue

  - "does not have a call frame", if the function has not executed the
    prologue, or has executed an epilogue

These properties can be computed for each basic block by a single RPO
traversal.

From the point of view of the unwind tables, the "has/does not have
call frame" state at beginning of each block is determined by the
state at the end of the previous block, in layout order.

Where these states differ, we insert compensating CFI instructions,
which come in two flavours:

- CFI instructions, which reset the unwind table state to the
    initial one.  This is done by a target specific hook and is
    expected to be trivial to implement, for example it could be:
```
     .cfi_def_cfa <sp>, 0
     .cfi_same_value <rN>
     .cfi_same_value <rN-1>
     ...
```
where `<rN>` are the callee-saved registers.

- CFI instructions, which reset the unwind table state to the one
    created by the function prologue. These are the sequence:
```
       .cfi_restore_state
       .cfi_remember_state
```
In this case we also insert a `.cfi_remember_state` after the
last CFI instruction in the function prologue.

Reviewed By: MaskRay, danielkiss, chill

Differential Revision: https://reviews.llvm.org/D114545
2022-04-11 13:27:26 +01:00
Craig Topper 1235aaefbd [AArch64][AMDGPU][WebAssembly] Use static_cast instead of a reinterpret_cast to downcast in parseMachineFunctionInfo. NFC
static_cast is a little safer here since the compiler will
ensure we're casting to a class derived from
yaml::MachineFunctionInfo.

I believe this first appeared on AMDGPU and was copied to the
other two targets.

Spotted when it was being copied to RISCV in D123178.

Differential Revision: https://reviews.llvm.org/D123260
2022-04-06 15:09:18 -07:00
Muhammad Omair Javaid 0320115c16 Revert "[CodeGen] Async unwind - add a pass to fix CFI information"
This reverts commit 980c3e6dd2.

This commit had failing tests with clang crashing across various
AArch64/Linux buildots.

https://lab.llvm.org/buildbot/#/builders/179/builds/3346

Differential Revision: https://reviews.llvm.org/D114545
2022-04-05 13:12:30 +05:00
Momchil Velikov 980c3e6dd2 [CodeGen] Async unwind - add a pass to fix CFI information
This pass inserts the necessary CFI instructions to compensate for the
inconsistency of the call-frame information caused by linear (non-CFG
aware) nature of the unwind tables.

Unlike the `CFIInstrInserer` pass, this one almost always emits only
`.cfi_remember_state`/`.cfi_restore_state`, which results in smaller
unwind tables and also transparently handles custom unwind info
extensions like CFA offset adjustement and save locations of SVE
registers.

This pass takes advantage of the constraints that LLVM imposes on the
placement of save/restore points (cf. `ShrinkWrap.cpp`):

  * there is a single basic block, containing the function prologue

  * possibly multiple epilogue blocks, where each epilogue block is
    complete and self-contained, i.e. CSR restore instructions (and the
    corresponding CFI instructions are not split across two or more
    blocks.

  * prologue and epilogue blocks are outside of any loops

Thus, during execution, at the beginning and at the end of each basic
block the function can be in one of two states:

  - "has a call frame", if the function has executed the prologue, or
     has not executed any epilogue

  - "does not have a call frame", if the function has not executed the
    prologue, or has executed an epilogue

These properties can be computed for each basic block by a single RPO
traversal.

In order to accommodate backends which do not generate unwind info in
epilogues we compute an additional property "strong no call frame on
entry" which is set for the entry point of the function and for every
block reachable from the entry along a path that does not execute the
prologue. If this property holds, it takes precedence over the "has a
call frame" property.

From the point of view of the unwind tables, the "has/does not have
call frame" state at beginning of each block is determined by the
state at the end of the previous block, in layout order.

Where these states differ, we insert compensating CFI instructions,
which come in two flavours:

- CFI instructions, which reset the unwind table state to the
    initial one.  This is done by a target specific hook and is
    expected to be trivial to implement, for example it could be:
```
     .cfi_def_cfa <sp>, 0
     .cfi_same_value <rN>
     .cfi_same_value <rN-1>
     ...
```
where `<rN>` are the callee-saved registers.

- CFI instructions, which reset the unwind table state to the one
    created by the function prologue. These are the sequence:
```
       .cfi_restore_state
       .cfi_remember_state
```
In this case we also insert a `.cfi_remember_state` after the
last CFI instruction in the function prologue.

Reviewed By: MaskRay, danielkiss, chill

Differential Revision: https://reviews.llvm.org/D114545
2022-04-04 14:38:22 +01:00
serge-sans-paille ed98c1b376 Cleanup includes: DebugInfo & CodeGen
Discourse thread: https://discourse.llvm.org/t/include-what-you-use-include-cleanup
Differential Revision: https://reviews.llvm.org/D121332
2022-03-12 17:26:40 +01:00
Jameson Nash c4b1a63a1b mark getTargetTransformInfo and getTargetIRAnalysis as const
Seems like this can be const, since Passes shouldn't modify it.

Reviewed By: wsmoses

Differential Revision: https://reviews.llvm.org/D120518
2022-02-25 14:30:44 -05:00
Roman Lebedev 371fcb720e
[SimplifyCFG][PhaseOrdering] Defer lowering switch into an integer range comparison and branch until after at least the IPSCCP
That transformation is lossy, as discussed in
https://github.com/llvm/llvm-project/issues/53853
and https://github.com/rust-lang/rust/issues/85133#issuecomment-904185574

This is an alternative to D119839,
which would add a limited IPSCCP into SimplifyCFG.

Unlike lowering switch to lookup, we still want this transformation
to happen relatively early, but after giving a chance for the things
like CVP to do their thing. It seems like deferring it just until
the IPSCCP is enough for the tests at hand, but perhaps we need to
be more aggressive and disable it until CVP.

Fixes https://github.com/llvm/llvm-project/issues/53853
Refs. https://github.com/rust-lang/rust/issues/85133

Reviewed By: nikic

Differential Revision: https://reviews.llvm.org/D119854
2022-02-17 12:13:55 +03:00
Yuanfang Chen f927021410 Reland "[clang-cl] Support the /JMC flag"
This relands commit b380a31de0.

Restrict the tests to Windows only since the flag symbol hash depends on
system-dependent path normalization.
2022-02-10 15:16:17 -08:00
Yuanfang Chen b380a31de0 Revert "[clang-cl] Support the /JMC flag"
This reverts commit bd3a1de683.

Break bots:
https://luci-milo.appspot.com/ui/p/fuchsia/builders/toolchain.ci/clang-windows-x64/b8822587673277278177/overview
2022-02-10 14:17:37 -08:00
Yuanfang Chen bd3a1de683 [clang-cl] Support the /JMC flag
The introduction and some examples are on this page:
https://devblogs.microsoft.com/cppblog/announcing-jmc-stepping-in-visual-studio/

The `/JMC` flag enables these instrumentations:
- Insert at the beginning of every function immediately after the prologue with
  a call to `void __fastcall __CheckForDebuggerJustMyCode(unsigned char *JMC_flag)`.
  The argument for `__CheckForDebuggerJustMyCode` is the address of a boolean
  global variable (the global variable is initialized to 1) with the name
  convention `__<hash>_<filename>`. All such global variables are placed in
  the `.msvcjmc` section.
- The `<hash>` part of `__<hash>_<filename>` has a one-to-one mapping
  with a directory path. MSVC uses some unknown hashing function. Here I
  used DJB.
- Add a dummy/empty COMDAT function `__JustMyCode_Default`.
- Add `/alternatename:__CheckForDebuggerJustMyCode=__JustMyCode_Default` link
  option via ".drectve" section. This is to prevent failure in
  case `__CheckForDebuggerJustMyCode` is not provided during linking.

Implementation:
All the instrumentations are implemented in an IR codegen pass. The pass is placed immediately before CodeGenPrepare pass. This is to not interfere with mid-end optimizations and make the instrumentation target-independent (I'm still working on an ELF port in a separate patch).

Reviewed By: hans

Differential Revision: https://reviews.llvm.org/D118428
2022-02-10 10:26:30 -08:00
Archibald Elliott 52faad83c9 [AArch64] Use Feature for A53 Erratum 835769 Fix
When this pass was originally implemented, the fix pass was enabled
using a llvm command-line flag. This works fine, except in the case of
LTO, where the flag is not passed into the linker plugin in order to
enable the function pass in the LTO backend.

Now LTO exists, the expectation now is to use target features rather
than command-line arguments to control code generation, as this ensures
that different command-line arguments in different files are correctly
represented, and target-features always get to the LTO plugin as they
are encoded into LLVM IR.

The fall-out of this change is that the fix pass has to always be added
to the backend pass pipeline, so now it makes no changes if the function
does not have the right target feature to enable it. This should make a
minimal difference to compile time.

One advantage is it's now much easier to enable when compiling for a
Cortex-A53, as CPUs imply their own individual sets of target-features,
in a more fine-grained way. I haven't done this yet, but it is an
option, if the fix should be enabled in more places.

Existing tests of the user interface are unaffected, the changes are to
reflect that the argument is now turned into a target feature.

Reviewed By: tmatheson

Differential Revision: https://reviews.llvm.org/D114703
2021-12-10 15:09:59 +00:00
Cullen Rhodes 0395e01583 [IR] Split vscale_range interface
Interface is split from:

  std::pair<unsigned, unsigned> getVScaleRangeArgs()

into separate functions for min/max:

  unsigned getVScaleRangeMin();
  Optional<unsigned> getVScaleRangeMax();

Reviewed By: sdesmalen, paulwalker-arm

Differential Revision: https://reviews.llvm.org/D114075
2021-12-07 10:38:26 +00:00
Amara Emerson dc84770d55 [GlobalISel] Add a store-merging optimization pass and enable for AArch64.
This is a first attempt at a constant value consecutive store merging pass,
a counterpart to the DAGCombiner's store merging optimization.

The high level goals of this pass:

* Have a simple and efficient algorithm. As close to linear time as we can get.
  Thus, prioritizing scalability of the algorithm over merging every corner case
  we can find. The DAGCombiner's store merging code has been the source of
  compile time and complexity issues in the past and I wanted to avoid that.
* Don't introduce any new data structures for ordering memory operations. In MIR,
  we don't have the concept of chains like we do in the DAG, and the instruction
  order is stricter than enforcing ordering with graph edges. Although I
  considered adding something similar, I couldn't justify the overhead.

The pass is current split into 3 main parts. The main store merging code focuses
on identifying candidate stores and managing the candidate group that's under
consideration for merging. Analyzing addressing of stores is a potentially
complex part and for now there's just a basic implementation to identify easy
cases. Finally, the other main bit of complexity is the alias analysis, which
tries to follow the same logic as the DAG's AA.

Currently this implementation only supports merging of constant stores. Stores
of arbitrary variables are technically possible with a very small change, but
the DAG chooses not to do this. Doing so here makes most code worse since
there's extra overhead in merging values into wider registers.

On AArch64 -Os, this optimization results in very minor savings on CTMark.

Differential Revision: https://reviews.llvm.org/D109131
2021-11-15 21:10:39 -08:00
David Sherwood 607fb1bb8c [AArch64] Always add -tune-cpu argument to -cc1 driver
This patch ensures that we always tune for a given CPU on AArch64
targets when the user specifies the "-mtune=xyz" flag. In the
AArch64Subtarget if the tune flag is unset we use the CPU value
instead.

I've updated the release notes here:

  llvm/docs/ReleaseNotes.rst

and added tests here:

  clang/test/Driver/aarch64-mtune.c

Differential Revision: https://reviews.llvm.org/D110258
2021-10-19 14:57:51 +01:00
Reid Kleckner 89b57061f7 Move TargetRegistry.(h|cpp) from Support to MC
This moves the registry higher in the LLVM library dependency stack.
Every client of the target registry needs to link against MC anyway to
actually use the target, so we might as well move this out of Support.

This allows us to ensure that Support doesn't have includes from MC/*.

Differential Revision: https://reviews.llvm.org/D111454
2021-10-08 14:51:48 -07:00
Jingu Kang 30caca39f4 Third Recommit "[AArch64] Split bitmask immediate of bitwise AND operation"
This reverts the revert commit fc36fb4d23 with
bug fixes.

Differential Revision: https://reviews.llvm.org/D109963
2021-10-08 11:28:49 +01:00
David Spickett fc36fb4d23 Revert "Second Recommit "[AArch64] Split bitmask immediate of bitwise AND operation""
This reverts commit 13f3c39f36.

Due to test failures in stage 2 clang tests on AArch64 bots.
2021-10-06 08:39:48 +00:00
Jingu Kang 13f3c39f36 Second Recommit "[AArch64] Split bitmask immediate of bitwise AND operation"
This reverts the revert commit c07f709969 with
bug fixes.

Differential Revision: https://reviews.llvm.org/D109963
2021-09-30 09:27:08 +01:00
David Green 8a645fc44b [AArch64] Enable type promotion for AArch64
This enables the type promotion pass for AArch64, which acts as a
CodeGenPrepare pass to promote illegal integers to legal ones,
especially useful for removing extends that would otherwise require
cross-basic-block analysis.

I have enabled this generally, for both ISel and GlobalISel. In some
quick experiments it appeared to help GlobalISel remove extra extends in
places too, but that might just be missing optimizations that are better
left for later. We can disable it again if required.

In my experiments, this can improvement performance in some cases, and
codesize was a small improvement. SPEC was a very small improvement,
within the noise. Some of the test cases show extends being moved out of
loops, often when the extend would be part of a cmp operand, but that
should reduce the latency of the instruction in the loop on many cpus.
The signed-truncation-check tests are increasing as they are no longer
matching specific DAG combines.

We also hope to add some additional improvements to the pass in the near
future, to capture more cases of promoting extends through phis that
have come up in a few places lately.

Differential Revision: https://reviews.llvm.org/D110239
2021-09-29 15:13:12 +01:00
Sterling Augustine c07f709969 Revert "Recommit "[AArch64] Split bitmask immediate of bitwise AND operation""
This reverts commit 73a196a11c.

Causes crashes as reported in https://reviews.llvm.org/D109963
2021-09-28 18:02:06 -07:00
Jingu Kang 73a196a11c Recommit "[AArch64] Split bitmask immediate of bitwise AND operation"
This reverts the revert commit f85d8a5bed
with bug fixes.

Original message:

    MOVi32imm + ANDWrr ==> ANDWri + ANDWri
    MOVi64imm + ANDXrr ==> ANDXri + ANDXri

    The mov pseudo instruction could be expanded to multiple mov instructions later.
    In this case, try to split the constant operand of mov instruction into two
    bitmask immediates. It makes only two AND instructions intead of multiple
    mov + and instructions.

    Added a peephole optimization pass on MIR level to implement it.

    Differential Revision: https://reviews.llvm.org/D109963
2021-09-28 15:26:29 +01:00
Jingu Kang f85d8a5bed Revert "[AArch64] Split bitmask immediate of bitwise AND operation"
This reverts commit 864b206796.

Reverting due to error on buildbots.
2021-09-28 13:28:09 +01:00
Jingu Kang 864b206796 [AArch64] Split bitmask immediate of bitwise AND operation
MOVi32imm + ANDWrr ==> ANDWri + ANDWri
MOVi64imm + ANDXrr ==> ANDXri + ANDXri

The mov pseudo instruction could be expanded to multiple mov instructions later.
In this case, try to split the constant operand of mov instruction into two
bitmask immediates. It makes only two AND instructions intead of multiple
mov + and instructions.

Added a peephole optimization pass on MIR level to implement it.

Differential Revision: https://reviews.llvm.org/D109963
2021-09-28 11:57:43 +01:00
Paul Walker 88522455c0 Fix typo in help text for -aarch64-enable-branch-targets. 2021-07-05 16:15:40 +01:00
Bradley Smith 9e7329e37e [AArch64][SVE] Wire up vscale_range attribute to SVE min/max vector queries
Differential Revision: https://reviews.llvm.org/D103702
2021-06-21 13:00:36 +01:00
Amara Emerson 5b158093e2 [AArch64][GlobalISel] Create a new minimal combiner pass just for -O0.
We never bothered to have a separate set of combines for -O0 in the prelegalizer
before. This results in some minor performance hits for a mode where performance
isn't a concern (although not regressing code size significantly is still preferable).

This also removes the CSE option since we don't need it for -O0.

Through experiments, I've arrived at a set of combines that gets the most code
size improvement at -O0, while reducing the amount of time spent in the combiner
by around 35% give or take.

Differential Revision: https://reviews.llvm.org/D102038
2021-05-07 17:01:27 -07:00
Amara Emerson 8a316045ed [AArch64][GlobalISel] Enable use of the optsize predicate in the selector.
To do this while supporting the existing functionality in SelectionDAG of using
PGO info, we add the ProfileSummaryInfo and LazyBlockFrequencyInfo analysis
dependencies to the instruction selector pass.

Then, use the predicate to generate constant pool loads for f32 materialization,
if we're targeting optsize/minsize.

Differential Revision: https://reviews.llvm.org/D97732
2021-03-02 12:55:51 -08:00
Florian Hahn ca23b2c8ed
[AArch64] Move machine bundle unpacking to PreEmit2 phase.
This patch adjusts the placement of the bundle unpacking to just before
code emission. In particular, this means bundle unpacking happens AFTER
the machine outliner. With the previous position, the machine outliner
may outline parts of a bundle, which breaks them up.

This is an issue for BLR_RVMARKER handling, as illustrated by the
rvmarker-pseudo-expansion-and-outlining.mir test case. The machine
outliner should not break up the bundles created during pseudo
expansion.

This should fix PR49082.

Reviewed By: SjoerdMeijer

Differential Revision: https://reviews.llvm.org/D96294
2021-02-15 16:10:43 +00:00
Arlo Siemsen 080866470d Add ehcont section support
In the future Windows will enable Control-flow Enforcement Technology (CET aka shadow stacks). To protect the path where the context is updated during exception handling, the binary is required to enumerate valid unwind entrypoints in a dedicated section which is validated when the context is being set during exception handling.

This change allows llvm to generate the section that contains the appropriate symbol references in the form expected by the msvc linker.

This feature is enabled through a new module flag, ehcontguard, which was modelled on the cfguard flag.

The change includes a test that when the module flag is enabled the section is correctly generated.

The set of exception continuation information includes returns from exceptional control flow (catchret in llvm).

In order to collect catchret we:
1) Includes an additional flag on machine basic blocks to indicate that the given block is the target of a catchret operation,
2) Introduces a new machine function pass to insert and collect symbols at the start of each block, and
3) Combines these targets with the other EHCont targets that were already being collected.

Change originally authored by Daniel Frampton <dframpto@microsoft.com>

For more details, see MSVC documentation for `/guard:ehcont`
  https://docs.microsoft.com/en-us/cpp/build/reference/guard-enable-eh-continuation-metadata

Reviewed By: pengfei

Differential Revision: https://reviews.llvm.org/D94835
2021-02-15 14:27:12 +08:00
Kyungwoo Lee 4f58b1bd29 [AArch64] Homogeneous Prolog and Epilog Size Optimization
Second land attempt. MachineVerifier DefRegState expensive check errors fixed.

Prologs and epilogs handle callee-save registers and tend to be irregular with
different immediate offsets that are not often handled by the MachineOutliner.
Commit D18619/a5335647d5e8 (combining stack operations) stretched irregularity
further.

This patch tries to emit homogeneous stores and loads with the same offset for
prologs and epilogs respectively. We have observed that this canonicalizes
(homogenizes) prologs and epilogs significantly and results in a greatly
increased chance of outlining, resulting in a code size reduction.

Despite the above results, there are still size wins to be had that the
MachineOutliner does not provide due to the special handling X30/LR. To handle
the LR case, his patch custom-outlines prologs and epilogs in place. It does
this by doing the following:

  * Injects HOM_Prolog and HOM_Epilog pseudo instructions during a Prolog and
    Epilog Injection Pass.
  * Lowers and optimizes said pseudos in a AArchLowerHomogneousPrologEpilog Pass.
  * Outlined helpers are created on demand. Identical helpers are merged by the linker.
  * An opt-in flag is introduced to enable this feature. Another threshold flag
    is also introduced to control the aggressiveness of outlining for application's need.

This reduced an average of 4% of code size on LLVM-TestSuite/CTMark targeting arm64/-Oz.

Differential Revision: https://reviews.llvm.org/D76570
2021-02-02 14:57:26 -08:00
Puyan Lotfi 8f7f2c4211 Revert "[AArch64] Homogeneous Prolog and Epilog Size Optimization"
This reverts commit 0426be3df6.

Reverting due to some expensive-checks failures in tests.
2021-02-02 02:33:44 -05:00
Kyungwoo Lee 0426be3df6 [AArch64] Homogeneous Prolog and Epilog Size Optimization
Prologs and epilogs handle callee-save registers and tend to be irregular with
different immediate offsets that are not often handled by the MachineOutliner.
Commit D18619/a5335647d5e8 (combining stack operations) stretched irregularity
further.

This patch tries to emit homogeneous stores and loads with the same offset for
prologs and epilogs respectively. We have observed that this canonicalizes
(homogenizes) prologs and epilogs significantly and results in a greatly
increased chance of outlining, resulting in a code size reduction.

Despite the above results, there are still size wins to be had that the
MachineOutliner does not provide due to the special handling X30/LR. To handle
the LR case, his patch custom-outlines prologs and epilogs in place. It does
this by doing the following:

  * Injects HOM_Prolog and HOM_Epilog pseudo instructions during a Prolog and
    Epilog Injection Pass.
  * Lowers and optimizes said pseudos in a AArchLowerHomogneousPrologEpilog Pass.
  * Outlined helpers are created on demand. Identical helpers are merged by the linker.
  * An opt-in flag is introduced to enable this feature. Another threshold flag
    is also introduced to control the aggressiveness of outlining for application's need.

This reduced an average of 4% of code size on LLVM-TestSuite/CTMark targeting arm64/-Oz.

Differential Revision: https://reviews.llvm.org/D76570
2021-02-02 00:26:51 -05:00
Amanieu d'Antras 21bfd068b3 [AArch64] Add support for the GNU ILP32 ABI
Add the aarch64[_be]-*-gnu_ilp32 targets to support the GNU ILP32 ABI for AArch64.

The needed codegen changes were mostly already implemented in D61259, which added support for the watchOS ILP32 ABI. The main changes are:
- Wiring up the new target to enable ILP32 codegen and MC.
- ILP32 va_list support.
- ILP32 TLSDESC relocation support.

There was existing MC support for ELF ILP32 relocations from D25159 which could be enabled by passing "-target-abi ilp32" to llvm-mc. This was changed to check for "gnu_ilp32" in the target triple instead. This shouldn't cause any issues since the existing support was slightly broken: it was generating ELF64 objects instead of the ELF32 object files expected by the GNU ILP32 toolchain.

This target has been tested by running the full rustc testsuite on a big-endian ILP32 system based on the GCC ILP32 toolchain.

Reviewed By: kristof.beyls

Differential Revision: https://reviews.llvm.org/D94143
2021-01-20 13:34:47 +00:00
Ahmed Bougacha f77c948d56 [Triple][MachO] Define "arm64e", an AArch64 subarch for Pointer Auth.
This also teaches MachO writers/readers about the MachO cpu subtype,
beyond the minimal subtype reader support present at the moment.

This also defines a preprocessor macro to allow users to distinguish
__arm64__ from __arm64e__.

arm64e defaults to an "apple-a12" CPU, which supports v8.3a, allowing
pointer-authentication codegen.
It also currently defaults to ios14 and macos11.

Differential Revision: https://reviews.llvm.org/D87095
2020-12-03 07:53:59 -08:00
Amara Emerson 0f0fd383b4 [AArch64][GlobalISel] Introduce a new post-isel optimization pass.
There are two optimizations here:

1. Consider the following code:
 FCMPSrr %0, %1, implicit-def $nzcv
 %sel1:gpr32 = CSELWr %_, %_, 12, implicit $nzcv
 %sub:gpr32 = SUBSWrr %_, %_, implicit-def $nzcv
 FCMPSrr %0, %1, implicit-def $nzcv
 %sel2:gpr32 = CSELWr %_, %_, 12, implicit $nzcv
This kind of code where we have 2 FCMPs each feeding a CSEL can happen
when we have a single IR fcmp being used by two selects. During selection,
to ensure that there can be no clobbering of nzcv between the fcmp and the
csel, we have to generate an fcmp immediately before each csel is
selected.

However, often we can essentially CSE these together later in MachineCSE.
This doesn't work though if there are unrelated flag-setting instructions
in between the two FCMPs. In this case, the SUBS defines NZCV
but it doesn't have any users, being overwritten by the second FCMP.

Our solution here is to try to convert flag setting operations between
a interval of identical FCMPs, so that CSE will be able to eliminate one.

2. SelectionDAG imported patterns for arithmetic ops currently select the
flag-setting ops for CSE reasons, and add the implicit-def $nzcv operand
to those instructions. However if those impdef operands are not marked as
dead, the peephole optimizations are not able to optimize them into non-flag
setting variants. The optimization here is to find these dead imp-defs and
mark them as such.

This pass is only enabled when optimizations are enabled.

Differential Revision: https://reviews.llvm.org/D89415
2020-10-23 10:18:36 -07:00
Jessica Paquette 147b9497e7 [AArch64][GlobalISel] Split post-legalizer combiner to allow for lowering at -O0
There are a lot of combines in AArch64PostLegalizerCombiner which exist to
facilitate instruction matching in the selector. (E.g. matching for G_ZIP and
other shuffle vector pseudos)

It still makes sense to select these instructions at -O0.

Matching earlier in a combiner can reduce complexity in the selector
significantly. For example, a good portion of our selection code for compares
would be a lot easier to represent in a combine.

This patch moves matching combines into a "AArch64PostLegalizerLowering"
combiner which runs at all optimization levels.

Also, while we're here, improve the documentation for the
AArch64PostLegalizerCombiner, and fix up the filepath in its file comment.

And also add a 'r' which somehow got dropped from a bunch of function names.

https://reviews.llvm.org/D89820
2020-10-22 14:43:25 -07:00
Amara Emerson e5784ef8f6 [GlobalISel] Enable usage of BranchProbabilityInfo in IRTranslator.
We weren't using this before, so none of the MachineFunction CFG edges had the
branch probability information added. As a result, block placement later in the
pipeline was flying blind.

This is enabled only with optimizations enabled like SelectionDAG.

Differential Revision: https://reviews.llvm.org/D86824
2020-09-09 14:31:12 -07:00
Roman Lebedev bb7d3af113
Reland [SimplifyCFG][LoopRotate] SimplifyCFG: disable common instruction hoisting by default, enable late in pipeline
This was reverted in 503deec218
because it caused gigantic increase (3x) in branch mispredictions
in certain benchmarks on certain CPU's,
see https://reviews.llvm.org/D84108#2227365.

It has since been investigated and here are the results:
https://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20200907/827578.html
> It's an amazingly severe regression, but it's also all due to branch
> mispredicts (about 3x without this). The code layout looks ok so there's
> probably something else to deal with. I'm not sure there's anything we can
> reasonably do so we'll just have to take the hit for now and wait for
> another code reorganization to make the branch predictor a bit more happy :)
>
> Thanks for giving us some time to investigate and feel free to recommit
> whenever you'd like.
>
> -eric

So let's just reland this.
Original commit message:


I've been looking at missed vectorizations in one codebase.
One particular thing that stands out is that some of the loops
reach vectorizer in a rather mangled form, with weird PHI's,
and some of the loops aren't even in a rotated form.

After taking a more detailed look, that happened because
the loop's headers were too big by then. It is evident that
SimplifyCFG's common code hoisting transform is at fault there,
because the pattern it handles is precisely the unrotated
loop basic block structure.

Surprizingly, `SimplifyCFGOpt::HoistThenElseCodeToIf()` is enabled
by default, and is always run, unlike it's friend, common code sinking
transform, `SinkCommonCodeFromPredecessors()`, which is not enabled
by default and is only run once very late in the pipeline.

I'm proposing to harmonize this, and disable common code hoisting
until //late// in pipeline. Definition of //late// may vary,
here currently i've picked the same one as for code sinking,
but i suppose we could enable it as soon as right after
loop rotation happens.

Experimentation shows that this does indeed unsurprizingly help,
more loops got rotated, although other issues remain elsewhere.

Now, this undoubtedly seriously shakes phase ordering.
This will undoubtedly be a mixed bag in terms of both compile- and
run- time performance, codesize. Since we no longer aggressively
hoist+deduplicate common code, we don't pay the price of said hoisting
(which wasn't big). That may allow more loops to be rotated,
so we pay that price. That, in turn, that may enable all the transforms
that require canonical (rotated) loop form, including but not limited to
vectorization, so we pay that too. And in general, no deduplication means
more [duplicate] instructions going through the optimizations. But there's still
late hoisting, some of them will be caught late.

As per benchmarks i've run {F12360204}, this is mostly within the noise,
there are some small improvements, some small regressions.
One big regression i saw i fixed in rG8d487668d09fb0e4e54f36207f07c1480ffabbfd, but i'm sure
this will expose many more pre-existing missed optimizations, as usual :S

llvm-compile-time-tracker.com thoughts on this:
http://llvm-compile-time-tracker.com/compare.php?from=e40315d2b4ed1e38962a8f33ff151693ed4ada63&to=c8289c0ecbf235da9fb0e3bc052e3c0d6bff5cf9&stat=instructions
* this does regress compile-time by +0.5% geomean (unsurprizingly)
* size impact varies; for ThinLTO it's actually an improvement

The largest fallout appears to be in GVN's load partial redundancy
elimination, it spends *much* more time in
`MemoryDependenceResults::getNonLocalPointerDependency()`.
Non-local `MemoryDependenceResults` is widely-known to be, uh, costly.
There does not appear to be a proper solution to this issue,
other than silencing the compile-time performance regression
by tuning cut-off thresholds in `MemoryDependenceResults`,
at the cost of potentially regressing run-time performance.
D84609 attempts to move in that direction, but the path is unclear
and is going to take some time.

If we look at stats before/after diffs, some excerpts:
* RawSpeed (the target) {F12360200}
  * -14 (-73.68%) loops not rotated due to the header size (yay)
  * -272 (-0.67%) `"Number of live out of a loop variables"` - good for vectorizer
  * -3937 (-64.19%) common instructions hoisted
  * +561 (+0.06%) x86 asm instructions
  * -2 basic blocks
  * +2418 (+0.11%) IR instructions
* vanilla test-suite + RawSpeed + darktable  {F12360201}
  * -36396 (-65.29%) common instructions hoisted
  * +1676 (+0.02%) x86 asm instructions
  * +662 (+0.06%) basic blocks
  * +4395 (+0.04%) IR instructions

It is likely to be sub-optimal for when optimizing for code size,
so one might want to change tune pipeline by enabling sinking/hoisting
when optimizing for size.

Reviewed By: mkazantsev

Differential Revision: https://reviews.llvm.org/D84108

This reverts commit 503deec218.
2020-09-08 00:24:03 +03:00
Craig Topper aab90384a3 [Attributes] Add a method to check if an Attribute has AttrKind None. Use instead of hasAttribute(Attribute::None)
There's a special case in hasAttribute for None when pImpl is null. If pImpl is not null we dispatch to pImpl->hasAttribute which will always return false for Attribute::None.

So if we just want to check for None its sufficient to just check that pImpl is null. Which can even be done inline.

This patch adds a helper for that case which I hope will speed up our getSubtargetImpl implementations.

Differential Revision: https://reviews.llvm.org/D86744
2020-08-28 13:23:45 -07:00
Roman Lebedev 503deec218
Temporairly revert "[SimplifyCFG][LoopRotate] SimplifyCFG: disable common instruction hoisting by default, enable late in pipeline"
As disscussed in post-commit review starting with
	https://reviews.llvm.org/D84108#2227365
while this appears to be mostly a win overall, especially code-size-wise,
this appears to shake //certain// code pattens in a way that is extremely
unfavorable for performance (+30% runtime regression)
on certain CPU's (i personally can't reproduce).

So until the behaviour is better understood, and a path forward is mapped,
let's back this out for now.

This reverts commit 1d51dc38d8.
2020-08-22 00:33:22 +03:00
Roman Lebedev 1d51dc38d8
[SimplifyCFG][LoopRotate] SimplifyCFG: disable common instruction hoisting by default, enable late in pipeline
I've been looking at missed vectorizations in one codebase.
One particular thing that stands out is that some of the loops
reach vectorizer in a rather mangled form, with weird PHI's,
and some of the loops aren't even in a rotated form.

After taking a more detailed look, that happened because
the loop's headers were too big by then. It is evident that
SimplifyCFG's common code hoisting transform is at fault there,
because the pattern it handles is precisely the unrotated
loop basic block structure.

Surprizingly, `SimplifyCFGOpt::HoistThenElseCodeToIf()` is enabled
by default, and is always run, unlike it's friend, common code sinking
transform, `SinkCommonCodeFromPredecessors()`, which is not enabled
by default and is only run once very late in the pipeline.

I'm proposing to harmonize this, and disable common code hoisting
until //late// in pipeline. Definition of //late// may vary,
here currently i've picked the same one as for code sinking,
but i suppose we could enable it as soon as right after
loop rotation happens.

Experimentation shows that this does indeed unsurprizingly help,
more loops got rotated, although other issues remain elsewhere.

Now, this undoubtedly seriously shakes phase ordering.
This will undoubtedly be a mixed bag in terms of both compile- and
run- time performance, codesize. Since we no longer aggressively
hoist+deduplicate common code, we don't pay the price of said hoisting
(which wasn't big). That may allow more loops to be rotated,
so we pay that price. That, in turn, that may enable all the transforms
that require canonical (rotated) loop form, including but not limited to
vectorization, so we pay that too. And in general, no deduplication means
more [duplicate] instructions going through the optimizations. But there's still
late hoisting, some of them will be caught late.

As per benchmarks i've run {F12360204}, this is mostly within the noise,
there are some small improvements, some small regressions.
One big regression i saw i fixed in rG8d487668d09fb0e4e54f36207f07c1480ffabbfd, but i'm sure
this will expose many more pre-existing missed optimizations, as usual :S

llvm-compile-time-tracker.com thoughts on this:
http://llvm-compile-time-tracker.com/compare.php?from=e40315d2b4ed1e38962a8f33ff151693ed4ada63&to=c8289c0ecbf235da9fb0e3bc052e3c0d6bff5cf9&stat=instructions
* this does regress compile-time by +0.5% geomean (unsurprizingly)
* size impact varies; for ThinLTO it's actually an improvement

The largest fallout appears to be in GVN's load partial redundancy
elimination, it spends *much* more time in
`MemoryDependenceResults::getNonLocalPointerDependency()`.
Non-local `MemoryDependenceResults` is widely-known to be, uh, costly.
There does not appear to be a proper solution to this issue,
other than silencing the compile-time performance regression
by tuning cut-off thresholds in `MemoryDependenceResults`,
at the cost of potentially regressing run-time performance.
D84609 attempts to move in that direction, but the path is unclear
and is going to take some time.

If we look at stats before/after diffs, some excerpts:
* RawSpeed (the target) {F12360200}
  * -14 (-73.68%) loops not rotated due to the header size (yay)
  * -272 (-0.67%) `"Number of live out of a loop variables"` - good for vectorizer
  * -3937 (-64.19%) common instructions hoisted
  * +561 (+0.06%) x86 asm instructions
  * -2 basic blocks
  * +2418 (+0.11%) IR instructions
* vanilla test-suite + RawSpeed + darktable  {F12360201}
  * -36396 (-65.29%) common instructions hoisted
  * +1676 (+0.02%) x86 asm instructions
  * +662 (+0.06%) basic blocks
  * +4395 (+0.04%) IR instructions

It is likely to be sub-optimal for when optimizing for code size,
so one might want to change tune pipeline by enabling sinking/hoisting
when optimizing for size.

Reviewed By: mkazantsev

Differential Revision: https://reviews.llvm.org/D84108
2020-07-29 20:05:30 +03:00
Arthur Eubanks 2a672767cc Prefix some AArch64/ARM passes with "aarch64-"/"arm-"
For consistency with other target specific passes.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D84560
2020-07-27 11:00:39 -07:00
Roman Lebedev fb432a51f4
Reland "[NFCI] createCFGSimplificationPass(): migrate to also take SimplifyCFGOptions"
This reverts commit 1067d3e176,
which reverted commit b2018198c3,
because it introduced a Dependency Cycle between Transforms/Scalar and
Transforms/Utils.

So let's just move SimplifyCFGOptions.h into Utils/, thus avoiding
the cycle.
2020-07-16 13:40:01 +03:00
Adrian Kuegel 1067d3e176 Revert "[NFCI] createCFGSimplificationPass(): migrate to also take SimplifyCFGOptions"
This reverts commit b2018198c3.
This commit introduced a Dependency Cycle between Transforms/Scalar and
Transforms/Utils. Transforms/Scalar already depends on Transforms/Utils,
so if SimplifyCFGOptions.h is moved to Scalar, and Utils/Local.h still
depends on it, we have a cycle.
2020-07-16 10:54:10 +02:00
Roman Lebedev b2018198c3
[NFCI] createCFGSimplificationPass(): migrate to also take SimplifyCFGOptions
Taking so many parameters is simply unmaintainable.

We don't want to include the entire llvm/Transforms/Utils/Local.h into
llvm/Transforms/Scalar.h so i've split SimplifyCFGOptions into
it's own header.
2020-07-16 01:27:54 +03:00
Kristof Beyls c35ed40f4f [AArch64] Extend AArch64SLSHardeningPass to harden BLR instructions.
To make sure that no barrier gets placed on the architectural execution
path, each
  BLR x<N>
instruction gets transformed to a
  BL __llvm_slsblr_thunk_x<N>
instruction, with __llvm_slsblr_thunk_x<N> a thunk that contains
__llvm_slsblr_thunk_x<N>:
  BR x<N>
  <speculation barrier>

Therefore, the BLR instruction gets split into 2; one BL and one BR.
This transformation results in not inserting a speculation barrier on
the architectural execution path.

The mitigation is off by default and can be enabled by the
harden-sls-blr subtarget feature.

As a linker is allowed to clobber X16 and X17 on function calls, the
above code transformation would not be correct in case a linker does so
when N=16 or N=17. Therefore, when the mitigation is enabled, generation
of BLR x16 or BLR x17 is avoided.

As BLRA* indirect calls are not produced by LLVM currently, this does
not aim to implement support for those.

Differential Revision:  https://reviews.llvm.org/D81402
2020-06-12 07:34:33 +01:00