This was disabled to prevent regressions, which appear to be just occurring on AMDGPU (at least in our current lit tests), which I've addressed by adding AMDGPUTargetLowering::isDesirableToCommuteWithShift overrides.
Fixes#57872
Differential Revision: https://reviews.llvm.org/D136042
SimplifyDemandedBits currently early-outs for multi-use values beyond the root node (just returning the knownbits), which is missing a number of optimizations as there are plenty of cases where we can still simplify when initially demanding all elements/bits.
@lenary has confirmed that the test cases in aea-erratum-fix.ll need refactoring and the current increase codegen is not a major concern.
Differential Revision: https://reviews.llvm.org/D129765
Add GFX11 test coverage to a bunch of tests where it was easy to do so,
mostly because the checks are autogenerated and/or GFX11 can share the
same checks as GFX10.
Differential Revision: https://reviews.llvm.org/D129295
Fold immediates regardless of how many uses they have. This is expected
to increase overall code size, but decrease register usage.
Differential Revision: https://reviews.llvm.org/D114644
Previously SIFoldOperands::foldInstOperand would only fold a
non-inlinable immediate into a single user, so as not to increase code
size by adding the same 32-bit literal operand to many instructions.
This patch removes that restriction, so that a non-inlinable immediate
will be folded into any number of users. The rationale is:
- It reduces the number of registers used for holding constant values,
which might increase occupancy. (On the other hand, many of these
registers are SGPRs which no longer affect occupancy on GFX10+.)
- It reduces ALU stalls between the instruction that loads a constant
into a register, and the instruction that uses it.
- The above benefits are expected to outweigh any increase in code size.
Differential Revision: https://reviews.llvm.org/D114643
Currently metadata is inserted in a late pass which is lowered
to an AssertZext. The metadata would be more useful if it was
inserted earlier after inlining, but before codegen.
Probably shouldn't change anything now. Just replacing the
late metadata annotation needs more work, since we lose
out on optimizations after these are lowered to CopyFromReg.
Seems to be slightly better than relying on the AssertZext from the
metadata. The test change in cvt_f32_ubyte.ll is a quirk from it using
-start-before=amdgpu-isel instead of running the usual codegen
pipeline.
Always set uniform metadata on the pointer if it is an instruction, but
otherwise do not bother to create a trivial getelementptr instruction,
because AMDGPUInstrInfo::isUniformMMO can already detect that various
non-instruction pointers are uniform.
Most of the test case churn is from tests that used undef as a pointer,
which AMDGPUInstrInfo::isUniformMMO treats as uniform.
Differential Revision: https://reviews.llvm.org/D118909
Using a BufferSize of one for memory ProcResources will result in better
ILP since it more accurately models the dependencies between memory ops
and their consumers on an in-order processor. After this change, the
scheduler will treat the data edges from loads as blocking so that
stalls are guaranteed when waiting for data to be retreaved from memory.
Since we don't actually track waitcnt here, this should do a better job
at modeling their behavior.
Practically, this means that the scheduler will trigger the 'STALL'
heuristic more often.
This type of change needs to be evaluated experimentally. Preliminary
results are positive.
Fixes: SWDEV-282962
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D114777
Shift node is still needed to check if the shift is shr or shl to increment/decrement offset. Do not override the node.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D112733
This simple heuristic uses the estimated live range length combined
with the number of registers in the class to switch which heuristic to
use. This was taking the raw number of registers in the class, even
though not all of them may be available. AMDGPU heavily relies on
dynamically reserved numbers of registers based on user attributes to
satisfy occupancy constraints, so the raw number is highly misleading.
There are still a few problems here. In the original testcase that
made me notice this, the live range size is incorrect after the
scheduler rearranges instructions, since the instructions don't have
the original InstrDist offsets. Additionally, I think it would be more
appropriate to use the number of disjointly allocatable registers in
the class. For the AMDGPU register tuples, there are a large number of
registers in each tuple class, but only a small fraction can actually
be allocated at the same time since they all overlap with each
other. It seems we do not have a query that corresponds to the number
of independently allocatable registers. Relatedly, I'm still debugging
some allocation failures where overlapping tuples seem to not be
handled correctly.
The test changes are mostly noise. There are a handful of x86 tests
that look like regressions with an additional spill, and a handful
that now avoid a spill. The worst looking regression is likely
test/Thumb2/mve-vld4.ll which introduces a few additional
spills. test/CodeGen/AMDGPU/soft-clause-exceeds-register-budget.ll
shows a massive improvement by completely eliminating a large number
of spills inside a loop.
Use GCNHazardRecognizer in postra sched.
Updated tests for the new schedules.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D109536
Change-Id: Ia86ba2ae168f12fb34b4d8efdab491f84d936cde
Add SReg_224, VReg_224, AReg_224, etc.
Link 224-bit types with v7i32/v7f32.
Link existing 192-bit types to newly added v3i64/v3f64/v6i32/v6f32.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D104622
the compilation time and there is no case for which we see any improvement in
performance. This patch removes this pass and its associated test cases from
the tree.
Differential Revision: https://reviews.llvm.org/D101313
Change-Id: I0599169a7609c19a887f8d847a71e664030cc141
Change waitcnt insertion to check the memory operand tokens to see if
flat memory operations access VMEM in the same way it does to check if
accessing LDS. This avoids adding waitcnt for counters for address
spaces that are not accessed.
In addition, only generate the pessimistic waitcnt 0 if a flat memory
operation appears to access both VMEM and LDS.
This benefits flat memory operations that explicitly specify the
address space as GLOBAL or LOCAL.
Differential Revision: https://reviews.llvm.org/D89618
This reverts commit ca907bfb57.
According to michel.daenzer,
> This completely broke the Mesa radeonsi driver on Navi 14. Xorg +
> xterm come up with major corruption & psychedelic colours.
When memory operations are outstanding on function calls, either the
caller or the callee can insert a waitcnt to ensure that all reads are
finished.
Calls need some time to be executed, so if the callee inserts the
waitcnt, filling the instruction buffer and waiting for memory will be
interleaved, hiding some latency. This comes at the cost of having a
waitcnt inside functions that may not be needed as no memory operations
are outstanding.
For function calls, this is already implemented. The same principal
applies to returns: If the caller inserts a waitcnt after the call, the
callee does not have to wait and the return and memory operation can be
run in parallel.
This commit implements waiting in the caller after returning from a
function call.
Differential Revision: https://reviews.llvm.org/D87674
Get rid of all fixmes and base heuristic on `num-clustered-dwords`. The main intuition behind this is as
follows. The existing heuristic roughly summarizes as below:
* Assume, all the mem ops instructions participating in the clustering process, loads/stores same num bytes
* If num bytes loaded by each mem op is 4 bytes, then cluster at max 5 mem ops, that is at max 20 bytes
* If num bytes loaded by each mem op is 8 bytes, then cluster at max 3 mem ops, that is at max 24 bytes
* If num bytes loaded by each mem op is 16 bytes, then cluster at max 2 mem ops, that is at max 32 bytes
So, we need to make sure that the new heuristic do not completey deviate away from the above one, and it
properly handles both the sub-word loads and the wide loads.
Reviewed By: arsenm, rampitec
Differential Revision: https://reviews.llvm.org/D84354
tryLatency compares two sched candidates. For the top zone it prefers
the one with lesser depth, but only if that depth is greater than the
total latency of the instructions we've already scheduled -- otherwise
its latency would be hidden and there would be no stall.
Unfortunately it only tests the depth of one of the candidates. This can
lead to situations where the TopDepthReduce heuristic does not kick in,
but a lower priority heuristic chooses the other candidate, whose depth
*is* greater than the already scheduled latency, which causes a stall.
The fix is to apply the heuristic if the depth of *either* candidate is
greater than the already scheduled latency.
All this also applies to the BotHeightReduce heuristic in the bottom
zone.
Differential Revision: https://reviews.llvm.org/D72392
Summary: This change enables all kind of carry out ISD opcodes to be selected according to the node divergence.
Reviewers: rampitec, arsenm, vpykhtin
Reviewed By: rampitec
Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D78091
If SimplifyDemandedBits succeeds in simplifying the byte src, add the CVT_F32_UBYTE node back to the worklist as we might be able to simplify further.
Yet another step towards removing SelectionDAG::GetDemandedBits.
Summary:
pickNodeBidirectional tried to compare the best top candidate and the
best bottom candidate by examining TopCand.Reason and BotCand.Reason.
This is unsound because, after calling pickNodeFromQueue, Cand.Reason
does not reflect the most important reason why Cand was chosen. Rather
it reflects the most recent reason why it beat some other potential
candidate, which could have been for some low priority tie breaker
reason.
I have seen this cause problems where TopCand is a good candidate, but
because TopCand.Reason is ORDER (which is very low priority) it is
repeatedly ignored in favour of a mediocre BotCand. This is not how
bidirectional scheduling is supposed to work.
To fix this I changed the code to always compare TopCand and BotCand
directly, like the generic implementation of pickNodeBidirectional does.
This removes some uncommented AMDGPU-specific logic; if this logic turns
out to be important then perhaps it could be moved into an override of
tryCandidate instead.
Graphics shader benchmarking on gfx10 shows a lot more positive than
negative effects from this change.
Reviewers: arsenm, tstellar, rampitec, kzhuravl, vpykhtin, dstuttard, tpr, atrick, MatzeB
Subscribers: jvesely, wdng, nhaehnle, yaxunl, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D68338
This is part of the work to remove SelectionDAG::GetDemandedBits and just use SimplifyMultipleUseDemandedBits.
Recent experiments raised some v_cvt_f32_ubyte*_e32 regressions, so I've added some additional abilities to performCvtF32UByteNCombine to help unpack byte data more aggressively.
We still don't remove all OR(SHL,SRL) patterns as some of the regenerated nodes don't get combined again, but we are getting closer.
Differential Revision: https://reviews.llvm.org/D74786
We are relying on atrificial DAG edges inserted by the
MemOpClusterMutation to keep loads and stores together in the
post-RA scheduler. This does not work all the time since it
allows to schedule a completely independent instruction in the
middle of the cluster.
Removed the DAG mutation and added pass to bundle already
clustered instructions. These bundles are unpacked before the
memory legalizer because it does not work with bundles but also
because it allows to insert waitcounts in the middle of a store
cluster.
Removing artificial edges also allows a more relaxed scheduling.
Differential Revision: https://reviews.llvm.org/D72737
This was unconditionally folding this to the source operand, even if the access was out of bounds. Use undef instead of the extract is not the first element.
This helps with some cases where 3-vectors are legalized and avoids processing the 4th component.
Original Patch by: arsenm (Matt Arsenault)
Differential Revision: https://reviews.llvm.org/D51589
Summary:
Mem op clustering adds a weak edge in the DAG between two loads or
stores that should be clustered, but the direction of this edge is
pretty arbitrary (it depends on the sort order of MemOpInfo, which
represents the operands of a load or store). This often means that two
loads or stores will get reordered even if they would naturally have
been scheduled together anyway, which leads to test case churn and goes
against the scheduler's "do no harm" philosophy.
The fix makes sure that the direction of the edge always matches the
original code order of the instructions.
Reviewers: atrick, MatzeB, arsenm, rampitec, t.p.northover
Subscribers: jvesely, wdng, nhaehnle, kristof.beyls, hiraditya, javed.absar, arphaman, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D72706
The VOP3 form should always be the preferred selection, to be shrunk
later. This should only be an optimization issue, but this partially
works around a problem from clobbering VCC when SIFixSGPRCopies
rewrites an SCC defining operation directly to VCC.
3 of the testcases are regressions from failing to fold the immediate
in cases it should. These can be avoided by improving the VCC liveness
handling in SIFoldOperands. Simply increasing the threshold to
computeRegisterLiveness works, although this is common enough that VCC
liveness should probably be tracked throughout the pass. The hack of
leaving behind an implicit_def instruction to avoid breaking iterator
wastes instruction count, which inhibits finding the VCC def in long
chains of adds. Doing this however exposes different, worse looking
regressions from poor scheduling behavior. This could probably be
avoided around by forcing the shrink of the addc here, but the
scheduler should probably be fixed.
The r600 add test needs to be split out because it asserts on the
arguments in the new test during the calling convention lowering.
llvm-svn: 360293
Summary:
Extract the logic for doing reassociations
from DAGCombiner::reassociateOps into a helper
function DAGCombiner::reassociateOpsCommutative,
and use that helper to trigger reassociation
on the original operand order, or the commuted
operand order.
Codegen is not identical since the operand order will
be different when doing the reassociations for the
commuted case. That causes some unfortunate churn in
some test cases. Apart from that this should be NFC.
Reviewers: spatel, craig.topper, tstellar
Reviewed By: spatel
Subscribers: dmgreen, dschuff, jvesely, nhaehnle, javed.absar, sbc100, jgravelle-google, hiraditya, aheejin, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D61199
llvm-svn: 359476
Summary:
The DAGCombiner is rewriting (canonicalizing) an ISD::ADD
with no common bits set in the operands as an ISD::OR node.
This could sometimes result in "missing out" on some
combines that normally are performed for ADD. To be more
specific this could happen if we already have rewritten an
ADD into OR, and later (after legalizations or combines)
we expose patterns that could have been optimized if we
had seen the OR as an ADD (e.g. reassociations based on ADD).
To make the DAG combiner less sensitive to if ADD or OR is
used for these "no common bits set" ADD/OR operations we
now apply most of the ADD combines also to an OR operation,
when value tracking indicates that the operands have no
common bits set.
Reviewers: spatel, RKSimon, craig.topper, kparzysz
Reviewed By: spatel
Subscribers: arsenm, rampitec, lebedev.ri, jvesely, nhaehnle, hiraditya, javed.absar, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D59758
llvm-svn: 358965
This commit fixes the dwordx3/southern-islands failures that were found
in bugzilla https://bugs.llvm.org/show_bug.cgi?id=40129, by not
generating the dwordx3 variants of load/store instructions that were
added to the ISA after southern islands.
Differential Revision: https://reviews.llvm.org/D56434
llvm-svn: 350838
Fixes cvt_f32_ubyte combine. performCvtF32UByteNCombine() could shrink
source node to demanded bits only even if there are other uses.
Differential Revision: https://reviews.llvm.org/D56289
llvm-svn: 350475
I've extended the load/store optimizer to be able to produce dwordx3
loads and stores, This change allows many more load/stores to be combined,
and results in much more optimal code for our hardware.
Differential Revision: https://reviews.llvm.org/D54042
llvm-svn: 348937
See https://reviews.llvm.org/D47106 for details.
Reviewed By: probinson
Differential Revision: https://reviews.llvm.org/D47171
This commit drops that patch's changes to:
llvm/test/CodeGen/NVPTX/f16x2-instructions.ll
llvm/test/CodeGen/NVPTX/param-load-store.ll
For some reason, the dos line endings there prevent me from commiting
via the monorepo. A follow-up commit (not via the monorepo) will
finish the patch.
llvm-svn: 336843
Try to avoid mutually exclusive features. Don't use
a real default GPU, and use a fake "generic". The goal
is to make it easier to see which set of features are
incompatible between feature strings.
Most of the test changes are due to random scheduling changes
from not having a default fullspeed model.
llvm-svn: 310258