This patch adds rvv codegen support for vp.fptrunc. The lowering of fp_round and vp.fptrunc share most code so use a common lowering function to handle those two, similar to vp.trunc.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D123841
Materializing constants on RISCV is simpler if the constant is sign
extended from i32. By default i32 constant operands of phis are
zero extended.
This patch adds a hook to allow RISCV to override this for i32. We
have an existing isSExtCheaperThanZExt, but it operates on EVT which
we don't have at these places in the code.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D122951
This was added before Zve extensions were defined. I think users
should use Zve32x or Zve32f now. Though we will lose support for limiting
ELEN to 16 or 8, but I hope no one was using that.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D123418
This patch adds the minimum required to successfully lower vp.icmp via
the new ISD::VP_SETCC node to RVV instructions.
Regular ISD::SETCC goes through a lot of canonicalization which targets
may rely on which has not hereto been ported to VP_SETCC. It also
supports expansion of individual condition codes and a non-boolean
return type. Support for all of that will follow in later patches.
In the case of RVV this largely isn't a problem as the vector integer
comparison instructions are plentiful enough that it can lower all
VP_SETCC nodes on legal integer vectors except for boolean vectors,
which regular SETCC folds away immediately into logical operations.
Floating-point VP_SETCC operations aren't as well supported in RVV and
the backend relies on condition code expansion, so support for those
operations will come in later patches.
Portions of this code were taken from the VP reference patches.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D122743
We can do this conversion by converting the same sized integer type, then compare the result with 0. The conversion is undefined if the converted FP value doesn't fit in an i1.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D122678
If we expand (uaddo X, 1) we previously expanded the overflow calculation
as (X + 1) <u X. This potentially increases the live range of X and
can prevent X+1 from reusing the register that previously held X.
Since we're adding 1, overflow only occurs if X was UINT_MAX in which
case (X+1) would be 0. So this patch adds a special case to expand
the overflow calculation to (X+1) == 0.
This seems to help with uaddo intrinsics that get introduced by
CodeGenPrepare after LSR. Alternatively, we could block the uaddo
transform in CodeGenPrepare for this case.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D122933
The splat_vector will be legalized to build_vector eventually
anyway. This patch makes it take fewer steps.
Unfortunately, this results in some codegen changes. It looks
like it comes down to how the nodes were ordered in the topological
sort for isel. Because the build_vector is created earlier we end up
with a different ordering of nodes.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D122185
This function now takes a uint64_t instead of an APInt. The caller
is responsible for masking the shift amount, extracting and inserting
into the KnownBits APInts, and inverting to compute zeros.
This is less code and cleaner division of responsibilities.
Modified DAGCombiner to pass the shift the bittest input and the shift amount
to hasBitTest. This matches the other call to hasBitTest in TargetLowering.h
This is an alternative to D122454.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D122458
Don't call EltVT.getSizeInBits() or SrcEltVT.getSizeInBits() a second
time. They are already in EltSize or SrcEltSize variables.
Refactor some comparisons to use multiply instead of division.
On RV32, we need to type legalize i64 scalar arguments to intrinsics.
We usually do this by splatting the value into a vector separately.
If the scalar happens to be sign extended, we can continue using a .vx
intrinsic.
We already special cased sign extended constants, this extends it
to any sign extended value.
I've only added tests for one case of vadd. Most intrinsics go
through the same check.
Reviewed By: khchen
Differential Revision: https://reviews.llvm.org/D122186
On RV32, we need to type legalize i64 scalar arguments to intrinsics.
We usually do this by splatting the value into a vector separately.
If the scalar happens to be sign extended, we can continue using a .vx
intrinsic.
We already special cased sign extended constants, this extends it
to any sign extended value.
I've only added tests for one case of vadd. Most intrinsics go
through the same check. I can add more tests if we're concerned.
Differential Revision: https://reviews.llvm.org/D122186
RISCVISelDAGToDAG's selectImm uses RISCVTargetLowering::getAddr
(specifically the ConstantPoolSDNode) as of 41454ab256 ("[RISCV] Use
constant pool for large integers"), but nothing explicitly instantiates
any of the templates, the only reason they exist is because of the
various lowering methods in RISCVISelLowering.cpp that themselves use
the methods. However, with inlining, those can end up not existing as
real functions and thus not be exported, leading to link errors. Up
until now this hasn't happened, but for whatever reason D121654 has
triggered this on the sanitizer-ppc64be-linux buildbot, giving:
../../../../lib/libLLVMRISCVCodeGen.a(RISCVISelDAGToDAG.cpp.o): In function `selectImm(llvm::SelectionDAG*, llvm::SDLoc const&, llvm::MVT, long, llvm::RISCVSubtarget const&)':
RISCVISelDAGToDAG.cpp:(.text._ZL9selectImmPN4llvm12SelectionDAGERKNS_5SDLocENS_3MVTElRKNS_14RISCVSubtargetE+0x3d8): undefined reference to `llvm::SDValue llvm::RISCVTargetLowering::getAddr<llvm::ConstantPoolSDNode>(llvm::ConstantPoolSDNode*, llvm::SelectionDAG&, bool) const'
collect2: error: ld returned 1 exit status
Fix this by explicitly instantiating getAddr in its four different forms
so separate translation units can reliably use it.
Fixes: 41454ab256 ("[RISCV] Use constant pool for large integers")
Since we have SPLAT_VECTOR_PARTS these days, I don't think we need
to go through extra lengths to avoid introducing an illegal scalar type.
We can just call getConstant using the scalable vector type and let
it create either a SPLAT_VECTOR or a SPLAT_VECTOR_PARTS.
Reviewed By: frasercrmck, rogfer01
Differential Revision: https://reviews.llvm.org/D121645
Since we mark the pseudos as mayLoad but do not provide any MMOs,
isSafeToMove conservatively returns false, stopping MachineLICM from
hoisting the instructions. PseudoLA_TLS_GD does not actually expand to a
load, so stop marking that as mayLoad to allow it to be hoisted, and for
the others make sure to add MMOs during lowering to indicate they're GOT
loads and thus can be freely moved.
Fixes https://github.com/llvm/llvm-project/issues/54372
Reviewed By: MaskRay, arichardson
Differential Revision: https://reviews.llvm.org/D121654
This code handles fixed vector SPLAT_VECTOR, but is never called in
any tests.
We only form fixed vector splat vectors for vXi64 on RV32 as part
of DAGCombine. This will be type legalized to SPLAT_VECTOR_PARTS.
So the Custom handling for SPLAT_VECTOR is never needed.
This patch makes SPLAT_VECTOR for vXi64 'Legal' on RV32 so that
DAGCombine will create it, but there's no need for Custom handler.
It will still be type legalized to SPLAT_VECTOR_PARTS.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D121673
If the type is less than XLenVT, type legalization will turn this
into (srl (bitreverse (bswap (srl (bswap X), C))), C). We can't
completely recover from these shifts. They introduce zeros into
the upper bits of the result and we can't easily tell if they are
needed. By doing a DAG combine early, we avoid introducing these
shifts.
Type legalize narrow RISCVISD::GREV/GORC with constant to a larger
type without switching to W. Detect sext_inreg+gorci/grevi with a
uimm5 immediate during isel to emit GREVIW/GORCIW.
This allows us to better propagate known bits information through
extended bits after type legalization. It will also simplify a
change I'm considering for BREV8 with Zbkb.
A future patch will add computeKnownBits support for GORC.
A further improvement here would be to use hasAllWUsers and
doPeepholeSExtW like we do for SLLIW, but I don't think we have
the test coverage for that yet.
We know the shift amount is a constant with bit 31 clear. anyext
of constant will be either zext or sext which will produce the
same result here. But we really shouldn't rely on that. It would
be valid to put a random number in the upper bits. Our isel patterns
expect the upper bits to be 0 so we should ask for it explicitly.
This doesn't appear to be needed any more. I did some inspecting
of the gcc torture suite and SPEC2006 with this removed and didn't
find any meaningful changes.
I think we're more aggressive about forming ADDIW now using
sign_extend_inreg during type legalization and hasAllWUsers in isel.
This probably helps catch the cases this helped with before.
Similar to what we do for other loads/stores, use the intrinsic
version that we already have custom isel for.
Reviewed By: rogfer01
Differential Revision: https://reviews.llvm.org/D121166
vslide1up/down have this flag set, but the value isn't a splat.
Rename for clarity.
Reviewed By: khchen
Differential Revision: https://reviews.llvm.org/D121037
With Zbb, abs is expanded to (max X, neg) by default. If X has 33 or
more sign bits, we can expand it a little early using negw instead of
neg to save a sext_inreg. If X started as a 32 bit value, type
legalization would have inserted a sext before the abs so X having
33 sign bits should always be true.
Note: I've used ISD::FREEZE here since we increase the number of uses.
Our default expansion for ABS doesn't do that, but I think that's a bug.
We can't do this with custom type legalization because ISD::FREEZE
doesn't propagate sign bits so later DAG combine won't expand be
able to see optmize it.
Alives2 https://alive2.llvm.org/ce/z/Gx3RNe
Reviewed By: spatel
Differential Revision: https://reviews.llvm.org/D120597
Until Zfinx is supported in CodeGen we need to convert all Zfinx
register classes to GPR.
Remove the zfinx-types.ll test which didn't test anything meaningful
since -mattr=zfinx isn't implemented completely in llc.
Follow up to D93298.
This miscompile was introduced in D119527.
This was a special pattern for rotate+bswap on RV32. It doesn't
work for RV64 since the rotate needs to be half the bitwidth. The
equivalent pattern for RV64 is ROTR ((GREV x, 56), 32) so match
that instead.
This could be generalized further as noted in the new FIXME.
Reviewed By: Chenbing.Zheng
Differential Revision: https://reviews.llvm.org/D120686
This patch added the MC layer support of Zfinx extension.
Authored-by: StephenFan
Co-Authored-by: Shao-Ce Sun
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D93298
This lowers VECTOR_SPLICE of scalable vectors to a slidedown follow by a slideup.
Fixed vectors are encouraged to use shufflevector instruction. The equivalent patch
for fixed vectors is D119039.
I've used a tail agnostic slidedown and limited the VL to only the
elements that will not be overwritten by the slideup. The slideup
uses VLMax for its VL. It unfortunately uses tail undisturbed policy
but it isn't required as there is no tail. We just need the merge
operand to carry the bits for the lower portion of the result.
Care was taken to ensure that either the slideup or slidedown will
be able to use a .vi instruction when the immediate is small. Which
one uses the immediate depends on the sign of the immediate.
Reviewed By: frasercrmck, ABataev
Differential Revision: https://reviews.llvm.org/D119303
Default type legalization will create sext_inreg+abs, but we may
not be able to remove the sext_inreg.
Instead this patch expands abs during type legalization to
Y = sraiw X, 31; subw(xor X, Y), Y) which doesn't require the input
to be sign extended.
This gives a big improvement for some neg-abs tests where the
abs is used more than the the neg. Previously the abs was expanded
a different way before and after type legalization. Now they are
expanded in a similar way enabling more CSE.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D120636
vcpop and vfirst are still useful when VL=0.
vcpop equivalents to li 0 and vfirst equivalents to li -1,
since no mask elements are active.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D120302
Add a new ISD opcode to represent the sign extending behavior of
vmv.x.h. Keep the previous anyext opcode to allow the existing
(fmv_x_anyexth (fmv_h_x X)) combine to keep working without needing
to generate a sign extend.
For fmv.x.w we are able to match the sext_inreg in an isel pattern,
but a 16-bit sext_inreg is lowered to a shift pair before isel. This
seemed like a larger match than we should do in isel.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D118974
Internally to DAGCombiner the SDValues were passed by non-const
reference despite not being modified. They were then passed by
const reference to TLI.
This patch passes them by value which is consistent with the vast
majority of code.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D120420
See https://github.com/llvm/llvm-project/issues/53831 for a full discussion.
The basic issue is that DAGCombiner::visitMUL and
RISCVISelLowering;:transformAddImmMullImm get stuck in a loop, as the
current checks in transformAddImmMulImm aren't sufficient to avoid all
cases where DAGCombiner::isMulAddWithConstProfitable might trigger a
transformation. This patch makes transformAddImmMulImm bail out if C0
(the constant used for multiplication) has more than one use.
Differential Revision: https://reviews.llvm.org/D120332
This generalizes isElementRotate to work when there's only a single
slide needed. I've removed matchShuffleAsSlideDown which is now
redundant.
Reviewed By: frasercrmck, khchen
Differential Revision: https://reviews.llvm.org/D119759
Add the passthru operand for
VMV_V_X_VL, VFMV_V_F_VL and SPLAT_VECTOR_SPLIT_I64_VL also.
The goal is support tail and mask policy in RVV builtins.
We focus on IR part first.
If the passthru operand is undef, we use tail agnostic, otherwise
use tail undisturbed.
Reviewed By: rogfer01
Differential Revision: https://reviews.llvm.org/D119688
Part of the shift lowering creates a (sub XLEN-1, ShAmt). When this
value is used we know that ShAmt is [0..XLEN-1]. Since XLEN is a power
of 2 we can replace the sub with an xor. This allows us to use XORI
instead of LI+SUB.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D119411
The goal is support tail and mask policy in RVV builtins.
We focus on IR part first.
If the passthru operand is undef, we use tail agnostic, otherwise
use tail undisturbed.
Add passthru operand for VSLIDE1UP_VL and VSLIDE1DOWN_VL to support
i64 scalar in rv32.
The masked VSLIDE1 would only emit mask undisturbed policy regardless
of giving mask agnostic policy until InsertVSETVLI supports mask agnostic.
Reviewed by: craig.topper, rogfer01
Differential Revision: https://reviews.llvm.org/D117989
This is a more generic version of D119110 that uses MaskedValueIsZero
to do the matching and SimplifyDemandedBits to remove any unneeded
AND instructions.
Tests were taken from D119110.
Reviewed By: Chenbing.Zheng
Differential Revision: https://reviews.llvm.org/D119622
While matching widening multiply, if we matched an extend from i8->i32,
i16->i64 or i8->i64, we need to reintroduce a narrower extend. If we're
matching a vwmulsu we need to use a sext for op0 and a zext for op1.
This bug exists in LLVM 14 and will need to be backported.
Differential Revision: https://reviews.llvm.org/D119618
Move some combine patterns to DAG combine,and
it dealt with fixme left in RISCVInstrInfoZb.td.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D119527
We can lower a vector splice to a vslidedown and a vslideup.
The majority of the matching code here came from X86's code for matching
PALIGNR and VPALIGND/Q.
The slidedown and slideup lowering don't really require it to be concatenation,
but it happened to be an interesting pattern with existing analysis code I
could use.
This helps with cases where the scalar loop optimizer forwarded a load
result from a previous loop iteration. For example, this happens if the
loop uses x[i] and x[i+1] on the same iteration. The scalar optimizer
will forward x[i+1] load from the previous loop to satisfy x[i] on this
loop. When this get vectorized it results in one element of a vector
being forwarded from the previous loop to be concatenated with elements
loaded on this iteration.
Whether that's more efficient than doing a shifted loaded or reloading
the single scalar and using vslide1up is an interesting question.
But that's not something the backend can help with.
Reviewed By: khchen
Differential Revision: https://reviews.llvm.org/D119039
The VLMaxSentinel is represented as TargetConstant, but that's included
in isa<ConstantSDNode>. To keep constant VLs and VLMax separate as long
as possible, use the X0 register during lowering and only convert to
VLMaxSentinel during isel.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D118845
We already had FMA_VL node, but we didn't have masked patterns.
I have not added the fneg variations. I'll do those after I add
llvm.vp.fneg.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D119196
This patch adds an optimization to splat-like operations where the
splatted value is extracted from a identically-sized vector. On RVV we
can splat that via vrgather.vx/vrgather.vi without dropping to scalar
beforehand.
We do have a similar VECTOR_SHUFFLE-specific optimization but that only
works on fixed-length vector types and for those with a constant splat
lane. This patch extends this optimization to make it work on
scalable-vector types and on unknown extract indices.
It is performed during fixed-vector BUILD_VECTOR lowering and during a
new DAGCombine on SPLAT_VECTOR for scalable vectors.
Reviewed By: craig.topper, khchen
Differential Revision: https://reviews.llvm.org/D118456
Add a new ISD opcode to represent the sign extending behavior of
vmv.x.h. Keep the previous anyext opcode to allow the existing
(fmv_x_anyexth (fmv_h_x X)) combine to keep working without needing
to generate a sign extend.
For fmv.x.w we are able to match the sext_inreg in an isel pattern,
but a 16-bit sext_inreg is lowered to a shift pair before isel. This
seemed like a larger match than we should do in isel.
Differential Revision: https://reviews.llvm.org/D118974
Add the vslidedown and interleave patterns that I recently implemented.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D118952
This avoids a crash for scalable vectors and or scalarization for
fixed vectors.
The algorithm is different enough that I don't think it makes sense
to merge with ceil/floor/trunc. Algorithm is adapted from gcc's X86
SSE2 output.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D117247
SPLAT_VECTOR_I64 has the same semantics as RISCVISD::VMV_V_X_VL, it
just assumed VLMax instead of carrying a VL operand.
Include order of RISCVInstrInfoVSDPatterns.td and RISCVInstrInfoVVLPatterns.td
has been swapped to avoid moving riscv_vmv_v_x_vl into
RISCVInstrInfoVSDPatterns.td and to allow moving other "_vl" SDNodes back to
RISCVInstrInfoVVLPatterns.td
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D118841
VLMaxSentintel happens to be represented as -1 TargetConstant. A user
provided -1 would be an ISD::Constant. We shouldn't assume that they
are the same thing. I'm still not entirely convinced that we should be
treating -1 from the user as VLMAX.
Also fix one place that failed to use XLenVT for the VLMaxSentinel,
using MVT::i64 in code that only executes on RV32.
This adds or reuses ISD opcodes for vadd.wv, vaddu.wv, vadd.vv, vaddu.vv
and a similar set for sub.
I've included support for narrowing scalar splats that have known
sign/zero bits similar to what was done for MUL_VL.
The conversion to vwadd.vv proceeds in two phases. First we'll form
a vwadd.wv by narrowing one of the operands. Then we'll visit the
vwadd.wv to try to narrow the other operand. This turned out to be
simpler than catching all the cases in one step. The forming of of
vwadd.wv can happen for either operand for add, but only the right
hand side for sub since sub isn't commutable.
An interesting quirk is that ADD_VL and VZEXT_VL/VSEXT_VL are formed
during vector op legalization, but VMV_V_X_VL isn't usually formed
until op legalization when BUILD_VECTORS are handled. This leads to
VWADD_W_VL forming in one DAG combine round, and then a later DAG combine
round sees the VMV_V_X_VL and needs to commute the operands to get the
splat in position. This alone necessitated a VWADD_W_VL combine function
which made forming vwadd.vv in two stages an easy choice.
I've left out trying hard to form vwadd.wx instructions for now. It would
only save an extend in the scalar domain which isn't as interesting.
Might need to review the test coverage a bit. Most of the vwadd.wv
instructions are coming from vXi64 tests on rv64. The tests were
copy pasted from the existing multiply tests.
Reviewed By: rogfer01
Differential Revision: https://reviews.llvm.org/D117954
We convert VLEN to vscale by dividing by RVVBitsPerBlock which is
currently 64. This is only correct if VLEN is evenly divisible by
64. With only Zvl32b we can't assume that.
This patch adds a fatal_error to prevent generating code that may
be broken.
We probably need to look at how we size stack frame objects too.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D118583
We had previously hardcoded this to assume that vector registers
are 128 bits. This was true when only V existed, but after Zve
extensions were added this became incorrect.
This patch adjusts it to support 128, 64, or 32 bit vectors depending
on Zvl. The 128-bit limit is artificial, but we don't have any test
coverage showing that we larger values so I was being conservative.
None of our lit tests depend on this code today due to the custom
lowering of ISD::VSCALE that inserts the appropriate left or right
shift to convert from VLENB to VSCALE. That code was added after
this code in computeKnownBitsForTargetNode.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D118582
masked.atomicrmw.*.i32 intrinsics access an i32 (and then possibly
mask it), so hardcode MVT::i32 as the access type here, rather than
determining it from the pointer element type.
Differential Revision: https://reviews.llvm.org/D118336
This is a slight change because I'm using the ANY_EXTEND result
instead of the original operand, but getNode should constant fold.
While there, add a comment about why the code specifically checks
for a ConstantSDNode.
We can use the RISCVISD::GREV encoding that swaps the bits in
each byte. This allows it to use the existing computeKnownBits
support for RISCVISD::GREV.
We already have an ISD opcode for the more general GREV/GREVI
instructon. We can just use it with the encoding that corresponds
to the behavior of brev8. This is similar to what we do for orc.b
where we use the GORC ISD opcode.
Now the backend promotes mask vector to an i8 vector and extract element from that. We could bitcast to a widen element vector, and extract from it to GPR, then use I instruction to extract the certain bit.
Differential Revision: https://reviews.llvm.org/D117389
We were creating a truncate with the default for the type, but for
VP intrinsics we have a VL that we should use.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D118406
`__gnu_h2f_ieee` and `__gnu_f2h_ieee` are introduce by ARM and set that as
default name for fp16 and fp32 conversion in LLVM.
However RISC-V GCC using default naming scheme for that, which is
`__extendhfsf2` and `__truncsfhf2` for that, that cause runtime ABI
incompatible issue.
Although we didn't have formal runtime ABI spec to specify those naming
convention yet, but I think it would be great to fix the incompatible
issue first.
And I've plan to create a runtime ABI spec undere psABI spec this year.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D118207
According to riscv-v-spec-1.0, widening signed(vs2)-unsigned integer multiply
vwmulsu.vv vd, vs2, vs1, vm # vector-vector
vwmulsu.vx vd, vs2, rs1, vm # vector-scalar
It is worth noting that signed op is only for vs2.
For vwmulsu.vv, we can swap two ops, and don't care which is sign extension,
but for vwmulsu.vx signExt can not be a vector extended from scalar (rs1).
I specifically added two functions ending with _swap in the test case.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D118215
This patch adds support for expanding VP_MERGE through a sequence of
vector operations producing a full-length mask setting up the elements
past EVL/pivot to be false, combining this with the original mask, and
culminating in a full-length vector select.
This expansion should work for any data type, though the only use for
RVV is for boolean vectors, which themselves rely on an expansion for
the VSELECT.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D118058
This reverts commit ef82063207.
- It conflicts with the existing llvm::size in STLExtras, which will now
never be called.
- Calling it without llvm:: breaks C++17 compat
The goal is support tail and mask policy in RVV builtins.
We focus on IR part first.
If the passthru operand is undef, we use tail agnostic, otherwise
use tail undisturbed.
Co-Authored-by: Hsiangkai Wang <Hsiangkai@gmail.com>
Reviewers: craig.topper, frasercrmck
Differential Revision: https://reviews.llvm.org/D117647
Instead use either Type::getPointerElementType() or
Type::getNonOpaquePointerElementType().
This is part of D117885, in preparation for deprecating the API.
This patch introduces new intrinsics that enable the use of vsetvli in
contexts where only the returned vector length is of interest. The
pre-existing intrinsics are marked with side-effects, which prevents
even trivial optimizations on/across them.
These intrinsics are intended to be used in situations where the vector
length is fed in turn to RVV intrinsics or to vector-predication
intrinsics during loop vectorization, for example. Those codegen paths
ensure that instructions are generated with their own implicit vsetvli,
so the vector length and vtype can be relied upon to be correct.
No corresponding C builtins are planned at this stage, though that is a
possibility for the future if the need arises.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D117910
This patch adds lowering of the llvm.vp.merge.* intrinsic
(ISD::VP_MERGE) to RVV vmerge/vfmerge instructions. It introduces a
special pseudo form of vmerge which allows a tied merge operand,
allowing us to specify the tail elements as being equal to the "on
false" operand, using a tied-def constraint and a "tail undisturbed"
policy.
While this strategy allows us to often lower the intrinsic to just one
instruction, it may be less efficient in fixed-vector types as the
number of tail elements may extend far beyond the length of the fixed
vector. Another strategy could be to use a vmerge/vfmerge instruction
with an AVL equal to the length of the vector type, and manipulate the
condition operand such that mask elements greater than the operation's
EVL are false.
I've also observed inefficient codegen in which our 'VF' patterns don't
match raw floating-point SPLAT_VECTORs, which occur in scalable-vector
code.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D117561
This patch follows up on D117697 to help the simple binary operations
behave similarly in the presence of masks.
It also enables CGP sinking support for vp.fdiv and vp.fsub intrinsics,
now that VFRDIV and VFRSUB are consistently matched with a LHS splat for
masked and unmasked variants.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D117783
Instead of passing the both the SDNode* and 2 of the operands
in two different orders, just pass the SDNode * and a bool to
indicate which operand order to test.
While there rename to combineMUL_VLToVWMUL_VL.
This patch brings better splat-matching to our VP support, by sinking
splat operands of VP intrinsics back into the same block as the VP
operation. The list of VP intrinsics we are interested in matches that
of the regular instructions.
Some optimization is still lacking. For instance, our VL nodes aren't
recognized as commutative, so splats must be on the RHS. Because of
this, we limit our sinking of splats to just the RHS operand for now.
Improvement in this regard can come in another patch.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D117703
RISCV only has a unary shuffle that requires places indices in a
register. For interleaving two vectors this means we need at least
two vrgathers and a vmerge to do a shuffle of two vectors.
This patch teaches shuffle lowering to use a widening addu followed
by a widening vmaccu to implement the interleave. First we extract
the low half of both V1 and V2. Then we implement
(zext(V1) + zext(V2)) + (zext(V2) * zext(2^eltbits - 1)) which
simplifies to (zext(V1) + zext(V2) * 2^eltbits). This further
simplifies to (zext(V1) + zext(V2) << eltbits). Then we bitcast the
result back to the original type splitting the wide elements in half.
We can only do this if we have a type with wider elements available.
Because we're using extends we also have to be careful with fractional
lmuls. Floating point types are supported by bitcasting to/from integer.
The tests test a varied combination of LMULs split across VLEN>=128 and
VLEN>=512 tests. There a few tests with shuffle indices commuted as well
as tests for undef indices. There's one test for a vXi64/vXf64 vector which
we can't optimize, but verifies we don't crash.
Reviewed By: rogfer01
Differential Revision: https://reviews.llvm.org/D117743
Similar for ceil, trunc, round, and roundeven. This allows us to use
static rounding modes to avoid a libcall.
This is similar to D116771, but for the saturating conversions.
This optimization is done for AArch64 as isel patterns.
RISCV doesn't have instructions for ceil/floor/trunc/round/roundeven
so the operations don't stick around until isel to enable a pattern
match. Thus I've implemented a DAG combine.
I'm only handling saturating to i64 or i32. This could be extended
to other sizes in the future.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D116864
This idea has come up in several reviews -- D115978 and D105902 -- so I
can't take any credit for the idea. Instead of using a constant pool to
lower -0.0, we can emit a sequence of two instructions:
fmv.[hwd].x freg, zero
fsgnjn.[hsd] freg, freg, freg
This is only done when the floating-point type is legal.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D117687
We may not be allowed to use vXiXLen vectors. Consult ELEN to
determine what is allowed. This will become even more important
when Zve32 is added.
Reviewed By: frasercrmck, arcbbb
Differential Revision: https://reviews.llvm.org/D117518
Remove fshl/fshr with constant shift amount isel patterns. Replace
with fsr/fsl with constant isel patterns.
This hack was trying to preserve as much optimization opportunity
for fshl/fshr by constant as possible, but the conversion to
RISCVISD::FSR/FSL happens so late it probably isn't worth much.
The new isel patterns are needed by D117468 anyway.
This reverts the revert commit e328385739.
Accidental demanded bits change has been removed. The demanded bits
code itself was remove in a pre-commit since it isn't tested.
Original commit message:
Previous we used the fshl/fshr operand ordering for simplicity. This
made things confusing when D117468 proposed adding intrinsics for
the instructions. We can't just use the generic funnel shifting
intrinsics because fsl/fsr have different functionality that should
be exposed to software.
Now we use rs1, rs3, rs2/shamt order which matches the instruction
printing order and the order used in this intrinsic header
https://github.com/riscv/riscv-bitmanip/blob/main-history/cproofs/rvintrin.h
Testing may be easier after D117468. Right now we get demanded bits
optimizations done on ISD::FSHL/FSHR before they become FSR/FSL. This
makes it hard to test.
Previous we used the fshl/fshr operand ordering for simplicity. This
made things confusing when D117468 proposed adding intrinsics for
the instructions. We can't just use the generic funnel shifting
intrinsics because fsl/fsr have different functionality that should
be exposed to software.
Now we use rs1, rs3, rs2/shamt order which matches the instruction
printing order and the order used in this intrinsic header
https://github.com/riscv/riscv-bitmanip/blob/main-history/cproofs/rvintrin.h
Currently, users expected VL is the last operand. However, since some
intrinsics has tail policy in the last operand, this rule cannot be used
anymore.
Reviewed By: craig.topper, frasercrmck
Differential Revision: https://reviews.llvm.org/D117452
Current SplatOperand starts from 1 because operand 0 (or 1) is intrinsic
id in SelectionDAG.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D117453
For fixed vectors, the undef will get expanded to an all zeros
build_vector. We don't want that so suppress creating the
insert_subvector.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D117379
We were considering this legal, but later the undef would become an all
zeros vector. This would cause us to need to re-legalize the insert later
into a vslideup with zero vector.
This patch catches the case and directly legalizes it to a scalable
insert.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D117377
When we know the value we're extending is a negative constant then it
makes sense to use SIGN_EXTEND because this may improve code quality in
some cases, particularly when doing a constant splat of an unpacked vector
type. For example, for SVE when splatting the value -1 into all elements
of a vector of type <vscale x 2 x i32> the element type will get promoted
from i32 -> i64. In this case we want the splat value to sign-extend from
(i32 -1) -> (i64 -1), whereas currently it zero-extends from
(i32 -1) -> (i64 0xFFFFFFFF). Sign-extending the constant means we can use
a single mov immediate instruction.
New tests added here:
CodeGen/AArch64/sve-vector-splat.ll
I believe we see some code quality improvements in these existing
tests too:
CodeGen/AArch64/reduce-and.ll
CodeGen/AArch64/unfold-masked-merge-vector-variablemask.ll
The apparent regressions in CodeGen/AArch64/fast-isel-cmp-vec.ll only
occur because the test disables codegen prepare and branch folding.
Differential Revision: https://reviews.llvm.org/D114357
Original patch by @hussainjk.
This patch was split off from D109377 to keep vector legalization
(widening/splitting) separate from vector element legalization
(promoting).
While the original patch added a third overload of
SelectionDAG::getVPStore, this patch takes the liberty of collapsing
those all down to 1, as three overloads seems excessive for a
little-used node.
The original patch also used ModifyToType in places, but that method
still crashes on scalable vector types. Seeing as the other VP
legalization methods only work when all operands need identical
widening, this patch follows in that vein.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D117235
These cases follow the same pattern, so they can be combined
to a unqiue function.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D117378
Specifically the unary shuffle case where the elements being
shifted in are undef. This handles the shuffles produce by expanding
llvm.reduce.mul.
I did not reduce the VL which would increase the number of vsetvlis,
but may improve the execution speed. We'd also want to narrow the
multiplies so we could share vsetvlis between the vslidedown.vi and
the next multiply.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D117239
It appears the code here was written for the inline asm clobbering
a specific register, but it also gets used for named input and
output registers.
For the input and output case, we should honor the VT so we
don't insert conversion instructions around the inline assembly.
For the clobber, case we need to pick the largest register class.
Reviewed By: asb, jrtc27
Differential Revision: https://reviews.llvm.org/D117279
We could use vmv.v.i/vmv.v.x whose eew is 32 to lower the i64 splat vector if the i64 constant scalar could be splitted into two same i32 scalar.
Differential Revision: https://reviews.llvm.org/D117079
When we know the value we're extending is a negative constant then it
makes sense to use SIGN_EXTEND because this may improve code quality in
some cases, particularly when doing a constant splat of an unpacked vector
type. For example, for SVE when splatting the value -1 into all elements
of a vector of type <vscale x 2 x i32> the element type will get promoted
from i32 -> i64. In this case we want the splat value to sign-extend from
(i32 -1) -> (i64 -1), whereas currently it zero-extends from
(i32 -1) -> (i64 0xFFFFFFFF). Sign-extending the constant means we can use
a single mov immediate instruction.
New tests added here:
CodeGen/AArch64/sve-vector-splat.ll
I believe we see some code quality improvements in these existing
tests too:
CodeGen/AArch64/dag-numsignbits.ll
CodeGen/AArch64/reduce-and.ll
CodeGen/AArch64/unfold-masked-merge-vector-variablemask.ll
The apparent regressions in CodeGen/AArch64/fast-isel-cmp-vec.ll only
occur because the test disables codegen prepare and branch folding.
Differential Revision: https://reviews.llvm.org/D114357
This adds support for STRICT_FSETCC(quiet) and STRICT_FSETCCS(signaling).
FEQ matches well to STRICT_FSETCC oeq.
FLT/FLE matches well to STRICT_FSETCCS olt/ole.
Others require commuting operands or multiple instructions.
STRICT_FSETCC olt/ole/ogt/oge/ult/ule/ugt/uge uses FLT/FLE,
but we need to save/restore FFLAGS around them to avoid spurious
exceptions. I've implemented pseudo instructions with a
CustomInserter to insert the save/restore CSR instructions.
Unfortunately, this doesn't honor exceptions for signaling NANs
but I'm not sure if signaling nans are really supported by the
constrained intrinsics.
STRICT_FSETCC one and ueq expand to a pair of FLT instructions
with a save/restore of fflags around each. This could be improved
in the future.
There may be some opportunities to generate better code for strict
comparisons mixed with nonans fast math flags. I've left FIXMEs in
the .td files for that.
Co-Authored-by: ShihPo Hung <shihpo.hung@sifive.com>
Reviewed By: arcbbb
Differential Revision: https://reviews.llvm.org/D116694
Similar for ceil, trunc, round, and roundeven. This allows us to use
static rounding modes to avoid a libcall.
This optimization is done for AArch64 as isel patterns.
RISCV doesn't have instructions for ceil/floor/trunc/round/roundeven
so the operations don't stick around until isel to enable a pattern
match. Thus I've implemented a DAG combine.
We only handle XLen types except i32 on RV64. i32 will be type
legalized to a RISCVISD node. All other types will be type legalized
to XLen and maintain the FP_TO_SINT/UINT ISD opcode.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D116771
The code can only address the whole RV32 address space or the lower 2 GiB
of the RV64 address space in small code model, so 32 bits entry is enough.
Cache hit ratio and code size have some improvements.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D116435
When `Zbt` is enabled, we can generate SELECT for division by power
of 2, so that there is no data dependency.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D114856
When we want to create an splat vector that only the first element is initialized, we could use vmv.s.x or vfmv.s.f to build it.
Differential Revision: https://reviews.llvm.org/D116277
The zextload hook is only used to determine whether to insert a
zero_extend or any_extend for narrow types leaving a basic block.
Returning true from this hook tends to cause any load whose output
leaves the basic block to become an LWU instead of an LW.
Since we tend to prefer sexts for i32 compares on RV64, this can
cause extra sext.w instructions to be created in other basic blocks.
If we use LW instead of LWU this gives the MIR pass from D116397
a better chance of removing them.
Another option might be to teach getPreferredExtendForValue in
FunctionLoweringInfo.cpp about our preference for sign_extend of
i32 compares. That would cause SIGN_EXTEND to be chosen for any
value used by a compare instead of using the isZExtFree heuristic.
That will require code to convert from the llvm::Type* to EVT/MVT
as well as querying the type legalization actions to get the
promoted type in order to call TargetLowering::isSExtCheaperThanZExt.
That seemed like many extra steps when no other target wants it.
Though it would avoid us needing to lean on the MIR pass in some cases.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D116567
This patch adds isel support for STRICT_LRINT/LLRINT/LROUND/LLROUND.
It also adds test cases for f32 and f64 constrained intrinsics that
correspond to the intrinsics in float-intrinsics.ll and
double-intrinsics.ll. Support for promoting the integer argument of
STRICT_FPOWI was added.
I've skipped adding tests for f16 intrinsics, since we don't have libcalls
for them and we have inconsistent support for promoting them in LegalizeDAG.
This will need to be examined more closely.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D116323
After consuming all vector registers, the scalable vector values will be
passed indirectly. The pointer values will be saved in general
registers. If all general registers are used up, we will report an error to
notify users the compiler does not support passing scalable vector
values through the stack. In this patch, we remove the restriction. After
all general registers are used up, we use the stack to save the
pointers which point to the indirect passed scalable vector values.
Differential Revision: https://reviews.llvm.org/D116310
The 'r' constraint uses the GPR class. There is generic support
for bitcasting and extending/truncating non-integer VTs to the
required integer VT. This doesn't work for scalable vectors and
instead crashes.
To prevent this, explicitly reject vectors. Fixed vectors might
work without crashing, but it doesn't seem worthwhile to allow.
While there remove an unnecessary level of indentation in the
"vr" and "vm" constraint handling.
Differential Revision: https://reviews.llvm.org/D115810
For fixed and scalable vectors, each intrinsic x is lowered to vmx.mm,
dropping the mask, which is safe to do as masked-off elements are
undef anyway.
Differential Revision: https://reviews.llvm.org/D115339
-0.0 requires a constant pool. +0.0 can be made with vmv.v.x x0.
Not doing this in getNeutralElement for fear of changing other targets.
Differential Revision: https://reviews.llvm.org/D115978
This adds support for strict conversions between fp types and between
integer and fp.
NOTE: RISCV has static rounding mode instructions, but the constrainted
intrinsic metadata is not used to select static rounding modes. Dynamic
rounding mode is always used.
Differential Revision: https://reviews.llvm.org/D115997
Enable transform (X & Y) == Y ---> (~X & Y) == 0 and (X & Y) != Y ---> (~X & Y) != 0 when have Zbb extension to use more andn instruction.
Differential Revision: https://reviews.llvm.org/D115922
Our Zfhmin support is only MC layer, but these are CodeGen layer
interfaces. If f16 isn't a Legal type for CodeGen with Zfhmin, then
these interfaces should keep their non-Zfh behavior.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D115822
Test that STRICT_FMINNUM/FMAXNUM are lowered to libcalls for f32/f64.
The RISC-V instructions don't match the behavior of fmin/fmax libcalls
with respect to SNaN.
Promoting FMINNUM/FMAXNUM for f16 needs more work outside of the
RISC-V backend.
Reviewed By: asb, arcbbb
Differential Revision: https://reviews.llvm.org/D115680
In order to support constrained FP intrinsics we need to model FRM
dependency. Whether or not a instruction uses FRM is based on a 3
bit field in the instruction. Because of this we can't add
'Uses = [FRM]' to the tablegen descriptions.
This patch examines the immediate after isel and adds an implicit
use of FRM. This idea came from Roger Ferrer Ibanez.
Other ideas:
We could be overly conservative and just pretend all instructions with
frm field read the FRM register. Or we could have pseudoinstructions
for CodeGen with rounding mode.
Reviewed By: asb, frasercrmck, arcbbb
Differential Revision: https://reviews.llvm.org/D115555
The reduction instructions only reads the first element. The
execution time for a splat may take longer with a larger VL.
We should use the smallest VL we can.
Reviewed By: frasercrmck, HsiangKai
Differential Revision: https://reviews.llvm.org/D115536
- `vm` constraint is used for masking operand, which always v0.
- Update testcase, only masking operand should use `vm`, vector mask operations
should just use `vr` for any vector register.
- Revise the description of `vm` constraint.
- This patch also fix issue on RISCVRegisterInfo.td and RISCVISelLowering.cpp.
RISCVRegisterInfo.td:
- The first VT in the list must be the largest total size since the
SelectionDAGBuilder uses the first register in the list as the canonical
type for the register.
RISCVISelLowering.cpp:
- Fix RISCVTargetLowering::splitValueIntoRegisterParts and
RISCVTargetLowering::joinRegisterPartsIntoValue for handling vectors
with different total size, that will happened on fractional LMUL since
fractional LMUL is always occupy one vector register.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D112599
The immediate size check on StepNumerator did not take into account
that vmul.vi does not exist. It also did not account for power of 2
constants that can be done with vshl.vi.
This patch fixes this by moving the conversion from mul to shift
further up. Then we can consider the immediates separately for MUL
vs SHL. For MUL I've allowed simm12 which requires a single addi
before a vmul.vx. For SHL I've allowed any uimm5 which works with
vshl.vi. We could relax these further in the future. This is a
starting point that allows us to emit the same number of instructions
we were already using for smaller numerators.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D115081
This prevents scalarization of fixed vector operations or crashes
on scalable vectors.
We don't have direct support for these operations. To emulate
ftrunc we can convert to the same sized integer and back to fp using
round to zero. We don't need to do a convert if the value is large
enough to have no fractional bits or is a nan.
The ceil and floor lowering would be better if we changed FRM, but
we don't model FRM correctly yet. So I've used the trunc lowering
with a conditional add or subtract with 1.0 if the truncate rounded
in the wrong direction.
There are also missed opportunities to use masked instructions.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D113543
This adds a fold in DAGCombine to create fptosi_sat from sequences for
smin(smax(fptosi(x))) nodes, where the min/max saturate the output of
the fp convert to a specific bitwidth (say INT_MIN and INT_MAX). Because
it is dealing with smin(/smax) in DAG they may currently be ISD::SMIN,
ISD::SETCC/ISD::SELECT, ISD::VSELECT or ISD::SELECT_CC nodes which need
to be handled similarly.
A shouldConvertFpToSat method was added to control when converting may
be profitable. The original fptosi will have a less strict semantics
than the fptosisat, with less values that need to produce defined
behaviour.
This especially helps on ARM/AArch64 where the vcvt instructions
naturally saturate the result.
Differential Revision: https://reviews.llvm.org/D111976
It causes builds to fail with this assert:
llvm/include/llvm/ADT/APInt.h:990:
bool llvm::APInt::operator==(const llvm::APInt &) const:
Assertion `BitWidth == RHS.BitWidth && "Comparison requires equal bit widths"' failed.
See comment on the code review.
> This adds a fold in DAGCombine to create fptosi_sat from sequences for
> smin(smax(fptosi(x))) nodes, where the min/max saturate the output of
> the fp convert to a specific bitwidth (say INT_MIN and INT_MAX). Because
> it is dealing with smin(/smax) in DAG they may currently be ISD::SMIN,
> ISD::SETCC/ISD::SELECT, ISD::VSELECT or ISD::SELECT_CC nodes which need
> to be handled similarly.
>
> A shouldConvertFpToSat method was added to control when converting may
> be profitable. The original fptosi will have a less strict semantics
> than the fptosisat, with less values that need to produce defined
> behaviour.
>
> This especially helps on ARM/AArch64 where the vcvt instructions
> naturally saturate the result.
>
> Differential Revision: https://reviews.llvm.org/D111976
This reverts commit 52ff3b0093.
This adds a fold in DAGCombine to create fptosi_sat from sequences for
smin(smax(fptosi(x))) nodes, where the min/max saturate the output of
the fp convert to a specific bitwidth (say INT_MIN and INT_MAX). Because
it is dealing with smin(/smax) in DAG they may currently be ISD::SMIN,
ISD::SETCC/ISD::SELECT, ISD::VSELECT or ISD::SELECT_CC nodes which need
to be handled similarly.
A shouldConvertFpToSat method was added to control when converting may
be profitable. The original fptosi will have a less strict semantics
than the fptosisat, with less values that need to produce defined
behaviour.
This especially helps on ARM/AArch64 where the vcvt instructions
naturally saturate the result.
Differential Revision: https://reviews.llvm.org/D111976
On RISC-V, icmp is not sunk (as the following snippet shows) which
generates the following suboptimal branch pattern:
```
core_list_find:
lh a2, 2(a1)
seqz a3, a0 <<
bltz a2, .LBB0_5
bnez a3, .LBB0_9 << should sink the seqz
[...]
j .LBB0_9
.LBB0_5:
bnez a3, .LBB0_9 << should sink the seqz
lh a1, 0(a1)
[...]
```
due to an icmp not being sunk.
The blocks after `codegenprepare` look as follows:
```
define dso_local %struct.list_head_s* @core_list_find(%struct.list_head_s* readonly %list, %struct.list_data_s* nocapture readonly %info) local_unnamed_addr #0 {
entry:
%idx = getelementptr inbounds %struct.list_data_s, %struct.list_data_s* %info, i64 0, i32 1
%0 = load i16, i16* %idx, align 2, !tbaa !4
%cmp = icmp sgt i16 %0, -1
%tobool.not37 = icmp eq %struct.list_head_s* %list, null
br i1 %cmp, label %while.cond.preheader, label %while.cond9.preheader
while.cond9.preheader: ; preds = %entry
br i1 %tobool.not37, label %return, label %land.rhs11.lr.ph
```
where the `%tobool.not37` is the result of the icmp that is not sunk.
Note that it is computed in the basic-block up until what becomes the
`bltz` instruction and the `bnez` is a basic-block of its own.
Compare this to what happens on AArch64 (where the icmp is correctly sunk):
```
define dso_local %struct.list_head_s* @core_list_find(%struct.list_head_s* readonly %list, %struct.list_data_s* nocapture readonly %info) local_unnamed_addr #0 {
entry:
%idx = getelementptr inbounds %struct.list_data_s, %struct.list_data_s* %info, i64 0, i32 1
%0 = load i16, i16* %idx, align 2, !tbaa !6
%cmp = icmp sgt i16 %0, -1
br i1 %cmp, label %while.cond.preheader, label %while.cond9.preheader
while.cond9.preheader: ; preds = %entry
%1 = icmp eq %struct.list_head_s* %list, null
br i1 %1, label %return, label %land.rhs11.lr.ph
```
This is caused by sinkCmpExpression() being skipped, if multiple
condition registers are supported.
Given that the check for multiple condition registers affect only
sinkCmpExpression() and shouldNormalizeToSelectSequence(), this change
adjusts the RISC-V target as follows:
* we no longer signal multiple condition registers (thus changing
the behaviour of sinkCmpExpression() back to sinking the icmp)
* we override shouldNormalizeToSelectSequence() to let always select
the preferred normalisation strategy for our backend
With both changes, the test results remain unchanged. Note that without
the target-specific override to shouldNormalizeToSelectSequence(), there
is worse code (more branches) generated for select-and.ll and select-or.ll.
The original test case changes as expected:
```
core_list_find:
lh a2, 2(a1)
bltz a2, .LBB0_5
beqz a0, .LBB0_9 <<
[...]
j .LBB0_9
.LBB0_5:
beqz a0, .LBB0_9 <<
lh a1, 0(a1)
[...]
```
Differential Revision: https://reviews.llvm.org/D98932
If we have a large enough floating point type that can exactly
represent the integer value, we can convert the value to FP and
use the exponent to calculate the leading/trailing zeros.
The exponent will contain log2 of the value plus the exponent bias.
We can then remove the bias and convert from log2 to leading/trailing
zeros.
This doesn't work for zero since the exponent of zero is zero so we
can only do this for CTLZ_ZERO_UNDEF/CTTZ_ZERO_UNDEF. If we need
a value for zero we can use a vmseq and a vmerge to handle it.
We need to be careful to make sure the floating point type is legal.
If it isn't we'll continue using the integer expansion. We could split the vector
and concatenate the results but that needs some additional work and evaluation.
Differential Revision: https://reviews.llvm.org/D111904
Previously these would crash. I don't think these can be generated
directly from C. Not sure if any optimizations can introduce them.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D113527
Not all scalar element types are allowed in vectors so we may not
be able to bitcast to a 1 element vector to use insert/extract.
This will become a bigger issue when the Zve extensions are commited.
For now, I'm using the ELEN limit to limit the element types.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D113219
This is consistent with what we do for other operands that are required
to be constants.
I don't think this results in any real changes. The pattern match
code for isel treats ConstantSDNode and TargetConstantSDNode the same.
These and MULHS/MULHU both default to Legal. Targets need to set
the ones they don't support to Expand.
I think MULHS/MULHU likely has priority in most places so this
change probably isn't directly testable. I found it while looking
at disabling MULHS/MULHU for nxvXi64 as required for Zve64x.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D113325
Similar to D110206, this patch optimizes unmasked vp.load intrinsics to
avoid the need of a vmset instruction to set the mask. It does so by
selecting a riscv_vle intrinsic rather than a riscv_vle_mask intrinsic.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D113022
Add new hasVInstructions() which is currently equivalent.
Replace vector uses of hasStdExtZfh/F/D with new vector specific
versions. The vector spec no longer requires that the vectors implement the
same types as scalar. It only requires that the scalar type is
the maximum size the vectors can support. This is currently
implemented using the scalar rule we were using before.
Add new hasVInstructionsI64() begin using to qualify code that
requires i64 vector elements.
This is all NFC for now, but we can start using this to better
implement D112408 which introduces the Zve extensions.
Reviewed By: frasercrmck, eopXD
Differential Revision: https://reviews.llvm.org/D112496
This can avoid a loss of decoupling with the scalar unit on cores
with decoupled scalar and vector units.
We should support FP too, but those use extract_element and not a
custom ISD node so it is a little different. I also left a FIXME
in the test for i64 extract and store on RV32.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D109482
If one input of a fixed vector multiply is a sign/zero extend and
the other operand is a splat of a scalar, we can use a widening
multiply if the scalar value has sufficient sign/zero bits.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D110028
This particular case was creating a `VMSET_VL` using the old
fixed-length type in order to pass a mask to other custom nodes
operating on the scalable container type. This kind of thing wasn't
caught for us; I only noticed when experimenting with odd-length
vectors, where it was trying to generate an invalid `v3i1` MVT.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D110420
Add the tail policy argument to LLVM IR intrinsics. There are two policies for tail elements. Tail agnostic means users do not care about the values in the tail elements and tail undisturbed means the values in the tail elements need to be kept after the operation. In order to let users control the tail policy, we add an additional argument at the end of the argument list.
For unmasked operations, we have no maskedoff and the tail policy is always tail agnostic. If users want to keep tail elements under unmasked operations, they could use all one mask in the masked operations to do it. So, we only add the additional argument for masked operations for most cases. There are exceptions listed below.
In this patch, we do not handle the following cases to reduce the complexity of the patch. There could be two separate patches for them.
* Use dest argument to control tail policy
vmerge.vvm/vmerge.vxm/vmerge.vim (add _t builtins with additional dest argument)
vfmerge.vfm (add _t builtins with additional dest argument)
vmv.v.v (add _t builtins with additional dest argument)
vmv.v.x (add _t builtins with additional dest argument)
vmv.v.i (add _t builtins with additional dest argument)
vfmv.v.f (add _t builtins with additional dest argument)
vadc.vvm/vadc.vxm/vadc.vim (add _t builtins with additional dest argument)
vsbc.vvm/vsbc.vxm (add _t builtins with additional dest argument)
* Always has tail argument for masked/unmasked intrinsics
Vector Single-Width Integer Multiply-Add Instructions (add _t and _mt builtins)
Vector Widening Integer Multiply-Add Instructions (add _t and _mt builtins)
Vector Single-Width Floating-Point Fused Multiply-Add Instructions (add _t and _mt builtins)
Vector Widening Floating-Point Fused Multiply-Add Instructions (add _t and _mt builtins)
Vector Reduction Operations (add _t and _mt builtins)
Vector Slideup Instructions (add _t and _mt builtins)
Vector Slidedown Instructions (add _t and _mt builtins)
Discussion: https://github.com/riscv/rvv-intrinsic-doc/pull/101
Differential Revision: https://reviews.llvm.org/D105092
This patch adds codegen support for lowering the vector-predicated
reduction intrinsics to RVV instructions. The process is similar to that
of the other reduction intrinsics, save for the fact that every VP
reduction has a start value. We reuse the existing custom "VL" nodes,
adding extra patterns where required to handle non-true masks.
To support these nodes, the `RISCVISD::VECREDUCE_*_VL` nodes have been
given an explicit "merge" operand. This is to faciliate the VP
reductions, where we must be careful to ensure that even if no operation
is performed (when VL=0) we still produce the start value. The RVV
reductions don't update the destination register under these conditions,
so we tie the splatted start value to the output register.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D107657
We can use riscv_vse intrinsic instead of riscv_vse_mask. The code here
is based on similar code for handling masked.scatter and vp.scatter.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D110206
This requires a minor change to CodeGenPrepare to ensure that
shouldSinkOperands will be called for And.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D110106
Optimize (add (mul x, c0), c1) -> (ADDI (MUL (ADDI, c1/c0), c0), c1%c0),
if c1/c0 and c1%c0 are simm12, while c1 is not.
Optimize (add (mul x, c0), c1) -> (MUL (ADDI, c1/c0), c0),
if c1%c0 is zero, and c1/c0 is simm12 while c1 is not.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D108607
For strided accesses the loop vectorizer seems to prefer creating a
vector induction variable with a start value of the form
<i32 0, i32 1, i32 2, ...>. This value will be incremented each
loop iteration by a splat constant equal to the length of the vector.
Within the loop, arithmetic using splat values will be done on this
vector induction variable to produce indices for a vector GEP.
This pass attempts to dig through the arithmetic back to the phi
to create a new scalar induction variable and a stride. We push
all of the arithmetic out of the loop by folding it into the start,
step, and stride values. Then we create a scalar GEP to use as the
base pointer for a strided load or store using the computed stride.
Loop strength reduce will run after this pass and can do some
cleanups to the scalar GEP and induction variable.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D107790
LICM may have pulled out a splat, but with .vx instructions we
can fold it into an operation.
This patch enables CGP to reverse the LICM transform and move the
splat back into the loop.
I've started with the commutable integer operations and shifts, but we can
extend this with more operations in future patches.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D109394
Since i128 isn't a legal C type on RV32, I don't believe
libgcc implements these functions for RV32. compiler-rt
does implement them because i128 support is enabled
in order to handle long double.
This is consistent with 32-bit X86 and ARM.
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D109383
This patch adds support for the vector-predicated `VP_STORE` and
`VP_LOAD` nodes. We do this in the same way we lower `MSTORE` and
`MLOAD`: to regular load/store instructions via intrinsics.
One necessary change was made to `SelectionDAGLegalize` so that
`VP_STORE` nodes' operation actions are taken from the stored "value"
operands, in the same vein as `STORE` or `MSTORE`.
Reviewed By: craig.topper, rogfer01
Differential Revision: https://reviews.llvm.org/D108999
This patch adds support for the `VP_SCATTER` and `VP_GATHER` nodes by
lowering them to RVV's `vsox`/`vlux` instructions, respectively. This
process is almost identical to the existing `MSCATTER`/`MGATHER` support.
One extra change was made to `SelectionDAGLegalize` so that
`VP_SCATTER`'s operation action is derived from its stored "value"
operand rather than its return type (which is always the chain).
Reviewed By: craig.topper, rogfer01
Differential Revision: https://reviews.llvm.org/D108987
This patch changes the register class to avoid accidentally setting
the AVL operand to X0 through MachineIR optimizations.
There are cases where we really want to use X0, but we can't get that
past the MachineVerifier with the register class as GPRNoX0. So I've
use a 64-bit -1 as a sentinel for X0. All other immediate values should
be uimm5. I convert it to X0 at the earliest possible point in the VSETVLI
insertion pass to avoid touching the rest of the algorithm. In
SelectionDAG lowering I'm using a -1 TargetConstant to hide it from
instruction selection and treat it differently than if the user
used -1. A user -1 should be selected to a register since it doesn't
fit in uimm5.
This is the rest of the changes started in D109110. As mentioned there,
I don't have a failing test from MachineIR optimizations anymore.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D109116
If the true and false values are the same, we don't need a SELECT_CC.
This would normally be folded before a select is legalized to
select_cc. The test case exploits the late legalization of vscale
to trigger a case where they become identical after legalization.
This works around an issue found on a test case in D107957. In that
case the true/false values were both eventually 0 and the select was
used by a vector AVL operand. The select_cc got expanded to control
flow and a phi, but the phi inputs were both copies from X0. MachineIR
optimizations simplified this to a single copy from X0 going into the
vector instruction. This became the input of a vsetvli after vsetvli
insertion. Then register coalescing folded the copy into the vsetvli.
X0 as the source of a vsetvli is a special encoding and should not be
created by coalesing. We need to fix our vsetvli handling to make sure
this can never happen any other way, but removing the unneeded select
is still a worthwhile optimization.
Similar to D108842, D108844, D108926, D108928, and D108936.
__has_builtin(builtin_mul_overflow) returns true for 32b RISCV targets,
but Clang is deferring to compiler RT when encountering long long types.
If the semantics of __has_builtin mean "the compiler resolves these,
always" then we shouldn't conditionally emit a libcall.
Link: https://bugs.llvm.org/show_bug.cgi?id=28629
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D108939
This adds an ELEN limit for fixed length vectors. This will scalarize
any elements larger than this. It will also disable some fractional
LMULs. For example, if ELEN=32 then mf8 becomes illegal, i32/f32
vectors can't use any fractional LMULs, i16/f16 can only use mf2,
and i8 can use mf2 and mf4.
We may also need something for the scalable vectors, but that has
interactions with the intrinsics and we can't scalarize a scalable
vector.
Longer term this should come from one of the Zve* features
Similar to what we do for add/sub/mul.
This can help remove some sext.w. There are some regressions on
some bswap tests, but I have an idea how to fix that for a follow up.
A new PACKW pattern is added to handle the new sext_inreg placement.
Differential Revision: https://reviews.llvm.org/D108663
This encapsulates the APInt creation and worklist management into
a helper function.
To keep one common interface I've use Log2_32 in places that
previously created a mask by subtracting 1 from a power of 2.
Differential Revision: https://reviews.llvm.org/D108324
We already do this for non-constants RHS. This just removes the
special case. I believe the special case may have been needed
because the ANY_EXTEND of a constant used to create zero extended
constants, but we recently changed that to produce sign extended
constants.
D107658 is needed to prevent some regressions.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D107697
Similar for sub except sub isn't commutative.
Modify the existing and/or/xor folds to also work on ISD::SELECT
and not just RISCVISD::SELECT_CC. This is needed to make sure
we do this transform before type legalization turns i32 add/sub
into add/sub+sign_extend_inreg on RV64. If we don't do this before
that, the sign_extend_inreg will still be after the select.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D107603
Shuffles which are broken into separate halves reveal splats in which
a half is accessed via one index; such operations can be optimized to
use "vrgather.vi".
This optimization could be achieved by adding extra patterns to match
`vrgather_vv_vl` which uses a splat as an index operand, but this patch
instead identifies splat earlier. This way, future optimizations can
build on top of the data gathered here, e.g., to splat-gather dominant
indices and insert any leftovers.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D107449
Previously we converted ISD condition codes to integers and stored
them directly in our MIR instructions. The ISD enum kind of belongs
to SelectionDAG so that seems like incorrect layering.
This patch instead uses a CondCode node on RISCV::SELECT_CC until
isel and then converts it from ISD encoding to a RISCV specific value.
This value can be converted to/from the RISCV branch opcodes in the
RISCV namespace.
My larger motivation is to possibly support a microarchitectural
feature of some CPUs where a short forward branch over a single
instruction can be predicated internally. This will require a new
pseudo instruction for select that needs to carry a branch condition
and live probably until RISCVExpandPseudos. At that point it can be
expanded to control flow without other instructions ending up in the
predicated basic block. Using an ISD encoding in RISCVExpandPseudos
doesn't seem like correct layering.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D107400
The fcvt fp to integer instructions saturate if their input is
infinity or out of range, but the instructions produce a maximum
integer for nan instead of 0 required for the ISD opcodes.
This means we can use the instructions to do the saturating
conversion, but we'll need to fix up the nan case at the end.
We can probably improve the i8 and i16 default codegen as well,
but I'll leave that for a follow up.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D107230
This patch extends the optimization of VID-sequence BUILD_VECTORs
introduced in D104921 to include simple fractional steps composed of a
separated integer numerator and denominator.
A notable limitation in this sequence detection is that only sequences
with steps N/1 or 1/D are found, meaning that the step between elements
and the frequency with which it changes is consistent across the whole
sequence. Fractional steps such as 2/3 won't be matched as those would
involve more complex tracking of state or some level of backtracking.
As is stands, however, this patch is sufficient to match common
interleave-type shuffle indices, for example matching `<0,0,1,1>` (or
commonly `<0,u,1,u>` or `<u,0,u,1>`) to an index sequence divided by 2.
While the optimization is relatively `undef`-tolerant, due to greedy
pattern-matching there even are some simple patterns which confuse the
sequence detection into identifying either a suboptimal sequence or no
sequence at all.
Currently only fractional-step sequences identified as having a
power-of-two denominator are actually lowered to RVV instructions. This
is to avoid introducing divisions into the generated code.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D106533
This patch aims to improve the performance of BUILD_VECTORs which are
identified as containing a dominant element. Given that most
floating-point constants themselves require a load from the constant
pool, it was possible for the optimization to actually increase the
number of individual loads on small vectors. The exception is the zero
constant -- +0.0 -- which can be materialized efficiently.
While this optimization could do with a proper cost model to weigh the
benfits of a single vector load vs. the manipulation of individual
elements -- even for integer vectors which often require several
instructions to materialize -- without a concrete RVV implementation to
work with any heuristic is likely to be both more obtuse and inaccurate.
Until then, this patch fixes at least one known obvious deficiency.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D106963
The sign_extend we insert here can get turned into a zero_extend if
the sign bit is known zero. This can enable a setcc combine that
shrinks compares with zero_extend. This reduces the use count of
the zero_extend allowing other combines to turn it back into an
any_extend.
This restricts the combine to only cases where the result is used
by a CopyToReg. This works for my original motivating case. I
hope the CopyToReg use will prevent any converted extends from
turning back into an any_extend.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D106754
This patch adds support for lowering the saturating vector add/sub
intrinsics to RVV instructions, for both fixed-length and
scalable-vector forms alike.
Note that some of the DAG combines are still not triggering for the
scalable-vector tests. These require a bit more work in the DAGCombiner
itself.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D106651
I stumbled onto a case where our (sext_inreg (assertzexti32 (fptoui X)), i32)
isel pattern can cause an fcvt.wu and fcvt.lu to be emitted if
the assertzexti32 has an additional user. If we add a one use check
it would just cause a fcvt.lu followed by a sext.w when only need
a fcvt.wu to satisfy both users.
To mitigate this I've added custom isel and new ISD opcodes for
fcvt.wu. This allows us to keep know it started life as a conversion
to i32 without needing to match multiple nodes. ComputeNumSignBits
has been taught that this new nodes produces 33 sign bits. To
prevent regressions when we need to zero extend the result of an
(i32 (fptoui X)), I've added a DAG combine to convert it to an
(i64 (fptoui X)) before type legalization. In most cases this would
happen in InstCombine, but a zero_extend can be created for function
returns or arguments.
To keep everything consistent I've added new nodes for fptosi as well.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D106346
Lowering certain float vectors without legal vector types could cause a
crash due to a bad interaction between passing floats via GPRs and
argument splitting. Split vector floats appear just like scalar floats.
Under certain situations we choose to pass these float arguments via
GPRs and use an XLenVT location and set the 'BCvt' info to track how
they must be converted back to floating-point values. However, later
logic for handling split arguments may take over, in which case we lose
the previous information and set the 'Indirect' info, thus incorrectly
lowering to integer types.
I don't believe that we would have come across the notion of split
floating-point arguments before. This patch addresses the issue by
updating the lowering so that split arguments are only passed indirectly
when they are scalar integer types.
This has some change to how we lower some larger illegal float vectors,
as can be seen in 'fastcc-float.ll' where the vector is now passed
partly in registers and partly on the stack.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D102852
This relands a6ca88e908 which was originally
reverted due to overflow bugs in e3fa2b1eab.
This patch teaches the compiler to identify a wider variety of
`BUILD_VECTOR`s which form integer arithmetic sequences, and to lower
them to `vid.v` with modifications for non-unit steps and non-zero
addends.
The sequences handled by this optimization must either be monotonically
increasing or decreasing. Consecutive elements holding the same value
indicate a fractional step which, while simple mathematically,
becomes more complex to handle both in the realm of lossy integer
division and in the presence of `undef`s.
For example, a common "interleaving" shuffle index will be lowered by
LLVM to both `<0,u,1,u,2,...>` and `<u,0,u,1,u,...>` `BUILD_VECTOR`
nodes. Either of these would ideally be lowered to `vid.v` shifted right
by 1. Detection of this sequence in presence of general `undef` values
is more complicated, however: `<0,u,u,1,>` could match either
`<0,0,0,1,>` or `<0,0,1,1,>` depending on later values in the sequence.
Both are possible, so backtracking or multiple passes is inevitable.
Sticking to monotonic sequences keeps the logic simpler as it can be
done in one pass. Fractional steps will likely be a separate
optimization in a future patch.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D104921
The existing rule about the operand type is strange. Instead, just say
the operand is a TargetConstant with the right width. (Legalization
ignores TargetConstants, so it doesn't matter if that width is legal.)
Highlights:
1. I had to substantially rewrite the AArch64 isel patterns to expect a
TargetConstant. Nothing too exotic, but maybe a little hairy. Maybe
worth considering a target-specific node with some dagcombines instead
of this complicated nest of isel patterns.
2. Our behavior on RV32 for vectors of i64 has changed slightly. In
particular, we correctly preserve the width of the arithmetic through
legalization. This changes the DAG a bit. Maybe room for
improvement here.
3. I explicitly defined the behavior around overflow. This is necessary
to make the DAGCombine transforms legal, and I don't think it causes any
practical issues.
Differential Revision: https://reviews.llvm.org/D105673
If we need to shift left anyway we might be able to take advantage
of LUI implicitly shifting its immediate left by 12 to cover part
of the shift. This allows us to use more bits of the LUI immediate
to avoid an ADDI.
isDesirableToCommuteWithShift now considers compressed instruction
opportunities when deciding if commuting should be allowed.
I believe this is the same or similar to one of the optimizations
from D79492.
Reviewed By: luismarques, arcbbb
Differential Revision: https://reviews.llvm.org/D105417
I don't think the semantics of the llvm masked gather intrinsic care
about the order the elements are loaded. For example, type legalization
by splitting will chain them in parallel. This is different than
scatter which we do chain in order.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D106025
RISCV would prefer a sign extended constant since that works better
with our constant materialization. We have an existing TLI hook we
use to control sign extension of setcc operands in type legalization.
That hook happens to do the right check we need here, but might be
straying from its original purpose. With only RISCV defining this
hook in tree, I wasn't sure if it was worth adding another hook
with identical behavior.
This is an alternative to D105785 where I tried to handle this in
the RISCV backend by not creating ANY_EXTENDs in some places.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D105918
We assume VLENB is a multiple of 8 and previously relied on shift
pairs being optimized to an AND+SHL/SHR and computeKnownBits
removing the AND. This doesn't happen if (vlenb >> 3) gets CSEd
to have multiple uses. This patch manually emits the best shift
to workaround this.
If the upper 32 bits are zero and bit 31 is set, we might be able to
use zext.w to fill in the zeros after using an lui and/or addi.
Most of this patch is plumbing the subtarget features into the constant
materialization.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D105509
This patch teaches the compiler to identify a wider variety of
`BUILD_VECTOR`s which form integer arithmetic sequences, and to lower
them to `vid.v` with modifications for non-unit steps and non-zero
addends.
The sequences handled by this optimization must either be monotonically
increasing or decreasing. Consecutive elements holding the same value
indicate a fractional step which, while simple mathematically,
becomes more complex to handle both in the realm of lossy integer
division and in the presence of `undef`s.
For example, a common "interleaving" shuffle index will be lowered by
LLVM to both `<0,u,1,u,2,...>` and `<u,0,u,1,u,...>` `BUILD_VECTOR`
nodes. Either of these would ideally be lowered to `vid.v` shifted right
by 1. Detection of this sequence in presence of general `undef` values
is more complicated, however: `<0,u,u,1,>` could match either
`<0,0,0,1,>` or `<0,0,1,1,>` depending on later values in the sequence.
Both are possible, so backtracking or multiple passes is inevitable.
Sticking to monotonic sequences keeps the logic simpler as it can be
done in one pass. Fractional steps will likely be a separate
optimization in a future patch.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D104921
Using positive zero as the neutral element in 'fadd' reductions, while
it generates better code, is incorrect. The correct neutral element is
negative zero: 0.0 + -0.0 = 0.0, whereas -0.0 + -0.0 = -0.0.
There are perhaps more optimal lowerings of negative zero avoiding
constant-pool loads which could be left as future work.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D105902
We don't really have optimizations for division with a constant
LHS. If we don't use a W instruction we end up needing to sign
or zero extend the RHS to use the 64-bit instruction.
I had to sign_extend i32 constants on the LHS instead of using
any_extend which becomes zero_extend. If we don't do this, constants
that were originally negative become harder to materialize. I think
this problem exists for more of our W instruction cases. For example
(i32 (shl -1, X)), but we don't have lit tests. I'll work on that
as a follow up.
I also left a FIXME for enabling W instruction for RHS constants
under -Oz.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D105769
Similar to D46745, "S" represents an absolute symbolic operand, which
can be used to specify the access models, e.g.
extern int var;
void *addr_via_asm() {
void *ret;
asm("lui %0, %%hi(%1)\naddi %0,%0,%%lo(%1)" : "=r"(ret) : "S"(&var));
return ret;
}
'S' is documented in trunk GCC: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101275
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D105254
Often when lowering vector shuffles, we split the shuffle into two
LHS/RHS shuffles which are then blended together. To do so we split the
original indices into two, indexed into each respective vector. These
two index vectors are then separately lowered as BUILD_VECTORs.
This patch forwards on any undef indices to the BUILD_VECTOR, rather
than having the VECTOR_SHUFFLE lowering decide on an optimal concrete
index. The motiviation for ths change is so that we don't duplicate
optimization logic between the two lowering methods and let BUILD_VECTOR
do what it does best.
Propagating undef in this way allows us, for example, to generate
`vid.v` to produce the LHS indices of commonly-used interleave-type
shuffles. I have designs on further optimizing interleave-type and other
common shuffle patterns in the near future.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D104789
These are fp->int conversions using either RMM or dynamic rounding modes.
The lround and lrint opcodes have a return type of either i32 or
i64 depending on sizeof(long) in the frontend which should follow
xlen. llround/llrint should always return i64 so we'll need a libcall
for those on rv32.
The frontend will only emit the intrinsics if -fno-math-errno is in
effect otherwise a libcall will be emitted which will not use
these ISD opcodes.
gcc also does this optimization.
Reviewed By: arcbbb
Differential Revision: https://reviews.llvm.org/D105206
This adds a DAG combine to detect sext/zext inputs and emit a
new ISD opcode. The extends will either be removed or replaced
with narrower extends.
Isel patterns are used to match add and widening mul to vwmacc
similar to the recently added vmacc patterns.
There's still some work to be to match vmulsu.
We should also rewrite splats that were extended as scalars and
then splatted.
Reviewed By: arcbbb
Differential Revision: https://reviews.llvm.org/D104802
It seems it is possible for DAG combine to create a shl with an
i64 result type and an i32 shift amount. This is ok before type
legalization since the type don't need to match in SelectionDAG.
This results in type legalization calling LowerOperation to
legalize just the amount. We weren't expecting this so we
asserted for not finding a fixed vector shift.
To fix this, I've added a check for the fixed vector case and
returned SDValue() to get the default type legalizer. I've
factored all shifts together and added a fixed vector specific
handler to avoid repeating similar code for each in
LowerOperation.
The particular case I found was exposed by D104581, but the bad
shift is created after that patch triggers.
If type legalization is going to insert a sign_extend for other users
of X and we can fold the sign_extend into ADDW/MULW/SUBW, it is
better to replace the ANY_EXTEND so we don't end up with a separate
ADD/MUL/SUB instruction for the users of the ANY_EXTEND.
I'm only handling setcc uses right now, but there are other
instructions that force sign_extends like ashr.
There are probably other *W instructions we could use in addition
to ADDW/SUBW/MULW.
My motivating case was a loop terminating compare and a phi use
as seen in the new test file.
Reviewed By: asb
Differential Revision: https://reviews.llvm.org/D104581
This patch optimizes the code generation of vector-type SELECTs (LLVM
select instructions with scalar conditions) by custom-lowering to
VSELECTs (LLVM select instructions with vector conditions) by splatting
the condition to a vector. This avoids the default expansion path which
would either introduce control flow or fully scalarize.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D104772
With the exception of `frem`, this patch supports the current set of VP
floating-point binary intrinsics by lowering them to to RVV instructions. It
does so by using the existing `RISCVISD *_VL` custom nodes as an intermediate
layer. Both scalable and fixed-length vectors are supported by using this
method.
The `frem` node is unsupported due to a lack of available instructions. For
fixed-length vectors we could scalarize but that option is not (currently)
available for scalable-vector types. The support is intentionally left out so
it equivalent for both vector types.
The matching of vector/scalar forms is currently lacking, as scalable vector
types do not lower to the custom `VFMV_V_F_VL` node. We could either make
floating-point scalable vector splats lower to this node, or support the
matching of multiple kinds of splat via a `ComplexPattern`, much like we do for
integer types.
Reviewed By: rogfer01
Differential Revision: https://reviews.llvm.org/D104237
This patch adds support for loading and storing unaligned vectors via an
equivalently-sized i8 vector type, which has support in the RVV
specification for byte-aligned access.
This offers a more optimal path for handling of unaligned fixed-length
vector accesses, which are currently scalarized. It also prevents
crashing when `LegalizeDAG` sees an unaligned scalable-vector load/store
operation.
Future work could be to investigate loading/storing via the largest
vector element type for the given alignment, in case that would be more
optimal on hardware. For instance, a 4-byte-aligned nxv2i64 vector load
could loaded as nxv4i32 instead of as nxv16i8.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D104032
This patch changes RVV's policy for its supported list of fixed-length
vector types by capping by vector size rather than element count. Now
all 1024-byte vectors (of supported element types) are supported, rather
than all 256-element vectors.
This is a more natural fit for the architecture, and allows us to, for
example, improve the support for vector bitcasts.
This change necessitated the adding of some new simple types to avoid
"regressing" on the number of currently-supported vectors. We round out
the 1024-byte types by adding `v512i8`, `v1024i8`, `v512i16` and
`v512f16`.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D103884
This patch is a simple fix which registers CONCAT_VECTORS as
custom-lowered for scalable mask vectors. This follows the pattern of
all other scalable-vector types, as the default expansion of
CONCAT_VECTORS cannot handle scalable types, and even if it did it'd go
through the stack and generate worse code.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D103896
Include known bits support so we know we don't need to zext the
output if the input was already zero extended.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D103757
We should be exiting when the shift amount is greater than
the bit width regardless of whether it is a power of 2.
Reported by Simon Pilgrim here https://reviews.llvm.org/D96661
This requires getting a shift amount that is out of bounds that
wasn't already optimized by SelectionDAG. This would be pretty
trick to construct a test for.
Or it would require a non-power of 2 shift amount and a mask
that has runs of ones and zeros of the next lowest power of 2 from
that shift amount. I tried a little to produce a test for this,
but didn't get it to work.