Commit Graph

38 Commits

Author SHA1 Message Date
Simon Pilgrim 4455c5cdea [CostModel][X86] Update RUN -passes=* to double quotes to appease update scripts on windows 2022-03-18 11:44:18 +00:00
Arthur Eubanks 15ba588d6d [test] Migrate '-analyze -cost-model' to '-passes=print<cost-model>' 2022-02-09 15:42:16 -08:00
Haohai Wen d2c093e79d [CostModel][X86] Add i64 mul cost for avx512 as 1cy
i64 mul cost is 1cy for all cpu that support avx512. Currently
all X86 cpu uses i64 mul cost in X64 cost table which is not
true for cpu that support avx512 (skx, icx).

Reviewed By: pengfei, RKSimon

Differential Revision: https://reviews.llvm.org/D115016
2021-12-08 11:29:08 +08:00
Roman Lebedev a468c39c90
[X86][Costmodel] `trunc v32i16 to v64i8` can appear after legalization, cost is same as for `trunc v32i16 to v32i8`
Some of the costs get larger here,
but i suppose that makes sense since we'd previously query
scalarization costs that may not be really representative of the reality.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D113852
2021-11-15 19:04:02 +03:00
Roman Lebedev a5f2fdca99
[X86][Costmodel] `trunc v16i32 to v32i16` can appear after legalization, cost is same as for `trunc v16i32 to v16i16`
This was noticed in D113609, hopefully it unblocks that patch.
There are likely other similar problems.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D113842
2021-11-14 18:41:37 +03:00
Simon Pilgrim 2ced9a42be [CostModel][TTI] Replace BAD_ICMP_PREDICATE with ICMP_NE for generic smulo/umulo cost expansion
Match the predicate used in TargetLowering::expandMULO to detect overflow
2021-10-06 19:11:33 +01:00
Simon Pilgrim 7bd097fd1e [CostModel][TTI] Fix ops used for generic smulo/umulo cost expansion
Fix copy+pasta that was checking for smul_fix instead of smul_with_overflow to detected signed values.

The LShr is performed on the extended type as we use it to truncate+extract the upper/hi bits of the extended multiply.

More closely matches the default expansion from TargetLowering::expandMULO
2021-10-06 19:11:32 +01:00
Simon Pilgrim 81b5da8c97 [CostModel][TTI] Replace BAD_ICMP_PREDICATE with ICMP_ULT/UGT for generic uadd/usubo cost expansion
Match the predicates used in TargetLowering::expandUADDSUBO
2021-10-06 19:11:32 +01:00
Simon Pilgrim 0776924a17 [CostModel][X86] getCmpSelInstrCost - treat BAD_PREDICATEs the same as the worst case cost predicates for ICMP/FCMP instructions
As suggested on D111024, we should treat getCmpSelInstrCost calls without a specific predicate as matching the worst case predicate cost.

These regressions will be addressed with a mixture of D111024 and fixing other specific getCmpSelInstrCost calls to have realistic predicates.
2021-10-06 10:14:56 +01:00
Craig Topper 765348298c [CostModel] Update default cost model for sadd/ssub overflow to match TargetLowering
The expansion for these was updated in https://reviews.llvm.org/D47927 but the cost model was not adjusted.

I believe the cost model was also incorrect for the old expansion.
The expansion prior to D47927 used 3 icmps using LHS, RHS, and Result
to calculate theirs signs. Then 2 icmps to compare the signs. Followed
by an And. The previous cost model was using 3 icmps and 2 selects.
Digging back through git blame, those 2 selects in the cost model used to
be 2 icmps, but were changed in https://reviews.llvm.org/D90681

Differential Revision: https://reviews.llvm.org/D110739
2021-09-30 09:41:14 -07:00
Simon Pilgrim 3538ee763d [CostModel][X86] Improve AVX1/AVX2 v16i32->v16i16/v16i8 truncation costs (PR51972)
Based off worst case btver2 (AVX1) and haswell (AVX2) llvm-mca reports
2021-09-26 13:43:46 +01:00
Simon Pilgrim c931d35216 [CostModel][X86] Increase i64 mul cost from 1 to 2
Only the most recent cpus support really 1cy 64-bit multiplies, and the X64 cost table represents a realistic worst case. The 1cy value was also discouraging vectorization when most vXi64 PMULDQ expansions aren't actually slower than scalarization.

Noticed while investigating PR51436.
2021-09-23 14:48:21 +01:00
Simon Pilgrim 96b4117d51 [CostModel][X86] Adjust truncate SSE/AVX legalized costs based on llvm-mca reports.
Update truncation costs based on the worst case costs from the script in D103695.

Move to using legalized types wherever possible, which allows us to prune the cost tables.
2021-07-12 13:50:43 +01:00
Simon Pilgrim 4c7e9a3852 [CostModel][X86] Adjust sext/zext SSE/AVX legalized costs based on llvm-mca reports.
Update costs based on the worst case costs from the script in D103695.

Move to using legalized types wherever possible, which allows us to prune the cost tables.
2021-07-07 13:58:27 +01:00
Simon Pilgrim 49d3a367c0 [CostModel][X86] Improve AVX1/AVX2 truncation costs
Based off the worse case numbers generated by D103695, we were overestimating the cost of a number of vector truncations:

AVX2: v2i32->v2i8, v2i64->v2i16 + v4i64->v4i32
AVX1: v2i32->v2i8, v4i64->v4i16 + v16i16->v16i8

Once we have a working set of conversion costs, the intention is to cleanup the tables and use legalized types a lot more to reduce the number of entries we currently have.
2021-06-08 10:41:03 +01:00
Simon Pilgrim 90d25808c4 [CostModel][X86] Improve accuracy of sext/zext to 256-bit vector costs on AVX1 targets
Determined from llvm-mca analysis (btver2 vs bdver2 vs sandybridge), the split+extends+concat sequence on AVX1 capable targets are cheaper than the #ops that the cost was previously based on.
2021-05-27 18:17:50 +01:00
Simon Pilgrim 243e588681 [CostModel][X86] Improve accuracy of vXi64 MUL costs on AVX2/AVX512 targets
By llvm-mca analysis, Haswell/Broadwell has the worst v4i64 recip-throughput cost of the AVX2 targets at 6 (vs the currently used cost of 8). Similarly SkylakeServer (our only AVX512 target model) implements PMULLQ with an average cost of 1.5 (rounded up to 2.0), and the PMULUDQ-sequence (without AVX512DQ) as a cost of 6.
2021-05-24 09:48:32 +01:00
Simon Pilgrim e4ec5cc8eb [CostModel][X86] Align v2i64 MUL costs on SSE42+ targets with worst case
Based on worst case of sandybridge (which seems to match nehalem for this SSE sequence) (vs btver2 + bdver2) llvm-mca analysis
2021-05-23 16:20:57 +01:00
Simon Pilgrim fc01b9bdf8 [CostModel][X86] Align v4i64 MUL costs on AVX1 targets with worst case
Based on worst case of sandybridge (vs btver2 + bdver2) llvm-mca analysis - which is a lot less than what we were predicting (I think based off total uop count).
2021-05-22 20:07:55 +01:00
Simon Pilgrim 9bd0dc83b5 [CostModel][X86] Improve v8i32 MUL costs on AVX1 targets to account for slower btver2
BTVER2 has a 2 cycle throughput for v4i32 multiplies (same as SSE41 targets), which is only partially hidden by the subvector extracts/insert when splitting v8i32.
2021-05-22 11:13:07 +01:00
Simon Pilgrim e11195d0a9 [CostModel][X86] Remove unused CHECK prefixes
Allows us to remove the "CHECK: {{^}}" hack and help simplify D91275
2020-11-13 17:31:48 +00:00
Sanjay Patel 3c050a597c [CostModel] fix cost calc bug for sadd/ssub with overflow
As noted in D90554, there's an opcode typo in using an easily
misused cost model API: getCmpSelInstrCost(). Beyond that, the
assumed sequence of ops is questionable, but that would be
another patch.

My guess is that the x86 test diffs show that we are probably
wrong both before and after this change, so there will be no
practical difference.
As an example, I tried this test which shows a cost of '7'
either way:

  define <4 x i32> @sadd(<4 x i32> %va, <4 x i32> %vb) {
    %V4I32  = call {<4 x i32>, <4 x i1>}  @llvm.sadd.with.overflow.v4i32(<4 x i32> %va, <4 x i32> %vb)
    %ov = extractvalue {<4 x i32>, <4 x i1>} %V4I32, 1
    %r = extractvalue {<4 x i32>, <4 x i1>} %V4I32, 0
    %z = select <4 x i1> %ov, <4 x i32> <i32 42, i32 42, i32 42, i32 42>, <4 x i32> %r
    ret <4 x i32> %z
  }

  $ llc -o - sadd.ll -mattr=avx
        vpaddd  %xmm1, %xmm0, %xmm2
        vpcmpgtd        %xmm2, %xmm0, %xmm0
        vpxor   %xmm0, %xmm1, %xmm0
        vblendvps       %xmm0, LCPI0_0(%rip), %xmm2, %xmm0a

Differential Revision: https://reviews.llvm.org/D90681
2020-11-03 11:03:47 -05:00
Fangrui Song 7979f24954 [test] Fix some unused check prefixes in test/Analysis/CostModel/X86 2020-10-31 23:29:57 -07:00
Sanjay Patel 251dd7c0f9 [x86] add cost overrides for mul with overflow
I'm assuming the standard size integer instructions for this end up as something like:
mulq %rsi
seto %al

And the 'mul' generally has reciprocal throughput of 1 on typical implementations
(higher latency, but that's not handled here).
The default costs may end up much higher than that, and that's what we see in the test diffs.

Vector types are left as a 'TODO'.

Differential Revision: https://reviews.llvm.org/D90431
2020-10-30 12:38:16 -04:00
Craig Topper e39c7ab2b9 [CostModel][X86][ARM] Teach default implementation of getCastInstrCost to not add a split/join cost if source type and the destination type both have a SplitVector action
If both the source and the destination need to be split then the two halves of the split operation are completely independent and don't need to be split or joined. So we don't need to assess a cost for the split or join.

Differential Revision: https://reviews.llvm.org/D79111
2020-05-01 18:55:23 -07:00
Craig Topper b938168aef [X86] Lower the cost of v4i64->v4i32 truncate with avx512.
We use the vpmovqd instruction which is a single uop. So
the cost should be 1.
2020-05-01 11:09:37 -07:00
Craig Topper cff6686532 [X86] Lower the cost of v4i64->v4i32 and v8i64->v8i32 truncate with AVX
We generate much better code these days than we used to. And we use the same sequence for AVX1 and AVX2 for these

For v4i64->v4i32 we generate:
vextractf128    xmm1, ymm0, 1
vshufps xmm0, xmm0, xmm1, 136   # xmm0 = xmm0[0,2],xmm1[0,2]

And for v8i64->v8i32 we generate:
vperm2f128      ymm2, ymm0, ymm1, 49 # ymm2 = ymm0[2,3],ymm1[2,3]
vinsertf128     ymm0, ymm0, xmm1, 1
vshufps ymm0, ymm0, ymm2, 136   # ymm0 = ymm0[0,2],ymm2[0,2],ymm0[4,6],ymm2[4,6]

Differential Revision: https://reviews.llvm.org/D79109
2020-04-29 13:21:44 -07:00
Craig Topper bdbbed115f [X86][CostModel] Update costs for vector truncate with avx512f/avx512bw.
All avx512 truncate instructions except vXi64->vXi32 are 2 uops
on port 5. So raise their costs to 2. Except when we have an
earlier faster sequence like pshufb for 128 bit input vectors.

Add a lower cost of 3 v16i16->v16i8 with avx512f where we can
extend to v16i32 then truncate. And a cost of 2 for avx512bw with
and without avx512vl. There we can use vpmovwb with either a ymm
or zmm input. Both of these beat masking, splitting, and using
packuswb which is our avx/avx2 codegen.
2020-04-27 12:00:24 -07:00
Craig Topper 8dfb9627b7 [X86] Make v32i16/v64i8 legal types without avx512bw. Use custom splitting instead.
This moves v32i16/v64i8 to a model consistent with how we
treat integer types with avx1.

This does change the ABI for types vXi16/vXi8 vectors larger than
512 bits to pass in multiple zmms instead of multiple ymms. We'd
already hacked some code to make v64i8/v32i16 pass in zmm.

Cost model is still a bit of a mess. In some place I tried to
match existing behavior. But really we need to account for
splitting and concating costs. Cost model for shuffles is
especially pessimistic.

Differential Revision: https://reviews.llvm.org/D76212
2020-04-15 12:17:18 -07:00
Craig Topper b2da1ddaef [X86] Add a non-zero cost for truncating v32i16->v32i8 on avx512bw. 2020-03-15 17:18:46 -07:00
Simon Pilgrim eaa41e103c [CostModel][X86] Try to check against common prefixes before using target-specific cpu checks
SLM/GLM is still a mess so not all of them have been updated yet.
2020-02-24 11:59:07 +00:00
Craig Topper 35625464c6 [X86] Fix the cost model for v16i16->v16i32 zero_extend/sign_extend with AVX2
We seem to be inheriting the cost from sse4.1. But if we have 256-bit registers we should be able to do this with just one extract to split the 16i16 and two v8i16->v8i32 operations so our cost should be 3 not 4.

Differential Revision: https://reviews.llvm.org/D73646
2020-01-29 15:52:10 -08:00
Simon Pilgrim d7f0207d73 [CostModel][X86] Fix SLM <2 x i64> icmp costs
SLM is 2 x slower for <2 x i64> comparison ops than other vector types, we should account for this like we do for SLM <2 x i64> add/sub/mul costs.

This should remove some of the SLM codegen diffs in D43582

llvm-svn: 372954
2019-09-26 10:14:38 +00:00
Simon Pilgrim 665ccbff60 [Cost][X86] Add v2i64 truncation costs
We are missing costs for a lot of truncation cases, I'm hoping to address all the 'zero cost' cases in trunc.ll

I thought this was a vector widening side effect, but even before this we had some interesting LV decisions (notably over indvars) being made due to these zero costs.

llvm-svn: 372498
2019-09-22 12:04:38 +00:00
Simon Pilgrim 42bf2dd629 [TTI] Add generic cost model for smul/umul overflow intrinsics
Based off smul/umul fixed costs and the implementation in TargetLowering::expandMULO.

llvm-svn: 354784
2019-02-25 13:30:23 +00:00
Simon Pilgrim 47ca8606ba [TTI] Add generic SADDO/SSUBO costs
Added x86 scalar sadd_with_overflow/ssub_with_overflow costs.

llvm-svn: 352045
2019-01-24 13:36:45 +00:00
Simon Pilgrim 2d1964b90f [TTI] Add generic UADDO/USUBO costs
Added x86 scalar uadd_with_overflow/usub_with_overflow costs.

Differential Revision: https://reviews.llvm.org/D56907

llvm-svn: 352043
2019-01-24 12:10:20 +00:00
Simon Pilgrim f87226eb70 [IR] Match intrinsic parameter by scalar/vectorwidth
This patch replaces the existing LLVMVectorSameWidth matcher with LLVMScalarOrSameVectorWidth.

The matching args must be either scalars or vectors with the same number of elements, but in either case the scalar/element type can differ, specified by LLVMScalarOrSameVectorWidth.

I've updated the _overflow intrinsics to demonstrate this - allowing it to return a i1 or <N x i1> overflow result, matching the scalar/vectorwidth of the other (add/sub/mul) result type.

The masked load/store/gather/scatter intrinsics have also been updated to use this, although as we specify the reference type to be llvm_anyvector_ty we guarantee the mask will be <N x i1> so no change in behaviour

Differential Revision: https://reviews.llvm.org/D57090

llvm-svn: 351957
2019-01-23 16:00:22 +00:00