1.Implement barrier and work_group_barrier function with intrinsics.
2.Implement work_group_all and work_group_any function,passed corresponding OPENCL-CTS test.
1. Add VMV_V_X in emitEpilogue.
2. Change all the positive numbers added by TP to negative numbers(in LowerCall).
3. Fix the LowerCall function to generate correct store instruction transferring the function parameters.
4. Fix hasReservedCallFrame function to return false.
5. Align the convention between caller and callee in the case of passing parameters by stack.
6. Change the stack offset calculation method of TP.
7. Unify the calculation of TP stack and SP stack offset.
8. Node that needing to manually modify the calculation of sp offset in the workitem.S. Since the growth direction of the stack is different from that of the traditional RISCV, it is now stipulated that for both the SP stack and the TP stack, the data is stored where the stack pointer is not offset.
9. There is a SPAdj check in eliminateFrameIndex function. but we don't need this value at all so that adding a getSPAdjust function to return zero.
10. V33 is a wrong value when parameters pushed to TP stack so there must be a MV instruction to refresh V33 after ADJCALLSTACKDOWN.
In the current libclc library, when the function parameter contains vec3, the library
does not overload the builtin function and implement it, so we need to add related
declaration
For cts test cases:
* prefetch
* async_copy_global_to_local
* async_copy_local_to_global
In previous sp initialization, sp points to the same base address for different warps
When different warps ends in different time, the sp pointer in later ended warp will
be changed by former ended warp, we need to initialize sp pointer for different warp
Stack space is shared between different warps, if two warps are executing
different functions, then the access to the return address will conflict,
which will lead the warp executing faster can not find the return address,
so we would like to add a barrier instruction after the lw and before the ret,
to ensure that the warps have the same scope of the sp pointer
In our previous design, the libclc library is built into static library which make the generated
ELF file having a large size, now we change compiler and linker option to make generated ELF file size much smaller, detail information can be seen in this pull request https://github.com/THU-DSP-LAB/pocl/pull/11