The domestic processor DCU(Deep Computing Unit) is a GPU-like accelerator. To address the problem that DCU lacks more effective compilation and optimization tools, a DCU code generation and optimization method based on polyhedral model is proposed. First, a source-to-source compiler C2HIPC is implemented based on the polyhedral model. The input of C2HIPC is a serial C program, and the output is a HIP (Heterogeneous-compute Interface for Portability) executable on the DCU; secondly, to further improve the heterogeneous performance of the program, a kernel partitioning method based on computing unit occupancy is proposed, which completes the kernel partitioning by limiting loop merging in the polyhedral schedule calculation; at the same time, a global memory access optimization method based on data reuse is proposed, By modeling the two kinds of data reuse and global memory access costs existing in the DCU kernel, the global memory access costs in different loop orders are calculated. Using the PolyBench test set to verify the effectiveness of C2HIPC, the performance of the automatically generated HIP code after C2HIPC optimization reaches 92.4% of that of PolyBench-GPU.
As a mainstream heterogeneous programming model, OpenMP has important practical significance for its uninstall performance research. Songshan supercomputer system installed in Zhengzhou Supercomputing Center is a new generation of E-class high-performance computer cluster independently developed by China, and the DCU chip installed on it is also home-made. In order to improve the offload performance of OpenMP on the platform and make full use of hardware resources such as registers, a redundant cycle optimization for thread iteration in DCU was proposed, so that the thread could release register resources in time after completing the calculation task, thus relieving the back-end register allocation pressure and improving the program performance. At the same time, based on the loop unrolling optimization algorithm in LLVM and combined with the hardware characteristics and instruction set characteristics of the domestic platform, a better algorithm for calculating the loop unrolling factor was proposed to improve the optimization effect of loop unrolling. Thread iteration optimization using SPEC ACCEL and Polybench resulted in an average 33.7% reduction in the overall register count and a 37% average performance improvement after loop expansion optimization.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.