The automatic OpenMP implementation in the current GCC compiler adopts the fork-join model, where frequent creation and convergence of thread groups result in significant management control overhead. This paper studies methods to reduce thread group creation and convergence to improve the efficiency of automatic OpenMP programs. A universal optimization method for merging parallel regions is proposed in this paper to address the fork-join model. Through variable attribute modifications, handling of serial statements, and synchronization operation optimization, adjacent continuous parallel regions are merged into a larger parallel region to reduce parameter passing within the parallel region and lower the overhead caused by the creation and destruction of parallel regions. This work is implemented based on the GCC 10.3.0 compiler and experimentally validated using the NPB 3.4-OMP test suite, achieving an average overall performance improvement of 20%. The experimental results demonstrate the effectiveness and generality of the method proposed in this paper. This method can effectively enhance the runtime efficiency of OpenMP programs, serve as a reference for optimizing the implementation of OpenMP programs, and provide support for thread dynamic merging techniques in AI compilation.
The numerical solution of large sparse linear systems is a fundamental aspect of many large-scale scientific and engineering computations. The multifrontal method, as a prevalent direct solving approach, currently lacks a corresponding implementation on domestic Digital Signal Processor platforms. With the development of the domestic FT series high-performance digital signal processors, there is an urgent need to meet the demand for efficient numerical solutions of large sparse linear systems in practical engineering applications. In response to the above-mentioned issue, We implement and optimize the multifrontal method for the FT-M6678 platform. By analyzing the hardware architecture of the FT-M6678 and the characteristics of the multifrontal method algorithm, we employ compilation optimization, loop unrolling, and single-instruction multiple data vectorization techniques. This fully leverages the independent functional units and register resources of the platform, achieving instruction-level and data-level parallelism. Considering the storage hierarchy of the FT-M6678, we configure the first and second-level caches, set attributes for the cacheability of external memory, and design memory layout based on memory bandwidth. We allocate different data and code segments to distinct storage areas, optimizing the storage and access of data. Experimental test data is sourced from the University of Florida Sparse Matrix Collection. The acceleration ratio of the algorithm after optimization on the FT-M6678 platform ranges from 16.0~38.9. Compared to the TMS320C6678 platform, the performance improvement can reach up to 2.3 times.
This paper addresses the issue of low parallel efficiency resulting from fixed thread allocation in automatic parallelization compilation technology. The authors employ a genetic algorithm to determine the optimal number of threads for individual parallelizable loops. They then utilize iterative compilation techniques to produce the suitable number of threads for each parallelizable loop structure, thereby enhancing the efficiency of automatic parallelization compilation. The proposed method demonstrated an average performance enhancement of 26% across ten benchmarks in the SPEC CPU2006 test suite and an overall performance improvement of 3.7% in the NPB3.4.2 test suite, thereby indicating the viability and efficacy of the approach. The approach outlined in this paper can be utilized as a benchmark for enhancing the effectiveness of automated parallel computing and promoting the progression of automated parallel computing technology.
KEYWORDS: Video surveillance, Information security, Network security, Video, Computer security, Data transmission, Telecommunications, Data processing, Data conversion, Data communications
In recent years, with the continuous development and maturity of surveillance video technology, the complexity of the network environment, video surveillance identity security issues have become more and more prominent, and more and more surveillance video identity security issues are mentioned. How to ensure that the front-end device identity security access to the video surveillance system has become an urgent problem. In this paper, according to the "GB35114 public security video surveillance networking information security technical requirements" in the A-level standard requirements, designed a based on embedded MCU chip communication technology, to ensure that the front-end equipment legitimate access to video surveillance system security feasible program. The program uses the MCU chip to do data signature processing, data scheduling through DMA, and the front-end device authentication information with the SIP authentication server with security features for authentication. The method is low in complexity and can adapt to a variety of scenarios for signature verification. It promotes the further development of video surveillance security technology and applications, improves the security of video content, and thus protects the rights and interests of users.
Depression is a common psychological disorder. In order to detect and identify the tendency of depression in text as early and accurately as possible, a method for depression tendency detection based on multi-feature fusion of BERT word embeddings is proposed. An emotion classification model is employed for vector representation of depressive text, extracting vectors that fuse multi-level semantic features and emotional features. The TextCNN and BiLSTM-Attention modules are established to extract local and contextual features of depressive text. Finally, multiple features are fused to determine the tendency of depression in the text. Experimental results demonstrate that the proposed method, which fuses multiple features of depressive text, achieves an improvement of approximately 5% in various evaluation metrics compared to traditional deep learning methods.
The Schrödinger equation is widely used in theoretical research in the fields of atomic and molecular physics, particle physics and nuclear physics, solid state physics, and photonics. As a first principle, it plays an important role in these applications. However, solving the Schrödinger equation numerically requires a relatively large amount of storage space and calculation. The emergence of modern graphics processing units (GPUs) provides an opportunity for the efficient solution of this equation. In this paper, on the basis of using the finite difference method to solve the three-dimensional Schrödinger equation algorithm, using the graphics processor Tesla V100 as the computing platform, through theoretical analysis and numerical simulation, using data structure layout and organization optimization, data merge reduction, synchronization elimination and merge kernel function optimization and other methods give full play to the hardware characteristics of the GPU and optimize the three-dimensional time-dependent algorithm for solving Schrödinger equation on the GPU. Experimental results show that, compared with the original GPU-based algorithm, the optimization method used in this paper can speed up the program by 1.476 times under the same number of loop iterations.
The modular inversion operations in the prime fields play an important role in the whole elliptic curve cryptosystem, and its operation speed directly determines the execution efficiency of the whole cryptosystem. The binary extended Euclidean algorithm in the existing general modular inversion algorithm is improved to obtain an optimized new algorithm. The new algorithm improves the shift efficiency of the original algorithm and reduces the cost of hardware resources. The 256-bit adder used in the new algorithm is split and reassembled, and the carry lookahead logic is used between groups. The addition and subtraction components are pipelined to improve data throughput. The algorithm is implemented using Verilog hardware description language and verified successfully on Virtex - 7 FPGA development board. The verification results show that the design can correctly handle 256-bit modular inversion operations, and the single calculation only consumes 1.12 μs.
KEYWORDS: Design and modelling, Embedded systems, Field programmable gate arrays, Power consumption, Clocks, Logic devices, Speech recognition, Parallel computing, Internet of things, Information operations
In order to address the growing demand for vector operations in embedded systems and the urgent need to enhance vector operations in embedded systems, this paper designs a 128-bit vector operation unit based on the open source RISC-V instruction set. In order to verify the correctness of the vector operation unit, we experimentally verify the vector operation unit in the ModelSim simulation environment, and all modules meet the correctness requirements. In addition, we integrated the vector operation unit with the open-source processor Hummingbird E203 and tested it on a 16MHz FPGA development board. The results show a 1.2 times performance improvement of the demo application compared to the scalar processor.
The domestic processor DCU(Deep Computing Unit) is a GPU-like accelerator. To address the problem that DCU lacks more effective compilation and optimization tools, a DCU code generation and optimization method based on polyhedral model is proposed. First, a source-to-source compiler C2HIPC is implemented based on the polyhedral model. The input of C2HIPC is a serial C program, and the output is a HIP (Heterogeneous-compute Interface for Portability) executable on the DCU; secondly, to further improve the heterogeneous performance of the program, a kernel partitioning method based on computing unit occupancy is proposed, which completes the kernel partitioning by limiting loop merging in the polyhedral schedule calculation; at the same time, a global memory access optimization method based on data reuse is proposed, By modeling the two kinds of data reuse and global memory access costs existing in the DCU kernel, the global memory access costs in different loop orders are calculated. Using the PolyBench test set to verify the effectiveness of C2HIPC, the performance of the automatically generated HIP code after C2HIPC optimization reaches 92.4% of that of PolyBench-GPU.
As a mainstream heterogeneous programming model, OpenMP has important practical significance for its uninstall performance research. Songshan supercomputer system installed in Zhengzhou Supercomputing Center is a new generation of E-class high-performance computer cluster independently developed by China, and the DCU chip installed on it is also home-made. In order to improve the offload performance of OpenMP on the platform and make full use of hardware resources such as registers, a redundant cycle optimization for thread iteration in DCU was proposed, so that the thread could release register resources in time after completing the calculation task, thus relieving the back-end register allocation pressure and improving the program performance. At the same time, based on the loop unrolling optimization algorithm in LLVM and combined with the hardware characteristics and instruction set characteristics of the domestic platform, a better algorithm for calculating the loop unrolling factor was proposed to improve the optimization effect of loop unrolling. Thread iteration optimization using SPEC ACCEL and Polybench resulted in an average 33.7% reduction in the overall register count and a 37% average performance improvement after loop expansion optimization.
Large-scale network models based on transformer architecture have strong versatility in many fields. Due to the computation-intensive and large-scale characteristics of the model, large-scale training on domestic heterogeneous accelerators is restricted by aspects such as computing and communication efficiency, resulting in poor training performance. Aiming at this problem, the hot functions and performance bottlenecks in the training process are studied and analyzed, and the corresponding performance optimization methods are proposed based on the hardware characteristics of domestic heterogeneous accelerators. In order to solve the problem of low performance in low accuracy training, the low accuracy package optimization is carried out for the underlying matrix multiplication core operator. To solve the problem of significant startup delay of kernel function caused by fine-grained core operators, the LightSeq framework is transplanted on the domestic heterogeneous platform for the first time, and the core fine-grained operators are specially optimized to adapt to the hardware structure according to the characteristics of network structure to accelerate the training process. In large-scale training, in order to solve the problem of low bandwidth in cross node communication, distributed communication optimization is carried out from two levels of data transmission and hardware topology, and communication efficiency is improved by reducing the frequency of communication and increasing the communication bandwidth. The experimental results show that using the WMT '14 English-German translation dataset, the performance is improved by two times after optimization on a single node without loss of training accuracy. The computing scale is gradually expanded to 128 nodes (512 accelerator cards) for large-scale distributed training and verification. Under the premise of ensuring performance improvement, the scalability can reach more than 90% in 256 accelerator cards.
Thread parallelism and single-thread’ performance are two important factors affecting the performance of kernel functions, and they are both closely related to register allocation. According to change the thread parallelism to optimize GPU register resource allocation can effectively improve the performance of heterogeneous programs. We obtain the required number of vector registers by counting the number of virtual registers during the compilation of kernel functions, and then combine them with the number of wavefronts used to launch kernel functions for overall performance analysis, proposing a RAW compilation method for collaborative optimization of register allocation and thread management for AMDGPU, which is implemented in the LLVM compiler. It is verified that the method has a speedup ratio of about 1.12x for the Rodinia test set and about 1.4x for the quda application.
When using the random forest algorithm to classify remote sensing images of each target year in the study area, the number of decision trees and the maximum number of features for constructing the optimal model of decision trees have a great influence on the accuracy of the random forest classification results. Based on this, this paper proposes an adaptive parameter tuning strategy based on GridSearchCV to improve the random forest algorithm. The method can select the best parameters according to different sample data and study area conditions. By comparing with unoptimized random forest, decision tree, and support vector machine algorithms, the results suggest that: the optimized random forest algorithm has good classification accuracy, and the overall accuracy and Kappa coefficient of classification results are above 0.90.
In order to give full play to the advantages of DCU accelerator and solve the problems of algorithm SpMV (Sparse matrix-vector multiplication) with limited bandwidth, unbalanced load, and non-combined memory access, a SCSR (Static Compressed Sparse Row) using CSR (Compressed Sparse Row) storage format is proposed based on DCU accelerator. The algorithm statically allocates the same number of rows to each thread block according to the average number of non-zero elements in each row to avoid unnecessary computations; the application of storage resources is reduced by reusing the on-chip high-speed storage space LDS (Local Data Shared), thus improving the CU (Compute Unit) occupancy. The experiment uses 15 sparse matrices in different fields for testing. The results show that compared with the SpMV algorithm in the hipSPARSE library, the SCSR algorithm achieves an average speedup ratio of 4.83 times.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.