Domain texts usually have significant domain features and long text lengths. Text classification models suitable for general fields cannot well meet the task of text classification in specific domains. Therefore, this paper proposes a domain text classification method based on BERT. First, segment the long domain text to obtain the sequence combination, and input it into the BERT pre-training model to obtain the word vector. Then the vector is compressed and encoded to obtain the pooled sequence feature vector. Finally, it is sequentially input to the Encoder layer for domain feature extraction, and the text is divided into various categories in the output layer. The experimental results show that the proposed BERT_VCA model has an average improvement of 1.12% in F1 value compared with the BERT_BASE model in the domain text classification task.
User interest model is the core of personalized information service, and its quality has a direct impact on the effect of personalized information service. However, the traditional interest model often only considers the user's interest in the text and ignores the influence of contextual information. Therefore, in order to avoid the interaction between context and interest features, this paper proposes a multi-level vector space interest model representation method. Then, the user's interest is weighted from many aspects by using the user's contextual information. Finally, the personalized retrieval experiment is carried out by using the constructed interest model. The experimental results show that the retrieval precision based on the model reaches 83.72%, which proves the validity and reliability of the model.
In addition to the body text, web pages also contain a lot of noise information such as advertisements and navigation bars. Accurately extracting text content from web pages is a key technology to improve the quality of web page analysis. The web page itself is a highly heterogeneous special text, and different types of web pages have different web page structures, which increases the difficulty of web page text extraction. After a lot of analysis, we found that there is a potential correlation between the body text and the tag path and text block density, so we propose a webpage text extraction method based on the tag path feature and the text block density feature. Combining the advantages and disadvantages of tag path features and text block density features, we design a fusion strategy to solve the problem of low accuracy of web page text extraction. The method does not require training, and improves the efficiency of webpage text extraction. The experimental results on the dataset constructed in this paper show that the classification accuracy of this method reaches 81.11%, the recall rate reaches 83.15%, and the average accuracy on all datasets is 17.7% higher than that of the BDF algorithm and 6.21% higher than that of the CEPF algorithm, the experiments show that the method has strong generalization ability.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.