In the fashion recommendation system, with the increase in the number of clothing, the combination of clothing matching is growing exponentially, which leads to the slowdown of training speed and excessive memory usage, and other problems are becoming more and more prominent. Hashing technology can effectively reduce memory usage and improve the recommendation speed. With more and more product images accompanied by text descriptions, multi-modal modeling has become a research hotspot in recommendation systems. Most of the past research adopts the method of processing different modal information separately and then weighting and averaging, but the correlation between the two modalities is not fully utilized. We believe that the visual and textual features of the same product are semantically consistent and have the same aesthetic characteristics. Therefore, we try new modeling approaches to mine the higherorder connections between the two modalities. Another issue is that visual information is much more important than textual information in fashion recommendation systems, so it is not reasonable to equate the two modalities. We model the hash-processed binary representation, optimize the network structure to include visual features in the ranking to compute the loss, and also use textual features to assist the image features for modeling. Tests on two larger datasets from Polyvore show that our model exhibits significant results compared to state-of-the-art models on key evaluation metrics such as user-personalized recommendation and compatibility modeling.
|