Poster 114: Optimizing Recommendation System Inference Performance Based on GPU
TimeThursday, 21 November 20198:30am - 5pm
DescriptionNeural network-based recommendation models have been widely applied on tracking personalization and recommendation tasks at large Internet companies such as e-commerce companies and social media companies. Alibaba recommendation system deploys WDL (wide and deep learning) models for product recommendation tasks. The WDL model consists of two main parts: embedding lookup and neural network-based feature ranking model that ranks different products for different users. As more and more products and users the model need to rank, the feature length and batch size of the models are increased. The computation of models is also increased so that traditional model inference implementation on CPU cannot meet the requirement of QPS (query per second) and latency of recommendation tasks. In this poster, we develop a GPU based system to speedup recommendation system inference performance. By model quantization and graph transformation, we can achieve 3.9x performance speedup when compared with a baseline GPU implementation.