Abstract
Discovering hot topics within social network like Twitter and Weibo, has received much attention in recent years. While topic models such as Latent Dirichlet Allocation (LDA) have been successfully applied in topic discovery, they are often less coherent when applied to microblog content which is known as “posts”. In this paper, we propose a time-series based aggregation scheme for topic modeling in Weibo. As Weibo topics are coherent within a time slice, we divide Weibo dataset into groups by time slice. With this scheme, posts in every group are aggregated into several longer pseudo-documents using paragraph-vector based similarity algorithms. While applying this scheme to LDA model, we dramatically decrease the topic model perplexity and increase the clustering quality, which also allows for better discovery of underlying topics in Weibo. Furthermore, we can let other topic models extended on LDA be directly used on such short texts.
•Divide Weibo dataset into groups by fixed time slice.•Aggregate tweets into several longer pseudo-documents.•Propose a time-series based aggregation scheme for topic modeling.