2009年3月31日星期二

[Reading] Probabilistic Latent Semantic Indexing

pLSA is a novel approach to automated document indexing and information retrieval. It models each word in a document as a sample from a mixture model. Each word is generated from a single topic, different words in the document may be generated from different topics. Each document is represented as a list of mixing proportions for the mixture components.

pLSA is based on the likelihood principle and uses a statistical model called aspect model to define a proper generative model of the data, and directly minimizes word perplexity, so it has a better statistical foundation than LSA. Also, pLSA outperforms LSA in the experiments. pLSA uses EM algorithm to identify latent classes. It is capable of dealing with polysemy and synonymy.

0 意見:

張貼意見