Distributed representations of sentences and documents

3 minute read


Today’s post describes Paragraph Vector, an unsupervised learning algorithm that can be quite useful on text classification, information retrieval and sentiment analysis tasks.

Paragraph Vector, also known as doc2vec, belongs to a wider set of techniques called word embeddings which can transform text of variable length to fixed-length, dense feature vectors. This is done in a way that semantically similar words have similar vector representations. Paragraph Vector can produce state of the art results regardless of the input text length which can vary from phrases to large documents.


Paragraph Vector is an extension by word2vec, a model that finds the vector representation of words and works as shown in the figure above. In both models, the word vectors are used to predict the next word in the sentence. The main difference is that in the case of Paragraph Vector, an additional word is used in the training and prediction steps. This word supplements the content of a paragraph and is called Paragraph ID. It represents what’s missing from the context and can be thought as the topic of the accompanied text.


When doc2vec is trained as shown above, it is called Distributed Memory Model of Paragraph Vectors (PV-DM) and provides two outputs:

  • Word vectors which are shared across the corpus
  • Paragraph vectors which characterise only the context words they were trained with.

Using the words of the above figure as an example, the vector of Paragraph id will be shared across the words the, cat, sat while the vector of cat will be the same for all paragraphs.

Another way of training Paragraph Vector is by completely ignoring the context words (the, cat, sat) and use only the paragraph IDs as input. This model is called Paragraph Vector - Distributed Bag of Words (PV-DBOW) and it is very similar to the word2vec skip-gram model. This version is much faster to train and requires less memory because less data are stored.


In most problems, PV-DM alone works well, however, its combination with PV-DBOW is more consistent and therefore recommended by the authors.

Experimental results

The authors tested the model in three tasks:

  1. Sentiment analysis using the Stanford sentiment treebank dataset
  2. Sentiment analysis using the IMDB dataset
  3. Information retrieval

The length of the input text differs significantly in the first two datasets, with the former consisted by single sentences while each document of the latter contains multiple sentences. In both cases, Paragraph Vector outperformed the rest of the methods, achieving 16% and 15% relative improvement respectively. In the last scenario, it scored a relative improvement of 32%.

Advantages of Paragraph Vector

Paragraph Vector offers a couple of advantages when compared to Bag of Words models:

  • The produced vectors capture the word semantics. For example, the vector of dog is found closer to cat than to London.
  • PV-DM considers the word order, depending on the selected window size, as an n-gram model would do. Moreover, it outperforms n-gram models because the dense vectors it produces have a much lower dimensionality than the ones produced with n-grams.

To see how Paragraph Vector was applied to real world, messy, web data, check out one of my papers!


Le, Q. and Mikolov, T., 2014. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learning (ICML-14) (pp. 1188-1196).