plsa.algorithms.result module

class plsa.algorithms.result.PlsaResult(topic_given_doc: numpy.ndarray, word_given_topic: numpy.ndarray, topic_given_word: numpy.ndarray, topic: numpy.ndarray, kl_divergences: List[float], corpus: plsa.corpus.Corpus, tf_idf: bool)

Bases: object

Container for the results generated by a (conditional) PLSA run.

Parameters:
  • topic_given_doc (ndarray) – The conditional probability p(t|d) as \(n_{topics}\times n_{docs}\) array.
  • word_given_topic (ndarray) – The conditional probability p(w|t) as \(n_{words}\times n_{topics}\) array.
  • topic_given_word (ndarray) – The conditional probability p(t|w) as \(n_{topics}\times n_{words}\) array.
  • topic (ndarray) – The marginal probability p(w).
  • kl_divergences (list of float) – The Kullback-Leibler divergences between the original document-word probability p(d, w) and its approximate for each iteration.
  • corpus (Corpus) – The original corpus the PLSA model was trained on.
  • tf_idf (bool) – Whether to weigh the document.word matrix with the inverse document frequencies or not.
convergence

The convergence of the Kullback-Leibler divergence.

kl_divergence

KL-divergence of approximate and true document-word probability.

n_topics

The number of latent topics identified.

predict(doc: str) → Tuple[numpy.ndarray, int, Tuple[str, ...]]

Predict the relative importance of latent topics in a new document.

Parameters:doc (str) – A new document given as a single string.
Returns:
  • ndarray – A 1-D array with the relative importance of latent topics.
  • int – The number of words in the new document that were not present in the corpus the PLSA model was trained on.
  • tuple of str – Those words in the new document that were not present in the corpus the PLSA model was trained on.
Raises:ValueError – If the document to predict on is an empty string, if there are no words left after preprocessing the document, or if there are no known words in the document.
tf_idf

Used inverse document frequency to weigh the document-word counts?

topic

The relative importance of latent topics.

topic_given_doc

The relative importance of latent topics in each document.

Dimensions are \(n_{docs} \times n_{topics}\).

word_given_topic

The words in each latent topic and their relative importance.

Results are presented as a tuple of 2-tuples (word, word importance).