plsa.algorithms.result module¶
-
class
plsa.algorithms.result.
PlsaResult
(topic_given_doc: numpy.ndarray, word_given_topic: numpy.ndarray, topic_given_word: numpy.ndarray, topic: numpy.ndarray, kl_divergences: List[float], corpus: plsa.corpus.Corpus, tf_idf: bool)¶ Bases:
object
Container for the results generated by a (conditional) PLSA run.
Parameters: - topic_given_doc (ndarray) – The conditional probability p(t|d) as \(n_{topics}\times n_{docs}\) array.
- word_given_topic (ndarray) – The conditional probability p(w|t) as \(n_{words}\times n_{topics}\) array.
- topic_given_word (ndarray) – The conditional probability p(t|w) as \(n_{topics}\times n_{words}\) array.
- topic (ndarray) – The marginal probability p(w).
- kl_divergences (list of float) – The Kullback-Leibler divergences between the original document-word probability p(d, w) and its approximate for each iteration.
- corpus (Corpus) – The original corpus the PLSA model was trained on.
- tf_idf (bool) – Whether to weigh the document.word matrix with the inverse document frequencies or not.
-
convergence
¶ The convergence of the Kullback-Leibler divergence.
-
kl_divergence
¶ KL-divergence of approximate and true document-word probability.
-
n_topics
¶ The number of latent topics identified.
-
predict
(doc: str) → Tuple[numpy.ndarray, int, Tuple[str, ...]]¶ Predict the relative importance of latent topics in a new document.
Parameters: doc (str) – A new document given as a single string. Returns: - ndarray – A 1-D array with the relative importance of latent topics.
- int – The number of words in the new document that were not present in the corpus the PLSA model was trained on.
- tuple of str – Those words in the new document that were not present in the corpus the PLSA model was trained on.
Raises: ValueError
– If the document to predict on is an empty string, if there are no words left after preprocessing the document, or if there are no known words in the document.
-
tf_idf
¶ Used inverse document frequency to weigh the document-word counts?
-
topic
¶ The relative importance of latent topics.
-
topic_given_doc
¶ The relative importance of latent topics in each document.
Dimensions are \(n_{docs} \times n_{topics}\).
-
word_given_topic
¶ The words in each latent topic and their relative importance.
Results are presented as a tuple of 2-tuples (word, word importance).