plsa.corpus module

class plsa.corpus.Corpus(corpus: Iterable[str], pipeline: plsa.pipeline.Pipeline)

Bases: object

Processes raw document collections and provides numeric representations.

Parameters:
  • corpus (iterable of str) – An iterable over documents given as a single string each.
  • pipeline (Pipeline) – The preprocessing pipeline.

See also

plsa.pipeline

classmethod from_csv(path: str, pipeline: plsa.pipeline.Pipeline, col: int = -1, encoding: str = 'latin_1', max_docs: int = 1000, **kwargs) → plsa.corpus.Corpus

Instantiate a corpus from documents in a column of a CSV file.

Parameters:
  • path (str) – Full path (incl. file name) to a CSV file with one column containing documents.
  • pipeline – The preprocessing pipeline.
  • col (int) – Which column contains the documents. Numbering starts with 0 for the first column. Negative numbers count back from the last column (e.g., -1 for last, -2 just before the last, etc.).
  • encoding (str) – A valid python encoding used to read the documents.
  • max_docs (int) – The maximum number of documents to read from file.
  • **kwargs – Keyword arguments are passed on to Python’s own csv.reader function.
Raises:

StopIteration – If you do not have at least two lines in your CSV file.

Notes

If you set a col to a value outside the range present in the CSV file, it will be silently reset to the first or last column, depending on which side you exceed the permitted range.

A list of available encodings can be found at https://docs.python.org/3/library/codecs.html

Formatting parameters for the Python’s csv.reader can be found at https://docs.python.org/3/library/csv.html#csv-fmt-params

classmethod from_xml(directory: str, pipeline: plsa.pipeline.Pipeline, tag: str = 'post', encoding: str = 'latin_1', max_files: int = 100) → plsa.corpus.Corpus

Instantiate a corpus from elements of XML files in a directory.

Parameters:
  • directory (str) – Path to the directory with the XML files.
  • pipeline (Pipeline) – The preprocessing pipeline.
  • tag – The XML tag that opens (<…>) and closes (</…>) the elements containing documents.
  • encoding – A valid python encoding used to read the documents.
  • max_files – The maximum number of XML files to read.

Notes

A list of available encodings can be found at https://docs.python.org/3/library/codecs.html

get_doc(tf_idf: bool) → numpy.ndarray

The marginal probability that any word comes from a given document.

This probability p(d) is obtained by summing the joint document- word probability p(d, w) over all words.

Parameters:tf_idf (bool) – Whether to marginalize the term-frequency inverse-document-frequency or just the term-frequency matrix.
Returns:The document probability p(d).
Return type:ndarray
get_doc_given_word(tf_idf: bool) → numpy.ndarray

The conditional probability of a particular word in a given document.

This probability p(d|w) is obtained by dividing the joint document- word probability p(d, w) by the marginal word probability p(w).

Parameters:tf_idf (bool) – Whether to base the conditional probability on the term-frequency inverse-document-frequency or just the term-frequency matrix.
Returns:The conditional word probability p(d|w).
Return type:ndarray
get_doc_word(tf_idf: bool) → numpy.ndarray

The normalized document-word counts matrix.

Also referred to as the term-frequency matrix. Because words (or terms) that occur in the majority of documents are the least helpful in discriminating types of documents, each column of this matrix can be multiplied by the logarithm of the total number of documents divided by the number of documents containing the given word. The result is then referred to as the term-frequency inverse-document-frequency or TF-IDF matrix.

Either way, the returned matrix is always normalized such that it can be interpreted as the joint document-word probability p(d, w).

Parameters:tf_idf (bool) – Whether to return the term-frequency inverse-document-frequency or just the term-frequency matrix.
Returns:The normalized document (rows) - word (columns) matrix, either as pure counts (if tf_idf = False) or weighted by the inverse document frequency (if tf_idf is False).
Return type:ndarray
get_word(tf_idf: bool) → numpy.ndarray

The marginal probability of a particular word.

This probability p(w) is obtained by summing the joint document- word probability p(d, w) over all documents.

Parameters:tf_idf (bool) – Whether to marginalize the term-frequency inverse-document-frequency or just the term-frequency matrix.
Returns:The word probability p(w).
Return type:ndarray
idf

Logarithm of inverse fraction of documents each word occurs in.

index

Mapping from actual word to numeric word index.

n_docs

The number of non-empty documents.

n_occurrences

Total number of times any word occurred in any document.

n_words

The number of unique words retained after preprocessing.

pipeline

The pipeline of preprocessors for each document.

raw

The raw documents as they were read from the source.

vocabulary

Mapping from numeric word index to actual word.