plsa.corpus module¶
-
class
plsa.corpus.
Corpus
(corpus: Iterable[str], pipeline: plsa.pipeline.Pipeline)¶ Bases:
object
Processes raw document collections and provides numeric representations.
Parameters: - corpus (iterable of str) – An iterable over documents given as a single string each.
- pipeline (Pipeline) – The preprocessing pipeline.
See also
-
classmethod
from_csv
(path: str, pipeline: plsa.pipeline.Pipeline, col: int = -1, encoding: str = 'latin_1', max_docs: int = 1000, **kwargs) → plsa.corpus.Corpus¶ Instantiate a corpus from documents in a column of a CSV file.
Parameters: - path (str) – Full path (incl. file name) to a CSV file with one column containing documents.
- pipeline – The preprocessing pipeline.
- col (int) – Which column contains the documents. Numbering starts with 0 for the first column. Negative numbers count back from the last column (e.g., -1 for last, -2 just before the last, etc.).
- encoding (str) – A valid python encoding used to read the documents.
- max_docs (int) – The maximum number of documents to read from file.
- **kwargs – Keyword arguments are passed on to Python’s own
csv.reader
function.
Raises: StopIteration
– If you do not have at least two lines in your CSV file.Notes
If you set a
col
to a value outside the range present in the CSV file, it will be silently reset to the first or last column, depending on which side you exceed the permitted range.A list of available encodings can be found at https://docs.python.org/3/library/codecs.html
Formatting parameters for the Python’s
csv.reader
can be found at https://docs.python.org/3/library/csv.html#csv-fmt-params
-
classmethod
from_xml
(directory: str, pipeline: plsa.pipeline.Pipeline, tag: str = 'post', encoding: str = 'latin_1', max_files: int = 100) → plsa.corpus.Corpus¶ Instantiate a corpus from elements of XML files in a directory.
Parameters: - directory (str) – Path to the directory with the XML files.
- pipeline (Pipeline) – The preprocessing pipeline.
- tag – The XML tag that opens (<…>) and closes (</…>) the elements containing documents.
- encoding – A valid python encoding used to read the documents.
- max_files – The maximum number of XML files to read.
Notes
A list of available encodings can be found at https://docs.python.org/3/library/codecs.html
-
get_doc
(tf_idf: bool) → numpy.ndarray¶ The marginal probability that any word comes from a given document.
This probability p(d) is obtained by summing the joint document- word probability p(d, w) over all words.
Parameters: tf_idf (bool) – Whether to marginalize the term-frequency inverse-document-frequency or just the term-frequency matrix. Returns: The document probability p(d). Return type: ndarray
-
get_doc_given_word
(tf_idf: bool) → numpy.ndarray¶ The conditional probability of a particular word in a given document.
This probability p(d|w) is obtained by dividing the joint document- word probability p(d, w) by the marginal word probability p(w).
Parameters: tf_idf (bool) – Whether to base the conditional probability on the term-frequency inverse-document-frequency or just the term-frequency matrix. Returns: The conditional word probability p(d|w). Return type: ndarray
-
get_doc_word
(tf_idf: bool) → numpy.ndarray¶ The normalized document-word counts matrix.
Also referred to as the term-frequency matrix. Because words (or terms) that occur in the majority of documents are the least helpful in discriminating types of documents, each column of this matrix can be multiplied by the logarithm of the total number of documents divided by the number of documents containing the given word. The result is then referred to as the term-frequency inverse-document-frequency or TF-IDF matrix.
Either way, the returned matrix is always normalized such that it can be interpreted as the joint document-word probability p(d, w).
Parameters: tf_idf (bool) – Whether to return the term-frequency inverse-document-frequency or just the term-frequency matrix. Returns: The normalized document (rows) - word (columns) matrix, either as pure counts (if tf_idf
=False
) or weighted by the inverse document frequency (iftf_idf
isFalse
).Return type: ndarray
-
get_word
(tf_idf: bool) → numpy.ndarray¶ The marginal probability of a particular word.
This probability p(w) is obtained by summing the joint document- word probability p(d, w) over all documents.
Parameters: tf_idf (bool) – Whether to marginalize the term-frequency inverse-document-frequency or just the term-frequency matrix. Returns: The word probability p(w). Return type: ndarray
-
idf
¶ Logarithm of inverse fraction of documents each word occurs in.
-
index
¶ Mapping from actual word to numeric word index.
-
n_docs
¶ The number of non-empty documents.
-
n_occurrences
¶ Total number of times any word occurred in any document.
-
n_words
¶ The number of unique words retained after preprocessing.
-
pipeline
¶ The pipeline of preprocessors for each document.
-
raw
¶ The raw documents as they were read from the source.
-
vocabulary
¶ Mapping from numeric word index to actual word.