plsa.preprocessors module¶

Preprocessors for documents and words.

These preprocessors come in three flavours (functions, closures that return functions, and classes defining callable objects). The choice for the respective flavour is motivated by the complexity of the preprocessor. If it doesn’t need any parameters, a simple function will do. If it is simple, does not need to be manipulated interactively, but needs some parameter(s), then a closure is fine. If it would be convenient to alter parameters of the preprocessor interactively, then a class is a good choice.

Preprocessors act either on an entire document string or, after splitting documents into individual words, on an iterable over the words contained in a single document. Therefore, they cannot be combined in arbitrary order but care must be taken to ensure that the return value of one matches the call signature of the next.

plsa.preprocessors.remove_non_ascii(doc: str) → str¶

Removes non-ASCII characters (i.e., with unicode > 127) from a string.

Parameters:	doc (str) – A document given as a single string.
Returns:	The document as a single string with all characters of unicode > 127 removed.
Return type:	str

plsa.preprocessors.to_lower(doc: str) → str¶

Converts a string to all-lowercase.

Parameters:	doc (str) – A document given as a single string.
Returns:	The document as a single string with all characters converted to lowercase.
Return type:	str

plsa.preprocessors.remove_numbers(doc: str) → str¶

Removes digit/number characters from a string.

Parameters:	doc (str) – A document given as a single string.
Returns:	The document as a single string with all number/digit characters removed.
Return type:	str

plsa.preprocessors.remove_tags(exclude_regex: str) → Callable[[str], str]¶

Returns callable that removes matches to the given regular expression.

Parameters:	exclude_regex (str) – A regular expression specifying specific patterns to remove from a document.
Returns:	A callable that removes patterns matching the given regular expression from a string.
Return type:	function

plsa.preprocessors.remove_punctuation(punctuation: Iterable[str]) → Callable[[str], str]¶

Returns callable that removes punctuation characters from a string.”

Parameters:	punctuation (iterable of str) – An iterable over single-character strings specifying punctuation characters to remove from a document.
Returns:	A callable that removes the given punctuation characters from a string.
Return type:	function

plsa.preprocessors.tokenize(doc: str) → Tuple[str, ...]¶

Splits a string into individual words.

Parameters:	doc (str) – A document given as a single string.
Returns:	The document as tuple of individual words.
Return type:	tuple of str

class plsa.preprocessors.RemoveStopwords(stopwords: Union[str, Iterable[str]])¶

Bases: object

Instantiate callable objects that remove stopwords from a document.

Parameters:	stopwords (str or iterable of str) – Stopword(s) to remove from a document given as an iterable over words.

Examples

>>> from plsa.preprocessors import RemoveStopwords
>>> remover = RemoveStopwords('is')
>>> remover.words
('is',)

>>> remover.words = 'the', 'are'
>>> remover.words
('the', 'are')

>>> remover += 'is', 'we'
>>> remover.words
('is', 'we', 'the', 'are')

>>> new_instance = remover + 'do'
>>> new_instance.words
('are', 'we', 'is', 'do', 'the')

words¶: The current stopwords.

class plsa.preprocessors.LemmatizeWords(*incl_pos)¶

Bases: object

Instantiate callable objects that find the root form of words.

Parameters:	inc_pos (str*) – One or more positional tag(s) indicating the type(s) of words to retain and to find the root form of. Must be one of ‘JJ’ (adjectives), ‘NN’ (nouns), ‘VB’ (verbs), or ‘RB’ (adverbs).
Raises:	`KeyError` – If the given positional tags are not among the list of allowed ones.

Examples

>>> from plsa.preprocessors import LemmatizeWords
>>> lemmatizer = LemmatizeWords('VB')
>>> lemmatizer.types
('VB',)

>>> lemmatizer.types = 'jj', 'nn'
>>> lemmatizer.types
('JJ', 'NN')

>>> lemmatizer += 'VB', 'NN'
>>> lemmatizer.types
('JJ', 'NN', 'VB')

>>> new_instance = lemmatizer + 'RB'
>>> new_instance.types
('JJ', 'RB', 'NN', 'VB')

types¶: The current type(s) of words to retain.

plsa.preprocessors.remove_short_words(min_word_len: int) → Callable[[Iterable[str]], Tuple[str, ...]]¶

Returns a callable that removes short words from an iterable of strings.

Parameters:	min_word_len (int) – Minimum number of characters in a word for it to be retained.
Returns:	A callable that removes words shorter than the given threshold from an iterable over strings.
Return type:	function