Text Operations API

Cleaning Operations

This module contains several text cleaning operations

class jange.ops.text.clean.CaseChangeOperation(mode: str = 'lower', name: str = 'case_change')[source]

Operation for changing case of the texts.

Parameters:
  • mode (str) – one of lower, upper or capitalize
  • name (str) – name of this operation

Example

>>> ds = DataStream(["AAA", "Bbb"])
>>> list(ds.apply(CaseChangeOperation(mode="lower)))
["aaa", "bbb"]
Variables:
  • mode (str) – one of [‘lower’, ‘capitalize’, ‘upper’]
  • name (str) – name of this operation
exception jange.ops.text.clean.EmptyTextError[source]
class jange.ops.text.clean.TokenFilterOperation(patterns: List[List[Dict[KT, VT]]], nlp: Optional[spacy.language.Language] = None, keep_matching_tokens=False, name: Optional[str] = 'token_filter')[source]

Operation for filtering individual tokens.

Spacy’s token pattern matching is used for matching various tokens in the document. Any tokens matching the filter can either be discarded or kept while discarding the non matching ones.

Parameters:
  • patterns (List[List[Dict]]) – a list of patterns where each pattern is a List[Dict]. The patterns are passed to spacy’s Token Matcher. see https://spacy.io/usage/rule-based-matching for more details on how to define patterns.
  • nlp (Optional[spacy.language.Language]) – spacy’s language model or None. If None then by default en_core_web_sm spacy model is loaded
  • keep_matching_tokens (bool) – if true then any non-matching tokens are discarded from the document (e.g. extracting only nouns) if false then any matching tokens are discarded (e.g. stopword removal)
  • name (Optional[str]) – name of this operation

Example

>>> nlp = spacy.load("en_core_web_sm")
>>> # define patterns to match [a, an, the] tokens
>>> patterns = [
    [{"LOWER": "a"}],
    [{"LOWER": "an"}],
    [{"LOWER": "the"}]
]
>>> # define the token filter operation to match the patterns and discard them
>>> op = TokenFilterOperation(patterns=patterns, nlp=nlp, keep_matching_tokens=False)
>>> ds = stream.DataStream(["that is an orange"])
>>> print(list(ds.apply(op))
["that is orange"]

See https://spacy.io/usage/rule-based-matching#adding-patterns-attributes for more details on what token patterns can be used.

Variables:
  • nlp (spacy.language.Language) – spacy’s language model
  • keep_matching_tokens (bool) – whether to discard the tokens matched by the filter from the document or to keep them
  • patterns (List[List[Dict]]) – patterns to pass to spacy’s Matcher
  • name (str) – name of this operation
jange.ops.text.clean.lemmatize(nlp: Optional[spacy.language.Language] = None, name='lemmatize') → jange.ops.base.SpacyBasedOperation[source]

Helper function to return SpacyBasedOperation for lemmatizing. This operation returns a stream.DataStream where each item is a string after being lemmatized.

Parameters:
  • nlp (Optional[spacy.language.Language]) – spacy’s language model or None. If None then by default en_core_web_sm spacy model is loaded
  • name (Optional[str]) – name of this operation
Returns:

out

Return type:

SpacyBasedOperation

jange.ops.text.clean.lowercase(name='lowercase') → jange.ops.text.clean.CaseChangeOperation[source]

Helper function to create CaseChangeOperation with mode=”lower”

jange.ops.text.clean.pos_filter(pos_tags: Union[str, List[str]], keep_matching_tokens: bool = False, nlp: Optional[spacy.language.Language] = None, name: Optional[str] = 'filter_pos') → jange.ops.text.clean.TokenFilterOperation[source]

TokenFilterOperation to filter tokens based on Part of Speech

Parameters:
  • pos_tags (Union[str, List[str]]) – a single POS tag or a list of POS tags to search for. See https://spacy.io/api/annotation#pos-tagging for more details on what tags can be used. These depend on the language model used.
  • keep_matching_tokens (bool) – if true then tokens having the given part of speech are kept and others are discarded from the text. Otherwise, tokens not having the given part of speech tags are kept
  • nlp (Optional[spacy.language.Language]) – spacy’s language model or None. If None then by default en_core_web_sm spacy model is loaded
  • name (Optional[str]) – name of this operation
Returns:

Return type:

TokenFilterOperation

Example

>>> ds = stream.DataStream(["Python is a programming language"])
>>> print(list(ds.apply(ops.text.filter_pos("NOUN", keep_matching_tokens=True))))
[programming language]
jange.ops.text.clean.remove_emails(nlp: Optional[spacy.language.Language] = None, name: Optional[str] = 'remove_emails') → jange.ops.text.clean.TokenFilterOperation[source]

TokenFilterOperation to remove emails

Parameters:
  • nlp (Optional[spacy.language.Language]) – spacy’s language model or None. If None then by default en_core_web_sm spacy model is loaded
  • name (Optional[str]) – name of this operation
Returns:

Return type:

TokenFilterOperation

TokenFilterOperation to remove hyperlinks

Parameters:
  • nlp (Optional[spacy.language.Language]) – spacy’s language model or None. If None then by default en_core_web_sm spacy model is loaded
  • name (Optional[str]) – name of this operation
Returns:

Return type:

TokenFilterOperation

jange.ops.text.clean.remove_numbers(nlp: Optional[spacy.language.Language] = None, name: Optional[str] = 'remove_numbers') → jange.ops.text.clean.TokenFilterOperation[source]

TokenFilterOperation to remove numbers

Parameters:
  • nlp (Optional[spacy.language.Language]) – spacy’s language model or None. If None then by default en_core_web_sm spacy model is loaded
  • name (Optional[str]) – name of this operation
Returns:

Return type:

TokenFilterOperation

jange.ops.text.clean.remove_short_words(length: int, nlp: Optional[spacy.language.Language] = None, name: Optional[str] = 'remove_short_words') → jange.ops.text.clean.TokenFilterOperation[source]

TokenFilterOperation to remove tokens that have fewer characters than specified

Parameters:
  • length (int) – atleast this many characters should be in the token, otherwise it is discarded
  • nlp (Optional[spacy.language.Language]) – spacy’s language model or None. If None then by default en_core_web_sm spacy model is loaded
  • name (Optional[str]) – name of this operation
Returns:

Return type:

TokenFilterOperation

jange.ops.text.clean.remove_stopwords(words: List[str] = None, nlp: Optional[spacy.language.Language] = None, name: Optional[str] = 'remove_stopwords') → jange.ops.text.clean.TokenFilterOperation[source]

TokenFilterOperation to remove stopwords

Parameters:
  • words (List[str]) – a list of words to remove from the text
  • nlp (Optional[spacy.language.Language]) – spacy’s language model or None. If None then by default en_core_web_sm spacy model is loaded
  • name (Optional[str]) – name of this operation
Returns:

Return type:

TokenFilterOperation

Example

>>> ds = stream.DataStream(["Python is a programming language"])
>>> print(list(ds.apply(ops.text.remove_stopwords())))
[Python programming language]
>>> print(list(ds.apply(ops.text.remove_stopwords(words=["programming]))))
[Python is a language]
jange.ops.text.clean.token_filter(patterns: List[List[Dict[KT, VT]]], keep_matching_tokens, nlp: Optional[spacy.language.Language] = None, name: Optional[str] = 'token_filter') → jange.ops.text.clean.TokenFilterOperation[source]

Helper function to create TokenFilterOperation

Parameters:
  • patterns (List[List[Dict]]) – a list of patterns where each pattern is a List[Dict]. The patterns are passed to spacy’s Token Matcher. see https://spacy.io/usage/rule-based-matching for more details on how to define patterns.
  • nlp (Optional[spacy.language.Language]) – spacy’s language model or None. If None then by default en_core_web_sm spacy model is loaded
  • keep_matching_tokens (bool) – if true then any non-matching tokens are discarded from the document (e.g. extracting only nouns) if false then any matching tokens are discarded (e.g. stopword removal)
  • name (Optional[str]) – name of this operation
Returns:

Return type:

TokenFilterOperation

jange.ops.text.clean.uppercase(name='uppercase') → jange.ops.text.clean.CaseChangeOperation[source]

Helper function to create CaseChangeOperation with mode=”upper”

Encoding Operations

This module contains several text encoding algorithms including binary or one-hot encoding, count based and tf-idf

jange.ops.text.encode.count(max_features: Optional[int] = None, max_df: Union[int, float] = 1.0, min_df: Union[int, float] = 1, ngram_range: Tuple[int, int] = (1, 1), name: Optional[str] = 'count', **kwargs) → jange.ops.base.ScikitBasedOperation[source]

Returns count based feature vector extraction. Uses sklearn’s CountVectorizer as underlying model.

Parameters:
  • max_features (Optional[int]) – If some value is provided then only top max_features words order by their count frequency are considered in the vocabulary
  • max_df (Union[int, float]) – When building vocabulary, ignore terms that have document frequency higher than the given value. If the value is float, then it is considered as a ratio.
  • min_df (Union[int, float]) – When building vocabulary, ignore terms that have document frequency less than the given valu. If the value is float, then it is considered as a ratio.
  • ngram_range (Tuple[int, int]) – The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams.
  • name (str) – name of this operation
  • **kwargs – Keyword parameters that will be passed to the initializer of CountVectorizer
Returns:

  • SklearnBasedEncodeOperation
  • See https (//scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)
  • for details on the paramters and more examples.

jange.ops.text.encode.one_hot(max_features: Optional[int] = None, max_df: Union[int, float] = 1.0, min_df: Union[int, float] = 1, ngram_range: Tuple[int, int] = (1, 1), name: Optional[str] = 'one_hot', **kwargs) → jange.ops.base.ScikitBasedOperation[source]

Returns operation for performing one hot encoding of texts.

Uses sklearn.feature_extraction.text.CountVectorizer class with binary=True mode

Parameters:
  • max_features (Optional[int]) – If some value is provided then only top max_features words order by their count frequency are considered in the vocabulary
  • max_df (Union[int, float]) – When building vocabulary, ignore terms that have document frequency higher than the given value. If the value is float, then it is considered as a ratio.
  • min_df (Union[int, float]) – When building vocabulary, ignore terms that have document frequency less than the given valu. If the value is float, then it is considered as a ratio.
  • ngram_range (Tuple[int, int]) – The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams.
  • name (str) – name of this operation
  • **kwargs – Keyword parameters that will be passed to the initializer of CountVectorizer
Returns:

  • SklearnBasedEncodeOperation
  • See https (//scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)
  • for details on the paramters and more examples.

jange.ops.text.encode.tfidf(max_features: Optional[int] = None, max_df: Union[int, float] = 1.0, min_df: Union[int, float] = 1, ngram_range: Tuple[int, int] = (1, 1), norm: str = 'l2', use_idf: bool = True, name: str = 'tfidf', **kwargs) → jange.ops.base.ScikitBasedOperation[source]

Returns tfidf based feature vector extraction. Uses sklearn’s TfidfVectorizer as underlying model.

Parameters:
  • max_features (Optional[int]) – If some value is provided then only top max_features words order by their count frequency are considered in the vocabulary
  • max_df (Union[int, float]) – When building vocabulary, ignore terms that have document frequency higher than the given value. If the value is float, then it is considered as a ratio.
  • min_df (Union[int, float]) – When building vocabulary, ignore terms that have document frequency less than the given valu. If the value is float, then it is considered as a ratio.
  • ngram_range (Tuple[int, int]) – The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams.
  • norm (str) – Each output row will have unit norm, either: * ‘l2’: Sum of squares of vector elements is 1. The cosine similarity between two vectors is their dot product when l2 norm has been applied. * ‘l1’: Sum of absolute values of vector elements is 1.
  • use_idf (bool) – Enable inverse-document-frequency reweighting.
  • name (str) – name of this operation
  • **kwargs – Keyword parameters that will be passed to the initializer of CountVectorizer
Returns:

  • SklearnBasedEncodeOperation
  • See https (//scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)
  • for details on the paramters and more examples.

Embedding Operations

This module contains operations for extracting word/document embeddings using a language model.

class jange.ops.text.embedding.DocumentEmbeddingOperation(nlp: Optional[spacy.language.Language] = None, name: str = 'doc_embedding')[source]

Operation to calculate document’s vector using word-embeddings. Word embedding of each token are collected and averaged.

Parameters:
  • nlp (Optional[Language]) – a spacy model
  • name (str) – name of this operation

Example

>>> ds = DataStream(["this is text 1", "this is text 2"])
>>> vector_ds = ds.apply(DocumentEmbeddingOperation())
>>> print(vector_ds.items)
Variables:
  • nlp (Language) – spacy model
  • name (str) – name of this operation
jange.ops.text.embedding.doc_embedding(nlp: Optional[spacy.language.Language] = None, name: str = 'doc_embedding') → jange.ops.text.embedding.DocumentEmbeddingOperation[source]

Helper function to return DocumentEmbeddingOperation

Parameters:
  • nlp (Optional[Language]) – a spacy model
  • name (str) – name of this operation
Returns:

Return type:

DocumentEmbeddingOperation