Text Operations API¶
Cleaning Operations¶
This module contains several text cleaning operations
-
class
jange.ops.text.clean.
CaseChangeOperation
(mode: str = 'lower', name: str = 'case_change')[source]¶ Operation for changing case of the texts.
Parameters: - mode (str) – one of lower, upper or capitalize
- name (str) – name of this operation
Example
>>> ds = DataStream(["AAA", "Bbb"]) >>> list(ds.apply(CaseChangeOperation(mode="lower))) ["aaa", "bbb"]
Variables: - mode (str) – one of [‘lower’, ‘capitalize’, ‘upper’]
- name (str) – name of this operation
-
class
jange.ops.text.clean.
TokenFilterOperation
(patterns: List[List[Dict[KT, VT]]], nlp: Optional[spacy.language.Language] = None, keep_matching_tokens=False, name: Optional[str] = 'token_filter')[source]¶ Operation for filtering individual tokens.
Spacy’s token pattern matching is used for matching various tokens in the document. Any tokens matching the filter can either be discarded or kept while discarding the non matching ones.
Parameters: - patterns (List[List[Dict]]) – a list of patterns where each pattern is a List[Dict]. The patterns are passed to spacy’s Token Matcher. see https://spacy.io/usage/rule-based-matching for more details on how to define patterns.
- nlp (Optional[spacy.language.Language]) – spacy’s language model or None. If None then by default en_core_web_sm spacy model is loaded
- keep_matching_tokens (bool) – if true then any non-matching tokens are discarded from the document (e.g. extracting only nouns) if false then any matching tokens are discarded (e.g. stopword removal)
- name (Optional[str]) – name of this operation
Example
>>> nlp = spacy.load("en_core_web_sm") >>> # define patterns to match [a, an, the] tokens >>> patterns = [ [{"LOWER": "a"}], [{"LOWER": "an"}], [{"LOWER": "the"}] ] >>> # define the token filter operation to match the patterns and discard them >>> op = TokenFilterOperation(patterns=patterns, nlp=nlp, keep_matching_tokens=False) >>> ds = stream.DataStream(["that is an orange"]) >>> print(list(ds.apply(op)) ["that is orange"]
See https://spacy.io/usage/rule-based-matching#adding-patterns-attributes for more details on what token patterns can be used.
Variables: - nlp (spacy.language.Language) – spacy’s language model
- keep_matching_tokens (bool) – whether to discard the tokens matched by the filter from the document or to keep them
- patterns (List[List[Dict]]) – patterns to pass to spacy’s Matcher
- name (str) – name of this operation
-
jange.ops.text.clean.
lemmatize
(nlp: Optional[spacy.language.Language] = None, name='lemmatize') → jange.ops.base.SpacyBasedOperation[source]¶ Helper function to return SpacyBasedOperation for lemmatizing. This operation returns a stream.DataStream where each item is a string after being lemmatized.
Parameters: - nlp (Optional[spacy.language.Language]) – spacy’s language model or None. If None then by default en_core_web_sm spacy model is loaded
- name (Optional[str]) – name of this operation
Returns: out
Return type:
-
jange.ops.text.clean.
lowercase
(name='lowercase') → jange.ops.text.clean.CaseChangeOperation[source]¶ Helper function to create CaseChangeOperation with mode=”lower”
-
jange.ops.text.clean.
pos_filter
(pos_tags: Union[str, List[str]], keep_matching_tokens: bool = False, nlp: Optional[spacy.language.Language] = None, name: Optional[str] = 'filter_pos') → jange.ops.text.clean.TokenFilterOperation[source]¶ TokenFilterOperation to filter tokens based on Part of Speech
Parameters: - pos_tags (Union[str, List[str]]) – a single POS tag or a list of POS tags to search for. See https://spacy.io/api/annotation#pos-tagging for more details on what tags can be used. These depend on the language model used.
- keep_matching_tokens (bool) – if true then tokens having the given part of speech are kept and others are discarded from the text. Otherwise, tokens not having the given part of speech tags are kept
- nlp (Optional[spacy.language.Language]) – spacy’s language model or None. If None then by default en_core_web_sm spacy model is loaded
- name (Optional[str]) – name of this operation
Returns: Return type: Example
>>> ds = stream.DataStream(["Python is a programming language"]) >>> print(list(ds.apply(ops.text.filter_pos("NOUN", keep_matching_tokens=True)))) [programming language]
-
jange.ops.text.clean.
remove_emails
(nlp: Optional[spacy.language.Language] = None, name: Optional[str] = 'remove_emails') → jange.ops.text.clean.TokenFilterOperation[source]¶ TokenFilterOperation to remove emails
Parameters: - nlp (Optional[spacy.language.Language]) – spacy’s language model or None. If None then by default en_core_web_sm spacy model is loaded
- name (Optional[str]) – name of this operation
Returns: Return type:
-
jange.ops.text.clean.
remove_links
(nlp: Optional[spacy.language.Language] = None, name: Optional[str] = 'remove_links') → jange.ops.text.clean.TokenFilterOperation[source]¶ TokenFilterOperation to remove hyperlinks
Parameters: - nlp (Optional[spacy.language.Language]) – spacy’s language model or None. If None then by default en_core_web_sm spacy model is loaded
- name (Optional[str]) – name of this operation
Returns: Return type:
-
jange.ops.text.clean.
remove_numbers
(nlp: Optional[spacy.language.Language] = None, name: Optional[str] = 'remove_numbers') → jange.ops.text.clean.TokenFilterOperation[source]¶ TokenFilterOperation to remove numbers
Parameters: - nlp (Optional[spacy.language.Language]) – spacy’s language model or None. If None then by default en_core_web_sm spacy model is loaded
- name (Optional[str]) – name of this operation
Returns: Return type:
-
jange.ops.text.clean.
remove_short_words
(length: int, nlp: Optional[spacy.language.Language] = None, name: Optional[str] = 'remove_short_words') → jange.ops.text.clean.TokenFilterOperation[source]¶ TokenFilterOperation to remove tokens that have fewer characters than specified
Parameters: - length (int) – atleast this many characters should be in the token, otherwise it is discarded
- nlp (Optional[spacy.language.Language]) – spacy’s language model or None. If None then by default en_core_web_sm spacy model is loaded
- name (Optional[str]) – name of this operation
Returns: Return type:
-
jange.ops.text.clean.
remove_stopwords
(words: List[str] = None, nlp: Optional[spacy.language.Language] = None, name: Optional[str] = 'remove_stopwords') → jange.ops.text.clean.TokenFilterOperation[source]¶ TokenFilterOperation to remove stopwords
Parameters: - words (List[str]) – a list of words to remove from the text
- nlp (Optional[spacy.language.Language]) – spacy’s language model or None. If None then by default en_core_web_sm spacy model is loaded
- name (Optional[str]) – name of this operation
Returns: Return type: Example
>>> ds = stream.DataStream(["Python is a programming language"]) >>> print(list(ds.apply(ops.text.remove_stopwords()))) [Python programming language] >>> print(list(ds.apply(ops.text.remove_stopwords(words=["programming])))) [Python is a language]
-
jange.ops.text.clean.
token_filter
(patterns: List[List[Dict[KT, VT]]], keep_matching_tokens, nlp: Optional[spacy.language.Language] = None, name: Optional[str] = 'token_filter') → jange.ops.text.clean.TokenFilterOperation[source]¶ Helper function to create TokenFilterOperation
Parameters: - patterns (List[List[Dict]]) – a list of patterns where each pattern is a List[Dict]. The patterns are passed to spacy’s Token Matcher. see https://spacy.io/usage/rule-based-matching for more details on how to define patterns.
- nlp (Optional[spacy.language.Language]) – spacy’s language model or None. If None then by default en_core_web_sm spacy model is loaded
- keep_matching_tokens (bool) – if true then any non-matching tokens are discarded from the document (e.g. extracting only nouns) if false then any matching tokens are discarded (e.g. stopword removal)
- name (Optional[str]) – name of this operation
Returns: Return type:
Encoding Operations¶
This module contains several text encoding algorithms including binary or one-hot encoding, count based and tf-idf
-
jange.ops.text.encode.
count
(max_features: Optional[int] = None, max_df: Union[int, float] = 1.0, min_df: Union[int, float] = 1, ngram_range: Tuple[int, int] = (1, 1), name: Optional[str] = 'count', **kwargs) → jange.ops.base.ScikitBasedOperation[source]¶ Returns count based feature vector extraction. Uses sklearn’s CountVectorizer as underlying model.
Parameters: - max_features (Optional[int]) – If some value is provided then only top max_features words order by their count frequency are considered in the vocabulary
- max_df (Union[int, float]) – When building vocabulary, ignore terms that have document frequency higher than the given value. If the value is float, then it is considered as a ratio.
- min_df (Union[int, float]) – When building vocabulary, ignore terms that have document frequency less than the given valu. If the value is float, then it is considered as a ratio.
- ngram_range (Tuple[int, int]) – The lower and upper boundary of the range of n-values for different
n-grams to be extracted. All values of n such that min_n <= n <= max_n
will be used. For example an
ngram_range
of(1, 1)
means only unigrams,(1, 2)
means unigrams and bigrams, and(2, 2)
means only bigrams. - name (str) – name of this operation
- **kwargs – Keyword parameters that will be passed to the initializer of CountVectorizer
Returns: - SklearnBasedEncodeOperation
- See https (//scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)
- for details on the paramters and more examples.
-
jange.ops.text.encode.
one_hot
(max_features: Optional[int] = None, max_df: Union[int, float] = 1.0, min_df: Union[int, float] = 1, ngram_range: Tuple[int, int] = (1, 1), name: Optional[str] = 'one_hot', **kwargs) → jange.ops.base.ScikitBasedOperation[source]¶ Returns operation for performing one hot encoding of texts.
Uses sklearn.feature_extraction.text.CountVectorizer class with binary=True mode
Parameters: - max_features (Optional[int]) – If some value is provided then only top max_features words order by their count frequency are considered in the vocabulary
- max_df (Union[int, float]) – When building vocabulary, ignore terms that have document frequency higher than the given value. If the value is float, then it is considered as a ratio.
- min_df (Union[int, float]) – When building vocabulary, ignore terms that have document frequency less than the given valu. If the value is float, then it is considered as a ratio.
- ngram_range (Tuple[int, int]) – The lower and upper boundary of the range of n-values for different
n-grams to be extracted. All values of n such that min_n <= n <= max_n
will be used. For example an
ngram_range
of(1, 1)
means only unigrams,(1, 2)
means unigrams and bigrams, and(2, 2)
means only bigrams. - name (str) – name of this operation
- **kwargs – Keyword parameters that will be passed to the initializer of CountVectorizer
Returns: - SklearnBasedEncodeOperation
- See https (//scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)
- for details on the paramters and more examples.
-
jange.ops.text.encode.
tfidf
(max_features: Optional[int] = None, max_df: Union[int, float] = 1.0, min_df: Union[int, float] = 1, ngram_range: Tuple[int, int] = (1, 1), norm: str = 'l2', use_idf: bool = True, name: str = 'tfidf', **kwargs) → jange.ops.base.ScikitBasedOperation[source]¶ Returns tfidf based feature vector extraction. Uses sklearn’s TfidfVectorizer as underlying model.
Parameters: - max_features (Optional[int]) – If some value is provided then only top max_features words order by their count frequency are considered in the vocabulary
- max_df (Union[int, float]) – When building vocabulary, ignore terms that have document frequency higher than the given value. If the value is float, then it is considered as a ratio.
- min_df (Union[int, float]) – When building vocabulary, ignore terms that have document frequency less than the given valu. If the value is float, then it is considered as a ratio.
- ngram_range (Tuple[int, int]) – The lower and upper boundary of the range of n-values for different
n-grams to be extracted. All values of n such that min_n <= n <= max_n
will be used. For example an
ngram_range
of(1, 1)
means only unigrams,(1, 2)
means unigrams and bigrams, and(2, 2)
means only bigrams. - norm (str) – Each output row will have unit norm, either: * ‘l2’: Sum of squares of vector elements is 1. The cosine similarity between two vectors is their dot product when l2 norm has been applied. * ‘l1’: Sum of absolute values of vector elements is 1.
- use_idf (bool) – Enable inverse-document-frequency reweighting.
- name (str) – name of this operation
- **kwargs – Keyword parameters that will be passed to the initializer of CountVectorizer
Returns: - SklearnBasedEncodeOperation
- See https (//scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)
- for details on the paramters and more examples.
Embedding Operations¶
This module contains operations for extracting word/document embeddings using a language model.
-
class
jange.ops.text.embedding.
DocumentEmbeddingOperation
(nlp: Optional[spacy.language.Language] = None, name: str = 'doc_embedding')[source]¶ Operation to calculate document’s vector using word-embeddings. Word embedding of each token are collected and averaged.
Parameters: - nlp (Optional[Language]) – a spacy model
- name (str) – name of this operation
Example
>>> ds = DataStream(["this is text 1", "this is text 2"]) >>> vector_ds = ds.apply(DocumentEmbeddingOperation()) >>> print(vector_ds.items)
Variables: - nlp (Language) – spacy model
- name (str) – name of this operation
-
jange.ops.text.embedding.
doc_embedding
(nlp: Optional[spacy.language.Language] = None, name: str = 'doc_embedding') → jange.ops.text.embedding.DocumentEmbeddingOperation[source]¶ Helper function to return DocumentEmbeddingOperation
Parameters: - nlp (Optional[Language]) – a spacy model
- name (str) – name of this operation
Returns: Return type: