Base Operation Classes API

class jange.ops.base.ScikitBasedOperation(model, predict_fn_name: str, batch_size: int = 1000, name: str = 'sklearn_op')[source]

Base class for operations using scikit-learn’s Estimators

Variables:
  • model (any sklearn Estimator) –
  • predict_fn_name (str) – name of function or attribute in the model to get predictions. Usually this is transform, predict or kneighbors. For models that do not support predicting on new dataset, this should be the name of attribute that holds the data. E.g. for clustering models like DBSCAN, AgglomerativeClustering it would be labels_ or for dimension reduction approaches like TSNE, SpectralEmbedding it would be embedding_

Example

>>> import sklearn.linear_model as sklm
>>> import sklearn.decomposition as skdecomp
>>> import sklearn.cluster as skcluster
>>> op1 = ScikitBasedOperation(sklm.SGDClassifier(), predict_fn_name="predict")
>>> op2 = ScikitBasedOperation(skdecomp.PCA(15), predict_fn_name="transform")
>>> op3 = ScikitBasedOperation(skcluster.DBSCAN(), predict_fn_name="labels_")
can_predict_on_new

Returns whether sklearn’s estimator can predict on unseen data

It checks whether the given predict_fn_name is present on the model and if it exists then checks whether it is a function or not.

Note

Estimators not supporting unseen data prediction will populate some attribute like labels_ or embeddings_ only after the model has been trained..

Returns:If the estimator can predict on new dataset
Return type:bool
class jange.ops.base.SpacyBasedOperation(nlp: Optional[spacy.language.Language] = None, process_doc_fn: Callable = <function _noop_process_doc_fn>, name: str = 'spacy_op')[source]

Base class for operations using spacy’s langauge model

Parameters:
  • nlp (Optional[Language]) – spacy’s language model. if None, then model defined in config.DEFAULT_SPACY_MODEL is used
  • process_doc_fn (Callable) – a function that accepts a document and context and returns a tuple <object, context>. Default function is an identity function. This function is called for each document in the stream
  • name (str) – name of this operation
get_docs_stream(ds: jange.base.DataStream) → jange.base.DataStream[source]

Returns DataStream of spacy Docs. If the data stream already contains spacy Docs then they are returned as-is otherwise the nlp object is used to create spacy Docs

Parameters:ds (DataStream) – input data stream
Returns:out – A datastream containing an iterable of spacy’s Doc objects
Return type:DataStream
class jange.ops.base.SpacyModelPicklerMixin[source]

Class intented to be inherited by classes that use spacy’s model so that the spacy’s model is not pickled. Instead only the path to the mode is pickled