Base Operation Classes API¶
-
class
jange.ops.base.
ScikitBasedOperation
(model, predict_fn_name: str, batch_size: int = 1000, name: str = 'sklearn_op')[source]¶ Base class for operations using scikit-learn’s Estimators
Variables: - model (any sklearn Estimator) –
- predict_fn_name (str) – name of function or attribute in the model to get predictions. Usually this is transform, predict or kneighbors. For models that do not support predicting on new dataset, this should be the name of attribute that holds the data. E.g. for clustering models like DBSCAN, AgglomerativeClustering it would be labels_ or for dimension reduction approaches like TSNE, SpectralEmbedding it would be embedding_
Example
>>> import sklearn.linear_model as sklm >>> import sklearn.decomposition as skdecomp >>> import sklearn.cluster as skcluster >>> op1 = ScikitBasedOperation(sklm.SGDClassifier(), predict_fn_name="predict") >>> op2 = ScikitBasedOperation(skdecomp.PCA(15), predict_fn_name="transform") >>> op3 = ScikitBasedOperation(skcluster.DBSCAN(), predict_fn_name="labels_")
-
can_predict_on_new
¶ Returns whether sklearn’s estimator can predict on unseen data
It checks whether the given predict_fn_name is present on the model and if it exists then checks whether it is a function or not.
Note
Estimators not supporting unseen data prediction will populate some attribute like labels_ or embeddings_ only after the model has been trained..
Returns: If the estimator can predict on new dataset Return type: bool
-
class
jange.ops.base.
SpacyBasedOperation
(nlp: Optional[spacy.language.Language] = None, process_doc_fn: Callable = <function _noop_process_doc_fn>, name: str = 'spacy_op')[source]¶ Base class for operations using spacy’s langauge model
Parameters: - nlp (Optional[Language]) – spacy’s language model. if None, then model defined in config.DEFAULT_SPACY_MODEL is used
- process_doc_fn (Callable) – a function that accepts a document and context and returns a tuple <object, context>. Default function is an identity function. This function is called for each document in the stream
- name (str) – name of this operation
-
get_docs_stream
(ds: jange.base.DataStream) → jange.base.DataStream[source]¶ Returns DataStream of spacy Docs. If the data stream already contains spacy Docs then they are returned as-is otherwise the nlp object is used to create spacy Docs
Parameters: ds (DataStream) – input data stream Returns: out – A datastream containing an iterable of spacy’s Doc objects Return type: DataStream