Nearest Neighbors Operation API

class jange.ops.neighbors.GroupingOperation(name: str = 'grouping')[source]

Operation to group a list of pairs.

This operation is similar to clustering but instead requires a list of pairs. It then uses the pairs data to create a graph and find connected components to group the items.

e.g. is there are pairs [(“a”, “b”), (“b”, “c”), (“e”, “f”)] then the groups formed will be [{‘a’, ‘b’, ‘c’}, {‘e’, ‘f’}]

The items in the DataStream should be a tuple where each tuple indicates a pair as follows: <item1, item2, *other_properties>. All other entries in the tuple except item1 and item2 will not be used by the operation and is discarded. Typically, the output of ops.neighbors.SimilarPairOperation is passed to this operation.

Parameters:name (str) – name of this operation, default grouping
Variables:name (str) – name of this operation
class jange.ops.neighbors.NearestNeighborsOperation(n_neighbors: int = 10, metric='cosine', name: str = 'nearest_neighbors')[source]
class jange.ops.neighbors.SimilarPairOperation(sim_threshold=0.8, metric='cosine', n_neighbors=10, name: str = 'similar_pair')[source]

Finds similar pairs

This operation uses nearest neighbors algorithms from sklearn.neighbors package to find similar items in a dataset and convert them into pairs. Unlike nearest neighbors, where you get n_neighbor items for each item in the input, similar pairs will only return distinct occurence of any two items. The input data stream should contain a numpy array or a scipy sparse matrix.

Variables:
  • sim_threshold (float) – minimun similarity threshold that each should pair have to be considered as being similar
  • model – any model from sklearn.neighbors package. default sklearn.neighbors.NearestNeighbors
  • name (str) – name of this operation. default similar_pair

Example

>>> features_ds = stream.DataStream(np.random.uniform(size=(20, 100)))
>>> op = SimilarPairOperation(sim_threshold=0.9)
>>> similar_pairs = features_ds.apply(features_ds)