Data Stream API

class jange.stream.DataStream(items: Iterable[Any], applied_ops: Optional[List[T]] = None, context: Optional[Iterable[Any]] = None)[source]

A class representing a stream of data. A data stream is created as a result of some operation. DataStream object can be iterated which basically iterates through the underlying data. The underlying data is stored in items attribute which can be any iterable object.

Parameters:
  • items (iterable) – an iterable that contains the raw data
  • applied_ops (Optional[List[Operation]]) – a list of operations that were applied to create this stream of data

Example

>>> ds = DataStream(items=[1, 2, 3])
>>> print(list(ds))
>>> [1, 2, 3]
Variables:
  • applied_ops (List[Operation]) – a list of operations that were applied to create this stream of data
  • items (iterable) – an iterable that contains the raw data
class jange.stream.CSVDataStream(path: str, columns: Union[str, List[str]], context_column: Optional[str] = None)[source]

Represents a stream of data by reading the contents from a csv file. pandas library is used to read the csv.

Parameters:
  • path (str) – path to the csv file to read. This parameter is passed directly to pandas.read_csv method
  • columns (Union[str, List[str]]) – a column name or a list of column names in the csv file. The values from the given column(s) are used to create a stream. If a list is passed then each item in the stream will be a list of values for the given columns in that order.

Example

>>> ds = CSVDataStream(path="news_articles.csv", columns=["body", "title"])
Variables:
  • df (pd.DataFrame) – a pandas DataFrame object created after reading the csv file
  • columns (Union[str, list]) – a list of column names or a single column name. This value is used to select data from those columns only
  • path (str) – path to the csv file
class jange.stream.DataFrameStream(df, columns: Union[str, List[str]], context_column: Optional[str] = None)[source]

Represents a stream of data by iterating over the rows in a pandas DataFrame object.

Parameters:
  • df (pd.DataFrame) – pandas DataFrame object
  • columns (Union[str, List[str]]) – a column name or a list of column names in the dataframe. The values from the given column(s) are used to create a stream. If a list is passed then each item in the stream will be a list of values for the given columns in that order.

Example

>>> df = pd.DataFrame([{"text": "text 1", "id": "1"}, {"text": "text 2", "id": "2"}])
>>> ds = DataFrameStream(df=df, columns="text")
>>> print(list(ds))
>>> ["text 1", "text 2"]
>>> ds = DataFrameStream(df=df, columns=["id", "text"])
>>> print(list(ds))
>>> [["1", "text 1"], ["2", "text 2"]]
Variables:
  • df (pd.DataFrame) – a pandas DataFrame object
  • columns (Union[str, list]) – a list of column names or a single column name. This value is used to select data from those columns only