Data Stream API

class Iterable[Any], applied_ops: Optional[List[T]] = None, context: Optional[Iterable[Any]] = None)[source]

A class representing a stream of data. A data stream is created as a result of some operation. DataStream object can be iterated which basically iterates through the underlying data. The underlying data is stored in items attribute which can be any iterable object.

  • items (iterable) – an iterable that contains the raw data
  • applied_ops (Optional[List[Operation]]) – a list of operations that were applied to create this stream of data


>>> ds = DataStream(items=[1, 2, 3])
>>> print(list(ds))
>>> [1, 2, 3]
  • applied_ops (List[Operation]) – a list of operations that were applied to create this stream of data
  • items (iterable) – an iterable that contains the raw data
class str, columns: Union[str, List[str]], context_column: Optional[str] = None)[source]

Represents a stream of data by reading the contents from a csv file. pandas library is used to read the csv.

  • path (str) – path to the csv file to read. This parameter is passed directly to pandas.read_csv method
  • columns (Union[str, List[str]]) – a column name or a list of column names in the csv file. The values from the given column(s) are used to create a stream. If a list is passed then each item in the stream will be a list of values for the given columns in that order.


>>> ds = CSVDataStream(path="news_articles.csv", columns=["body", "title"])
  • df (pd.DataFrame) – a pandas DataFrame object created after reading the csv file
  • columns (Union[str, list]) – a list of column names or a single column name. This value is used to select data from those columns only
  • path (str) – path to the csv file
class, columns: Union[str, List[str]], context_column: Optional[str] = None)[source]

Represents a stream of data by iterating over the rows in a pandas DataFrame object.

  • df (pd.DataFrame) – pandas DataFrame object
  • columns (Union[str, List[str]]) – a column name or a list of column names in the dataframe. The values from the given column(s) are used to create a stream. If a list is passed then each item in the stream will be a list of values for the given columns in that order.


>>> df = pd.DataFrame([{"text": "text 1", "id": "1"}, {"text": "text 2", "id": "2"}])
>>> ds = DataFrameStream(df=df, columns="text")
>>> print(list(ds))
>>> ["text 1", "text 2"]
>>> ds = DataFrameStream(df=df, columns=["id", "text"])
>>> print(list(ds))
>>> [["1", "text 1"], ["2", "text 2"]]
  • df (pd.DataFrame) – a pandas DataFrame object
  • columns (Union[str, list]) – a list of column names or a single column name. This value is used to select data from those columns only