Readers¶
The WarpRec data reading module provides a unified interface to ingest datasets for recommendation tasks. It is designed to be flexible and extensible, allowing users to load interaction data from different sources, including:
- Local files
- Azure Blob Storage
The module abstracts the underlying data source, returning a consistent DataFrame object that contains user-item interactions and optionally also side information, and clustering information. This ensures that downstream components, such as dataset splitters, models, and callbacks, can operate without concern for the original data format or storage location.
WarpRec supports reading from local files and Azure Blob Storage. The backend is selected through configuration; the data format and requirements are identical regardless of the source. When using Azure Blob Storage, WarpRec automatically handles blob download or in-memory reading.
API Reference
For class signatures and parameters, see the Data Management API Reference.
Reading from a Single Source¶
WarpRec expects the data to be in one file, typically organized in a tabular format:
WarpRec is a highly customizable framework; here are the requirements and customization options for the raw data file:
- Header and Columns:
- A header with the following labels is expected (order is not important):
user_id,item_id,rating,timestamp. - Column labels can be customized through configuration.
- The file can contain more columns; only those with the configured names will be considered.
- A header with the following labels is expected (order is not important):
- Separators: Values must be split by a fixed separator, which can be customized (e.g., comma, tab, semicolon).
- Required Columns:
- The
ratingcolumn is required only for theexplicitrating type. - The
timestampcolumn is required only if a temporal strategy is used. Timestamps should ideally be provided in numeric format for full support, although string formats are accepted but may result in unexpected errors.
- The
Reading Pre-split Data¶
When reading pre-split data, WarpRec expects the split files to reside within the same directory (or virtual folder, in the case of Azure). The required directory structure is as follows:
split_dir/
├── train.tsv
├── validation.tsv
├── test.tsv
├── 1/
| ├── train.tsv
| ├── validation.tsv
└── 2/
├── train.tsv
├── validation.tsv
- Each individual file is expected to follow the same format as unsplit dataset files.
- In this setup, both the training (e.g.,
train.tsv) and test (e.g.,test.tsv) sets must be provided. - The train/validation folds (e.g., directories
1/,2/) are optional.
Reading Side Information¶
Side information is used to train certain models and evaluate specific metrics. WarpRec expects the side information file to be formatted as:
- Column Ordering is Crucial:
- The first column must contain the item ID.
- All other columns will be interpreted as features.
- Data Type: WarpRec expects all feature data in this file to be numerical. The user must provide preprocessed input.
- Error Handling: During the configuration evaluation process, you will be notified if you attempt to use a model that requires side information but none has been provided. In that case, the experiment will be terminated.
Reading Clustering Information¶
When reading clustering information, WarpRec expects the file to be formatted as follows:
- Header: The header is important and needs to be consistent with the other files.
- Cluster Numeration: The clusters must be numbered starting from 1, as
cluster 0is reserved as a fallback.- In case of incorrect numeration, the framework will automatically handle this step.