Dataset - API Reference¶

`warprec.data.dataset.Dataset` ¶

The definition of the Dataset class that will handle transaction data.

Parameters:

Name	Type	Description	Default
`train_data`	`FrameT`	The train data.	required
`eval_data`	`Optional[FrameT]`	The evaluation data.	`None`
`side_data`	`Optional[FrameT]`	The side information data.	`None`
`user_cluster`	`Optional[FrameT]`	The user cluster data.	`None`
`item_cluster`	`Optional[FrameT]`	The item cluster data.	`None`
`batch_size`	`int`	The batch size that will be used evaluation.	`1024`
`rating_type`	`RatingType`	The type of rating used.	`IMPLICIT`
`user_id_label`	`str`	The label of the user id column.	`'user_id'`
`item_id_label`	`str`	The label of the item id column.	`'item_id'`
`rating_label`	`str`	The label of the rating column.	`None`
`timestamp_label`	`str`	The label of the timestamp column.	`None`
`cluster_label`	`str`	The label of the cluster column.	`None`
`context_labels`	`Optional[List[str]]`	The list of labels of the contextual data.	`None`
`evaluation_set`	`str`	The type of evaluation set. Can either be 'Test' or 'Validation'.	`'Test'`

Attributes:

Name	Type	Description
`train_set`	`Interactions`	Training set used by recommendation models.
`eval_set`	`Interactions`	Evaluation set used by recommendation models.
`train_session`	`Sessions`	Training session used by sequential models.
`user_cluster`	`Optional[dict]`	User cluster information.
`item_cluster`	`Optional[dict]`	Item cluster information.

Raises:

Type	Description
`ValueError`	If the evaluation_set is not supported.

Source code in warprec/data/dataset.py

class Dataset:
    """The definition of the Dataset class that will handle transaction data.

    Args:
        train_data (FrameT): The train data.
        eval_data (Optional[FrameT]): The evaluation data.
        side_data (Optional[FrameT]): The side information data.
        user_cluster (Optional[FrameT]): The user cluster data.
        item_cluster (Optional[FrameT]): The item cluster data.
        batch_size (int): The batch size that will be used evaluation.
        rating_type (RatingType): The type of rating used.
        user_id_label (str): The label of the user id column.
        item_id_label (str): The label of the item id column.
        rating_label (str): The label of the rating column.
        timestamp_label (str): The label of the timestamp column.
        cluster_label (str): The label of the cluster column.
        context_labels (Optional[List[str]]): The list of labels of the
            contextual data.
        evaluation_set (str): The type of evaluation set. Can either be 'Test'
            or 'Validation'.

    Attributes:
        train_set (Interactions): Training set used by recommendation models.
        eval_set (Interactions): Evaluation set used by recommendation models.
        train_session (Sessions): Training session used by sequential models.
        user_cluster (Optional[dict]): User cluster information.
        item_cluster (Optional[dict]): Item cluster information.

    Raises:
        ValueError: If the evaluation_set is not supported.
    """

    train_set: Interactions = None
    eval_set: Interactions = None
    train_session: Sessions = None
    user_cluster: Optional[dict] = None
    item_cluster: Optional[dict] = None

    def __init__(
        self,
        train_data: FrameT,
        eval_data: Optional[FrameT] = None,
        side_data: Optional[FrameT] = None,
        user_cluster: Optional[FrameT] = None,
        item_cluster: Optional[FrameT] = None,
        batch_size: int = 1024,
        rating_type: RatingType = RatingType.IMPLICIT,
        user_id_label: str = "user_id",
        item_id_label: str = "item_id",
        rating_label: str = None,
        timestamp_label: str = None,
        cluster_label: str = None,
        context_labels: Optional[List[str]] = None,
        evaluation_set: str = "Test",
    ):
        # Check evaluation set
        if evaluation_set not in ["Test", "Validation"]:
            raise ValueError("Evaluation set must be either 'Test' or 'Validation'.")

        # Materialize the data and ensure
        mat_train_data = self._materialize(train_data)
        mat_eval_data = self._materialize(eval_data)
        mat_side_data = self._materialize(side_data)
        mat_user_cluster = self._materialize(user_cluster)
        mat_item_cluster = self._materialize(item_cluster)

        # Initializing variables
        self._nuid: int = 0
        self._niid: int = 0
        self._nfeat: int = 0
        self._max_seq_len: int = 0
        self._umap: dict[Any, int] = {}
        self._imap: dict[Any, int] = {}
        self._feature_maps: dict[str, dict[Any, int]] = {}
        self._feature_dims: dict[str, int] = {}
        self._context_maps: dict[str, dict[Any, int]] = {}
        self._context_dims: dict[str, int] = {}
        self._feat_lookup: Tensor = None
        self._uc: Tensor = None
        self._ic: Tensor = None
        self._stash: Dict[
            str, Any
        ] = {}  # Stash will be used by the user for custom needs

        # Initialize the dataset dataloader
        self._precomputed_dataloader: Dict[str, Any] = {}

        # If side information data has been provided, we filter the main dataset
        if mat_side_data is not None:
            mat_train_data, mat_eval_data = self._filter_data(
                train_set=mat_train_data,
                eval_set=mat_eval_data,
                filter_data=mat_side_data,
                label=item_id_label,
            )

        # Define dimensions that will lead the experiment
        self._nuid = mat_train_data.select(nw.col(user_id_label).n_unique()).item()
        self._niid = mat_train_data.select(nw.col(item_id_label).n_unique()).item()
        self._nfeat = (
            (len(mat_side_data.columns) - 1) if mat_side_data is not None else 0
        )
        self.batch_size = batch_size

        # Values that will be used to calculate mappings
        _uid = (
            mat_train_data.select(user_id_label)
            .unique()
            .sort(user_id_label)
            .to_dict(as_series=False)[user_id_label]
        )
        _iid = (
            mat_train_data.select(item_id_label)
            .unique()
            .sort(item_id_label)
            .to_dict(as_series=False)[item_id_label]
        )

        # Calculate mapping for users and items
        self._umap = {user: i for i, user in enumerate(_uid)}
        self._imap = {item: i for i, item in enumerate(_iid)}

        # Process contextual data
        if context_labels:
            mat_train_data = self._process_context_data(
                mat_train_data, context_labels, fit=True
            )

            if mat_eval_data is not None:
                mat_eval_data = self._process_context_data(
                    mat_eval_data, context_labels, fit=False
                )

        # Process the side information data and filter not valid columns
        self.side = None
        if mat_side_data is not None:
            self.side = self._process_side_data(mat_side_data, item_id_label)

        # Save user and item cluster information inside the dataset
        if mat_user_cluster is not None:
            uc_dict = mat_user_cluster.select(user_id_label, cluster_label).to_dict(
                as_series=False
            )
            self.user_cluster = {
                self._umap[user_id]: cluster
                for user_id, cluster in zip(
                    uc_dict[user_id_label], uc_dict[cluster_label]
                )
                if user_id in self._umap
            }
        else:
            self.user_cluster = None

        if mat_item_cluster is not None:
            ic_dict = mat_item_cluster.select(item_id_label, cluster_label).to_dict(
                as_series=False
            )
            self.item_cluster = {
                self._imap[item_id]: cluster
                for item_id, cluster in zip(
                    ic_dict[item_id_label], ic_dict[cluster_label]
                )
                if item_id in self._imap
            }
        else:
            self.item_cluster = None

        # Pre compute lookup tensors for user clusters
        if self.user_cluster is not None:
            unique_user_clusters = sorted(set(self.user_cluster.values()))
            user_cluster_remap = {
                cud: idx + 1 for idx, cud in enumerate(unique_user_clusters)
            }
            self._uc = torch.zeros(self._nuid, dtype=torch.long)
            for u, c in self.user_cluster.items():
                self._uc[u] = user_cluster_remap[c]
        else:
            self._uc = torch.ones(self._nuid, dtype=torch.long)

        # Pre compute lookup tensors for item clusters
        if self.item_cluster is not None:
            unique_item_clusters = sorted(set(self.item_cluster.values()))
            item_cluster_remap = {
                cid: idx + 1 for idx, cid in enumerate(unique_item_clusters)
            }
            self._ic = torch.zeros(self._niid, dtype=torch.long)
            for i, c in self.item_cluster.items():
                self._ic[i] = item_cluster_remap[c]
        else:
            self._ic = torch.ones(self._niid, dtype=torch.long)

        # Add padding value to item clusters
        padding_value = torch.tensor([0], dtype=torch.long)
        self._ic = torch.cat((self._ic, padding_value), dim=0)

        # Create the main data structures
        self.train_set = self._create_inner_set(
            mat_train_data,
            side_data=self.side,
            user_cluster=self.user_cluster,
            item_cluster=self.item_cluster,
            batch_size=batch_size,
            rating_type=rating_type,
            rating_label=rating_label,
            timestamp_label=timestamp_label,
            context_labels=context_labels,
        )

        if mat_eval_data is not None:
            self.eval_set = self._create_inner_set(
                mat_eval_data,
                side_data=self.side,
                user_cluster=self.user_cluster,
                item_cluster=self.item_cluster,
                header_msg=evaluation_set,
                batch_size=batch_size,
                rating_type=rating_type,
                rating_label=rating_label,
                context_labels=context_labels,
            )

        # Save side information inside the dataset
        if self.side is not None:
            native_ns = nw.get_native_namespace(self.side)

            mapping_data = {
                item_id_label: list(self._imap.keys()),
                "__item_idx__": list(self._imap.values()),
            }
            mapping_df = nw.from_dict(mapping_data, native_namespace=native_ns)

            # Align the data
            aligned_side = mapping_df.join(self.side, on=item_id_label, how="left")
            aligned_side = aligned_side.sort("__item_idx__")

            feat_cols = [c for c in self.side.columns if c != item_id_label]

            side_values = (
                aligned_side.select(feat_cols)
                .with_columns(nw.all().fill_null(0))
                .to_numpy()
            )

            feature_lookup = torch.tensor(side_values).float()

            # Padding row
            padding_row = torch.zeros((1, feature_lookup.size(1)))
            self._feat_lookup = torch.cat([feature_lookup, padding_row], dim=0)

        # Sequential recommendation sessions
        self.train_session = Sessions(
            mat_train_data,
            self._umap,
            self._imap,
            self.train_set.get_sparse(),
            user_id_label=user_id_label,
            item_id_label=item_id_label,
            timestamp_label=timestamp_label,
            context_labels=context_labels,
        )

    def _materialize(self, data: Optional[FrameT]) -> Optional[DataFrame[Any]]:
        """Utility method to correctly materialize the input data.

        Args:
            data (Optional[FrameT]): The data to materialize.

        Returns:
            Optional[DataFrame[Any]]: The materialized data.
        """
        if data is None:
            return None

        # Ensure the data wrapped in Narwhals
        nw_data = nw.from_native(data, pass_through=True)

        # Materialize the data if in Lazy format
        if isinstance(nw_data, nw.LazyFrame):
            return nw_data.collect()

        # Return normal Eager DataFrame
        return nw_data

    def _filter_data(
        self,
        train_set: DataFrame[Any],
        eval_set: Optional[DataFrame[Any]],
        filter_data: DataFrame[Any],
        label: str,
    ) -> Tuple[DataFrame[Any], DataFrame[Any]]:
        """Filter the data based on a given additional information set and label.

        Args:
            train_set (DataFrame[Any]): The train set.
            eval_set (Optional[DataFrame[Any]]): The evaluation set.
            filter_data (DataFrame[Any]): The additional information dataset.
            label (str): The label used to filter the data.

        Returns:
            Tuple[DataFrame[Any], DataFrame[Any]]:
                - DataFrame[Any]: The filtered train set.
                - DataFrame[Any]: The filtered evaluation set.
        """
        train_vals = set(
            train_set.select(label).unique().to_dict(as_series=False)[label]
        )
        filter_vals = set(
            filter_data.select(label).unique().to_dict(as_series=False)[label]
        )

        shared_data = list(train_vals.intersection(filter_vals))

        # Count the number of data points before filtering
        train_data_before_filter = train_set.select(nw.col(label).n_unique()).item()

        # Filter all the data based on data points present in both train data and filter.
        train_set = train_set.filter(nw.col(label).is_in(shared_data))
        if eval_set is not None:
            eval_set = eval_set.filter(nw.col(label).is_in(shared_data))

        # Count the number of data points after filtering
        train_data_after_filter = train_set.select(nw.col(label).n_unique()).item()

        logger.attention(
            ""
            f"Filtered out {train_data_before_filter - train_data_after_filter} {label}."
        )

        return train_set, eval_set

    def _create_inner_set(
        self,
        data: DataFrame[Any],
        side_data: Optional[DataFrame[Any]] = None,
        user_cluster: Optional[dict] = None,
        item_cluster: Optional[dict] = None,
        header_msg: str = "Train",
        batch_size: int = 1024,
        rating_type: RatingType = RatingType.IMPLICIT,
        rating_label: str = None,
        timestamp_label: str = None,
        context_labels: Optional[List[str]] = None,
    ) -> Interactions:
        """Functionality to create Interaction data from DataFrame.

        Args:
            data (DataFrame[Any]): The data used to create the interaction object.
            side_data (Optional[DataFrame[Any]]): The side data information about the dataset.
            user_cluster (Optional[dict]): The user cluster information.
            item_cluster (Optional[dict]): The item cluster information.
            header_msg (str): The header of the logger output.
            batch_size (int): The batch size of the interaction.
            rating_type (RatingType): The type of rating used.
            rating_label (str): The label of the rating column.
            timestamp_label (str): The label of the timestamp column.
            context_labels (Optional[List[str]]): The list of labels of the
                contextual data.

        Returns:
            Interactions: The final interaction object.
        """
        inter_set = Interactions(
            data,
            (self._nuid, self._niid),
            self._umap,
            self._imap,
            side_data=side_data,
            user_cluster=user_cluster,
            item_cluster=item_cluster,
            batch_size=batch_size,
            rating_type=rating_type,
            rating_label=rating_label,
            timestamp_label=timestamp_label,
            context_labels=context_labels,
        )
        nuid, niid = inter_set.get_dims()
        transactions = inter_set.get_transactions()
        logger.stat_msg(
            (
                f"Number of users: {nuid}      "
                f"Number of items: {niid}      "
                f"Transactions: {transactions}"
            ),
            f"{header_msg} set",
        )

        return inter_set

    def _process_side_data(
        self, side_data: DataFrame[Any], item_id_label: str
    ) -> Optional[DataFrame[Any]]:
        """Process side information data.

        It maps categorical/numerical values to integer indices suitable for Embeddings.

        Args:
            side_data (DataFrame[Any]): The side data DataFrame.
            item_id_label (str): The label of the item ID.

        Returns:
            Optional[DataFrame[Any]]: The processed DataFrame with integer indices.

        Raises:
            ValueError: If the item ID label is not found in side_data.
        """
        # Check for item ID label
        if item_id_label not in side_data.columns:
            raise ValueError("Item ID label not found inside side information data.")

        df_processed = side_data.clone()

        # Identify feature columns (all except item_id_label)
        feature_cols = [c for c in df_processed.columns if c != item_id_label]

        if not feature_cols:
            logger.negative("No feature columns found in side data.")
            return None

        # Iterate over each feature column to create mappings and transform data
        for col in feature_cols:
            # Fill missing data with a placeholder
            df_processed = df_processed.with_columns(nw.col(col).fill_null("UNK"))

            # Create a mapping from unique values to integers
            unique_vals = (
                df_processed.select(col)
                .unique()
                .sort(col)
                .to_dict(as_series=False)[col]
            )

            # Map: value -> index (starting from 1, reserving 0 for unknown/padding)
            mapping = {val: i + 1 for i, val in enumerate(unique_vals)}

            # Save the mapping
            self._feature_maps[col] = mapping

            # Save the dimension (vocab size + 1 for UNK)
            self._feature_dims[col] = len(unique_vals) + 1

            # Apply the mapping to the columns
            map_df = nw.from_dict(
                {col: list(mapping.keys()), f"{col}_idx": list(mapping.values())},
                native_namespace=nw.get_native_namespace(df_processed),
            )

            # Join and replace
            df_processed = df_processed.join(map_df, on=col, how="left")

            # Fill nulls (if any value wasn't in mapping, though here we built mapping from data)
            # and replace original column
            df_processed = df_processed.with_columns(
                nw.col(f"{col}_idx").fill_null(0).cast(nw.Int64).alias(col)
            ).drop(f"{col}_idx")

            logger.msg(f"Side Feature '{col}': found {len(unique_vals)} unique values.")

        return df_processed

    def _process_context_data(
        self, df: DataFrame[Any], context_labels: List[str], fit: bool = False
    ) -> DataFrame[Any]:
        """Processes context columns: creates mappings (if fit=True) and converts
        values to integers.

        Strategy:
        - Index 0 is reserved for <UNK> (Unknown) or Padding.
        - Actual values start from index 1.

        Args:
            df (DataFrame[Any]): The DataFrame containing transaction data
                with contextual information.
            context_labels (List[str]): The labels of the columns containing
                the contextual information.
            fit (bool): Wether or not to fit the dataset on the DataFrame
                data. Usually used on the train set.

        Returns:
            DataFrame[Any]: The processed data.

        Raises:
            ValueError: If the contextual column names are not
                present the the DataFrame.
        """
        df_processed = df.clone()

        for col in context_labels:
            if col not in df_processed.columns:
                raise ValueError(f"Context label '{col}' not found in DataFrame.")

            if fit:
                # Create mapping based on unique values in this column
                uniques = (
                    df_processed.select(col)
                    .unique()
                    .sort(col)
                    .to_dict(as_series=False)[col]
                )
                # Start from 1, reserve 0 for Unknown
                mapping = {val: i + 1 for i, val in enumerate(uniques)}

                self._context_maps[col] = mapping
                # Dimension is len(uniques) + 1 (for the UNK token)
                self._context_dims[col] = len(uniques) + 1

                logger.msg(f"Context '{col}': found {len(uniques)} unique values.")
            else:
                # Use existing mapping
                mapping = self._context_maps.get(col)
                if mapping is None:
                    raise ValueError(
                        f"Mapping for context '{col}' not found. Fit on train first."
                    )

            # Apply mapping
            # Values not in mapping become NaN, then fill with 0 (UNK), then cast to int
            map_df = nw.from_dict(
                {col: list(mapping.keys()), f"{col}_idx": list(mapping.values())},
                native_namespace=nw.get_native_namespace(df_processed),
            )

            # Join
            df_processed = df_processed.join(map_df, on=col, how="left")

            # Optional: Check for OOV (Out Of Vocabulary) in Eval
            if not fit:
                # Check for nulls in the mapped column before filling
                num_unk = df_processed.select(
                    nw.col(f"{col}_idx").is_null().sum()
                ).item()
                if num_unk > 0:
                    logger.attention(
                        f"Context '{col}': found {num_unk} unknown values in Eval set (mapped to 0)."
                    )

            # Finalize column
            df_processed = df_processed.with_columns(
                nw.col(f"{col}_idx").fill_null(0).cast(nw.Int64).alias(col)
            ).drop(f"{col}_idx")

        return df_processed

    def get_evaluation_dataloader(self, **kwargs: Any) -> DataLoader:
        """Retrieve the full evaluation DataLoader for the dataset.

        Args:
            **kwargs (Any): The keyword arguments to pass to DataLoader initialization.

        Returns:
            DataLoader: DataLoader that yields batches of interactions
                (eval_batch, user_indices).
        """
        key = f"full_{self._serialize_dataloader_kwargs(kwargs)}"
        if key not in self._precomputed_dataloader:
            eval_sparse = self.eval_set.get_sparse()

            dataset = EvaluationDataset(eval_interactions=eval_sparse)
            self._precomputed_dataloader[key] = DataLoader(
                dataset,
                batch_size=self.batch_size,
                shuffle=False,
                **kwargs,
            )

        return self._precomputed_dataloader[key]

    def get_sampled_evaluation_dataloader(
        self,
        num_negatives: int = 99,
        seed: int = 42,
        **kwargs: Any,
    ) -> DataLoader:
        """Retrieve the sampled evaluation DataLoader for the dataset.

        Args:
            num_negatives (int): Number of negative samples per user.
            seed (int): Random seed for negative sampling.
            **kwargs (Any): The keyword arguments to pass to DataLoader initialization.

        Returns:
            DataLoader: DataLoader that yields batches
                of interactions (pos_items, neg_items, user_indices)
        """
        key = (
            f"sampled_{num_negatives}_{seed}_"
            f"{self._serialize_dataloader_kwargs(kwargs)}"
        )

        if key not in self._precomputed_dataloader:
            train_sparse = self.train_set.get_sparse()
            eval_sparse = self.eval_set.get_sparse()

            dataset = SampledEvaluationDataset(
                train_interactions=train_sparse,
                eval_interactions=eval_sparse,
                num_negatives=num_negatives,
                seed=seed,
            )
            self._precomputed_dataloader[key] = DataLoader(
                dataset,
                batch_size=self.batch_size,
                shuffle=False,
                collate_fn=dataset.collate_fn,
                **kwargs,
            )

        return self._precomputed_dataloader[key]

    def get_contextual_evaluation_dataloader(self, **kwargs: Any) -> DataLoader:
        """Retrieve the full contextual evaluation DataLoader for the dataset.

        This loader is specific for Context-Aware Recommender Systems.
        It iterates over transactions (User, Item, Context) instead of Users.

        Args:
            **kwargs (Any): The keyword arguments to pass to DataLoader initialization.

        Returns:
            DataLoader: The contextual data loader.
        """
        key = f"full_contextual_{self._serialize_dataloader_kwargs(kwargs)}"
        if key not in self._precomputed_dataloader:
            eval_data = self.eval_set.get_df()

            # Retrieve labels to pre-compute eval data
            user_label = self.eval_set.user_label
            item_label = self.eval_set.item_label
            context_labels = self.eval_set.context_labels

            # Map the evaluation dataset to ensure consistency
            # Create mapping DFs
            u_map_df = nw.from_dict(
                {
                    user_label: list(self._umap.keys()),
                    f"{user_label}_idx": list(self._umap.values()),
                },
                native_namespace=nw.get_native_namespace(eval_data),
            )

            i_map_df = nw.from_dict(
                {
                    item_label: list(self._imap.keys()),
                    f"{item_label}_idx": list(self._imap.values()),
                },
                native_namespace=nw.get_native_namespace(eval_data),
            )

            # Apply mappings via Join
            eval_data = eval_data.join(u_map_df, on=user_label, how="inner").join(
                i_map_df, on=item_label, how="inner"
            )

            # Sort to ensure reproducibility
            eval_data = eval_data.sort(f"{user_label}_idx", f"{item_label}_idx")

            # Replace columns with mapped indices
            eval_data = eval_data.with_columns(
                nw.col(f"{user_label}_idx").alias(user_label),
                nw.col(f"{item_label}_idx").alias(item_label),
            ).drop(f"{user_label}_idx", f"{item_label}_idx")

            # dropna is implicit in inner join for the mapped keys,
            # but if we need to drop other nulls:
            # eval_data = eval_data.drop_nulls(subset=[user_label, item_label])

            dataset = ContextualEvaluationDataset(
                eval_data=eval_data,
                user_id_label=user_label,
                item_id_label=item_label,
                context_labels=context_labels,
            )
            self._precomputed_dataloader[key] = DataLoader(
                dataset, batch_size=self.batch_size, shuffle=False, **kwargs
            )

        return self._precomputed_dataloader[key]

    def get_sampled_contextual_evaluation_dataloader(
        self,
        num_negatives: int = 99,
        seed: int = 42,
        **kwargs: Any,
    ) -> DataLoader:
        """Retrieve the sampled contextual evaluation DataLoader for the dataset.

        Args:
            num_negatives (int): Number of negative samples per transaction.
            seed (int): Random seed.
            **kwargs (Any): The keyword arguments to pass to DataLoader initialization.

        Returns:
            DataLoader: The sampled contextual loader.
        """
        key = (
            f"sampled_contextual_{num_negatives}_{seed}_"
            f"{self._serialize_dataloader_kwargs(kwargs)}"
        )

        if key not in self._precomputed_dataloader:
            train_sparse = self.train_set.get_sparse()
            eval_data = self.eval_set.get_df()

            # Retrieve labels to pre-compute eval data
            user_label = self.eval_set.user_label
            item_label = self.eval_set.item_label
            context_labels = self.eval_set.context_labels

            # Map the evaluation dataset
            u_map_df = nw.from_dict(
                {
                    user_label: list(self._umap.keys()),
                    f"{user_label}_idx": list(self._umap.values()),
                },
                native_namespace=nw.get_native_namespace(eval_data),
            )

            i_map_df = nw.from_dict(
                {
                    item_label: list(self._imap.keys()),
                    f"{item_label}_idx": list(self._imap.values()),
                },
                native_namespace=nw.get_native_namespace(eval_data),
            )

            eval_data = eval_data.join(u_map_df, on=user_label, how="inner").join(
                i_map_df, on=item_label, how="inner"
            )

            # Sort to ensure reproducibility
            eval_data = eval_data.sort(f"{user_label}_idx", f"{item_label}_idx")

            eval_data = eval_data.with_columns(
                nw.col(f"{user_label}_idx").alias(user_label),
                nw.col(f"{item_label}_idx").alias(item_label),
            ).drop(f"{user_label}_idx", f"{item_label}_idx")

            dataset = SampledContextualEvaluationDataset(
                train_interactions=train_sparse,
                eval_data=eval_data,
                user_id_label=user_label,
                item_id_label=item_label,
                context_labels=context_labels,
                num_items=self._niid,
                num_negatives=num_negatives,
                seed=seed,
            )
            self._precomputed_dataloader[key] = DataLoader(
                dataset,
                batch_size=self.batch_size,
                shuffle=False,
                collate_fn=dataset.collate_fn,
                **kwargs,
            )

        return self._precomputed_dataloader[key]

    @staticmethod
    def _serialize_dataloader_kwargs(kwargs: dict[str, Any]) -> str:
        """Serialize DataLoader kwargs into a deterministic cache-key suffix."""
        if not kwargs:
            return "default"

        return "_".join(f"{name}={value}" for name, value in sorted(kwargs.items()))

    def get_dims(self) -> Tuple[int, int]:
        """Returns the dimensions of the data.

        Returns:
            Tuple[int, int]:
                int: Number of unique user_ids.
                int: Number of unique item_ids.
        """
        return (self._nuid, self._niid)

    def get_feature_dims(self) -> Dict[str, int]:
        """Returns the dimensions (vocab size) of each side information feature.

        Returns:
            Dict[str, int]: A dictionary containing the name of the
                feature and the vocab size.
        """
        return self._feature_dims

    def get_context_dims(self) -> Dict[str, int]:
        """Returns the dimensions (vocab size) of each context feature.

        Returns:
            Dict[str, int]: A dictionary containing the name of the
                context feature and the vocab size.
        """
        return self._context_dims

    def get_mappings(self) -> Tuple[dict, dict]:
        """Returns the mapping used for this dataset.

        Returns:
            Tuple[dict, dict]:
                dict: Mapping of user_id -> user_idx.
                dict: Mapping of item_id -> item_idx.
        """
        return (self._umap, self._imap)

    def get_inverse_mappings(self) -> Tuple[dict, dict]:
        """Returns the inverse of the mapping.

        Returns:
            Tuple[dict, dict]:
                dict: Mapping of user_idx -> user_id.
                dict: Mapping of item_idxs -> item_id.
        """
        return {v: k for k, v in self._umap.items()}, {
            v: k for k, v in self._imap.items()
        }

    def get_features_lookup(self) -> Optional[Tensor]:
        """This method retrieves the lookup tensor for side information features.

        Returns:
            Optional[Tensor]: Lookup tensor for side information features.
        """
        return self._feat_lookup

    def get_user_cluster(self) -> Tensor:
        """This method retrieves the lookup tensor for user clusters.

        Returns:
            Tensor: Lookup tensor for user clusters.
        """
        return self._uc

    def get_item_cluster(self) -> Tensor:
        """This method retrieves the lookup tensor for item clusters.

        Returns:
            Tensor: Lookup tensor for item clusters.
        """
        return self._ic

    def get_stash(self) -> Dict[str, Any]:
        """This method retrieves the stash dictionary
        containing all user custom structures.

        Returns:
            Dict[str, Any]: The stash dictionary.
        """
        return self._stash

    def info(self) -> dict:
        """This method returns the main information of the
        dataset in dict format.

        Returns:
            dict: The dictionary with the main information of
                the dataset.
        """
        base_info = {
            "n_items": self._niid,
            "n_users": self._nuid,
            "n_features": self._nfeat,
            "item_mapping": self._imap,
            "user_mapping": self._umap,
        }

        # Optionally add feature dimensions if present
        if self._feature_dims:
            base_info["feature_dims"] = self._feature_dims

        # Optionally add contextual dimensions if present
        if self._context_dims:
            base_info["context_dims"] = self._context_dims

        return base_info

    def add_to_stash(self, key: str, value: Any):
        """Add a custom structure to the stash.

        Args:
            key (str): The key of the custom structure.
            value (Any): The value of the custom structure.
        """
        self._stash[key] = value

    def update_mappings(self, user_mapping: dict, item_mapping: dict):
        """Update the mappings of the dataset.

        Args:
            user_mapping (dict): The mapping of user_id -> user_idx.
            item_mapping (dict): The mapping of item_id -> item_idx.
        """
        self._umap = user_mapping
        self._imap = item_mapping

`add_to_stash(key, value)` ¶

Add a custom structure to the stash.

Parameters:

Name	Type	Description	Default
`key`	`str`	The key of the custom structure.	required
`value`	`Any`	The value of the custom structure.	required

Source code in warprec/data/dataset.py

def add_to_stash(self, key: str, value: Any):
    """Add a custom structure to the stash.

    Args:
        key (str): The key of the custom structure.
        value (Any): The value of the custom structure.
    """
    self._stash[key] = value

`get_context_dims()` ¶

Returns the dimensions (vocab size) of each context feature.

Returns:

Type	Description
`Dict[str, int]`	Dict[str, int]: A dictionary containing the name of the context feature and the vocab size.

Source code in warprec/data/dataset.py

def get_context_dims(self) -> Dict[str, int]:
    """Returns the dimensions (vocab size) of each context feature.

    Returns:
        Dict[str, int]: A dictionary containing the name of the
            context feature and the vocab size.
    """
    return self._context_dims

`get_contextual_evaluation_dataloader(**kwargs)` ¶

Retrieve the full contextual evaluation DataLoader for the dataset.

This loader is specific for Context-Aware Recommender Systems. It iterates over transactions (User, Item, Context) instead of Users.

Parameters:

Name	Type	Description	Default
`**kwargs`	`Any`	The keyword arguments to pass to DataLoader initialization.	`{}`

Returns:

Name	Type	Description
`DataLoader`	`DataLoader`	The contextual data loader.

Source code in warprec/data/dataset.py

def get_contextual_evaluation_dataloader(self, **kwargs: Any) -> DataLoader:
    """Retrieve the full contextual evaluation DataLoader for the dataset.

    This loader is specific for Context-Aware Recommender Systems.
    It iterates over transactions (User, Item, Context) instead of Users.

    Args:
        **kwargs (Any): The keyword arguments to pass to DataLoader initialization.

    Returns:
        DataLoader: The contextual data loader.
    """
    key = f"full_contextual_{self._serialize_dataloader_kwargs(kwargs)}"
    if key not in self._precomputed_dataloader:
        eval_data = self.eval_set.get_df()

        # Retrieve labels to pre-compute eval data
        user_label = self.eval_set.user_label
        item_label = self.eval_set.item_label
        context_labels = self.eval_set.context_labels

        # Map the evaluation dataset to ensure consistency
        # Create mapping DFs
        u_map_df = nw.from_dict(
            {
                user_label: list(self._umap.keys()),
                f"{user_label}_idx": list(self._umap.values()),
            },
            native_namespace=nw.get_native_namespace(eval_data),
        )

        i_map_df = nw.from_dict(
            {
                item_label: list(self._imap.keys()),
                f"{item_label}_idx": list(self._imap.values()),
            },
            native_namespace=nw.get_native_namespace(eval_data),
        )

        # Apply mappings via Join
        eval_data = eval_data.join(u_map_df, on=user_label, how="inner").join(
            i_map_df, on=item_label, how="inner"
        )

        # Sort to ensure reproducibility
        eval_data = eval_data.sort(f"{user_label}_idx", f"{item_label}_idx")

        # Replace columns with mapped indices
        eval_data = eval_data.with_columns(
            nw.col(f"{user_label}_idx").alias(user_label),
            nw.col(f"{item_label}_idx").alias(item_label),
        ).drop(f"{user_label}_idx", f"{item_label}_idx")

        # dropna is implicit in inner join for the mapped keys,
        # but if we need to drop other nulls:
        # eval_data = eval_data.drop_nulls(subset=[user_label, item_label])

        dataset = ContextualEvaluationDataset(
            eval_data=eval_data,
            user_id_label=user_label,
            item_id_label=item_label,
            context_labels=context_labels,
        )
        self._precomputed_dataloader[key] = DataLoader(
            dataset, batch_size=self.batch_size, shuffle=False, **kwargs
        )

    return self._precomputed_dataloader[key]

`get_dims()` ¶

Returns the dimensions of the data.

Returns:

Type	Description
`Tuple[int, int]`	Tuple[int, int]: int: Number of unique user_ids. int: Number of unique item_ids.

Source code in warprec/data/dataset.py

def get_dims(self) -> Tuple[int, int]:
    """Returns the dimensions of the data.

    Returns:
        Tuple[int, int]:
            int: Number of unique user_ids.
            int: Number of unique item_ids.
    """
    return (self._nuid, self._niid)

`get_evaluation_dataloader(**kwargs)` ¶

Retrieve the full evaluation DataLoader for the dataset.

Parameters:

Name	Type	Description	Default
`**kwargs`	`Any`	The keyword arguments to pass to DataLoader initialization.	`{}`

Returns:

Name	Type	Description
`DataLoader`	`DataLoader`	DataLoader that yields batches of interactions (eval_batch, user_indices).

Source code in warprec/data/dataset.py

def get_evaluation_dataloader(self, **kwargs: Any) -> DataLoader:
    """Retrieve the full evaluation DataLoader for the dataset.

    Args:
        **kwargs (Any): The keyword arguments to pass to DataLoader initialization.

    Returns:
        DataLoader: DataLoader that yields batches of interactions
            (eval_batch, user_indices).
    """
    key = f"full_{self._serialize_dataloader_kwargs(kwargs)}"
    if key not in self._precomputed_dataloader:
        eval_sparse = self.eval_set.get_sparse()

        dataset = EvaluationDataset(eval_interactions=eval_sparse)
        self._precomputed_dataloader[key] = DataLoader(
            dataset,
            batch_size=self.batch_size,
            shuffle=False,
            **kwargs,
        )

    return self._precomputed_dataloader[key]

`get_feature_dims()` ¶

Returns the dimensions (vocab size) of each side information feature.

Returns:

Type	Description
`Dict[str, int]`	Dict[str, int]: A dictionary containing the name of the feature and the vocab size.

Source code in warprec/data/dataset.py

def get_feature_dims(self) -> Dict[str, int]:
    """Returns the dimensions (vocab size) of each side information feature.

    Returns:
        Dict[str, int]: A dictionary containing the name of the
            feature and the vocab size.
    """
    return self._feature_dims

`get_features_lookup()` ¶

This method retrieves the lookup tensor for side information features.

Returns:

Type	Description
`Optional[Tensor]`	Optional[Tensor]: Lookup tensor for side information features.

Source code in warprec/data/dataset.py

def get_features_lookup(self) -> Optional[Tensor]:
    """This method retrieves the lookup tensor for side information features.

    Returns:
        Optional[Tensor]: Lookup tensor for side information features.
    """
    return self._feat_lookup

`get_inverse_mappings()` ¶

Returns the inverse of the mapping.

Returns:

Type	Description
`Tuple[dict, dict]`	Tuple[dict, dict]: dict: Mapping of user_idx -> user_id. dict: Mapping of item_idxs -> item_id.

Source code in warprec/data/dataset.py

def get_inverse_mappings(self) -> Tuple[dict, dict]:
    """Returns the inverse of the mapping.

    Returns:
        Tuple[dict, dict]:
            dict: Mapping of user_idx -> user_id.
            dict: Mapping of item_idxs -> item_id.
    """
    return {v: k for k, v in self._umap.items()}, {
        v: k for k, v in self._imap.items()
    }

`get_item_cluster()` ¶

This method retrieves the lookup tensor for item clusters.

Returns:

Name	Type	Description
`Tensor`	`Tensor`	Lookup tensor for item clusters.

Source code in warprec/data/dataset.py

def get_item_cluster(self) -> Tensor:
    """This method retrieves the lookup tensor for item clusters.

    Returns:
        Tensor: Lookup tensor for item clusters.
    """
    return self._ic

`get_mappings()` ¶

Returns the mapping used for this dataset.

Returns:

Type	Description
`Tuple[dict, dict]`	Tuple[dict, dict]: dict: Mapping of user_id -> user_idx. dict: Mapping of item_id -> item_idx.

Source code in warprec/data/dataset.py

def get_mappings(self) -> Tuple[dict, dict]:
    """Returns the mapping used for this dataset.

    Returns:
        Tuple[dict, dict]:
            dict: Mapping of user_id -> user_idx.
            dict: Mapping of item_id -> item_idx.
    """
    return (self._umap, self._imap)

`get_sampled_contextual_evaluation_dataloader(num_negatives=99, seed=42, **kwargs)` ¶

Retrieve the sampled contextual evaluation DataLoader for the dataset.

Parameters:

Name	Type	Description	Default
`num_negatives`	`int`	Number of negative samples per transaction.	`99`
`seed`	`int`	Random seed.	`42`
`**kwargs`	`Any`	The keyword arguments to pass to DataLoader initialization.	`{}`

Returns:

Name	Type	Description
`DataLoader`	`DataLoader`	The sampled contextual loader.

Source code in warprec/data/dataset.py

def get_sampled_contextual_evaluation_dataloader(
    self,
    num_negatives: int = 99,
    seed: int = 42,
    **kwargs: Any,
) -> DataLoader:
    """Retrieve the sampled contextual evaluation DataLoader for the dataset.

    Args:
        num_negatives (int): Number of negative samples per transaction.
        seed (int): Random seed.
        **kwargs (Any): The keyword arguments to pass to DataLoader initialization.

    Returns:
        DataLoader: The sampled contextual loader.
    """
    key = (
        f"sampled_contextual_{num_negatives}_{seed}_"
        f"{self._serialize_dataloader_kwargs(kwargs)}"
    )

    if key not in self._precomputed_dataloader:
        train_sparse = self.train_set.get_sparse()
        eval_data = self.eval_set.get_df()

        # Retrieve labels to pre-compute eval data
        user_label = self.eval_set.user_label
        item_label = self.eval_set.item_label
        context_labels = self.eval_set.context_labels

        # Map the evaluation dataset
        u_map_df = nw.from_dict(
            {
                user_label: list(self._umap.keys()),
                f"{user_label}_idx": list(self._umap.values()),
            },
            native_namespace=nw.get_native_namespace(eval_data),
        )

        i_map_df = nw.from_dict(
            {
                item_label: list(self._imap.keys()),
                f"{item_label}_idx": list(self._imap.values()),
            },
            native_namespace=nw.get_native_namespace(eval_data),
        )

        eval_data = eval_data.join(u_map_df, on=user_label, how="inner").join(
            i_map_df, on=item_label, how="inner"
        )

        # Sort to ensure reproducibility
        eval_data = eval_data.sort(f"{user_label}_idx", f"{item_label}_idx")

        eval_data = eval_data.with_columns(
            nw.col(f"{user_label}_idx").alias(user_label),
            nw.col(f"{item_label}_idx").alias(item_label),
        ).drop(f"{user_label}_idx", f"{item_label}_idx")

        dataset = SampledContextualEvaluationDataset(
            train_interactions=train_sparse,
            eval_data=eval_data,
            user_id_label=user_label,
            item_id_label=item_label,
            context_labels=context_labels,
            num_items=self._niid,
            num_negatives=num_negatives,
            seed=seed,
        )
        self._precomputed_dataloader[key] = DataLoader(
            dataset,
            batch_size=self.batch_size,
            shuffle=False,
            collate_fn=dataset.collate_fn,
            **kwargs,
        )

    return self._precomputed_dataloader[key]

`get_sampled_evaluation_dataloader(num_negatives=99, seed=42, **kwargs)` ¶

Retrieve the sampled evaluation DataLoader for the dataset.

Parameters:

Name	Type	Description	Default
`num_negatives`	`int`	Number of negative samples per user.	`99`
`seed`	`int`	Random seed for negative sampling.	`42`
`**kwargs`	`Any`	The keyword arguments to pass to DataLoader initialization.	`{}`

Returns:

Name	Type	Description
`DataLoader`	`DataLoader`	DataLoader that yields batches of interactions (pos_items, neg_items, user_indices)

Source code in warprec/data/dataset.py

def get_sampled_evaluation_dataloader(
    self,
    num_negatives: int = 99,
    seed: int = 42,
    **kwargs: Any,
) -> DataLoader:
    """Retrieve the sampled evaluation DataLoader for the dataset.

    Args:
        num_negatives (int): Number of negative samples per user.
        seed (int): Random seed for negative sampling.
        **kwargs (Any): The keyword arguments to pass to DataLoader initialization.

    Returns:
        DataLoader: DataLoader that yields batches
            of interactions (pos_items, neg_items, user_indices)
    """
    key = (
        f"sampled_{num_negatives}_{seed}_"
        f"{self._serialize_dataloader_kwargs(kwargs)}"
    )

    if key not in self._precomputed_dataloader:
        train_sparse = self.train_set.get_sparse()
        eval_sparse = self.eval_set.get_sparse()

        dataset = SampledEvaluationDataset(
            train_interactions=train_sparse,
            eval_interactions=eval_sparse,
            num_negatives=num_negatives,
            seed=seed,
        )
        self._precomputed_dataloader[key] = DataLoader(
            dataset,
            batch_size=self.batch_size,
            shuffle=False,
            collate_fn=dataset.collate_fn,
            **kwargs,
        )

    return self._precomputed_dataloader[key]

`get_stash()` ¶

This method retrieves the stash dictionary containing all user custom structures.

Returns:

Type	Description
`Dict[str, Any]`	Dict[str, Any]: The stash dictionary.

Source code in warprec/data/dataset.py

def get_stash(self) -> Dict[str, Any]:
    """This method retrieves the stash dictionary
    containing all user custom structures.

    Returns:
        Dict[str, Any]: The stash dictionary.
    """
    return self._stash

`get_user_cluster()` ¶

This method retrieves the lookup tensor for user clusters.

Returns:

Name	Type	Description
`Tensor`	`Tensor`	Lookup tensor for user clusters.

Source code in warprec/data/dataset.py

def get_user_cluster(self) -> Tensor:
    """This method retrieves the lookup tensor for user clusters.

    Returns:
        Tensor: Lookup tensor for user clusters.
    """
    return self._uc

`info()` ¶

This method returns the main information of the dataset in dict format.

Returns:

Name	Type	Description
`dict`	`dict`	The dictionary with the main information of the dataset.

Source code in warprec/data/dataset.py

def info(self) -> dict:
    """This method returns the main information of the
    dataset in dict format.

    Returns:
        dict: The dictionary with the main information of
            the dataset.
    """
    base_info = {
        "n_items": self._niid,
        "n_users": self._nuid,
        "n_features": self._nfeat,
        "item_mapping": self._imap,
        "user_mapping": self._umap,
    }

    # Optionally add feature dimensions if present
    if self._feature_dims:
        base_info["feature_dims"] = self._feature_dims

    # Optionally add contextual dimensions if present
    if self._context_dims:
        base_info["context_dims"] = self._context_dims

    return base_info

`update_mappings(user_mapping, item_mapping)` ¶

Update the mappings of the dataset.

Parameters:

Name	Type	Description	Default
`user_mapping`	`dict`	The mapping of user_id -> user_idx.	required
`item_mapping`	`dict`	The mapping of item_id -> item_idx.	required

Source code in warprec/data/dataset.py

def update_mappings(self, user_mapping: dict, item_mapping: dict):
    """Update the mappings of the dataset.

    Args:
        user_mapping (dict): The mapping of user_id -> user_idx.
        item_mapping (dict): The mapping of item_id -> item_idx.
    """
    self._umap = user_mapping
    self._imap = item_mapping

`warprec.data.eval_loaders.EvaluationDataset` ¶

Bases: Dataset

Yields: (user_idx, dense_ground_truth)

Source code in warprec/data/eval_loaders.py

class EvaluationDataset(TorchDataset):
    """
    Yields: (user_idx, dense_ground_truth)
    """

    def __init__(
        self,
        eval_interactions: csr_matrix,
    ):
        self.eval_interactions = eval_interactions
        self.users_with_eval = [
            u
            for u in range(eval_interactions.shape[0])
            if eval_interactions.indptr[u + 1] - eval_interactions.indptr[u] > 0
        ]

    def __len__(self) -> int:
        return len(self.users_with_eval)

    def __getitem__(self, idx: int) -> Tuple[int, Tensor]:
        user_idx = self.users_with_eval[idx]
        eval_row = self.eval_interactions.getrow(user_idx)
        ground_truth = torch.from_numpy(eval_row.toarray()).to(torch.float32).squeeze(0)

        return user_idx, ground_truth

`warprec.data.eval_loaders.ContextualEvaluationDataset` ¶

Bases: Dataset

Yields: (user_idx, target_item_idx, context_vector)

Source code in warprec/data/eval_loaders.py

class ContextualEvaluationDataset(TorchDataset):
    """
    Yields: (user_idx, target_item_idx, context_vector)
    """

    def __init__(
        self,
        eval_data: DataFrame[Any],
        user_id_label: str,
        item_id_label: str,
        context_labels: List[str],
    ):
        # Pre-convert DataFrames to torch tensor to reduce overhead
        self.user_indices = torch.from_numpy(
            eval_data.select(user_id_label).to_numpy().flatten().astype(np.int64)
        )
        self.item_indices = torch.from_numpy(
            eval_data.select(item_id_label).to_numpy().flatten().astype(np.int64)
        )
        self.context_features = torch.from_numpy(
            eval_data.select(context_labels).to_numpy().astype(np.int64)
        )

    def __len__(self) -> int:
        return len(self.user_indices)

    def __getitem__(self, idx: int) -> Tuple[Tensor, Tensor, Tensor]:
        return (
            self.user_indices[idx],
            self.item_indices[idx],
            self.context_features[idx],
        )

`warprec.data.eval_loaders.SampledEvaluationDataset` ¶

Bases: Dataset

Yields: (user_idx, pos_item, neg_items_vector)

Source code in warprec/data/eval_loaders.py

class SampledEvaluationDataset(TorchDataset):
    """
    Yields: (user_idx, pos_item, neg_items_vector)
    """

    def __init__(
        self,
        train_interactions: csr_matrix,
        eval_interactions: csr_matrix,
        num_negatives: int = 99,
        seed: int = 42,
    ):
        super().__init__()
        self.num_users, self.num_items = train_interactions.shape

        # Pre-calculate all positives (Train + Eval) per user,
        # we use a list of arrays for fast access
        self.all_positives = []
        for u in range(self.num_users):
            train_indices = train_interactions.indices[
                train_interactions.indptr[u] : train_interactions.indptr[u + 1]
            ]
            eval_indices = eval_interactions.indices[
                eval_interactions.indptr[u] : eval_interactions.indptr[u + 1]
            ]
            self.all_positives.append(np.union1d(train_indices, eval_indices))

        # Identify users who actually have evaluation data
        self.users_with_eval = [
            u
            for u in range(self.num_users)
            if eval_interactions.indptr[u + 1] - eval_interactions.indptr[u] > 0
        ]

        self.positive_items_list = []
        self.negative_items_list = []

        np.random.seed(seed)

        for u in self.users_with_eval:
            # Store positives
            eval_pos = eval_interactions.indices[
                eval_interactions.indptr[u] : eval_interactions.indptr[u + 1]
            ]
            self.positive_items_list.append(torch.tensor(eval_pos, dtype=torch.long))

            # Compute seen items
            seen_items = self.all_positives[u]
            n_seen = len(seen_items)

            # If user has seen almost everything, return empty or partial
            if self.num_items - n_seen <= 0:
                self.negative_items_list.append(torch.tensor([], dtype=torch.long))
                continue

            # Sample one time 2x the number of negatives
            # NOTE: In most cases this will skip the while loop
            num_to_generate = num_negatives * 2
            candidates = np.random.randint(0, self.num_items, size=num_to_generate)

            # Fast filtering using numpy boolean masking
            mask = np.isin(candidates, seen_items, invert=True)
            valid_negatives = candidates[mask]

            # Remove duplicates in candidates if necessary
            valid_negatives = np.unique(valid_negatives)

            # If we don't have enough, fallback to a loop
            if len(valid_negatives) < num_negatives:
                final_negs = list(valid_negatives)
                while len(final_negs) < num_negatives:
                    cand = np.random.randint(0, self.num_items)
                    if cand not in seen_items and cand not in final_negs:
                        final_negs.append(cand)
                valid_negatives = np.array(final_negs)

            # Take exactly num_negatives
            self.negative_items_list.append(
                torch.tensor(valid_negatives[:num_negatives], dtype=torch.long)
            )

    def __len__(self) -> int:
        return len(self.users_with_eval)

    def __getitem__(self, idx: int) -> Tuple[int, Tensor, Tensor]:
        user_idx = self.users_with_eval[idx]
        return (
            user_idx,
            self.positive_items_list[idx],
            self.negative_items_list[idx],
        )

    def collate_fn(
        self, batch: List[Tuple[int, Tensor, Tensor]]
    ) -> Tuple[Tensor, Tensor, Tensor]:
        user_indices, positive_tensors, negative_tensors = zip(*batch)
        user_indices_tensor = torch.tensor(list(user_indices), dtype=torch.long)

        positives_padded = pad_sequence(
            positive_tensors,  # type: ignore[arg-type]
            batch_first=True,
            padding_value=self.num_items,
        )
        negatives_padded = pad_sequence(
            negative_tensors,  # type: ignore[arg-type]
            batch_first=True,
            padding_value=self.num_items,
        )
        return user_indices_tensor, positives_padded, negatives_padded

`warprec.data.eval_loaders.SampledContextualEvaluationDataset` ¶

Bases: Dataset

Yields: (user_idx, pos_item, neg_items_vector, context_vector)

Source code in warprec/data/eval_loaders.py

class SampledContextualEvaluationDataset(TorchDataset):
    """
    Yields: (user_idx, pos_item, neg_items_vector, context_vector)
    """

    def __init__(
        self,
        train_interactions: csr_matrix,
        eval_data: DataFrame[Any],
        user_id_label: str,
        item_id_label: str,
        context_labels: List[str],
        num_items: int,
        num_negatives: int = 99,
        seed: int = 42,
    ):
        # pylint: disable = too-many-nested-blocks
        self.num_negatives = num_negatives
        self.num_items = num_items

        # Pre-convert DataFrames to torch tensor to reduce overhead
        self.user_indices = torch.from_numpy(
            eval_data.select(user_id_label).to_numpy().flatten().astype(np.int64)
        )
        self.pos_item_indices = torch.from_numpy(
            eval_data.select(item_id_label).to_numpy().flatten().astype(np.int64)
        )
        self.context_features = torch.from_numpy(
            eval_data.select(context_labels).to_numpy().astype(np.int64)
        )

        n_train_users = train_interactions.shape[0]

        self.negatives_list: list[Tensor] = []
        np.random.seed(seed)

        for idx, user_idx_tensor in enumerate(self.user_indices):
            u = int(user_idx_tensor.item())
            target_item = int(self.pos_item_indices[idx].item())

            # Retrieve training history
            if u < n_train_users:
                train_items = train_interactions.indices[
                    train_interactions.indptr[u] : train_interactions.indptr[u + 1]
                ]
            else:
                train_items = np.array([], dtype=np.int64)

            # Compute seen items
            seen_items = np.append(train_items, target_item)

            # Generate 2x candidates to avoid loops in most cases
            num_to_generate = self.num_negatives * 2
            candidates = np.random.randint(0, self.num_items, size=num_to_generate)

            # Filter out seen items using optimized numpy boolean masking
            mask = np.isin(candidates, seen_items, invert=True)
            valid_negatives = candidates[mask]

            # Remove duplicates
            valid_negatives = np.unique(valid_negatives)

            # If we don't have enough, fallback to a loop
            if len(valid_negatives) < self.num_negatives:
                final_negs = list(valid_negatives)
                while len(final_negs) < self.num_negatives:
                    cand = np.random.randint(0, self.num_items)
                    if cand not in final_negs:
                        if cand != target_item:
                            if cand not in train_items:
                                final_negs.append(cand)  # type: ignore[arg-type]
                valid_negatives = np.array(final_negs)

            # Take exactly num_negatives
            self.negatives_list.append(
                torch.tensor(valid_negatives[: self.num_negatives], dtype=torch.long)
            )

    def __len__(self) -> int:
        return len(self.user_indices)

    def __getitem__(self, idx: int) -> Tuple[Tensor, Tensor, Tensor, Tensor]:
        return (
            self.user_indices[idx],
            self.pos_item_indices[idx],
            self.negatives_list[idx],
            self.context_features[idx],
        )

    def collate_fn(
        self, batch: List[Tuple[Tensor, Tensor, Tensor, Tensor]]
    ) -> Tuple[Tensor, Tensor, Tensor, Tensor]:
        user_indices, pos_items, neg_items, context_features = zip(*batch)

        tensor_user_indices = torch.stack(user_indices)
        tensor_pos_items = torch.stack(pos_items).unsqueeze(1)
        tensor_neg_items = torch.stack(neg_items)
        tensor_context_features = torch.stack(context_features)

        return (
            tensor_user_indices,
            tensor_pos_items,
            tensor_neg_items,
            tensor_context_features,
        )

Dataset - API Reference¶

warprec.data.dataset.Dataset ¶

add_to_stash(key, value) ¶

get_context_dims() ¶

get_contextual_evaluation_dataloader(**kwargs) ¶

get_dims() ¶

get_evaluation_dataloader(**kwargs) ¶

get_feature_dims() ¶

get_features_lookup() ¶

get_inverse_mappings() ¶

get_item_cluster() ¶

get_mappings() ¶

get_sampled_contextual_evaluation_dataloader(num_negatives=99, seed=42, **kwargs) ¶

get_sampled_evaluation_dataloader(num_negatives=99, seed=42, **kwargs) ¶

get_stash() ¶

get_user_cluster() ¶

info() ¶

update_mappings(user_mapping, item_mapping) ¶

warprec.data.eval_loaders.EvaluationDataset ¶

warprec.data.eval_loaders.ContextualEvaluationDataset ¶

warprec.data.eval_loaders.SampledEvaluationDataset ¶

warprec.data.eval_loaders.SampledContextualEvaluationDataset ¶

`warprec.data.dataset.Dataset` ¶

`add_to_stash(key, value)` ¶

`get_context_dims()` ¶

`get_contextual_evaluation_dataloader(**kwargs)` ¶

`get_dims()` ¶

`get_evaluation_dataloader(**kwargs)` ¶

`get_feature_dims()` ¶

`get_features_lookup()` ¶

`get_inverse_mappings()` ¶

`get_item_cluster()` ¶

`get_mappings()` ¶

`get_sampled_contextual_evaluation_dataloader(num_negatives=99, seed=42, **kwargs)` ¶

`get_sampled_evaluation_dataloader(num_negatives=99, seed=42, **kwargs)` ¶

`get_stash()` ¶

`get_user_cluster()` ¶

`info()` ¶

`update_mappings(user_mapping, item_mapping)` ¶

`warprec.data.eval_loaders.EvaluationDataset` ¶

`warprec.data.eval_loaders.ContextualEvaluationDataset` ¶

`warprec.data.eval_loaders.SampledEvaluationDataset` ¶

`warprec.data.eval_loaders.SampledContextualEvaluationDataset` ¶