Preparing Complex Datasets for Amazon's Recommender System Study

4 May 2024


(1) Jonathan H. Rystrøm.

Abstract and Introduction

Previous Literature

Methods and Data



Conclusions and References

A. Validation of Assumptions

B. Other Models

C. Pre-processing steps

C Pre-processing steps

Dealing with a dataset with millions of rows and complex types like ”categories” and ”dates” requires special engineering considerations. This section outlines the pre-processing steps required to get the data from Ni et al. (2019) in an analysis-ready shape.

All pre-processing of the data was done using python (Van Rossum, 2007). This is particularly because of the rich ecosystem of scientific packages. For this project we use numpy (Harris et al., 2020), pandas (McKinney, 2011), and numba (Lam et al., 2015) for efficient large-scale data processing. We also use scikit-learn (Pedregosa et al., 2011) to efficiently parse categories (see repository for implementation).

Most computations were performed on the Oxford Internet Institute’s HPC cluster. This allowed us to benefit from multi-core processing (Gorelick & Ozsvald, 2020) and increased RAM.

The first step is creating a dataset of category relevance for the books (see repository for details). Here, we simply take the original gzipped file and extract a list of categories and item ID (asin). This drastically reduces the file size, so we can do the computations in memory.

The next step is preparing the rating data. We start by filtering the dataset to only have users with more than 20 ratings. This reduces the dataset considerably as we saw in Fig. 2. We then left-join the data with the category similarity data described above. Each row now consists of a user id, category id, timestamp, and preference score (i.e. the rating multiplied by category relevance; see eq. 1) for each rating that the user has made for any given category. Note, that each individual rating can be represented in multiple rows if a book has multiple categories (which most have).

Finally, we summarise the data to get the sum of preference scores and amount of ratings per user, category, and quarter. This gives us a further reduced dataset that is more manageable to work with. Revealed preferences are defined as the weighted sum of ratings and category relevance (eq. 1), so this decision is mainly one of granularity.

This paper is available on arxiv under CC 4.0 license.