Learning is an analytic process of exploring the past in order to predict the future. Hence, being able to travel back in time to create features is critical for machine learning projects to be successful. To enable this, we built a time machine that computes features for any arbitrary time in the recent past for offline experimentation. We also built a real-time stream processing system to capture the interests of members during different times of the day and to quickly adapt to changes in the collective interests of members as it happens in case of real-world events.
Building the time machine for offline experimentation and the real-time infrastructure for online recommendations with Apache Spark (Streaming) and Apache Cassandra empowered us to both scale up the data size by an order of magnitude and train and validate the models in less time. We will delve into the architecture, use case details, data models used for cassandra and share our learnings.