Jeffrey Carpenter

This is the latest installment of a series about building a Python application with Apache Cassandra — specifically a Python implementation of the <a href="http://killrvideo.github.io/">KillrVideo</a> microservice tier. In previous posts I shared <a href="https://medium.com/datadriveninvestor/killrvideo-python-pt-1-the-backstory-5c38191fb330">what motivated this project</a>, how I started with infrastructure including <a href="https://medium.com/@jscarp/python-app-dev-with-protobuf-and-grpc-e5bff779783d">GRPC</a> and <a href="https://medium.com/@jscarp/advertising-python-services-via-etcd-aa1eca6d8ecc">Etcd</a>, the <a href="https://medium.com/@jscarp/who-needs-unit-tests-im-building-microservices-4c8fe40d7095">testing approach</a>, and most recently, how I began implementing <a href="https://medium.com/@jscarp/when-data-access-is-the-easiest-part-of-a-microservice-c7a90dee701a">data access using Cassandra</a>.

In this post we’ll look at some additional examples of data access using the DataStax Python Driver, ranging from the simple to the complex. (I’ll be making reference some driver / mapper concepts from the previous post so you may want to review that if you haven’t already.)

<h2>Keeping it simple</h2>

In the last post, I shared how easy it was to implement the data access for the User Management Service using Apache Cassandra and the DataStax Python Driver. Using the <code>cqlengine</code> mapper was a big factor in my productivity. Because of that productivity boost, when implementing the other services I started with the mapper and only used other approaches when the complexity of the data access required it.

In fact, a couple of additional services were implemented entirely using the mapper: the Statistics Service and the Ratings Service. These demonstrate Cassandra features including counters and batches.

<h2>Simple example 1: Counters in the Statistics Service</h2>

The Statistics Service (<a href="https://github.com/KillrVideo/killrvideo-python/blob/master/killrvideo/statistics/statistics_service.py">statistics_service.py</a>) stores counts of how many time each video has been viewed. Counting statistics is one of the relatively few use cases for which the <a href="https://docs.datastax.com/en/cql/3.3/cql/cql_reference/counter_type.html">Cassandra counter type</a> is a good fit, because and therefore a good example of how to manipulate counters with the mapper. The model class we used to track statistics is shown here:

<code>class VideoPlaybackStatsModel(Model): 
&nbsp; &nbsp;"""Model class that maps to the video_playback_stats table"""&nbsp; &nbsp; 
&nbsp; &nbsp;__table_name__ = 'video_playback_stats' 
&nbsp; &nbsp;video_id = columns.UUID(primary_key=True, db_field='videoid') 
&nbsp; &nbsp;views = columns.Counter() </code>

Note the use of the <code>columns.Counter()</code> type for the <code>views</code> column.

Incrementing the count of views in the <code>record_playback_started()</code> operation is very simple:

<code>VideoPlaybackStatsModel(video_id=video_id).update(views=1) </code>

Note that the value <code>views=1</code> represents a single view of the video.

<h2>Simple example 2: Batch Writes in the Ratings Service</h2>

The Ratings Service <a href="https://github.com/KillrVideo/killrvideo-python/blob/master/killrvideo/ratings/ratings_service.py">(ratings_service.py</a>) stores user ratings of videos and allows retrieval of the ratings either by user or by video. One interesting aspect of this service is the use of the mapper as part of a <a href="https://docs.datastax.com/en/cql/3.3/cql/cql_reference/cqlBatch.html">Cassandra</a> batch when storing ratings in order to support writes to two denormalized tables supporting the “by user” and “by video” queries mentioned above. The code block below shows the data access code from the <code>rate_video()</code> operation:

<code>now = datetime.utcnow() </code>

<code># create and execute batch statement to insert into multiple tables batch_query = BatchQuery(timestamp=now) 
VideoRatingsByUserModel.batch(batch_query)\ 
&nbsp; &nbsp;.create(video_id=video_id, user_id=user_id, rating=rating) 
# updating counter columns rating_counter and rating_total 
# values are interpreted as amount to increment VideoRatingsModel(video_id=video_id).\ 
&nbsp; &nbsp;update(rating_counter=1, rating_total=rating).batch(batch_query) </code>

<code>batch_query.execute() </code>

In this example, we use two different mapper classes to write to two different tables. Note the use of each mapper’s <code>batch()</code> operation to add a statement to the <code>batch_query</code>, which we can then execute. The first statement created is a CQL <code>INSERT</code> into the <code>killrvideo.video_ratings_by_user</code> via the <code>VideoRatingsByUserModel</code>.

If you’re looking closely may you have also noticed something about that second write — we’re using the <code>VideoRatingsModel</code>, which goes to the <code>killrvideo.video_ratings</code> table to do a CQL <code>UPDATE</code>. This table is another interesting use of counters. In this case, two counter columns are used. The <code>rating_counter</code> tracks the number of times a video has been rated, while the <code>rating_total</code> tracks the sum of all ratings for the video. The average rating can then be calculated by the client via simple division (<code>rating_total/rating_counter</code>).

<h2>When simple doesn’t cut it</h2>

There were a few other services that involved more complex data access where the mapper couldn’t fully address my needs. These included the Video Catalog Service, the Comments Service, and the Search Service.

<h2>Complex example 1: Paging in the Comments Service</h2>

Although it’s not “YouTube scale”, the KillrVideo application is designed to support a very large number of videos, users, ratings, and other data types, in order to demonstrate best practices for data modeling and driver usage found in real-world applications.

With this in mind, imagine a user being presented with screens of videos and comments in a web browser on a client device. It would be infeasible and unnecessary to return the entire video catalog or the entire comment history for a popular video to the client at once, so of course, some form of paging is required. However, paging is a feature that isn’t supported by the <code>cqlengine</code> mapper provided with the DataStax Python Driver. So, we need to fall back to other methods.

<img alt="DataStax Python Driver" data-entity-type="file" data-entity-uuid="7548772c-eecd-42e1-af1a-bea11d1fa9f7" src="https://www.datastax.com/sites/default/files/inline-images/1_vLsuvnQUSnBZc5gsdSx6HA.png" />

A good example of the approach is found in the Comments Service (<a href="https://github.com/KillrVideo/killrvideo-python/blob/master/killrvideo/comments/comments_service.py">comments_service.py</a>). Since paging only comes into effect on reads, we can use the mapper for writes, and then use regular CQL statements on the reads to get control over paging.

As an example, to access comments by user, we first create prepared statements in the constructor for the class:

<code># Prepared statements for get_user_comments() 
self.userComments_startingPointPrepared = \ 
&nbsp; &nbsp;session.prepare('SELECT * FROM comments_by_user WHERE userid = ? AND (commentid) &lt;= (?)') 
self.userComments_noStartingPointPrepared = \ 
&nbsp; &nbsp;session.prepare('SELECT * FROM comments_by_user WHERE userid = ?') </code>

Then, in the get_user_comments() operation, we use one of the prepared statements to create a bound statement which we can then execute:

<code>bound_statement = None </code>

<code>if starting_comment_id: 
&nbsp; &nbsp;bound_statement = self.userComments_startingPointPrepared 
&nbsp; &nbsp;&nbsp; &nbsp;.bind([user_id, starting_comment_id]) </code>

<code>else: bound_statement = self.userComments_noStartingPointPrepared .bind([user_id]) </code>

<code>bound_statement.fetch_size = page_size result_set = None </code>

<code>if paging_state: 
&nbsp; &nbsp;# see below where we encode paging state to hex before returning 
&nbsp; &nbsp;result_set = self.session.execute(bound_statement, 
&nbsp; &nbsp;&nbsp; &nbsp;paging_state=paging_state.decode('hex')) </code>

<code>else: 
&nbsp; &nbsp;result_set = self.session.execute(bound_statement) </code>

Note where the <code>fetch_size</code> of the batch statement is set to the <code>page_size</code> requested by the client. The code after this iterates over the rows in the <code>result_set</code> to build up a list of comments to return to the client. If <code>page_size</code> rows were returned, then the Cassandra paging state is extracted from the result set and returned to the client:

<code>if len(results) == page_size: 
&nbsp; &nbsp;# Use hex encoding since paging state is raw bytes that won't encode to UTF-8 
&nbsp; &nbsp;next_page_state = result_set.paging_state.encode('hex') </code>

This allows the client to pass back the paging state on a subsequent call to <code>get_user_comments()</code> to retrieve the next page. Note that we encode/decode the paging state to a hex string, which works well over the Protobuf message format we’re using on our service interfaces.

<h2>Complex example 2: Full-text Search in the Search Service</h2>

The Search Service presents a different sort of problem. If you’re familiar with Cassandra data modeling practices, you’ll be aware that Cassandra doesn’t support arbitrary searches, and the secondary index implementation that comes with Cassandra is known to perform poorly over large data sets.

Instead, the best practice is to design a table per query, with primary keys based on the attributes you will specify in each query. However, your options when you don’t know the exact key you are looking for, with limited support for range queries and no support for text search features like prefix/postfix or fuzzy matching.

However, we do have a couple of cases where we need text search features in the Search Service, in both the <code>get_query_suggestions()</code> operation, which performs a typeahead search for common search terms, and the <code>search_videos()</code> operation, which is used to search for videos containing a search term in the title or description.

Following the pattern set in the other language implementations of the KillrVideo services, the Python implementation of the Search Service uses DataStax Enterprise Search. First, we need a search index on the <code>videos</code> table (extracted from the CQL for the search schema):

<code>CREATE SEARCH INDEX IF NOT EXISTS on videos; </code>

Then, we create a prepared statement in the constructor of the Search Service:

<code>self.search_videos_prepared = \ session.prepare('SELECT * FROM videos WHERE solr_query = ?') </code>

Finally, in the&nbsp;<code>search_videos()</code> operation, we use the&nbsp;<code>solr_query</code> syntax supported by DSE Search to create a bound statement containing our desired CQL query:

<code>solr_query = '{"q":"name:(' + query + ')^4 OR tags:(' + query + \ ')^2 OR description:(' + query + ')", "paging":"driver"}' bound_statement = self.search_videos_prepared.bind([solr_query]) </code>

We can then execute the bound statement and iterate over the results (not shown).

The <code>solr_query</code> defined above searches for the provided search term (<code>query</code>) in the <code>name</code>, <code>tags</code>, or <code>description</code> columns. The <code>solr_query</code> places additional weight on the <code>name</code> and <code>description</code> columns so that movies with the <code>query</code> appearing in the <code>name</code> column will appear the highest in search results, followed by those that have the search term in the <code>description</code> column.

<h2>And now for something completely different</h2>

In the past two posts, I’ve taken you on a guided tour of the data access code for most of the KillrVideo Python services. The remaining service we haven’t discussed yet is the Suggested Videos Service. As you may have guessed, this involves building a recommender, which we’re doing using DataStax Enterprise Graph since that is built on top of Cassandra.

In upcoming posts I’ll share about this recommender, starting with how we shared data between services using Kafka in order to populate data into a graph used for recommendations.

<img alt="KillrVideo Python" data-entity-type="file" data-entity-uuid="6613295d-eb80-46af-9183-90f00cd4a628" src="https://www.datastax.com/sites/default/files/inline-images/killrvideopython.jpg" />

KillrVideo Python Pt. 6— Cassandra with Python: Simple to Complex

Jeffrey CarpenterSoftware Engineer - Stargate

Discover more

Share

Share

Keeping it simple

Simple example 1: Counters in the Statistics Service

Simple example 2: Batch Writes in the Ratings Service

When simple doesn’t cut it

Complex example 1: Paging in the Comments Service

Complex example 2: Full-text Search in the Search Service

And now for something completely different

More Technology

How to Build a Crystal Image Search App with Vector Search

Knowledge Graphs for RAG without a GraphDB

How Winweb Built its AI Assistant with DataStax Astra DB and LangChain

Vercel + Astra DB: Get Data into Your GenAI Apps Fast

One-stop Data API for Production GenAI