How Ankeri keeps cargo fleets sailing smoothly with DataStax Astra DB
“Once we had chosen DataStax and Astra DB as a serverless solution, we found out about a “get out of jail free” card—or it felt like that—with the storage attached indexing DataStax offers.”
—Nanna Einarsdóttir, Ankeri’s vice president of engineering.
Headquartered in Reykjavík, Iceland, Ankeri provides a unified platform to enable robust, data-driven performance analyses for companies that own and charter ships. Ankeri’s leadership team recognized an opportunity to provide real-time data for the shipping sector. While cargo ships and their management applications can provide data for analysis, the data itself is tied to those specific systems, which makes it harder to compare different vessels or aggregate data to help cargo companies make decisions about future projects. Ankeri’s founders CEO Kristinn Aspelund and CTO Leifur Kristjansson saw a new approach to the industry was needed.
We recently sat down with Nanna Einarsdóttir, Ankeri’s vice president of engineering. She explains why Ankeri chose Apache Cassandra™ and DataStax Astra DB, and walks us through her team’s journey into the non-relational database world.
1. Tell us a little about yourself and your company.
I’m an electrical engineer and I graduated from the Technical University of Denmark in 2013. I’ve been working in software development for eight years and joined Ankeri in 2019. We are a startup software company and launched our first product in 2018.
A company’s shipping fleet usually consists of a combination of a company’s own ships along with chartered ships rented from others. Ankeri provides a platform for ship owners and charterers to communicate and stay on top of fleet performance and potential chartering prospects.
2. What was the genesis of Ankeri and your Data Connections Hub (DCH), the software you offer to companies that own and charter ships?
We came across a problem our customers had. Their fleets consist of both their own ships as well as chartered ships. Real-time data is recorded on board each ship using different internal systems. Then data is made available onshore via a number of APIs, each providing the data in a different format. And then you have the problem of ships regularly being added to and removed from each fleet. This data can be very difficult to capture and manage, both for monitoring and analytical purposes.
The solution is Ankeri’s Data Connections Hub (DCH), which is an API that serves real-time data to companies for their entire fleet. It is onboard system-agnostic, with data storage in the cloud. It is served via a single endpoint, in a single, configurable format. It features an administrative interface for easily managing ships and users.
We launched a pilot with our first customer and real-time data has been streaming in smoothly as of May.
3. What are some of the particular challenges you faced as a startup developing a brand new solution in this space? And how did that affect how you architected your database infrastructure?
To give you some perspective, during our planning, our resources consisted of four coders. We had all the required business logic for the project and we had good general software development skills. We had vast experience managing databases, but only in the relational world.
When it came to requirements for the product, we have a small team, so we knew it needed to be low maintenance. We needed good performance, because the data is not only meant for analytics but also monitoring and also for providing data to our existing products. We needed performance that scaled. This solution should work as well for 100 ships as for 10,000. These requirements indicated that serverless was the way to go.
In addition, this is a new product, not a re-write. We needed a fast deployment of our minimum viable product to test the product. It’s fair to say that all the books tell you to know all your access patterns before you dive into big data, but in some cases it can be impossible to know everything in advance. We needed agility in our solution
We knew we wanted a serverless solution. We chose AWS Lambda and API Gateway to implement the API itself and the back-end infrastructure. This led to the logical choice of AWS DynamoDB as our database-as-a-service.
We wrote up all the access patterns that we anticipated and came up with a design for a DynamoDB single table. Best practices for DynamoDB call for keeping all your data in a single table, no matter what type of data you have.
Just like that, we hit our first challenge. Our access patterns were not set in stone. As the product evolved and as we found out some implementation details, we needed to modify access and add patterns. This meant re-visiting the whole DynamoDB single-table design. This was limiting and slowing us down. And even though it was already October last year, and already closer than we would have liked to our internal deadline, we decided to switch databases.
4. When you decided to switch databases, which alternatives did you consider? Why did you choose Apache Cassandra and DataStax?
We looked into Elasticsearch, MongoDB, and Cassandra. We decided to go with Cassandra because of CQL and its familiarity, its similarity to SQL. The table structure with defined column names and types resonated well with our data structure. And then coming from this challenge, we valued the ability to add new tables for new access patterns. It was an easier step than to overhaul the whole design.
We chose DataStax as our service provider due to the high level of support we received from the beginning. That is also an important part of the criteria because we knew that we couldn’t know everything about this going in. You need to make sure you have good support.
Once we had chosen DataStax and Astra DB as a serverless solution, we found out about a “get out of jail free” card—or it felt like that—with the storage attached indexing DataStax offers. This can be a very comfortable solution if you already have a table out there. It has all the data you need. But you need to query by a column that isn’t in your initial primary key, or it isn’t your whole primary key.
We created a table for ships, which is just a list of ships and all the available source APIs belonging to the ship. We created a custom index, which enabled us to also query the table by sources.
With the full stack serverless solution set up, we can retrieve and parse data from onboard system APIs, then make data available to end users via the Ankeri app, with the Cassandra compatible database in the middle to store parsed data.
5. Once you completed setup, what other challenges did you face as your software development evolved?
Our second challenge was that we assumed an unlimited number of columns per table. You hear the term “wide columnar storage,” and maybe you stop listening. But maybe you should listen a little more. We had our main table storing all the time series data from the ships. Our primary key contains the metadata for each data row, and then we have 25 data columns and each measurement has its own column, one for latitude, another for longitude, etc. And this currently works great for our main access patterns. You can select all the data columns or you can choose a handful.
But our wrong assumption was that we could continue adding columns indefinitely, as the product grows and we would maybe move on to different market segments. But the fact is that in Cassandra, even though you select only by a couple of columns, the whole row gets read into memory. So if you have a couple hundred columns, all of them get read into memory and this can slow the system down.
We probably wouldn’t have found out about this anti-pattern that we had designed if it wasn’t for Astra DB serverless guardrails, which is a set of rules, or limitations, depending on how you look at it. And one of them is that there is a hard limit of 75 columns per table. This was a very valuable lesson for us to learn. First I thought it was silly, let’s find a database provider that doesn’t have this limit. But after some research, I found out that this was more of a limit for our safety.
6. Tell us about the challenge you had in optimizing throughput at one point.
A third challenge to look out for was when we designed the partition key for our biggest table, the ship data table. We did it very carefully. We made sure each ship had its own partition and furthermore we added the year as a bucketing feature to our partition key, so that the time series data would be bound to include only a single year for each partition.
What we didn’t think about, however, was the throughput that smaller tables might have. As an example, each time we write data to the large table, Ship Data, an attribute in the table for Ships By Source gets updated. We update the last data timestamp each time we write to the large table. This means that even though it’s a much smaller table, it gets similar traffic as the other one. But in the case of the Ships By Source table, we didn’t think to add Ship ID to the partition key. And our back-end system behaves the way that data is retrieved simultaneously for all ships for a particular source. So each time we updated the table for a single source, this small table got bombarded.
This issue was hard to spot, we had to put real load on the table before we found this out, but it was very easily fixed. We just needed to add Ship ID to the partition key.
7. How else did Astra DB guardrails prove useful?
Our fourth challenge was that we were leaning too hard on the “IN” operator with our primary select statement—the select statement that’s executed when a user makes a data request on the API. We have three “IN” operators. For two of them, those applying to year and source, there are only a few items in those lists, due to the nature of the attributes.
But for Ship ID, a company can have a few hundred ships, so there’s a possibility that a user might ask for data for all their ships. In that case, the “IN” operator, which I now know is an inefficient one, would slow down the query. We found out about this issue when we hit the 25-item hard limit of Astra DB guardrails, which was also a good thing to learn.
This is also easy to implement in another way. Not necessarily by changing the select statement, but more how the back-end interprets the request from the user or even by placing some limitations on the number of ships the user can request. But the point is that I’m not sure how we would have learned about this if it wasn’t for the guardrails. We would probably just see random slow queries. It might have been hard to spot.
8. What has the biggest challenge been along the way? Is there one that’s larger than all the others?
The biggest challenge is an internal one. It’s shifting your mindset. It’s a pretty hard thing to get used to, creating a data model from your access pattern instead of from the nature of the data, like you do in the relational world. Because the task is laid out in front of you when you create the data model from the behavior of the data. You usually have access to data before you even need to design every part of your product.
But having to focus on your access pattern first requires you to answer questions that you don’t even know yet. Of course, when I look back on it now, I realize there’s homework that’s also healthy to do when you’re designing data modeling for the relational world. It can be difficult to get used to. I felt the thought of de-normalizing data was a bit uncomfortable.
Duplicating data, throughout your developer life, you may have had one rule, which is eliminate duplication, eliminate maintenance of data and code. This is new, but if I think about it from another perspective, one of clean code, it’s that it kind of takes weight away from complex queries that you see everywhere in code. And instead, yes, you have duplicate data, but you have tables named in a very descriptive way. And you have data maintenance code that you can unit test.
Another thing is that it is difficult to get used to prioritizing and limiting access patterns when it comes to your users. I don’t know how many times I had this conversation with my product-facing team, “I guess this access pattern’s really necessary?” And the answer I got was always that it’s better to have it. It’s better to have all the options open. But in the non-relational world, you probably have to find some compromise. You can approach that compromise by getting used to de-normalizing. Saying more yes than no to your product-facing team. And you can educate them about the pros and cons of non-relational databases.
9. Why would you recommend Cassandra and DataStax Astra DB to others?
For a small team of people designing a new product and putting it out there, a team primarily with knowledge in the relational world, with SQL, I would go for Cassandra because of the familiar table structure and CQL. Yes, you should know your access pattern beforehand, but adding an SAI can be your answer, or at least it can be a short stop before you need to duplicate your data. And adding a new access pattern in the form of a table doesn’t require a revisit to the whole design. It’s an easy step. It’s available as a service and with a pay-for-what-you-use model.
I say get your seatbelt on with Astra DB’s guardrails. It’s definitely a comfort zone for me. Of course, read up on how Cassandra works. It gives you a certain comfort zone to know that you can’t just do whatever. Someone will stop you at some point. And it’s about learning possibilities. Then as a final note, make sure you have access to support and use it.
For more information on how Ankeri is using DataStax to deliver analytics at scale, read the full case study.
Behind the Innovator takes a peek behind the scenes with learnings and best practices from leading architects, operators, and developers building cloud-native, data-driven applications with Apache Cassandra™, Apache Pulsar, and open-source technologies in unprecedented times.