What is Apache Pulsar?
In these days of exploding data, Apache Pulsar™ is emerging as the new go-to platform for businesses that need to efficiently move their data. It has an exciting and growing feature set that is integrated into a single platform to meet a wide range of real-time event-streaming needs, including data pipelines, microservices, instant messaging, data integration, and more. Since being open-sourced in 2016, Pulsar has been adopted by hundreds of companies, including Verizon Media, Yahoo! Japan, Tencent, Comcast, Overstock, and, of course, DataStax.
In this article, we’ll cover Apache Pulsar:
- use cases
- best practices
Apache Pulsar basics
Before exploring all the ways Apace Pulsar can help, we’ll start with the basics.
What is Apache Pulsar?
Apache Pulsar is a cloud-native, distributed, open-source pub-sub messaging and streaming platform. Originally developed by Yahoo! and contributed to the Apache Software Foundation (ASF) in 2016, it now manages hundreds of billions of events per day. It is highly scalable and can handle the most demanding data movement use cases out there.
Pulsar combines the best features of a traditional messaging system like RabbitMQ with those of a pub-sub system like Kafka. You get the best of both worlds in a high performance, cloud-native package. It’s not a surprise that Pulsar has been increasing in popularity since it became an Apache open source project. Given its advantages, its community will likely continue to grow quickly.
Open-source pub-sub messaging system
Pub-sub is short for a publish-subscribe messaging platform. With pub-sub, the senders of messages, or publishers, don’t send messages (or events) to specific receivers, or subscribers. Instead, the consumers of messages subscribe to the topics they’re interested in. Each time a message associated with that topic is published, all subscribers immediately receive it. Very little queueing or batching is necessary.
Publishers don’t know who the subscribers are or what topics they subscribe to and subscribers receive messages of interest without knowledge of the publisher. They communicate independently. The asynchronous nature of pub-sub enables loose coupling and scalability, making it a perfect fit for distributed applications, as well as serverless and microservices architectures.
How does Pulsar compare to other messaging systems, including Apache Kafka?
Unlike many other event streaming platforms, Apache Pulsar is cloud-native, much easier to scale, and has multi-data center and active/active configuration support.
But how does it compare to a leading, traditional pub-sub messaging system option, Apache Kafka? Both systems enable applications to access data in motion, as well as at rest.
However, it turns out Pulsar has several advantages over Kafka, including:
- ease of deployment
- architecture (Pulsar has tiered storage, decoupled compute and storage, and multitenancy)
- support for messaging semantics of MQ-based solutions
Analysis by market research firm, GigaOm, found that Pulsar wins on price and performance.
Here some of the findings, from the report, showing Pulsar has:
- 81% lower 3-year cost compared to Kafka
- 35% higher performance
- 73% savings for high complexity scenarios
- 81% savings for higher data volumes
Apache Pulsar structural basics
Let’s take a look at Apache Pulsar’s building blocks.
Because Apache Pulsar uses a multiple layer approach, separating compute (brokers) from storage (BookKeeper), it fits very well into cloud infrastructures, which also separate these two concerns. Brokers are essentially stateless, and BookKeeper can easily be managed as a StatefulSet in container orchestration environments like Kubernetes — which is the de facto standard for cloud-native orchestration.
In fact, Apache Pulsar works naturally in Kubernetes, supporting rolling upgrades, rollbacks, and horizontal scaling. When coupled with persistent volumes backed by cloud storage with configurable performance dimensions, Pulsar is a highly durable and highly flexible messaging system that can scale from small test deployments to large production deployments with ease.
Pulsar has a wide variety of client libraries maintained by the core project: Java, Python, C++, Golang, Node.js, and C#. If you don’t want to use a Pulsar client library at all, Pulsar includes a WebSockets proxy.
There are many other clients being developed by the community, such as Scala and Rust. If you prefer to use HTTP to send and receive Pulsar messages, you can use Pulsar Beam.
Multi-tenancy and namespaces
Once you have a high-performance, scalable messaging system in place, you will want to share it between different teams and groups within your organization. It doesn’t make sense to have to replicate the high-performance system to make sure different teams don’t impact each other or build a complex overlay system to simulate multi-tenancy.
Pulsar was designed from the beginning to be a multi-tenant system. As such, different teams can safely share the messaging system. Each tenant has its own authentication, authorization, and policies. And tenants can be further divided into namespaces, which makes it easy to support different environments — such as development, staging, and production — within a single tenant.
Features of Apache Pulsar
Apache Pulsar has a rapidly growing feature set. Let’s review some of its key features.
Built-in schema registry
One of the biggest challenges of any messaging system is making sure producers and consumers are communicating in the same language. Because producers and consumers are decoupled, it’s easy for one or both to change the format of the messages they are sending or expecting to receive. The result? Applications end up broken.
The solution to this is a schema registry that requires producers and consumers to use messages with a compatible schema. Pulsar includes a schema registry out-of-the-box. You just need to register the schema with a Pulsar topic and it takes care of enforcing the schema rules.
Replicating messages to remote locations is important to support disaster recovery or to enable applications to operate on a global scale. When the users of your application travel, you want them to have the same experience no matter where they are. With geo-replication, applications can connect to the local cluster, and still send and receive to clusters around the world.
With Pulsar, geo-replication of messages is built in. If you publish a message to a topic in a replicated namespace, that message is automatically replicated to the configured remote geo-location or locations. No complex configurations or add-ons are needed.
One of the main functions of a messaging system is to glue together data-intensive systems like databases, stream-processing engines, and other messaging systems. Since this is common, it makes sense to provide a common framework and connectors to make it easy. That’s exactly what Pulsar does with its IO connectors.
Pulsar comes with a wide variety of ready-made connectors, including MySQL, MongoDb, Cassandra, RabbitMQ, Kafka, Flume, Redis, and many more, making it easy to glue your systems together.
Benefits of Apache Pulsar
Here are some of Apache Pulsar’s advantages.
One important advantage of Pulsar’s multi-layer architecture is that new layers can be added. For high performance, any persistent messaging system needs to use high-performance disks, because messages ultimately must be written to disk and may have to be retrieved from disk (if they aren’t consumed immediately). But what happens if you need to keep old messages in case you want to replay them or you are doing event sourcing? And what if you want to keep those messages forever? Storing those old messages on high-performance disks can get expensive, and fast.
To solve this problem, Pulsar supports tiered storage, allowing older messages to be offloaded to cheaper storage options — like S3 buckets. When a consumer needs an older message, Pulsar automatically retrieves it from the S3 bucket and delivers it to the consumer.
Yes, performance will be lower. But when dealing with messages that are months or even years old, performance doesn’t matter. You just want those messages to be available when or if you need them without breaking the bank.
Apache Pulsar supports four different subscription types: exclusive, failover, shared, and key shared. It also supports multiple subscriptions on a single topic. Using subscriptions, you can easily configure messaging patterns — such as queuing, pub-sub, fan-out, and competing consumers.
Apache Pulsar implements the competing-consumers pattern using its shared subscriptions. You can scale the number of consumers up and down on a shared subscription seamlessly. There are no partitions involved. Just add a consumer and it starts receiving messages right away.
Low latency, high throughput
From the beginning, Pulsar was designed to provide low, consistent latency at high throughput. It does this by separating the concerns of serving messages between producers and consumers and storing the messages for persistence. Pulsar uses a multi-tier architecture where messages are served by brokers and stored by Apache BookKeeper. Instead of building their own storage layer, Pulsar leverages the best-in-class performance and durability of BookKeeper.
BookKeeper is a distributed log that is designed to durably store messages with IO isolation between writing and reading. This means it can provide consistent, low latency even while large amounts of data are being written or read. Unlike traditional storage systems, performance doesn’t break down under high write pressure or under high read (consumer catch up) pressure. BookKeeper is a distributed system and it is able to seamlessly scale horizontally without needing to rebalance storage assignments.
Use cases / applications for Apache Pulsar
Features and benefits are one thing, but how can Apache Pulsar help each and every day? What are the use cases and applications for Pulsar? Take a look.
Pub-sub, streaming and queueing
Typically, a messaging solution is either good at streaming messages, where you are dealing with a high volume of messages, in real-time performance, with simple pub-sub messaging patterns, or queuing, where you are dealing with a variety of complex message exchange patterns, such as competing consumers.
Apache Pulsar is adept at handling high-volume pub-sub messaging as well as the more complex messaging patterns typical in a message queuing system. And these complex messaging patterns are handled by Pulsar — not left to the software developer to code around using a complex application built on top of a simple client.
Retention and message replay
In a traditional messaging system, the system keeps track if a particular message has been consumed. Once the consuming client is done with the message, it acknowledges the message, which tells the messaging system that message is no longer needed. A traditional messaging system will then delete the message from its persistent storage. After all, the message is no longer needed.
In a perfect world, that may be true. But in the real world, things go wrong, applications crash, availability zones go down, and being able to get that message back may be critical for rebuilding your application state. That’s why message retention is important. If something goes wrong, Pulsar can replay the messages that have been published to a topic—even if they have already been consumed. After all, you never know when you might need that message again.
The ability to retain messages also enables event-driven application architectures, such as event sourcing, where it is important to record each change of state as an event in the order it occurred.
Dead letter topic, negative acknowledgment, delayed delivery
Apache Pulsar supports a variety of advanced messaging capabilities that make it easy to build powerful and flexible applications around it. With negative acknowledgment, a consuming client can put a message back on a topic to process it later or allow another consumer to attempt to process it. If a consumer is unable to process a message, instead of getting blocked, it can send the message to a dead-letter topic to become unblocked and to save the problematic message for later analysis.
If you want to send a message after a delay, Pulsar can do that using the delayed delivery feature. When you publish a message, you can set a configurable amount of time to wait before the messages can be consumed.
Integrated streaming functions
Increasingly, we want to get insights from the data we are collecting in real time. Gone are the days when waiting for an overnight batch job to crunch all the data and getting insight the next day was considered good enough. Today, we want our insights in real time so we can react in real time.
In order to get real-time insights, data must be processed in real time. With Pulsar, you can seamlessly integrate lightweight functions into the message flow, performing cleaning, enrichment, and analysis of the data in real time. There’s no need to dump everything into a data lake and process it later. With Pulsar functions, you can process the data as it flows through the messaging system. Pulsar functions can be written in Java, Python, or Go and can be configured to run as Kubernetes pods.
Best practices for pub messaging with Apache Pulsar
To get the most out of Apache Pulsar, we recommend following these best practices.
Run queries on high data storage
If you are storing a lot of data in Pulsar it can be very useful to run queries on that data and do that while Pulsar is doing its main job of sending and receiving messages. Pulsar makes this possible by leveraging the SQL query engine Presto. Pulsar integrates with Presto so you can perform SQL queries on the data stored in your topics. You can even query the data if it is offloaded into tiered storage. And the queries bypass the broker, so they won’t impact the ability of the Pulsar cluster to send and receive messages in real time.
Coordinate partitions with performance
Apache Pulsar supports both partitioned and non-partitioned topics. For lower performance use cases, you can use a non-partitioned topic to keep things simple. But if you have a high-performance use case where you need to process a high volume of data on a single topic, you can use a partitioned topic to take advantage of parallelism in the processing. You can seamlessly add partitions as performance requirements grow.
Like Kafka, Pulsar is able to guarantee message order if you publish your message with keys. Pulsar will assign messages with the same key to the same partition, guaranteeing order for messages sent to that key.
Use non-persistent messages as necessary
Persistent messages are sent to Apache BookKeeper for storage on disk. These messages are guaranteed to be delivered at-least-once regardless of the failure of the network, application, or even Pulsar itself.
However, there are some cases where this level of guaranteed delivery is not required and at-most-once delivery is sufficient. For those cases, Apache Pulsar supports non-persistent messages. Non-persistent messages are not stored to disk, reducing resource requirements, while still delivering high throughput and low latency.
Store topics on compacted keys
Sometimes, only the latest instance of a piece of data is of interest. You don’t care about all the historical values — just the latest value. If that’s the case, you can use a Pulsar compacted topic to store only the latest value on a particular key in a topic.
All data is published to a compacted topic. But Pulsar will periodically remove the old values for a key, leaving only the latest. Compacted topics prevent the topic from growing indefinitely and gives you quick access to the latest values on a topic.
Getting started with Apache Pulsar
Only publicly available for a few years, Apache Pulsar’s rate of adoption has been staggering. After comparing it to Kafka and other messaging system options, and reviewing Pulsar’s features, benefits and wide-range of applications, it makes complete sense. As Pulsar’s open source community continues to grow at an incredible pace, its features, benefits and use cases will only multiply.
If you think Apache Pulsar could be a good fit for you, take a look at DataStax’s Luna Streaming—a production-ready, open-source distribution and support subscription for Apache Pulsar.
(Editor's note: DataStax acquired Kesque in January 2021.)