What is Apache Pulsar?
Apache Pulsar™ is an open-source pub-sub messaging system. It was originally developed by Yahoo! and was contributed to the Apache Software Foundation (ASF) in 2016. It is highly scalable and can handle the most demanding data movement use case out there. In these days of exploding data, Apache Pulsar is emerging as the new go-to platform for businesses that need to efficiently move their data.
Apache Pulsar combines an exciting and growing feature set into a single platform to meet a myriad of use cases, all in a single package supported by the ASF.
Pub-sub, streaming and queueing
Typically, a messaging solution is either good at streaming messages, where you are dealing with a high volume of messages in real-time performance with simple pub-sub messaging patterns, or queuing, where you are dealing with a variety of complex message exchange patterns, such as competing consumers.
Apache Pulsar is adept at handling high-volume pub-sub messaging as well as the more complex messaging patterns typical in a message queuing system. And these complex messaging patterns are handled by Pulsar — not left to the software developer to code around using a complex application built on top of a simple client.
Retention and message replay
In a traditional messaging system, the system keeps track if a particular message has been consumed. Once the consuming client is done with the message, it acknowledges the message, which tells the messaging system that message is no longer needed. A traditional messaging system will then delete the message from its persistent storage. After all, the message is no longer needed.
In a perfect world, that may be true. But in the real world, things go wrong, applications crash, availability zones go down, and being able to get that message back may be critical for rebuilding your application state. That’s why message retention is important. If something goes wrong, Pulsar can replay the messages that have been published to a topic — even if they have already been consumed. After all, you never know when you might need that message again.
The ability to retain messages also enables event-driven application architectures, such as event sourcing, where it is important to record each change of state as an event in the order it occurred.
Designed for low latency, high throughput
From the beginning, Pulsar was designed to provide low, consistent latency at high throughput. It does this by separating the concerns of serving messages between producers and consumers and storing the messages for persistence. Pulsar uses a multi-tier architecture where messages are served by brokers and stored by Apache BookKeeper. Instead of building their own storage layer, Pulsar leverages the best-in-class performance and durability of BookKeeper.
BookKeeper is a distributed log that is designed to durably store messages with IO isolation between writing and reading. This means that it can provide consistent, low latency even while large amounts of data are being written or read. Unlike traditional storage systems, performance doesn’t break down under high write pressure or under high read (consumer catch up) pressure. BookKeeper is a distributed system and it is able to seamlessly scale horizontally without needing to rebalance storage assignments.
Because Apache Pulsar uses a multiple layer approach, separating compute (brokers) from storage (BookKeeper), it fits very well into cloud infrastructures, which also separate these two concerns. Brokers are essentially stateless, and BookKeeper can easily be managed as a StatefulSet in container orchestration environments like Kubernetes — which is the defacto standard for cloud-native orchestration.
In fact, Apache Pulsar works naturally in Kubernetes, supporting rolling upgrades, rollbacks, and horizontal scaling. When coupled with persistent volumes backed by cloud storage with configurable performance dimensions, Pulsar is a highly durable and highly flexible messaging system that can scale from small test deployments to large production deployments with ease.
Infinite retention with tiered storage
Another advantage of Pulsar’s multi-layer architecture is that new layers can be added. For high performance, any persistent messaging system needs to use high-performance disks, because messages ultimately must be written to disk and may have to be retrieved from disk (if they aren’t consumed immediately).
But what happens if you need to keep around old messages in case you want to replay them or you are doing event sourcing? And what if you want to keep those messages forever? Storing those old messages on high-performance disks can get expensive, and fast.
To solve this problem, Pulsar supports tiered storage, allowing older messages to be offloaded to cheaper storage options — like S3 buckets. When a consumer needs an older message, Pulsar automatically retrieves it from the S3 bucket and delivers it to the consumer.
Yes, the performance will be lower. But when dealing with messages that are months or even years old, performance doesn’t matter. You just want those messages to be available when or if you need them without breaking the bank.
Multi-tenancy and namespaces
Once you have a high-performance, scalable messaging system in place, you will want to share it between different teams and groups within your organization. It doesn’t make sense to have to replicate the high-performance system to make sure different teams don’t impact each other or build a complex overlay system to simulate multi-tenancy.
Pulsar was designed from the beginning to be a multi-tenant system. As such, different teams can safely share the messages system. Each tenant has its own authentication, authorization, and policies. And tenants can be further divided into namespaces, which makes it easy to support different environments — such as development, staging, and production — within a single tenant.
Built-in schema registry
One of the biggest challenges of any messaging system is making sure producers and consumers are talking the same language. Because producers and consumers are decoupled, it is easy for one or both to change the format of the messages they are sending or expecting to receive and applications end up getting broken.
The solution to this is a schema registry that enforces producers and consumers to use messages with a compatible schema. Pulsar includes a schema registry out-of-the-box. You just need to register the schema with a Pulsar topic and it takes care of enforcing the schema rules.
Replicating messages to remote locations is important to support disaster recovery or to enable applications to operate on a global scale. If the users of your application move around the world, you want them to have the same user experience no matter where they are. With geo-replication, applications can connect to the local cluster but can send and receive to clusters around the world.
With Pulsar, geo-replication of messages is built in. If you publish a message to a topic in a replicated namespace, that message is automatically replicated to the configured remote geo-location or locations. No complex configurations or add-ons are needed.
Apache Pulsar supports four different subscription types: exclusive, failover, shared, and key shared. It also supports multiple subscriptions on a single topic. Using subscriptions, you can easily configure messaging patterns — such as queuing, pub-sub, fan-out, and competing consumers.
Apache Pulsar implements the competing-consumers pattern using its shared subscriptions. You can scale the number of consumers up and down on a shared subscription seamlessly. There are no partitions involved. Just add a consumer and it starts receiving messages right away.
Dead letter topic, negative acknowledgment, delayed delivery
Apache Pulsar supports a variety of advanced messaging capabilities that make it easy to build powerful and flexible applications around it. With negative acknowledgment, a consuming client can put a message back on a topic to process it later or allow another consumer to attempt to process it. If a consumer is unable to process a message, instead of getting blocked, it can send the message to a dead-letter topic to become unblocked and to save the problematic message for later analysis.
If you want to send a message after a delay, Pulsar can do that using the delayed delivery feature. When you publish a message you can set a configurable amount of time to wait before the messages can be consumed.
Integrated streaming functions
Increasingly, we want to get insights from the data we are collecting in real time. Gone are the days when waiting for an overnight batch job to crunch all the data and getting insight the next day was considered good enough. Today, we want our insights in real time so we can react in real time.
In order to get real-time insights, data must be processed in real time. With Pulsar, you can seamlessly integrate lightweight functions into the message flow, performing cleaning, enrichment, and analysis of the data in real time. There’s no need to dump everything into a data lake and process it later. With Pulsar functions, you can process the data as it flows through the messaging system. Pulsar functions can be written in Java, Python, or Go and can be configured to run as Kubernetes pods.
One of the main functions of a messaging system is to glue together data-intensive systems like databases, stream-processing engines, and other messaging systems. Since this is common, it makes sense to provide a common framework and connectors to make this easy to do. That’s exactly what Pulsar does with its IO connectors.
Pulsar provides a number of ready-made connectors that run inside the Pulsar cluster that make it easy to glue your systems together. Pulsar comes with a wide variety of connectors, including MySQL, MongoDb, Cassandra, RabbitMQ, Kafka, Flume, Redis, and many more.
SQL queries with Presto
If you are storing a lot of data in Pulsar it can be very useful to run queries on that data and do that while Pulsar is doing its main job of sending and receiving messages. Pulsar makes exactly this possible by leveraging the SQL query engine Presto. Pulsar integrates with Presto so you can perform SQL queries on the data stored in your topics. You can even query the data if it is offloaded into tiered storage. And the queries bypass the broker, so they won’t impact the ability of the Pulsar cluster to send and receive messages in real time.
Partitioned and non-partitioned topics
Apache Pulsar supports both partitioned and non-partitioned topics. For lower performance use cases, you can use a non-partitioned topic to keep things simple. But if you have a high-performance use case where you need to process a high volume of data on a single topic, you can use a partitioned topic to take advantage of parallelism in the processing. You can seamlessly add partitions as performance requirements grow.
Like Kafka, Pulsar is able to guarantee message order if you publish your message with keys. Pulsar will assign messages with the same key to the same partition, guaranteeing order for messages sent to that key.
Persistent and non-persistent messages
Persistent messages are sent to Apache BookKeeper for storage on disk. These messages are guaranteed to be delivered at-least-once regardless of the failure of the network, application, or even Pulsar itself.
However, there are some cases where this level of guaranteed delivery is not required. At-most-once delivery is sufficient. For those cases, Apache Pulsar supports non-persistent messages. Non-persistent messages are not stored to disk, reducing resource requirements while still delivering high throughput and low latency.
Sometimes, only the latest instance of a piece of data is of interest. You don’t care about all the historical values — just the latest value. If that’s the case, you can use a Pulsar compacted topic to store only the latest value on a particular key in a topic.
All data is published to a compacted topic. But Pulsar will periodically remove the old values for a key, leaving only the latest. Compacted topics prevent the topic from growing indefinitely and gives you quick access to the latest values on a topic.
Pulsar has a wide variety of client libraries that are maintained by the core project: Java, Python, C++, Golang, Node.js, and C#. And if you don’t want to use a Pulsar client library at all, Pulsar includes a WebSockets proxy.
There are many other clients being developed by the community, such as Scala and Rust. And if you prefer to use HTTP to send and receive Pulsar messages, you can use Pulsar Beam.
Apache Pulsar combines the best features of a traditional messaging system like RabbitMQ with those of a pub-sub system like Kafka. You get the best of both worlds in a high performance, cloud-native package. It’s not a surprise that Pulsar has been increasing in popularity since it became an Apache open source project. And given its advantages, it will likely continue to gain in popularity in the years to come.
(Editor's note: DataStax acquired Kesque in January 2021.)