CompanyJune 27, 2020

What is Apache Pulsar?

Chris Bartholomew
Chris Bartholomew
What is Apache Pulsar?

In these days of exploding data, Apache Pulsar™ is emerging as the new go-to platform for businesses that need to efficiently move their data. It has an exciting and growing feature set that is integrated into a single platform to meet a wide range of real-time event-streaming needs, including data pipelines, microservices, instant messaging, data integration, and more. Since being open-sourced in 2016, Pulsar has been adopted by hundreds of companies, including Yahoo! Japan, Tencent, Comcast, Overstock, and, of course, DataStax.

In this article, we’ll cover Apache Pulsar: 

  • basics
  • features
  • benefits
  • use cases
  • best practices 

Apache Pulsar basics

Before exploring all the ways Apace Pulsar can help, we’ll start with the basics.

What is Apache Pulsar?

Apache Pulsar is a cloud-native, distributed, open-source pub-sub messaging and streaming platform. Originally developed by Yahoo! and contributed to the Apache Software Foundation (ASF) in 2016, it now manages hundreds of billions of events per day. It is highly scalable and can handle the most demanding data movement use cases out there. 

Pulsar combines the best features of a traditional messaging system like RabbitMQ with those of a pub-sub system like Kafka. You get the best of both worlds in a high performance, cloud-native package. It’s not a surprise that Pulsar has been increasing in popularity since it became an Apache open source project. Given its advantages, its community will likely continue to grow quickly.

Open-source pub-sub messaging system

Pub-sub is short for a publish-subscribe messaging platform. With pub-sub, the senders of messages, or publishers, don’t send messages (or events) to specific receivers, or subscribers. Instead, the consumers of messages subscribe to the topics they’re interested in. Each time a message associated with that topic is published, all subscribers immediately receive it. Very little queueing or batching is necessary. 

Publishers don’t know who the subscribers are or what topics they subscribe to and subscribers receive messages of interest without knowledge of the publisher. They communicate independently. The asynchronous nature of pub-sub enables loose coupling and scalability, making it a perfect fit for distributed applications, as well as serverless and microservices architectures.

How does Pulsar compare to other messaging systems, including Apache Kafka?

Unlike many other event streaming platforms, Apache Pulsar is cloud-native, much easier to scale, and has multi-data center and active/active configuration support.

But how does it compare to a leading, traditional pub-sub messaging system option, Apache Kafka? Both systems enable applications to access data in motion, as well as at rest. 

However, it turns out Pulsar has several advantages over Kafka, including:

  • cost
  • performance
  • ease of deployment
  • geo-replication
  • scaling
  • architecture (Pulsar has tiered storage, decoupled compute and storage, and multitenancy)
  • queuing
  • support for messaging semantics of MQ-based solutions

Analysis by market research firm, GigaOm, found that Pulsar wins on price and performance.

Here some of the findings, from the report, showing Pulsar has:

  • 81% lower 3-year cost compared to Kafka
  • 35% higher performance
  • 73% savings for high complexity scenarios
  • 81% savings for higher data volumes

Apache Pulsar Image 1

Apache Pulsar structural basics

Let’s take a look at Apache Pulsar’s building blocks.

Cloud-native architecture

Because Apache Pulsar uses a multiple layer approach, separating compute (brokers) from storage (BookKeeper), it fits very well into cloud infrastructures, which also separate these two concerns. Brokers are essentially stateless, and BookKeeper can easily be managed as a StatefulSet in container orchestration environments like Kubernetes — which is the de facto standard for cloud-native orchestration.

In fact, Apache Pulsar works naturally in Kubernetes, supporting rolling upgrades, rollbacks, and horizontal scaling. When coupled with persistent volumes backed by cloud storage with configurable performance dimensions, Pulsar is a highly durable and highly flexible messaging system that can scale from small test deployments to large production deployments with ease. 

Client libraries

Pulsar has a wide variety of client libraries maintained by the core project: Java, Python, C++, Golang, Node.js, and C#. If you don’t want to use a Pulsar client library at all, Pulsar includes a WebSockets proxy.

There are many other clients being developed by the community, such as Scala and Rust. If you prefer to use HTTP to send and receive Pulsar messages, you can use Pulsar Beam

Multi-tenancy and namespaces

Once you have a high-performance, scalable messaging system in place, you will want to share it between different teams and groups within your organization. It doesn’t make sense to have to replicate the high-performance system to make sure different teams don’t impact each other or build a complex overlay system to simulate multi-tenancy.

Pulsar was designed from the beginning to be a multi-tenant system. As such, different teams can safely share the messaging system. Each tenant has its own authentication, authorization, and policies. And tenants can be further divided into namespaces, which makes it easy to support different environments — such as development, staging, and production — within a single tenant.

Features of Apache Pulsar

Apache Pulsar has a rapidly growing feature set. Let’s review some of its key features. 

Built-in schema registry

One of the biggest challenges of any messaging system is making sure producers and consumers are communicating in the same language. Because producers and consumers are decoupled, it’s easy for one or both to change the format of the messages they are sending or expecting to receive. The result? Applications end up broken.

The solution to this is a schema registry that requires producers and consumers to use messages with a compatible schema. Pulsar includes a schema registry out-of-the-box. You just need to register the schema with a Pulsar topic and it takes care of enforcing the schema rules. 

Built-in geo-replication

Replicating messages to remote locations is important to support disaster recovery or to enable applications to operate on a global scale. When the users of your application travel, you want them to have the same experience no matter where they are. With geo-replication, applications can connect to the local cluster, and still send and receive to clusters around the world.

With Pulsar, geo-replication of messages is built in. If you publish a message to a topic in a replicated namespace, that message is automatically replicated to the configured remote geo-location or locations. No complex configurations or add-ons are needed. 

IO connectors

One of the main functions of a messaging system is to glue together data-intensive systems like databases, stream-processing engines, and other messaging systems. Since this is common, it makes sense to provide a common framework and connectors to make it easy. That’s exactly what Pulsar does with its IO connectors.

Pulsar comes with a wide variety of ready-made connectors, including MySQL, MongoDb, Cassandra, RabbitMQ, Kafka, Flume, Redis, and many more, making it easy to glue your systems together. 

Benefits of Apache Pulsar

Here are some of Apache Pulsar’s advantages. 

Infinite retention

One important advantage of Pulsar’s multi-layer architecture is that new layers can be added. For high performance, any persistent messaging system needs to use high-performance disks, because messages ultimately must be written to disk and may have to be retrieved from disk (if they aren’t consumed immediately). But what happens if you need to keep old messages in case you want to replay them or you are doing event sourcing? And what if you want to keep those messages forever? Storing those old messages on high-performance disks can get expensive, and fast.

To solve this problem, Pulsar supports tiered storage, allowing older messages to be offloaded to cheaper storage options — like S3 buckets. When a consumer needs an older message, Pulsar automatically retrieves it from the S3 bucket and delivers it to the consumer. 

Yes, performance will be lower. But when dealing with messages that are months or even years old, performance doesn’t matter. You just want those messages to be available when or if you need them without breaking the bank. 

Apache Pulsar Image 2

Flexible subscriptions

Apache Pulsar supports four different subscription types: exclusive, failover, shared, and key shared. It also supports multiple subscriptions on a single topic. Using subscriptions, you can easily configure messaging patterns — such as queuing, pub-sub, fan-out, and competing consumers.

Apache Pulsar implements the competing-consumers pattern using its shared subscriptions. You can scale the number of consumers up and down on a shared subscription seamlessly. There are no partitions involved. Just add a consumer and it starts receiving messages right away. 

Low latency, high throughput

From the beginning, Pulsar was designed to provide low, consistent latency at high throughput. It does this by separating the concerns of serving messages between producers and consumers and storing the messages for persistence. Pulsar uses a multi-tier architecture where messages are served by brokers and stored by Apache BookKeeper. Instead of building their own storage layer, Pulsar leverages the best-in-class performance and durability of BookKeeper.

BookKeeper is a distributed log that is designed to durably store messages with IO isolation between writing and reading. This means it can provide consistent, low latency even while large amounts of data are being written or read. Unlike traditional storage systems, performance doesn’t break down under high write pressure or under high read (consumer catch up) pressure. BookKeeper is a distributed system and it is able to seamlessly scale horizontally without needing to rebalance storage assignments.

Use cases / applications for Apache Pulsar

Features and benefits are one thing, but how can Apache Pulsar help each and every day? What are the use cases and applications for Pulsar? Take a look. 

Pub-sub, streaming and queueing

Typically, a messaging solution is either good at streaming messages, where you are dealing with a high volume of messages, in real-time performance, with simple pub-sub messaging patterns, or queuing, where you are dealing with a variety of complex message exchange patterns, such as competing consumers.

Apache Pulsar is adept at handling high-volume pub-sub messaging as well as the more complex messaging patterns typical in a message queuing system. And these complex messaging patterns are handled by Pulsar — not left to the software developer to code around using a complex application built on top of a simple client. 

Retention and message replay

In a traditional messaging system, the system keeps track if a particular message has been consumed. Once the consuming client is done with the message, it acknowledges the message, which tells the messaging system that message is no longer needed. A traditional messaging system will then delete the message from its persistent storage. After all, the message is no longer needed.

In a perfect world, that may be true. But in the real world, things go wrong, applications crash, availability zones go down, and being able to get that message back may be critical for rebuilding your application state. That’s why message retention is important. If something goes wrong, Pulsar can replay the messages that have been published to a topic—even if they have already been consumed. After all, you never know when you might need that message again.

The ability to retain messages also enables event-driven application architectures, such as event sourcing, where it is important to record each change of state as an event in the order it occurred.

Dead letter topic, negative acknowledgment, delayed delivery

Apache Pulsar supports a variety of advanced messaging capabilities that make it easy to build powerful and flexible applications around it. With negative acknowledgment, a consuming client can put a message back on a topic to process it later or allow another consumer to attempt to process it. If a consumer is unable to process a message, instead of getting blocked, it can send the message to a dead-letter topic to become unblocked and to save the problematic message for later analysis.

If you want to send a message after a delay, Pulsar can do that using the delayed delivery feature. When you publish a message, you can set a configurable amount of time to wait before the messages can be consumed.

Integrated streaming functions

Increasingly, we want to get insights from the data we are collecting in real time. Gone are the days when waiting for an overnight batch job to crunch all the data and getting insight the next day was considered good enough. Today, we want our insights in real time so we can react in real time.

In order to get real-time insights, data must be processed in real time. With Pulsar, you can seamlessly integrate lightweight functions into the message flow, performing cleaning, enrichment, and analysis of the data in real time. There’s no need to dump everything into a data lake and process it later. With Pulsar functions, you can process the data as it flows through the messaging system. Pulsar functions can be written in Java, Python, or Go and can be configured to run as Kubernetes pods. 

Best practices for pub messaging with Apache Pulsar

To get the most out of Apache Pulsar, we recommend following these best practices.

Run queries on high data storage

If you are storing a lot of data in Pulsar it can be very useful to run queries on that data and do that while Pulsar is doing its main job of sending and receiving messages. Pulsar makes this possible by leveraging the SQL query engine Presto. Pulsar integrates with Presto so you can perform SQL queries on the data stored in your topics. You can even query the data if it is offloaded into tiered storage. And the queries bypass the broker, so they won’t impact the ability of the Pulsar cluster to send and receive messages in real time.

Coordinate partitions with performance

Apache Pulsar supports both partitioned and non-partitioned topics. For lower performance use cases, you can use a non-partitioned topic to keep things simple. But if you have a high-performance use case where you need to process a high volume of data on a single topic, you can use a partitioned topic to take advantage of parallelism in the processing. You can seamlessly add partitions as performance requirements grow.

Like Kafka, Pulsar is able to guarantee message order if you publish your message with keys. Pulsar will assign messages with the same key to the same partition, guaranteeing order for messages sent to that key.

Use non-persistent messages as necessary

Persistent messages are sent to Apache BookKeeper for storage on disk. These messages are guaranteed to be delivered at-least-once regardless of the failure of the network, application, or even Pulsar itself.

However, there are some cases where this level of guaranteed delivery is not required and at-most-once delivery is sufficient. For those cases, Apache Pulsar supports non-persistent messages. Non-persistent messages are not stored to disk, reducing resource requirements, while still delivering high throughput and low latency.

Store topics on compacted keys

Sometimes, only the latest instance of a piece of data is of interest. You don’t care about all the historical values — just the latest value. If that’s the case, you can use a Pulsar compacted topic to store only the latest value on a particular key in a topic.

All data is published to a compacted topic. But Pulsar will periodically remove the old values for a key, leaving only the latest. Compacted topics prevent the topic from growing indefinitely and gives you quick access to the latest values on a topic.

Getting started with Apache Pulsar

Only publicly available for a few years, Apache Pulsar’s rate of adoption has been staggering. After comparing it to Kafka and other messaging system options, and reviewing Pulsar’s features, benefits and wide-range of applications, it makes complete sense. As Pulsar’s open source community continues to grow at an incredible pace, its features, benefits and use cases will only multiply.

If you think Apache Pulsar could be a good fit for you, take a look at DataStax’s Luna Streaming—a production-ready, open-source distribution and support subscription for Apache Pulsar.

Start building a more robust messaging infrastructure.


 (Editor's note: DataStax acquired Kesque in January 2021.)

Share

One-stop Data API for Production GenAI

Astra DB gives JavaScript developers a complete data API and out-of-the-box integrations that make it easier to build production RAG apps with high relevancy and low latency.