Monitoring the Health of Apache Pulsar with Pulsar Heartbeat
Apache Pulsar™ has become a popular choice for high performance pub-sub messaging and data streaming needs. To provide high availability and consistency in a distributed environment, a Pulsar cluster is designed as a multiple tier system. It consists of many components, including Bookkeeper, Broker, Proxy, and Zookeeper. To help monitor the usage and the overall health of the individual components of the cluster, each component exposes Prometheus metrics. These metrics are telemetry data providing engineers with insights into Pulsar system performance and health.
However, analysis of these metrics requires in-depth knowledge of Pulsar domain knowledge. Metrics collected are mostly passive data that is based on workload generated by users. For example, the message egress and ingress metrics might not indicate the health of individual broker. With intermittent traffic, it’s difficult to rely on metrics to determine the health of the websocket interface.
Limitations like these mean that measuring end-to-end system availability in realtime requires a monitoring application that can generate synthetic workload continuously and across various protocols.
Datastax has developed Pulsar Heartbeat (formerly known as Pulsar Monitor) to fulfill the following requirements:
- Monitor service availability of Pulsar pub-sub protocol, websocket, partitioned topic, and Pulsar admin REST interface
- Measure latency over Pulsar pub-sub, websocket, and partitioned topic
Astra Streaming has deployed Pulsar Heartbeat in every Pulsar cluster. This enables you to monitor SLA in real time and gain performance insights of end-to-end latency. Pulsar Heartbeat is also shipped with Luna Streaming to help the community and our customers operate their clusters effectively and collect performance data.
Here’s a summary of Pulsar Heartbeat’s features and capabilities:
- Monitor end-to-end Pulsar’s pub-sub availability and latency
- Monitor end-to-end Pulsar’s partitioned topic availability and latency
- Monitor end-to-end Pulsar’s WebSocket availability and latency
- Monitor Pulsar Admin REST interface availability
- Monitor Pulsar Geo-Replication’s availability
- Monitor individual Pulsar broker’s health
- Monitor Pulsar Kubernetes’ state
- Monitor multiple Pulsar clusters
- Integrate with popular DevOps’ alerting tools
- Generate Prometheus metrics
- Pulsar Heartbeat is an Open Source tool written in Go
What does Pulsar Heartbeat monitor?
Pulsar Heartbeat monitors a number of pub-sub mechanisms and system health illustrated in the diagram below. The specific test module will be described in the following sections.
Pub-sub latency test
A pub-sub test is to monitor the end-to-end service availability and measure the latency from publishing to receiving a message from Pulsar. In order to monitor availability and measure the latency, the test sends canary messages to a system topic created for this purpose. The number of messages, payload size, and frequency of the test are configurable.
The test starts with a consumer that expects the arrival of the correct messages in its own thread. Once the consumer is started, a producer sends messages to the topic with a timestamp. Once the message or all messages are received by the consumer, the latency will be calculated based on the sent time. Each message is also keyed with an ordered ID that will be used to detect any out of order delivery.
Pulsar Heartbeat can evaluate the websocket availability. It sends a message to a topic via the websocket URL and expects the same message on the websocket consumer URL. It measures the end to end latency of the websocket pub-sub test.
Partitioned topic test
A single pub-sub test only evaluates the health of a single broker. To test all brokers’ availability, the tool sends a number of messages over a partitioned topic that covers a wide range of brokers. Pulsar Heartbeat creates a partitioned topic with predefined N number of partitions via the Pulsar admin REST interface. Pulsar Heartbeat starts with the same number of consumers that expect to receive N messages published by a producer. The producer then will send
N of canary messages to the partitioned topic. The test not only monitors service availability over a number of brokers, it also measures the latency from publishing the first message to receiving the last message.
Pulsar has a built-in broker health monitor. It regularly publishes a message to a broker health check topic for validation. The health check topic is a special topic per broker under the Pulsar system tenant. Pulsar Heartbeat uses a Pulsar Reader client to tap into this topic to verify the health of individual brokers. This feature is enabled when the health check is configured as part of Pulsar broker’s Kubernetes liveness probe.
Pulsar Admin REST API monitor
Pulsar has an admin REST interface that manages all of the important entities in a Pulsar instance such as tenants, namespaces, and topics. Pulsar Heartbeat monitors the health of Admin REST API by sending HTTP requests to query the tenant list on a predefined schedule.
If the partitioned topic test is enabled, the admin REST interface is also tested by querying the existence of the partitioned topic or creating the partitioned topic in the first test.
Configuration, test, verdict
Pulsar Heartbeat’s test modules are data-driven. The test modules run independent of each other within their execution thread. Each test’s module can be enabled and configured for connection, verdict, and alert. The configuration can be specified by either a JSON or YAML file.
Pulsar Heartbeat repeatedly runs these tests on a predefined schedule to test the availability of each protocol and measure the latency if applicable.
The basic verdict of these tests rely on the successful publishing and receiving the correct message or messages. In the case of more than one message, the orderly receipt will be verified.
The verdict of the end to end latency is evaluated within a predefined latency budget that is configurable for each test.
Kubernetes deployment monitor
Apache Pulsar’s cloud-native design is a significant competitive advantage. Pulsar can be easily deployed and managed in a Kubernetes cluster. Pulsar Heartbeat can be set up as a Kubernetes deployment in the cluster to discover Zookeeper, Bookkeeper, Proxy, and Broker’s replicas. Thus it can monitor their availability and alert users when any instances go offline.
To monitor Pulsar instances in a Kubernetes cluster, Pulsar Heartbeat has to be deployed within the same Kubernetes cluster. This feature won’t be supported when Pulsar Heartbeat runs remotely to a Pulsar cluster. For Pulsar Heartbeat to access Kubernetes API server, a service account with the required access role must be configured.
Monitor multiple clusters
Pulsar Heartbeat can run remotely to multiple Pulsar clusters. It can perform tests on Pulsar pub-sub, webSocket, REST interface, and the partitioned topics. It also measures end-to-end latency against these tests.
Even if you’re only operating a single cluster, this feature can also be used to factor in network latency to a remote Pulsar client in the standard latency tests offered by Pulsar Heartbeat.
Geo-replication is a feature that can replicate messages across multiple Pulsar clusters. This is supported out-of-the-box in Apache Pulsar. Pulsar Heartbeat can orchestrate a test setup to monitor message replication among geo-replicated Pulsar clusters. In the diagram below, Pulsar Heartbeat sends a message to cluster A. The message is replicated to cluster B where Pulsar Heartbeat subscribes to the same topic to validate the received message. It can also measure the end-to-end pub-sub latency for monitoring and performance analysis.
Pulsar Heartbeat can also verify the subscription replication in the multiple clusters configuration. A producer sends multiple messages to cluster A. Pulsar HeartBeat creates a consumer to receive a message on another cluster. It repeats the same procedure against the other clusters sequentially to make sure no duplicate messages are received under the same subscription.
Pulsar Heartbeat’s tests are data driven by the configuration file. These modules run independent of each other within their execution thread. Each test’s module can be enabled and configured for connection, verdict, and alert. They can be configured by either a JSON or YAML file.
Installation and runtime
Pulsar Heartbeat has several runtimes and installation options.
Build and run as a binary executable
Pulsar Heartbeat is written in Go, s. So it can be easily built against different OS and architecture targets. Our repo provides a script to build binaries for multiple-supported operating systems and architectures. The configuration file is specified as a command line option that instructs Pulsar Heartbeat how to run the tests and their respective verdict evaluation.
We can also run Pulsar Heartbeat as a docker container. The configuration file can be persisted on Docker volumes in order to not to be reset every time the container restarts. Our repo offers a multi-staged Docker build to produce a small image size. Datastax maintains an official Docker image repo for Pulsar Heartbeat. The image is available for public use.
Kubernetes deployment and Helm chart
Pulsar Heartbeat is a stateless application, so it can be deployed as a Kubernetes service. The configuration file is sourced from Kubernetes configmap. A service account is required to access the Kubernetes Server API for Pulsar replicas monitoring needs. Pulsar Heartbeat has been integrated into the Datastax Pulsar Helm chart as well as Luna Streaming’s Replicated distribution. This enables monitoring in a single Helm install.
Pulsar Heartbeat exposes Prometheus metrics that can review real time and historical end to end latency, websocket pub-sub latency, Pulsar’s Kubernetes pods availability, and the Pulsar tenant count.
As part of the monitoring service, Pulsar Heartbeat can alert users through the OpsGenie and PagerDuty integration. An alert is triggered by a number of consecutive failed tests. To detect system degradation, an alert can also be generated based on the number of failures (non-consecutive) within a moving window. The number of consecutive failures, and the moving window size in seconds are configurable.
To prevent alert fatigue, Pulsar Heartbeat leverages OpsGenie and PagerDuty API to deduplicate the same type of fault. It can automatically resolve alerts if the system corrects the fault itself. This can be useful when brokers or bookkeepers restart because of upgrades and other planned maintenance activities.
Any test failures are logged and can also be configured to be sent to Slack.
Monitor the monitor
Pulsar Heartbeat monitors Pulsar. We must take into account that Pulsar Heartbeat itself can go down unplanned. We added an internal service that can send heartbeat messages to a Dead Man’s Snitch service or any webhook-enabled Heartbeat monitoring service.
Pulsar Heartbeat is an open source project under the Apache 2.0 license. Feature descriptions, configuration instructions, and design can be found within the Pulsar Heartbeat repo (https://github.com/datastax/pulsar-heartbeat) under DataStax GitHub. We welcome any feedback, contributions, and feature requests.
Lastly! Did we mention the good thing can come with a small package? Pulsar Heartbeat’s Docker image has a small footprint at 25MB.
Luna Streaming with Pulsar Heartbeat baked in
The easiest way to get a Pulsar Heartbeat-enabled deployment of Apache Pulsar is with our free distribution, Luna Streaming. If you haven’t taken the plunge into Kubernetes yet, don’t worry -- Luna Streaming gives you the option of deploying to its kubernetes-in-a-box either locally or across a cluster. Try it out now!