Benchmarking NATS JetStream Cluster: Hypothesis Testing for Enhancements

8 min readJun 20, 2024

As a part of a cloud engineering team, among our services, we primarily manage multiple NATS clusters, which serves as the central message queues in our systems. The JetStream feature of NATS has been operational in many businesses. Utilizing multiple streams within a single cluster enhances ability to distribute events swiftly and reliably. Additionally, this setup allows retain messages temporarily to accommodate slow processing or disconnections among consumers.

A significant challenge we face is modifying the configurations of the cluster or streams. As our cluster processes millions of write/read operations per second, it is crucial to ensure that these changes do not reduce the cluster’s throughput. To verify this, we benchmark the cluster’s performance before and after any adjustments.

Benchmarking effectively assesses whether these changes impact throughput. However, it is difficult to draw definitive conclusions from this process alone. We are challenged by the need to determine the number of benchmarks required to ensure the results are neither random nor biased. Additionally, when a configuration change proves beneficial, we need to quantify the extent of this improvement. These issues have led us to employ hypothesis testing with our benchmarking results.

Our case

When we initially migrated our NATS cluster to JetStream, each stream had a single replication. A few months later, JetStream introduced an update that allowed multiple replications for streams, enhancing the robustness of our distributed NATS cluster. This feature is particularly valuable in failure scenarios, ensuring that our services continue to receive events even if one cluster node fails. Additionally, it helps distribute the load evenly across the cluster nodes, thus enhancing reliability under heavy loads.

However, adding replications to our streams can reduce the write speed due to the time required for replicas to synchronize. In a single-replica stream, there is no synchronization overhead. Therefore, before we scale our streams, we needed to know how it would affect our cluster’s throughput.

Benchmarking a NATS cluster

To benchmark a NATS cluster, we utilized a feature of the nats-cli called nats-bench. This command includes various functions for conducting benchmarks on both NATS and JetStream clusters. It offers several options to customize the benchmark, such as the number of publishers, consumers, and messages. Below, you will find a list of these options along with their descriptions.

usage: nats bench [<flags>] <subject>

Benchmark utility

Core NATS publish and subscribe:

  nats bench benchsubject --pub 1 --sub 10

Request reply with queue group:

  nats bench benchsubject --sub 1 --reply

  nats bench benchsubject --pub 10 --request

JetStream publish:

  nats bench benchsubject --js --purge --pub 1

JetStream ordered ephemeral consumers:

  nats bench benchsubject --js --sub 10

JetStream durable pull and push consumers:

  nats bench benchsubject --js --sub 5 --pull

  nats bench benchsubject --js --sub 5 --push

JetStream KV put and get:

  nats bench benchsubject --kv --pub 1

  nats bench benchsubject --kv --sub 10

Remember to use --no-progress to measure performance more accurately

Args:
  <subject>  Subject to use for the benchmark

Flags:
  --pub=0                     Number of concurrent publishers
  --sub=0                     Number of concurrent subscribers
  --js                        Use JetStream
  --request                   Request-Reply mode: publishers send requests waits
                              for a reply
  --reply                     Request-Reply mode: subscribers send replies
  --kv                        KV mode, subscribers get from the bucket and
                              publishers put in the bucket
  --msgs=100000               Number of messages to publish
  --size="128"                Size of the test messages
  --no-progress               Disable progress bar while publishing
  --csv=CSV                   Save benchmark data to CSV file
  --purge                     Purge the stream before running
  --storage=file              JetStream storage (memory/file) for the
                              "benchstream" stream
  --replicas=1                Number of stream replicas for the "benchstream"
                              stream
  --maxbytes="1GB"            The maximum size of the stream or KV bucket in
                              bytes
  --stream="benchstream"      When set to something else than "benchstream":
                              use (and do not attempt to define) the specified
                              stream when creating durable subscribers.
                              Otherwise define and use the "benchstream" stream
  --bucket="benchbucket"      When set to something else than "benchbucket":
                              use (and do not attempt to define) the specified
                              bucket when in KV mode. Otherwise define and use
                              the "benchbucket" bucket
  --consumer="natscli-bench"  Specify the durable consumer name to use
  --jstimeout=30s             Timeout for JS operations
  --syncpub                   Synchronously publish to the stream
  --pubbatch=100              Sets the batch size for JS asynchronous publishing
  --pull                      Use a shared durable explicitly acknowledged JS
                              pull consumer rather than individual ephemeral
                              consumers
  --push                      Use a shared durable explicitly acknowledged
                              JS push consumer with a queue group rather than
                              individual ephemeral consumers
  --consumerbatch=100         Sets the batch size for the JS durable pull
                              consumer, or the max ack pending value for the JS
                              durable push consumer
  --subsleep=0s               Sleep for the specified interval before sending
                              the subscriber acknowledgement back in --js mode,
                              or sending the reply back in --reply mode,
                              or doing the next get in --kv mode
  --pubsleep=0s               Sleep for the specified interval after publishing
                              each message
  --history=1                 History depth for the bucket in KV mode
  --multisubject              Multi-subject mode, each message is published on
                              a subject that includes the publisher's message
                              sequence number as a token
  --multisubjectmax=100000    The maximum number of subjects to use in
                              multi-subject mode (0 means no max)
  --retries=3                 The maximum number of retries in JS operations
  --dedup                     Sets a message id in the header to use JS Publish
                              de-duplication
  --dedupwindow=2m            Sets the duration of the stream's deduplication
                              functionality

Also, you can read more about nats-cli and nats-bench through these links.

Benchmark Runbook

By using nats-bench, we aimed to execute its commands repeatedly at will. Additionally, we sought to include failure recovery in our benchmarks and automate the process as much as possible. To achieve this, we developed a runbook using Python. This runbook utilizes a cmd.json file, which contains the nats-bench commands and flags, and a context.json file, which provides the necessary details to connect to a NATS cluster. Examples of each file are provided below. First file is nats context, and the second file is benchmark commands.

{
  "description": "Default NATS context",
  "url": "nats://0.0.0.0:4222",
  "socks_proxy": "",
  "token": "",
  "user": "admin",
  "password": "password",
  "creds": "",
  "nkey": "",
  "cert": "",
  "key": "",
  "ca": "",
  "nsc": "",
  "jetstream_domain": "",
  "jetstream_api_prefix": "",
  "jetstream_event_prefix": "",
  "inbox_prefix": "",
  "user_jwt": "",
  "color_scheme": "blue"
}

[
    {
        "name": "clear-stream",
        "command": "nats stream delete benchstream --force -s nats://0.0.0.0:4222",
        "syscall": true
    },
    {
        "name": "normal-stats",
        "count": 20,
        "command": "rides --js --pub 5 --sub 15 -s nats://0.0.0.0:4222",
        "pre-commands": [
            "nats stream purge benchstream --force -s nats://0.0.0.0:4222"
        ],
        "timeout": 180,
        "syscall": false
    }
]

The commands in the cmd.json file first delete benchstream from our NATS cluster. Afterward, they run a benchmark on the rides topic with 5 publishers and 15 subscribers over a local JetStream server 20 times. A purge stream command is executed before each benchmark.

Upon completion, the runbook generates a CSV file that details the data for write, read, and overall throughput in megabytes per second. An example is provided below.

pub-stats,sub-stats,overall-stats
40.4,606.89,646.21
34.96,524.86,559.25
33.96,510.61,543.1
27.47,412.89,439.56
32.0,480.52,511.7

You can access our runbook via the following link.

https://github.com/amirhnajafiz/natsbench-runbook

Hypothesis Testing

It is like trying to solve a mystery by gathering clues (data) and deciding if they strongly suggest that something happened or not. According to Wikipedia, hypotheses testing is a method used in statistics to determine if there is enough evidence to reject a null hypothesis (our base assumption). Overall, it helps make decisions based on data analysis.

Key terms

Null Hypothesis (H₀): The default assumption that there is no effect or no difference.
Alternative Hypothesis (H₁): The statement we want to test against.
Significance Level (α): The probability of rejecting the null hypothesis when it is true, commonly set at 0.05.
P-value: The probability of obtaining test results at least as extreme as the observed results, assuming the null hypothesis is true.
Test Statistic: A standardized value used to decide whether to reject the null hypothesis.
Equal vars: If its false, it means that the variances of the two groups are assumed to be unequal, and different statistical procedures are used to accommodate this assumption. This assumption is important because it affects the calculation of test statistics and determines which formula or method is used to analyze the data.
Permutations: Refer to the different ways in which a set of elements can be arranged or ordered.

Types of Hypothesis Testing

t-test: Compares the means of two groups.
Chi-square test:Tests for association between categorical variables.
ANOVA (Analysis of Variance): Compares means among three or more groups.
Regression Analysis: Examines relationships between variables.

As previously mentioned, we aimed to compare the cluster throughput before and after implementing our changes. To accomplish this, we utilized the t-test.

t-test steps

State the Hypotheses
Choose the Significance Level (α). Commonly used values are 0.05, 0.01.
Collect Data and Compute Test Statistic. Depends on the type of test (e.g., t-test, chi-square test).
Determine the P-value
Compare the p-value with α.
Make a Decision. If p-value ≤ α, reject the null hypothesis. If p-value > α, do not reject the null hypothesis.

Integrate Benchmark and t-test

Now equipped with a runbook to conduct benchmarks, we employed the following code to read the CSV files and perform a t-test on our data. In the code, we initially load data from the CSV files into pandas dataframes. Subsequently, we use the scipy.stats library to execute the t-test.

import pandas as pd
from scipy.stats import ttest_ind

GROUP_A="no-syncpub-3/dataset.csv"
GROUP_B="no-syncpub-rep1/dataset.csv"

# ttest variables
P_VALUE_BOUND = 0.05
PERMUTATION_VALUE = 25

# read csv to create groups datasets
dfA = pd.read_csv(f'{GROUP_A}')
dfB = pd.read_csv(f'{GROUP_B}')

t_statistic, p_value = ttest_ind(dfA, dfB, equal_var=True, permutations=PERMUTATION_VALUE)
if p_value < P_VALUE_BOUND:
        print("the difference is statistically significant at 95% confidence level.")

The output indicates which group performs better, by what percentage it improves, and the extent of the change in throughput.

testing `pub-stats` field:
 mean of `no-syncpub-3/dataset.csv`: 11.9496
 mean of `no-syncpub-rep1/dataset.csv`: 20.551999999999992
 t-statistic: -22.983042981964857
 p-value: 0.0
the difference is statistically significant at 95% confidence level.
`no-syncpub-rep1/dataset.csv` is better by 41.85675360062279%.

You can see an example in the following notebook.

https://github.com/amirhnajafiz/natsbench-runbook/tree/master/notebooks

Conclusion

After conducting 100 benchmarks comparing different scenarios and using t-test with equal-vars and permutation 50 for our hypothesis testing, we found that migrating to streams with three replicas would result in a 45% reduction in throughput for synchronous publishers and a 28% reduction for asynchronous publishers. Nevertheless, our current throughput on the single-replica stream was utilizing less than 50% of its capacity. Therefore, we were confident that scaling up to three replicas would not adversely affect our operations.

You can see these stats and their results below.

Comparing replication 1 and 3 for sync publishers:
testing `pub-stats` field:
mean of `syncpub-rep1/dataset.csv`: 1.0848698
mean of `syncpub-3/dataset.csv`: 0.8084258333333334
t-statistic: 20.831836518519147
p-value: 0.0
the difference is statistically significant at 95% confidence level.
`syncpub-rep1/dataset.csv` is better by 25.481764416952768%

Comparing replication 1 and 3 for async publishers:
testing `pub-stats` field:
mean of `no-syncpub-rep1/dataset.csv`: 20.551999999999992
mean of `no-syncpub-3/dataset.csv`: 11.9496
t-statistic: 22.983042981964857
p-value: 0.0
the difference is statistically significant at 95% confidence level.
`no-syncpub-rep1/dataset.csv` is better by 41.85675360062279%