Benchmarking NATS JetStream Cluster: Hypothesis Testing for Enhancements
As a part of a cloud engineering team, among our services, we primarily manage multiple NATS clusters, which serves as the central message queues in our systems. The JetStream feature of NATS has been operational in many businesses. Utilizing multiple streams within a single cluster enhances ability to distribute events swiftly and reliably. Additionally, this setup allows retain messages temporarily to accommodate slow processing or disconnections among consumers.
A significant challenge we face is modifying the configurations of the cluster or streams. As our cluster processes millions of write/read operations per second, it is crucial to ensure that these changes do not reduce the cluster’s throughput. To verify this, we benchmark the cluster’s performance before and after any adjustments.
Benchmarking effectively assesses whether these changes impact throughput. However, it is difficult to draw definitive conclusions from this process alone. We are challenged by the need to determine the number of benchmarks required to ensure the results are neither random nor biased. Additionally, when a configuration change proves beneficial, we need to quantify the extent of this improvement. These issues have led us to employ hypothesis testing with our benchmarking results.
Our case
When we initially migrated our NATS cluster to JetStream, each stream had a single replication. A few months later, JetStream introduced an update that allowed multiple replications for streams, enhancing the robustness of our distributed NATS cluster. This feature is particularly valuable in failure scenarios, ensuring that our services continue to receive events even if one cluster node fails. Additionally, it helps distribute the load evenly across the cluster nodes, thus enhancing reliability under heavy loads.
However, adding replications to our streams can reduce the write speed due to the time required for replicas to synchronize. In a single-replica stream, there is no synchronization overhead. Therefore, before we scale our streams, we needed to know how it would affect our cluster’s throughput.
Benchmarking a NATS cluster
To benchmark a NATS cluster, we utilized a feature of the nats-cli
called nats-bench
. This command includes various functions for conducting benchmarks on both NATS and JetStream clusters. It offers several options to customize the benchmark, such as the number of publishers, consumers, and messages. Below, you will find a list of these options along with their descriptions.
usage: nats bench [<flags>] <subject>
Benchmark utility
Core NATS publish and subscribe:
nats bench benchsubject --pub 1 --sub 10
Request reply with queue group:
nats bench benchsubject --sub 1 --reply
nats bench benchsubject --pub 10 --request
JetStream publish:
nats bench benchsubject --js --purge --pub 1
JetStream ordered ephemeral consumers:
nats bench benchsubject --js --sub 10
JetStream durable pull and push consumers:
nats bench benchsubject --js --sub 5 --pull
nats bench benchsubject --js --sub 5 --push
JetStream KV put and get:
nats bench benchsubject --kv --pub 1
nats bench benchsubject --kv --sub 10
Remember to use --no-progress to measure performance more accurately
Args:
<subject> Subject to use for the benchmark
Flags:
--pub=0 Number of concurrent publishers
--sub=0 Number of concurrent subscribers
--js Use JetStream
--request Request-Reply mode: publishers send requests waits
for a reply
--reply Request-Reply mode: subscribers send replies
--kv KV mode, subscribers get from the bucket and
publishers put in the bucket
--msgs=100000 Number of messages to publish
--size="128" Size of the test messages
--no-progress Disable progress bar while publishing
--csv=CSV Save benchmark data to CSV file
--purge Purge the stream before running
--storage=file JetStream storage (memory/file) for the
"benchstream" stream
--replicas=1 Number of stream replicas for the "benchstream"
stream
--maxbytes="1GB" The maximum size of the stream or KV bucket in
bytes
--stream="benchstream" When set to something else than "benchstream":
use (and do not attempt to define) the specified
stream when creating durable subscribers.
Otherwise define and use the "benchstream" stream
--bucket="benchbucket" When set to something else than "benchbucket":
use (and do not attempt to define) the specified
bucket when in KV mode. Otherwise define and use
the "benchbucket" bucket
--consumer="natscli-bench" Specify the durable consumer name to use
--jstimeout=30s Timeout for JS operations
--syncpub Synchronously publish to the stream
--pubbatch=100 Sets the batch size for JS asynchronous publishing
--pull Use a shared durable explicitly acknowledged JS
pull consumer rather than individual ephemeral
consumers
--push Use a shared durable explicitly acknowledged
JS push consumer with a queue group rather than
individual ephemeral consumers
--consumerbatch=100 Sets the batch size for the JS durable pull
consumer, or the max ack pending value for the JS
durable push consumer
--subsleep=0s Sleep for the specified interval before sending
the subscriber acknowledgement back in --js mode,
or sending the reply back in --reply mode,
or doing the next get in --kv mode
--pubsleep=0s Sleep for the specified interval after publishing
each message
--history=1 History depth for the bucket in KV mode
--multisubject Multi-subject mode, each message is published on
a subject that includes the publisher's message
sequence number as a token
--multisubjectmax=100000 The maximum number of subjects to use in
multi-subject mode (0 means no max)
--retries=3 The maximum number of retries in JS operations
--dedup Sets a message id in the header to use JS Publish
de-duplication
--dedupwindow=2m Sets the duration of the stream's deduplication
functionality
Also, you can read more about nats-cli
and nats-bench
through these links.
- https://docs.nats.io/using-nats/nats-tools/nats_cli
- https://docs.nats.io/using-nats/nats-tools/nats_cli/natsbench
Benchmark Runbook
By using nats-bench
, we aimed to execute its commands repeatedly at will. Additionally, we sought to include failure recovery in our benchmarks and automate the process as much as possible. To achieve this, we developed a runbook using Python. This runbook utilizes a cmd.json
file, which contains the nats-bench commands and flags, and a context.json
file, which provides the necessary details to connect to a NATS cluster. Examples of each file are provided below. First file is nats context, and the second file is benchmark commands.
{
"description": "Default NATS context",
"url": "nats://0.0.0.0:4222",
"socks_proxy": "",
"token": "",
"user": "admin",
"password": "password",
"creds": "",
"nkey": "",
"cert": "",
"key": "",
"ca": "",
"nsc": "",
"jetstream_domain": "",
"jetstream_api_prefix": "",
"jetstream_event_prefix": "",
"inbox_prefix": "",
"user_jwt": "",
"color_scheme": "blue"
}
[
{
"name": "clear-stream",
"command": "nats stream delete benchstream --force -s nats://0.0.0.0:4222",
"syscall": true
},
{
"name": "normal-stats",
"count": 20,
"command": "rides --js --pub 5 --sub 15 -s nats://0.0.0.0:4222",
"pre-commands": [
"nats stream purge benchstream --force -s nats://0.0.0.0:4222"
],
"timeout": 180,
"syscall": false
}
]
The commands in the cmd.json
file first delete benchstream from our NATS cluster. Afterward, they run a benchmark on the rides topic with 5 publishers and 15 subscribers over a local JetStream server 20 times. A purge stream command is executed before each benchmark.
Upon completion, the runbook generates a CSV file that details the data for write, read, and overall throughput in megabytes per second. An example is provided below.
pub-stats,sub-stats,overall-stats
40.4,606.89,646.21
34.96,524.86,559.25
33.96,510.61,543.1
27.47,412.89,439.56
32.0,480.52,511.7
You can access our runbook via the following link.
Hypothesis Testing
It is like trying to solve a mystery by gathering clues (data) and deciding if they strongly suggest that something happened or not. According to Wikipedia, hypotheses testing is a method used in statistics to determine if there is enough evidence to reject a null hypothesis (our base assumption). Overall, it helps make decisions based on data analysis.
Key terms
- Null Hypothesis (H₀): The default assumption that there is no effect or no difference.
- Alternative Hypothesis (H₁): The statement we want to test against.
- Significance Level (α): The probability of rejecting the null hypothesis when it is true, commonly set at 0.05.
- P-value: The probability of obtaining test results at least as extreme as the observed results, assuming the null hypothesis is true.
- Test Statistic: A standardized value used to decide whether to reject the null hypothesis.
- Equal vars: If its false, it means that the variances of the two groups are assumed to be unequal, and different statistical procedures are used to accommodate this assumption. This assumption is important because it affects the calculation of test statistics and determines which formula or method is used to analyze the data.
- Permutations: Refer to the different ways in which a set of elements can be arranged or ordered.
Types of Hypothesis Testing
- t-test: Compares the means of two groups.
- Chi-square test:Tests for association between categorical variables.
- ANOVA (Analysis of Variance): Compares means among three or more groups.
- Regression Analysis: Examines relationships between variables.
As previously mentioned, we aimed to compare the cluster throughput before and after implementing our changes. To accomplish this, we utilized the t-test.
t-test steps
- State the Hypotheses
- Choose the Significance Level (α). Commonly used values are 0.05, 0.01.
- Collect Data and Compute Test Statistic. Depends on the type of test (e.g., t-test, chi-square test).
- Determine the P-value
- Compare the p-value with α.
- Make a Decision. If p-value ≤ α, reject the null hypothesis. If p-value > α, do not reject the null hypothesis.
Integrate Benchmark and t-test
Now equipped with a runbook to conduct benchmarks, we employed the following code to read the CSV files and perform a t-test on our data. In the code, we initially load data from the CSV files into pandas dataframes. Subsequently, we use the scipy.stats library to execute the t-test.
import pandas as pd
from scipy.stats import ttest_ind
GROUP_A="no-syncpub-3/dataset.csv"
GROUP_B="no-syncpub-rep1/dataset.csv"
# ttest variables
P_VALUE_BOUND = 0.05
PERMUTATION_VALUE = 25
# read csv to create groups datasets
dfA = pd.read_csv(f'{GROUP_A}')
dfB = pd.read_csv(f'{GROUP_B}')
t_statistic, p_value = ttest_ind(dfA, dfB, equal_var=True, permutations=PERMUTATION_VALUE)
if p_value < P_VALUE_BOUND:
print("the difference is statistically significant at 95% confidence level.")
The output indicates which group performs better, by what percentage it improves, and the extent of the change in throughput.
testing `pub-stats` field:
mean of `no-syncpub-3/dataset.csv`: 11.9496
mean of `no-syncpub-rep1/dataset.csv`: 20.551999999999992
t-statistic: -22.983042981964857
p-value: 0.0
the difference is statistically significant at 95% confidence level.
`no-syncpub-rep1/dataset.csv` is better by 41.85675360062279%.
You can see an example in the following notebook.
Conclusion
After conducting 100 benchmarks comparing different scenarios and using t-test with equal-vars and permutation 50 for our hypothesis testing, we found that migrating to streams with three replicas would result in a 45% reduction in throughput for synchronous publishers and a 28% reduction for asynchronous publishers. Nevertheless, our current throughput on the single-replica stream was utilizing less than 50% of its capacity. Therefore, we were confident that scaling up to three replicas would not adversely affect our operations.
You can see these stats and their results below.
Comparing replication 1 and 3 for sync publishers:
testing `pub-stats` field:
mean of `syncpub-rep1/dataset.csv`: 1.0848698
mean of `syncpub-3/dataset.csv`: 0.8084258333333334
t-statistic: 20.831836518519147
p-value: 0.0
the difference is statistically significant at 95% confidence level.
`syncpub-rep1/dataset.csv` is better by 25.481764416952768%
Comparing replication 1 and 3 for async publishers:
testing `pub-stats` field:
mean of `no-syncpub-rep1/dataset.csv`: 20.551999999999992
mean of `no-syncpub-3/dataset.csv`: 11.9496
t-statistic: 22.983042981964857
p-value: 0.0
the difference is statistically significant at 95% confidence level.
`no-syncpub-rep1/dataset.csv` is better by 41.85675360062279%
References
- https://github.com/amirhnajafiz/natsbench-runbook/blob/master/notebooks/presentation.ipynb
- https://github.com/amirhnajafiz/natsbench-runbook/blob/master/README.md
- https://www.investopedia.com/terms/t/t-test.asp#:~:text=A%20t%2Dtest%20is%20an,flipping%20a%20coin%20100%20times.
- https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html
- https://docs.nats.io/using-nats/nats-tools/nats_cli/natsbench