Rebalancing InfluxDB Enterprise clusters
Manually rebalance an InfluxDB Enterprise cluster to ensure:
- Shards are evenly distributed across all data nodes in the cluster
- Every shard is on n number of nodes, where n is the replication factor.
Why rebalance?
Rebalance a cluster any time you do one of the following:
- Expand capacity and increase write throughput by adding a data node.
- Increase availability and query throughput, by either:
- Adding data nodes and increasing the replication factor.
- Adjusting fully replicated data to partially replicated data. See Full versus partial replication.
- Add or remove a data node for any reason.
- Adjust your replication factor for any reason.
Full versus partial replication
When replication factor equals the number of data nodes, data is fully replicated to all data nodes in a cluster. Full replication means each node can respond to queries without communicating with other data nodes. A more typical configuration is a partial replication of data; for example, a replication factor of 2 with 4, 6, or 8 data nodes. Partial replication allows more series to be stored in a database.
Rebalance your cluster
Before you begin, do the following:
- Stop writing data to InfluxDB. Rebalancing while writing data can lead to data loss.
- Enable anti-entropy (AE) to ensure all shard data is successfully copied. To learn more about AE, see Use Anti-Entropy service in InfluxDB Enterprise.
After adding or removing data nodes from your cluster, complete the following steps to rebalance:
- If applicable, update the replication factor.
- Truncate hot shards (shards actively receiving writes).
- Identify cold shards
- Copy cold shards to new data node
- If you’re expanding capacity to increase write throughput, remove cold shards from an original data node.
- Confirm the rebalance.
Important: The number of data nodes in a cluster must be evenly divisible by the replication factor. For example, a replication factor of 2 works with 2, 4, 6, or 8 data nodes. A replication factor of 3 works with 3, 6, or 9 data nodes.
Update the replication factor
The following example shows how to increase the replication factor to 3
. Adjust the replication factor as needed for your cluster, ensuring the number of data nodes is evenly divisible by the replication factor.
In InfluxDB CLI, run the following query on any data node for each retention policy and database:
> ALTER RETENTION POLICY "<retention_policy_name>" ON "<database_name>" REPLICATION 3 >
If successful, no query results are returned. At this point, InfluxDB automatically distributes all new shards across the data nodes in the cluster with the correct number of replicas.
To verify the new replication factor, run the
SHOW RETENTION POLICIES
query:> SHOW RETENTION POLICIES ON "telegraf" name duration shardGroupDuration replicaN default ---- -------- ------------------ -------- ------- autogen 0s 1h0m0s 3 # true
Truncate hot shards
In InfluxDB CLI, run the following command:
influxd-ctl truncate-shards
A message confirms shards have been truncated. Previous writes are stored in cold shards and InfluxDB starts writing new points to a new hot shard. New hot shards are automatically distributed across the cluster.
Identify cold shards
In InfluxDB CLI, run following command to view every shard in the cluster:
``` influxd-ctl show-shards Shards ========== ID Database Retention Policy Desired Replicas [...] End Owners 21 telegraf autogen 2 [...] 2020-01-26T18:00:00Z [{4 enterprise-data-01:8088} {5 enterprise-data-02:8088}] 22 telegraf autogen 2 [...] 2020-01-26T18:05:36.418734949Z* [{4 enterprise-data-01:8088} {5 enterprise-data-02:8088}] 24 telegraf autogen 2 [...] 2020-01-26T19:00:00Z [{5 enterprise-data-02:8088} {6 enterprise-data-03:8088}] ```
This example shows three shards:
- Shards are cold if their
End
timestamp occurs in the past. In this example, the current time is2020-01-26T18:05:36.418734949Z
, so the first two shards are cold. - Shard owners identify data nodes in a cluster. In this example, cold shards are on two data nodes:
enterprise-data-01:8088
andenterprise-data-02:8088
. - Truncated shards have an asterix (
*
) after timestamp. In this example, the second shard is truncated. - Shards with an
End
timestamp in the future are hot shards. In this example, the third shard is the newly-created hot shard. The host shard owner includes one of the original data nodes (enterprise-data-02:8088
) and the new data node (enterprise-data-03:8088
).
- Shards are cold if their
For each cold shard, note the shard
ID
andOwners
or data node TCP name (for example,enterprise-data-01:8088
).Run the following command to determine the size of the shards in your cluster:
find /var/lib/influxdb/data/ -mindepth 3 -type d -exec du -h {} \;
To increase capacity on the original data nodes, move larger shards to the new data node. Note that moving shards impacts network traffic.
Copy cold shards to new data node
In InfluxDB CLI, run the following command, specifying the
shard ID
to copy, thesource_TCP_name
of the original data node anddestination_TCP_name
of the new data node:influxd-ctl copy-shard <source_TCP_name> <destination_TCP_name> <shard_ID>
A message appears confirming the shard was copied.
Repeat step 1 for every cold shard you want to move to the new data node.
Confirm copied shards display the TCP name of the new data node as an owner:
influxd-ctl show-shards
Expected output shows copied shard now has three owners:
Shards ========== ID Database Retention Policy Desired Replicas [...] End Owners 22 telegraf autogen 2 [...] 2017-01-26T18:05:36.418734949Z* [{4 enterprise-data-01:8088} {5 enterprise-data-02:8088} {6 enterprise-data-03:8088}]
Copied shards appear in the new data node shard directory: /var/lib/influxdb/data/<database>/<retention_policy>/<shard_ID>
.
Remove cold shards from original data node
Removing a shard is an irrecoverable, destructive action; use this command with caution.
Run the following command, specifying the cold shard ID to remove and the source_TCP_name of one of the original data nodes:
influxd-ctl remove-shard <source_TCP_name> <shard_ID>
Expected output:
Removed shard <shard_ID> from <source_TCP_name>
Repeat step 1 to remove all cold shards from one of the original data nodes.
Confirm the rebalance
For each relevant shard, confirm the TCP name is correct by running the following command.
influxd-ctl show-shards
For example:
If you’re rebalancing to expand the capacity to increase write throughput, verify the new data node and only one of the original data nodes appears in the
Owners
column:Expected output shows that the copied shard now has only two owners:
Shards ========== ID Database Retention Policy Desired Replicas [...] End Owners 22 telegraf autogen 2 [...] 2017-01-26T18:05:36.418734949Z* [{5 enterprise-data-02:8088} {6 enterprise-data-03:8088}]
If you’re rebalancing to increase data availability for queries and query throughput, verify the TCP name of the new data node appears in the
Owners
column:Expected output shows that the copied shard now has three owners:
``` Shards ========== ID Database Retention Policy Desired Replicas [...] End Owners 22 telegraf autogen 3 [...] 2017-01-26T18:05:36.418734949Z* [{4 enterprise-data-01:8088} {5 enterprise-data-02:8088} {6 enterprise-data-03:8088}] In addition, verify that the copied shards appear in the new data node’s shard directory and match the shards in the source data node’s shard directory. Shards are located in `/var/lib/influxdb/data/<database>/<retention_policy>/<shard_ID>`. Here’s an example of the correct output for shard 22: # On the source data node (enterprise-data-01) ~# ls /var/lib/influxdb/data/telegraf/autogen/22 000000001-000000001.tsm # # On the new data node (enterprise-data-03) ~# ls /var/lib/influxdb/data/telegraf/autogen/22 000000001-000000001.tsm #
Was this page helpful?
Thank you for your feedback!
Support and feedback
Thank you for being part of our community! We welcome and encourage your feedback and bug reports for InfluxDB and this documentation. To find support, use the following resources:
InfluxDB Cloud and InfluxDB Enterprise customers can contact InfluxData Support.