Troubleshoot high disk usage

It’s very important that components of your TICK stack do not run out of disk. A machine with 100% disk usage will not function properly.

In a monitoring dashboard, high disk usage will appear in the Disk Utilization % metric and look similar to the following:

High disk usage

Potential causes

Old data not being downsampled

InfluxDB uses retention policies and continuous queries to downsample older data and preserve disk space. If using an infinite retention policy or one with a lengthy duration, high resolution data will use more and more disk space.

Log data not being dropped

Log data is incredibly useful in your monitoring solution, but can also require more disk space than other types of time series data. Many times, log data is stored in an infinite retention policy (the default retention policy duration), meaning it never gets dropped. This will inevitably lead to high disk utilization.


Remove unnecessary data

The simplest solution to high disk utilization is removing old or unnecessary data. This can be done by brute force (deleting/dropping data) or in a more graceful manner by tuning the duration of your retention policies and adjusting the downsampling rates in your continuous queries.

Log data retention policies

Log data should only be stored in a finite retention policy. The duration of your retention policy is determined by how long you want to keep log data around.

Whether or not you use a continuous query to downsample log data at the end of its retention period is up to you, but old log data should either be downsampled or dropped altogether.

Scale your machine’s disk capacity

If removing or downsampling data isn’t an option, you can always scale your machine’s disk capacity. How this is done depends on your hardware or virtualization configuration and is not covered in this documentation.


Set up a disk usage alert

To preempt disk utilization issues, create a task that alerts you if disk usage crosses certain thresholds. The example TICKscript below sets warning and critical disk usage thresholds and sends a message to Slack whenever those thresholds are crossed.

For information about Kapacitor tasks and alerts, see the Kapacitor alerts documentation.

Example TICKscript alert for disk usage

// Disk usage alerts
// Alert when disks are this % full
var warn_threshold = 80
var crit_threshold = 90

// Use a larger period here, as the telegraf data can be a little late
// if the server is under load.
var period = 10m

// How often to query for the period.
var every = 20m

var data = batch
    SELECT last(used_percent) FROM "telegraf"."default".disk
    WHERE ("path" = '/influxdb/conf' or "path" = '/')
    .groupBy('host', 'path')

    .id('Alert: Disk Usage, Host: {{ index .Tags "host" }}, Path: {{ index .Tags "path" }}')
    .warn(lambda: "last" > warn_threshold)
    .message('{{ .ID }}, Used Percent: {{ index .Fields "last" | printf "%0.0f" }}%')

    .id('Alert: Disk Usage, Host: {{ index .Tags "host" }}, Path: {{ index .Tags "path" }}')
    .crit(lambda: "last" > crit_threshold)
    .message('{{ .ID }}, Used Percent: {{ index .Fields "last" | printf "%0.0f" }}%')

New! Cloud or OSS?