InfluxDB schema design and data layout
This page documents an earlier version of InfluxDB OSS. InfluxDB OSS v2 is the latest stable version.
Each InfluxDB use case is unique and your schema reflects that uniqueness. In general, a schema designed for querying leads to simpler and more performant queries. We recommend the following design guidelines for most use cases:
- Where to store data (tag or field)
- Avoid too many series
- Use recommended naming conventions
- Shard Group Duration Management
Where to store data (tag or field)
Your queries should guide what data you store in tags and what you store in fields :
- Store commonly queried and grouping (
group()
orGROUP BY
) metadata in tags. - Store data in fields if each data point contains a different value.
- Store numeric values as fields (tag values only support string values).
Avoid too many series
InfluxDB indexes the following data elements to speed up reads:
Tag values are indexed and field values are not. This means that querying by tags is more performant than querying by fields. However, when too many indexes are created, both writes and reads may start to slow down.
Each unique set of indexed data elements forms a series key. Tags containing highly variable information like unique IDs, hashes, and random strings lead to a large number of series, also known as high series cardinality. High series cardinality is a primary driver of high memory usage for many database workloads. Therefore, to reduce memory consumption, consider storing high-cardinality values in field values rather than in tags or field keys.
If reads and writes to InfluxDB start to slow down, you may have high series cardinality (too many series). See how to find and reduce series high cardinality.
Use recommended naming conventions
Use the following conventions when naming your tag and field keys:
- Avoid reserved keywords in tag and field keys
- Avoid the same tag and field name
- Avoid encoding data in measurements and keys
- Avoid more than one piece of information in one tag
Avoid reserved keywords in tag and field keys
Not required, but avoiding the use of reserved keywords in your tag and field keys simplifies writing queries because you won’t have to wrap your keys in double quotes. See InfluxQL and Flux keywords to avoid.
Also, if a tag or field key contains characters other than [A-z,_]
, you must wrap it in double quotes in InfluxQL or use bracket notation in Flux.
Avoid the same name for a tag and a field
Avoid using the same name for a tag and field key. This often results in unexpected behavior when querying data.
If you inadvertently add the same name for a tag and a field, see Frequently asked questions for information about how to query the data predictably and how to fix the issue.
Avoid encoding data in measurements and keys
Store data in tag values or field values, not in tag keys, field keys, or measurements. If you design your schema to store data in tag and field values, your queries will be easier to write and more efficient.
In addition, you’ll keep cardinality low by not creating measurements and keys as you write data. To learn more about the performance impact of high series cardinality, see how to find and reduce high series cardinality.
Compare schemas
Compare the following valid schemas represented by line protocol.
Recommended: the following schema stores metadata in separate crop
, plot
, and region
tags. The temp
field contains variable numeric data.
Good Measurements schema - Data encoded in tags (recommended)
-------------
weather_sensor,crop=blueberries,plot=1,region=north temp=50.1 1472515200000000000
weather_sensor,crop=blueberries,plot=2,region=midwest temp=49.8 1472515200000000000
Not recommended: the following schema stores multiple attributes (crop
, plot
and region
) concatenated (blueberries.plot-1.north
) within the measurement, similar to Graphite metrics.
Bad Measurements schema - Data encoded in the measurement (not recommended)
-------------
blueberries.plot-1.north temp=50.1 1472515200000000000
blueberries.plot-2.midwest temp=49.8 1472515200000000000
Not recommended: the following schema stores multiple attributes (crop
, plot
and region
) concatenated (blueberries.plot-1.north
) within the field key.
Bad Keys schema - Data encoded in field keys (not recommended)
-------------
weather_sensor blueberries.plot-1.north.temp=50.1 1472515200000000000
weather_sensor blueberries.plot-2.midwest.temp=49.8 1472515200000000000
Compare queries
Compare the following queries of the Good Measurements and Bad Measurements schemas.
The Flux queries calculate the average temp
for blueberries in the north
region
Easy to query: Good Measurements data is easily filtered by region
tag values, as in the following example.
// Query *Good Measurements*, data stored in separate tag values (recommended)
from(bucket: "<database>/<retention_policy>")
|> range(start:2016-08-30T00:00:00Z)
|> filter(fn: (r) => r._measurement == "weather_sensor" and r.region == "north" and r._field == "temp")
|> mean()
Difficult to query: Bad Measurements requires regular expressions to extract plot
and region
from the measurement, as in the following example.
// Query *Bad Measurements*, data encoded in the measurement (not recommended)
from(bucket: "<database>/<retention_policy>")
|> range(start:2016-08-30T00:00:00Z)
|> filter(fn: (r) => r._measurement =~ /\.north$/ and r._field == "temp")
|> mean()
Complex measurements make some queries impossible. For example, calculating the average temperature of both plots is not possible with the Bad Measurements schema.
InfluxQL example to query schemas
# Query *Bad Measurements*, data encoded in the measurement (not recommended)
> SELECT mean("temp") FROM /\.north$/
# Query *Good Measurements*, data stored in separate tag values (recommended)
> SELECT mean("temp") FROM "weather_sensor" WHERE "region" = 'north'
Avoid putting more than one piece of information in one tag
Splitting a single tag with multiple pieces into separate tags simplifies your queries and improves performance by reducing the need for regular expressions.
Consider the following schema represented by line protocol.
Example line protocol schemas
Schema 1 - Multiple data encoded in a single tag
-------------
weather_sensor,crop=blueberries,location=plot-1.north temp=50.1 1472515200000000000
weather_sensor,crop=blueberries,location=plot-2.midwest temp=49.8 1472515200000000000
The Schema 1 data encodes multiple separate parameters, the plot
and region
into a long tag value (plot-1.north
).
Compare this to the following schema represented in line protocol.
Schema 2 - Data encoded in multiple tags
-------------
weather_sensor,crop=blueberries,plot=1,region=north temp=50.1 1472515200000000000
weather_sensor,crop=blueberries,plot=2,region=midwest temp=49.8 1472515200000000000
Use Flux or InfluxQL to calculate the average temp
for blueberries in the north
region.
Schema 2 is preferable because using multiple tags, you don’t need a regular expression.
Flux example to query schemas
// Schema 1 - Query for multiple data encoded in a single tag
from(bucket:"<database>/<retention_policy>")
|> range(start:2016-08-30T00:00:00Z)
|> filter(fn: (r) => r._measurement == "weather_sensor" and r.location =~ /\.north$/ and r._field == "temp")
|> mean()
// Schema 2 - Query for data encoded in multiple tags
from(bucket:"<database>/<retention_policy>")
|> range(start:2016-08-30T00:00:00Z)
|> filter(fn: (r) => r._measurement == "weather_sensor" and r.region == "north" and r._field == "temp")
|> mean()
InfluxQL example to query schemas
# Schema 1 - Query for multiple data encoded in a single tag
> SELECT mean("temp") FROM "weather_sensor" WHERE location =~ /\.north$/
# Schema 2 - Query for data encoded in multiple tags
> SELECT mean("temp") FROM "weather_sensor" WHERE region = 'north'
Shard group duration management
Shard group duration overview
InfluxDB stores data in shard groups. Shard groups are organized by retention policy (RP) and store data with timestamps that fall within a specific time interval called the shard duration.
If no shard group duration is provided, the shard group duration is determined by the RP duration at the time the RP is created. The default values are:
RP Duration | Shard Group Duration |
---|---|
< 2 days | 1 hour |
>= 2 days and <= 6 months | 1 day |
> 6 months | 7 days |
The shard group duration is also configurable per RP. To configure the shard group duration, see Retention Policy Management.
Shard group duration tradeoffs
Determining the optimal shard group duration requires finding the balance between:
- Better overall performance with longer shards
- Flexibility provided by shorter shards
Long shard group duration
Longer shard group durations let InfluxDB store more data in the same logical location. This reduces data duplication, improves compression efficiency, and improves query speed in some cases.
Short shard group duration
Shorter shard group durations allow the system to more efficiently drop data and record incremental backups. When InfluxDB enforces an RP it drops entire shard groups, not individual data points, even if the points are older than the RP duration. A shard group will only be removed once a shard group’s duration end time is older than the RP duration.
For example, if your RP has a duration of one day, InfluxDB will drop an hour’s worth of data every hour and will always have 25 shard groups. One for each hour in the day and an extra shard group that is partially expiring, but isn’t removed until the whole shard group is older than 24 hours.
Note: A special use case to consider: filtering queries on schema data (such as tags, series, measurements) by time. For example, if you want to filter schema data within a one hour interval, you must set the shard group duration to 1h. For more information, see filter schema data by time.
Shard group duration recommendations
The default shard group durations work well for most cases. However, high-throughput or long-running instances will benefit from using longer shard group durations. Here are some recommendations for longer shard group durations:
RP Duration | Shard Group Duration |
---|---|
<= 1 day | 6 hours |
> 1 day and <= 7 days | 1 day |
> 7 days and <= 3 months | 7 days |
> 3 months | 30 days |
infinite | 52 weeks or longer |
Note: Note that
INF
(infinite) is not a valid shard group duration. In extreme cases where data covers decades and will never be deleted, a long shard group duration like1040w
(20 years) is perfectly valid.
Other factors to consider before setting shard group duration:
- Shard groups should be twice as long as the longest time range of the most frequent queries
- Shard groups should each contain more than 100,000 points per shard group
- Shard groups should each contain more than 1,000 points per series
Shard group duration for backfilling
Bulk insertion of historical data covering a large time range in the past will trigger the creation of a large number of shards at once. The concurrent access and overhead of writing to hundreds or thousands of shards can quickly lead to slow performance and memory exhaustion.
When writing historical data, we highly recommend temporarily setting a longer shard group duration so fewer shards are created. Typically, a shard group duration of 52 weeks works well for backfilling.
Was this page helpful?
Thank you for your feedback!
Support and feedback
Thank you for being part of our community! We welcome and encourage your feedback and bug reports for InfluxDB and this documentation. To find support, use the following resources:
Customers with an annual or support contract can contact InfluxData Support.