If you run the Telegraf metric collector in any non-trivial setup, you probably came across the problem that it doesn’t offer explicit routing between inputs and outputs.
All the inputs, outputs, and the filters in between share one single namespace, so, by default, every metric from each input goes to each output, even when you use multiple configuration files.
This is convenient when you want to funnel metrics from several inputs into one InfluxDB; or maybe you want to fan out your metrics to several databases at once.
It’s not so convenient when you want to separate your metrics from different sources into different databases.
There are several approaches to this problem. One suggestion is to run multiple Telegraf instances, with separate configuration, either stand-alone or in a Docker environment.
Another suggestion is to apply arbitrary tags to the input plugins and then filter on those tags in the output plugins, like this:
[[inputs.cpu]]
# [ ... ]
[inputs.cpu.tags]
route_to = "metricsdb"
[[outputs.influxdb]]
# [ ... ]
tagexclude = ["route_to"]
[outputs.influxdb.tagpass]
route_to = ["metricsdb"]
In this example, all CPU metrics get the route_to
tag attached and
are allowed to pass at the specified InfluxDB output. The route_to
tag gets stripped in the process so it doesn’t spam the database.
I have been using this setup for years now, and although you have to be extra careful to get the tags right it worked surprisingly well.
Or rather I thought it worked surprisingly well, until I started to experiment with our metrics stack, testing alternatives to InfluxDB.
On every server in our fleet, we run a Telegraf agent collecting basic system telemetry. Rather than sending the metrics directly to the database, the metrics are sent to a central “proxy” which then writes the data to the database. This proxy is a Telegraf agent with an InfluxDB Listener input plugin, utilizing the same tag/tagpass logic:
[[inputs.influxdb_listener]]
# [ ... ]
[inputs.influxdb_listener.tags]
route_to = "metricsdb"
[[outputs.influxdb]]
# [ ... ]
tagexclude = ["route_to"]
[outputs.influxdb.tagpass]
route_to = ["metricsdb"]
Now, this host collects system telemetry itself like any other host in the fleet, and for the sake of simplicity I just reused the config, so it sent its metrics to the proxy port on the same host instead of directly to the database.
In retrospect it’s clear as day but I completely failed to notice that
the output to the destination InfluxDB and the output to the proxy would
let the same tag pass… so, yeah, I created a really nice feedback
loop of metrics - what would come in through the influxdb_listener
would go out to the proxy again, and again, and again.
I did notice that the load on that host was actually pretty high, and I also completely ignored the constant “metrics buffer full” errors in the logs, but hey, I figured that the poor server was just getting hammered! Plus, I wasn’t missing any metrics in the database!
The thing with InfluxDB is that it treats a write with the same timestamp and the same tagset of a record that already exists in the database as an update instead of an insert. So nothing seemed off on that end of the pipeline - I found one record where I expected one record.
But TimescaleDB, which I tested as an alternative, is a PostgreSQL database, which does in fact insert a new row for every record, even if the same data already exists (barring any unique index constraints). So when I added another output plugin to my Telegraf proxy, sending everything to the TimescaleDB as well, I was surprised and also disappointed how quickly this database grew, how much diskspace it used, and how long the queries took.
After several hours of optimizing and prodding I finally noticed that instead of 1 record per measurement, I had 68. Or 90. Or 130.
And this led me to the discovery of the feedback loop.
I fixed that, and now not only won’t the TimescaleDB eat its diskspace within a couple of hours, but also the system load on the proxy server dropped to basically 0, and there are no more “metric buffer full” errors.
Go figure…