Telegraf Issues

On all of our servers we use telegraf to collect telemetry (CPU, RAM, disk, I/O, network). Most of them just have the system input plugin configured, but some dedicated monitoring hosts have many more input plugins, to collect metrics from PostgreSQL, Elasticsearch, Logstash, etc. etc.

With this week’s update to telegraf 1.26, the telegraf agent on these hosts started throwing errors, and no metrics were saved to the InfluxDB:

[outputs.influxdb] When writing to [https://localhost:8086]:
  failed making write req: getting password failed: cannot allocate memory

[agent] ["outputs.influxdb"] did not complete within its flush interval

I downgraded to 1.25.3 and the processes worked again as before, so it had to be some regression with 1.26.

After a little tinkering I found that version 1.26 would run fine as root but not when I ran it as the default user telegraf. A couple of tests and back and forth it was clear that the process needed its ulimits raised.

Since it runs as a systemd service you cannot set it in /etc/security/limits.d/, but you need to write a systemd service override instead:

# /etc/systemd/system/telegraf.service.d/override.conf

[Service]
LimitNOFILE=65536
LimitNPROC=16384
LimitSTACK=32768
LimitMEMLOCK=134217728
LimitDATA=infinity

I didn’t experiment with the actual values to fine tune them, so it’s likely that some of them are not needed or too high, but at least our metrics are available again, and that’s good enough for me!