Controlling Your Logs

Log management on Linux is a pain. It's not the first thing you think about when you're managing a Linux server, but it can really bite you on the ass if you don't think about it sometime. Problems come in the form of company policy (for example "personal information not to be retained more than one month"), default software package values (that don't meet your standards or expectations), your developers expectations, and uncontrolled log growth that can overwhelm your hard drive (sometimes implemented by said developers).

We run Ruby-based applications with Nginx on cloud servers. What I've found is that there are three forms of logging that can bite you: Nginx itself, Journalctl, and Ruby logs. I'm going to talk about how to deal with all three of these. Your last line of defense should be some form of notification system such that you're notified when any given server's hard drive surpasses the 80% full mark (or 70% or 90% depending on your setup and personal tolerance). With Digital Oceans, their recent "extended monitoring" includes precisely this (... except once you choose a notification channel, you can't ever change it again - a "known bug" that's been outstanding for at least a month now).

Most log management is done by logrotate, which is a lovely utility that every system administrator should have at least a passing acquaintance with. It's installed and configured by default on almost every Linux system, certainly Debian (for which settings mentioned here apply). There's one significant problem here: with the move to Systemd, nearly all system-level logs are A) binary, and B) managed by journalctl/journald NOT logrotate. So they have to be configured separately.

logrotate

'logrotate' rotates, compresses, and deletes old logs at set intervals. "Log rotation" is the process of moving a currently active log to a named older file, for example once a day or once a week or once in whatever interval you specify, /var/log/nginx/access.log becomes /var/log/nginx/access.log.1. logrotate's primary configuration file is /etc/logrotate.conf and with Debian it comes with a good default, rotating all files weekly and deleting them after four weeks. This sounded great for us (and covers all the default logs I'm not talking about), except that the default Debian nginx package turns out to override this immediately by installing its own logrotate config in /etc/logrotate.d/nginx - which logrotate reads automatically, and prioritizes above its own config. (I'm not saying this is wrong: it's a good system, but annoying in this instance.) So we break out the man logrotate to find out how to better configure nginx's logrotate configuration.

# /etc/logrotate.d/nginx
/var/log/nginx/*.log {
    weekly
    rotate 4
    compress
    delaycompress
    notifempty
    missingok
    maxsize 100M
    copytruncate
}

The first line says "This is how to deal with the files called /var/log/nginx/*.log." 'weekly' means "rotate the logs once a week." Note that logrotate considers the week to have started from the first time it sees a log file, not at any particular day of the week. 'rotate 4' means "keep the current log and three previous." 'compress' means "compress old logs with gzip" but it's modified by 'delaycompress' which means "don't compress the first previous log." This is surprisingly helpful, as this is the most likely log for you to look at other than the first. 'notifempty' means "don't bother rotating the log if it's empty." Likewise, 'missingok' means "don't worry about it if the log you expect isn't there."

These settings are fairly reasonable, but now we get into the trickier ones. 'maxsize 100M' means what it suggests, but not quite the way you think. At first look it suggests that when the current log hits 100 MB in size, it should be rotated. This is correct but, major caveat, logrotate is only run once a day via the system /etc/cron.daily/logrotate cron script at midnight or 0400 each morning (I don't remember the exact time). This means that in practice 'maxsize' means "if the log is bigger than 100M when run every 24 hours, rotate the log even if we haven't got to the one week time limit you also specified."

I'm pleased to say I've never dealt with a server so busy that rotating logs more frequently than every 24 hours was necessary. I think you could simply move /etc/cron.daily/logrotate to /etc/cron.hourly/ - but note that this will change nothing if you say 'weekly' but not 'maxsize 100M'!

And finally, 'copytruncate'. Normally, when logrotate backs up a log, it uses the 'mv' command (or a system level equivalent) to move the log from access.log to access.log.1. It then immediately 'touch'es access.log so the logging process can continue. But some processes get very attached to a particular file: I believe what happens is that they go not by the file NAME, but rather by the file INODE number, and the inode number stays the same when a file is 'mv'ed. So the logging process - in this case Nginx - continues to write to access.log.1 instead of to access.log as you expect. So 'copytruncate' exists to tell logrotate to copy the contents of the file to the backup filename, and then empty the original file so the logging process can continue to write to it. I suspect 'copytruncate' should only be used if required: renaming a file has got to be less resource-hungry than a copy process. It doesn't sound like much, but I've had to do that with files of 1G in size, and at that point it might cause a small VM's web server to hiccup. From personal experience, 'copytruncate' is needed with (at least) Nginx and Puma.

Sometimes developers decide they need more logging - and if they're trying to debug a buggy application, they're almost certainly right. But they usually don't concern themselves with what happens with the logs after they're done reading them - that's an ops problem (not really sarcasm, just a statement of fact). So we apply a little logrotate magic to the application folder:

# /etc/logrotate.d/ruby (the filename is arbitrary, it just has to be in the folder)
/opt/ruby/poetry/*.log /opt/ruby/poetry/log/*.log {
    weekly
    rotate 4
    compress
    delaycompress
    notifempty
    missingok
    maxsize 100M
    copytruncate
}

Note that I'm telling it to take care of two separate folders: this is done simply by placing them on the same line with a space between them. We had log files in both places. In fact it doesn't hurt to cover your bases: if there are no log files in one place or the other, logrotate should fail quietly and without problems.

journalctl

journald logs to /var/log/journal/ and I first discovered what a pain it is when I found a cloud server that was running out of disk space because that folder was occupying 10G of a 30G hard disk. It's supposed to have good defaults and politely manage itself, but this doesn't always hold true. So I learned about journald and its public-facing command, journalctl. Just typing that will get you all your logs in a less session. You can find out how clogged with logs your system is simply by typing 'G' - in less, this jumps to the end of the file. If you have a relatively slow VM and a large log, I've seen this take in excess of three minutes. The next most useful command (particularly given our subject) is journalctl --disk-usage, which will tell you something like this: Archived and active journals take up 512.0M on disk (Fedora 25) or Journals take up 2.8G on disk (Debian stable). On a small VM, 2.8G is a bit of a pain so we want to control that growth:

# /etc/systemd/journald.conf
#  This file is part of systemd.
# See journald.conf(5) for details

[Journal]
# 2678400 seconds = 31 days.
MaxRetentionSec=2678400
# 86400 seconds = 1 day.
MaxFileSec=86400
SystemMaxUse=400M

We're cropping the logs on two limits: not more than 31 days old, and not more than 400M disk space used for all logs. The MaxFileSec is interesting: it appears that if you leave journald to its own devices and set a maximum retention period of 31 days, journald will wipe your entire log back to nothing at the end of each month. My guess (not confirmed yet!) is that by telling it not to have any file larger than a day, it will simply wipe out one day of data each time it hits the 31 days limit so you'll always have most of the month available. As I say - that part is unproven so far.

Lastly, if you suspect journald log corruption (or it tells you), you can use journalctl --verify. However, it has an ugly interface and is fantastically slow, so I'd recommend only using it if you really know you need it.

Bibliography