Double-indexed application logs in Splunk and other log indexers

Sunday, July 26, 2015

Introduction

A well-behaved application should output reasonably verbose logs. Disk is cheap, and hierarchical datastores make it possible to store large volumes of log data at near-negligible cost. Factor in the benefit of the insight acquired (in which developers can leave a metaphorical trail of breadcrumbs for issues down the road), and the cost-benefit of even the first issue will begin to speak for itself.

As highly-available applications mean that logging may be distributed among several distinct data sources (VMs, partitioned logs, or containers, for instance), use of Splunk or Logstash as a log indexer likewise becomes a no-brainer.

An application may use a well-known framework such as logrotate to rotate out logs, ensuring that disk space for an application is kept in check. But the application may not be logging-aware; in fact, whatever script starts it might even manage it using the shell:

LOG_DIR=/var/log/myapp
java -jar /opt/myapp/my_application.jar &gt; $LOG_DIR/my_application.info 2&gt; $LOG_DIR/my_application.error

In this case, a log rotate policy such as the following might seem perfectly reasonable - it will keep five 50MB logs of each type on the machine, and allocate 500MB of the application’s disk budget to logs, excluding the log that is currently being appended. The application uses the copytruncate logrotate directive to prevent having to be log-aware - that is, it does not have to accept a signal to flush its log file handle.

/var/log/myapp/* {
        rotate 5
        size=50M
        copytruncate
}

Likewise, the application’s logs might be indexed in a Splunk inputs.conf:

[monitor://var/log/myapp/*]
index=myapp_logs
sourcetype=log4j

When performing an initial verification against the myapp_logs index, all will appear well: single log4j events will be parsed into usable messages. Following the first rotation, however, indexed events will be duplicated. This wastes disk space on the indexer, and can cause false alerts to be sent and/or incorrect results to be displayed in dashboards keying off indexed data.

What’s the fix?

There are two primary conceptual ways to resolve the double-logging issue. The first is to stop indexing all logs within a particular hierarchy, and instead choose just those logs which are known to be output by the application, implicitly ignoring rotated log files. The second is to rewrite the application to be log-aware, so that its logging can specifically be reset via prerotate and postrotate logrotate commands.

The first fix changes the indexing configuration for the application, ensuring that no rotated logs are picked up:

[monitor://var/log/myapp/my_application.info]
index=myapp_logs
sourcetype=log4j
[monitor://var/log/myapp/my_application.error]
index=myapp_logs
sourcetype=log4j

The second fix makes the application aware of an external signal, which tells it to close and reopen its logs. In this case, the hangup (HUP) signal is used.

/var/log/myapp/* {
        rotate 5
        size=50M
        postrotate
                kill -HUP $(pgrep -f my_application.jar)
        endscript
        sharedscripts
}

Of course, there are numerous other ways to skin this cat.

But why?

The first logging configuration introduced tells Splunk to read in all possible files in /var/log/myapp. Splunk has some built-in checks for uniqueness, but importantly, it uses the CRC of the first and last 256 bytes of the indexed file as its designator of uniqueness. As a copytruncated file is written in chunks, rather than all at once, the last 256 bytes of the file will constantly be changing as it is being rotated out. Splunk will pick the rotated file up, assume it is new, and index its entire contents accordingly.

In the first fix, this is worked around by explicitly telling Splunk not to look at any of the rotated files that get generated (e.g. “my_application.log.1”), ensuring no duplicates from copied files are indexed. In the second fix, the application no longer depends on copytruncate logging at all, and logrotate is able to rotate the file with a simple, atomic rename of the old file. The contents of the old file, and thus its first and last 256 bytes’ CRCs, will match, preventing the file from being reindexed.

Why not write directly to the indexer?

As anyone using Docker may already be aware, the method of using an external file for what is essentially interprocess communication is unnecessary - an application can log directly to Splunk or some other indexer, bypassing a file-based forwarder altogether - after all, the indexer is just another service to interact with. For Java applications, Splunk has [log providers](log providers) for log4j / slf4j, Logback, and JDK 1.4-style loggers, which avoid a locally-deployed forwarder altogether. The single process model used in Docker obviates a forwarder entirely.

A word on JBoss AS / Wildfly

This blog post grew out of lessons learned from the stock logging configuration of Wildfly (formerly JBoss) Application Server, where the out-of-box log appender is the periodic-rotating-file-handler, which creates a /var/log/jboss which looks similar to the following:

[root@app01 /var/log/jboss]$ ls -1
server.log.2015-07-20
server.log.2015-07-19
server.log.2015-07-18
[...]

In order to keep disk space in check with this handler, logrotate must be used externally to clean up old logs. As there is no maximum number of files to keep, care must not be taken to not specify a copyrotate rotation policy. Wildfly is smart enough to swap out file handles for you, rendering this policy unnecessary. Otherwise, Wildfly logs likewise will be double-indexed.

A better solution altogether is to use a size-rotating-file-handler, which allows the max-backup-index parameter to be specified to keep only a certain number of logs on the disk. This removes the need for parallel usage of native Wildfly and logrotate log rotation. See ogging Configuration: Logging Subsystem Descriptions: size-rotating-file-handler for more information on the logging policies available to Wildfly and JBoss.