Tomcat log management with Logstash and Graylog2

shopping24 tech blog

s is for shopping

May 21, 2014 / by Torsten Bøgh Köster / CTO / @tboeghk

Tomcat log management with Logstash and Graylog2

Collecting and parsing access logs from web servers like Nginx or Apache is widely covered. But when it comes to Apache Tomcat, even writing access logs can be tricky, parsing those in Logstash close to hell.

We run our Java applications on the fabulous Apache Tomcat server either behind an Nginx, HAProxy or a Varnish. To reflect a 100% view of your traffic, we collect most access logs at the entry point into our datacenter. Sometimes it may be desireable to collect Tomcat access and application logs.

Tomcat access logs

There’s no need to reinvent the wheel to configure Tomcat to produce decent access and application logs. If your Tomcat runs behind a reverse proxy, the RemoteIpValve comes in handy to replace the loadbalancers ip with the real client ip. Add it to your Tomcat server.xml.

<Valve className="org.apache.catalina.valves.RemoteIpValve"
  protocolHeader="x-forwarded-proto"
  remoteIpHeader="x-forwarded-for"
  requestAttributesEnabled="true" />

The AccessLogValve produces a access log in apache combined format + some extras attached. It appends the request duration (%D) the current session identifier (%S) and the Location header sent to the log message. Place it in your server.xml as well:

<Valve className="org.apache.catalina.valves.AccessLogValve"
  directory="logs"  prefix="access" suffix=".log"
  rotatable="false" resolveHosts="false"
  requestAttributesEnabled="true"
  conditionUnless="ignore-access-log" encoding="utf-8"
  pattern='%h %l %u %t "%r" %s %b "%{Referer}i" "%{User-agent}i" %D "%S" "%{Location}o"' />

We usually have log rotation disabled as we have a logrotate daemon rotating all the logfiles, that’s why we switch rotatable to false. Check logs how to enable logfile rotation.

Tomcat application logs

The easiest way to collect log data from your favorite logging framework (log4j, slf4j, juli) is to format them easy to parse and read – in JSON. We use slf4j on top of log4j. Place the logstash log4j JAR either in your WEB-INF/lib or Tomcat /lib directory.

Logstash provides a JSONEventLayoutV1 as a log4j layout implementation. Configure your FileAppender (or RollingFileAppender) to use it:

<appender name="stdout-json" class="org.apache.log4j.FileAppender">
  <param name="Threshold" value="WARN" />
  <param name="File" value="${catalina.base}/logs/catalina.json.out" />
  <param name="Append" value="true" />
  <layout class="net.logstash.log4j.JSONEventLayoutV1" />
</appender>

The JSONEventLayoutV1 marshalls all logging event details in JSON, even the NDC and MDC contents. Be aware that JSON logfiles get rather big and needs some proper rotating. Use the logrotate daemon or use a RollingFileAppender above.

Mining data in Logstash

We use a centralized Logstash architecture and have Logstash shippers running on each machine, collecting logfiles and shipping them to a central Redis server. The Logstash supercharger collects log entries off Redis and processes them. In the end logs are sent to a Graylog2 instance. Here are some useful filters.

Parsing the log message

Grok ist the tool of choice when it comes to parsing logfiles. The following expression parses the access log created above:

grok {
  match => { "message" => "%{COMBINEDAPACHELOG} %{NUMBER:http_request_duration_ms} %{QS:http_session_id} %{QS:http_response_location}" }
  add_tag  => [ "combined_apache_log", "extended_tomcat_log" ]
}

N.B.: We use tagging a lot for debugging purposes.

Synching log message date

The Redis buffers messages between the Logstash shippers and the central supercharger. On service interruption, it’s important to restore the original log message date. Beware that you set your timezone correctly.

date {
  locale => "en"
  timezone => "Europe/Berlin"
  match => [ "timestamp", "dd/MMM/yyyy:HH:mm:ss Z" ]
  add_tag => [ "timestamped" ]
}

Geolocation

Logstash ships with the Maxmind Geolite database, which is pretty accurate.

geoip {
  source   => "clientip"
  fields   => [ "city_name", "country_code2", "country_name", "location", "real_region_name", "postal_code" ]
  add_tag  => [ "geoip" ]
}

Useragent & bot detection

The useragent filter ist based on the uaparser:

useragent {
  source    => "agent"
  prefix    => "useragent_"
  add_tag   => [ "useragent" ]
}

Fingerprinting

We fingerprint the clientip in combination with the useragent.

fingerprint {
  source => [ "clientip", "agent" ]
  concatenate_sources => true
  method => "SHA1"
  key => "somecrazyseed"
  target => "fingerprint_session_id"
  add_tag => [ "fingerprint" ]
}

For compliance reasons, you should either remove the clientip or hash it before sending the log message off to Graylog2.

Serve in Graylog2

Use the gelf output to send your messages to your Graylog2 instance. Enjoy in-depth analytics of your log messages and some serious cross-domain session tracking.