shopping24 tech blog

s is for shopping

January 02, 2014 / by Torben Greulich / Software engineer / @mr_tege

How to sync incidents of pagerduty and statuspage.io

Assuming you use pagerduty for aggregating all your monitoring and statuspage.io for communicate your server states. Wouldn’t it be nice to have a service, that automatically updates statuspage.io based on your aggregated monitoring results? Unfortunately such a service doesn’t exists. Till now :-)

Monitoring and alerting

pagerduty

We use pagerduty. It’s a service that aggregates almost all monitoring tools and their alarms and gives you the possibility to install an alerting system based that. You can configure different types of alarms (SMS, Email, phone …) and different stages of escalation (e.g. send alarm per mail. If no one react after half an hour send a sms). So you get a bunch of possibilities to manage your alarms and alerting.

Statuspage.io

If you are serving software and want to communicate your system state to your customers it’s very useful to have a statuspage.

We chose statuspage.io. It offers the possibility to simply display the state of your components (Operational, Degraded Performance, Partial Outage, or Major Outage.) and list current incidents. Furthermore you can add performance metrics from other tools to display your component states. (At the moment you can integrate Pingdom, Librato, New Relic, Datadog, and TempoDB.)

Connecting pagerduty and statuspage.io

Because we aggregate all our monitoring and server states at pagerduty it is obvious to use pagerduty as input for statuspage.io. Unfortunately there is no service, that updates statuspage.io based on open incidents of pagerduty. Thats why we create such a service:

The Idea

  • Check all, unresolved incidents at pagerduty and try to identify which component is concerned.
  • Check which consequences follow from this incidents for your component.
  • Update component at statuspage.io based on this information.

Details

We load all incidents of pagerduty with state = triggered or acknowledged and check their content with a predefined regex. We use trigger_summary_data object of pagerduty for checking against the regex. As per documentation of pagerduty, this field contains “Some condensed information regarding the initial event that triggered this incident. …if an email triggered the incident, then the trigger_summary_data will likely contain a subject…”

All incidents, that matched will be send to statuspage.io with a predefined message, a new component state and a transformed incident state. For your information, if there are several incidents of one component, the worst state will win.

Incident transformation:

pagerduty incident state statuspage.io incident state
triggered investigating
acknowledged identified
resolved resolved

Furthermore all unresolved statuspage.io incidents will be loaded and we check, if they are still active. If not, they will be set to resolved.

Setting up the tool

You can find the tool at github

Credentials

At first, add your credentials to application.credentials:

# statuspage.io
statuspage.component.id=COMPONENT.ID
statuspage.api.key=API.KEY
statuspage.page.id=PAGE.ID

# pagerDuty
pagerduty.token=PAGERDUTY.TOKEN
pagerduty.host=PAGERDUTY.HOST At the moment you can just add one statuspage.io component.

Configuration

For configuration you must add application.properties. Here is an example, how it can look like:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<incidentTypes>
	<incidentType>
	    <name>Site is Down</name>
	    <pagerduty_messageregexp>{regexp}</pagerduty_messageregexp>
	    <statuspageio_message>We are investigating this issue.</statuspageio_message>
	    <statuspageio_componentstatus>major_outage</statuspageio_componentstatus>
	</incidentType>
	<incidentType>
	    <name>Site error rate high</name>
	    <pagerduty_messageregexp>{regexp}</pagerduty_messageregexp>
	    <statuspageio_message>We are investigating this issue.</statuspageio_message>
	    <statuspageio_componentstatus>partial_outage</statuspageio_componentstatus>
	</incidentType>
</incidentTypes> value | definition :------------:| :-------------: name | name of this incident type, just for internal use and a better readability pagerduty_messageregexp | regexp to parse incident message. statuspageio_message | message for statuspage.io incident. statuspageio_componentstatus | statuspage.io component status (one of operational,degraded_performance,partial_outage,major_outage)

Build

We use maven, so just build your project with maven and everything will be created automatically.

mvn clean install

Execution

Unix OS

After a successful build, you will find a shell script /target/bin/standaloneApp in your project. Just make it executable and run it with your credentials and properties as parameter:

$ chmod 754 target/bin/standaloneApp
$ ./target/bin/standaloneApp application.credentials application.properties

Now you should see something like this:

...
...Updated component status in 4.758s
...Component status is now operational
...Updated incident status in 0.687s
...

Windows

After a successful build, you will find standaloneApp.bat in /target/bin. Run it with your application.credentials as first and your application.properties as second parameter. The output should look like above.