An Instrumentation Story (How I Learned to Love the Elastic Stack)

Michael Brown November 24, 2017

in which we discover a better way to meet our customer’s needs while showing us where we can improve our code

When you’re operating a hosting service it’s essential to know what’s going on. Visibility into your environment is required, both on a macro scale (performance graphs) and on a micro scale (request logs). Analyzing those request logs can provide you with details hidden by the big picture.

We have a few customers that are outliers in terms of how they use Discourse; their data size might scale out in one particular direction (such as millions of users or a large number of categories) in a way that we don’t expect. We’ve worked with them to ensure they don’t accidentally DOS themselves by sending too many expensive requests via the API, which has caused us to temporarily revoke their API key on their site in response to overloading their resources in a particular way. It’s intrusive and we don’t like doing that – we consider that to be less responsive than we ought to be.

During the discussion with one particular customer about their API usage, we were asked:

Are you able to provide more details so we might be better equipped to understand what might have happened (e.g., time window, quantity of requests, how many requests per API key, requested endpoints, etc.)?

That’s an ENTIRELY SENSIBLE question to ask of us, and with Kibana we’re easily equipped to provide an answer! Let’s work through how to get there together.

On top of the usual information about the request (hostname, remote IP) the Discourse backend adds information about the Ember route used and the CPU processing time taken into response headers:

X-Discourse-Route: list/latest
X-Runtime: 0.370071

which are then captured by our front-end haproxy. The resulting data is fed to logstash along which massages it and injects a document into Elasticsearch. It’s fantastic having this information in Elastisearch, but when you have millions of log entries, what’s the next step? How do you analyze? What do you look for?

To start, I’ve created a new visualization and chosen to work with an entire day’s data (actually: a subset of data across one hosting centre).

My starting screen is not that interesting but it’s still summarizing 279 million documents, comprising syslog entries, nginx logs, application logs and haproxy logs. Let’s pare that down a bit with a filter:


so that we only look at the documents relating to requests that actually hit our application. At the same time let’s add some more metrics that do some statistical analysis on our X-Runtime header value (reminder: this is in milliseconds) so we can gain some insight:

Naturally, we don’t care about EVERYTHING blended together so at this point it makes sense to aggregate statistics along a dimension (for you SQL users out there, this is a GROUP BY) to tease out the variability across types of requests:

And here’s where we start to see something interesting about those requests:

As expected, the most common request to Discourse is to show a topic. Fantastic! But let’s focus and dig into our original mission to find out information about a particular customer. I’ve restricted the record search to only their site and added a new metric: the total runtime taken for that route.

Restricting the search to only those requests with an API key, we arrive at the answer to the question asked by our customer:

That outlier route users/index route is always expensive. Something is probably be improved there. How expensive is it for other customers?

Just processing the requests for this single API route for Customer G is consuming 6 hours of CPU time per day. What is going on, and why is this so expensive?

A quick dig reveals the users/index route is an admin-restricted route that handles a search for users, filtering on username and email. This requires two full table scans to complete. Our asking customer has a lot of users in the system which stresses this particular query.

The fact that this route was being called often (and was so expensive) led us to investigate why it was being used. Quckly we discovered a gap in our API as customers had a need that wasn’t being met by our design. Their workflow looks roughly like this:

  • sync a user details
    • look up a user by email address:
    • check returned users to identify one with an exact email match
    • perform an operation on that user (e.g. add/remove from groups, update user information)

We addressed this by modifying our API in two ways:

  1. Use the column index to shortcut the logic: if there is an exact match in the email column, return only that user (commit)
  2. Add a new API call that requests only an exact email match and isn’t interested in anything else (commit)

These changes (as it turned out, mostly the second one) significantly reduced the time spent on these queries.

That’s a MASSIVE savings of 92.3% for us when handling these API calls from Customer G since they switched to using email= as a filter – more than 5.5 cpu-hours per day of savings just for a single site.

Read more about the Elasticsearch stack at

Many thanks to Jamie Riedesel, reading her article on Getting started with Logstash helped me understand the injest pipeline.


Shorewall+Docker: Two Great Tastes That Taste Great Together

Matt Palmer November 23, 2015

As has been mentioned previously, we lurve us some Docker here at Discourse. We also lurve us some security, and I’ve recently been replacing our “artisinally handcrafted iptables firewall rules” with a Shorewall-managed configuration, which plays better with Puppet.  Unfortunately, as it stands, like my twin three year olds, they don’t always play together well.

Both Docker and Shorewall assume that nobody else is actively messing with the firewall configuration.  Shorewall assumes this because it likes to completely blow away the existing firewall configuration, and replace it with a set of rules crafted from your rules files.  Docker inserts NAT rules to implement its port forwarding system, amongst other things.  Both make sense in isolation, but when you combine the two behaviours… FWACKOOM.


Every time you reload your Shorewall ruleset, all your Docker containers stop receiving traffic.  Restarting Docker fixes it, but who wants to do that on a large-scale production infrastructure?  Not me.

Luckily, Shorewall, being the awesome system that it is, has plenty of hook points (or, as it calls them, extension scripts) you can use to do funky, custom things.  Such as, in this case, saving the existing Docker-related firewall rules before blowing away the firewall, and restoring them afterwards.  Thanks to Docker’s decision to confine most of its rules to a special chain, named DOCKER, this is quite straightforward.

There are three hooks you need to create, all in the same path.

/etc/shorewall/init and /etc/shorewall/stop have the same contents:

if iptables -t nat -L DOCKER >/dev/null 2>&1; then
    echo '*nat' >/etc/shorewall/docker_rules
    iptables -t nat -S DOCKER >>/etc/shorewall/docker_rules
    iptables -t nat -S POSTROUTING >>/etc/shorewall/docker_rules
    echo "COMMIT" >>/etc/shorewall/docker_rules

    echo '*filter' >>/etc/shorewall/docker_rules
    iptables -S DOCKER >> /etc/shorewall/docker_rules
    echo "COMMIT" >>/etc/shorewall/docker_rules

/etc/shorewall/start looks like this:

if [ -f /etc/shorewall/docker_rules ]; then
    iptables-restore -n </etc/shorewall/docker_rules
    run_iptables -t nat -I PREROUTING -m addrtype --dst-type LOCAL -j DOCKER
    run_iptables -t nat -I OUTPUT ! -d -m addrtype --dst-type LOCAL -j DOCKER
    run_iptables -I FORWARD -o docker0 -j DOCKER
    run_iptables -I FORWARD -o docker0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
    run_iptables -I FORWARD -i docker0 ! -o docker0 -j ACCEPT
    run_iptables -I FORWARD -i docker0 -o docker0 -j ACCEPT

    rm -f /etc/shorewall/docker_rules

Once you’ve created those three files, with the above contents, when you run shorewall start or shorewall restart, your firewall will be restarted, with all your Shorewall-defined rules, and your Docker rules, all in place.