blog

archives

An Instrumentation Story (How I Learned to Love the Elastic Stack)

Michael Brown November 24, 2017

in which we discover a better way to meet our customer’s needs while showing us where we can improve our code

When you’re operating a hosting service it’s essential to know what’s going on. Visibility into your environment is required, both on a macro scale (performance graphs) and on a micro scale (request logs). Analyzing those request logs can provide you with details hidden by the big picture.

We have a few customers that are outliers in terms of how they use Discourse; their data size might scale out in one particular direction (such as millions of users or a large number of categories) in a way that we don’t expect. We’ve worked with them to ensure they don’t accidentally DOS themselves by sending too many expensive requests via the API, which has caused us to temporarily revoke their API key on their site in response to overloading their resources in a particular way. It’s intrusive and we don’t like doing that – we consider that to be less responsive than we ought to be.

During the discussion with one particular customer about their API usage, we were asked:

Are you able to provide more details so we might be better equipped to understand what might have happened (e.g., time window, quantity of requests, how many requests per API key, requested endpoints, etc.)?

That’s an ENTIRELY SENSIBLE question to ask of us, and with Kibana we’re easily equipped to provide an answer! Let’s work through how to get there together.

On top of the usual information about the request (hostname, remote IP) the Discourse backend adds information about the Ember route used and the CPU processing time taken into response headers:

X-Discourse-Route: list/latest
X-Runtime: 0.370071

which are then captured by our front-end haproxy. The resulting data is fed to logstash along which massages it and injects a document into Elasticsearch. It’s fantastic having this information in Elastisearch, but when you have millions of log entries, what’s the next step? How do you analyze? What do you look for?

To start, I’ve created a new visualization and chosen to work with an entire day’s data (actually: a subset of data across one hosting centre).

My starting screen is not that interesting but it’s still summarizing 279 million documents, comprising syslog entries, nginx logs, application logs and haproxy logs. Let’s pare that down a bit with a filter:

_exists_:response_header_discourse_route

so that we only look at the documents relating to requests that actually hit our application. At the same time let’s add some more metrics that do some statistical analysis on our X-Runtime header value (reminder: this is in milliseconds) so we can gain some insight:

Naturally, we don’t care about EVERYTHING blended together so at this point it makes sense to aggregate statistics along a dimension (for you SQL users out there, this is a GROUP BY) to tease out the variability across types of requests:

And here’s where we start to see something interesting about those requests:

As expected, the most common request to Discourse is to show a topic. Fantastic! But let’s focus and dig into our original mission to find out information about a particular customer. I’ve restricted the record search to only their site and added a new metric: the total runtime taken for that route.

Restricting the search to only those requests with an API key, we arrive at the answer to the question asked by our customer:

That outlier route users/index route is always expensive. Something is probably be improved there. How expensive is it for other customers?

Just processing the requests for this single API route for Customer G is consuming 6 hours of CPU time per day. What is going on, and why is this so expensive?

A quick dig reveals the users/index route is an admin-restricted route that handles a search for users, filtering on username and email. This requires two full table scans to complete. Our asking customer has a lot of users in the system which stresses this particular query.

The fact that this route was being called often (and was so expensive) led us to investigate why it was being used. Quckly we discovered a gap in our API as customers had a need that wasn’t being met by our design. Their workflow looks roughly like this:

  • sync a user details
    • look up a user by email address:
      https://discourse.CUSTOMER.com/admin/users/list/active.json?api_key=KEY&api_username=sync&email=USER@aol.com
    • check returned users to identify one with an exact email match
    • perform an operation on that user (e.g. add/remove from groups, update user information)

We addressed this by modifying our API in two ways:

  1. Use the column index to shortcut the logic: if there is an exact match in the email column, return only that user (commit)
  2. Add a new API call that requests only an exact email match and isn’t interested in anything else (commit)

These changes (as it turned out, mostly the second one) significantly reduced the time spent on these queries.

That’s a MASSIVE savings of 92.3% for us when handling these API calls from Customer G since they switched to using email= as a filter – more than 5.5 cpu-hours per day of savings just for a single site.


Read more about the Elasticsearch stack at https://www.elastic.co

Many thanks to Jamie Riedesel, reading her article on Getting started with Logstash helped me understand the injest pipeline.

0 comments

Discourse team grows to 20

Erlend Sogge Heggen November 13, 2017

Over the last few months we’ve added a whopping 9 members to the Discourse team! We held off on any big announcements until our new team page was ready for primetime.

Behold the first twenty!

Drawings courtesy of the excellent Nick Staab

The nine new arrivals are as follows:

  • Joffrey Jaffeux – Software Engineer
  • Sarah Hawk – Community Advocate
  • Michael Brown – Operations Engineer
  • Joshua Rosenfield – Technical Advocate
  • Gerhard Schlager – Software Engineer
  • Andrew Schleifer – Operations Engineer
  • Kris Aubuchon – Designer
  • Vinoth Kannan – Software Engineer
  • Simon Cossar – Technical Advocate

Those who frequent our Meta community will no doubt recognize most of these names, as many of them have been around for years already. Hiring from within our community remains our go-to strategy, but it is by no means a requirement.

Our 100% remote company now spans 10 different timezones (hover over an avatar to see specific city) and includes a fish, a cat, a hawk and a mysterious plushy creature.

2017 has been an incredible year for us and we expect even greater things in the year to come.

Think you can do great things with us? Send us a note and describe what you can do as part of the team.

6 comments

It’s Time We Talked About Tags

Sarah Hawk October 20, 2017

Consider the typical sections of a daily newspaper: Arts, Sports, Business, Travel, Local, and World. Any given article belongs to just one of those sections, and the content in each section is quite different, such that some people, for example, may only ever read the Sports or Business sections. These are what we call categories.

Categories are established by staff for strong, distinct, and secure divisions between content. But when it comes to categories, more is not necessarily better. You can think of categories as walls. Four walls make a nice room, sixteen walls might even make a nice house. But a hundred walls would present an impenetrable maze.

Tags, unlike the heavy walls of categories, are nimble, flexible and lightweight:

  • You can have thousands of tags, even tens of thousands of them.
  • Users, if you allow them to, can create their own tags to help organize things.
  • Tags don’t imply any kind of security “wall” or permissions.
  • Multiple tags can apply to the same topic, even across categories. If you were running a music forum, you can tag a topic as both “Hip Hop” and “Electronica”. But if you were using categories you’d have to choose one or the other!

In the last year we’ve made considerable enhancements to tags and how they work:

  • Users can tag topics themselves, taking the onus off staff.
  • You can restrict who can tag by trust level.
  • You can restrict tags by category.
  • Tag groups allow you to limit which tags can be used together on a single topic.
  • Parent-child tag relationships allow you to define which tags can be used in conjunction with other tags.
  • Users can choose to auto watch tags to create a relevant activity feed
  • Users can view a list of all tags, and filter topics by tag.
  • There are bulk tools to assign tags to many topics at once and to rename tags when necessary.

You can build tag systems with your community to organize content in a way that is intuitive for your users. Here is an example of a tag schema.

The future

It’s been a slow burn for tags at Discourse. While Discourse itself was in public beta since February 2013, and reached version 1.0 August 2014, the tagging plugin was initially released in January 2015. It wasn’t until much later that year that they were made visible in listings. Finally, in June 2016 tagging became part of core.

We have two exciting tag enhancements planned for 2018. First up will be the ability to tag personal messages, making it possible to organize your personal discussions as you would label your emails. We also plan to add an optional setting that makes at least one tag mandatory on each new topic.

How can you use tags?

Self-organizing communities

It’s certainly a good idea to start with a few key categories when you launch, but don’t overdo it. Instead of adding yet another category, ask yourself if a tag would, perhaps, be a better and more flexible starting point? If a tag gets extremely popular you can easily graduate it to a full blown category later — and the natural popularity of user-entered tags might help you decide what categories you need and don’t need.

Furthermore, take a hard look at the categories you have today. If they don’t house specific security settings and have very few posts, then do you really need that category? Cull them and use a tag instead. Be harsh.

Hopscotch is an example of a Discourse site that embraced tags wholeheartedly, letting the community organize itself as it saw fit.

Create a customized homepage

FeverBee surface relevant content into specific areas of their custom homepage using tag queries. Tagging specific topics as challenges or reading pushes them into specialized feeds. This helps first time visitors find the information that they’re looking for.

Manage product feedback and roadmaps

On Meta we use tags to group and classify product feedback and indicate where we are on the product roadmap. That gives the community visibility on planned work and a better understanding of what external work we can support.

As Discourse sites grow more numerous and larger, we’re increasingly encouraging communities to explore tagging as a lightweight complement to traditional heavyweight discussion categories. We’ve come a long way with tagging in 2 years, but we’re always open to community feedback on improving tags to make them even more useful. If you have an interesting use case or constructive feedback we’d love to hear it!

3 comments