blog

archives

HTTPS by Default

Matt Palmer April 18, 2018

Here at Discourse World HQ, we’re firm believers in the value of security. We fund a public bug bounty program, and we document our security policy and procedures right in the repo. Securing traffic between a forum’s server and its users is important, too, and we’ve had first-class support for integrating Let’s Encrypt into a self-hosted Discourse server since virtually day one of their being generally available.

All of this is to explain why we’re so very proud to be able to announce that every new hosted Discourse instance now comes with HTTPS configured (and enforced via HSTS policy) by default, and the majority of our existing customers have been transparently migrated to enforced HTTPS. Some of our customers can’t be automatically HTTPS enforced; if you see your site is still loading over HTTP, please contact our support team so we can walk you through the changes that need to be made.

The Story So Far

image

HTTPS has always been supported by our hosting, but for a long time it wasn’t feasible to automatically roll it out to everyone. Back in the Dark Times (up to 2015 or so), SSL was a headache to do at scale, unless you were big enough to be able to wangle a special deal with a CA. Sure, certificates cost money, but that wasn’t the biggest hassle. The thing that stopped us, honestly, was the painful story around automation. Basically, there wasn’t any – not end-to-end, anyway.

It’s fun to track the evolution of our SSL procedures by looking at the titles of our runbooks over time. In the Dark Times, we went through several iterations – there was “How to install a customer’s SSL certificate”, then “Customer SSL workflow (Manual Edition)”, “Semi-Automated Customer SSL workflow”, and finally, “Customer SSL workflow (99.5% Automated Edition)”. Whilst we tried very hard to automate as much of the process as we could, getting an SSL certificate always seemed to involve a human in the loop somewhere, and when you provide automated trial signups, you can’t put a human in the middle of that process without making everything terrible.

Thankfully, Let’s Encrypt came along to bring us into the age of certificate Enlightenment. We’ve been big supporters of Let’s Encrypt for a long time; we’ve been donating the hosting for the Let’s Encrypt community forums practically since the initial public announcement, and we donate what we would have otherwise spent on certificates in cash, as well. When Let’s Encrypt became generally available, we re-implemented our SSL certificate issuance pipeline on top of that, and everything was reasonably peachy.

Reality Ensues

image

So, Let’s Encrypt made it easy for us to fully-automate getting SSL certificates for our customers, as they asked for it. Except, there’s a big difference between requesting a certificate for a running site, when a customer asks for it, and making sure you’re getting a certificate issued as soon as the site’s DNS is configured. Even detecting that DNS is correctly configured, given the wide variety of, shall we say, “interesting” configuration choices some people make, can be a challenge. This took a lot more time and ingenuity than getting issuance working. Requesting a certificate from Let’s Encrypt and dealing with validation is about 10 lines of Ruby (with the help of the acme-client gem) and a bit of HAProxy magic. Dealing with all the special cases and exceptions in people’s DNS and proxy setups is quite a bit more code.

As an added challenge, a large part of our team has been hard at work since mid last year building out a completely custom, hyper-scale AWS-based hosting environment for a large game company. That work, which has now borne fruit, diverted time and attention away from building out universal SSL as quickly as we would have liked.

The Final Hurdle

The final hurdle is the problem of migration. We want all our existing customers to automatically have the benefits of our SSL labours. There’s a few things that get in the way of just flicking the switch for customers who have been with us for some time.

First off, forcing everyone to use HTTPS, via redirects and HSTS config, breaks some things. The biggest issue is the lack of a universally-supported way to say to a HTTP client, “transparently make this exact request again, but to this other URL”. There are newly-standardised HTTP response codes for “redirect with the same HTTP verb”, but they’re not supported by all browsers and HTTP libraries, and they sometimes prompt “do you want to do this again?”, which isn’t what you want for a smooth transition.

External authentication providers get in the way, too. They often whitelist the return URLs they’ll accept. When you call out to an authentication service, you typically send along a link which the browser should be redirected back to after authentication is complete. Sending an HTTPS link, when the authentication provider expects an HTTP one, results in authentication errors. Very not cool. We can’t fix this ourselves, either – we don’t have access to our customers’ Google, Facebook, and Twitter applications to update the whitelisted URL.

Finally, there’s the age-old problem of mixed-content warnings, which pretty much everyone who has stood up a HTTPS site has dealt with at one time or another. Thankfully, the vast majority of those we can fix on our customers’ behalf, by making sure the site being linked to supports HTTPS, and modifying the site config to add a strategically-placed “s”.

Security is no longer optional

Our deployment of HTTPS by default is coming at just the right time. Popular browsers such as Chrome and Firefox are in the process of rolling out changes to mark HTTP-only sites as “Not Secure”, like so:

google_chrome_http

It’s reasonable to assume that this trend will continue, with other browsers following suit, and the warnings getting more and more scary over time. So, regardless of where you’re hosted, now would be a good time to start making sure all your web properties are being served over HTTPS.

Full of win

It has been a bit of a journey, but in the end, we’ve gotten to where we want to be: new customers get HTTPS enforced by default, most existing customers have HTTPS enforced by default without their even noticing, and for the remainder, we know exactly what they need to do to complete the upgrade.

successkid

2 comments

An Instrumentation Story (How I Learned to Love the Elastic Stack)

Michael Brown November 24, 2017

in which we discover a better way to meet our customer’s needs while showing us where we can improve our code

When you’re operating a hosting service it’s essential to know what’s going on. Visibility into your environment is required, both on a macro scale (performance graphs) and on a micro scale (request logs). Analyzing those request logs can provide you with details hidden by the big picture.

We have a few customers that are outliers in terms of how they use Discourse; their data size might scale out in one particular direction (such as millions of users or a large number of categories) in a way that we don’t expect. We’ve worked with them to ensure they don’t accidentally DOS themselves by sending too many expensive requests via the API, which has caused us to temporarily revoke their API key on their site in response to overloading their resources in a particular way. It’s intrusive and we don’t like doing that – we consider that to be less responsive than we ought to be.

During the discussion with one particular customer about their API usage, we were asked:

Are you able to provide more details so we might be better equipped to understand what might have happened (e.g., time window, quantity of requests, how many requests per API key, requested endpoints, etc.)?

That’s an ENTIRELY SENSIBLE question to ask of us, and with Kibana we’re easily equipped to provide an answer! Let’s work through how to get there together.

On top of the usual information about the request (hostname, remote IP) the Discourse backend adds information about the Ember route used and the CPU processing time taken into response headers:

X-Discourse-Route: list/latest
X-Runtime: 0.370071

which are then captured by our front-end haproxy. The resulting data is fed to logstash along which massages it and injects a document into Elasticsearch. It’s fantastic having this information in Elastisearch, but when you have millions of log entries, what’s the next step? How do you analyze? What do you look for?

To start, I’ve created a new visualization and chosen to work with an entire day’s data (actually: a subset of data across one hosting centre).

My starting screen is not that interesting but it’s still summarizing 279 million documents, comprising syslog entries, nginx logs, application logs and haproxy logs. Let’s pare that down a bit with a filter:

_exists_:response_header_discourse_route

so that we only look at the documents relating to requests that actually hit our application. At the same time let’s add some more metrics that do some statistical analysis on our X-Runtime header value (reminder: this is in milliseconds) so we can gain some insight:

Naturally, we don’t care about EVERYTHING blended together so at this point it makes sense to aggregate statistics along a dimension (for you SQL users out there, this is a GROUP BY) to tease out the variability across types of requests:

And here’s where we start to see something interesting about those requests:

As expected, the most common request to Discourse is to show a topic. Fantastic! But let’s focus and dig into our original mission to find out information about a particular customer. I’ve restricted the record search to only their site and added a new metric: the total runtime taken for that route.

Restricting the search to only those requests with an API key, we arrive at the answer to the question asked by our customer:

That outlier route users/index route is always expensive. Something is probably be improved there. How expensive is it for other customers?

Just processing the requests for this single API route for Customer G is consuming 6 hours of CPU time per day. What is going on, and why is this so expensive?

A quick dig reveals the users/index route is an admin-restricted route that handles a search for users, filtering on username and email. This requires two full table scans to complete. Our asking customer has a lot of users in the system which stresses this particular query.

The fact that this route was being called often (and was so expensive) led us to investigate why it was being used. Quckly we discovered a gap in our API as customers had a need that wasn’t being met by our design. Their workflow looks roughly like this:

  • sync a user details
    • look up a user by email address:
      https://discourse.CUSTOMER.com/admin/users/list/active.json?api_key=KEY&api_username=sync&email=USER@aol.com
    • check returned users to identify one with an exact email match
    • perform an operation on that user (e.g. add/remove from groups, update user information)

We addressed this by modifying our API in two ways:

  1. Use the column index to shortcut the logic: if there is an exact match in the email column, return only that user (commit)
  2. Add a new API call that requests only an exact email match and isn’t interested in anything else (commit)

These changes (as it turned out, mostly the second one) significantly reduced the time spent on these queries.

That’s a MASSIVE savings of 92.3% for us when handling these API calls from Customer G since they switched to using email= as a filter – more than 5.5 cpu-hours per day of savings just for a single site.


Read more about the Elasticsearch stack at https://www.elastic.co

Many thanks to Jamie Riedesel, reading her article on Getting started with Logstash helped me understand the injest pipeline.

0 comments

Shorewall+Docker: Two Great Tastes That Taste Great Together

Matt Palmer November 23, 2015

As has been mentioned previously, we lurve us some Docker here at Discourse. We also lurve us some security, and I’ve recently been replacing our “artisinally handcrafted iptables firewall rules” with a Shorewall-managed configuration, which plays better with Puppet.  Unfortunately, as it stands, like my twin three year olds, they don’t always play together well.

Both Docker and Shorewall assume that nobody else is actively messing with the firewall configuration.  Shorewall assumes this because it likes to completely blow away the existing firewall configuration, and replace it with a set of rules crafted from your rules files.  Docker inserts NAT rules to implement its port forwarding system, amongst other things.  Both make sense in isolation, but when you combine the two behaviours… FWACKOOM.

fwackoom

Every time you reload your Shorewall ruleset, all your Docker containers stop receiving traffic.  Restarting Docker fixes it, but who wants to do that on a large-scale production infrastructure?  Not me.

Luckily, Shorewall, being the awesome system that it is, has plenty of hook points (or, as it calls them, extension scripts) you can use to do funky, custom things.  Such as, in this case, saving the existing Docker-related firewall rules before blowing away the firewall, and restoring them afterwards.  Thanks to Docker’s decision to confine most of its rules to a special chain, named DOCKER, this is quite straightforward.

There are three hooks you need to create, all in the same path.

/etc/shorewall/init and /etc/shorewall/stop have the same contents:

if iptables -t nat -L DOCKER >/dev/null 2>&1; then
    echo '*nat' >/etc/shorewall/docker_rules
    iptables -t nat -S DOCKER >>/etc/shorewall/docker_rules
    iptables -t nat -S POSTROUTING >>/etc/shorewall/docker_rules
    echo "COMMIT" >>/etc/shorewall/docker_rules

    echo '*filter' >>/etc/shorewall/docker_rules
    iptables -S DOCKER >> /etc/shorewall/docker_rules
    echo "COMMIT" >>/etc/shorewall/docker_rules
fi

/etc/shorewall/start looks like this:

if [ -f /etc/shorewall/docker_rules ]; then
    iptables-restore -n </etc/shorewall/docker_rules
    run_iptables -t nat -I PREROUTING -m addrtype --dst-type LOCAL -j DOCKER
    run_iptables -t nat -I OUTPUT ! -d 127.0.0.0/8 -m addrtype --dst-type LOCAL -j DOCKER
    run_iptables -I FORWARD -o docker0 -j DOCKER
    run_iptables -I FORWARD -o docker0 -m conntrack --ctstate RELATED,ESTABLISHED -j ACCEPT
    run_iptables -I FORWARD -i docker0 ! -o docker0 -j ACCEPT
    run_iptables -I FORWARD -i docker0 -o docker0 -j ACCEPT

    rm -f /etc/shorewall/docker_rules
fi

Once you’ve created those three files, with the above contents, when you run shorewall start or shorewall restart, your firewall will be restarted, with all your Shorewall-defined rules, and your Docker rules, all in place.

2 comments

For more blog posts, visit the archives