Amazon’s ELB implementation is really a Proxy – lessons learned from 1.2MM spam emails in 24 hours

We’ve had a challenging last 24 hours at AffinityLive. After a fairly successful cut-over to Amazon over the weekend, we learned the hard way that Amazon’s implementation of load balancing leaves a lot to be desired.

Background

AffinityLive does a lot of email handling. Not only do our users use AffinityLive to send emails, we also process a lot of incoming and outgoing email for our users so they can use our cool automatic email capture feature. Capturing inbound emails is pretty easy – people set up a forward to a special dropbox account, and since this is on our host and in our domain, it is all easy and part of the normal way email is managed.

Unfortunately, the outgoing email channel is much more challenging. In all but a few isolated situations, users don’t have the ability to set up an outgoing BCC rule to send all their emails to the outgoing log address for capture, so we need to actually act as an outgoing relay for them. For users who use POP3 accounts on generic ISP provided mail servers, we provide a different outgoing server for them to put into Outlook and all their connections are authenticated. However, for users of Gmail, there isn’t a “client” per-se to reconfigure, so we allow our Google Apps users to use us as a relay server without authentication.

This is a plan we’ve had running flawlessly for a couple of years now, and involved us using a script to periodically interrogate Google’s SPF records to ensure we had a list of all their sender IP addresses so we could trust them. The script we use is included below in case you want to do something similar some time.

use strict;
use Error qw( :try );
use Mail::SPF;
use NetAddr::IP;
use Scalar::Util qw( blessed );
sub process;
my @domains = @ARGV or die "Usage: $0 DOMAIN...\n";
my $server = new Mail::SPF::Server();

my @results = map {
 try {
  process $server, new Mail::SPF::Request(
   identity => $_,
   ip_address => "0.0.0.0",
  );
 } catch Mail::SPF::Exception with {
  $@ =~ s/ at \S+ line \S+$//;
  warn "$_: $@";
 }
} @domains;
@results or die "No IPs expanded\n";
print "$_\n" foreach @results;
exit;

############################################################################
use constant DEBUG => $ENV{DEBUG};
sub debug { DEBUG and warn @_ }

sub dns_lookup {
 my ($server, $request, $type, $domain, $max) = @_;
 debug "DNS: $type => $domain\n";
 my $packet = $server->dns_lookup($domain, $type);
 my @rrs = $packet->answer or $server->count_void_dns_lookup($request);
 debug "... ", (map { $_->string } @rrs), "\n";
 @rrs = splice @rrs, 0, $max if defined $max;
 grep { $_->type eq $type } @rrs;
}
sub process {
 my ($server, $request) = @_;
 debug "SPF: ", $request->identity, " from ", $request->authority_domain, "\n";
 my $record = $server->select_record($request);
 my @terms = $record->terms;
 my @results;
 while (my $term = shift @terms) {
  debug "Term: $term\n";
  my $domain = $term->domain($server, $request);
  for (blessed $term) {
   /^Mail::SPF::Mech::IP4$/ and do {
    push @results, $term->ip_network;
    last;
   };
   /^Mail::SPF::Mech::Include$/ and do {
    $server->count_dns_interactive_term($request);
    push @results, process $server, $request->new_sub_request(authority_domain => $domain);;
    last;
   };
   /^Mail::SPF::Mech::MX$/ and do {
    $server->count_dns_interactive_term($request);
    push @results,
     map { $_->address }
     map { dns_lookup $server, $request, A => $_->exchange }
     dns_lookup $server, $request, MX => $domain, $server->max_name_lookups_per_mx_mech;
    last;
   };
   /^Mail::SPF::Mech::A$/ and do {
    $server->count_dns_interactive_term($request);
    push @results, map { $_->address }
    dns_lookup $server, $request, A => $domain;
    last;
   };
  }
 }
 @results;
}

By asking Google regularly about the servers they send from and assert are their own (via their SPF records) we could confidently trust that email from them could be relayed (on behalf of our users).

The other type of email that we’d relay and trust without needing a specific user/pass for AUTH is internal email – the emails that come from the servers in our cluster. Whether they’re reports from cron or puppet or nagios, emails that originate locally need to be trusted, and to keep a handle on things (and keep firewall rules tight) we made sure that all of the servers in our cluster would forward email to our mail servers before being delivered to the outside world, ie, relayed.

When we moved to Amazon, though, this plan fell apart, and in the last 24 hours we had a spammer take advantage of the trusted relay rules to send out 1.2MM spam emails via our infrastructure – all because Amazon’s Load Balancers aren’t actually load balancers, but instead operate like proxies. Taking email out of a load balancer arrangement solves part of the problem, but the other challenge/problem is that our users set their outgoing relay server to be their own AffinityLive domain (like clientdomain.affinitylive.com) which resolves to the load balancers… and when they get forwarded through to the mail servers in the cluster, we couldn’t use the Google IP address trusting since the Amazon Load Balancers were stripping the details of the original sending IP address.

AWSs Enterprise Load Balancers – they’re actually proxies

When a request come into an Amazon AWS Load Balancer (called an ELB or Enterprise Load Balancer), they are passed onto the internal hosts (we have a bunch of them for horizontal scalability) and these hosts then process them.

With a normal load balancer (like the load-director package we were using), the traffic is passed across to the back-end machine with the originating packet/sender in-tact. This meant that tests like checking the host is one of the trusted Google hosts or one of the trusted internal hosts before relaying worked perfectly.

Unfortunately, though, Amazon doesn’t work this way – it actually proxies the request and on the inside network everything looks like it has come from a trusted, internal server – the AWS load balancer.

It was this mistake in understanding how Amazon doesn’t actually act as a load balancer but rather a proxy that lead to us being used by spammers as an open relay – with the load balancer on our trusted internal network, spammers were able to send mail to our load balancer and then we’d send the emails out on their behalf.

sendgrid-stats

We’ve since patched this hole, and we’re working on a solution to allow Gmail to be a trusted host that we’ll relay for using their TLS certificates (which don’t care about origin IP address); I’ll update this post with this work around once we’ve finalized it.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s