Jeff Dwyer By Jeff Dwyer • July 4, 2017

Log aggregation at scale for cheap with AWS Athena and Kinesis Firehose

RateLim.it is a service that gives you robust, scalable, durable ratelimits so you can save money. If you have a bunch of somethings and you want fewer somethings, we're your people.

Now it turns out that when you find yourself wanting a rate limit you often have something that is pretty high throughout and since we're a SaaS and we charge by the request, we obviously we can't charge all that much. How little? Well we charge $0.000005 per request which is.... a very very small part of a penny. 

That's all well and good for you the customer, but when it comes to actually figuring out how much to bill each account, it takes a little thinking to figure out how to do that for substantially less than 5 millionths of a penny. Essentially we want to:

  1. Track everytime our users hit an endpoint
  2. Hourly, aggregate how many times a user hit each endpoint
  3. If reasonable it would be nice to give users access to a tail of usage logs for debugging

The cheapest easiest way to store a bunch of data like this in AWS is Kinesis Firehose. Kinesis Firehose is very confusingly different from AWS Kinesis, but is designed for just this sort of massive data ingestion. For these small log lines it will ends up costing .000000138 per request for a log line. That’s 2.3% of my cost which seems fair for now.

Screen Shot 2017-07-04 at 3.00.08 PM.png

Firehose has 3 choices of endpoint into which it will deposit our data: Redshift, S3 and ElasticSearch. Since we're being cheap, the cheapest place to store it is of course S3. But that just leaves actually aggregating the data.

There are a kajillion ways to query and aggregate S3 once you have a Hadoop cluster. Hive, Pig, Presto, Spark, EMR. The choices are endless!

But what if you are a mere mortal and don't have a Hadoop cluster (and really don't feel like having one since , say, you've already done your time in the Hadoop salt mines)? 

What if you just wanted to run one Presto job a night? What if you didn't want to write a mapper and a reducer? What if you just wanted to write some basic SQL and get charged by the GB of data that was queried?

How in the heck can I actually read all of the data I put in S3?


Well AWS Athena is here for you! Athena is pretty rad, simply point her at an S3 repo of CSVs or other data, tell her how the CSVs should be translated to tables and start writing SQL.  You only pay by the GB of data read.  I don't have much else to add on this point besides: Athena is awesome! Seriously if you have some data in S3 you really should hook up Athena to check it out. It'll take you 10 minutes and it is very cool.

So what does it cost overall? Let's take a service that's logging 200 r/s 24x7. As we mentioned above the cost of firehose will be about  .000000138 for a total of $71 / month. 

We paid for about 2TB of Kinesis Firehose in that $71, but realistically our payloads won't be 5k. Beyond that Athena lets us "partition" our data meaning it can skip reading S3 data that isn't applicable if you give it well named files. For athena itself, say you query 20GB a day, that is .02 of a TB a day or .6 of a TB per month for a grand total of $3 a month for you personal Presto cluster. 20GB is enough for logging 1k of log at 200 r/sec all month.

The actual S3 costs for 1k payloads at 200 r/s end ups being the cost for 494GB/month which is $12 / month.

So all told that's $86 to store and query half a terrabyte of data a month. That's been good enough to get ratelim.it off the ground and seems like a good deal to me.

The last thing I'd note is that there’s a TON off inefficiency in what I have outlined so far. We're not using compression. Not using Parquet. tl;dr this is a solution that scales and is cheap easy to get started with good options for when you get even bigger.