NewRelic or NewRe-Leak?

in #code9 years ago (edited)

This post is written for code geeks and developers to tell the story of a memory leak we hunted last night after a month of chase.

Spot.IM is the biggest commenting system on the web, managing the conversation on many famous domains, handling billion of PageViews on daily basis (aol.com, engadget.com, thedrive.com, rt.com and many more). As part of our solution we have a pre-render (aka server side rendering) server written over Node.JS with REACT. In the last month we started to see a memory leak in the pre-render container. It started as a minor leak that popped once in a while and consumed the whole container memory until it crashed.

We use AWS ECS (EC2 Container Service) with Docker and CloudWatch for monitoring (All leak images are taken from it). The story is quite simple. Our service was stable for few hours, maybe even a day and then memory started leaking. Recently it became much worse and memory was leaking on every minute. We looked at the problem everywhere, going back in history, cherry picking and located the commit that created the leak. In the mean time we purchased NewRelic Pro account for thousands of dollars to help us locate the leak.

When we started hunting the leak it looked like this:

Pre-Render AVG is the average memory. You can see it is stable for some time and there are wild jumps once every few hours. Each time you see a drop we deployed a new version resetting the memory.

So, we used the good old well known method to catch leaks - binary search. Since we could not locate it with any other tool (heap dump and others) we added more and more NewRelic logic and logs to catch it. Last week was the worse. The memory graph looked like this:

Every time you see the memory falls is due to our of memory in the container, health stops responding, AWS ELB kills it and replace it with a new container that starts leak again the minute it launched.

Last night at around 2:00 o'clock we got to 2 versions that are in feature parity: one that leaks and one that does not. Since we went on our source tree from both directions, on one direction (old to new) we did not have NewRelic configured. The only difference was the NewRelic service. We deployed latest version without newRelic to production and this is how the graph looks like now:

Today we killed many ECS containers (From 32 to 8):

And reduced each container memory from 3GB to 1GB!

Important to state that all tests and major iterations were made on a pre-master environment to avoid problems in the production environment. This bug did not have an impact on consumers since we implemented a draining method that stopped accepting connections once we reach a specific amount of memory.

Moral of the Story

  1. Never trust anyone
  2. Always suspect 3rd party as a source of problems
  3. Do minor changes each release to be able to revert fast when you get production problems
  4. Production scale of billions of request is an odd animal, and is very hard to simulate in testing
  5. The monitoring tool itself can be the problem of your app (the uncertainty principle)

Sort:  

Congratulations @spotarchitect! You have received a personal award!

1 Year on Steemit
Click on the badge to view your Board of Honor.

Do you like SteemitBoard's project? Then Vote for its witness and get one more award!

Congratulations @spotarchitect! You received a personal award!

Happy Birthday! - You are on the Steem blockchain for 2 years!

You can view your badges on your Steem Board and compare to others on the Steem Ranking

Vote for @Steemitboard as a witness to get one more award and increased upvotes!