How a Golang Upgrade Led to Breaking and Fixing a Service

Ethos Life
9 min readJul 21, 2022

Vishal Sood, Software Engineer

We use a plethora of different technologies here at Ethos to make life insurance easy and protect the next million families. One of those technologies is Go (or golang) which powers multiple backend services. A couple of months ago, we decided to upgrade the version of golang being used in one of these services and started experiencing an issue that one wouldn’t normally expect during such an upgrade.

Upgrading Golang Version caused a Service to Run Out-of-Memory

Upon upgrading the golang version, we started to see pods dying (and restarting) every couple of hours because they ran OoM (out of memory). As an initial brute-force solution, we tried increasing the amount of memory allocated to the service. This seemed to decrease the frequency of the pods dying but didn’t prevent them from eventually running OoM and dying.

Memory usage graph for the affected service. The graph is discontinuous which indicates dying and restarting of pods
Memory usage before resource increase — the points of discontinuation correspond to one pod dying and another spinning up
Memory usage graph for the affected service after allocated resources were increased
Memory usage after resource increase

On inspecting the memory usage, we found that the memory utilization was going up every hour, coinciding with the start of a long, memory-intensive hourly job. Memory usage didn’t recover after the job finished and instead increased in visible steps until the pod hit the limit and was then automatically killed. This indicated a memory leak, but no major changes had been made to the job recently and it had been running in production without any memory issues for over a year!

Diagnosing the Issue and Resolving it

I decided to take a deeper dive into this issue to try and figure out why, all of a sudden, the pods were running to their memory limits. As would be expected, I started by using pprof to collect and analyze the application’s memory allocation data. I collected data before, during, and after the job while using different versions of golang. It was clear from the memory profiles that memory usage isn’t returning to normal levels after the job finishes on the newer golang version. It does however return to the level before the job run when the older golang version is used.

Using pprof’s top command pointed out that the memory is being held by a method that’s responsible for downloading a large resource from an S3 bucket. Digging further using the list command, I was able to find out that a buffer-object used to download the S3resource was the one clinging onto the memory. Some googling led me to discover that a few other people have had issues with using AWS SDK’s aws.WriteAtBuffer method to download large S3objects. So, I started looking into that but soon realized that we hadn’t touched the downloading code for months.

Applying “first-principles thinking” — manifesting “Intellectual Rigor”, one of our core values and the one that resonates the most with me as an engineer — I tried to reason with myself that since the issue appeared after a golang version update and initial investigation shows a specific object is holding on to the memory, the issue has something to do with memory management. With this consideration, I changed the course of my investigation to focus on memory management & garbage collection — specifically on the garbage collection of the buffer object and any updates between versions that might have affected it.

Inspecting the code at different places referencing the buffer object, I stumbled upon an odd-looking code construct that was indirectly interacting with the buffer object (referencing some other field on the struct containing the buffer object). A simplified simulation is shown below.

Simplified version of the odd-looking construct encountered in the code

A method creates and returns an unbuffered-channel to which we send a value from inside a goroutine, but the returned channel is not captured by the caller. After examining the git blame and asking around, I learned that a few months ago, a well-intentioned developer wanted to make the method behave asynchronously (by ignoring the channel) for the HTTP request handler but synchronously (by waiting for a value on the channel) for use in integration tests.

Using the same method asynchronously and synchronously

But, unknowingly, the developer introduced a goroutine leak in the handler flow (as is illustrated in this playground link).

Refactoring the method to not return a channel and running memory profiling for the updated code indicates that the refactor solves the memory issue.

Code refactored to resolve goroutine leak

So, this means problem solved, right? The memory usage returns to normal levels after job completion.

Hooray! 🎉

Going a little further!

But wait! This still doesn’t answer the question of why this was not an issue with the older golang version (the leaky goroutine code had been in production for several months).

Leaving the investigation here would solve the issue but would neither lead to substantial learning nor exemplify “Intellectual Rigor”!

So, I decided to spend some more time getting to the bottom of this. I started with running the job multiple times (using different golang versions and different versions of the code) gathering multiple memory profiles to dig into. But the memory profiles weren’t revealing any new information at this time. So I collected other profiles (CPU, goroutine, etc.) to inspect, but to no avail!

Going back to “first principles” — I conjectured that, in the simplest terms, the issue can be summed up as follows:

something changed between golang versions that resulted in the prevention of memory being garbage collected (which was being collected previously)

I started looking into golang’s garbage collection specifics. Based on the release notes, nothing seems to have changed the garbage collection drastically. So, maybe something changed “when” the garbage collector was run. Even though it was unreasonable even to think that the garbage collector was possibly not running for several minutes after the job was completed (and almost an hour in case of the production), just to be entirely sure I even tried running the old code with enforced garbage collection right before exiting from the job processing — but that didn’t help either (memory was still leaking).

It was clear that something was keeping a reference to the buffer object and preventing it from being garbage collected but I wasn’t sure what exactly was this?

So, I resorted to replicating the issue in a simpler setup and trying to reason about the issue with the simpler code! One thing to keep in mind here is that the GC cycle might not run in a short-lived program, so we’ll enforce garbage collection.

To accomplish this, I created a code snippet to simulate the scenario but with enforced garbage collection and tried running that with different versions of golang.

Simulation of the offending application code in a simpler code setup

This successfully replicated the goroutine leak but wasn’t causing the memory leak with the newer golang version — which meant this wasn’t simulating the exact scenario.

I was unable to make much progress and called it a day. But this issue was stuck in my mind even after work. Later that night, right before I was about to go to bed, it suddenly occurred to me that one construct that’s missing from my replication is defer — the original code writes to the channel in a deferred-closure and that closure also indirectly interacts with the buffer object (same as the goroutine — by referencing a different field in the struct holding the buffer object).

Updating the code snippet to conform with the usage of defer resulted in replication of the memory leak

Simulation of the offending application code along with use of defer

So, now I went through the release notes again, but this time to see if there were any updates made to how defer affects garbage collection. I couldn’t find the exact “defer affecting garbage collection” thing but I did discover that some optimizations were made to defer in v1.14.

To see if that could be directly affecting our use case, I tried running the original (non-refactored) code with v1.14 and that resulted in the memory leak too!

Thereafter, I started looking further into these optimizations made to defer. At this point, I’d like to point out that if you search for “defer changes in golang v1.14”, most of the first-page results you get are release notes (different people writing different articles just copying the official release notes).

Thankfully, it’s just most and not all. In the results, I found this podcast where the engineer behind the defer optimizations himself explains the changes made (optimizations are discussed at ~35:30).

He goes on to explain that to reduce the overhead cost of defer, they made an optimization so that for any defer that is not in a loop they’ll generate in-line code, evaluate any arguments passed to the deferred method and store it in the stack (instead of the heap) — which, if you think about, is a pretty clever optimization.

But, this also means that any objects referenced within the deferred-closure will stay in memory (and not be garbage collected) until the closure returns.

And that’s it! That explains exactly what’s going on in our scenario. The deferred-closure is blocked because the channel is not being read from; thus, the buffer object lives on in the memory (and is not garbage collected) as its parent struct is referenced inside that closure.

To Conclude…

So, that’s how upgrading golang broke the service (by making it run out of memory) but at the same time allowed us to “fix” it by surfacing an undiscovered issue.

This was a particularly interesting investigation that provided a lot of opportunities to learn. This issue also seems to be a perfect example of the Murphy’s Law adage:

“Anything that can go wrong will go wrong”

Because if we think about it, the main issue with the code design was only the leaky goroutine — but for this exact bug (or the memory leak) to manifest & the leaky goroutine to stay undiscovered for so long, multiple things had to happen in pretty specific ways, for instance:

  1. use of defer within the leaky goroutine
    As illustrated in a previous section, a (noticeable¹) memory leak was happening only when defer was added to the mix. So, had defer not been used, we probably wouldn’t have encountered the OoM issue. And consequently, might not have discovered the goroutine leak.
  2. defer optimization shifting to use stack instead of heap
    If v1.14 optimizations didn’t shift to utilizing stack memory then the issue wouldn’t have surfaced. As alluded to before, the leaky goroutine had been in place for several months (albeit with an older golang version); but that didn’t lead to pods running OoM.
  3. referencing the (memory-intensive) buffer object within the deferred-closure
    If the buffer object had not been referenced inside the deferred-closure, then we wouldn’t have seen the memory spikes & subsequent OoM issues because then it would have been GC’ed after its last reference went out of scope!
  4. channel being used within the deferred-closure²
    If the channel causing the leak would have been used outside of the deferred-closure, then the issue would have also manifested in the older golang version. Thus, this specific point prevented the issue from being discovered for many months!
  5. channel used after the last reference to the memory-intensive object
    This investigation was triggered by the OoM issue and the leaky goroutine bug escaped us for quite some time — but the issue would have surfaced earlier if the memory-intensive object was referenced after the use of channel in the deferred-closure!

¹ Noticeable memory leak was happening because of the memory-intensive buffer object staying on in memory — but even without that the goroutine leak would itself result in a minuscule amount of memory to leak. It’s just that such a small leak will not surface easily!

² I don’t mean to state that there’s anything wrong with the pattern of using a channel inside a deferred-closure. But, in this specific instance, it prevented the issue from being caught with the older version(s) of golang!

Vishal Sood, Software Engineer

Vishal Sood joined Ethos in March, 2021 as a Software Engineer. When Vishal isn’t working on empowering and developing the Agent ecosystem at Ethos, he enjoys keeping up to speed on all things MCU, cooking and learning new things. Interested in joining Vishal’s team? Learn more about our career opportunities here.

--

--