A strange bug on AWS Lambda

At Adaptavist we run a lot of Lambda functions to provide functionality in our Cloud apps. We use both the Java and Node.js runtimes for lambdas, and recently we came across a weird bug in production that was caused by the way that the lambda function runtime is re-used in AWS.

As part of our ScriptRunner for Jira Cloud app, we run a service (built using lambda functions) that processes some input data from our customer's Jira instances and then PUTs the output back into our customer's Jira instances. Part of the processing logic requires us to have a list of the fields that exist in our customer's Jira instances in memory during the processing. This lambda function is triggered, in parallel, once per user per Jira instance, for a subset of the users in each Jira instance.

Lets say we have 10 customers and 2 users per customer that need this service. We would trigger the lambda function that does the processing 20 times in parallel. The processing takes usually less than 30 seconds, but we run the process every minute. That frequency is important.

I mentioned that we need a list of fields in memory. When we first built the service we fetched that list on demand for each item in the input that the lambda receives. Recently we started caching that list of fields, so that we would only need to fetch it once per lambda invocation.

The lambda uses a node module that our frontend code for ScriptRunner uses and it was inside this module that we added the caching.

Once we deployed this change, everything seemed to be fine, until we got some support requests along the lines of "the service is telling me the fields are invalid". We checked the logs and sure enough there were lots of error reported about fields mentioned in the lambda input being invalid fields - ie not in the list of fields we should have been fetching from Jira.

I mentioned at the start of the blog post that we use Java and Node.js runtimes, and when developing our Java lambdas we had been very cognizant of the fact that a single lambda instance can have its runtime frozen once it completes and that a second request a short time later can trigger the runtime to be defrosted/thawed in order to process the second lambda request and avoid a cold start, therefore we should clean up after ourselves nicely in case the JVM is re-used.

For our Node.js lambdas we hadn't really thought that much about it, but once we saw this bug in production I had a hunch what might be happening.

Because we run the service so frequently (every minute) the lambda function is always "hot", which means the Node.js runtime is shared between invocations of the function. We had introduced caching into a node module and that module was not being re-initialized because the runtime (and therefore data stored inside modules) was being kept between invocations! Because we only have one lambda function for this process, the first time it ran it cached the field list from one Jira instance and all the subsequent invocations of the lambda that were supposed to be for other Jira instances just used the cached data which didn't match the input!

So, if you cache anything in a lambda function, make sure you think about the possibility of the cached value being shared between invocations of that lambda!

09 Oct 2018 » A strange bug on AWS Lambda
17 Jan 2018 » How to run Karma tests in browsers in Docker
07 Dec 2017 » Switching from Javascript to Typescript
30 Oct 2017 » Fun with React event handlers
17 Jul 2017 » Switching from Groovy to Java
24 May 2017 » Useful Git Aliases
27 Mar 2017 » Practical Ratpack Promises
03 Nov 2016 » Custom Content in Forms for Confluence Connect
04 Oct 2016 » Checking user permissions from REST calls
30 Sep 2016 » Using the reflection API in Confluence
28 Sep 2016 » Creating a custom Confluence Blueprint
06 Sep 2016 » ReactJS in Forms for Confluence Connect
25 Apr 2016 » Migrating to ES6 in Atlassian Add-ons
17 Mar 2016 » All kinds of things I learnt trying to performance test against Fisheye/Crucible
24 Dec 2015 » Adaptavist’s Holiday Gift of Atlassian Deployment Automation
17 Dec 2015 » Getting a Custom Field value safely
07 Dec 2015 » Putting Google Analytics to work with plugins for Confluence
02 Dec 2015 » Devoxx Voting, A retrospective
25 Nov 2015 » Some things I've learnt about SingleSelect
15 Oct 2015 » Using SOY for JIRA actions
26 Sep 2015 » Object Reflection in Groovy
22 Sep 2015 » Introducing Adaptavist Labs