June 2, 2021â˘7 min read
At Transcend, our mission is to put users everywhere in control of their data.
At the core of this mission is our data privacy infrastructure, which for companies takes care of everything from allowing their users to submit privacy requests to our customers, orchestrating personal data from hundreds of SaaS partners and internal databases, and returning data found to the data subject.
But with that comes an inherent challengeârunning hundreds and thousands of processes in parallel, all the while keeping request completion times to a minimum. While our platform could technically handle current volumes in parallel with low turnaround times, we knew there were low-hanging architectural improvements that could get us to a place of handling volumes of greater magnitude.
This led to a reconnaissance mission into Serverless territory, and as we are an AWS-only shop, our prime candidate was AWSâs Lambda. This blog post chronicles our efforts to move most of our privacy request processing workflows to use Lambda functions from a more traditional monolithic service.
Before we go any further, letâs first touch on the objectives we were hoping to achieve with this project.
In the short term, we wanted to quickly scale our ability to serve hundreds of thousands of privacy requests instantaneously for our customer base. At the same time, we sought to break apart our monolithic application to get to much faster builds and deploys.
In the long term, the migration sought to bring more modularity to our codebase and platform. We wanted to be able to scale different parts of our platform independently, enable smoother transitions when pushing big changes, and process and deploy our services and events on a more granular basis.
Prior to the introduction of Lambda functions as our primary way of processing data subject requests, our applicationâs backend was deployed in an Elastic Container Service (ECS) cluster as a monolithic application. We had a single backend âserviceâ that would perform multiple tasks, such as invoking remote APIs, scheduling and running workflows, and acting as the server for our customer-facing API.
The bulk of our workflow involves invoking SaaS APIs using metadata from a privacy request. That involves locking the appropriate database rows to prevent multiple application threads from acting upon the same privacy request.
Unfortunately, this approach didnât scale well with multiple application threads or when faced with unexpected spikes in incoming requests.
Lambda functions are self-contained pieces of code, usually deployed by uploading a zip archive containing the code to run, with functions configured using the AWS UI or a configuration file. They are analogous to functions in your code in the sense that you pass arguments to your function and operate on the response returned. However, instead of deploying your entire application, if the function code changes, you only need to re-deploy the specific function.
Lambda functions are built on the Serverless philosophy, in that you donât need to worry about the actual hardware running your code and instead can focus on writing code, letting AWS handle the actual provisioning and runtime requirements of your function.
So whatâs changed?
With Lambdas, we now send messages to a Simple Queue Service (SQS) queue that triggers a Lambda function. The message describes the workflow to be run along with the parameters required to invoke the remote APIs, among other things. The results of the Lambda functionâs execution are stored in a DynamoDB table, which is regularly polled by our backend service. These results are then ingested in batches into the database to update the applicationâs state, presented to the user.
The asynchronous nature of this processing now means that thread execution times for the backend service are now less than a second, instead of multiple minutes on average with our earlier approach.
But what happens to messages that fail to process, whether itâs because a SaaS API rate limits us, thereâs a bug in our function code, or something else entirely?
There are a couple of different ways we handle error states. In the case of API rate limits, or if we encounter a temporary error state, we retry the event after waiting for a short period of time. If we encounter a state where a bug in our application code is causing messages to fail processing, we send our messages to whatâs called a Dead Letter Queue (DLQ) for analysis of the failed messages and their error stacks. Weâll talk about DLQs later in this post.
As mentioned earlier, we knew that our monolithic approach was not sustainable in the long term, as the sheer volume of privacy requests we were handling increased. Based on earlier attempts to increase our processing capacity by attempting to increase the number of application threads or the size of the database to cope with the extra connections, we knew some of the bottlenecks in our architecture would have to be addressed before we could grow.
We were also aware that we would have to adopt a more asynchronous approach to processing privacy requests and rely less on interacting with the database at every turn to update the application state. And as previously mentioned, we are an AWS-only setup, meaning AWSâ Lambda became our prime choice.
There are multiple ways to invoke a Lambda function, such as using AWSâs API Gateway, using a Simple Notification Service (SNS) topic or an SQS queue trigger, or even the AWS API directly.
Considering our need for asynchronous processing, we chose to go with the message queue approach, with SQS and Apacheâs Kafka as our primary candidates. In the end, SQS won for its ease of use, first-class integration with the AWS ecosystem, and relatively lightweight setup.
We also decided to go with the Serverless framework over AWSâs Serverless Application Model (SAM) framework, as the former is much more widely used, and its configuration syntax is much easier to parse for humans, things we hoped would mean an easier time finding solutions if, and when, we ever did run into issues.
Spoiler: we ran into a number of issues, and while the Serverless framework documentation and forums were enough to resolve most of them, there are still some rough edges.
For example, Serverless does not support creating a DLQ for a regular message queue using its config syntax. Fortunately, they support using Cloudformation configuration syntax blocks in your Serverless configuration, which meant that we were able to create and link the two queues and then refer to both of them in the rest of our Serverless configuration.
service: handler-lambda plugins: - serverless-webpack provider: name: aws runtime: nodejs14.x region: eu-west-1 timeout: 300 memorySize: 512 functions: handler: name: handler handler: src/handler.handler events: - sqs: arn: Fn::GetAtt: - HandlerQueue - Arn resources: Resources: HandlerQueue: Type: 'AWS::SQS::Queue' Properties: QueueName: 'handler-queue' RedrivePolicy: deadLetterTargetArn: 'Fn::GetAtt': - HandlerDeadLetterQueue - Arn HandlerDeadLetterQueue: Type: 'AWS::SQS::Queue' Properties: 38 QueueName: 'handler-dlq'
Next, letâs talk about how we package our code onto the Lambda function itself.
We used Webpack to bundle our code, given that the serverless-webpack plugin is one of the most downloaded and mentioned plugins in the Serverless ecosystem.
We also decided to package and deploy our monorepo code and external dependencies (like request-promise, lodash, bluebird, etc.) separately, as this allows us to create the bundle for external dependencies once and re-use it for each of our hundreds of Lambda functions, drastically cutting down on deployment times.
As part of this bundling approach, the webpack config had to be modified to bundle only our monorepo dependencies, ignoring all external packages.
const nodeExternals = require('webpack-node-externals'); module.exports = webpackMerge(commonConfig, { externals: [ nodeExternals({ allowlist: [âmonorepo-package-aâ, âmonorepo-package-bâ, âŚ] }) ] });
Midway through this project, we began looking into converting our Yarn v1 workspaces into Yarn v2 for multiple reasons, chief amongst which were the tremendously shorter install times on both local and CI and lower disk space usage. In the end, we were able to move our project to use Yarn v2, but this came with pretty big changes to our entire Serverless deployment workflow. Weâre planning on publishing a chronicle on this migration in a future post, so stay tuned!
Yarn v2 does away completely with node_modules folders and patches the require/import functions in NodeJS to redirect any dependency import calls to use the new Yarn Plug nâ Play cache system, when running code using the yarn run âŚÂ command. Yarn v2 uses a map file, which allows NodeJSâs require() to understand how to import packages from Yarn v2âs cache.
The changes we had to make were twofold:
const { BannerPlugin } = require('webpack'); module.exports = webpackMerge(commonConfig, { plugins: [ new BannerPlugin({ banner: `require(â/path/to/pnp.cjsâ).setup();`, raw: true, }), ], });
After figuring out how to package and deploy our Lambda functions, we were ready to start testing Lambdas as part of our privacy request workflows! But, we werenât quite ready to roll out untested code to customers in production and had to find a way to do a phased switchover.
The approach we used was to start switching over a few of the least-used workflows to Lambdas, testing extensively, and patching any issues. We found that this approach worked best as it allowed us to control the âblast radiusâ for any issues we found and move quicker, refining our Lambdas development and deployment processes.
The main downside to this approach was that we had to maintain two versions of the workflow logic â one in the monolithic backend service and another in the Lambda versions, but the fact that this was temporary made it an easier pill to swallow.
With these new Lambda functions and their associated infrastructure (SQS queues, DynamoDB tables), we went from 0 to over 300 functions! Our deploy process at the time was quite simple where we would build a Docker image, upload it to the Elastic Container Registry (ECR), and update the ECS tasks to use the new image.
The Serverless framework is quite useful as it also helps us deploy all the Lambda functions along with their SQS queues, IAM roles, and DynamoDB tables, with the downside being that we need to deploy each function separately.
We did look into tools like Seed.run and Serverlessâ own deployment offering to deploy our new infrastructure in a parallel fashion but ultimately decided to roll our own deploy pipeline using multiple CircleCI containers in a single deploy job, as we didnât want to hand over the keys to the proverbial kingdom to external vendors.
The deploy job is straightforward conceptually; it first indexes all the functions that can be deployed, reads a log of all deployments to find the latest version deployed for a particular function, and only creates a new deploy if a dependency of the function has changed since it was last deployed. The deploy job is then given the list of functions to deploy, and each container in the overall job deploys anywhere from 3 to 15 functions in a single run, with 15 containers running in parallel.
To say this project was convoluted is an understatement. There were many obstacles to hurdle, quirks to work around, and rabbit holes to fall into. At the end of it, though, itâs quite clear to us that thereâs a clear use case for the Serverless philosophy. Itâs allowed us to scale up (and down) our compute based on workload, reduce our reliance on the database and its connection limits, and allowed us to delineate cleanly the different functions our platform performs.
While there is definitely an argument to be made against prematurely optimizing for workloads you might never see, in this case, the long-term benefits tied in very neatly with our short-term requirements, and so making this move made sense for us.