An overview of common problems and tactical solutions for those approaching the serverless crossroads.
Serverless is one of those technologies that come along every 5 to ten years that promises to completely revolutionize the way we design and build applications, no more so than when placed in the context of cloud-native systems.
When we look back at the last thirty years of significant technology changes, so much work has been done to abstract away various layers of infrastructure to the point where application developers are meant to only think about application code and not the servers that execute it.
Virtual Machines, the JVM, Containerisation, and now Serverless, all abstractions on abstractions that try to commoditize the way we execute code and move us as far away from the bare metal as possible.
Serverless or, more specifically, Function-as-a-service (FaaS) is the culmination of these abstractions. The idea is to give a cloud vendor your code, and they will run it for you.
The internet is full of examples and opinion pieces on this - especially AWS Lambda, the forerunner, and thought-leader in the FaaS space. Most of these articles offer fairly simplistic examples of how to build basic API’s and event processing systems using FaaS, and other fully managed cloud services like NoSQL DBs and Streaming Services.
What happens if you decide to bet the house on Serverless, though? Is FaaS ready to run a fully-fledged, enterprise-grade application stack with all the complexity that comes with it?
Can you really kick out all of your event based processing to a FaaS provider and trust them to handle everything with the same performance and scalability you could achieve with VM’s?
Drowning in Lambdas
FaaS seems to be a natural fit for handling any kind of event-based processing use case, and actually, a lot of the workloads we write code for in the cloud are event-based, even though that may not appear to be the case at first glance.
A Web-API, that’s event-based - your event is the API call. There are plenty more examples - stream processing, image rendering, ML inference, file scanning, and many more. This is also where things start to get complex.
If you follow best practices, you’re going to want to write a function for each type of event you process - and although this might start out at a small number, you’re likely going to end up with hundreds or thousands of different functions across your organization.
Let's take microservices as an example. A typical microservice might have 20 endpoints, and in the case of a RESTful API, for each endpoint, you’re likely to have a combination of GET, POST, PUT, DELETE, and LIST operations available - maybe more. Conservatively this means 50-100 endpoint/operations translating to 50-100 functions you need to version, manage, and deploy.
Without the right automation strategy, this gets complicated and fast. Often the answer here appears to be using something like AWS Serverless Application Model (SAM), but it’s straightforward to run into trouble here too. SAM is an open-source framework for building serverless applications using standard AWS components like API Gateway, Lambda, and DynamoDB. While a wrapper around Cloudformation templates, it provides a much simpler YAML syntax for defining serverless stack, allowing you to define resources in just a couple of lines.
If you follow the Principle of least Privilege, you’re likely to define an IAM role for each Lambda function, granting it very fine-grained permissions to your resources. You will also probably want to identify some other resources your service needs - DynamoDB tables, for example, and Cloudwatch Log Groups. Very quickly, you’re running right into that Cloudformation hard limit of 200 resources. Oops!
Any microservices architecture results in a significant distribution of services, logic, and API calls - pretty much by definition. This in itself brings several challenges that are well documented. Perhaps the most pertinent challenge to Serverless architectures, and FaaS, in particular, is that of the cascading failure.
To demonstrate this, consider a fictional e-commerce website, and an extremely common action - adding an item to a user basket before checkout.
When a user wants to browse the items in their basket, a number of things need to happen each of which is likely to be designed as its own microservice:
- Make a call to the user-basket service to get a list of items in the basket.
- This will probably call the authentication service to check if the user is allowed to retrieve the basket items for this user.
For each item in the basket:
- Retrieve the name of the item and an image thumbnail.
- Check if the item is still in stock.
- Call the pricing service to get the current item price.
- Call the authentication service to see if the user is allowed to do this
- Call the stock service to see if the item is in stock.
- Call the authentication service to see if the user is allowed to do this.
- Check if there are any promos to be applied for this user.
- Call the authentication service to see if the user is allowed to do this.
That’s at least four calls to the authentication service. In this example, a cascading failure occurs when the authentication service starts to degrade - let's say every 10th call results in a 500 server-side error. This will cascade up through the services that call it (e.g. the stock service), which will then throw an error, which manifests itself to the user.
It’s not uncommon for a microservice to have a call stack that's more than ten layers deep. What happens when the service at the 10th layer fails, manifesting errors all up the call stack.
While not fully solved in the container space, this problem has been somewhat addressed by service mesh tools such as Istio, App Mesh, and (to some extent) Hystrix. Each of these provides some combination of side-car proxy and traffic management to route calls around dead endpoints, providing additional observability over what’s working and what’s not.
Although these tools don’t provide a silver bullet, they are relatively mature and have been broadly adopted, and battle-tested.
To-date achieving similar functionality with FaaS architectures requires the introduction of those ugly server things you had tried so hard to remove from your stack. There is no native FaaS service-mesh tooling, and any solution requires a significant degree of inhouse engineering.
Remember, that each FaaS invocation is launching a new container on your cloud vendor's compute plane - a managed container, but it’s still a container. These containers only hang around for as long as is required to respond to the invocation request, making it difficult to trace what has happened - particularly if your logging isn't extremely detailed.
Compound this with the potentially thousands upon thousands of functions that will need deploying, versioning, and monitoring, and you end up with a stack that's orders of magnitude more difficult to operate and run in production.
Observability at Scale
The nature of serverless makes figuring out what’s going on - especially in a complex stack, a significant engineering challenge, and something that requires careful planning. We’ve already talked about how difficult it is to obtain good observability in a distributed call stack, but the nature of serverless takes monitoring requirements to another level.
Compared to containers you host yourself, there are now several additional variables that you need to think about, all of which directly affect the time it takes for an API call to return and, in turn, directly impacting user experience.
How long does it take for your function container to start? How long does it take for your function to obtain a network connection? How long does it take for the code to start, and how does the size and complexity of the code impact this? Why are some functions taking longer than others to start - sometimes this can be down merely to the scheduling strategy used by the cloud host, which you cannot control?
Every one of these parameters contributes to your API response time, and each of them is critical to understand, monitor, and tune. Remember that you now need to perform these actions for each of the 50-100 functions in your microservice - rather than the single replica set of Docker containers you would have deployed to Kubernetes.
Understanding and comprehending an extensive FaaS system is hard, understanding what calls what, knowing where to make a change and how this impacts where and how information flows into and out of a system requires a consolidated strategy. Observability is key to this, but so is advanced planning, functionality management, technical consistency, and generally keeping a tight trip on the architecture of your system.
As with all new technologies, you need to be cautious when adopting serverless. While it’s very attractive to no longer have to think about patching VMs, security scanning containers, and scaling clusters, the nature of serverless introduces its own challenges.
Observability is hard, and something that requires proper planning and engineering at the start. Deployments and automation are hard and also require adequate thought and design to work properly. Beyond this, it’s critical to understand and have a plan for how you will version your functions, how you will respond when they don’t perform as expected, and how you will test them. All of these things are possible and solvable but may involve more complicated or involved solutions than you had first hoped.
While serverless is going to take its place as one of the cloud revolution’s flagship features, it doesn’t come without its challenges or its pitfalls. As tooling matures, we can only wait to see where serverless might lead the industry.