Posted on Mar 30

After 5 years, I'm out of the serverless compute cult

I have been using serverless computing and storage for nearly five years and I'm finally tired of it. I do feel like it has become a cult. In a cult, brainwashing is done so gradually, people have no idea it is going on. I feel like this has happened across the board with so many developers; many don’t even realize they are clouded. In my case, I took the serverless marketing and hype hook, line, and sinker for the first half of my serverless journey. After working with several companies small and large, I have been continually disappointed as our projects grew. The fact is, serverless technology is amazingly simple to start, but becomes a bear as projects and teams accelerate. A serverless project typically includes a fully serverless stack which can include (using a non-exhaustive list of AWS services):

API Gateway
Cognito
Lambda
DynamoDB
DAX
SQS/SNS/EventBridge

Combining all of these into a serverless project become a huge nightmare for the following reasons.

Testing

All these solutions are proprietary to AWS. Sure, a lambda function is a pretty simple idea; it is simply a function that executes your code. The other services listed above have almost no other easy and testable solutions when integrated together. Serverless Application Model and Localstack have done some amazing work attempting to emulate these services. However, they usually only only cover basic use cases and an engineer ends up spending a chunk of time trying to mock or figure out a way to get their function to test locally. Or, they simply forget it and deploy it. Also, since these functions typically depend on other developer's functions or API Gateway, there tends to be 10 different ways to authorize their function. For example, someone might have an unauthorized API, one may use AWS credentials, another might use Cognito, and yet another uses an API key. All of these factors lead to an engineer having little to no confidence in their ability to test anything locally.

Account Chaos

Since engineers typically don't have a high confidence in their code locally they depend on testing their functions by deploying. This means possibly breaking their own code. As you can imagine, this breaks everyone else deploying and testing any code which relies on the now broken function. While there are a few solutions to this scenario, all are usually quite complex (i.e. using an AWS account per developer) and still cannot be tested locally with much confidence. Chaos engineering has a time and a place. This is not it.

Security

With all the possible permutations of deployments and account structures, security becomes a big problem. Good IAM practices are hard. Many engineers simply put a dynamodb:* for all resources in the account for a lambda function. (BTW this is not good). It becomes hard to manage all of these because developers can usually quite easily deploy and manage their own IAM roles and policies. And since it is hard to test locally, trying to fix serverless IAM issues requires deploying to AWS and testing (or breaking) in the environment.

Bad (Cult-like) Practices

No Fundamental Enforcement

Without help from frameworks, DRY (Don't Repeat Yourself), KISS (Keep It Simple Stupid) and other essential programming paradigms are simply ignored. In a perfect world, a team would reject PR's that do not abide by these basic principles. However, with the huge push for the cloud over the past several years, many junior developers have had the freedom to do what they want in the serverless space because of its ease of use; resulting in developers enmasse adopting something that doesn’t increase the health of the developer ecosystem as a whole. AWS gives you a knife by providing such an easy way to deploy code on the internet. Please don't hurt yourself with it.

Copy and Paste Culture

Most teams end up copying code to the new microservices and proliferating it across many services. I have seen teams with hundreds and even thousands of functions with nearly every function being different. This culture has gotten out of hand and now teams are stuck with these functions. Another symptom of this is not taking the time to provide a proper DNS.

DNS Migration Failures

Developers take the generic API Gateway generated DNS name (abcd1234.amazonaws.com) and litter their code with it. There will come a time when the teams want to put a real DNS in front of it and now you're faced with locating the 200 different spots it was used to change it. And, it's not as easy as a Find/Replace. Searching like this can become a problem when you have a mix of hard-coded strings, parameterized/concatenated strings, and environment variables everywhere that DNS name lies. Oh and telemetry? Yeah that's nowhere to be found.

Microservice Hell

This isn't a post about microservices. However, as teams and developers can decide and add whatever they want into their YAML files for deployment, you end up with hundreds of dependent services and hundreds of repositories. Many have different approaches and/or have different CI/CD workflows. Also, I've found that repository structures begin widely diverging. Any perceived cost savings has now been moved to managing all of these deployments and repositories. Here are a few examples of how developers choose to break up their serverless functions by Git repositories:

Use a monolith for all their API's.
Separate each API Gateway or queue processors
Separate "domain" (i.e. /customers or /invoices)
Separate by endpoint (I have seen developers break out a repository for POST:/customers while maintaining a separate one for GET:/customers/:id and so on…).

Many times, developers switch between different styles and structures daily. This becomes a nightmare not only for day-to-day development, but also for any developer getting to a quick-understanding of how the code deploys and what dependencies it has or impacts.

API Responses

The serverless cult has been active long enough now that many newer engineers entering the field don't seem to even know about the basics of HTTP responses. Now there are many veteran developers lacking this knowledge. While this is not strictly a serverless problem, I never have experienced this much abuse outside of serverless. I've seen endpoints returning 200, 400, 500 like normal. Yet another set of endpoints return all 2xx responses, with a payload like:

{
  "status": "ERROR",
  "reason": "malformed input"
}

Then, another set of endpoints implement inconsistent response patterns dependent on some permutation of query parameters. For example:
Query 1:
/customers?firstName=John

[{
  "accountId": "1234",
  "firstName": "John",
  "lastName": "Newman"
}]

Query 2:
/customers?lastName=Newman

{
  "accountId": "1234",
  "firstName": "John",
  "lastName": "Newman"
}

Inventing New Problems

As mentioned previously, initially deploying these types of services are easy. The reality is there are new problems with these kind of serverless structures that don't typically occur in server-backed services:

Cold starts - many engineers don't care too much about this. But they suddenly start caring when Function A calls Function B which calls Function C and so on. Without some voodoo warm-up scripting solution, paying for provisioned concurrency, or ignoring it, you may be out of luck.
In the past five years, prior to our work on our flooring installation marketplace, the teams I have been a part of have always chased the latest features because we had been doing workarounds like FIFO queues, state machines, provisioned concurrency, etc. As teams chase the latest features released by AWS (or your cloud provider of choice), things then become even harder to test and maintain since SAM or Localstack don't match these features for some time.
Some awful custom eventing solution because… serverless. Engineers think simply putting an API Gateway in front of EventBridge will solve all their eventing problems. What about retries? What about duplicate events? What about replaying events? Schema enforcement? Where does the data land? How do I get the data? These are all questions that have to be answered or documented in a custom fashion. Ok, EventBridge supports a few of these things in some form but it does leave engineers chasing the latest features, waiting for these to become available. However, outside of the serverless cult, these issues can be solved with Kafka, NATS, or other technologies. Use the right tool.
When it’s not okay to talk about the advantages and disadvantages of serverless with other engineers without fear of reprisal, it might be a cult. Many of these engineers say Lambda is the only way to deploy anymore. There isn't much thought to offline solutions when things need to be run onsite or separated from the cloud. For some companies this can be a fine approach. However, many medium to large organizations have (potentially) offline computing needs outside of the cloud. Lambda cannot provide a sensitive, remote pressure device real-time updates in the event of an internet outage in the middle of Canada during winter.

So, how do I get out of the cult?

In this article, I didn’t plan to address the many options you have to extricate yourself from the grips of mindless serverless abuse. If you're interested, please leave a comment and I will write a follow-up on the different solutions and alternatives to serverless I’ve found as well as some tips to incrementally shift back to a normal life. What I did want to do here was to express the pains I have experienced with serverless technologies over the past couple years now that I have helped architect more traditional VM and container-based tech stacks. I felt compelled to ensure individuals, teams, and organizations know what serverless can really mean for long-term sustainability in an environment.

What does serverless do well?

Deployment and scaling. That's really it for most organizations. For a lot of these organizations, it's hard to find the time, people, and money to figure out how to automatically provision new VM's, get access to a K8S cluster, etc. My challenge to you is to first fix your deployment and scaling problems internally before thinking about serverless compute.

Conclusion

Serverless is one of the hottest new cloud trends. However, I have found it leads to more harm than good in the long run. While I understand some of the problems listed above are not unique to serverless, they are much more prolific; leading engineers to spend most of their time with YAML configuration or troubleshooting function execution rather than crafting business logic. What I find odd is the lack of complaints from the community. If I’m alone in my assessment, I’d love to hear from you in the comments below. I’ve spent a significant amount of time over the last few years working to undo my own serverless mistakes as well as those made by other developers. Maybe I’m the one who has been brainwashed? Time will tell.

Discussion (16)

Elias Brange • Mar 30 • Edited on Mar 30

Most of the points here sounds like a result of bad communication and organization, and all of them could surface without Serverless as well. Having teams that build stuff without communicating on certain standards are bound to build services that don't interact very nicely with each other, regardless of if they are running in a Lambda, container or VM.

Depending directly on other teams functions?

Stop doing that, and expose each individual service as an API. Then it does not matter whether there are lambdas, containers or even VMs behind the API. Care must be taken to not break the APIs, and that needs to be done regardless of what kind of compute you are using.

10 different ways to authorize

Also boils down to communication between teams. If all services are exposed as APIs, it would be preferable to use a common auth mechanism for them. Here you could even centralize that function to a team and let them be in charge of an authorizer that other teams can use in API Gateways. For other services that does not run behind API gateway, expose the authorizer functionality as an API that can be called from inside containers in an API middleware or similar.

Account chaos

More an organizational issue. I wouldn't want to log in to a console where 20 teams have 4 EC2 instances each either. Use AWS Organizations and automate creation of accounts, preferably one account per service (or some similar scope).

DNS migration

Same goes for ALB, NLB and other services.

leading engineers to spend most of their time with YAML configuration

Kubernetes says hello.

Brent Mitchell • Mar 30

Hey there, Elias, appreciate the feedback! There is no doubt that team communication is critical, serverless or otherwise. My larger point was that focusing on business objectives (that move business forward) rather than technical ones should be the priority of every productive team.

Additionally, I probably could have been more clear that it doesn't matter whether a developer is hitting an API or lambda directly. The testing difficulty of serverless means many developers and teams are first testing in their dev and test environments. This means my functions/apis may start suddenly failing in dev because another team is making a change to their api, which is now broken for whatever reason. Again, this is because it is very hard to impossible to test anything locally.

As I mentioned, appreciate the feedback!

Elias Brange • Mar 30

Hello there! Great answer! :)

I totally agree on the part that testing distributed systems, where different teams are responsible for different services, is very hard. And I also agree on that all the serverless offerings still need a bit of work to mimic it perfectly locally. However, I feel that this problem would still be there even if the compute layer was running on something else.

How we solved it at my previous place was that every team had a development environment, where they experimented and stuff was not expected to always work. Other services were mocked where needed. We then had a staging and production deploys of every service where all services were expected to be stable, with integration suites testing and verifiying that the actual business usecases (often spanning multiple services) was working as expected.

Since all contracts between teams where defined by APIs, it didn't really matter what was running underneath. Some teams used API Gateway + Lambdas extensively, others used ALB + Fargate, and some teams used NLB + EC2.

tdsanchez • Apr 20

Sounds like your dev teams never heard of blue-green release patterns.

It sounds like your organization REALLY needs to consult with cloud native practitioners. Your organization will continue to fail at the cloud until it does and makes changes on how it communicates and enforces best practices.

Brian Burton • Mar 30

I'm a big proponent of serverless, especially in bootstrapped startups, but it has its place. We have long-running processes that we have running on K8s, but for everything else we use Google Cloud Run and it's frankly amazing.

I agree with Elias, your post sounds like technical decisions are being made by people who don't have a full understanding of how it all works individually or together. The "API Responses" section seems like a rant about the quality of your engineers, to be honest. There are definitely solid negatives to using serverless, but your post comes across not as, "serverless is bad" but "we're bad at serverless" if that makes sense.

Questions I asked myself reading your post:

Why aren't you using an OpenAPI schema and tests to standardize and validate your API endpoints?
Why are you allowing developers to deploy code that hasn't been reviewed for consistency and quality?
Why aren't you containerizing your services?

At our company we have a very elegant environment on GCP that has been properly architected that frankly is working incredibly well. We have processes for each step, from pushing code to CI/CD to building new services from skeletons that has helped maintain consistency across the product. We utilize services such as Pub/Sub when appropriate if we need to run background processes that may take several seconds to minutes and may fail and need to be retried. All internal security including the compartmentalization of customer data is handled by IAM. It's elegant and easy to manage.

Also if it helps, many of your complaints about the quality of the services at AWS are why I prefer Google Cloud. For example, Anthos Service Mesh is the solution you're looking for for Problem #4. From project management to permissions to serverless, the entire experience is so much more logical and easier to manage. Hellooooo folders!

Brent Mitchell • Mar 30

Hey there Brian,

Totally agree, in a perfect world, a senior engineer or architect would guide these decisions. However, there are instances (especially as things scale, like in large organizations
) where it is out of a developer's control and technical decisions have been made without fully understanding the tech. The path is wide for poor decision-making in a serverless application where many times those configurations or options are not available in a non-serverless application (i.e. domain name, choosing whether you want auth or not). This can lead to a lot of pain. Why blame or subject yourself to the lack of knowledge when the human problem can be eliminated. (Yes, I realize this approach still necessitates a senior engineer or architect laying the proper groundwork).

It's not always (and rarely in my experience) a breaking schema change when an API breaks. For example, OpenAPI test validations don't do much when a developer forgets to add an environment variable for some refactoring. Now, their function is breaking everyone as soon as they deploy, even though it passed locally fine because they had their environment set correctly.

Services are containerized. The problem is these container definitions change too easily as people move teams and teams grow in their understanding of serverless.

Can these be prevented through code review? Maybe. But now the code reviewer has to check all of these things before even beginning to check actual business impact of the function. Again, why not enforce this outside the code review process to eliminate the human aspect?

Adam Hisley • Apr 4 • Edited on Apr 4

It's nice to hear critical perspectives, but I can't agree with most of these points. Serverless is "cult-like" because.... it doesn't support KISS and DRY "without help from frameworks?" This sounds more like a local problem than a fundamental critique of serverless architectures. You seem to acknowledge that good frameworks are the answer here, so what's wrong with CDK, serverless.com, etc.?

In addition, it seems like many of the listed problems boil down to "junior devs with weak support and practices ran into these issues." But have junior members on standalone DevOps/Ops teams never created an overly permissive IAM policy? Never copy/pasted YAML? Have devs never cut corners on local testing prior to serverless?

I think it's important to differentiate problems that are endemic to serverless/cloud native architectures from generic challenges that apply to all organizations adapting to new technologies and software principles. In my experience, the real cult-like mentality in software development is the pernicious belief that the benefits of innovation can be bought cheaply:

-- "We can just lift & shift our data center into the cloud, and reap massive benefits overnight!"
-- QA automation totally replaces our need for dedicated QA!"
-- "Software devs will handle cloud infrastructure, plus all their traditional responsibilities, with minimal support! And output will be higher than ever before!"

In many such cases, after the hype wave crashes, the most successful software businesses are the ones who followed through on innovative practices, but also paid their entry fee and have an impressive list of "Challenges we solved..." by the end of their journey. If your organization's leadership isn't prepared to support a serverless transformation, I would certainly stick to more traditional and practiced approaches, but I struggle to imagine many of these problems you listed going away with a simple tech stack change.

Steven Chu • Mar 31

I've always felt the same way about serverless, but never knew it was okay to talk about :P

I'm definitely repeating a lot of what you say but I thought I'd throw my hat in the ring to both sympathize and keep myself honest. My biggest pain points are as follows:

Copy pasta -- yes layers are a way of modularizing your code but each layer comes with a version bump and, let's be honest, it's much harder to develop a stable library than it is to simply import from a namespace; yes those namespaced utils may change a lot over time but I'm more okay with that than constantly redeploying a new version of my layers just because I keep writing buggy code; OTOH, maybe I personally just am not that very good at writing modular code and that no amount of process fixes anything and that really what's the cost of a version bump?
An AWS resource for everything! With every new lambda I have to add a metric, a metric filter, a log group, and probably at least 2 other things I want to get basic log lines in CloudWatch and then on top of that I have to remember what each resource I create was named! OTOH, building a centralized logging system is no walk in the park either and bespoke solutions are never really as ready-out-the-box as advertised
Local testing. I'm running into this now where my laptop is just too darn slow and I constantly forget what's a dev dependency and what's a prod dependency and whether I need to update my .env files to point to the stage or prod instance when I'm testing. I have no clue how this would work if it weren't a side project :P I've loved, loved, loved dedicated dev environments at my 9-5 but even there (OTOH) sometimes dev isn't quite up to par with prod and it does take time and a process (and money and servers!) to make sure that you have a decent dev environment to begin with

Adam Crockett • Mar 30

I didn't ever fully invest in serverless because it worked just fine the way I have bookmarked your post because I will read it, I want to believe I'm the one with the problem and I hope you convince me that perhaps I'm not

Alex Lohr • Mar 31

Serverless is such a misnomer, as it is the exact opposite: managed servers as a service. You get what you pay for. It's your own decision if the advantages outweigh the disadvantages. It's certainly a nice choice for more front-end oriented people.

Mike Talbot • Mar 30

Great article, thought provoking and mirrored at least partially in my experiences.

Thomas Hansen • Apr 3

I worked for a company once who were using CosmosDB. All inserts had to go through stored procedures because we needed atomicity (sigh!), we paid €8,000 per month in Azure fees, we couldn't do partial document updates complicating our code resulting in forcing us to manually in code implement locking, and we were basically using the things as a "badly implemented RDBMS".

Us the right tool for the right job - For software that is (almost) never CosmosDB, Lambdas or Dynamo ... :/

DanielSanchez97 • Apr 5

I agree with a lot of the things that were brought up in this article about the short comings of using a Serverless compute platform. I think Serverless compute platforms created a fundamentally different deployment privatives, which is the underlying cause for a lot of the problems in this article. I am currently trying to build a framework to solves these underlying problems at staging.cdevframework.io/ if anyone is interested in trying it out. It definitely still has a lot of rough edges, but I think it shows some promise.

Alex Hutton • Apr 19

Good article. You're definitely on to something, but I couldn't help but disagree with you on almost all of your specific points. The problems you raise all seem solvable.

When I look at serverless, I see something beautiful and unique (let's call it a monolith), that's been shattered into a thousand pieces. Now each piece can be purchased individually as a turnkey solution. Is this genius or is this a crime -- a difficult question to answer. If an experienced developer is not used to serverless, it's only naturally for them to feel like something is wrong when making the adjustment. On the other hand, you're five years in and you still have a problem with serverless. I'd be interested in hearing more of your perspective in terms of comparing serverless with concrete examples of "non-serverless" and explanations as to why the latter is better.

Tony Nguyễn • Mar 31 • Edited on Mar 31

nice and interesting viewpoint!
Thanks for your sharing!