Hacker News new | threads | past | comments | ask | show | jobs | submit DanielBMarkham (43803) | logout
The Big DevOps Misunderstanding (wolfoliver.medium.com)
516 points by WolfOliver 2 days ago | flag | hide | past | favorite | 308 comments

As someone who has lead a devops team, and still has a devops team reporting to me, I have also struggled with the idea of having a team called devops, since it does go against the devops philosophy of devs owning their software in production.

However, in our case (and probably many others), our devops team is responsible for creating and maintaining the tools and processes that allow the dev teams to manage their systems in production. At a larger organization, it doesn’t make since for each individual dev team to have to create and maintain their own deployment and management systems… that would end up with a lot of duplicated effort and a wide variety of systems that would need to be maintained. Devops creates the tools and procedures, and dev teams use them.

You might say this is just sysops, but I think the key difference is that the dev teams are the ones who are using the production management tools, not the sysops team. Devops is not the ones who are paged if a system has an issue, the dev team who owns it is paged. Devops is only paged if the tool that the dev team is using to handle the issue has issues itself.

Being a devops guy these last 2 years I have noticed that besides doing what you just mentioned, I am basically end user support for developers.

And through this I've noticed that developers can be just as bad at communicating issues as any end user I've ever dealt with.

Just because they've written the system end users use, does not mean they're better at reporting problems to me, because their problems are about the pipelines, the kubernetes clusters, the containers, anything outside of their programming code basically.

So I still need to use all the communication skills I learned decades ago when I was doing end user support for people who didn't even work in IT.

In my experience, its because the tools that are given to the dev team are opaque, poorly documented, made for a general case that rarely is sufficient, and usually locked down with some role based access control to the point where, even if I can figure out whats wrong with some pipeline, I rarely have the access to change something. I have an AWS Codepipeline that I can see running, but the logs from each step are sent to a Cloudwatch instance running in another AWS account, which only the devops folks can view. I have a Jenkins pipeline that was cookie cutter from what the devops team offers. If I wanted to actually grok whats occurring during the pipeline, I'd have to jump between 3 or 4 github repos, and also account for the versions/branches on each one.

It really feels worse than back in my websphere days, where in order to see app logs, we had to submit a ticket to the hosting teams and wait 2 days for a .log file to be sent back through email. I'm consistently left wondering if we've actually gained anything.

Yeah, one of the reasons I left a previous team was the company's attempt at embracing this. The standards were set down from a central platform team, the access control was held by department specific operations team, but developers were now "responsible" for the production deployments but granted neither the agency or access to do shit about it. So being paged about an issue with a load balancer setup that is understand by 2 guys in the platform team and configured by a different guy and having him have to send you screenshots of the config while digging in the source of the platform team's tool because of course there's no docs.

I think the above is what happens when a devops tools team doesn't do discovery and user interviews.

We've been at a point of software and development where we can build pretty much anything. But people are still making the same mistakes in figuring out what we should build.

Fix the cause, not the symptom.

Edit: And because I know it's coming... if IT security (or insert other org here) is the reason things can't be better, then it becomes the devops tools team's job to cajole them into doing it anyway, on behalf of their users.

I understand where you're coming from, but I don't think you really need "discovery and user interviews" to tell you that app developers need to be able to see their deployment logs, or that having your deployment code spread out across 3-4 repos with frequent changes makes it harder to troubleshoot. I think a lot of people might read this comment as blameshifting, where obvious flaws in the engineering and reliability of the system are blamed instead on a failure to define the specifications properly ahead of time.

Here's the reason, 9 out of 10 times, this situation persists -- no one realizes it's happening.

No one realizes some dev teams can't see their own deployment logs. No one realizes teams don't have the correct Jenkins access. No one realizes there's a mysterious gatekeeper to use the corporate time series database as a service. No one realizes all new containers are transferred to a few people who have the rights to publish them.

And when I say "no one realizes", I mean two things. One, the people who do know, because they're the ones trying to do it, have given up bitching about it, because they did for years and nothing happened. Two, the people who can actually change things don't know, either because they actively ignored / forgot or because the requests never reached them (usually because a middle transport manager didn't understand what was actually being asked).

And so... the status quo prevails, and everyone has a vague sense that something isn't working, but doesn't know what to do about it.

And to that problem, the best solution I've found is to get the fixer and the problem user as close as possible, wherein the problem is usually revealed as a relatively simple change or feature oversight. And for that, discovery and user interviews are the most reliable method.

> And to that problem, the best solution I've found is to get the fixer and the problem user as close as possible

You mean put development and operations together, into some kind of dev..ops.. :)

I'm sorry for the snark, it was not meant as disagreement with what you said. The whole idea of the role was to try to remedy that exact issue, but old habits die hard and sometimes things just change names but stay the same. I shouldn't throw stones, I'm struggling at our shop to steer outside of that behavior as well.

devops is the new agile

The only good thing that has come out of the rise of devops is all the conferences I have been able to attend.

This is why I pushed for a developer advocate role at my last place. The idea was shit on pretty quickly.

I actually think that it's an issue of company structure. Once you concentrate ALL deployment/operational power to a single DevOps team, you really lose the power to touch individual teams and make human budgets clearly. This is more of a "cost-center" move that implicitly signals that company X needs a single platform and individual requirements are not going to be prioritized.

The same thing with a single DBA or whatever admin team. My dream team structure is a task-force like one. Each team has its own DevOps/DBA. They don't have to be dedicated roles, but someone in the team needs to be well-trained on those topics.

But I guess this model never scales in larger corporations so eventually they all adopt the One X-admin team model.

> My dream team structure is a task-force like one. Each team has its own DevOps/DBA. They don't have to be dedicated roles, but someone in the team needs to be well-trained on those topics.

Looking back at the last twenty years, oscillating between central sysops and everyone doing their own thing (is that "devops" ?), I feel that an optimum requires both: local skills to iterate fast, central skills to consolidate. Striking a budgetary balance is difficult, but at least ensure that some local skills remain - sell it to central ops as a way to get the trivial demands off their backs !

This is good point and definitely more doable when the company is already centralized.

IMHO, the "centralized team" + "decentralized liasons" model isn't explicitly used nearly as often as it should be.

It usually happens in practice ("Sarah knows someone in security"), but there's no reason "devops liason" can't be an explicit 2nd+ hat for someone on a team to wear.

Centralized teams are necessary for budgetary reasons: one person on every team asking for a devops product will never make a coherent case to management for why it's needed, but a team will.

But having been on both sides of the equation, the central-decentralized relationship usually breaks down because there's always a new face on the right side of that equation, and that requires reteaching / recoordinating all the basics. More consistency on the user side (a hat) helps significantly.

And ultimately, it's a trust problem: do I (central person) trust this person who's asking me for something? If they're new to me... probably not very much. If I've been working with them for awhile... probably a lot more.

Agreed. And I don't really see a good solution for this. I guess we have to recognize that when companies grow and especially when structures are moved around, the original "agile" efficiency is not going to be there so we need to slow down in whole.

Do you have good resource in how to discovery/user interviews? I’m on a DevOps team working to mature the first few tools we made

To add on to ethbr0's good response, a few other things to do:

1. Find early adopters of the thing you're building who are willing to use and give feedback (bugs, friction logs) in the early stages of the feature.

2. Build relationships with your users / VIP users and meet regularly to discuss friction points and upcoming features (feasibility, etc). To make it time efficient, make sure you're talking to staff engineers who can speak on behalf of a significant portion of your user base. Make sure your team is closing the feedback loop with these engineers (i.e. shipping things that address their concerns) in a timely fashion.

I'd look at UX books. They've got more formal approaches.

Personally? Make a list of all the teams that use your product. Schedule interviews (30 minutes to an hour for initial) with a few that use it the most, and a few that use it less. Ask them what the best and worst parts are. Ask them details about the things they bring up, and prompt about related areas they might be struggling with too.

If you get any repeats from team to team, assume those are systemic issues and try and find a solution.

The hard part usually isn't figuring out the nuances, but rather realizing that someone wants to do something at all and/or can't because of some restriction. Assume you know nothing about how they actually do what they do and listen closely.

Sending logs to a dedicated AWS account is usually done for compliance. That account will have rules set via AWS Organization policies that limit IAM user rights and protect S3 buckets from being tampered with. I don't see why it should be necessary to ban read-only access to these logs for troubleshooting.

Anyone here know a good legal, compliance or best practice reason for blocking all access, or did the infra/sec team just get carried away here?

IMO the biggest issue other engineers tend to have with security is that it needlessly restricts them to the point where their job sucks. It's just as important to avoid "too much restriction" as "too little restriction".

> Anyone here know a good legal, compliance or best practice reason for blocking all access

The philosophies include:

1. Applying a maxim like "Principle of Least Privilege", or "That which is not allowed, is forbidden". Maybe through choice, maybe due to standards like PCI-DSS and SOC2.

2. Blocking access isn't a problem, because we're going to grant access requests quickly. Just as soon as we fill these five vacancies on the team....

3. Surely bugs making it to production will be a very rare event? Aren't these developers testing properly?

4. Don't we have test environments for troubleshooting? Isn't the heart of devops that you have a test pipeline that you have confidence in? If your developers think they're likely to deploy buggy software to production, maybe you aren't ready for devops...

5. Any large organisation will, at some point in its history, have had some chump run their test code against the production system and send every customer an e-mail saying "Hello World". If that happened when you had 30 developers, I doubt hiring standards are any higher now you have 300 developers! Stopping that happening again is common sense professionalism.

6. Would you want Google's 100,000 engineers to have read-only access to your gmail account? Of course you wouldn't, you've got loads of private shit in there. And if you were CEO, would you want an intern on a 2 week placement to be able to download the entire customer database? Of course you wouldn't. People having access - even read-only access - to the production system is a bad thing.

7. Locking down a lot of things gives the security team more political power, as every request to reduce restrictions or expedite a request is a favour owed.

Yea I completely understand the complexities that emerge when you're doing this stuff for a ~1500 person organization. And I know that overwhelmingly the people building these systems are talented and competent and have a stake in things and take pride in what they're doing. My comment was really more in response to the tone of the parent comment which insinuates that these helpless devs just can't figure out what the fuck is going on.

Coming from a team that builds and supports an end user suite of applications. I'm not asking you to bend over backwards like I would have to do for a client. But at least don't enter in to conversations with that attitude. You are in a service organization. You deliver value insofar as you can help the next link in the chain.

Yeah that's also part of our job, to expose the right info to developers so they don't have full access to everything but can still do their job. In my case that involves shipping logs to Grafana so they can view them there. In AWS it wouldn't be hard to create a policy for IAM users so the devs can view logs in the AWS console.

You're right, it would not be hard. But I don't have the keys to do that. I would have to talk to someone on the devops team and I would hope they wouldn't write me off as some dumb developer who only knows how to code, whatever that means.

Agreed. I approach shared services like a customer because that is all the power I have. It is supposed to work; that is all I know or have time to deal with. Off the the slack support channel...if you are real lucky.

> I am basically end user support for developers.

Pull up a seat at the bar with us.

And drink whenever the dev works out that it was their fault the entire time.

enlightenment comes seldom and without thanks, unfortunately.

> And through this I've noticed that developers can be just as bad at communicating issues

I swear to god the next guy who sends me the screenshot of an error log...

You haven't known hell until you've been sent a video of someone scrolling through a log file.

The number of developers who believe that operations is a support channel for development is truly amazing. How am I suppose to know why your software is broken? You wrote it. In reality it's easy enough most of the time, if you can read logs.

Still, there's a weird expectation that operations is a bunch of elite developers, knowledgeable about all programming language, who just choose to do operations because they enjoy getting calls at 3AM

The number of developers who believe that operations is a support channel for development is truly amazing.

The amount of engineering leaders that allow it to happen is far more frustrating, at least IMO it is.

Or another example from last week, a java developer sends me 50 lines of exception text when even I know that ONE of those lines is all you need to explain the problem.

And this particular developer is my main problem in that project. So it's not all developers who are bad, it's about the same percentage as regular end users I'd say.

My boss will text me screenshots of error logs and ask me to help debug his problem via text message.


Blame management. They thought it would be a good idea to go into the DevOps PaaS business.

At a certain point you have to ask how can someone not think about what they're doing at all. If someone asks for a log, what would go through your brain that says, should I give it to the person asking in the most inconvenient format possible? What's the first thing they're going to do with it?

I’ve seen this and I empathize. I’ve seen it lead to a sort of learned helplessness in developers who are supposed to own their whole stack.

I do think it’s unavoidable though. No company will have a truly open stack. There will be customized versions of everything, with caveats only the infrastructure team or developer-tools team are aware of. Naturally these teams become the go to folks to ask questions and resolve issues.

I don’t know if a good way around it. Documentation works to an extent. But at some point you have to decide what parts of the system you want to expose to developers, and whenever that happens you become the layer they communicate with rather than the underlying technology.

I’m a consultant data engineer and am on my third engagement in a row where devops has been the hardest part of my job. Every org has their pet stack with their own poorly documented process. Having devops as developer support is critical to getting solutions out the door.

Agreed. In my experience there are only a handful of companies (Netflix for example) where the pipeline is truly hands off and any developer can deploy to production at any time. We’re talking about the people who literally built Spinnaker, and most teams aren’t that.

In my experience thats not true, but it does take education and cultural change to get there.

Devops is having an identity crisis, but what you describe is what I call devops. It's the subset of engineering that supports engineering by defining/inferring workflows and building systems and tools to codify them. But I've seen it called ops, sysops, internal platform, and even infrastructure -- in some companies, devops and infra are the same people.

Yeah, I am not sure how typical this is, but my company has DC (data center) ops, NetOps, SysOps, and DevOps…

DC ops is in charge of the physical system, from racking and stacking to hardware replacements.

NetOps handles the networking stack, managing the routers and switches, vlans, announcements, IP block management, network access requests, etc.

SysOps handles provisioning systems, working with dc ops to get systems repaired, laying down the OS, managing the inventory system, and anything else the systems need to be ready to run our software.

Then DevOps is responsible for all the software that our company writes, and managing how developers get their software onto the machines, how they configure the software, and how they interact with their software in production.

The lines are obviously not that clean cut, but that is the basic idea.

this seems to be roughly the division i have seen at most companies.

Sometimes, automation and tooling are written by seperate teams, or are done by teams of different disciplines of operations/engineering.

I have almost never seen a network and system groups NOT being seperate. Mind you this is in DC/ISP's environment, where the complexity of networks is usually the mainstay or a heavy dependency of the business.

Who's clicking the button that says deploy this service to production?

The devs do. We have a system that slowly releases software to production in stages, and devs watch their deployment and a-b metrics to make sure the deployment is working.

Yeah, that's totally reasonable (even a requirement for anything remotely important these days!). I'd consider that setup devs owning their deployments.

It is the J2EE role "Application Deployer"

You're the librarians of the dev world by this definition. That's interesting to think about; that's what good academic librarians do. They also have a very similar identity crisis.

Devops is relative new term. Back in the days (in the *inx world), there is only system admin, db admin, etc. Personally, the full-stack now really need to know everything development, system admin, and db admin.

> it's the subset of engineering that supports engineering

only it's not what the idea behind it was. It was exactly the opposite of another compartmentalization. it was supposed to bridge silos not carve out new job titles for quality control

> only it's not what the idea behind it was

It's like the article says:

> At some point in time, this idea has been so widely misunderstood that the wrong definition of DevOps became the right one.

See also: 'agile'.

Disclosure: this is my role.

I work as a sort of "dev tooling engineer" or "platform engineer" for several companies who are our customers.

After years of consulting in the area of DevOps, the pretty smart people I work with figured out that while the developers' user story in the process of devops transformation is a "revelation", the one thing that usually gets forgotten is the maintenance and upkeep of those tools. So we set up a pilot project where we started hosting, maintaining and upkeeping a set of tools consisting roughly of Jira+Confluence, Jenkins as the CI/CD server, Artifactory for binary hosting and Bitbucket/Gitlab for source code.

It has worked damn well for the first pilot company (signed on 5+ years ago), the second one, and by the third the platform started selling itself (getting a public reference from the first couple companies helped a lot). The second reason for the easy buy-in is to use known software that our customers could host themselves if they had to. This practically means that we aren’t vendor-locking them with our own proprietary tools. We just take care of off-the-shelf tools.

If there is a competent tooling team that takes care of the intricacies of the tools and maintains their uptimes, the dream of "you build it you run it" can still live. There is a giant gap between "you build it you run it" and reality -- much how there is a gap between a pilot being able to fly a plane and a pilot knowing how to maintain an Airbus, the airport and the runways.

It says something about the job market that losing a customer pretty much never happens. It’s so much easier to find sysops/devs than it is to find people who specialize in tooling.

your company sounds like it would be very useful to many startups. The tooling stack we went for is:

Jira+Confluence, circle CI as the CI/CD service Artifactory for binary hosting and GitHub+gitlfs

circle had the lowest friction compared to Jenkins and GitHub actions wasn't out at the time!

Last couple of places we’ve called it “dev platform” or “dev tooling” team.

I use to say that it’s like one of the business domain development teams, it’s just that the domain is “dev”. The businesses devs are the stakeholders, and the tooling/platform team members are usually already domain experts that write code.

Having a “devops” team would feel like having a team called “agile team” or something - you do “devops”, you can’t build it.

> Having a “devops” team would feel like having a team called “agile team” or something - you do “devops”, you can’t build it.

Too late. The industry has chosen to deviate the word.

For years as a sysadmin I was arguing people that devops is a methodology and not a function. Now I have the Sr Devops Engineer title. I know I am still just a sys and platform engineer who's job is to facilitate life of developers but hey the industry insist on calling me devops, I have other fights more important than fixing my title and team name.

I just tend to think that 20y ago people were relaying memes about sysadmins being motherfuckers who just randomly denied everything to clueless developers while the devops term emphases more the collaboration side. I hope the devs I work with are happy working with me, I sure have a good relationship with them. If that is being what the industry now choose to call a devops, I am a devops.

Exactly. Just call it the platform/tool team.

Realistically though, I'm 100% sure that absolutely nothing will stand in the way of the DevOps job title being denigrated.

Companies have for a very long time used title inflation to bait and switch the unwary software engineer into editing enterprise software configs, and will continue to do so for all of time..

It wasn't necessarily the standard or even the majority, but we had that with Sysops whenever "the build team" or "the deployment team" was building self-serve tools.

And I definitely agree with the article here that if you're using PaaS, you don't need a DevOps team. And there's certainly a better feedback loop there.

But many will argue that PaaS _still_ isn't there yet for highly complex applications. Unless you go to a point where you basically need a team managing the eventual complexity that PaaS adds back to things for you to achieve complexity.

In any case, this is mostly pedantry. We should all strive for simplicity as much as possible, but complexity in apps breeds complexity in organizational structures. DevOps was never a silver bullet, and there probably won't ever be one.

Evidently any PaaS has to have _some_ opinions on the kind of apps it is going to run. You trade some assumptions against magic (and time you won't spend on making decisions that are very possibly trivial). You could very probably design some pretty complex applications on any PaaS worth its salt.. as long as you accept its assumptions, opinions . But when you do you might very well end up with the only actually achievable form of the original DevOps. Weirdly if Devs actually touch Ops it's probably no longer DevOps ... just people being cheap and asking other people to have two jobs for the price of one...

To adapt Greenspun's tenth rule:

You have a dedicated team building an ad-hoc, informally-specified, bug-ridden, slow implementation of half of Google App Engine (or Heroku, or Digital Ocean App Platform, etc).

I know my company is likely not the norm, but we can’t use Google App Engine or anything like that because we are a CDN. Our infrastructure is core to our product.

I know this might be unreasonable, but I always get frustrated when everyone suggests that anyone running their own infrastructure is clearly doing it wrong. I get that most dev groups are creating web apps and should use a PaaS, but there are also a lot of us who are CREATING those platforms that other companies use. We also have to write and deploy software, and sometimes I wish I could have discussions about how to do that without most responses being that we shouldn’t do that.

Maybe general developer forums like HN aren’t the best place to have those conversations, but I haven’t found great active forums dedicated to platform developers.

I'm sure there are many people in the same boat, but I don't know that there could be the forum you imagine, not with our current cloud landscape. You're building a CDN, Netflix's video CDN is different from Akamai's is different from Fastly's is different from yours, internally and externally. More mainstream; AWS engineers don't know GCP's architecture don't know Azure engineers' needs. And just right there are 7 different dialects of "cloud".

There's tons of interest, but much of the knowledge is covered behind corporate NDAs/employe and getting people aligned on terminology and technologies is hard because so much is hidden behind NDAs. The current solution is either to work for a big company where there's rigorous internal discussion or go to tech conference after-parties and hope to meet the right people at the right time/place on the Ballmer curve to have a productive discussion.

If your infrastructure is core to your product then your product teams should be working on it. There are definitely teams who should be making infrastructure because it's a differentiator for them. But an internal platform team seems like an antipattern, because at that point you've got the worst of both worlds - your platform is just as distant from your developers as something like AWS, and is unlikely to offer better functionality than AWS.

Kind of, but it has better observability, more reliable networking, and supports the specific subset of features needed from that platform, and better flexibility in terms of what you actually need, while not locking us to a specific way to write applications (like google app engine).

Heroku's simplicity for deploying is honestly a thing to aim for, but using Heroku at scale while keeping things consistent and secure is extremely difficult. It's not built for large, complex infrastructures. It's built for small shops and ad-hoc deployments, and it's dope for that.

This describes what I've heard referred to as an internal platform team.

Yeah, that might be a more accurate title at this point.

Part of the reason that the name DevOps was chosen almost 10 years ago when it was started was to help signal to the wider org the cultural shift we were undertaking, and that the expectation was going to be for devs to own their systems all the way through production.

At this point, we would probably change the name, but the team is quite attached to the name after 10 years. We actually did try to rename the team, but everyone still called it the devops team and we gave up on the change.

Agreed - this idea is often cited by fans of the book 'Team Topologies'. Its important here to distinguish between the idea of _an_ internal platform team and a _platform engineering_ team. A platform team is any team which provides services to internal clients - one such team would be the one (or ones) supporting whatever platforms-as-a-service are being used to deploy software to production.

Or infrastructure team, or systems administrator, perhaps sysadmin for short ;)

In the places I've been, a sysadmin has been someone who operates someone else's product -- not a person operating the systems developers use to operate their own product.

This is my observation as well. Sysadmins writes scripts or small programs for automation but its mostly using those tools for the operation of larger systems developed elsewhere (including other internal depts)

A sysadmin wouldn’t write payment code.

A sysadmin would write the tooling to ensure the payment code could work (failover databases, redundancies) and put safeguards in place so that it does not get traffic that is not intended for it.

Sysadmins used to write the code that touched thousands of machines at once, developers used to write the more complex code which were the bread and butter of the company.

But where I’ve worked, it was never a silo. Just a different mindset, and everyone worked together to do the best they can. That was before “DevOps”.

> sysadmin would write the tooling to ensure the payment code could work

Uhh if by tooling you mean a bash script to deploy ssh keys to prevent team A from accessing a machine owned by team B.

You really have no idea what you're saying.

Dijit is completely correct.

A junior sysadmin executes checklists, writes simple scripts, reads logs, and solves common problems, all while learning the complex supersystem of their workplace. They learn configuration of systems in use and how infrastructure is arranged and commonly changed.

A sysadmin helps to create checklists, writes more complex code to work in an automation system, and generally solves problems with non-obvious causes and/or remedies. They know all the routine events in their workplace, and add to documentation. They build, configure, test and recommend new tools and solutions to users, devs and the rest of the ops staff.

A senior sysadmin works on policy, debugs automation, solves complex interactions, and has an excellent understanding of complex systems and networks, not just at their workplace but in interactions and hypothetical interactions with the Internet at large.

You've nailed it.

DevOps guys are senior sys admin in the cloud (and that last caveat isn't mandatory). People shit on sys admins all the time but the only real difference between senior sys admins and DevOps engineers is pay packet, tools and responsibility.

> Devops creates the tools and procedures, and dev teams use them.

Which is one of the things the whole institution was designed to solve.

I realize that there’s quite a lot of devs these days that got in it solely for the money, and don’t care one whit about what happens after they throw the app over the fence.

But the whole reason I liked the movement was that it gave me more control over my own destiny. With half baked devops systems and processes that I have to follow it’s (mostly) no different from the BoFH of yesteryear.

It isn't just a lack of care. What I have seen is some engineers coming on and focussing on the domain knowledge, others focussing on rhe pipelines and they organically divide in two and provide each other help in good cases or each whine about the other not understanding the job in the bad cases. In the bad cases they are both underestimating dramatically how much the other has learned.

Honestly, and I understand this is possibly a hard pill to swallow for anyone that really believes in devops, most devs don't necessarily think in the right mindset for high reliability and performance often required for that role. And most sysadmins don't think in the right mindset for application development and getting features specced and out the door. What you're left with is a position mostly staffed by people that are a poor fit, even more so than the regular problems hiring for either.

It's sort of like expecting you can get by with only sales engineers. Having someone good at either job is hard enough, expecting you can staff only those that are just as good at both? Reality is you'll just end up with people deficient in one area or the other.

You might get by with a small core of people that can successfully straddle both roles, as long as you don't expect them to fully handle the needs of both departments, but I suspect hiring only for either department and fostering good communication and encouraging them to work together on projects that help deployment and availability yields better results. Maybe that partnership will grow you some good devops people if you're lucky. Maybe you should resist making a department out of them if that happens.

You lead a sysadmin team. DevOps was supposed to bring systems administration and development under the same roof. All of it. It’s a failed movement.

I’m curious as to what you’re maintaining. Once we set our pipeline up with a consultant agency and build step-by-step guides on how to launch a docker container through it, the maintenance of the pipeline has required no human hands.

This is in an enterprise sized organisation with 10.000 employees.

We still have a sys ops team to handle security, network and all those other things, but deploying software? That’s really easy for our developers.

We use Azure DevOps, we deploy to Azure apps with a dev, testing, staging (5% traffic) and prod (95% traffic) slots. It takes maybe 5 minutes to go through the to-do-list when a new application or function or whatever is created and we have a very fascist naming convention. This isn’t an azure commercial, we’re only in azure because we’re the public sector and already so in bed with Microsoft that it’s the cheapest solution. I’m sure you could do the same in any other cloud.

This is partially correct and honestly if your consulting agency wasn't upfront about this they're overcharging the business either today or down the road when the business comes back needing more.

Good service is ones that solve the problem in a reasonable time. Great services accomplish the same as good service but understand that the code, company and even the pipeline will evolve and either the client will need to reinvest themselves or externally again. They will communicate this with the limitations to the client.

There are only two types of clients I've seen more inline with the original statement, declining or flat company and both are only in the maintenance do not touch phase.

I’m not sure what the issue with hiring them back is. We spend around 5% of a full time employee for their services, we can hire them back another 19 times this year, 39 times by next year and so on.

Nothing they build us would last forever, but then, we operate 300 IT systems, only a few major systems have a life cycle of more than 5 years. 5 years ago everything we ran was on prem but on VMs on rented hardware, 10 years ago it was physically on prem some on physical servers, now it’s in azure, who knows where it’ll be in another 5 years.

I fail to see what a full time staffed devops department would give us except for added continuous cost that is so much higher than its alternatives that it’s just ridiculous.

I mean, I understand full well that system operators are worried in the current climate, but most of ours simply developed into broader roles that aren’t becoming obsolete.

I worked with a client and a devops team before, I still thinks having a separate devops team is not the right approach.

In my case, the team « responsible for creating and maintaining the tools and processes that allow the dev teams to manage their systems in production », was just a regular OPS team. So they built tools that were not working properly and they struggled to focus on the most valuable parts because they were not using them. At least, they were the one getting paged when systems were down.

My next ask for my project with my next client, will be to embed a devops full-time within the dev team. At least they will understand and share the pain of having to work with a git+terraform ticket system and all the bad ideas that our OPS friends think are great.

This makes sense for a smaller org, but what do you do when there are dozens of dev teams? You can try to make every team a full stack team and have a devops engineer embedded with every team, but you will end up with a different process and set of tools for every team. I feel like having a dedicated devops (or tools or platform or whatever you want to call them) is needed in a large org.

I do work with large orgs almost exclusively. I have no problem with individual devops engineers going back to their team after X weeks or at the end of the project. This way they can integrate avances with other teams and share knowledge.

I just don't want them to stay in their ivory tower and keep breaking my balls with their broken terraform templates.

"but you will end up with a different process and set of tools for every team"

This keeps being trotted out, without question, as the reason for the shared devops platform team. So what if each team has their own opinionated build, scan, deploy tooling?

It's used for compliance security scans now, so we are never going back.

The problem with philosophy is that reality has a way of getting in the way.

Think of home improvement. If my toilet is clogged, it doesn’t make sense for most people to call a plumber. That doesn’t mean that plumbers are bad, just that it’s easy for a generalist to do basic tasks.

Likewise “pure” devops rarely makes sense because developers are assigned to projects that end, and sustainment doesn’t require the same skills. It’s cheaper to pay for 20% of a $75/hr ops engineer than 100% of a $150/hr senior dev.

Me too, I worked on Customers with "DevOps" teams responsible for creating and maintaining the tools and processes, creating a sort of polarization between two gropus: devs and "devops (sysadmin)" team.

Some Developers tend to stick with the Dev parts, ignore the Docker/helm configs so, yeah there is some trouble with a true "DevOps" project.

And Google is re-introducing a "K8s stability role", which is a Sys(Dev)Ops role, pushing on that direction (for valid reason).

A true shift so it is difficult, but DevOps time-to-market deploy speed is so high I will no look back :)

The problem is what happens when there are niggling small problems. Do you do the work to handle the problem? If so dev gets a free pass to write unreliable code again. Next what about monitoring? Is it the same as dev environment or all stuff you wrote? How do you test it? So much duplication and overlap.

> You might say this is just sysops, but I think the key difference is that the dev teams are the ones who are using the production management tools, not the sysops team.

FYI, that is sysops, in case that you missed it

The DevOps team can often be the team that handles Compliance, Security and Audit for the platform.

Then there is the infrastructure/deployment team and finally the developers.

At least in larger organizations i think that is a common way of doing things.

If you try that you totally broken the DevOps idea, but now the whole world of ops and IT did that ages ago as anything that is configured in code, e.g. yaml, is devops.

The article suggests using a higher level abstraction with better tools suited for the developers than the low level tools as docker, Kubernetes etc. You can make this into a separate internal product and still keep the devops idea. But as soon you break out any of the core responsibilities for the product from the team. If that's business, security, stability or quality you are moving away from devops. Sure devops isn't the solution to all problems. But why call it devops when one doesn't like the basic devops idea.

in my experience with in enterprise businesses the DevOps people are almost the worst people to handle Governance, not too great at end to end Security (I've no idea about Audit)

IMHO I would guess they're too close to the dev people to care about the bigger picture

Would you say it's a platform team?

I used to run a team like this; we called it “the tools group”.

> the key difference is that the dev teams are the ones who are using the production management tools, not the sysops team.

That doesn't make your team a devops team. The devs are in fact the devops team. Engineer is ostensibly a better title so, because reasons, the engineers are just calling themselves engineers.

What you're doing is really internal tooling, dev tooling, infra, prod, those kind of team names.

I'm frankly astounded at the complexity around modern development due to dev ops and how much worse developer experience is because of it. I used to develop in a great IDE with debugging, right click re-run failed tests, I could follow the console right there in a nice, clean window integrated with my IDE, click on stack traces, etc.

Now I'm running my app via multiple docker images, trying to get a buggy remote debugger protocol to work (no local variables, wtff?) where I have to coordinate the container and the IDE, command line wrangling / using print debugging in the container to chase down stuff, digging around through multiple log files, copying-and-pasting stack traces into my IDE to analyze, which kinda works in some cases since the file paths are different, and on and on.

DevOps came and took my nice dev environment away, and I am mad.

exactly right. the rug was pull out from underneath us and replaced with technology lacking parity.

devops was able to convince the industry that "software engineers own their own infra" is a viable thing to do so they could do engineering work, expect, they're out of their element. That's not their job. devops should be pulling in open-source technology that's developed by real engineers and write tiny packets of glue code to make it work to add value.

Never mind devops teams, I remember when my team had a _BUILD_ team. Serious companies still do this, engineers don't touch production. devops and sysadmins make sure that the business software stays operational, they could care less what it is. their mandate is to make sure it's operational and they hold engineers accountable.

k8s is a terrible experience and a terrible abstraction. I'm not smart enough to do better but I'm smart enough to know it terrible. It's snake-oil. My team is now migrating to the cloud, along with of course k8s, and guess what? the k8s env. in the cloud are to be managed and owned by the software engineering teams, because devops doesn't want to do their job. What a disaster.

at my company, a bunch of devops folks that deployed k8s 80% of the way left soon later to make 2x the pay as contractors at other companies to help them push through their last 80%. This is another reason that there is a lack of parity on the new infrastructure. Unfortunately, everyone is still stuck at 80%. Strange.

> devops and sysadmins make sure that the business software stays operational, they could care less what it is. their mandate is to make sure it's operational and they hold engineers accountable.

It's sad to see something like devops that was supposed to bring teams together is still creating divides and one team sees their role is to hold another team to account.

Last year at the beginning of the pandemic I started implementing a devops infrastructure at work where I work as a lead developer. Initially the idea was to allow developers to run anything (or a selection of things) required for production on their local machine (k3s, kind etc) with added tooling for development (consistently working tools inside the container, but even outside by the use of binding). All I needed was a simple way to run containers and the requirements. Docker soon proved to be inadequate, it lacked certain features and I soon moved to k8s for local development and a devops pipeline.

People commit their changes, I (and sometimes another dev) review them and they are pushed to the test environment, from there to staging and then production (with a way to easily roll back). This is all done automatically with out-of-the-box solutions (fluxcd, flagger, kilo, kaniko etc).

At work, Kubernetes is not considered too difficult to manage. Much can be done in a very consistent and complete way unlike Docker. What am I overlooking that makes k8s to be considered snake-oil by so many?

Edit: Forgot to add. I am so happy with k8s that I even run it at home and on my external servers.

> k8s is a terrible experience and a terrible abstraction. I'm not smart enough to do better but I'm smart enough to know it terrible. It's snake-oil

If you're running Kubernetes on bare metal then I don't blame you for having a bad impression. I can't help you understand that your problems are almost certainly to do with a group of "hardcore Ops" guys trying to do everything GKE or EKS gives you for free while neglecting your actual needs, not to do with Kubernetes itself, which is most useful when Google just creates a cluster for you and you can immediately start dumping manifests into it.

> at my company, a bunch of devops folks that deployed k8s 80% of the way left soon later to make 2x the pay as contractors at other companies to help them push through their last 80%. This is another reason that there is a lack of parity on the new infrastructure. Unfortunately, everyone is still stuck at 80%. Strange.

Again, this isn't really to do with k8s is it? Probably better to point the finger at management for not retaining the people who they needed to run k8s or listening to your complaints trying to work with something 80% done instead of 100%. At my company we used k8s 110% for everything and have amazing support for it, and as a dev it's a dream come true. For people who don't understand Ops, it can be difficult, but when you open the "DevOps" box of saying "Devs own their services in production" for the first time, you're bound to do a lot of learning. I see this in my teammates a lot, who have basically zero understanding of how things actually deploy or run once they've left their local machine.

So much to unpack here. Let's start with:

> devops was able to convince the industry that "software engineers own their own infra" is a viable thing to do so they could do engineering work, expect, they're out of their element.

Devops wasn't about getting ops people to start doing development. It was about getting developers to be start doing ops because infrastructure definition became a tye of coding.

What's incredible about your inversion of understanding is it's exactly the point of OP's posted article - that people misunderstand devops in exactly the way you just did.

> That's not their job. devops should be pulling in open-source technology that's developed by real engineers and write tiny packets of glue code to make it work to add value.

"Real engineers"? This is a Googleism and their caste system of "real engineers" and "software reliability engineers". The rest of the industry doesn't coddle their precious "real engineers".

Designing a resilient, scalable, frugal, operateable architecture of load balancers, servers, caches, queues, dead letter queues, and all their associated metrics and alarms is absolutely "real engineering" and worth the time and attention of the same engineers that are building the core business value of their company.

> k8s is a terrible experience and a terrible abstraction.

Completely agreed. But note that that K8S came from Google and their 2-tier system of engineering. But the quote in OP's article is from Werner Vogels - AWS's CTO, and Amazon is legendary (nay infamous) for asking their engineers to work on ops.

So, I agree with the last half of your post - K8s IS a scam, and should not be operated by an ops-focused team (probably not operated by anyone at all)

But if you had follow the intention of the DevOps movement, not the perverted definition of DevOps, then the k8s scam was obvious from day one.

> "Real engineers"? This is a Googleism and their caste system of "real engineers" and "software reliability engineers". The rest of the industry doesn't coddle their precious "real engineers".

There's no such caste system at Google. SREs have the same bar as the SWEs, and plenty of people switch roles from one to other. SREs are also well-respected and I don't think anyone would consider them as _not_ real engineers.

> Designing a resilient, scalable, frugal, operateable architecture of load balancers, servers, caches, queues, dead letter queues, and all their associated metrics and alarms is absolutely "real engineering" and worth the time and attention of the same engineers that are building the core business value of their company.

You are absolutely right and that's why having people focusing on that is _really_ important. Having worked at AWS previously, I can tell you that when you force your software engineers to do everything and promote them based on _what_ they deliver without caring about the operational load, you end up with crazy oncalls and poorly-designed software. There's a reason why AWS oncalls are infamous whereas Google's are not.

> Having worked at AWS previously, I can tell you that when you force your software engineers to do everything and promote them based on _what_ they deliver without caring about the operational load, you end up with crazy oncalls and poorly-designed software.

I work at AWS. While it's true that promotions can create all sorts of incorrect incentives, the reason why AWS oncall can be brutal on some teams isn't because of promotion-orientated architectures that disregard operational load.

The reason is that the company prioritizes moving fast and shipping stuff to customers. We have a deep backlog of customer asks and we want to ship them quickly. The systems need to be secure, durable, available, and fast - those are non-negotiable. But do they need to be operated automatically? Well, no, that's the easiest thing to compromise - ship quickly, have initially shitty oncall, improve operations behind the scenes.

This is usually why customers are always confused/concerned why AWS ships something new then releases no updates for a year. We don't compromise on security or durability, but we do compromise on engineer happiness during oncall for the first release. So after shipping, the immediate next step is to start improving automation and operations.

This is done with open eyes, and most engineers are happy with it to ship things big things quicker, rather than releasing a new version of chat software every year.

None of this is promotion-orientated. It's working backwards from customers.

> There's a reason why AWS oncalls are infamous whereas Google's are not.

Could market share be a significant factor here?

It really isn't.

It's just the cultural difference between the two companies. I have been told that GCP oncalls are a lot busier than rest of the Google, but it's still nowhere close to the suffering I had at AWS. It's an organizational pain and comes from the mindset AWS has towards software development and their engineers.

AWS's mindset is software is useless without customers. Google doesn't seem to have the same care - it's a charity for academic software engineers to spend AdSense revenue on abstract high level computer science problems not caring about practical applications.

It's why most Google X ideas flame out. It's why there's a new chat app every year.

But - it's a net positive for humanity. Google publishes a lot of papers, and I think genuinely has moved humanity forward in the last few decades.

I don't have a PhD and I don't do L33tcode, so I don't think I'm smart enough to be at Google, but honestly I don't know if I would want to be.

I prefer shipping.

I'm sure you're more than enough smart to be at Google :) It's just practice and luck.

I completely agree that people who prefer to ship fast would be happier at AWS. I've heard all the horror stories about how slow Google is, but I think GCP has a great balance between speed and quality. I'm definitely happier here, but I understand why some people would be happier at AWS.

I can't find the post, but I base this on rachelbythebay's writing - who got hired as an SRE, and then tried to become an SWE and was told she couldn't just convert.

But fair enough, I don't work at Google, so I'll withdraw the point. Having said that, knowing that AWS engineers do oncall, and GCP don't (letting the SREs do it) makes me still think there is some sort of two-tier system.

Before you withdraw your point, I guess I have to say my experience is limited to my org and maybe one or two more. Google is a big company, so I can see that happening. I'd still presume it's not the general attitude towards SREs, though.

re: Oncall, we have two rotations:

- one manned (personed?) by SWEs, 9-5, responsible for dealing with customer issues and mandatory.

- one is mix of SREs and SWEs, 24/7, responsible for prod issues

I believe SREs also have their own rotation, but that's 9am-9pm because they are always spread among two timezones. Overall, this is muuuuuch better for everyone involved compared to the AWS oncalls. I remember barely sleeping for a week being the norm on one of the teams I worked at. Our cries for another team in a different timezone, similar to Google SREs, were shut down every single time. "Customer obsession" at AWS means delivering stuff as fast as possible and then throwing the engineers under the bus. I still remember the days I had to wake up multiple times a night to run a command manually (literally 5 minutes) because engineers couldn't take the extra 4 months to do it right.

Thanks, but no thanks. I was at a great team at AWS for ~2 years with great engineering culture and little operational load, but unlike Google, that's rare.

Devops wasn't about getting ops people to start doing development. It was about getting developers to be start doing ops because infrastructure definition became a tye of coding.

Reminds me of an idea that any user may expand a system by making plugins/changes, if you provide them a sufficiently humane programming language. Then they still can’t, and a completely new developers category is born.

Seems like another one here, cause it always turns backwards like that.

Given the second principle of devops is amplifying feedback loops, I think a more charitable view could be "the common misunderstanding of DevOps has lead to a terrible dev environment".

Fwiw, I'm fully onboard with you, but really love the core message of books like the phoenix project and don't really want it to be lost because of overzealous use of docker.

OMG, I read the sample chapters from Amazon about the Phoenix project and that was a really really good read, thank you!

It’s an amazingly good read. The follow up - The Unicorn Project - wasn’t as good, but definitely worth reading.

Thanks, really LOL a few times during the read.

What you just described is not really related to devops. I agree that a complicated dev environment is bad. However, that isn’t a devops thing. You can have devops with a monolith that is statically compiled into a single binary. I’ve built it and done it. It is quite nice.

we currently do this at our company.

compiled code gets (autodeployed) with SFTP in some cases even.

deployment has become massively complex because of the need to keep production and test environments roughly the same. Deployment to production should not be complex and basically boils down to the following:

- think about method of getting the binaries installed on the system in a secure way. - think about sending some state back that the binary has been installed succesfully.

- add node/member/app/whathaveyou to the production membership pool/loadbalancer/BGP Route reflector/whathaveyou.

I second that. I experienced this drive towards overly complicated pipelines that require a bunch of external systems to work in order to test my application and claims like “it’s impossible to run it locally”.

The answer should be more modular code with clear interfaces and contracts so that I can test as much as possible locally.

> require a bunch of external systems to work

- unit tests should always complete locally - integration could also complete locally but are allowed more complex scenarios - in any case, we require developers to pull all their deps through a caching proxy (external system) to keep a lid on what they use, keep a lid on licenses used in our software, and keep open a door to centralized vuln tracking & management.

Though if you can't fully test a software locally by getting all units and integration components to spin up, this is mostly a consequence of bad system design, for example, if the one application depends on another one which can't be run locally, reasons for which are (a) no bootstrappable dev environment (b) no test data management/generation (c) no license (d) too complex for the originating team to provide or (e) proprietary applience no one but the operating team understands. (a) through (e) all encountered at $currentjob and not happy about it, but what can you do? beg the other team to fix their shit and wait or just bite the apple and deploy to "dev stage" to get the testing done.

This really isn't a DevOps thing. If you hard depend on lots of cloud bullshit and 3rd party services but don't also put in the work to emulate it locally in your dev environment then congrats you can't really run it locally anymore.

I'm in that situation now and the best solution is to just give every dev their own scaled down but complete copy of production with a little magic so the load balancer proxies requests from their dev URL to their laptop. Like it's pants-on-head stupid but I'll be damned if it didn't make onboarding new devs instant.

PyCharm Professional lets you use a remote docker-compose interpreter[1] which brings back that live debugging experience. I did have to work pretty hard to figure out the correct strings to launch Gunicorn and connect to nginx. But it's all worth it to be able to evaluate code at a breakpoint and refresh my code changes without relaunching all my containers.

[1]: https://www.jetbrains.com/help/pycharm/using-docker-compose-...

Did it? I'm using GoLand and have exactly the experience you described. I can debug, rerun tests, terminal integrated, ... It's just a decision you took or someone else took for you.

I prefer to be able to fully run my application locally, with all databases required, and test everything on my machine. DBs and other tools like a simple SMTP server run in Docker, yes, but that's just a decision I took and I could easily run these tools without containerization.

Maybe I'm just in a lucky position to be able to decide this by myself, but I think we can get back there.

I started my career as a backend developer and did a fair bit of frontend development as well. However, I'm solely doing infrastructure engineering for the last 6-7 years.

I think you are kind of mistaken here, as a _devops_ I never enforced our development teams to use docker, or another specific tooling. The thing is, software engineering become a whole lot difficult and complicated over the years. When I was first learning web-development 15 years ago, notepad was the only thing you needed to do web-development, now look what you need to have to have a simple CRUD app running locally. Setting up local environments become so much harder, so people found the answer within the containers as you set it up once and you can share it within the team easily. Please do not blame _devops_ for that.

You don't have to run your app via multiple docker images locally, you can still configure everything to work natively. It's just way more difficult because of the dependencies of modern technologies.

Sounds like what you're really angry at here is a case of micro-services taken to the extreme.

A huge part is companies rejecting simpler solutions.

For example, you can use latest systemd to basically replace all of docker and then some. It provides advanced sandboxing, resource control and job control for units and even manages "containers". Which is why my private servers mostly use systemd properly configured to provide what containers are thought to provide. The deployment is switching out a binary or add a directory tree, override a .service line and then reload and restart. Easy.

Though over a 10 year period, companies, developers and operators went all-in on "container runtimes" and here we are.

That being said, things like kubernetes are NOT NEW and NOT DIFFERENT from solutions like corosync+pacemaker (who remembers them?). They do the same thing, with the same levels of flexibility and extensibility and pluggability, just with containers. From my POV people who are surprised by k8s and its complexity never lived or forgot the cluster tech of the 00s. It's just about to come together as a circle, and the next iteration seems to be micro VMs instead of containers.

I use containers in production and for local development dependencies but for exactly these reasons I run the application locally via the IDE like I always have, with the option to run it locally via Docker for the times I need to dip into making sure specific bits run there fine. So, I still have all those nice things.

This problem you describe isn't due to devops. It's due to lack of care or understanding of the dev and deploy environments. If your company is smaller you can be part of the solution. If you company is too large for that, just move on. What you are describing is not universal.

I had to basically give up my full time role as a Data Engineer who is supposed to solve "big data" problems (Storage, Streaming, Machine Learning pipelines) etc, and had to solely focus on DevOps because there was no other expert in the team.

We (a colleague and I) basically learned Terraform, then Kubernetes, and then Helm in an extremely stressful environment. I wouldn't recommend that to most people. (Also, I can hear "you don't need k8s", but tbh, it makes the difficult things possible, and that's what we needed).

I've done Ops in bare-metal environments before, and at least there, even if there were new skills to learn, the environment just wasn't so hostile for beginners, and the learning curve was simpler.

The YAML (or heaven forbid, jsonnet) does *not* help. TBH, I'm convinced if the IaC bits were done in a regular Java/Go/Rust SDK style instead of mixing and matching YAML glued together via labels and tags, I would have had a much easier time.

The positive side of that is that now the IaC bits are mature enough that even the Data Scientists in our team write their own Terraform code, and I've upskilled myself quite a lot. I'm slowly going back to solving Storage/Data/ML issues, which I enjoy a lot more.

Maybe I'm just becoming a cranky Software Dev as I enter my 30s. I do _not_ enjoy reading YAML.

I've been in two production environments where IaC was done with Turing-complete languages (Google and Facebook). It's really not better (it just lets the truly impossible/unmanageable become possible) because they were both dynamically typed.

If anything, Terraform comes the closest to being a good IaC language because providers form an API to production resources as opposed to unstructured code/config like Helm or Jenkins/Puppet/Chef. Production in reality is well-typed, and unstructured configuration makes static analysis very difficult and makes diffs only a guess at what will actually be changing when code changes are applied.

What I'd like to see is Haskell (or similar) typeclasses as production API; make it impossible to construct malformed configurations in the first place but allow expressive functions to replace the limited templating functionality that current languages have (why can't helm templatize inside values.yaml? Why can't Terraform templatize backends and why aren't providers first-class objects?). It's very difficult to keep current IaC DRY, and if it's not DRY then it has bugs and there are misunderstandings about how environments are actually configured.

Cuelang is a nice compromise: https://cuelang.org

Composition over encapsulation, expressive enough for complex configs, not expressive enough to be turing complete.

+1. Cue is actually very nice. I wish there's more adoption of Cue instead of most other config languages. Config languages is one of those fields where the problem looks easy and the barrier for entry is low enough that we have many different designs where they don't provide meaningful enough differences, yet the actual problem is quite difficult and subtle. Cue is one of few that has a clean theoretical foundation and looks very promising. Too bad we have YAMLs and friends everywhere.

One can write good and bad code in any language. It just so happens that our daily driver - procedural or functional languages, happen to be ones where a Software Engineer's mental model + Existing tooling is evolved enough to make understanding it _easier_ (as opposed to just easy).

The problem with DSLs and Config languages is that someone will inevitably try to use them in a way they are not designed to be used, and then circumvent the way you're "supposed to" do things. Not only do you lose all the advances in tooling, but people will end up reinventing things that already exist in the other languages.

PS: Of course, I'll pick Static Typing every time.

In terms of “Haskell type classes as production API”, I assume you’ve come across Dhall?


> The YAML (or heaven forbid, jsonnet) does not help. TBH, I'm convinced if the IaC bits were done in a regular Java/Go/Rust SDK style instead of mixing and matching YAML glued together via labels and tags, I would have had a much easier time.

I have mixed feelings about this. I've seen some absolutely atrocious Python and JavaScript code, and i wouldn't want to use either of them, and the flexibility ( to shoot yourself in the foot) they provide. Go is better. On the other hand, YAML is just terrible.

I think DSLs such as HCL are better than either option, as long as they're flexible enough while staying limit in order not to get illegible/unmaintainable.

I think the YAML issues center around a simple fact, few are taking e.g. 1000+ line YAML (or YAML for X) dev seriously. Complex YAML becomes code, and requires as much e.g. refactoring, idioms, reuse, static checking. At least it's not terribly hard to make a script output it as a format

You are on the right track. I have exactly the same story, data engineer, working on automation a lot.

YAML, JSON and HCL should be purged from IaC asap. Luckily Dhall can be integrated everywhere we need YAML otherwise we would go nuts. The bigges failure of k8s is to understand that YAML does not scale to teams and beyond 100 lines, especially not good to nest it (hello ArgoCD).

A type safe not accidentally Turing complete configuration language or for that matter any langue (hello Pulumi) is much better than YAML ever will be.

Anyways, if I can give you an advise, drop k8s like its hot. I is much more trouble than worth. We have phased it out everywhere we could. The companies still using it are in 1 year long implementation cycles with it with 10+ consultants. It is going very well. For consultants, obviously.


Dhall looks interesting. Thanks!

I went into the rabbithole of terraform/HCL and came up with Terragrunt to do dynamic code generation. Combining it together with generous use of `make` keeps me sane, but it's not a long term solution.

Even now, there are things within the system where the boundaries should be separate, but they aren't.

We literally got exhausted and burned out writing our automation, and decided to leave it as it is and then slowly build it up over the next few months. :/

PS: You're 100% right re K8s/YAML. It feels like everything in the Kubernetes world is optimized for writing the code just once. Which, of course, is easy if you already know how all the various bits fit together (true in any system).

But Reading/Debugging kubernetes world is a dreadful experience.

Edit: typo

> But Reading/Debugging kubernetes world is a dreadful experience.

Exactly. It is abstraction, over abstraction, over abstraction for no obvious benefits.

Dhall or other typed languages like Cue definitely help a lot. It seems to me that the only acceptable amount of YAML you could have in deployment config is for a setup small enough that it doesn't warrant using Kubernetes, so any use case where Kubernetes is a good fit quickly becomes a major pain to maintain.

That said, you wrote:

> drop k8s like its hot [...] We have phased it out everywhere we could.

Could you elaborate on what you've replaced Kubernetes with, and for what kind of deployment?

AWS Lambda, EC2, Nomad, depending on the usecase.

"turning complete" : typo?

Yes, thanks.

could you give me a hint why you didn't like jsonnet? I dreaded k8s yaml when needing to (write) and use helm packages and think of indenting levels, but jsonnet via grafanas tanka gave me the option to be concise, writing functions and creating reusable bits. Of course then you're not far away from using a real language "sdk style".

As you so aptly put, k8s learning curve is beginner hostile and needs system knowledge before you go at it, so I'm not sure if this was also with being more familiar with the abstractions - after having looked at k8s for 2-3 years, jsonnet made me finally like interfacing with the k8s apiserver.

Yeah, I think the difference is that I was

a) Learning K8s and Helm Concepts (both different things)

b) Reading/Debugging K8s and Helm YAML (both different things)

c) Ending up at Helm Chart Configs (another config)

d) Figuring out how to set all this to helm me deploy the solution I needed and customize it acc to my needs

...all within the span of a few weeks.

So by the time I got to some GitHub repos which were using Jsonnet, I was basically done. I'm sure it has its uses and given enough time and experience, I can see how to utilize it more effectively.

yes, both different things. I guess helm is still the most common entry point for people to handle the contents of a cluster. I'm checking out other suggestions in this thread (dhall), but from my personal experience I can already tell, grafanas k8s jsonnet-libs was a second chance to kubernetes I'm glad I got to know (thanks coworkers).

> it makes the difficult things possible, and that's what we needed

Indeed, the problem of just saying "you don't need k8s" is you may accidentally invent an even worse in-house version of k8s.

Though for 80-90% og cases out there you seriously don’t need k8s, and when the small internal IT dept that used to “manage” a dozen virtual machines starts talking about “need to keep up with the times and use new tech” what they are really saying is “let’s gamble infrastructure stability on my personal self education in k8s for my next job” and it’s perfectly reasonable to give them a hard no.

> regular Java/Go/Rust SDK style

I thought that until I started working with Gradle (which is a build script tool who's syntax is nearly-fully-functioning Groovy). I miss Makefiles - they were strictly declarative and, while I've seen some hairy Makefiles, you'll never need a debugger to figure out what they're actually doing wrong.

If you haven’t heard about cdktf, check it out! It lets you do infrastructure as code but in full programming languages like Python, C# and TypeScript. It’s by hashicorp, and under the covers it runs Terraform, so you have the whole set of terraform providers at your disposal.

Ah, CDKTF seems to be the Pulumi equivalent for Terraform. I think I remember seeing it.

We're far too invested into Terraform at this point to learn yet another tool. Oh well. Hopefully the next time I have to implement something like this, K8s ecosystem would also have similar tools (or maybe I just missed something)

There’s a straight up conversion tool iirc, and you can do things like include hcl as well I believe. Cdktf is just terraform at the bottom.

If you havn't heard, check out Pulumi..

I don’t understand the appeal of CDK, it’s the most difficult to understand overly complex part of IaaC I have ever come across.

I want simple yaml/hcl that is easy to understand and write. I don’t need to feel smart writing TF, I need it to be easy to understand to reduce MTTR when things go wrong.

Try to TF manage a large github organization with just terraform and get back to me. I'm having to wrap it with python code already, and my next step is the CDK.

The biggest issue is that parts of the TF for github requires references, and it fetches each of those references individually (which which quickly make you hit your ratelimit, while also taking 10+ minutes for just a plan). Wrapping this in python, and doing more efficient fetches ahead of time, and feeding the references in yourself can turn a 10+ minute plan into a 20 second plan for 1000+ repos.

the issue with yaml is that is breaks down under real life complexities in larger/complexer environments.

DevOps specialist here and I agree with your complaints about IaC meaning JSON and YAML documents. sigh

any suggestions on learning materials?

You'll find plethora of learning documentation, but by far, the one talk/presentation that made it click for me was this one: https://kube-decon.carson-anderson.com/

DevOps was always about de-siloing. In that, it is a culture but you need people to "be the change" and that's how I've played a DevOps Engineer for the last five years (well now SRE) and how I believe DevOps Engineers (and SREs by extension) ought to be playing their role today.

Corporate management is always surprised when my first move is to go sit with the developers and ask how I can unblock their workflow, reduce pain and tighten their development loop. Most SREs and professional DevOps people I've met do not do this and insist that their function is to tell the developers how they'll build their app and in which way AKA the same shit from the 90s.

The portmanteau should be a dead giveaway that the number one priority of this job is building cooperation between these divisions. And often, it's not a technical solution.

I think you've hit the nail on the head. In a previous job I tried to outsource "DevOps" to Rackspace, only to learn the hard way that if you can outsource it it's not DevOps—_increasing_ siloization is the antithesis of DevOps. But this is also the point where I think SRE is most compelling. Rather than trying to build an organization around "you build it, you run it", SRE recognizes the need for specialization while deliberately striving to align incentives between the various parties via SLOs.

This summer I finally hopped over to the DevOps side of the house as a manager, having always been a software engineer. My mandate from higher ups is to reform the team to start practicing SRE, but I think it's really more about reforming the whole organization. In every way possible, I'm trying to build a team of people willing to "be the change" like you say. But I also hope that SLOs, as a formal way of agreeing on priorities, will grease the wheels a bit.

> And often, it's not a technical solution.

Absolutely. In my experience, it's almost never technical. If you can structure the collaboration so that correct incentives apply, any technical solutions required come naturally.

(I wrote about my Rackspace experience here: http://zephyri.co/2016/can-devops-be-outsourced/)

I see the function of SRE as a place to pass off an app when an app team (DevOps included) need to develop another app.

SRE sets standards so a centralized team can manage the ongoing maintenance and on call break/fix of many apps in an organization while development teams move on to something else.

"be the change" I think nails the difficulty of devops and why we end up with the non Devops that we call Devops.

Most of us are not Fang companies and don't have the pick of the best engineers in the world (or the region). We cannot trust some of our devs to deploy directly to production because quite simply, some do not understand that responsibility and cannot rise to it. We still need our CTO to check any big changes because things are missed and this just reinforces the idea that you don't have to nail it, the CTO will probably find any problems.

We are also in Europe where you can't just fire people if they e.g. deploy 3 bugs to production. So what do you do? Like most companies, you try and introduce a good level of automated testing, some testers to write smoke tests etc. and we don't just deploy whenever like some companies do.

Then we are just back where we started with a bit more overhead and still not deploying more quickly to production.

> We cannot trust some of our devs to deploy directly to production

Neither can FAANG - one of the reasons why SRE is a thing.

> We are also in Europe where you can't just fire people if they e.g. deploy 3 bugs to production

This sounds like a process failure nobody should be fired for.

DevOps should be embedded on the team. Doing development and taking care of pipelines/iac when needed. If these people are not embedded on the team, it’s just more of the same with a new title.

I get paid a lot because I can write api endpoints, front end code, iac, pipelines, and understand networking enough to make our apps SOC2 or HIPAA compliant.

I've also become a little disillusioned with the 'DevOps means developers create their infra/deploy their applications' mantra too, especially when you see several years of accumulated infrastructure spun up by developers that don't really care about its quality. Then again, having a dedicated ops team can lead to siloing and learned helplessness on the part of devs too. It's a bit cyclic.

I agree 100%. It's cyclical, the trend goes one way for a while, then it goes the other way.

"We don't need a separate devops department, the programmers can do it"

"We don't need a separate QA/QE/SDET department, the programmers can do it"

"We don't need DBAs designing normalized database schemas for ACID databases, we'll have the programmers use ORM/NoSQL/whatever"

I mean this is true for the most part. It's one of those things that feels really wrong but the truth is that your code can be hot garbage for way longer than anyone cares to admit. I think a lot of shops should be popping champagne the moment they need to hire a dedicated DBA / ops person to clean house because it means they got big enough that they need to do better.

If they don't care about the quality ... do they suffer from it when its problems and enjoy its fruits? It seems they either don't have the time/resource/ownership/change budget to make it better, or the team is so dysfunctional that they need to be pumped full of new blood from top to bottom. stat.

One thing I really hate about a dedicated ops team (DevOps that is independent of devs) is that they have to use just one hammer for all kinds of nails. It's simply impossible for a small team to maintain multiple hammers.

So somehow the tools and workflow are exactly same for an app that has millions of external users and an internal app that only 10 people use. You still have to go through a whole stacks of K8S/docker who-knows-wtf-they-are configurations just to publish a dashboard from an internal IP. It took more than 8 months for my previous team to try to pull it off but it never went through.

As an ops guy: You’re absolutely right.

For better or worse I’m constantly annoyed by developers trying be “special”. Most projects involve stuffing things into a database, calling a webservice and reading stuff back. Don’t try to be clever about it, just stick to one type of database and Springboot so we can all go home and sleep with no incidents.

But there are projects which really do need to be different, and we don’t deal with them well, and they don’t scale.

I don't blame the devops guys though, I think this is a corporate thing as teams have to follow procedures and devops are not an exception (to waive some rules for us). I just hope it's a lot easier if we can just use a VM instead, but that would need to go through security certification.

I wonder how other companies deal with this kind of issues, eventually my previous team decided to host the app on one of the laptops and tell clients to visit a "192.168.0.X:port".

Nothing better than a team that doesn't use the internal platform. They go and write their own scripts, tools and terraform only for the developers responsible to leave the company.

They found another way to skin the cat purely because they didn't like the existing approach and didn't want the hassle of getting changes integrated upstream.

Integrating the two apps hosted on different platforms is also quite difficult normally. Things you take for granted that need to be shared, reachable, etc between apps all now need to be done manually for that snowflake platform.

From the other extreme: One thing I hate about most developers is that they have to use 15 different kinds of nails to join two boards together.

We have a full platform implemented, base images available, standards on how applications are developed and deployed, extensive documentation, etc. This has evolved over time along with the projects and business. This represents a _huge_ amount of work being abstracted away.

But every week we have to have another discussion about pulling in the database-of-the-week or language-of-the-week because apparently we were completely unable to develop basic CRUD apps until the new hottest crypto-document-as-a-service-database came along to let us store and retrieve data.

And every time they think _their_ application deserves special snowflake status. We're just making things difficult, they could fire up a VM and have this set up in 15 minutes.

This completely fails to account for _most_ of the actual work involved in setting most production systems up properly. How is this being kept updated and patched? Where are the backups? Where's the monitoring? What's the disaster recovery plan? Where's the documentation? Never mind trying to integrate it into all of our existing systems (making sure it's on the appropriate subnet, has appropriately configured security groups, etc, etc).

This also completely fails to account for the long tail: Even if you reimplement all of this, now it _all_ needs to be maintained for the life of your application. Any time we need to make major infrastructure-level changes, we need to account for n+1 different systems that need to be updated. If we need to do any migrations, we need to migrate n+1 things. If you need functionality X, we can't re-use the existing battle-tested modular templates to provide it to you it has to be implemented from scratch (and also maintained indefinitely).

Most devs I've worked with aren't well-versed enough in ops to understand the full immediate scope nor the long tail. They only see the project as far as, and only plan to see it through as far as, get it working and shitfix as required.

To your dashboard example... it's for 10 people, but that doesn't change the underlying equation for most of these things. The fact that only 10 people use it will not stop it being turned into a jumpbox for some hacker group or getting our external IPs blackholed after it sends a bunch of spam because nobody bothered to patch the OS for five years. The fact that it's only for 10 people doesn't mean it's not business critical and could be down indefinitely. The fact that it's only for 10 people doesn't mean we shouldn't have a local/staging environment and should be working in prod. The fact that it's only for 10 people doesn't mean it doesn't need to integrate into any other systems (where is it sourcing data? how is it authenticating users? how is it authenticating itself to other systems?). The fact that it's only for 10 people doesn't mean someone doesn't have to maintain it after you're gone.

Following the standard _should_ be the happy path so that you don't have to worry about all this stuff. If it's not, I think your ops department was just dropping the ball.

Ayyy, even the author gets it wrong.

DevOps is not about letting developers own services in production. This is the common misunderstanding that has lead to DevOps teams be platform teams. DevOps is about helping separate Development and Operations teams work better together so that the organization, which needs both, can benefit from their harmony.

The simple fact of the matter is that the modern stack is too big. It's not possible to expect to be able to hire someone who can walk in and do their own product management + write frontend code + backend code + host the frontend on a CDN + own the backend on Kubernetes + handle all the security, monitoring, uptime requirements around the clock. Maybe over time people will expand their skillsets, but right on their first day? No way. Not a reasonable expectation.

Organizations need to allow people to specialize. This is the only way anybody ever truly gets productive, they learn some part of the stack, learn in deeply, handle that part of the stack for the company, and communicate. True DevOps isn't pipeline engineering / automation engineering, it's corporate communication engineering, so that the concerns of people later in the value stream (ops, security) can be handled as early as possible (e.g. development/staging environments) to improve overall throughput.

You still need dedicated ops engineers! We just call them SRE now. You still need dedicated security engineers! We still call them security engineers.

Heroku came up with the idea of buildpacks [01] the idea has later been picked up by Cloud Foundry [02]. Using buildpacks (as an alternative to dockerfiles/images) mitigates this problem. A lot of best practices are build into the buildpacks.

Of course this does not work for everything. But for most cases this is just good enough.

[01] https://buildpacks.io/ [02] https://www.cloudfoundry.org/

Following this approach it becomes harder for the developer to do things wrong. This is one benefit why Cloud Foundry is used by a couple of big enterprises.

Letting/making devs own the services they deploy/develop is not the full picture, nor is it a real goal, but still it sums up the problem and solution space concisely.

Yes, specialization is a must for productivity, but that's a given already in the old model. What the DevOps effort tries to cut through is the trust barrier between the two. Sure, it's communications engineering too, but it's easy to weasel out of any real change with that phrasing.

> Letting/making devs own the services they deploy/develop is not the full picture, nor is it a real goal, but still it sums up the problem and solution space concisely.

What I'm trying to explain is, that I don't believe that this is a realistic expectation, because owning a service in production includes aspects of the stack that are not related to development, like networking, storage, and various non-dependency security-related concerns. When you task a platform team with building an easy-to-use abstraction for developers, those abstractions will inevitably leak. When they leak, one of two things happens: either the developers page the platform engineers, in which case you're back to shared ownership and the need to coordinate the concerns and responsibilities of multiple teams, or the developers are expected to fix the leak themselves, in which case, most of the time, the developers will find themselves out of their depth, most likely doing something inefficient at best or flat out wrong at worst.

I agree with the concern regarding leaky abstractions, but ultimately the developers should still own their services and be on-call for it, e.g.

- On-call dev deploys a buggy change that messes up some transactional data. Product owner passes the complaints received by customer service. On-call dev rolls back the deployment, writes a script to fix the bad data, and then prepares a post-mortem.

- On-call dev and on-call dba are paged because the database for the service is slow. On-call dev checks the service metrics and logs, and confirms that this is not caused by a deployment nor abnormal traffic. On-call dba checks the database metrics and logs, and finds some runaway queries due to silly optimizer. On-call dba kills the runaway queries, and proposes to add an index hint. On-call dev agrees, works on it, and asks on-call dba to review the change.

- On-call sec is notified because of a new rce vulnerability. On-call sec comes up with the steps to identify and remedy the issue. All on-call devs are notified to follow the steps, to ensure that their services are not impacted.

That's why ideally you have someone in your team who is trusted by the platform team to do what needs to be done to unblock your main goals/tasks. The platform team of course can help (or hinder), but the ability to do your own stack helps uncouple problems, helps teams progress.

Ideally there's an equilibrium. Some teams will inevitably be outliers, but the majority will benefit from the platform.

There's always a degree of sharing/coordination/communication. After all the teams are part of the same company. (Eg. when it comes to data access, API stability, it's a must to coordinate. Though famously Amazon "simply" solved this by a companywide mandate, requiring everything to have an API, and there's probably some story about their stability too.)

So, in any case the coupling should be loose. Basically an almost wait-free (but at least lock-free) algorithm.

Exactly, this is a huge problem.

A lot of developers are exactly that, developers. Sure if you have somebody who is very motivated he might know a thing or two about DevOps etc. but hiring becomes almost impossible if you expect a candidate to be a good developer + a good knowledge of devops.

It's just too much to handle for one person.

A sysadmin team leader walks into the room and says, "Everyone, you are now DevOps engineers, carry on", and walks out to collect a bigger paycheck...

So this, but instead of starting with Sysadmins with tens of years of unix experience, start with junior developers with 1 year of experience using Docker as a developer and have them read a blog post on how-to set up Kubernetes.

Now I feel this is how it is exactly happening at work even though I am not part of "devops" team. Any working solution at my place is now legacy ready to be dumped and replaced with next generation stack. Other day "devops" engineer couldn't figure out wrong directory permissions and whole deployment failed.

we once had this issue at an old job.

People wanted to replace a monitoring solution for something new, because snmp was "bad and outdated".

SNMP is definitely not the holy grail of solutions (to be honest, its kind of a mess). But atleast SNMP is widly supported, works reasonably well, and gets you the data in a dashboard you need. Even for those old Cisco 7200's with an uptime of over a decade.

Being new and shiny in operations is usually a sign of inmature technology. (with the bugs and issues associated with it).

Thank you for the post! At some point years ago I started realising that most problems in SNMP are stemming from badly implemented SNMP agents. Those problems can be:

1. Returning values different than what the MIB says.

2. Missing values that are not optional in the MIB.

3. Broken getnext code that returns an incorrect oid, which could even lead to loops. Net-snmp will detect it, but proper snmpwalk gets broken.

4. Broken getbulk repeater code that can cause crashes if the repeater is too large.

5. OIDs that will crash/stall/slow the agent.

6. Implementations that cannot handle simultaneous requests. Admittedly this is notoriously difficult to achieve: most MIB implementation code is not reentrant for various reasons.

7. SNMP SET is very difficult to implement, cannot be meaningfully used for almost anything beyond extremely simple stuff.

8. UDP encapsulation can cause MTU/fragmentation issues that are a nightmare to debug. Net-snmp actually implemented transports such as TCP and it was very nice, but it never caught on in general.

Bonus, there is also the not related to implementation problem of:

9. The query semantics of SNMP are non-existent, and the indirection in the SMI is cumbersome, which means that as a client/manager you'll have to do all sorts of data wrangling to find what you need.

I really wish this wasn't the case but after sitting on the other side of the interview table I'll take the junior developer almost every time. All the best ops people start as devs with an interest in infra. The worst interviews I've ever been in have been sysadmins trying to pivot to infra work. You can't teach the tinkerer mentality or the culture of "I'll just look at the source and fix the problem."

I am sorry but I just laughed at this. 20 years ago sysadmins used perl and bash to automate everything. Now devs that don't understand OS, networking, package managers, etc, think they will obsolete ops people by writing yaml configuration for a 0.12 version software, looking at its source code and fixing the limitations. May God have mercy on your soul as the hell you have developed for yourself has none.

Look, I am that ops person sans the age. I desperately want to one of these mythical graybeards to show up for an interview but it just doesn't happen. I get "senior" UNIX sysadmins that can't write a basic Python script despite it being on their resumes [1], can't mock out a basic HA web architecture or a logging pipeline, and can't explain at all how Linux namespaces work.

We tried to hire someone senior for 8 months before giving up and just grabbing a junior dev that wanted to do infra stuff and trained him up. I have worked at shops with people like you describe but it seems they're all happily employed.

[1] Not whiteboard style, on their own time, on their own computer, not timed or watched. They can take a week for all we care.

This is your personal recruiting problem and has nothing to do with "all the best ops start as devs" statement that you made.

I have no idea how you could find true experienced sysadmins in your case. They exist though, that much I can tell you. I have 20 years of experience, I write bash, python, perl and lately mostly go. I am not a greybeard as I am just old enough to have been a junior for the greybeards that wrote linux kernel patches among other things to solve infra problems.

I mean how could I know which part of my comment you took issue with, I was describing my recruiting experience. I still stand by that statement though -- doesn't matter if they've been doing it for 2 years or 20 all the best ops people started on their path by being a dev who found themselves as the ops person of last resort and fell in love with it. The experience gained by being thrown into responsibility for a production system and having to figure it out in real time genuinely can't be beat.

If a company is having trouble recruiting, it normally means the package on offer just isn't competitive. Remote work and money are the big ones nowadays. It might also mean your company is unattractive to engineers; location, negative reviews, difficult tech stack, business model based around clubbing baby puppies... Whatever it is.

I have led Ops teams in one shape or another for about 10 years. I second that having a junior dev do Ops work without supervision will not bring joy to your life, mostly because they do the darnedest things! One of the trainees here decided to remove a ControlTower guardrail from a root account, for a personal experiment in a sandbox; then three weeks later created a public S3 bucket in a production subaccount and leaked a bunch of data...

90% of developers just don't have the mindset for Ops work.

> This trend is also supported by Kubernetes as this is a technology which is hard to master. So people start splitting up responsibilities again. One part of the team developing the application the other part of the team managing the infrastructure.

Kubernetes administration is hard to master. It's analogous to running VMware in production, or any cloud technology for that matter. None of them is easy, and they all require relatively specialized knowledge.

Kubernetes development on the other hand is relatively straightforward. Minikube is my goto environment whenever I have to put together an app that consists of more than a single process. As long as you pay reasonable attention it is not hard to move such applications from dev to staging to production.

Edit: fixed markup

Yeah I don't really get the hate on Kubernetes. Run it managed with EKS, GKE or similar, set it up with the needed controller (ingress, external secret, whatmpt). The development afterward is as you say straight forward.

I think a lot of people are missing the point of the article: it's saying that devops was coined by Vogels (i.e. Amazon) as a value add because actual ops people are a necessary evil for a lot of companies that just want the features devs are cranking out to WORK without needing to be managed. Every salary that has to be paid to an ops person is a salary not spent on a dev making new features, fixing important bugs, or doing data engineering, etc. that clients are paying the real money for. Clients don't wanna hear about your deployment problems or DB backups. I'm not saying they are unimportant, but that they're painful cost centers which people looking after the bottom line would love to minimize, i.e. pay Amazon a fraction of what you'd pay full time in house people to do - at least in part. DevOps was meant to give devs more control over this work so that fewer ops people would be needed.

So when there is suddenly a new generation of ops people AGAIN getting paid just to manage stuff instead of build stuff, only now in the cloud instead of on on-site bare metal, this is a BIG misunderstanding of what the intent was.

> pay Amazon a fraction of what you'd pay full time in house people to do - at least in part. DevOps was meant to give devs more control over this work so that fewer ops people would be needed.

And here is the lie in the AWS marketing promise. In reality without ops the infrastructure clueless devs will generate huge AWS bills that move the profits from AWS customers into Amazon's pockets. At the same time with production being more brittle and ci/cd becomming a complex nightmare that gradually takes larger chunks of dev time until things grind to a halt.

The catch 22 being leaking abstractions. In the same way that devs cannot really empower regular users to generate apps with nocode that work well, devops cannot create iaas or paas that can be used without understanding the underlying os, networking, configuration management, package managers, etc

seems to me the real point of devops was employers trying to mash together two expensive categories of labor in a desparate effort to control overhead in a nation that spent 40 years prioritizing blind consumerism, reality television and talent shows over public education and STEM.

mashing together devs and ops inevitably results in schisms of devsecops and netops and ^ops because developers have enough shit to worry about on the daily without needing to learn the absolute intricacies of the OS as it applies to dev and prod. employers tried to get them around this by insisting "move fast and break things" with the push-to-prod mentality only to realize a 7 hour outage as a result kindof dims the outlook for everyone, shareholders included.

so employers wound up embracing the buzz and holding the line where it made sense, with some devops as developers, other devops as mostly ops, and still more devops as network-centric roles that handle things like firewalls and traffic too. now, people are going to point to automation and say "infra as code means devops is still a departure from traditional ^ops" but heres the secret: shit like cfengine has been around since 1993 and grandstand conferences on devops didnt show up until nearly 16 years later, and hyperconvergence showed up in 2009 which basically expands on software defined network and infra, so a cogent argument can be made that automation/convergence alone isnt devops, it was always happening anyways...so theres the wizard behind the curtain:

devops was just a trick to get you to do more so your boss didnt have to hire network and ops.

yes but no. the problem was that in mostly big financially privilegized companies both ops, dev, net, itsec were it's own little fiefdoms. and there was no incentive to get shit done. because that would have required working with the enemy. that gets you nowhere on your way up the corporate ladder in your guild.

lack of education never helps, but it's laughable to point at that as a serious causally significant factor. outsourcing, communication barriers, trust barriers, organization barriers, ossification of the business structure led to serious loss of productivity of the whole enterprise.

the answer is the cross-functional team. agile. scrum. (and many names and many crazy offshoots) but the core recipe is good and sane: people above process. trust (and self-check) instead of mandatory responsibility separation to check each other. empower engineers to be able to do the right thing by default, engineer tools for everything, don't let non-tooled non-integrated handoff points develop, because that leads to fiefdoms and communication bottlenecks.

Personally I think it's pretty great that I can write Dockerfiles, run them locally, and call it the day because this is also what runs in production.

So basically I have no clue what the dude is talking about - we have a great setup which is x10 better than what we had before.


> Personally I think it's pretty great that I can write Dockerfiles, run them locally, and call it the day because this is also what runs in production.

It goes far, far beyond Dockerfiles. Are you on-call for production? Can you debug a problem in production?

If not, then that's the root of the problem that the dude is talking about. If these are true for you (not specifically you, any reader of this):

- You don't have a good idea of how to ship a change all the way to production after committing it

- You would not know how to debug and fix a failure of your code in production

- Your "Dev" and "DevOps" teams are in separate departments

...then congratulations, you have just reinvented the old-school sysadmin/developer dichotomy from the 90s. You just happen to be using Dockerfiles instead of Makefiles.

I'm glad there's a dichotomy. I (a dev) never signed up to be on call 24/7 for production and it's ridiculous to just assume that I should be. I have a life outside of my work. If the system needs 24/7 uptime, they better be paying someone other than me to ensure that's the case, because my time is too valuable to waste my already precious freetime on fixing bugs in prod.

> If the system needs 24/7 uptime, ..., because my time is too valuable to waste my already precious freetime on fixing bugs in prod.

As an ops-sided person, this is the exact reason why ops folks always said no to releases on a Friday afternoon.

Ops people would rather be in the pub enjoying their bug free weekend by not risking pushing buggy broken code from developers, in spite of the desperate pleas and guarantees that this batch of code is *definitely* bug free. Last week was just a one-off. Honest. We swear.

EDIT: In the old style of Ops team versus Dev team versus Infra team versus everyone else.

Who wrote those bugs in the first place?

What's your point?

His point is that Ops time is more expensive than Dev time, they get paid more. The Dev teams should really fix their own bugs if they happen in the wee-small-hours.

Not just that (since there's a lot of situations you may not be able to actually fix bugs in the moment, just work around them), but it also aligns incentives and enforces lessons learned. If you know you're gonna be on the other end of that pager, maybe that hack won't see the light of day. For more senior folks, pattern recognition around how things were built and what issues they led to are incredibly valuable. Being able to tell a junior engineer no don't do it like that, we did that last time and it led to x y and z, could save the company a ton of money and pain.

If you're separated out from ops, what would ever drill that into you? Saying you're too busy for ops is just wild to me. That's some of the highest effort to reward work someone can do.

Not to mention the layer that lives behind this.

Can you troubleshoot network level issues that seem to occur between your system and someone else's?

What about storage? Can you make sure your data is still their if your environment gets royally screwed.

I know these are usually done by seperate companies nowadays, but getting an an actual computing environment up and running with redundancy in place (down to the cooling and electrical level) is no easy feat.

Agreed. Dockerfiles beat ansible scripts which in turn beat that-bash-script-that-PXE-ran. Skaffold and kubernetes are amazing tools, but just tools.

So much easier to be cynical about new tools than to investigate and understand the pros and cons.

Can we please get rid of the DevOps hype train?

What gets completely ignored is that there is a reason that sysops has been a specialized role for decades. How often do we get threads here that people get their AWS accounts hijacked because of some credential leak (e.g. access keys in frontend builds), extreme expenses due to stupid autoscaling settings combined with a DoS attack or content leaks out of insecure s3 buckets? Or how much money gets wasted because clueless "devops" developers think that failsafe redundancy for their tiny PHP app is configuring a bunch of x2large instances and putting ELB in front of it instead of either going with lambda functions in the first place or at the very least use small instances and sensible autoscaling.

Or ffs just take gitlab ci pipelines. Nearly every time when taking over a project or hopping in to consult I see utter fucking bullshit written by people who don't have any idea what the fuck they are doing. Yeah, your build takes 10 minutes because you don't use any cache anywhere and you don't use Nexus or any other form of repository cache. And ffs no you should not push your application image to public Dockerhub because you have no idea that Gitlab has its own registry - and no, "I read it on some 'getting started' guide for <insert JS framework here>' ain't an excuse either. People get barely trained in the technology they use, most JS frameworks (and fwiw, Gitlab and Github CI themselves!) are young enough that it's impossible to get senior people for them, and the stuff that ships for "CI templates" there is atrocious.

The amount of knowledge required to securely operate any kind of hosting or development operation is immense, and acting like developers should also shoulder the role of unix sysadmin, IT security staff and cloud configurations is extremely ignorant - not to mention only half of the developers know more than "ls/cd/rm" on the command line, less if they are Windows developers.

I do see value in "devops" as in "a person primarily focused on operating infrastructure, but with enough capability to troubleshoot production issues in applications", but devops as in "save money for actually qualified ops staff" is irresponsible penny pinching.

Great to see someone else saying this for a change. "Dev" and "Ops" are different skill sets with some potential overlap, but not 100%. Organizations I've been in have never done the "throw it over the wall" thing, either. Dev and Ops are partners, and treating things any other way is a recipe for failure.

This. I think that today the barrier to entry is low enough that, especially in small companies, developers with little knowledge of the space can get a basic CI system or production deployment going. And, the company is more than happy to let them instead of hiring somebody specializing in the area. This seems to really blow up if it goes unaddressed as the team/product grows.

I've watched at least one company burn time on the same issues for years because instead of acknowledging the need for and empowering people with distinct skillsets or knowledge, they're content to let developers bumble through the process of learning and reinventing the wheel over and over in the margins. That's if they can even get the developers on growing and siloed teams to communicate with one another at all. There needs to be experienced leadership somewhere.

Developers should have more responsibility over their apps and what they're putting in to production. But, it's unfair an inefficient to expect all but a few will ever even have time to become skilled practitioners in these areas. Especially while already keeping up with their normal duties. Now not just tech debit, but foundational infrastructure can be pushed aside by the drive for new/more features.

For the foreseeable future it seems like the trend of tasking devs with so much of this work will result in engineering teams continuously, wastefully toiling over fractured infrastructure built by novices. (Maybe someday the tools will actually allow for developers to take this one with limited knowledge/resources, though.)

Access keys on frontend is the fault of the frontend developer or culture

There will always be irresponsible or junior people, the problem is that AWS allows extremely unsafe things in the first place.

Normal hosters wouldn't even open internal infra like DB instances, mail relays or storage up to the Internet, AWS by default doesn't even firewall their internal stuff off and thinks, as long as you have the right credentials, you can't be wrong, even if the IP is not from eu-central-1 but some dial-up in Mexico any more. S3 and SES have mind-bogglingly insecure defaults, and it is hard to wall off your SES credentials enough that they're useless to an attacker who has managed to exfiltrate them (and that one is easy enough, all an attacker needs is a vulnerable log4j and the dubious "best practice" of passing credentials via environment variables).

It is so incredibly easy to mess up on AWS it hurts. AWS is a gun shop that not just sells handguns but nuclear armed bomber jets to anyone presenting a credit card. No knowledge checks, training, nothing required, and you have the firepower of all of AWS on your hands.

> This trend is also supported by Kubernetes as this is a technology which is hard to master.

Maybe this is rich coming from a Kubernetes person, but sounds like the tide is turning against this viewpoint.

Running Kubernetes itself is definitely hard to master, but you can take a <10 hour online course and be able to "helm install" your application.

Kubernetes can be the new PaaS (if you have someone else running the control plane and maybe a monitoring stack). But there's not a PaaS on the planet where you can just skip the step where you learn about it. Even Heroku users eventually have to learn about how connection handling and limiting works, and the specific Heroku commands to upgrade your database.

> but you can take a <10 hour online course and be able to "helm install" your application.

Genuinely I believe that good QuickStarts are the cause of so many expert beginners.

What I mean is: there is a person on an adjacent team who truly believes he is very competent in nearly all technologies. He is able to deploy a k3s cluster no problem.

He is absolutely unable to debug that cluster, or use it in a way which is reliable (local persistent volumes means you are dependent on the node never going away), he is unable to troubleshoot networking issues that have occurred in that cluster and he has struggled many times to admit that it is unable.

But heck. He can helm install no problem!

> What I mean is: there is a person on an adjacent team who truly believes he is very competent in nearly all technologies. He is able to deploy a k3s cluster no problem.

Aren't you confusing two roles, Kubernetes application developer and Kubernetes administrator? The Kubernetes application developer should focus on their application, NodeJS, Java, DotNET, whatever. Helm install (actually helm update --install) is all they should know/care when it comes to Kubernetes.

K3s deploy and local storage providers? Let those be the Kubernetes administrator's headaches.

Are you comparing the cost of learning kubernetes to the cost of learning heroku? Or even AppEngine? Those are totally different leagues.

Also, with heroku, AppEngine, etc you have tons of documentation (official and unofficial) and resources. It is well tested and the exact same service runs thousands of businesses.

While kubernetes is like the react of the infrastructure world: you won't find two similar implementations. You need thousand tools configured around it to make it usable, and of course you might have docs about the individual pieces but you never have it for the big picture.

I made a similar ranty blog post about devops.

I was surprised by the amount of history and misconception that exists around the term.

The term was originally intended as “Agile Systems Administration” but came about at the same time as a talk titled 100 deploys a day.

So people conflated the two, and it didn’t help that the conference was a portmaneau of “Developers, Operations days” to encourage it being a conference for everyone. Eitherway. We’re stuck with it now, and it’s the least meaningful job title I’ve ever known.

If you have to ask “what does x mean to you” then it’s a failed descriptor.


> The term was originally intended as “Agile Systems Administration”

That explains why it's such a terrible idea.

Exactly. One meaningless definition attempts to justify its existence by referencing another.

As I learn I’m realizing that what a lot of people miss is the rigor required once you pass a certain company/infra size. The same stack and processes that worked to get you to $X million is not the one that will allow you to scale software deployments to support $X billion worth of business. Documentation, real ci/cd with multiple environments, strong access controls. It’s so easy to gloss over these when you’re small. You have someone on the team who is “the expert”, you don’t have the money to do a real blue/green setup, you don’t have the time on an incident to give someone the proper access so you just give them a shared ssh key…success at scale requires discipline that makes the work hard and not fun. But it’s the key. Once you have dozens of infra engineers and many microservice teams, this laziness (that’s what it is, I know because I’m lazy) doesn’t work.

Kubernetes serves to allow a team within an organisation to build a PaaS that's customized for your organisation. You do this because 1. Baking in your organizational defaults and opinions reduces work and cognitive load for your engineers and 2. A PaaS is too expensive at scale (compared to IaaS), especially given what you get isn't tailored to your org.

> You build it, you run it

While I do like that definition, it also means that most companies/organisation cannot do DevOps, as software development is often bought as projects, and operations is a recurring cost.

Also, government regulations such as HIPAA explicitily force you to have separation of duties, so true DevOps is against them. Developer being able to push their code to production can earn you some hefty fine at the next audit.

We pass our audits all day long and I can push to prod multiple times a day. But there are caveats. Another developer must sign off on a code review and quality check before code rolled out, and a build pipeline ensures that we run our test suites and all changes have an audit trail.

Software is never done. It can only be abandoned.

There are projects that change slowly but they do change nonetheless unless they are no longer used.

We get heavily clamped down on access to production by our auditors for Sarbane Oxley reason. Technically we aren't even allowed production data in our dev environment. I dont understand why any organization would allow it. Seems the tech world is still ripe to get sued.

Because most of the world (also, as in industries) isn't under those type of laws.

Organizations did (do?) allow it because it's easy. It's also a way to test/debug production problems with actual production data instead of spending time trying to make up data that you hope covers all the actual use cases.

It's also lazy and doesn't respect the privacy of your users.

I see it as a specialization of Software Engineering. Like we have (self proclaimed as well as normalized by job posts): Frontend Developers, Backend Developers, (And Full-stack developers if one can do both, or simply, Software Engineers), And infrastructure developers.

DevOps, and especially with Pulumi / CDK (where it's actually infrastructure-as-actual-code and not infrastructure-as-a-JSON/Yaml file), is simply becoming a "developer who also likes ops" type of specialization. E.g. front-end, back-end and stack-end.

The nice thing is, and what I try to do in my team (both at my current startup and previously at AWS) is to break the silo and keep the "you build it you ship it" notion. Frontend - TypeScript, Backend - TypeScript, Infra - also TypeScript. When an engineer wants to build a lambda that listens to SQS or a Fargate ECS task or an S3 bucket, they can just do it. I if you know how to use the AWS SDK to let's say read/write to S3/DynamoDB/SQS, the API to create S3/DynamoDB/SQS etc is not that far off, and if you are using CDK in TypeScript for example (whether it's Terraform / CloudFormation / CDK for k8s doesn't matter) then indeed the boundary between backend code and infra code is getting smaller and smaller (both technically and also visually, the declarative code to generate an S3 bucket in CDK is pretty damn close to how it will look to generate that same S3 bucket via the SDK, it's just that CDK takes care of idempotency / change sets / diffs etc...)

In summary. I agree that DevOps should mean - you build it, you ship it and that the current industry trend towards a separate DevOps department is just plain wrong, but on the other hand, it's ok for specialization within a team, some will lean toward frontend, some toward backend, some toward data engineering, some toward data science, some toward security and some toward operations. Some will do a bit of both and some will prefer to focus and specialize, but I think that the main lesson I learned is that it all should be part of the team. Like a D&D party, you don't go to an adventure with just Warriors, then send the Wizards separately to ship the dragon to production.

Running systems is a different set of skills than building them. You can mitigate this to an extent with a PaaS but eventually you'll run up against operational issues that require expertise to remedy properly. You can try to use generalists to handle both produce development and operational roles but you'll start accumulating inefficiencies rather quickly (i.e your entire dev team needs to understand the operational tooling so they can be on-call, you're going to be spending more for resources you aren't using optimally, and you'll likely start to accumulate toil, downtime, performance, or security problems). So specialists tend to pay for themselves pretty quickly.

Even just having separate reporting chains is helpful to achieve a balance between work driving operational metrics (stability, performance, security), and product development work driving product metrics (sign ups, sales, acquisition costs, etc)

I thought “dev ops” was applying dev work flows to ops processes. That is, sprints, jira stories, etc.

Then SRE came along, and it was all about devs doing ops work focusing on keeping things up and running. That is, a dev doing rigorous ops work with a focus on SLIs, error budgets, etc.

But then, the term got trendy, leaked out of Google, and everyone wanted to be cool so they called their devops guys (which were really repurposed sysadmins) SREs but the job really didn’t change.

Source: sysadmin/devops/SRE since 1995

Devops is a philosophy not a role. The title DevOps engineer is meaningless.

There are software engineers, who write software and then deploy it onto self serve infrastructure, and there are operations engineers and data engineers who build and maintain the systems that the software engineers use to deploy and operate their software.

Usually when someone calls themselves a DevOps engineer they are an operations engineer, but sometimes they are a software engineer at a small company who also has to maintain their own infrastructure.

Totally agree, I’ve always insisted myself and/or my team are called “infrastructure”[1], who provide the platform for the rest of engineering to deploy stuff to (devops). But when hiring for this team, have always used devops as the title because it brings in the right type of candidates.

[1] platform is probably a better name :)

I always want to call my DevOps team the "tools" team, but that's not as catchy.

It's helpless to argue for changing the name. You'll never convince people whose careers are tied to DevOps to change the word they use.

We should come up with a buzzword that that we can use for a few years to indicate real-DevOps. It will be taken over by the same managers shortly after, but it can be used by for a little while by people who know the real deal.

We can call this "buzzword maintenance". The industry of engineers agrees to regularly update lingo to keep ahead of corporate concept degradation.

We can do this for agile and DevOps, at least. I suggest taborca (acrobat backwards) and DesMis (designer-mission). These were done with just a quick thesaurus search.

He’s not wrong. Over complicating DevOps does take the control away from developers but I think a lot of the Ops part of DevOps is to create an internal PaaS that your devs can deploy to themselves.

Except then it's not DevOps. If you have people working on 'DevOps' 100% of the time, that means they aren't working on Dev, and are only working on Ops.

The idea behind DevOps is that Dev(elopers) do their own Op(eration)s, rather than throwing stuff over the wall for a separate team to manage.

Not only has the industry done the opposite in many cases, they kept the name DevOps, meanwhile the full time operations staff have over complicated deployments and builds to the point that even they don't understand how they work anymore.

I agree that the term was kinda stollen and in many places Ops became DevOps while still doing Ops stuff 100%.

I think there is a place for both. Going into DevOps does not mean you need to get rid of your Ops team.

I like the approach when Ops are handling the backbone infrastructure of the organization while individual teams work in DevOps methodology meaning they are responsible for building necessary application infra, programs, as well as bringing it to production and monitoring.

Yes, many enterprise devs I know are completely uninterested in infrastructure.

This is also a great way of scaling teams - keeping a path of least resistance that is secure and compliant while giving observability and conventions.

Infrastructure in enterprises is usually completely uninteresting, to be fair. You need like eight different people on a call in order to stand up a VM, install a service, and configure a database connection.

Broken processes and missing self-service, as you point out, is why creating a platform team is important for these types of places. They might not know it yet, although some are trying.

It doesn’t make the infra technology more or less interesting?

Or any CS field for that matter.

100% agree. The industry took the term DevOps and started applying it to old-style ops teams (only difference is the ops teams use more "modern" tools).

There's still value in having an infrastructure or platform team focused on building tools to enable dev teams to more efficiently deploy, monitor, and run their code, and to step in with expertise when needed (ie help debug things like database performance, etc). More "teach a man to fish" type help, but also "give them a good boat and fishing net".

My DevOps team are actually SREs, and our SRE team are just webdevs who are on call. It’s pretty ridiculous.

Could be that someone in management understood site reliability as website reliability and decided to put web devs on call.

As an SRE, I’m starting to think the misunderstanding of SRE tradecraft is probably more pronounced than the misunderstandings of Devops tradecraft. The risks definitely feel higher.

I’m six months into a new SRE role and it really feels like instead of doing any kind of engineering I’m instead just getting “overflow pipeline work from the rest of engineering that the Devops team was too busy for” and I’m already looking for the door.

Similar thing happened to Agile.

Agile is a practice of speeding up improvement. Idea being that organisations that improve faster will, over time, overtake organisations that improve slower. If so, it makes sense to focus on the improvement loop itself. You remove management from the loop (because management takes too long and slows down improvement) and you empower and encourage individual contributors within their teams to improve the process because they can do it faster, have hands on knowledge of the process and can observe results faster.

Additional benefit of Agile is that you now have a tightly knit team that tends to be more engaged with their work and more focused on thinking through what they are doing -- this tends to cause people to be more satisfied with their jobs because satisfaction of being able to affect the process positively is a great motivator.

Yet, the management thinks they can come with a book and institute "Agile process" by telling everybody what they have to do, exactly.

Usually, improvement within team structure is then actively discouraged with many explanations like "that's not what the book says" or "I was doing scrum at a company where it worked and so we must do the same" or "we need to standardise processes" or some other variation of it.

I have observed this time and time again at various stages -- always with the same miserable outcome.

At my last company they additionally decided we are introducing Scaled Agile Framework meaning, according to them, we will now plan not at team structure but at a department level. This innovation means all planning and retrospective are combined for about 400 people. Because this was too time consuming to do every two weeks, they decided we will do this every 5 sprints (10 weeks). Just to be clear -- every 5 sprints we were supposed to select and plan tickets for the next 10 weeks of development. This was presented as an efficiency improvement.

To be honest, I never understood how this "DevOps vision" could scale up to an engineering department bigger than, let say, 10-15 people. At some point there are enough things to do to keep the platform working by itself to move dedicated hours from business code to platform/infrastructure code. So you can decide if you use developers time or bring in someone dedicated to this. But maybe all the developers like the OP are 10x developers that can just do everything without skipping a beat.

I tend to agree with the author of the article. Somewhere along the way, term devops changed its meaning. Initially, devops symbolized a way of working and a multi-disciplinary team. Nicely captured in [1]:

> Under a DevOps model, development and operations teams are no longer “siloed. Sometimes, these two teams are merged into a single team where the engineers work across the entire application lifecycle, from development and test to deployment to operations, and develop a range of skills not limited to a single function.

Today, devops is often used to describe the engineering role very similar to the classic system administrator. I guess this is mostly because it sounds "more modern" and it's more attractive.

But, the comments in this thread made me aware that there indeed is a role that's focused only on serving developer needs. I guess we could call an engineer a devops engineer if their job is only to build and maintain system tools for developers.

[1] https://aws.amazon.com/devops/what-is-devops/

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact