However, in our case (and probably many others), our devops team is responsible for creating and maintaining the tools and processes that allow the dev teams to manage their systems in production. At a larger organization, it doesn’t make since for each individual dev team to have to create and maintain their own deployment and management systems… that would end up with a lot of duplicated effort and a wide variety of systems that would need to be maintained. Devops creates the tools and procedures, and dev teams use them.
You might say this is just sysops, but I think the key difference is that the dev teams are the ones who are using the production management tools, not the sysops team. Devops is not the ones who are paged if a system has an issue, the dev team who owns it is paged. Devops is only paged if the tool that the dev team is using to handle the issue has issues itself.
And through this I've noticed that developers can be just as bad at communicating issues as any end user I've ever dealt with.
Just because they've written the system end users use, does not mean they're better at reporting problems to me, because their problems are about the pipelines, the kubernetes clusters, the containers, anything outside of their programming code basically.
So I still need to use all the communication skills I learned decades ago when I was doing end user support for people who didn't even work in IT.
It really feels worse than back in my websphere days, where in order to see app logs, we had to submit a ticket to the hosting teams and wait 2 days for a .log file to be sent back through email. I'm consistently left wondering if we've actually gained anything.
We've been at a point of software and development where we can build pretty much anything. But people are still making the same mistakes in figuring out what we should build.
Fix the cause, not the symptom.
Edit: And because I know it's coming... if IT security (or insert other org here) is the reason things can't be better, then it becomes the devops tools team's job to cajole them into doing it anyway, on behalf of their users.
No one realizes some dev teams can't see their own deployment logs. No one realizes teams don't have the correct Jenkins access. No one realizes there's a mysterious gatekeeper to use the corporate time series database as a service. No one realizes all new containers are transferred to a few people who have the rights to publish them.
And when I say "no one realizes", I mean two things. One, the people who do know, because they're the ones trying to do it, have given up bitching about it, because they did for years and nothing happened. Two, the people who can actually change things don't know, either because they actively ignored / forgot or because the requests never reached them (usually because a middle transport manager didn't understand what was actually being asked).
And so... the status quo prevails, and everyone has a vague sense that something isn't working, but doesn't know what to do about it.
And to that problem, the best solution I've found is to get the fixer and the problem user as close as possible, wherein the problem is usually revealed as a relatively simple change or feature oversight. And for that, discovery and user interviews are the most reliable method.
You mean put development and operations together, into some kind of dev..ops.. :)
I'm sorry for the snark, it was not meant as disagreement with what you said. The whole idea of the role was to try to remedy that exact issue, but old habits die hard and sometimes things just change names but stay the same. I shouldn't throw stones, I'm struggling at our shop to steer outside of that behavior as well.
The same thing with a single DBA or whatever admin team. My dream team structure is a task-force like one. Each team has its own DevOps/DBA. They don't have to be dedicated roles, but someone in the team needs to be well-trained on those topics.
But I guess this model never scales in larger corporations so eventually they all adopt the One X-admin team model.
Looking back at the last twenty years, oscillating between central sysops and everyone doing their own thing (is that "devops" ?), I feel that an optimum requires both: local skills to iterate fast, central skills to consolidate. Striking a budgetary balance is difficult, but at least ensure that some local skills remain - sell it to central ops as a way to get the trivial demands off their backs !
It usually happens in practice ("Sarah knows someone in security"), but there's no reason "devops liason" can't be an explicit 2nd+ hat for someone on a team to wear.
Centralized teams are necessary for budgetary reasons: one person on every team asking for a devops product will never make a coherent case to management for why it's needed, but a team will.
But having been on both sides of the equation, the central-decentralized relationship usually breaks down because there's always a new face on the right side of that equation, and that requires reteaching / recoordinating all the basics. More consistency on the user side (a hat) helps significantly.
And ultimately, it's a trust problem: do I (central person) trust this person who's asking me for something? If they're new to me... probably not very much. If I've been working with them for awhile... probably a lot more.
1. Find early adopters of the thing you're building who are willing to use and give feedback (bugs, friction logs) in the early stages of the feature.
2. Build relationships with your users / VIP users and meet regularly to discuss friction points and upcoming features (feasibility, etc). To make it time efficient, make sure you're talking to staff engineers who can speak on behalf of a significant portion of your user base. Make sure your team is closing the feedback loop with these engineers (i.e. shipping things that address their concerns) in a timely fashion.
Personally? Make a list of all the teams that use your product. Schedule interviews (30 minutes to an hour for initial) with a few that use it the most, and a few that use it less. Ask them what the best and worst parts are. Ask them details about the things they bring up, and prompt about related areas they might be struggling with too.
If you get any repeats from team to team, assume those are systemic issues and try and find a solution.
The hard part usually isn't figuring out the nuances, but rather realizing that someone wants to do something at all and/or can't because of some restriction. Assume you know nothing about how they actually do what they do and listen closely.
Anyone here know a good legal, compliance or best practice reason for blocking all access, or did the infra/sec team just get carried away here?
IMO the biggest issue other engineers tend to have with security is that it needlessly restricts them to the point where their job sucks. It's just as important to avoid "too much restriction" as "too little restriction".
The philosophies include:
1. Applying a maxim like "Principle of Least Privilege", or "That which is not allowed, is forbidden". Maybe through choice, maybe due to standards like PCI-DSS and SOC2.
2. Blocking access isn't a problem, because we're going to grant access requests quickly. Just as soon as we fill these five vacancies on the team....
3. Surely bugs making it to production will be a very rare event? Aren't these developers testing properly?
4. Don't we have test environments for troubleshooting? Isn't the heart of devops that you have a test pipeline that you have confidence in? If your developers think they're likely to deploy buggy software to production, maybe you aren't ready for devops...
5. Any large organisation will, at some point in its history, have had some chump run their test code against the production system and send every customer an e-mail saying "Hello World". If that happened when you had 30 developers, I doubt hiring standards are any higher now you have 300 developers! Stopping that happening again is common sense professionalism.
6. Would you want Google's 100,000 engineers to have read-only access to your gmail account? Of course you wouldn't, you've got loads of private shit in there. And if you were CEO, would you want an intern on a 2 week placement to be able to download the entire customer database? Of course you wouldn't. People having access - even read-only access - to the production system is a bad thing.
7. Locking down a lot of things gives the security team more political power, as every request to reduce restrictions or expedite a request is a favour owed.
Coming from a team that builds and supports an end user suite of applications. I'm not asking you to bend over backwards like I would have to do for a client. But at least don't enter in to conversations with that attitude. You are in a service organization. You deliver value insofar as you can help the next link in the chain.
Pull up a seat at the bar with us.
I swear to god the next guy who sends me the screenshot of an error log...
Still, there's a weird expectation that operations is a bunch of elite developers, knowledgeable about all programming language, who just choose to do operations because they enjoy getting calls at 3AM
The amount of engineering leaders that allow it to happen is far more frustrating, at least IMO it is.
And this particular developer is my main problem in that project. So it's not all developers who are bad, it's about the same percentage as regular end users I'd say.
I do think it’s unavoidable though. No company will have a truly open stack. There will be customized versions of everything, with caveats only the infrastructure team or developer-tools team are aware of. Naturally these teams become the go to folks to ask questions and resolve issues.
I don’t know if a good way around it. Documentation works to an extent. But at some point you have to decide what parts of the system you want to expose to developers, and whenever that happens you become the layer they communicate with rather than the underlying technology.
DC ops is in charge of the physical system, from racking and stacking to hardware replacements.
NetOps handles the networking stack, managing the routers and switches, vlans, announcements, IP block management, network access requests, etc.
SysOps handles provisioning systems, working with dc ops to get systems repaired, laying down the OS, managing the inventory system, and anything else the systems need to be ready to run our software.
Then DevOps is responsible for all the software that our company writes, and managing how developers get their software onto the machines, how they configure the software, and how they interact with their software in production.
The lines are obviously not that clean cut, but that is the basic idea.
Sometimes, automation and tooling are written by seperate teams, or are done by teams of different disciplines of operations/engineering.
I have almost never seen a network and system groups NOT being seperate. Mind you this is in DC/ISP's environment, where the complexity of networks is usually the mainstay or a heavy dependency of the business.
only it's not what the idea behind it was. It was exactly the opposite of another compartmentalization. it was supposed to bridge silos not carve out new job titles for quality control
It's like the article says:
> At some point in time, this idea has been so widely misunderstood that the wrong definition of DevOps became the right one.
See also: 'agile'.
I work as a sort of "dev tooling engineer" or "platform engineer" for several companies who are our customers.
After years of consulting in the area of DevOps, the pretty smart people I work with figured out that while the developers' user story in the process of devops transformation is a "revelation", the one thing that usually gets forgotten is the maintenance and upkeep of those tools. So we set up a pilot project where we started hosting, maintaining and upkeeping a set of tools consisting roughly of Jira+Confluence, Jenkins as the CI/CD server, Artifactory for binary hosting and Bitbucket/Gitlab for source code.
It has worked damn well for the first pilot company (signed on 5+ years ago), the second one, and by the third the platform started selling itself (getting a public reference from the first couple companies helped a lot). The second reason for the easy buy-in is to use known software that our customers could host themselves if they had to. This practically means that we aren’t vendor-locking them with our own proprietary tools. We just take care of off-the-shelf tools.
If there is a competent tooling team that takes care of the intricacies of the tools and maintains their uptimes, the dream of "you build it you run it" can still live. There is a giant gap between "you build it you run it" and reality -- much how there is a gap between a pilot being able to fly a plane and a pilot knowing how to maintain an Airbus, the airport and the runways.
It says something about the job market that losing a customer pretty much never happens. It’s so much easier to find sysops/devs than it is to find people who specialize in tooling.
circle CI as the CI/CD service
Artifactory for binary hosting
circle had the lowest friction compared to Jenkins and GitHub actions wasn't out at the time!
I use to say that it’s like one of the business domain development teams, it’s just that the domain is “dev”.
The businesses devs are the stakeholders, and the tooling/platform team members are usually already domain experts that write code.
Having a “devops” team would feel like having a team called “agile team” or something - you do “devops”, you can’t build it.
Too late. The industry has chosen to deviate the word.
For years as a sysadmin I was arguing people that devops is a methodology and not a function. Now I have the Sr Devops Engineer title. I know I am still just a sys and platform engineer who's job is to facilitate life of developers but hey the industry insist on calling me devops, I have other fights more important than fixing my title and team name.
I just tend to think that 20y ago people were relaying memes about sysadmins being motherfuckers who just randomly denied everything to clueless developers while the devops term emphases more the collaboration side. I hope the devs I work with are happy working with me, I sure have a good relationship with them. If that is being what the industry now choose to call a devops, I am a devops.
Realistically though, I'm 100% sure that absolutely nothing will stand in the way of the DevOps job title being denigrated.
Companies have for a very long time used title inflation to bait and switch the unwary software engineer into editing enterprise software configs, and will continue to do so for all of time..
And I definitely agree with the article here that if you're using PaaS, you don't need a DevOps team. And there's certainly a better feedback loop there.
But many will argue that PaaS _still_ isn't there yet for highly complex applications. Unless you go to a point where you basically need a team managing the eventual complexity that PaaS adds back to things for you to achieve complexity.
In any case, this is mostly pedantry. We should all strive for simplicity as much as possible, but complexity in apps breeds complexity in organizational structures. DevOps was never a silver bullet, and there probably won't ever be one.
You have a dedicated team building an ad-hoc, informally-specified, bug-ridden, slow implementation of half of Google App Engine (or Heroku, or Digital Ocean App Platform, etc).
I know this might be unreasonable, but I always get frustrated when everyone suggests that anyone running their own infrastructure is clearly doing it wrong. I get that most dev groups are creating web apps and should use a PaaS, but there are also a lot of us who are CREATING those platforms that other companies use. We also have to write and deploy software, and sometimes I wish I could have discussions about how to do that without most responses being that we shouldn’t do that.
Maybe general developer forums like HN aren’t the best place to have those conversations, but I haven’t found great active forums dedicated to platform developers.
There's tons of interest, but much of the knowledge is covered behind corporate NDAs/employe and getting people aligned on terminology and technologies is hard because so much is hidden behind NDAs. The current solution is either to work for a big company where there's rigorous internal discussion or go to tech conference after-parties and hope to meet the right people at the right time/place on the Ballmer curve to have a productive discussion.
Heroku's simplicity for deploying is honestly a thing to aim for, but using Heroku at scale while keeping things consistent and secure is extremely difficult. It's not built for large, complex infrastructures. It's built for small shops and ad-hoc deployments, and it's dope for that.
Part of the reason that the name DevOps was chosen almost 10 years ago when it was started was to help signal to the wider org the cultural shift we were undertaking, and that the expectation was going to be for devs to own their systems all the way through production.
At this point, we would probably change the name, but the team is quite attached to the name after 10 years. We actually did try to rename the team, but everyone still called it the devops team and we gave up on the change.
A sysadmin would write the tooling to ensure the payment code could work (failover databases, redundancies) and put safeguards in place so that it does not get traffic that is not intended for it.
Sysadmins used to write the code that touched thousands of machines at once, developers used to write the more complex code which were the bread and butter of the company.
But where I’ve worked, it was never a silo. Just a different mindset, and everyone worked together to do the best they can. That was before “DevOps”.
Uhh if by tooling you mean a bash script to deploy ssh keys to prevent team A from accessing a machine owned by team B.
You really have no idea what you're saying.
A junior sysadmin executes checklists, writes simple scripts, reads logs, and solves common problems, all while learning the complex supersystem of their workplace. They learn configuration of systems in use and how infrastructure is arranged and commonly changed.
A sysadmin helps to create checklists, writes more complex code to work in an automation system, and generally solves problems with non-obvious causes and/or remedies. They know all the routine events in their workplace, and add to documentation. They build, configure, test and recommend new tools and solutions to users, devs and the rest of the ops staff.
A senior sysadmin works on policy, debugs automation, solves complex interactions, and has an excellent understanding of complex systems and networks, not just at their workplace but in interactions and hypothetical interactions with the Internet at large.
DevOps guys are senior sys admin in the cloud (and that last caveat isn't mandatory). People shit on sys admins all the time but the only real difference between senior sys admins and DevOps engineers is pay packet, tools and responsibility.
Which is one of the things the whole institution was designed to solve.
I realize that there’s quite a lot of devs these days that got in it solely for the money, and don’t care one whit about what happens after they throw the app over the fence.
But the whole reason I liked the movement was that it gave me more control over my own destiny. With half baked devops systems and processes that I have to follow it’s (mostly) no different from the BoFH of yesteryear.
It's sort of like expecting you can get by with only sales engineers. Having someone good at either job is hard enough, expecting you can staff only those that are just as good at both? Reality is you'll just end up with people deficient in one area or the other.
You might get by with a small core of people that can successfully straddle both roles, as long as you don't expect them to fully handle the needs of both departments, but I suspect hiring only for either department and fostering good communication and encouraging them to work together on projects that help deployment and availability yields better results. Maybe that partnership will grow you some good devops people if you're lucky. Maybe you should resist making a department out of them if that happens.
This is in an enterprise sized organisation with 10.000 employees.
We still have a sys ops team to handle security, network and all those other things, but deploying software? That’s really easy for our developers.
We use Azure DevOps, we deploy to Azure apps with a dev, testing, staging (5% traffic) and prod (95% traffic) slots. It takes maybe 5 minutes to go through the to-do-list when a new application or function or whatever is created and we have a very fascist naming convention. This isn’t an azure commercial, we’re only in azure because we’re the public sector and already so in bed with Microsoft that it’s the cheapest solution. I’m sure you could do the same in any other cloud.
Good service is ones that solve the problem in a reasonable time. Great services accomplish the same as good service but understand that the code, company and even the pipeline will evolve and either the client will need to reinvest themselves or externally again. They will communicate this with the limitations to the client.
There are only two types of clients I've seen more inline with the original statement, declining or flat company and both are only in the maintenance do not touch phase.
Nothing they build us would last forever, but then, we operate 300 IT systems, only a few major systems have a life cycle of more than 5 years. 5 years ago everything we ran was on prem but on VMs on rented hardware, 10 years ago it was physically on prem some on physical servers, now it’s in azure, who knows where it’ll be in another 5 years.
I fail to see what a full time staffed devops department would give us except for added continuous cost that is so much higher than its alternatives that it’s just ridiculous.
I mean, I understand full well that system operators are worried in the current climate, but most of ours simply developed into broader roles that aren’t becoming obsolete.
In my case, the team « responsible for creating and maintaining the tools and processes that allow the dev teams to manage their systems in production », was just a regular OPS team. So they built tools that were not working properly and they struggled to focus on the most valuable parts because they were not using them. At least, they were the one getting paged when systems were down.
My next ask for my project with my next client, will be to embed a devops full-time within the dev team. At least they will understand and share the pain of having to work with a git+terraform ticket system and all the bad ideas that our OPS friends think are great.
I just don't want them to stay in their ivory tower and keep breaking my balls with their broken terraform templates.
This keeps being trotted out, without question, as the reason for the shared devops platform team. So what if each team has their own opinionated build, scan, deploy tooling?
It's used for compliance security scans now, so we are never going back.
Think of home improvement. If my toilet is clogged, it doesn’t make sense for most people to call a plumber. That doesn’t mean that plumbers are bad, just that it’s easy for a generalist to do basic tasks.
Likewise “pure” devops rarely makes sense because developers are assigned to projects that end, and sustainment doesn’t require the same skills. It’s cheaper to pay for 20% of a $75/hr ops engineer than 100% of a $150/hr senior dev.
Some Developers tend to stick with the Dev parts, ignore the Docker/helm configs so, yeah there is some trouble with a true "DevOps" project.
And Google is re-introducing a "K8s stability role", which is a Sys(Dev)Ops role, pushing on that direction (for valid reason).
A true shift so it is difficult, but DevOps time-to-market deploy speed is so high I will no look back :)
FYI, that is sysops, in case that you missed it
Then there is the infrastructure/deployment team and finally the developers.
At least in larger organizations i think that is a common way of doing things.
The article suggests using a higher level abstraction with better tools suited for the developers than the low level tools as docker, Kubernetes etc. You can make this into a separate internal product and still keep the devops idea. But as soon you break out any of the core responsibilities for the product from the team. If that's business, security, stability or quality you are moving away from devops. Sure devops isn't the solution to all problems. But why call it devops when one doesn't like the basic devops idea.
IMHO I would guess they're too close to the dev people to care about the bigger picture
That doesn't make your team a devops team. The devs are in fact the devops team. Engineer is ostensibly a better title so, because reasons, the engineers are just calling themselves engineers.
What you're doing is really internal tooling, dev tooling, infra, prod, those kind of team names.
Now I'm running my app via multiple docker images, trying to get a buggy remote debugger protocol to work (no local variables, wtff?) where I have to coordinate the container and the IDE, command line wrangling / using print debugging in the container to chase down stuff, digging around through multiple log files, copying-and-pasting stack traces into my IDE to analyze, which kinda works in some cases since the file paths are different, and on and on.
DevOps came and took my nice dev environment away, and I am mad.
devops was able to convince the industry that "software engineers own their own infra" is a viable thing to do so they could do engineering work, expect, they're out of their element. That's not their job. devops should be pulling in open-source technology that's developed by real engineers and write tiny packets of glue code to make it work to add value.
Never mind devops teams, I remember when my team had a _BUILD_ team. Serious companies still do this, engineers don't touch production. devops and sysadmins make sure that the business software stays operational, they could care less what it is. their mandate is to make sure it's operational and they hold engineers accountable.
k8s is a terrible experience and a terrible abstraction. I'm not smart enough to do better but I'm smart enough to know it terrible. It's snake-oil. My team is now migrating to the cloud, along with of course k8s, and guess what? the k8s env. in the cloud are to be managed and owned by the software engineering teams, because devops doesn't want to do their job. What a disaster.
at my company, a bunch of devops folks that deployed k8s 80% of the way left soon later to make 2x the pay as contractors at other companies to help them push through their last 80%. This is another reason that there is a lack of parity on the new infrastructure. Unfortunately, everyone is still stuck at 80%. Strange.
It's sad to see something like devops that was supposed to bring teams together is still creating divides and one team sees their role is to hold another team to account.
People commit their changes, I (and sometimes another dev) review them and they are pushed to the test environment, from there to staging and then production (with a way to easily roll back). This is all done automatically with out-of-the-box solutions (fluxcd, flagger, kilo, kaniko etc).
At work, Kubernetes is not considered too difficult to manage. Much can be done in a very consistent and complete way unlike Docker. What am I overlooking that makes k8s to be considered snake-oil by so many?
Edit: Forgot to add. I am so happy with k8s that I even run it at home and on my external servers.
If you're running Kubernetes on bare metal then I don't blame you for having a bad impression. I can't help you understand that your problems are almost certainly to do with a group of "hardcore Ops" guys trying to do everything GKE or EKS gives you for free while neglecting your actual needs, not to do with Kubernetes itself, which is most useful when Google just creates a cluster for you and you can immediately start dumping manifests into it.
> at my company, a bunch of devops folks that deployed k8s 80% of the way left soon later to make 2x the pay as contractors at other companies to help them push through their last 80%. This is another reason that there is a lack of parity on the new infrastructure. Unfortunately, everyone is still stuck at 80%. Strange.
Again, this isn't really to do with k8s is it? Probably better to point the finger at management for not retaining the people who they needed to run k8s or listening to your complaints trying to work with something 80% done instead of 100%. At my company we used k8s 110% for everything and have amazing support for it, and as a dev it's a dream come true. For people who don't understand Ops, it can be difficult, but when you open the "DevOps" box of saying "Devs own their services in production" for the first time, you're bound to do a lot of learning. I see this in my teammates a lot, who have basically zero understanding of how things actually deploy or run once they've left their local machine.
> devops was able to convince the industry that "software engineers own their own infra" is a viable thing to do so they could do engineering work, expect, they're out of their element.
Devops wasn't about getting ops people to start doing development. It was about getting developers to be start doing ops because infrastructure definition became a tye of coding.
What's incredible about your inversion of understanding is it's exactly the point of OP's posted article - that people misunderstand devops in exactly the way you just did.
> That's not their job. devops should be pulling in open-source technology that's developed by real engineers and write tiny packets of glue code to make it work to add value.
"Real engineers"? This is a Googleism and their caste system of "real engineers" and "software reliability engineers". The rest of the industry doesn't coddle their precious "real engineers".
Designing a resilient, scalable, frugal, operateable architecture of load balancers, servers, caches, queues, dead letter queues, and all their associated metrics and alarms is absolutely "real engineering" and worth the time and attention of the same engineers that are building the core business value of their company.
> k8s is a terrible experience and a terrible abstraction.
Completely agreed. But note that that K8S came from Google and their 2-tier system of engineering. But the quote in OP's article is from Werner Vogels - AWS's CTO, and Amazon is legendary (nay infamous) for asking their engineers to work on ops.
So, I agree with the last half of your post - K8s IS a scam, and should not be operated by an ops-focused team (probably not operated by anyone at all)
But if you had follow the intention of the DevOps movement, not the perverted definition of DevOps, then the k8s scam was obvious from day one.
There's no such caste system at Google. SREs have the same bar as the SWEs, and plenty of people switch roles from one to other. SREs are also well-respected and I don't think anyone would consider them as _not_ real engineers.
> Designing a resilient, scalable, frugal, operateable architecture of load balancers, servers, caches, queues, dead letter queues, and all their associated metrics and alarms is absolutely "real engineering" and worth the time and attention of the same engineers that are building the core business value of their company.
You are absolutely right and that's why having people focusing on that is _really_ important. Having worked at AWS previously, I can tell you that when you force your software engineers to do everything and promote them based on _what_ they deliver without caring about the operational load, you end up with crazy oncalls and poorly-designed software. There's a reason why AWS oncalls are infamous whereas Google's are not.
I work at AWS. While it's true that promotions can create all sorts of incorrect incentives, the reason why AWS oncall can be brutal on some teams isn't because of promotion-orientated architectures that disregard operational load.
The reason is that the company prioritizes moving fast and shipping stuff to customers. We have a deep backlog of customer asks and we want to ship them quickly. The systems need to be secure, durable, available, and fast - those are non-negotiable. But do they need to be operated automatically? Well, no, that's the easiest thing to compromise - ship quickly, have initially shitty oncall, improve operations behind the scenes.
This is usually why customers are always confused/concerned why AWS ships something new then releases no updates for a year. We don't compromise on security or durability, but we do compromise on engineer happiness during oncall for the first release. So after shipping, the immediate next step is to start improving automation and operations.
This is done with open eyes, and most engineers are happy with it to ship things big things quicker, rather than releasing a new version of chat software every year.
None of this is promotion-orientated. It's working backwards from customers.
Could market share be a significant factor here?
It's just the cultural difference between the two companies. I have been told that GCP oncalls are a lot busier than rest of the Google, but it's still nowhere close to the suffering I had at AWS. It's an organizational pain and comes from the mindset AWS has towards software development and their engineers.
It's why most Google X ideas flame out. It's why there's a new chat app every year.
But - it's a net positive for humanity. Google publishes a lot of papers, and I think genuinely has moved humanity forward in the last few decades.
I don't have a PhD and I don't do L33tcode, so I don't think I'm smart enough to be at Google, but honestly I don't know if I would want to be.
I prefer shipping.
I completely agree that people who prefer to ship fast would be happier at AWS. I've heard all the horror stories about how slow Google is, but I think GCP has a great balance between speed and quality. I'm definitely happier here, but I understand why some people would be happier at AWS.
But fair enough, I don't work at Google, so I'll withdraw the point. Having said that, knowing that AWS engineers do oncall, and GCP don't (letting the SREs do it) makes me still think there is some sort of two-tier system.
re: Oncall, we have two rotations:
- one manned (personed?) by SWEs, 9-5, responsible for dealing with customer issues and mandatory.
- one is mix of SREs and SWEs, 24/7, responsible for prod issues
I believe SREs also have their own rotation, but that's 9am-9pm because they are always spread among two timezones. Overall, this is muuuuuch better for everyone involved compared to the AWS oncalls. I remember barely sleeping for a week being the norm on one of the teams I worked at. Our cries for another team in a different timezone, similar to Google SREs, were shut down every single time. "Customer obsession" at AWS means delivering stuff as fast as possible and then throwing the engineers under the bus. I still remember the days I had to wake up multiple times a night to run a command manually (literally 5 minutes) because engineers couldn't take the extra 4 months to do it right.
Thanks, but no thanks. I was at a great team at AWS for ~2 years with great engineering culture and little operational load, but unlike Google, that's rare.
Reminds me of an idea that any user may expand a system by making plugins/changes, if you provide them a sufficiently humane programming language. Then they still can’t, and a completely new developers category is born.
Seems like another one here, cause it always turns backwards like that.
Fwiw, I'm fully onboard with you, but really love the core message of books like the phoenix project and don't really want it to be lost because of overzealous use of docker.
compiled code gets (autodeployed) with SFTP in some cases even.
deployment has become massively complex because of the need to keep production and test environments roughly the same. Deployment to production should not be complex and basically boils down to the following:
- think about method of getting the binaries installed on the system in a secure way.
- think about sending some state back that the binary has been installed succesfully.
- add node/member/app/whathaveyou to the production membership pool/loadbalancer/BGP Route reflector/whathaveyou.
The answer should be more modular code with clear interfaces and contracts so that I can test as much as possible locally.
- unit tests should always complete locally
- integration could also complete locally but are allowed more complex scenarios
- in any case, we require developers to pull all their deps through a caching proxy (external system) to keep a lid on what they use, keep a lid on licenses used in our software, and keep open a door to centralized vuln tracking & management.
Though if you can't fully test a software locally by getting all units and integration components to spin up, this is mostly a consequence of bad system design, for example, if the one application depends on another one which can't be run locally, reasons for which are (a) no bootstrappable dev environment (b) no test data management/generation (c) no license (d) too complex for the originating team to provide or (e) proprietary applience no one but the operating team understands.
(a) through (e) all encountered at $currentjob and not happy about it, but what can you do? beg the other team to fix their shit and wait or just bite the apple and deploy to "dev stage" to get the testing done.
I'm in that situation now and the best solution is to just give every dev their own scaled down but complete copy of production with a little magic so the load balancer proxies requests from their dev URL to their laptop. Like it's pants-on-head stupid but I'll be damned if it didn't make onboarding new devs instant.
I prefer to be able to fully run my application locally, with all databases required, and test everything on my machine. DBs and other tools like a simple SMTP server run in Docker, yes, but that's just a decision I took and I could easily run these tools without containerization.
Maybe I'm just in a lucky position to be able to decide this by myself, but I think we can get back there.
I think you are kind of mistaken here, as a _devops_ I never enforced our development teams to use docker, or another specific tooling. The thing is, software engineering become a whole lot difficult and complicated over the years. When I was first learning web-development 15 years ago, notepad was the only thing you needed to do web-development, now look what you need to have to have a simple CRUD app running locally. Setting up local environments become so much harder, so people found the answer within the containers as you set it up once and you can share it within the team easily. Please do not blame _devops_ for that.
You don't have to run your app via multiple docker images locally, you can still configure everything to work natively. It's just way more difficult because of the dependencies of modern technologies.
For example, you can use latest systemd to basically replace all of docker and then some. It provides advanced sandboxing, resource control and job control for units and even manages "containers". Which is why my private servers mostly use systemd properly configured to provide what containers are thought to provide. The deployment is switching out a binary or add a directory tree, override a .service line and then reload and restart. Easy.
Though over a 10 year period, companies, developers and operators went all-in on "container runtimes" and here we are.
That being said, things like kubernetes are NOT NEW and NOT DIFFERENT from solutions like corosync+pacemaker (who remembers them?). They do the same thing, with the same levels of flexibility and extensibility and pluggability, just with containers. From my POV people who are surprised by k8s and its complexity never lived or forgot the cluster tech of the 00s. It's just about to come together as a circle, and the next iteration seems to be micro VMs instead of containers.
We (a colleague and I) basically learned Terraform, then Kubernetes, and then Helm in an extremely stressful environment. I wouldn't recommend that to most people. (Also, I can hear "you don't need k8s", but tbh, it makes the difficult things possible, and that's what we needed).
I've done Ops in bare-metal environments before, and at least there, even if there were new skills to learn, the environment just wasn't so hostile for beginners, and the learning curve was simpler.
The YAML (or heaven forbid, jsonnet) does *not* help. TBH, I'm convinced if the IaC bits were done in a regular Java/Go/Rust SDK style instead of mixing and matching YAML glued together via labels and tags, I would have had a much easier time.
The positive side of that is that now the IaC bits are mature enough that even the Data Scientists in our team write their own Terraform code, and I've upskilled myself quite a lot. I'm slowly going back to solving Storage/Data/ML issues, which I enjoy a lot more.
Maybe I'm just becoming a cranky Software Dev as I enter my 30s. I do _not_ enjoy reading YAML.
If anything, Terraform comes the closest to being a good IaC language because providers form an API to production resources as opposed to unstructured code/config like Helm or Jenkins/Puppet/Chef. Production in reality is well-typed, and unstructured configuration makes static analysis very difficult and makes diffs only a guess at what will actually be changing when code changes are applied.
What I'd like to see is Haskell (or similar) typeclasses as production API; make it impossible to construct malformed configurations in the first place but allow expressive functions to replace the limited templating functionality that current languages have (why can't helm templatize inside values.yaml? Why can't Terraform templatize backends and why aren't providers first-class objects?). It's very difficult to keep current IaC DRY, and if it's not DRY then it has bugs and there are misunderstandings about how environments are actually configured.
Composition over encapsulation, expressive enough for complex configs, not expressive enough to be turing complete.
The problem with DSLs and Config languages is that someone will inevitably try to use them in a way they are not designed to be used, and then circumvent the way you're "supposed to" do things. Not only do you lose all the advances in tooling, but people will end up reinventing things that already exist in the other languages.
PS: Of course, I'll pick Static Typing every time.
I think DSLs such as HCL are better than either option, as long as they're flexible enough while staying limit in order not to get illegible/unmaintainable.
YAML, JSON and HCL should be purged from IaC asap. Luckily Dhall can be integrated everywhere we need YAML otherwise we would go nuts. The bigges failure of k8s is to understand that YAML does not scale to teams and beyond 100 lines, especially not good to nest it (hello ArgoCD).
A type safe not accidentally Turing complete configuration language or for that matter any langue (hello Pulumi) is much better than YAML ever will be.
Anyways, if I can give you an advise, drop k8s like its hot. I is much more trouble than worth. We have phased it out everywhere we could. The companies still using it are in 1 year long implementation cycles with it with 10+ consultants. It is going very well. For consultants, obviously.
I went into the rabbithole of terraform/HCL and came up with Terragrunt to do dynamic code generation. Combining it together with generous use of `make` keeps me sane, but it's not a long term solution.
Even now, there are things within the system where the boundaries should be separate, but they aren't.
We literally got exhausted and burned out writing our automation, and decided to leave it as it is and then slowly build it up over the next few months. :/
PS: You're 100% right re K8s/YAML. It feels like everything in the Kubernetes world is optimized for writing the code just once. Which, of course, is easy if you already know how all the various bits fit together (true in any system).
But Reading/Debugging kubernetes world is a dreadful experience.
Exactly. It is abstraction, over abstraction, over abstraction for no obvious benefits.
That said, you wrote:
> drop k8s like its hot [...] We have phased it out everywhere we could.
Could you elaborate on what you've replaced Kubernetes with, and for what kind of deployment?
As you so aptly put, k8s learning curve is beginner hostile and needs system knowledge before you go at it, so I'm not sure if this was also with being more familiar with the abstractions - after having looked at k8s for 2-3 years, jsonnet made me finally like interfacing with the k8s apiserver.
a) Learning K8s and Helm Concepts (both different things)
b) Reading/Debugging K8s and Helm YAML (both different things)
c) Ending up at Helm Chart Configs (another config)
d) Figuring out how to set all this to helm me deploy the solution I needed and customize it acc to my needs
...all within the span of a few weeks.
So by the time I got to some GitHub repos which were using Jsonnet, I was basically done. I'm sure it has its uses and given enough time and experience, I can see how to utilize it more effectively.
Indeed, the problem of just saying "you don't need k8s" is you may accidentally invent an even worse in-house version of k8s.
I thought that until I started working with Gradle (which is a build script tool who's syntax is nearly-fully-functioning Groovy). I miss Makefiles - they were strictly declarative and, while I've seen some hairy Makefiles, you'll never need a debugger to figure out what they're actually doing wrong.
We're far too invested into Terraform at this point to learn yet another tool. Oh well. Hopefully the next time I have to implement something like this, K8s ecosystem would also have similar tools (or maybe I just missed something)
I want simple yaml/hcl that is easy to understand and write. I don’t need to feel smart writing TF, I need it to be easy to understand to reduce MTTR when things go wrong.
The biggest issue is that parts of the TF for github requires references, and it fetches each of those references individually (which which quickly make you hit your ratelimit, while also taking 10+ minutes for just a plan). Wrapping this in python, and doing more efficient fetches ahead of time, and feeding the references in yourself can turn a 10+ minute plan into a 20 second plan for 1000+ repos.
Corporate management is always surprised when my first move is to go sit with the developers and ask how I can unblock their workflow, reduce pain and tighten their development loop. Most SREs and professional DevOps people I've met do not do this and insist that their function is to tell the developers how they'll build their app and in which way AKA the same shit from the 90s.
The portmanteau should be a dead giveaway that the number one priority of this job is building cooperation between these divisions. And often, it's not a technical solution.
This summer I finally hopped over to the DevOps side of the house as a manager, having always been a software engineer. My mandate from higher ups is to reform the team to start practicing SRE, but I think it's really more about reforming the whole organization. In every way possible, I'm trying to build a team of people willing to "be the change" like you say. But I also hope that SLOs, as a formal way of agreeing on priorities, will grease the wheels a bit.
> And often, it's not a technical solution.
Absolutely. In my experience, it's almost never technical. If you can structure the collaboration so that correct incentives apply, any technical solutions required come naturally.
(I wrote about my Rackspace experience here: http://zephyri.co/2016/can-devops-be-outsourced/)
SRE sets standards so a centralized team can manage the ongoing maintenance and on call break/fix of many apps in an organization while development teams move on to something else.
Most of us are not Fang companies and don't have the pick of the best engineers in the world (or the region). We cannot trust some of our devs to deploy directly to production because quite simply, some do not understand that responsibility and cannot rise to it. We still need our CTO to check any big changes because things are missed and this just reinforces the idea that you don't have to nail it, the CTO will probably find any problems.
We are also in Europe where you can't just fire people if they e.g. deploy 3 bugs to production. So what do you do? Like most companies, you try and introduce a good level of automated testing, some testers to write smoke tests etc. and we don't just deploy whenever like some companies do.
Then we are just back where we started with a bit more overhead and still not deploying more quickly to production.
Neither can FAANG - one of the reasons why SRE is a thing.
> We are also in Europe where you can't just fire people if they e.g. deploy 3 bugs to production
This sounds like a process failure nobody should be fired for.
I get paid a lot because I can write api endpoints, front end code, iac, pipelines, and understand networking enough to make our apps SOC2 or HIPAA compliant.
"We don't need a separate devops department, the programmers can do it"
"We don't need a separate QA/QE/SDET department, the programmers can do it"
"We don't need DBAs designing normalized database schemas for ACID databases, we'll have the programmers use ORM/NoSQL/whatever"
So somehow the tools and workflow are exactly same for an app that has millions of external users and an internal app that only 10 people use. You still have to go through a whole stacks of K8S/docker who-knows-wtf-they-are configurations just to publish a dashboard from an internal IP. It took more than 8 months for my previous team to try to pull it off but it never went through.
For better or worse I’m constantly annoyed by developers trying be “special”. Most projects involve stuffing things into a database, calling a webservice and reading stuff back. Don’t try to be clever about it, just stick to one type of database and Springboot so we can all go home and sleep with no incidents.
But there are projects which really do need to be different, and we don’t deal with them well, and they don’t scale.
I wonder how other companies deal with this kind of issues, eventually my previous team decided to host the app on one of the laptops and tell clients to visit a "192.168.0.X:port".
They found another way to skin the cat purely because they didn't like the existing approach and didn't want the hassle of getting changes integrated upstream.
We have a full platform implemented, base images available, standards on how applications are developed and deployed, extensive documentation, etc. This has evolved over time along with the projects and business. This represents a _huge_ amount of work being abstracted away.
But every week we have to have another discussion about pulling in the database-of-the-week or language-of-the-week because apparently we were completely unable to develop basic CRUD apps until the new hottest crypto-document-as-a-service-database came along to let us store and retrieve data.
And every time they think _their_ application deserves special snowflake status. We're just making things difficult, they could fire up a VM and have this set up in 15 minutes.
This completely fails to account for _most_ of the actual work involved in setting most production systems up properly. How is this being kept updated and patched? Where are the backups? Where's the monitoring? What's the disaster recovery plan? Where's the documentation? Never mind trying to integrate it into all of our existing systems (making sure it's on the appropriate subnet, has appropriately configured security groups, etc, etc).
This also completely fails to account for the long tail: Even if you reimplement all of this, now it _all_ needs to be maintained for the life of your application. Any time we need to make major infrastructure-level changes, we need to account for n+1 different systems that need to be updated. If we need to do any migrations, we need to migrate n+1 things. If you need functionality X, we can't re-use the existing battle-tested modular templates to provide it to you it has to be implemented from scratch (and also maintained indefinitely).
Most devs I've worked with aren't well-versed enough in ops to understand the full immediate scope nor the long tail. They only see the project as far as, and only plan to see it through as far as, get it working and shitfix as required.
To your dashboard example... it's for 10 people, but that doesn't change the underlying equation for most of these things. The fact that only 10 people use it will not stop it being turned into a jumpbox for some hacker group or getting our external IPs blackholed after it sends a bunch of spam because nobody bothered to patch the OS for five years. The fact that it's only for 10 people doesn't mean it's not business critical and could be down indefinitely. The fact that it's only for 10 people doesn't mean we shouldn't have a local/staging environment and should be working in prod. The fact that it's only for 10 people doesn't mean it doesn't need to integrate into any other systems (where is it sourcing data? how is it authenticating users? how is it authenticating itself to other systems?). The fact that it's only for 10 people doesn't mean someone doesn't have to maintain it after you're gone.
Following the standard _should_ be the happy path so that you don't have to worry about all this stuff. If it's not, I think your ops department was just dropping the ball.
DevOps is not about letting developers own services in production. This is the common misunderstanding that has lead to DevOps teams be platform teams. DevOps is about helping separate Development and Operations teams work better together so that the organization, which needs both, can benefit from their harmony.
The simple fact of the matter is that the modern stack is too big. It's not possible to expect to be able to hire someone who can walk in and do their own product management + write frontend code + backend code + host the frontend on a CDN + own the backend on Kubernetes + handle all the security, monitoring, uptime requirements around the clock. Maybe over time people will expand their skillsets, but right on their first day? No way. Not a reasonable expectation.
Organizations need to allow people to specialize. This is the only way anybody ever truly gets productive, they learn some part of the stack, learn in deeply, handle that part of the stack for the company, and communicate. True DevOps isn't pipeline engineering / automation engineering, it's corporate communication engineering, so that the concerns of people later in the value stream (ops, security) can be handled as early as possible (e.g. development/staging environments) to improve overall throughput.
You still need dedicated ops engineers! We just call them SRE now. You still need dedicated security engineers! We still call them security engineers.
Of course this does not work for everything. But for most cases this is just good enough.
Yes, specialization is a must for productivity, but that's a given already in the old model. What the DevOps effort tries to cut through is the trust barrier between the two. Sure, it's communications engineering too, but it's easy to weasel out of any real change with that phrasing.
What I'm trying to explain is, that I don't believe that this is a realistic expectation, because owning a service in production includes aspects of the stack that are not related to development, like networking, storage, and various non-dependency security-related concerns. When you task a platform team with building an easy-to-use abstraction for developers, those abstractions will inevitably leak. When they leak, one of two things happens: either the developers page the platform engineers, in which case you're back to shared ownership and the need to coordinate the concerns and responsibilities of multiple teams, or the developers are expected to fix the leak themselves, in which case, most of the time, the developers will find themselves out of their depth, most likely doing something inefficient at best or flat out wrong at worst.
- On-call dev deploys a buggy change that messes up some transactional data. Product owner passes the complaints received by customer service. On-call dev rolls back the deployment, writes a script to fix the bad data, and then prepares a post-mortem.
- On-call dev and on-call dba are paged because the database for the service is slow. On-call dev checks the service metrics and logs, and confirms that this is not caused by a deployment nor abnormal traffic. On-call dba checks the database metrics and logs, and finds some runaway queries due to silly optimizer. On-call dba kills the runaway queries, and proposes to add an index hint. On-call dev agrees, works on it, and asks on-call dba to review the change.
- On-call sec is notified because of a new rce vulnerability. On-call sec comes up with the steps to identify and remedy the issue. All on-call devs are notified to follow the steps, to ensure that their services are not impacted.
Ideally there's an equilibrium. Some teams will inevitably be outliers, but the majority will benefit from the platform.
There's always a degree of sharing/coordination/communication. After all the teams are part of the same company. (Eg. when it comes to data access, API stability, it's a must to coordinate. Though famously Amazon "simply" solved this by a companywide mandate, requiring everything to have an API, and there's probably some story about their stability too.)
So, in any case the coupling should be loose. Basically an almost wait-free (but at least lock-free) algorithm.
A lot of developers are exactly that, developers. Sure if you have somebody who is very motivated he might know a thing or two about DevOps etc. but hiring becomes almost impossible if you expect a candidate to be a good developer + a good knowledge of devops.
It's just too much to handle for one person.
People wanted to replace a monitoring solution for something new, because snmp was "bad and outdated".
SNMP is definitely not the holy grail of solutions (to be honest, its kind of a mess). But atleast SNMP is widly supported, works reasonably well, and gets you the data in a dashboard you need. Even for those old Cisco 7200's with an uptime of over a decade.
Being new and shiny in operations is usually a sign of inmature technology. (with the bugs and issues associated with it).
1. Returning values different than what the MIB says.
2. Missing values that are not optional in the MIB.
3. Broken getnext code that returns an incorrect oid, which could even lead to loops. Net-snmp will detect it, but proper snmpwalk gets broken.
4. Broken getbulk repeater code that can cause crashes if the repeater is too large.
5. OIDs that will crash/stall/slow the agent.
6. Implementations that cannot handle simultaneous requests. Admittedly this is notoriously difficult to achieve: most MIB implementation code is not reentrant for various reasons.
7. SNMP SET is very difficult to implement, cannot be meaningfully used for almost anything beyond extremely simple stuff.
8. UDP encapsulation can cause MTU/fragmentation issues that are a nightmare to debug. Net-snmp actually implemented transports such as TCP and it was very nice, but it never caught on in general.
Bonus, there is also the not related to implementation problem of:
9. The query semantics of SNMP are non-existent, and the indirection in the SMI is cumbersome, which means that as a client/manager you'll have to do all sorts of data wrangling to find what you need.
We tried to hire someone senior for 8 months before giving up and just grabbing a junior dev that wanted to do infra stuff and trained him up. I have worked at shops with people like you describe but it seems they're all happily employed.
 Not whiteboard style, on their own time, on their own computer, not timed or watched. They can take a week for all we care.
I have no idea how you could find true experienced sysadmins in your case. They exist though, that much I can tell you. I have 20 years of experience, I write bash, python, perl and lately mostly go. I am not a greybeard as I am just old enough to have been a junior for the greybeards that wrote linux kernel patches among other things to solve infra problems.
I have led Ops teams in one shape or another for about 10 years. I second that having a junior dev do Ops work without supervision will not bring joy to your life, mostly because they do the darnedest things! One of the trainees here decided to remove a ControlTower guardrail from a root account, for a personal experiment in a sandbox; then three weeks later created a public S3 bucket in a production subaccount and leaked a bunch of data...
90% of developers just don't have the mindset for Ops work.
Kubernetes administration is hard to master. It's analogous to running VMware in production, or any cloud technology for that matter. None of them is easy, and they all require relatively specialized knowledge.
Kubernetes development on the other hand is relatively straightforward. Minikube is my goto environment whenever I have to put together an app that consists of more than a single process. As long as you pay reasonable attention it is not hard to move such applications from dev to staging to production.
Edit: fixed markup
So when there is suddenly a new generation of ops people AGAIN getting paid just to manage stuff instead of build stuff, only now in the cloud instead of on on-site bare metal, this is a BIG misunderstanding of what the intent was.
And here is the lie in the AWS marketing promise. In reality without ops the infrastructure clueless devs will generate huge AWS bills that move the profits from AWS customers into Amazon's pockets. At the same time with production being more brittle and ci/cd becomming a complex nightmare that gradually takes larger chunks of dev time until things grind to a halt.
The catch 22 being leaking abstractions. In the same way that devs cannot really empower regular users to generate apps with nocode that work well, devops cannot create iaas or paas that can be used without understanding the underlying os, networking, configuration management, package managers, etc
mashing together devs and ops inevitably results in schisms of devsecops and netops and ^ops because developers have enough shit to worry about on the daily without needing to learn the absolute intricacies of the OS as it applies to dev and prod. employers tried to get them around this by insisting "move fast and break things" with the push-to-prod mentality only to realize a 7 hour outage as a result kindof dims the outlook for everyone, shareholders included.
so employers wound up embracing the buzz and holding the line where it made sense, with some devops as developers, other devops as mostly ops, and still more devops as network-centric roles that handle things like firewalls and traffic too. now, people are going to point to automation and say "infra as code means devops is still a departure from traditional ^ops" but heres the secret: shit like cfengine has been around since 1993 and grandstand conferences on devops didnt show up until nearly 16 years later, and hyperconvergence showed up in 2009 which basically expands on software defined network and infra, so a cogent argument can be made that automation/convergence alone isnt devops, it was always happening anyways...so theres the wizard behind the curtain:
devops was just a trick to get you to do more so your boss didnt have to hire network and ops.
lack of education never helps, but it's laughable to point at that as a serious causally significant factor. outsourcing, communication barriers, trust barriers, organization barriers, ossification of the business structure led to serious loss of productivity of the whole enterprise.
the answer is the cross-functional team. agile. scrum. (and many names and many crazy offshoots) but the core recipe is good and sane: people above process. trust (and self-check) instead of mandatory responsibility separation to check each other. empower engineers to be able to do the right thing by default, engineer tools for everything, don't let non-tooled non-integrated handoff points develop, because that leads to fiefdoms and communication bottlenecks.
So basically I have no clue what the dude is talking about - we have a great setup which is x10 better than what we had before.
It goes far, far beyond Dockerfiles. Are you on-call for production? Can you debug a problem in production?
If not, then that's the root of the problem that the dude is talking about. If these are true for you (not specifically you, any reader of this):
- You don't have a good idea of how to ship a change all the way to production after committing it
- You would not know how to debug and fix a failure of your code in production
- Your "Dev" and "DevOps" teams are in separate departments
...then congratulations, you have just reinvented the old-school sysadmin/developer dichotomy from the 90s. You just happen to be using Dockerfiles instead of Makefiles.
As an ops-sided person, this is the exact reason why ops folks always said no to releases on a Friday afternoon.
Ops people would rather be in the pub enjoying their bug free weekend by not risking pushing buggy broken code from developers, in spite of the desperate pleas and guarantees that this batch of code is *definitely* bug free. Last week was just a one-off. Honest. We swear.
EDIT: In the old style of Ops team versus Dev team versus Infra team versus everyone else.
If you're separated out from ops, what would ever drill that into you? Saying you're too busy for ops is just wild to me. That's some of the highest effort to reward work someone can do.
Can you troubleshoot network level issues that seem to occur between your system and someone else's?
What about storage? Can you make sure your data is still their if your environment gets royally screwed.
I know these are usually done by seperate companies nowadays, but getting an an actual computing environment up and running with redundancy in place (down to the cooling and electrical level) is no easy feat.
So much easier to be cynical about new tools than to investigate and understand the pros and cons.
What gets completely ignored is that there is a reason that sysops has been a specialized role for decades. How often do we get threads here that people get their AWS accounts hijacked because of some credential leak (e.g. access keys in frontend builds), extreme expenses due to stupid autoscaling settings combined with a DoS attack or content leaks out of insecure s3 buckets? Or how much money gets wasted because clueless "devops" developers think that failsafe redundancy for their tiny PHP app is configuring a bunch of x2large instances and putting ELB in front of it instead of either going with lambda functions in the first place or at the very least use small instances and sensible autoscaling.
Or ffs just take gitlab ci pipelines. Nearly every time when taking over a project or hopping in to consult I see utter fucking bullshit written by people who don't have any idea what the fuck they are doing. Yeah, your build takes 10 minutes because you don't use any cache anywhere and you don't use Nexus or any other form of repository cache. And ffs no you should not push your application image to public Dockerhub because you have no idea that Gitlab has its own registry - and no, "I read it on some 'getting started' guide for <insert JS framework here>' ain't an excuse either. People get barely trained in the technology they use, most JS frameworks (and fwiw, Gitlab and Github CI themselves!) are young enough that it's impossible to get senior people for them, and the stuff that ships for "CI templates" there is atrocious.
The amount of knowledge required to securely operate any kind of hosting or development operation is immense, and acting like developers should also shoulder the role of unix sysadmin, IT security staff and cloud configurations is extremely ignorant - not to mention only half of the developers know more than "ls/cd/rm" on the command line, less if they are Windows developers.
I do see value in "devops" as in "a person primarily focused on operating infrastructure, but with enough capability to troubleshoot production issues in applications", but devops as in "save money for actually qualified ops staff" is irresponsible penny pinching.
I've watched at least one company burn time on the same issues for years because instead of acknowledging the need for and empowering people with distinct skillsets or knowledge, they're content to let developers bumble through the process of learning and reinventing the wheel over and over in the margins. That's if they can even get the developers on growing and siloed teams to communicate with one another at all. There needs to be experienced leadership somewhere.
Developers should have more responsibility over their apps and what they're putting in to production. But, it's unfair an inefficient to expect all but a few will ever even have time to become skilled practitioners in these areas. Especially while already keeping up with their normal duties. Now not just tech debit, but foundational infrastructure can be pushed aside by the drive for new/more features.
For the foreseeable future it seems like the trend of tasking devs with so much of this work will result in engineering teams continuously, wastefully toiling over fractured infrastructure built by novices. (Maybe someday the tools will actually allow for developers to take this one with limited knowledge/resources, though.)
Normal hosters wouldn't even open internal infra like DB instances, mail relays or storage up to the Internet, AWS by default doesn't even firewall their internal stuff off and thinks, as long as you have the right credentials, you can't be wrong, even if the IP is not from eu-central-1 but some dial-up in Mexico any more. S3 and SES have mind-bogglingly insecure defaults, and it is hard to wall off your SES credentials enough that they're useless to an attacker who has managed to exfiltrate them (and that one is easy enough, all an attacker needs is a vulnerable log4j and the dubious "best practice" of passing credentials via environment variables).
It is so incredibly easy to mess up on AWS it hurts. AWS is a gun shop that not just sells handguns but nuclear armed bomber jets to anyone presenting a credit card. No knowledge checks, training, nothing required, and you have the firepower of all of AWS on your hands.
Maybe this is rich coming from a Kubernetes person, but sounds like the tide is turning against this viewpoint.
Running Kubernetes itself is definitely hard to master, but you can take a <10 hour online course and be able to "helm install" your application.
Kubernetes can be the new PaaS (if you have someone else running the control plane and maybe a monitoring stack). But there's not a PaaS on the planet where you can just skip the step where you learn about it. Even Heroku users eventually have to learn about how connection handling and limiting works, and the specific Heroku commands to upgrade your database.
Genuinely I believe that good QuickStarts are the cause of so many expert beginners.
What I mean is: there is a person on an adjacent team who truly believes he is very competent in nearly all technologies. He is able to deploy a k3s cluster no problem.
He is absolutely unable to debug that cluster, or use it in a way which is reliable (local persistent volumes means you are dependent on the node never going away), he is unable to troubleshoot networking issues that have occurred in that cluster and he has struggled many times to admit that it is unable.
But heck. He can helm install no problem!
Aren't you confusing two roles, Kubernetes application developer and Kubernetes administrator? The Kubernetes application developer should focus on their application, NodeJS, Java, DotNET, whatever. Helm install (actually helm update --install) is all they should know/care when it comes to Kubernetes.
K3s deploy and local storage providers? Let those be the Kubernetes administrator's headaches.
Also, with heroku, AppEngine, etc you have tons of documentation (official and unofficial) and resources. It is well tested and the exact same service runs thousands of businesses.
While kubernetes is like the react of the infrastructure world: you won't find two similar implementations. You need thousand tools configured around it to make it usable, and of course you might have docs about the individual pieces but you never have it for the big picture.
I was surprised by the amount of history and misconception that exists around the term.
The term was originally intended as “Agile Systems Administration” but came about at the same time as a talk titled 100 deploys a day.
So people conflated the two, and it didn’t help that the conference was a portmaneau of “Developers, Operations days” to encourage it being a conference for everyone. Eitherway. We’re stuck with it now, and it’s the least meaningful job title I’ve ever known.
If you have to ask “what does x mean to you” then it’s a failed descriptor.
That explains why it's such a terrible idea.
While I do like that definition, it also means that most companies/organisation cannot do DevOps, as software development is often bought as projects, and operations is a recurring cost.
There are projects that change slowly but they do change nonetheless unless they are no longer used.
DevOps, and especially with Pulumi / CDK (where it's actually infrastructure-as-actual-code and not infrastructure-as-a-JSON/Yaml file), is simply becoming a "developer who also likes ops" type of specialization. E.g. front-end, back-end and stack-end.
The nice thing is, and what I try to do in my team (both at my current startup and previously at AWS) is to break the silo and keep the "you build it you ship it" notion. Frontend - TypeScript, Backend - TypeScript, Infra - also TypeScript. When an engineer wants to build a lambda that listens to SQS or a Fargate ECS task or an S3 bucket, they can just do it. I if you know how to use the AWS SDK to let's say read/write to S3/DynamoDB/SQS, the API to create S3/DynamoDB/SQS etc is not that far off, and if you are using CDK in TypeScript for example (whether it's Terraform / CloudFormation / CDK for k8s doesn't matter) then indeed the boundary between backend code and infra code is getting smaller and smaller (both technically and also visually, the declarative code to generate an S3 bucket in CDK is pretty damn close to how it will look to generate that same S3 bucket via the SDK, it's just that CDK takes care of idempotency / change sets / diffs etc...)
In summary. I agree that DevOps should mean - you build it, you ship it and that the current industry trend towards a separate DevOps department is just plain wrong, but on the other hand, it's ok for specialization within a team, some will lean toward frontend, some toward backend, some toward data engineering, some toward data science, some toward security and some toward operations. Some will do a bit of both and some will prefer to focus and specialize, but I think that the main lesson I learned is that it all should be part of the team. Like a D&D party, you don't go to an adventure with just Warriors, then send the Wizards separately to ship the dragon to production.
Even just having separate reporting chains is helpful to achieve a balance between work driving operational metrics (stability, performance, security), and product development work driving product metrics (sign ups, sales, acquisition costs, etc)
Then SRE came along, and it was all about devs doing ops work focusing on keeping things up and running. That is, a dev doing rigorous ops work with a focus on SLIs, error budgets, etc.
But then, the term got trendy, leaked out of Google, and everyone wanted to be cool so they called their devops guys (which were really repurposed sysadmins) SREs but the job really didn’t change.
Source: sysadmin/devops/SRE since 1995
There are software engineers, who write software and then deploy it onto self serve infrastructure, and there are operations engineers and data engineers who build and maintain the systems that the software engineers use to deploy and operate their software.
Usually when someone calls themselves a DevOps engineer they are an operations engineer, but sometimes they are a software engineer at a small company who also has to maintain their own infrastructure.
 platform is probably a better name :)
It's helpless to argue for changing the name. You'll never convince people whose careers are tied to DevOps to change the word they use.
We should come up with a buzzword that that we can use for a few years to indicate real-DevOps. It will be taken over by the same managers shortly after, but it can be used by for a little while by people who know the real deal.
We can call this "buzzword maintenance". The industry of engineers agrees to regularly update lingo to keep ahead of corporate concept degradation.
We can do this for agile and DevOps, at least. I suggest taborca (acrobat backwards) and DesMis (designer-mission). These were done with just a quick thesaurus search.
The idea behind DevOps is that Dev(elopers) do their own Op(eration)s, rather than throwing stuff over the wall for a separate team to manage.
Not only has the industry done the opposite in many cases, they kept the name DevOps, meanwhile the full time operations staff have over complicated deployments and builds to the point that even they don't understand how they work anymore.
I think there is a place for both. Going into DevOps does not mean you need to get rid of your Ops team.
I like the approach when Ops are handling the backbone infrastructure of the organization while individual teams work in DevOps methodology meaning they are responsible for building necessary application infra, programs, as well as bringing it to production and monitoring.
This is also a great way of scaling teams - keeping a path of least resistance that is secure and compliant while giving observability and conventions.
It doesn’t make the infra technology more or less interesting?
There's still value in having an infrastructure or platform team focused on building tools to enable dev teams to more efficiently deploy, monitor, and run their code, and to step in with expertise when needed (ie help debug things like database performance, etc). More "teach a man to fish" type help, but also "give them a good boat and fishing net".
I’m six months into a new SRE role and it really feels like instead of doing any kind of engineering I’m instead just getting “overflow pipeline work from the rest of engineering that the Devops team was too busy for” and I’m already looking for the door.
Agile is a practice of speeding up improvement. Idea being that organisations that improve faster will, over time, overtake organisations that improve slower. If so, it makes sense to focus on the improvement loop itself. You remove management from the loop (because management takes too long and slows down improvement) and you empower and encourage individual contributors within their teams to improve the process because they can do it faster, have hands on knowledge of the process and can observe results faster.
Additional benefit of Agile is that you now have a tightly knit team that tends to be more engaged with their work and more focused on thinking through what they are doing -- this tends to cause people to be more satisfied with their jobs because satisfaction of being able to affect the process positively is a great motivator.
Yet, the management thinks they can come with a book and institute "Agile process" by telling everybody what they have to do, exactly.
Usually, improvement within team structure is then actively discouraged with many explanations like "that's not what the book says" or "I was doing scrum at a company where it worked and so we must do the same" or "we need to standardise processes" or some other variation of it.
I have observed this time and time again at various stages -- always with the same miserable outcome.
At my last company they additionally decided we are introducing Scaled Agile Framework meaning, according to them, we will now plan not at team structure but at a department level. This innovation means all planning and retrospective are combined for about 400 people. Because this was too time consuming to do every two weeks, they decided we will do this every 5 sprints (10 weeks). Just to be clear -- every 5 sprints we were supposed to select and plan tickets for the next 10 weeks of development. This was presented as an efficiency improvement.
> Under a DevOps model, development and operations teams are no longer “siloed. Sometimes, these two teams are merged into a single team where the engineers work across the entire application lifecycle, from development and test to deployment to operations, and develop a range of skills not limited to a single function.
Today, devops is often used to describe the engineering role very similar to the classic system administrator. I guess this is mostly because it sounds "more modern" and it's more attractive.
But, the comments in this thread made me aware that there indeed is a role that's focused only on serving developer needs. I guess we could call an engineer a devops engineer if their job is only to build and maintain system tools for developers.