I sold out.
Well I didn’t actually sell out, because if I did I would be on a yacht off the coast of Ibiza sipping Chateau d’Yquem Sauternes. But in the world of young hungry startups, I sold out. At the end of last year I joined Scalr, a successful eight-year-old company building the market’s leading enterprise cloud management platform.
Even though I’m in product marketing today, I’m a developer at heart. But the roles have been reversed - I have no clue what’s under the hood at Scalr. The product that Engineering has built over the last eight years is nothing short of incredible. Every time public and private cloud providers push out new features (eyes on you Amazon), we’re right behind them building bridges across the cloud providers, services, and tools every enterprise uses.
Working in product marketing has taught me that it’s one thing to write good code - but communicating to the people you want to help is just as important.
I thought it would be helpful to talk about Scalr from the perspective of a regular developer. I’ll explain why companies approach cloud management solutions and why it’s a great thing for everyone: front-line developers, DevOps gurus, IT administrators, even the finance departments.
Let’s start at the top. What is a cloud management platform?
Cloud management platforms are a way for companies to standardize cloud consumption. They give administrators the ability to create centralized best practices. Some of those practices are preventative (think guardrails for new developers), and some are reactive (when servers crash, how do we handle this?).
At Scalr we call these standardized practices Policies, and they’re enforced through our Policy Engine. Policies fall into five general categories: access, workload placement, integration, financial, and lifecycle. These Policies just don’t refer to standardization on cloud providers - we can create best practices around the tools we use for user hierarchy, automation, deployment, configuration management, security, and more. We’ll jump into the details later, but that’s the idea.
When more users are consuming more cloud resources, policies enable us to keep everything under control.
This also makes life easier for developers because it reduces friction. While we have dozens of options for instance types, security group configurations, tags, resource groups, and auto-scaling policies - we only need a few.
More often than not, we go through the same use cases. New testing, staging, production servers, configured for the application we’re in charge of. The processes around those servers, like assigning security groups, accepting security keys we already have, hooking them up into our deployment and configuration management systems are a chore. Wouldn’t it be great to have a simple UI or API call pre-configured for your everyday requests?
While some may look at cloud management as mandated restrictions, products like Scalr are a way to replicate the skillsets of the best engineers of a company and extend it to all users.
I know from personal experience. At my last company, the deployment automation behind our web app was built by one serious wizard using Capistrano and homegrown scripts. There was no UI or clarity, just: when you’re ready to push to production, run these scripts from your terminal. If it works, it works. If it breaks, Slack me.
It was fine on the startup scale, but as we grew and had issues like when production servers weren’t all running the latest application update, it created a bottleneck through one administrator. We’ve all been there - high level guys build a powerful backend but when they’re gone things get complicated. The best engineers tend to hate writing documentation.
In order for a company to hit that level of cloud consumption and care about using a management platform to solve these issues, they usually hit a few bumps in the road. We call it the ‘cloud journey’: the problems companies see as consumption increases.
Let’s start at the beginning.
We have a web application. This application is composed of app servers, databases, and the load balancer across those servers. We’ve deployed it on a cloud provider because it’s cheap and easy. Deadlines are conquered, load times are quick, users are happy. For startups deploying to the cloud is a no-brainer, but for developers in bigger companies, there’s an incentive to use cloud services to skip the slowness of traditional IT.
Working as a developer at a hospital and just being in the realm of sensitive data restricted every action we could do. We never had access to patient data, just staging databases filled with autogenerated dummy users. Even so we had two layers of approvals for minor updates, security groups on security, physical security keys for logins - all of it for legacy software that was first developed back in the 90s. Enterprise is weird. I’m sure many of us have experienced levels of complication like this.
Sure, I could jump through the hoops of requesting infrastructure and waiting for department approvals. But that would be unbearable in 2007, much less 2017. Instead I get the ok from the team leader, spin up servers, and get to work. Regardless of your cloud provider (AWS, GCP, Azure, OpenStack), convenience beats inefficiency.
We call this process of pockets of teams spinning up servers without any central guidance skunkworks. At the developer level, it’s great. We use the servers we need, shut down the ones we’re not using. We automate deployment and configuration on our own terms. There are less roadblocks to get through - apart from the application itself not working. As my old mentor used to say - Scrum can’t help stupid.
At the company level this isn’t a major issue either. Management doesn’t know or doesn’t care as long as security is locked down and costs are under control.
Unless they’re forward facing and expecting rapid growth, a company probably wouldn’t explore cloud management at this stage.
So we’ve got our cloud servers. Over time, more developers get turned on to using AWS accounts for speed and flexibility. Some applications hook into our old infrastructure, some are entirely on cloud resources. We make IAM policies for our accounts and resources, know the services and instance types to use. At the team level, we’re all on the same page. We’ve got objectives for the applications we have ownership of. The goes the same for other teams creating their own accounts.
Let’s say our team has been working on a new release. We’ve merged all our changes into our master on Github and we’re ready to deploy across all our servers. Some teams use Jenkins. If you’re living completely in AWS, CodeDeploy/CodePipeline is a friendly alternative. Capistrano is big in Ruby shops. Small operations may use Heroku. On the server configuration side, Chef, Puppet, or Ansible are the go-tos.
The process is generally the same. You run a script which stops a segment of servers, runs tests, deploys the new code, clears the cache, runs tests, and restarts the servers. Then you move on to the next segment of servers. Rinse and repeat until all production servers are on the same page.
But at this point a few questions start to come up.
What’s the protocol in deploying new code? How do we handle remote server automation and configuration? What do we educate our new developers to do? Should they learn the process Team A created, or Team B? If they both work does it matter? If we’re building separate applications, does it matter? Sure, the CIO said this way, but let’s do it our way so we don’t have to relearn new processes right now.
As more developers and administrators use cloud resources, it becomes harder to know what’s right. If I’m training new people, what’s the right way to educate them?
At this stage in the cloud journey, administrators and developers struggle with defining best practices for automation. This is the first place a company might look at using cloud management platforms. How can we standardize cloud usage? How can we universally define DevOps workflows?
Scalr customers find a solution with Integration and Lifecycle Policies, which enable companies to define their DevOps workflows where it’s relevant. Do I want to define what developers can use at the application level, the team level, or at the department level?
At this point cost comes into play. At small shops and startups you typically have complete visibility of cost. A few staging servers, clusters of production servers, and the individual services that you use. If you’re not the one paying the bills, chances are you sit across from the person that does.
At scale it gets tricky. Yes, we try to be good and keep track of everything we’re doing, but when the developers aren’t paying the bills we don’t have to sweat too much. Then you look at enterprises that are spending hundreds of thousands of dollars a month on their AWS accounts and you get why billing managers lose their hair. Before you scoff and say that’s impossible, think about it.
While writing this post I took a look at my personal AWS Billing Dashboard and noticed that I had racked up $100 for some servers I was using to play with Docker. I simply forgot to shut them down when I was done experimenting. When you have thousands of developers working on concurrent projects, using multiple accounts across regions it’s not surprising that things can get messy.
At this stage companies look at ways to control their costs. Usually it starts with IAM policies, unifying AWS accounts, and declaring that users have to tag everything they use so they can be held accountable. Easier said than done, of course. Nobody wants to be grilled when they forgot to shut down some testing servers over the weekend. It’s also not fun to be stressed on watching how much servers are costing you. When I talked about reducing friction for developers before, I know from experience that tagging is a chore.
For many of our Scalr customers, the ability to set Financial Policies at multiple levels was a game changer. As an example, you can create a ‘template’ for a test server and say that when a new one is spun up, auto-tag it, show much an estimate of how much it’ll cost, define that it gets 3 days of life, and if developers need more time with them they can request for more time. If it’s within budget automatically approve the request. The detail of Financial Policies goes as deep as teams need it to be.
The S word. Security becomes an issue as applications start going into production. Security isn’t absolutely essential when everything is inside developer sandboxes and staging servers on our IP. But once applications are exposed to the real world - attack surfaces are exposed.
There's two goals for security. The first is to ensure absolute internal visibility. It should be clear how our system is orchestrated to team leaders and department heads, and developers should know how to adhere to those standards. Second is layering security and compliance. There’s multiple types of users in a production system, so we should have security policies at multiple levels. Applications should have defined policies, as should the abilities of individual users and their teams. If it’s necessary, departments shouldn’t have access to other departments (like do sales engineers need visibility on financial records?). By layering security we have multiple layers of protection.
There’s a few ways to accomplish these goals. You could setup IAM policies that match users to their roles in the company. As an example, junior developers can’t add instances to different security groups they’ve configured. QA teams can’t access production servers or databases. When creating security groups for our infrastructure, make that they’re only accessible from our IP, and that developers can only create security groups from our IP. Distributed systems are the future for everything except security - at the management level it should be consolidated into as few access points as possible. In the words of one of our customers, "De-centralize access, centralize control".
Here’s a good question: as a team developer I like to work from home and have access to our servers from home. How can we create policies around situations like that? If we’re a distributed team, how do we declare where ‘home’ or ‘work’ IPs are? Should every developer be able to jump into staging servers when they’re at home? What about when developers leave - how can we depreciate their keys and tokens? There’s been a few products I’ve been working on where I look my my dev folder and realize, oh god, I should definitely get rid of these keys. I shouldn’t be able to SSH into production servers for startups I no longer work for.
At scale these are questions that matter the most. And don’t forget - it’s one solution to declare this inside one cloud provider, but if you use multiple clouds, you have to replicate these solutions. For some developers this may be out of our pay grade, but understanding best practices and security concerns are important for everyone.
In the end, all of these factors converge at the end of the cloud journey, taking everything we talked about and doing it even bigger. From a few pockets of developers exploring with AWS to a CIO mandated march to the cloud, economies of scale come into play. How can we handle costs as our teams grow bigger and more distributed? How can we handle security? How do we orchestrate deployment with bigger development teams?
Defining standardization is the key to success and sanity. Imagine a team of developers as a unit. Inside each unit, give us what we need. Preconfigured storage and network resources, templates of application & infrastructure that are easy to duplicate. Make it easy to streamline deployment across servers. Handle routine IT work like change requests and backup services.
Set up monitoring for improved performance and availability, and cost analytics for budgeting.
Once you can train the people within that unit and define the system, we can replicate those units infinitely. That’s not a very exciting concept, but systems get me hot and bothered.
Here’s the beauty behind cloud management: by abstracting all of these services one level higher, we can function across cloud providers.
Cloud management is transparency, security, automation, and efficiency all wrapped into one pane of glass that works across clouds. Some companies may view this concept as nice-to-have, but the reduction in spend and increase in productivity allows your company to (using 2016’s buzzword) innovate faster.
Hopefully you got more a sense of why companies approach cloud management platforms and how a product like Scalr works through Policies. If you’d like to try our Open Source Scalr click here, or watch the videos we have on our Resource hub.
If you have any questions or feedback, please feel free to contact me directly at firstname.lastname@example.org.