Those of us who deal with cloud on a daily basis live in a bubble. We tend to forget that even though for many of us cloud is a commodity, it’s still not the obvious choice for many enterprises.
Before I got into cloud computing and “saw the light”, I was all about routers, switches and firewalls. I was studying for the CCIE (Cisco Certified Internetworking Expert), a fancy, top-tier Cisco certification that teaches you how to make Cisco hardware sing and dance.
Ever since I left the networking world and put all my eggs in the cloud basket, I found that I have a tendency to forget that these architectures, these “traditional” ways of securing and delivering IT resources, are still very much alive.
The sharp contrast between the cloud and traditional IT is one of the major barriers of cloud adoption. More accurately, it’s a barrier for letting cloud do its job properly. Many of the IT professionals I talk to ask about the best way to create an approval process for cloud resources. If we bear in mind that the two main drivers for cloud are business agility (getting this done faster) and cost (it’s supposed to be cheaper!), we should understand that an approval process wherein a human needs to manually approve things, will prevent us from reaching either of these goals!
The need to have a human component in the provisioning process belongs to a situation where resources are scarce. It doesn’t align with the cloud’s self-service nature. In a traditional datacenter resources might need to be rationed, so it makes sense a human should decide where those resources are needed most. A cloud deployment doesn’t share the same problem, the resources are always there, if a user matches all the required criteria they should be able to get what they need immediately. So when my peers ask me about the best way to implement a cloud resource approval process, my answer will always be - don’t.
Now that we’ve established the obvious - cloud works differently from traditional IT, we can take a look at the problem and the solution.
Self-service is great, but it’s also dangerous. A developer working for a large enterprise will rarely lose any sleep over the company’s finances, so leaving some servers running over the weekend might not seem like a big deal. Another cloud end-user might be in a hurry to get something done, and doesn’t have time to start thinking about which tags or security groups she should use.
Let’s consider this example: Our user wishes to provision 3 servers, each of which has an EBS volume, all tied to the same Elastic Load Balancer. There are now 7 objects to tag, even if we’re only looking at an owner tag and a production-status tag, that’s still 21 tags to set! Specifically on AWS, users have the option to “review & launch” an instance as soon as they’ve selected an instance type, this dramatically increases the chances of skipping the “Tag instance” stage of the workflow.
Earlier we recognized agility as one of the main drivers for cloud. The self-service model is what delivers that agility, users get what they need, when they need it.
Unfortunately, the self-service model presents us with problems around security, cost and the efficiency of cloud usage.
As many have pointed out before, and as is illustrated in Bernard Golden’s great article about this very subject, the solution is a policy engine. The traditional workflow for ordering resources involves approvals, and checking if we actually have the requested resources. The question of “if” doesn’t exist with cloud. The resources are there. The questions is: should we provide them?
“Should this person get resources?” is a much easier question to answer than “Do we have the resources this person is asking for?”. Either way, the question should be answered automatically, without human intervention. A set of rules tied to a user or an environment should determine all the actions that can be performed, and all the resources that can be consumed. With all of this in mind, let’s take a look at a few different types of policy enforcement.
Observer policies are policies that sit outside of the workflow and provide information about usage patterns and violations. An observer will log in to your cloud platform and check if users are consuming resources in accordance to financial policies, security policies and more. Whenever a violation is detected, you’ll be notified. Observers are a great way to learn about patterns of cloud usage, although they tend to produce reports that aren’t actionable. For example, your observer lets you know that port 22 is open to the internet on a server, what’s the next step? In order to make a decision you need context, whose server this is? Will quarantining it mess up something else? These tools generally don’t enforce policies, and when they do, it’s usually reactive.
Most of the policy engines employed by the cloud platforms themselves are reactive. This type of system will allow you to define a set of rules, such as “don’t use machines larger than size X”, and then will execute some predefined action, such as “scale down any machine that violated that rule”. The issue with reactive policies is that they are simply too slow for the ephemeral nature of cloud, and they don’t prevent you from incurring cost for mistakes made. Since reactive policies will only enforce rules post facto, each case of misprovisioning will carry a small financial penalty.
Train the People, Not the System
Surprisingly, some companies chose not to employ a policy engine at all, and attempt to train the people instead of the system. This means creating a centralized wiki, onboarding plans, and trainings, to explain the workflow one should use when provisioning cloud resources. Which tags and security groups should be used, how network placement should be handled etc. Most companies that go this route quickly reconsider it, as this system doesn’t scale well (there’s a reason RTFM is a thing).
In-line policy engines create guardrails that prevent users from making poor choices, and will only allow the use of approved and available resources. This type of policy engine is the easiest way to safely provide a self-service model in an enterprise. As the policy enforcement point sits between the user and the cloud platform, policy can be transparently enforced before any resources are provisioned. Ideally, the user will only be presented with options they have the permission to use, and the policy engine will make sure only actions that are in compliance with the budget, security policies and governance policies will actually be performed. This model of policy enforcement is a departure from the traditional, ticket-based system, and the only cloud-native way to do policy. If the creation of tickets and the logging of provisioned resources is required, the process should be automated as well.
There’s no doubt datacenters and the systems that manage them will still be around for a while, but this is the era of cloud. Enterprises can’t afford to make the mistake of using irrelevant workflows where cloud-native ones are needed. One important thing to remember, is that this whole thing shouldn’t be that much of a challenge. We describe cloud as a massive paradigm shift, and in many aspects it is. Cloud adoption has a huge impact on agility, ease of use when consuming IT resources and more. But from an IT administrator’s standpoint management and policy enforcement should be easier with cloud, not harder.
Most enterprises turn to Cloud Management Platforms and other similar tools to handle policies on multi-cloud deployments. When evaluating these tools remember that a policy engine should be aware of the context of each action, to that end, a cloud-native CMP with an in-line policy engine is the most effective choice.
If you have any questions or feedback, please feel free to contact me direcly at firstname.lastname@example.org.