A brief disclaimer:
Now before I set to work on this post, I asked one of our more knowledgeable legal guys if I could discuss specifics of open stack without getting a legal approval to do so. He seemed to think I’d be fine as long as I didn’t specifically reference any code. Okay that works for me, at least for now. I figure at this point it’s better to ask forgiveness than make a policy issue out of it. I also will be steering well clear of exposing any of our own internal policy at my place of employment as that too could get me into trouble. Additionally this is a blog post, not a white paper, and certainly not an authoritative document. Take it as it is, off the cuff, and highly opinionated. Grains of salt are not included.
A brief note about me:
Now you may be asking yourself who the hell I am to be speaking to the security model of OpenStack. I’m not exactly listed in the top code commits of the open stack launchpad group. However, I did deploy the first “production” cluster using the nova cloud compute controller on earth. And I still maintain that and several other clusters. We’re still growing our environment, and unlike most ( though not all ) adopters of OpenStack in the world, we’ve got some actual security oversight. I’ll allude to some of that later. I do have an infosec past, and one that remains a hobby of sorts. Though my chief expertise is in large scale automation environments. But it suffices to say, OpenStack is a beast I have probably more experience than most with ATM.
A brief note on Open Stack:
OpenStack is an open source IaaS software solution. The bulk of the code base is written in Python and is highly portable. The project consisted originally of two main components, Nova and Swift. Swift is a data object store, and Nova is a cloud compute controller. In amazon terms think of EC2 ( Nova ) and S3 ( Swift ). Open Stack was originally released to the public about 1 year ago. Since then it has grown by leaps and bounds.
For example let’s look at a chart showing the Nova release family:
As you can see, the Austin release is coming up on it’s 1 year anniversary. At that time Open Stack itself will be 1 year old. And the reality is, at this point there is still a great deal about the way open stack will develop that is entirely open to change. In fact, there is currently only now a proposal for a Security Governance Group, you can read more about the accepted governance model of Open Stack here: Governance Model
So I want everyone to be very aware that the beast, that is OpenStack, is still experiencing the roaring instability of a young and highly successful project. While that’s great in a number of ways, it will cause some alarm for the policy architects and security experts out there. The simple fact is OpenStack is young. And like anything young, it hasn’t proven itself yet and it will make some mistakes.
But, young though it may be it is growing FAST. Between that October release last year and today the code base has effectively more than doubled in size. And two projects have become 9 or more diverse technology projects. Major new engineering projects are on the horizon too, such as quantum which will be a paradigm shift in the way networking in open stack is handled.
In a lot of ways, security assessments of open stack are premature. But, open stack is ripe for security innovation, and there are some folks who are attempting to drive some of that.
My good friend and world renowned gentlemen hacker Vyrus referred a tweet to me, as is so often the way of our contemporary discourse.
Now, that’s a devil of an open ended question. And it’s absolutely primed for the sort of response that engenders terms from my technological peers such as “clown computing”. Luckily for all of you I’m not a sales rep, manager, or self proclaimed founder of the next Apple. So my response will hopefully be of some actual value to folks that happen upon it. I hope so anyway.
Before I get started it’s important to note that one of the things I learned while working for HP as a professional services consultant is that, every environment is different. And that’s something you need to be aware of when you start talking about Infrastructure as a Service. A federal research facility that performs largely civilian research, isn’t going to have the same network security policy as a military research facility, or a financial institution. Each of these entities has their own accepted standards that they choose to apply to their deployment strategies. Each of them has their own legal and fiduciary requirements as well. And, just as importantly, they each have a different cross section of experts in the areas that your IaaS solution will be falling across. So when we start discussing “security” in such an open ended and generic way, what we need to do first and foremost is discuss what open stack expects from you. For some folks this will simply be a show stopper. In a past blog post I touched on the fundamental differences in operations shops that have resulted in ITIL vs DevOps, what you are doing impacts how you do it. So the second important thing to scope in your analysis of open stack’s security is what you are going to be doing with it. And I want to discuss some of what you should be asking yourself when you are breaching these subjects. Only once we’ve got you in a good place on those two subjects can we start to really look at the internals of open stack and what it really has to offer from several aspects. We’ll want to start with how you want to approach a risk assessment when concerning IaaS, then we can focus on where you can apply your risk factors to open stack on a component by component level. And I can try to elaborate on the components, and how they are intended to operate from a security viewpoint. On this level some of the work being done by Piston cloud may be of interest to some of you.
The impenetrable wall of the hypervisor
In security policy, what leaves most security folks with a sour taste in their mouth is that they have to view hypervisors as being a security layer. In fact, as far as policy goes they are considered to be basically impenetrable. That’s, of course, a fantasy. Especially so, with kvm and qemu. In the past year I’ve heard of at least 4 different ways for a user on a kvm instance to compromise the host device. In some cases with a staggering degree of ease, also in ways that were almost impossible to prevent. So try not to get to caught up in the policy that tells you hypervisors are a wall. Recent experience tells us all that as far as walls go, kvm’s is pretty porous. But security models for cloud rely heavily on the separation of instances away from the hosting platform. So you are going to need to acknowledge that there are going to be some pretty serious security concerns that stem from any possible compromise of kvm or qemu. And that for me is the nastiest component of the security model of cloud clusters.
The show stopper
The CIA is pretty well focused on securing information. It’s kind of what they do. And since they were called the OSS, they’ve been pretty heavily focused on compartmentalizing information. The fundamental problem they face on this approach is that sometimes even when they have all the pieces of the puzzle, there is no single person with enough pieces to put them together and see the big picture. But, it also means no one can compromise very much of anything that isn’t within their direct purview. Now this approach all comes down to risk. You see, as we discussed before even the most infallible barriers can be subverted. And, if anyone puts forth a significant effort to do so, they will succeed eventually. The trick in security is to make it hard enough to block out the opportunistic attacks and to be able to respond to targeted ones effectively. From a purely objective viewpoint, all the revenge and justice in the world isn’t going to undo the harm that a successful attack will cause. So the first and foremost goal of security policy is to minimize risk. In this regard, risk is really something each organization needs to identify for themselves. Reducing risk in the cloud is a bit of a problem.
Cloud’s like OpenStack rely on the creation of a homogeneous hosting platform. The idea is to create a single cookie cutter design for each type of machine. Compute host, storage host, network switch, etc. There could be 10,000 compute nodes at a site. Each of them would be the same or similar generations of the same manufacturers x series server. They would have the same disk geometry, the same ram, the same OS, same binary packages, and same authentication sources. The idea is that much like Spartacus, you can lose any number of your compute nodes but there will twice as many that will stand up in their place. Also, because the you have reduced variety in your environment, in theory you have been able to increase the quality of your system build profiles. Of course, any immunologist can tell you that if we were all perfectly identical, no matter how wonderful we may be we’d probably all have the same predispositions to fall ill facing certain bacteria or virii. Sadly this is also true of systems and security vulnerabilities. If you can own one host in the cluster you can probably own them all. The simple fact is, on a very fundamental level, in cloud any user can pass execution sets to any host device in the entire cloud compute environment. As a counter point, the DNS root servers actually ensure that the servers themselves are heterogeneous as an added security measure.
Security in Depth
Now the approach that’s generally taken in cloud, is called “Security in Depth”. The idea is to generate layers in your infrastructure topology. The outermost layer would be where your users instances rely. Then you would have the compute hosts base OS profile. Then probably configuration management and package mirrors. Finally, you would likely have authoritative sources such as GIT repositories and secure backups. In most IaaS environments the first goal is to protect the cloud itself. The best way to do that is to ensure that the cloud can continue to heal itself. Security in depth separates the mechanisms that can be used to heal the cloud from likely vectors of attack.
Now, you might be thinking, well that’s all well and good for the cloud provider, but what about the lowly down trodden user? Are they expected to simply fend for themselves? The answer is, it depends.
In an IaaS configuration, the assumption is that your users are in fact system administrators. Maybe not the worlds best, and certainly still beholden to the overriding authority of your clouds management teams, but administrators none the less. In that scenario you are expecting and hopefully educating your users in the way that clouds “heal”. In a security in depth scenario, one of the things that can be restored from secured backups is the bundled images repository. In short, the images from which you cast new instances ( or VMs if you prefer ). So what your “recovery procedure” would be in most situations would be to terminate running instances ( probably force-ably terminated already by your admins if they were on a compromised host ) and then relaunch the instances from the still safe and authenticated bundled images. Or, if you’ve been keeping snapshots of instances stored in your object store, you can simply relaunch from the snapshot. Of course this glosses over some of the clean up work you will likely need to do. Such as generate a new key set to authenticate with before launching these new instances. Also, this doesn’t address your data. It’s probably been available to the sticky fingered thieves that have so in-graciously invited themselves into your systems. The reality is, in an IaaS environment your users should know when they need to store their data encrypted at rest, and when they need to keep off cloud backups.
But that’s not where IaaS stops. The simple fact is, there are many different tiers of security that might exist in a federated environment. If your organization is big enough to roll their own private clouds, there’s a chance they may be rolling different cloud clusters for compartmentalized tiers of security. This could mean intranet only cloud environments, or ITAR compliant cloud environments. Heck it could get even more locked down than that, and I’ll discuss that later. But depending on your needs, you will find that there are levels of compartmentalization within and without of OpenStack that can be deployed to make your environment just a little bit better at handling the risk your organization is concerned about addressing. Of course, some configurations have their draw backs, and become investments in time, complexity, or simply speed.
I’ll discuss IaaS in federated environments in depth, maybe in another blog post.
In a PaaS scenario, you maybe have users who you can trust to run a server, but you want them to run instances based off sets of images that one or many of your teams authoritatively maintains. This allows you to prepackage things like monitoring and security related tools. Or force-ably maintain updated software stacks on all cloud instances in that environment. More importantly for your users you can provide bundled images and SOPs for using them to provide a hardened or even highly available cloud instance. Of course in this scenario your PaaS offerings are contingent upon you having people to create and maintain those bundled images, and that is a job in and of itself.
In this case you can rely on all the security and response procedures IaaS requires and offers, as well as take a hand in preventing your cloud users from making costly mistakes that will detract from their over all experience in the cloud, or more likely cause your organization terrible grief in the event of a compromise.
Major cloud providers tend to take this approach to varying degrees of complexity. Linode for instance if you have ever used it is more of a PaaS provider than an IaaS provider. If only because they offer pre-bundled images and over-write authentication configuration and other system profile components when running an instance.
SaaS is interesting. When you think of SaaS you think of a developer requesting a mysql instance or a wordpress blog instance. Simple things that you occasionally get asked to deploy but can be time consuming. However, SaaS can be so much more than that to your systems management teams. SaaS can form the core of authoritative management of multiple clusters.
SaaS separates users from the active maintenance and access to the VM’s themselves and relegates them to users of services running on those hosts. In a way, you can almost view each instance as it’s own highly specialized VM ( JVM anyone? ) that is geared to run a specific service in a predefined ( and hopefully well defined ) way. Now if you deploy an operations team only cloud, and allow it to expose only SaaS style instances to dynamically handle many clusters spanning different security and performance environments you can provide an even greater degree of defense in depth, and increase your ability to dynamically / automatically maintain an environment of increasing complexity from a single homogeneously designed environment.
Now of course there is a risk here, as any service you expose from this cluster is now a path into an environment with direct access to your authoritative data sources, and since it’s somewhat homogenous to the other clusters it may be susceptible to some of the same vulnerabilities they are. You need to be very careful about what you choose to expose and where you chose to expose it to. This design methodology is probably its own blog post.
Authoritative Data Sources
In DevOps there’s an idea of agile deployment applied to operations task. Short sprints, and deployments in line with development methodologies. That’s cool, and in a lot of ways it syncs up with ITIL reqs. But one of the cool cultural paradigms of the DevOps crowd is an increased reliance on GIT to serve as an authoritative data source for configuration management of your environment. In cloud, this is pretty effective.
If you look to the compromise of the Kernel.org website and it’s own GIT and package repositories you can see what is effectively a very nasty breach, handled very well by a group of guys who just plain do it right. Of course, if you synced your apt mirror off kernel.org during a period it was compromised, you are probably too busy cursing them and fate to care. My thoughts are with you.
The fact is authoritative sources do get compromised. You try your damnedest to prevent it, but anyone looking to really do something nasty will be trying their abject hardest to hit you right in your GIT or apt repositories. The trick is to be ready to respond to them when they do hit you. And don’t think they won’t. Murphy’s law is as true in tech as it is anywhere else.
In a proper configuration management architecture you are trying to centralize the management conduit for all of your cloud / cluster systems. The idea is since the environment is massively scaled based off a cookie cutter design you can manage the environment with a few highly skilled folks rather than team upon team. Of course, I think that model is still an evolving paradigm in technology and certainly not standardized at this point. Claim what you want I call bullshit on that front. Regardless, defense in depth relies on keeping your change control access far the heck away from the users and their code execution and entirely on a one way path. That helps architecture wise with preventing compromises. It will certainly cut opportunistic attacks down to virtually nothing. However even if you’ve got your authenticated data sources secured, you need to make sure that all of your configuration teams have full and revision controlled copies of everything… INCLUDING DOCUMENTATION.
If your primary DC decides to go for a swim in the Hudson river, or gets wiped out by hurricane Inigo Montoya ( you killed his parent process prepare to SIGHUP without return.. ) you are going to want to be able to deploy a new control architecture quick. Also you want your clusters to be able to continue to operate without their command and control paths active and be unimpaired by it ( except maybe backups not occurring and the sort ). The same is true in the event of a compromise.
One of the things working for you against an attack on a large cloud deployment is that your architecture is likely pretty custom, and even more likely pretty darned complex. It’s not going to be easy for an attacker that isn’t local to your environment to do anything that will be difficult to recover from.
And even if all your clusters are suspect well… you just automatically re-provision them while ensuring healing of Swift object stores, and raid of volumes in compute host volume storage. Of course OpenStack could use some work on that front. Data persistence and high availability at this point are mostly clever hacks. The good stuff is still not in the recipe yet.
If you are unhappy with this model, odds are cloud just isn’t for you. The design approach has specific goals in mind, and to achieve them it chooses to adopt some pretty difficult to stomach architectural requirements. But personally, I consider them to be a hell of a lot more practical then the ideals that most security and operations guys set themselves up to pursue in vain.
I feel like OpenStack is probably fine for most things. And in another year, it might actually be =P
Regarding Risk Assement and the Cloud
Choosing whether or not to deploy OpenStack is going to become an interesting prospect. I think today OpenStack is still very immature as software. But it shows terrific promise. With the rising cost of VMWare vSphere there will be many people looking for an open source equivalent. OpenStack may yet grow to fill that need in a way that is both unexpected and possibly a bit wonderful for VMWare engineers. That is of course once they get past the fact that it is “different”.
When you are identifying risk, you need to view your possible vectors of attack in terms of a defense in depth model overlayed by network topology and specifically segmentation.
If you are thinking that cloud could be used to help back end some sort of SCADA environment… odds are it can. But when risk means mortal injury to a human being you need to put on your big boy pants and be willing to sacrifice when you need to to ensure safety of your people above all else. That doesn’t mean OpenStack isn’t right for you. What it ultimately means is that you need to make sure that you can segment cloud management and access from any possible attack vector. Something along the lines of an intranet only SaaS offering would probably be adequate. A self healing cloud environment could very well be exactly the right remedy for what ails you in this market space. I can’t say I don’t do SCADA but I can imagine a few scenarios where this could work out serendipitously ( real word believe it or not ).
If you are in a financially regulated environment maybe you need to give thought to federated tiers of cloud offerings. Maybe you don’t want to place financial data in an IaaS offering at all. I wouldn’t knowing what I know about bank environments. But PaaS might work if your bundle management teams are on the ball. SaaS could certainly work. And heck for dev environments IaaS may work FINE. Might even improve your developers productivity by making more adaptive use of the resources allotted to them for development.
It’s a use case question. Of course you want to be sure in that environment that unencrypted financial data is NEVER accessed by anyone who doesn’t have the keys. That’s the best you can hope for operationally. And when that’s not enough generally the bank will eat the cost. Sometimes shit happens and you plan for it fiscally as well as in engineering. It’s only money if you’ve put aside cash in advance to deal with the risks you know may rear their heads. That’s just smart business for anyone.
In a government environment, and possibly medical, data classification can become a bit of an enormous headache in an environment that loves to replicate data and share resources. It sort of conflicts with the compartmentalization of data approach we discussed earlier. But, even if the guys in dark suits walk off with half your object store because Captain Capslock decided to dump the extraordinary rendition film collection onto the FISMA low object store you can probably recover surprisingly well.
Of course in my industry, government regulation also means protecting against “mortal peril”. And that means for now that OpenStack doesn’t run nuclear power plants. And for now I am totally okay with that. I have every faith OpenStack will be there some day, but not today. I also wouldn’t run Shuttle mission related software on it. But you know, that being said, I’ve seen some stuff on those lines that surprised the hell out of me. The simple fact is you need to think about how security in depth works in the environment you are planning to deploy it, and how the risks to that environment can be mitigated.
Hope that was helpful. Comments section for further and more specific discussion I guess. Or OpenStack’s freenode IRC channel.
Quantum is probably the biggest change for OpenStack on the horizon. Once Quantum rolls out in Q1 next year, the theory is that OpenStack will be relying on OpenVSwitch for providing network segmentation inside IaaS user network space. Currently 802.1q vlans provide some level of segmentation ( coupled with other technologies this can be pretty effective ). However there are draw backs to the current network controller models and a large number of VLANs suddenly popping up dynamically on all your hosts. Particularly scaling issues and single point iptables complexity increasing on an automatic scale.
I am deeply looking forward to quantum. And personally don’t consider nova completed until it is released. The networking in OpenStack has been hokey since day 1. But soon that will be coming to an end. And the increasing capabilities of OpenVSwitch operating in an environment that really wants to see it shine will allow for a greater degree of SaaS style switch deployment. And possibly some amazing innovation in network controller management.
At the very least manageability and network segmentation in project space in OpenStack will be that much better.
I also included a link to Piston Cloud’s Paxos approach below. Interesting read.
Of course there is much more coming, seems like half of silly valley is now working on OpenStack, but that could just be because I am in the middle of it and my perspective is skewed.
Some fun extra reading:
ref : http://www.cso.com.au/blog/cso-bloggers/2011/10/14/if-i-was-cso-hacker/ A few comments from a penetration tester about hardening
ref : http://en.wikipedia.org/wiki/Paxos_%28computer_science%29 Paxos is the fundamental design ideology that Piston Cloud is attempting to work into their flavor of OpenStack.