Constructing internal clouds on disconnected environments
The term “private cloud” is already a reserved keyword for “cloud provided to a specific user”, but this usually still comes with the implicit assumption that you’re still connected to the internet. After all, who isn’t connected to the internet? How could you possibly work that way?
The problem space of disconnected environments isn’t often discussed on the internet, for good reason — the people who deal with such problems primarily (or even exclusively) don’t work on the internet. Banks are usually given as an example for this use-case, but many companies or government agencies operate this way — in other words, places where security requires total disconnect from all internet access. The talk of private vs public cloud usually centers around ownership dilemmas, and not of “how does one create an entire cloud from scratch”.
For the last 9 years or so, I’ve been working in such an environment — from developer, to team lead, to infrastructure group leader. Since I have searched, and haven’t found, anyone discussing this issue, I’ll give the the basic drilldown of how such a cloud environment looks, both from the inside and from the developer’s side.
This is only the narrow keyhole of our environment from which I view this subject, I’d like to hear or have references from other disconnected environments to wee what we do similarly or differently!
In classic “bottom line up front” fashion, this is how our stack will look at the end:
Step 1 — Providing bare metal servers
Everyone knows that the cloud is “other people’s computers”. These computers need to be connected to electricity in order to work, and have specific physical requirements (e.g. cooling). In order to make life easier for everyone, and allow us to work with hundreds of these physical computers, we have data centers, or DCs. These provide these ‘base needs’ at scale, and somebody needs to be in charge of these things — electricity, cooling, floorspace, etc.
Within the DC, you own specific parts, or “racks”. These are the big cabinets full of computers, and so someone needs to be in charge of those, as well. RAM sticks need replacing, physical computers have hardware problems, and of course everything has an “end of warranty” and needs replacing.
In order to actually be useful, these computers need to actually run something that communicates with something else. So we need to ensure we have our network set up — that’s another responsibility.
Step 2 — From physical server to VM
Physical severs incur considerable costs beyond the cost of the server itself — running costs, maintenance, etc. Physical servers are also rather large, and our users are (on average) rather small. And we have a hundred of them. All these considered, providing a physical server per project is not only grossly inefficient — it is infeasable.
That’s why the next level is splitting up these servers into bite-sized pieces with virtualization be it VMware, OpenStack, etc.
At this stage we need to address storage. In our environment, both block storage (for VMs) and object storage (S3) are provided by dedicated storage servers. There are good reasons for this, but of course each separation of concerns incurs operational costs for the separate storage management. Therefore there are places where the storage is provided on the same compute as the CPU/RAM of the VM.
Step 3 — Data storage and running code
Congratulations, you can provide your users with virtual machines! But that’s not exactly what most people expect from a “cloud”, is it? Where are the services?
The way I conceptualize software development is in the following (possibly contentious way):
Let’s focus just on our running program for now, and we can see that there are 2 kinds of services we can provide:
A. A place to run your code
B. A place to store your data
The “trivial” service for both of these is, of course, the VM that we already provide. It’s both, so it fulfills the minimum requirements for running our program!
But if we go a little deeper, we can provide multiple kinds of services for each.
Each of these services can run on VMs, or they can run on the giant computes we provided in step 1. There are pros and cons for each, but in general — if your service is centralized, and gets requests from multiple users, compute is the way to go. If your service is distributed, and each user has his own instance, computes reduce much of the overhead and network problems.
Obviously this too is a tradeoff. You can run one giant Postgresql server and give each user namespaces, or you can give each user a small, 3-VM cluster with full permissions.
You can run one giant K8S cluster with hundreds of nodes, or provide a cluster for each user.
The tradeoff is higher management overhead vs localization and limiting the blast radius of errors.
At this stage, we provide DNS services as well, because they’re essential to the architecture of both the data and runtime services.
Step 4 — Everything else
Apart from the program itself, we also have developers!
We also have users, or services, that want to access our programs, with the myriad ways we can handle problems in that vector!
These are all SaaS services, that run somewhere and store sstate somewhere, e.g. — run on top of Step 3.
This leads to 6 kinds of services we can provide users:
(Conceptually there are services that stand at the threshold between running programs and state-providing services, but I am unaware of them).
There are, of course, many other SaaS services that you can provide your users.
The services I outlined here are unique in that they may end up as essential parts of your PaaS and Data storage services, and you must pay special attention to ensure there are no circular dependencies!
- Does your monitoring service also monitor its underlying services?
- Does your authentication limit access to your DB administration, including the authentication DB administration?
- Does your DNS service both use, and direct traffic to, your DB services?
These are complex systems with many interlocking parts, so building your dependency tree is essential!
Afterword
Searching the web for “cloud architecture” nets results for “how to build your application on top of cloud services”, and ads for various cloud providers. “In house cloud” gives results for in house servers vs cloud, again from an application perspective. People working in disconnected environments don’t write about it on the internet, and this leads to an immediate perception that no one does this anymore, and makes it hard to reach out for advice on such matters.
“How to build a cloud” is very different from “how to build on a cloud”, and the problems tend to focus intensely on stability and reliability, delivered via automation and monitoring. It’s a fascinating problem space that I would love to discuss with any who may be interested :)