When packaging any app in a Docker image, you run into the issue of how to protect the secrets, like database credentials or API keys.
We all know about not checking in secrets to Github (don’t we?), but we also don’t want to package secrets in the image. Anyone who can get at the image can also get at the secrets, and since the image proliferates through CI/CD pipeline and image repo, it’s near impossible to limit access effectively.
What we found to work best is to store secrets in a well protected Cloud Storage bucket, and have the app download the secrets on startup. This decouples the secrets from Docker images, and rotating secrets becomes a cinch.
The example here from an app written to run on Google Cloud Platform.
Layering Secrets in Flask
Flask gives you multiple ways to load configuration. Every time it loads a configuration, it overlays the existing configuration.
We can use the default settings (i.e., dev settings or reasonable defaults) in the code, then download and merge secrets.
Here we download secrets from a Cloud Storage bucket. The bucket should be restricted to allow read access from a specific service account assigned to the app.
As a paranoid implementation, you can encrypt the secrets in the bucket. The app decrypts the downloaded secrets then loads it in. Both encrypted and decrypted files get deleted. The app needs permission for the bucket as well as the crypto keyring in KMS.
You need to encrypt the secret file before uploading to the bucket. The shell script below encrypts and uploads a secrets.cfg file. To run this script, you need access to KMS keyring as well as write permission for the vault bucket
#!/usr/bin/env bash
gcloud services enable cloudkms.googleapis.com
KEYRING=keyring
KEY=secrets
gcloud kms keys list --location global --keyring $KEYRING
if [[ $? -ne 0 ]]; then
gcloud kms keyrings create ${KEYRING} --location global
gcloud kms keys create ${KEY} --location global \
--keyring ${KEYRING} --purpose encryption
fi
rm -f secrets.cfg.encrypted
gcloud kms encrypt --location global \
--keyring ${KEYRING} \
--key ${KEY} \
--plaintext-file secrets.cfg \
--ciphertext-file secrets.cfg.encrypted
gsutil cp sectets.cfg.encrypted gs://vault
Flask + Gunicorn Caveat
When running a Flask app under gunicorn, loading configuration file using relative path (i.e. app.config.from_pyfile('config/default_settings.py')) can fail. This is because gunicorn doesn’t change to the app directory. You need to supply--chdir flag to make gunicorn change to the directory that app expects.
Any sizable cloud infrastructure maintained manually is usually a mess.
What I’ve seen happen is this: A developer manually sets up an instance or two to get his/her project going. He/she manually set up firewall rules (security group on AWS) and subnets to go with this.
Then someone else comes along and brings up some other instances, and adds some other subnets and firewall rules.
Once this goes on a few times, you have lots of firewall rules and subnet configurations. If you let this go on, this grows to the point where you have a big mess that no one knows how anything is connected to what.
Quite often, everyone is too scared of touching anything because it might break something. So we keep inserting compute/DB instances and subnets and firewall rules making matters worse.
How do I know all this? Well, because I’m often that guy who set things up by hand in the very beginning. The cliché, of course, is I’ll just get by for now, we can do it right later.
Guilty as charged, but I’m sure you’ve been there too, no?
Start Off Right
So how do we I avoid ending up with this mess?
The root cause is that the initial setup is not automated and you start off creating instances by hand.
We all know DevOps automation is a tremendous help in maintaining Cloud infrastructure. You really need to start off using automation from the start. It really isn’t that hard.
We all area lazy creatures, but if we make this so it’s easier to just add a few lines of code to automation script than to futz with the cloud console UI, we probably will do the right thing.
So I wrote a simple set of TerraForm scripts to bootstrap a generic environment, so no manual setup is ever needed.
What follows is an example using TerraForm on Google Cloud Platform. In the example below, we bootstrap a multi-region network (VPC) with public/private subnets and instances.
Creating Environment
The example creates a standalone environment that contains the following on Google Cloud Platform:
A VPC Network
Two public subnets, one in region 1 and another in region 2
Two private subnets, one in region 1 and another in region 2
A firewall rule to allow traffic between all subnets
Firewall rules to allow ssh and http
A compute instance-1 in region 1 on public-subnet-1
A compute instance-1 in region 2 on public-subnet-2
TerraForm Scripts
Pre-requisite
In order to run the TerraForm scripts, you need Google Cloud Platform account, and TerraForm and Google Cloud SDK installed. There are many tutorials on this so I won’t go into it here.
You also need a service account with proper permission and its credentials downloaded locally. The credential file is referred to in main.tf.
Directory Organization
To make things easier, TerraForm scripts are separated into:
main.tf – loads TerraForm configuration providers for GCP
variables.tf – defines variables used by scripts
main.tf – loads TerraForm configuration providers for GCP
vpc.tf – Defines VPC and firewall rules
r1_network.tf – Defines the subnets for region 1
r2_network.tf – Defines the subnets for region 2
r1_instance.tf – Defines the instance to start in region 1
r2_instance.tf – Defines the instance to start in regoin 2
main.tf
We define two providers for GCP. A lot of features Google Cloud SDK are made available while in beta. This requires us to define the beta provider in order to access these beta features.
The credential is for the service account that is allowed to create/delete the compute resources in the GCP project.
As you most likely already know, with GCP you can create multiple VPC networks within a project. If you are coming from AWS, this looks the same at first, but there is a big difference. VPC network is global on GCP where VPC is regional on AWS.
This means your environment is multi-regional from the very beginning on GCP. To place resources in multiple regions, you need to create subnets in each region. All subnets route to each other globally by default, all you have to do is create subnets in regions of your choice. (As a side note, each project comes with a default network that covers every GCP region.)
In this example, we create two networks (private/public) in two regions (r1/r2). Regions are defined in variables.tf.
The idea here is to attach private instances (with no public IP address) to private subnet, while public instances are assigned to the public subnet.
This is one area where we often see manual configurations get out of hand because intended network topology isn’t always followed. Having these subnets and instances defined in the Terraform script makes it far easier to maintain this.
resource "google_compute_subnetwork" "public_subnet_r1" {
name = "${format("%s","${var.company}-${var.env}-${var.region1}-pub-net")}"
ip_cidr_range = "${var.r1_public_subnet}"
network = "${google_compute_network.vpc.name}"
region = "${var.region1}"
}
resource "google_compute_subnetwork" "private_subnet_r1" {
name = "${format("%s","${var.company}-${var.env}-${var.region1}-pri-net")}"
ip_cidr_range = "${var.r1_private_subnet}"
network = "${google_compute_network.vpc.name}"
region = "${var.region1}"
}
r1_instance.tf
A compute instance is brought up in each region, and Nginx is installed. These instances are handy when developing TerraForm script. We can log into these instances to test the network configuration, and also these instances can serve as boilerplate code for creating more instances. GCP does give you a pretty good indication of applied firewall rules in the console, so you can do away with these instances as well.
Once the TerraForm scripts are in place, bringing up the environment is easy:
$> terraform init
$> terraform apply
This also executes really fast on GCP. It took mere just 70s…!
Once created, you can see in the console that two sets of subnets are created in two regions.
Also, two instances are created in two regions. At this point you can SSH to one instance and access the other.
You can delete this environment with:
$> terraform destroy
Summary
There aren’t as many examples of setting up a GCP environment with TerraForm. So initially, I had some issues and was a bit skeptical as to how well TerraForm would work with GCP.
Once I got the basics up and running though, I was pleasantly surprised by how well it worked, and also how fast GCP created resources and spinned up instances.
These TerraForm scripts also allow you to create a completely separate clone VPC with a mere command line argument. You can now build dev, test, integration or production environments with ease.
All this, of course, is possible if you start off your project with an automated build of the cloud environment.