How to use Cloud without losing Sleep

A hands on guide for developers without dedicated SRE or on-call teams to use Cloud services safely.

(Originally published at https://sudcha.com)

Introduction

This post outlines practices I learnt and now practise for keeping a check on Cloud usage, and eliminate possible surprises.

Disclaimers (!)

  1. Examples are centric to GCP but fundamentals should apply to all platforms:
    Although examples in this post are on GCP, I’ve tried to structure around some fundamental vs. platform features. All cloud services have similar features, but they have different names, prices and usage policies.
  2. This is not an exhaustive list
    Like everyone, I’m still learning and growing. Below is what I learnt from several mistakes and going through every piece of literature online from great kind minds. If you have more tips, feel free to share with me, and I’ll be happy to add it to this list.
  3. This is a long post
    Sorry about it, but there’s no way around it. If you find it useful and can’t finish in one go, you can always come back to this post.

Let’s jump into it!

1. Use multiple forms of payments (FOP), preferably with spend caps

1a. Have spending limits/caps on the forms of payments (FOPs) you use

Ideally the spending limit could be anywhere between 120–150% of the cost you expect to incur given the usage of your platform.

From what I’ve read, it’s relatively easier to get bill waived off than get a refund. While the suggestions in this post will make sure such situations don’t arise, it’s better to have it set up anyways.

Don’t use your $25,000 no limit credit card while setting up billing for a fresh account.

1b. Use different FOPs for development and production.

  • First GCP suspended all our billing accounts tied to the same credit card for a suspect of fraud.
  • Second, our bank started suspending all future transaction requests from GCP suspecting fraud.

Sorting this took several days, and it halted our development cycle. Thankfully we didn’t have a production service back then.

In an ideal setup, I recommend using one FOP for production account, and another one for dev/test accounts. Never mix these two.

Note:
If you’re a solo developer, I highly recommend setting up an LLC or some umbrella company that gives your personal assets protection. In recent years setting up a basic business entity has become quite standard everywhere in the world, thanks to the entrepreneurial boom.

Some wise people online have even suggested setting up a shell company to use Cloud Platforms (!!). I personally don’t recommend spending time doing this. If you’re building something meaningful, spend all your time making it better in a legit way.

2. Setup Service Quotas

Quotas can be set for usage per day, per minute or even per user per minute. I don’t really trust the per user quota because who defines “a user”? The others quotas for the most part are fairly reliable.

While GCP doesn’t allow auto cut off on billing, it does allow auto cut off based on usage, well at least for most services.

For any account with billing, when I enable a new service, I first check if there’s a quota, and if there’s one, I set it to really low value while getting familiar with the service. A lot of the services are charged per use and setting up this quota also helps in validating the costs that will be incurred on use.

As an example, if I’m testing GCP Adult Image Classifier service, as soon as I enable the service, I would first set the Quota to 100 per day and then try the Codelab provided.

If the cloud provider you use neither has auto billing shutoff nor budgets, it should be a big red flag for you to use the service.

Some Caveats

  • Not all services have Quota limits
    For example Firestore Read/Write ops. From engineering standpoint, this is understandable because if the service has to check for quota, how can it be real time? There are some quotas available for Firebase which are useful, but don’t assume that every service has a quota.
  • Not all quotas work as advertised
    Test each of the quotas before fully relying on them. I found several that don’t work and my consults led to bugs with Google. The bugs are not fixed yet, and so if something were to go wrong with those services, they would act as a fall back because it’s an issue on GCP’s side.

3. Cloud Monitoring

These monitoring services are either free for the project services (standard metrics), or available at very, very cheap prices.

Setting up Monitoring isn’t really in the face when you start using Cloud Services, nor is it widely advertised, but it’s a great feature that everyone should use.

What it allows:

  • Creating alerts that fire if usage goes beyond a user defined limit. The alerts can be SMS, email and app notifications.

How to use Monitoring services

Set up Alerts

Case study from our development cycle: During development of our first product Announce, one of the engineers on my team developing locally (live build on localhost) accidentally created an infinite loop that led to infinite Firebase read ops for a few minutes. We had a free limited project for development so nothing could go wrong, but the monitoring alerted me and I reached out to the team to see what was up.

The only down side in this was that we had a very limited quota remaining for the day and the team waited until next day to resume development.

Look for spikes !

Always take the time to explain spikes

After few steady days, your job should be to watch out for spikes in usage. Set the usage to 6 weeks and see if there is a spike. If there is a spike, it should be explained, say too much local testing, surge in usage, expected triggers etc.

Looking for spikes only takes 2–3 minutes for the dashboard to load, and your job is done. This should be part of daily working routine if you have any service deployed on cloud.

Anticipate and Reduce Costs

Another big advantage is to identify low hanging fruits for reducing costs.

For example, looking at Storage charts, I realized that GCP was continuously storing build artifacts from each build which took a development project storage that shouldn’t be using anymore than 100MB to 25GB, hence increasing storage costs for nothing. There are ways to fix this, but one must know that this problem exists.

Another example is memory or CPU consumption from the deployed services. Some graphs can easily tell if you have over/under allocation of resources to services.

High execution time could mean under allocation of resources, leading to more cost.

Write better Code

4. Use Free Projects — Stretch the limits

Firebase and GCP allow more than 10 projects, which can be extended further. If there’s some issue with your account, you can always file for an extension.

That’s great… but how to use these free projects?

Well, for one, set up multiple environments. Software development cycle is common knowledge so I’ll be brief about it. In any project ideally you should have dev, test, staging (alpha), preview (beta) and production environments. No matter how small the project be, you should still have at least dev and prod environments that are completely decoupled from each other. Read that again: completely decoupled from each other.

Sample project structure using Firebase, and GCP resources in different environments

You can use a free firebase project for development, and a paid one for production. If you have backend services that need cloud sources, you can grant permissions to your free project sources to the paid project service accounts.

This topic could be an entire post, as creating an architecture that supports multiple environments (dev, test, prod) for multiple platforms (Web, Android, iOS, API) can be a little complex.

Granting permissions of one project to another

Ideally almost all accesses should be only to service accounts and none to humans (except maybe admins).

In GCP it’s possible to give permissions to access resources of one project from another project, which is a great feature. I’m very sure every other platform has the same feature and I highly recommend using it.

5. Spend good amount of time understanding and predicting costs

Most cloud services now have a great cost calculator (GCP Calculator here). This is a tool for developers to decide whether and how to use that service. Spend some time on these calculators, and test the costs in extreme cases.

Over the course of time, I have started testing services for a day or two in a safe environment (development account). Wait for the billing to process and when I understand it correctly, then move on to integrating it in the product.

A better approach would be to predict costs before and then test the service. The cost of service and your prediction should match. Never assume the pricing works as is because there are lot of factors to consider.

6. CICD = Operational Efficiency

Time spent in CICD is like time spent in converting cube into sphere

Any project that has billing enabled, should only be accessed by machines (service accounts) and not humans. This means all code should be deployed through CICD pipelines triggered after code merge, which should ideally only happen after code review.

This falls in operational efficiency, and it doesn’t matter how small your project or team be, it’s a good idea to set it up like this because there’s no reason not to.

Time spent in infrastructure is like an investment. Over the course of time it will save you time and energy.

Less human intervention is better, both for security and efficiency. If you’re on Github, Github Actions is one of the most valuable sources available, which is mostly free. Until now, there hasn’t been a month that I paid for Actions and I manage lot of CICD Pipelines.

GCP Cloud Build gives 120 minutes per day per billing account free, which to be honest is a lot. Unless you’re deploying ML models, or have lot of projects + large team, I highly doubt that this free tier would be breached.

Firebase has a great CLI, but I don’t recommend using it for deploying to production environments. For one, it’s harder to version or review the code and you might end up deploying buggy code. Second, any change in the deployment has to go through your machine.

CICD forces you to be good at code management and versioning, and that’s a good thing.

What if I have to deploy just one function?
I still recommend setting up CICD because it’s reusable. Once you’ve figured out to set it up for one function, all your future deployments will require zero work.

7. Protect the keys (and tokens) !

No Key == Better Security

Most starter Codelabs suggest downloading “service-key.json” or printing some token on CLI, and setting the environment locally. This is a great suggestion for someone testing a new service on free projects. However, any project that you plan on adding Billing, don’t ever download a key. It’s just not needed, and there’s lot of risk that goes with it.

For example you may accidentally add it to a git commit, forget to delete it while sharing code, or simply leave it somewhere on your system.

Simplest solution to securing keys and tokens is to never download them.

I do download service keys but only when absolutely required, and only for development accounts that are on Free plan. For example in order to use Firebase emulator.

When keys are needed in CICD, encrypt them right away and delete the local copies before committing any code.

You can easily use encryption on Cloud without spending any time learning it. KMS and Secret Manager are as simple as it can get.

8. Multi Cloud

However, if you’re a solo indie developer, or a small startup, I recommend using just one cloud unless you really need just one feature from another cloud service. This, because each cloud service has its own dictionary, and learning curve.

‎”I fear not the man who has practiced 10,000 kicks once, but I fear the man who has practiced one kick 10,000 times.”

-Bruce Lee

It doesn’t matter which service, which language, which platform you choose, my recommendation is to spend as much time on it to learn and experiment that you’re the most knowledgeable about it.

Each cloud provider has similar set of tools, but using them, understanding their pricing/dictionary, and navigating their console takes lot of time to learn and adapt. My recommendation is to master the one you choose and optimize your usage based on the positives and negatives of that platform.

9. Read the best practices from the provider

Cloud Platforms are very similar to fancy expensive coffee machines in this way. They come with a guidebook manual that’s continuously evolving, and often has a set of dos / don’ts included.

I highly recommend going through the dos and don’ts for each of the service you use not only for best performance, but also to avoid unnecessary troubles.

Good example of this is Google Cloud Run development tips. This article clearly states that developers should avoid background activities. If you as a developer didn’t go to this page on your own, and wrote a program that has some potential background activities but works perfectly on your local machine, you’d never realize why your code performs so badly on Cloud.

Point being, always proactively look for best practices for any service you use. SO and Medium have some great resources for most services :)

10. Billing Budget Alerts / Notifications

I do recommend setting Billing/Budget alert because it’s there, but it isn’t a feature I suggest anyone to rely upon.

All of this to deploy simple code?

If you’re deploying a scalable product that has multiple connected services and you don’t have deep pockets, I recommend suggestions in this post.

Building something from scratch, and making it economically scalable isn’t an easy task. It doesn’t matter if your product cures cancer, or you’re working on a spam attack, building anything requires substantial amount of work.

There’s a reason why SREs are paid so well across the industry, and there’s a huge constant demand for them.

Thankfully the world is becoming better everyday with better free tutorials, codelabs, and people sharing their awesome strategies on SO for free. Key is to begin :).

Sleep like a baby

Great Sleep == Great Productivity.

This post covers the basic setup that I usually do for any project that I deploy on Cloud. All of this helps me sleep well at night without having any SRE or anyone in dev-ops.

Well, that’s probably not entirely true as each of the big cloud services have tens of thousands of brilliant engineers that I can easily piggy back on.

There are several service specific hacks not included in this post, which I might write about in future.

This is not a sponsored post. There are no newsletter signups, no affiliate links, no product promotion. If you found this post useful, please share it.

If you’d like to get my future posts, or just want to say hello, connect with me.

Originally published at https://sudcha.com on January 13, 2021.

Founder of Milkie Way, Inc. https://tomilkieway.com

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store