How to use Cloud without losing Sleep

A hands on guide for developers without dedicated SRE or on-call teams to use Cloud services safely.

(Originally published at https://sudcha.com)

Introduction

About a month ago, I published two blog posts on our company blog sharing one of our stories that form our journey: We burnt $72K and almost went bankrupt. The post was fairly well circulated in the tech community, and inspired me to write a post dedicated to the hacks that help me sleep better at night without having an on call team to manage all the cloud services.

This post outlines practices I learnt and now practise for keeping a check on Cloud usage, and eliminate possible surprises.

Disclaimers (!)

  1. If you’re capable of hosting everything on your own server(s) … Congratulations, you’re one of the very few who have this knowledge and experience. You know what you’re doing and keep it up. Depending on whether you have interest in Cloud platforms, this post may or may not add any value to you.
  2. Examples are centric to GCP but fundamentals should apply to all platforms:
    Although examples in this post are on GCP, I’ve tried to structure around some fundamental vs. platform features. All cloud services have similar features, but they have different names, prices and usage policies.
  3. This is not an exhaustive list
    Like everyone, I’m still learning and growing. Below is what I learnt from several mistakes and going through every piece of literature online from great kind minds. If you have more tips, feel free to share with me, and I’ll be happy to add it to this list.
  4. This is a long post
    Sorry about it, but there’s no way around it. If you find it useful and can’t finish in one go, you can always come back to this post.

Let’s jump into it!

1. Use multiple forms of payments (FOP), preferably with spend caps

Let’s begin with the simplest yet the most effective fixes.

1a. Have spending limits/caps on the forms of payments (FOPs) you use

Most cloud services have a monthly billing except if a threshold spend is reached. If the threshold is reached, the Cloud platforms charge the given account right away. If for some reason, one of your services faulter, the cloud service would directly charge the form of payment (like credit cards) on file, and on non payment, stop the service.

Ideally the spending limit could be anywhere between 120–150% of the cost you expect to incur given the usage of your platform.

From what I’ve read, it’s relatively easier to get bill waived off than get a refund. While the suggestions in this post will make sure such situations don’t arise, it’s better to have it set up anyways.

Don’t use your $25,000 no limit credit card while setting up billing for a fresh account.

1b. Use different FOPs for development and production.

After we messed up in one of our projects, our credit card declined payment, and this had several unexpected ripple effects.

  • First GCP suspended all our billing accounts tied to the same credit card for a suspect of fraud.
  • Second, our bank started suspending all future transaction requests from GCP suspecting fraud.

Sorting this took several days, and it halted our development cycle. Thankfully we didn’t have a production service back then.

In an ideal setup, I recommend using one FOP for production account, and another one for dev/test accounts. Never mix these two.

Note:
If you’re a solo developer, I highly recommend setting up an LLC or some umbrella company that gives your personal assets protection. In recent years setting up a basic business entity has become quite standard everywhere in the world, thanks to the entrepreneurial boom.

Some wise people online have even suggested setting up a shell company to use Cloud Platforms (!!). I personally don’t recommend spending time doing this. If you’re building something meaningful, spend all your time making it better in a legit way.

2. Setup Service Quotas

In GCP users can define “Quotas” for most services (Admin -> Quotas). AWS has similar feature called “Service Quotas”, and I’m sure other providers have this as well.

Quotas can be set for usage per day, per minute or even per user per minute. I don’t really trust the per user quota because who defines “a user”? The others quotas for the most part are fairly reliable.

While GCP doesn’t allow auto cut off on billing, it does allow auto cut off based on usage, well at least for most services.

For any account with billing, when I enable a new service, I first check if there’s a quota, and if there’s one, I set it to really low value while getting familiar with the service. A lot of the services are charged per use and setting up this quota also helps in validating the costs that will be incurred on use.

As an example, if I’m testing GCP Adult Image Classifier service, as soon as I enable the service, I would first set the Quota to 100 per day and then try the Codelab provided.

If the cloud provider you use neither has auto billing shutoff nor budgets, it should be a big red flag for you to use the service.

Some Caveats

  • Default Values are counter intuitive!
    Most services have a preset default quota of unlimited, or some absurd number like 1,000,000. Why? God knows. Maybe GCP engineers never paid attention to it, or they assume each of their users, considerable population of which is students, build apps with million users from get go. Setting quotas is also multi click arduous process probably because they are optimized for large organizations. Don’t fret these extra clicks, they are worth it.
  • Not all services have Quota limits
    For example Firestore Read/Write ops. From engineering standpoint, this is understandable because if the service has to check for quota, how can it be real time? There are some quotas available for Firebase which are useful, but don’t assume that every service has a quota.
  • Not all quotas work as advertised
    Test each of the quotas before fully relying on them. I found several that don’t work and my consults led to bugs with Google. The bugs are not fixed yet, and so if something were to go wrong with those services, they would act as a fall back because it’s an issue on GCP’s side.

3. Cloud Monitoring

While billing is delayed by about a day, most metrics provided by Cloud platforms are delayed by only a few minutes. For GCP this is called Cloud Monitoring, for AWS it’s called CloudWatch and in Microsoft — Azure Monitor.

These monitoring services are either free for the project services (standard metrics), or available at very, very cheap prices.

Setting up Monitoring isn’t really in the face when you start using Cloud Services, nor is it widely advertised, but it’s a great feature that everyone should use.

What it allows:

  • Creating beautiful custom dashboards with usage graphs for the services you care about
  • Creating alerts that fire if usage goes beyond a user defined limit. The alerts can be SMS, email and app notifications.

How to use Monitoring services

Set up Alerts

You can also set up alerts that’ll fire emails, text messages and mobile app notifications, all for free. While they may not be of much use while you’re sleeping, they do provide a lot of support while you’re awake.

Case study from our development cycle: During development of our first product Announce, one of the engineers on my team developing locally (live build on localhost) accidentally created an infinite loop that led to infinite Firebase read ops for a few minutes. We had a free limited project for development so nothing could go wrong, but the monitoring alerted me and I reached out to the team to see what was up.

The only down side in this was that we had a very limited quota remaining for the day and the team waited until next day to resume development.

Look for spikes !

Always take the time to explain spikes

After few steady days, your job should be to watch out for spikes in usage. Set the usage to 6 weeks and see if there is a spike. If there is a spike, it should be explained, say too much local testing, surge in usage, expected triggers etc.

Looking for spikes only takes 2–3 minutes for the dashboard to load, and your job is done. This should be part of daily working routine if you have any service deployed on cloud.

Anticipate and Reduce Costs

As the graphs are only a few minutes late, after doing something you can wait for a few minutes to see what resources you consumed. You can also predict costs of something before it shows up in billing.

Another big advantage is to identify low hanging fruits for reducing costs.

For example, looking at Storage charts, I realized that GCP was continuously storing build artifacts from each build which took a development project storage that shouldn’t be using anymore than 100MB to 25GB, hence increasing storage costs for nothing. There are ways to fix this, but one must know that this problem exists.

Another example is memory or CPU consumption from the deployed services. Some graphs can easily tell if you have over/under allocation of resources to services.

High execution time could mean under allocation of resources, leading to more cost.

Write better Code

These graphs easily show low hanging fruits to optimize code. Maybe your cloud function is going into background processes, or perhaps its timing out due to some other service being called serially, when they could be called asynchronously.

4. Use Free Projects — Stretch the limits

Cloud Platforms today provide a lot of free services per project, and there are lot of projects that a user can create. There’s a reason why the systems are set up in such a way. Primary reason, as far as I can think of, is to give enough room for testing and learning about the service.

Firebase and GCP allow more than 10 projects, which can be extended further. If there’s some issue with your account, you can always file for an extension.

That’s great… but how to use these free projects?

Well, for one, set up multiple environments. Software development cycle is common knowledge so I’ll be brief about it. In any project ideally you should have dev, test, staging (alpha), preview (beta) and production environments. No matter how small the project be, you should still have at least dev and prod environments that are completely decoupled from each other. Read that again: completely decoupled from each other.

Sample project structure using Firebase, and GCP resources in different environments

You can use a free firebase project for development, and a paid one for production. If you have backend services that need cloud sources, you can grant permissions to your free project sources to the paid project service accounts.

This topic could be an entire post, as creating an architecture that supports multiple environments (dev, test, prod) for multiple platforms (Web, Android, iOS, API) can be a little complex.

Granting permissions of one project to another

An important concept to understand is types of accounts and accesses. There are two types of user accesses in services: 1. Access to human, and 2. Access to a service account.

Ideally almost all accesses should be only to service accounts and none to humans (except maybe admins).

In GCP it’s possible to give permissions to access resources of one project from another project, which is a great feature. I’m very sure every other platform has the same feature and I highly recommend using it.

5. Spend good amount of time understanding and predicting costs

Before going through any Codelabs for a service, first spend some time on Pricing of that service/feature. This is tied to designing the architecture of a project which can be a long post of its own.

Most cloud services now have a great cost calculator (GCP Calculator here). This is a tool for developers to decide whether and how to use that service. Spend some time on these calculators, and test the costs in extreme cases.

Over the course of time, I have started testing services for a day or two in a safe environment (development account). Wait for the billing to process and when I understand it correctly, then move on to integrating it in the product.

A better approach would be to predict costs before and then test the service. The cost of service and your prediction should match. Never assume the pricing works as is because there are lot of factors to consider.

6. CICD = Operational Efficiency

Time spent in CICD is like time spent in converting cube into sphere

Any project that has billing enabled, should only be accessed by machines (service accounts) and not humans. This means all code should be deployed through CICD pipelines triggered after code merge, which should ideally only happen after code review.

This falls in operational efficiency, and it doesn’t matter how small your project or team be, it’s a good idea to set it up like this because there’s no reason not to.

Time spent in infrastructure is like an investment. Over the course of time it will save you time and energy.

Less human intervention is better, both for security and efficiency. If you’re on Github, Github Actions is one of the most valuable sources available, which is mostly free. Until now, there hasn’t been a month that I paid for Actions and I manage lot of CICD Pipelines.

GCP Cloud Build gives 120 minutes per day per billing account free, which to be honest is a lot. Unless you’re deploying ML models, or have lot of projects + large team, I highly doubt that this free tier would be breached.

Firebase has a great CLI, but I don’t recommend using it for deploying to production environments. For one, it’s harder to version or review the code and you might end up deploying buggy code. Second, any change in the deployment has to go through your machine.

CICD forces you to be good at code management and versioning, and that’s a good thing.

What if I have to deploy just one function?
I still recommend setting up CICD because it’s reusable. Once you’ve figured out to set it up for one function, all your future deployments will require zero work.

7. Protect the keys (and tokens) !

No Key == Better Security

Most starter Codelabs suggest downloading “service-key.json” or printing some token on CLI, and setting the environment locally. This is a great suggestion for someone testing a new service on free projects. However, any project that you plan on adding Billing, don’t ever download a key. It’s just not needed, and there’s lot of risk that goes with it.

For example you may accidentally add it to a git commit, forget to delete it while sharing code, or simply leave it somewhere on your system.

Simplest solution to securing keys and tokens is to never download them.

I do download service keys but only when absolutely required, and only for development accounts that are on Free plan. For example in order to use Firebase emulator.

When keys are needed in CICD, encrypt them right away and delete the local copies before committing any code.

You can easily use encryption on Cloud without spending any time learning it. KMS and Secret Manager are as simple as it can get.

8. Multi Cloud

If you have resources and not bootstrapped with an extremely small team, you probably are using more than just one Cloud. DO, GCP, Azure and AWS, all have some great pluses and minuses. If you have dedicated dev-ops and SREs I highly recommend making use of multi-cloud.

However, if you’re a solo indie developer, or a small startup, I recommend using just one cloud unless you really need just one feature from another cloud service. This, because each cloud service has its own dictionary, and learning curve.

‎”I fear not the man who has practiced 10,000 kicks once, but I fear the man who has practiced one kick 10,000 times.”

-Bruce Lee

It doesn’t matter which service, which language, which platform you choose, my recommendation is to spend as much time on it to learn and experiment that you’re the most knowledgeable about it.

Each cloud provider has similar set of tools, but using them, understanding their pricing/dictionary, and navigating their console takes lot of time to learn and adapt. My recommendation is to master the one you choose and optimize your usage based on the positives and negatives of that platform.

9. Read the best practices from the provider

Think of cloud providers as coffee machines. The fundamental principles of making coffee remain the same, but the engineers who design the machines chose to prioritize a certain set of features that lead to certain design trade offs. The machine must be used in a certain way to work the best, and certain things should always be avoided.

Cloud Platforms are very similar to fancy expensive coffee machines in this way. They come with a guidebook manual that’s continuously evolving, and often has a set of dos / don’ts included.

I highly recommend going through the dos and don’ts for each of the service you use not only for best performance, but also to avoid unnecessary troubles.

Good example of this is Google Cloud Run development tips. This article clearly states that developers should avoid background activities. If you as a developer didn’t go to this page on your own, and wrote a program that has some potential background activities but works perfectly on your local machine, you’d never realize why your code performs so badly on Cloud.

Point being, always proactively look for best practices for any service you use. SO and Medium have some great resources for most services :)

10. Billing Budget Alerts / Notifications

Every single guide, article, post recommends setting up billing alerts, and auto shutoff cloud functions. This my least favorite feature, and the last on my list simply because it’s not proactive but post facto.

I do recommend setting Billing/Budget alert because it’s there, but it isn’t a feature I suggest anyone to rely upon.

All of this to deploy simple code?

Well, it depends. If you’re deploying a small Cloud Function or equivalent thereof which has no possibility of backfiring then probably not.

If you’re deploying a scalable product that has multiple connected services and you don’t have deep pockets, I recommend suggestions in this post.

Building something from scratch, and making it economically scalable isn’t an easy task. It doesn’t matter if your product cures cancer, or you’re working on a spam attack, building anything requires substantial amount of work.

There’s a reason why SREs are paid so well across the industry, and there’s a huge constant demand for them.

Thankfully the world is becoming better everyday with better free tutorials, codelabs, and people sharing their awesome strategies on SO for free. Key is to begin :).

Sleep like a baby

Great Sleep == Great Productivity.

This post covers the basic setup that I usually do for any project that I deploy on Cloud. All of this helps me sleep well at night without having any SRE or anyone in dev-ops.

Well, that’s probably not entirely true as each of the big cloud services have tens of thousands of brilliant engineers that I can easily piggy back on.

There are several service specific hacks not included in this post, which I might write about in future.

This is not a sponsored post. There are no newsletter signups, no affiliate links, no product promotion. If you found this post useful, please share it.

If you’d like to get my future posts, or just want to say hello, connect with me.

Originally published at https://sudcha.com on January 13, 2021.

Founder of Milkie Way, Inc. https://tomilkieway.com

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store