We Burnt $72K testing Firebase — Cloud Run and almost went Bankrupt [Part 2]
One Hundred Sixteen Billion: that’s the number of times our test code read Firestore database in few hours.
This post is Part 2 in the series. If you haven’t already, go through Part 1 first so this one makes sense :)
For those who learn better with visuals, here’s a video for you:
Standing Himalaya is telling us…
Personally, this was the first time that I had received such a big set back. It had the potential to alter the course of our company as well as my life. There were several lessons on entrepreneurship in this incident, one important one was to stay strong.
I had a team of ~7 engineers/interns at this time, and it would take Google about 10 days to get back to us on this incident. In the meantime we had to resume development, find our way around account suspensions. Despite this thought on my mind, we had to focus on the features and our product.

Poem: Khada Himalaya bata raha hai (Standing Himalaya is telling us)
For some reason one poem from my childhood kept playing in my head. It was my favorite one, and I remembered it word for word, even though the last time I recited it was over 15 years ago.
What did we actually do?
As a very small team, we wanted to stay serverless for as long as we could. The problem with serverless solutions like Cloud Functions and Cloud Run is timeout.
One instance at any time would be serially scraping the URLs in a web page. But soon after 9 minutes, it would time out.
After discussing this problem, and powered with caffeine, within minutes I wrote some dry code on the white board which I now see had so many design issues. Back then, we were focused more on failing and learning super fast and trying new things, so we went ahead and experimented.

To overcome the timeout limitation, I suggested using POST requests (with URL as data) to send jobs to an instance, and use multiple instances in parallel instead of using one instance serially. Because each instance in Cloud Run would only be scraping one page, it would never time out, process all pages in parallel (scale), and also be highly optimized because Cloud Run usage is accurate to milliseconds.

If you look closely, the flow is missing few important pieces.
- Exponential Recursion without Break: The instances wouldn’t know when to break, as there was no break statement.
- The POST requests could be of the same URLs. If there’s a back link to the previous page, the Cloud Run service will be stuck in infinite recursion, but what’s worst is, that this recursion is multiplying exponentially (our max instances were set to 1000!)
As you can imagine, this lead to 1000 instances querying, and writing to Firebase DB every few milli seconds. Looking at the data post incident, we saw that the Firebase Reads were at one point about 1 Billion Requests per minute!

116 Billion Reads (!!!)and 33 Million Writes
Running our this version of Hello World deployment on Cloud Run, made 116 Billion reads and 33 Million Writes to Firestore. Ouch!
Read Operations Cost on Firebase:
$ (0.06 / 100,000) * 116,000,000,000 = $69,600
To put this number into perspective, Google gets around 3.5 Billion searches / day. If each search requires 30 table lookups, that would still be less than the lookups we did in few hours.
16,000 hours(!) of Cloud Run Compute time
After testing, we assumed that the request died because logging stopped, but actually it went into background process. As we didn’t delete the services (this was our first time using Cloud Run, and we didn’t really understand it back then), multiple services continued to operate slowly.
In 24 hours, these service versions each scaled to 1000 instances consumed 16,022 hours.
All our Mistakes
Deploying flawed algorithm on Cloud
Already discussed above. We did discover a new way to use serverless using POST requests, something I hadn’t found anywhere on the internet, but deployed it without refining the algorithm.
Deploying Cloud Run with Default Options
While creating a Cloud Run service, we chose default values in the service. The max-instances is preset to 1000, and concurrency set to 80. In the beginning we didn’t know that these values are actually worst case scenario for a test program.
Had we chosen max-instances to be “2”, our costs would’ve been 500 times less. $72,000 bill would’ve been: $144
Had we chosen concurrency of “1” request, we probably wouldn’t have even noticed the bill.
Using Firebase without understanding it completely
There are somethings that can only be learnt after lot of experience. Firebase isn’t a language that one can learn, it’s a containerized platform service provided by Google. It has rules defined by them, not by laws of nature or how a particular user may think they are.
Integration of Firebase and GCP is slightly tricky one. If billing is enabled in one platform, GCP assumes it’s available everywhere.

Also, while writing code in Node.js, one must take care of Background processes. If the code goes into background processes, there’s no easy way for the developer to know that the service is running, but it might be, for fairly long time. As we learnt later on, this was the reason why most of our Cloud Functions were timing out as well.
Fail fast, learn fast with Cloud is a bad idea
Cloud overall is like a double edged sword. When used properly, it can be of great use, but if used incorrectly, it can have consequences.
If you count the number of pages in GCP documentation, it’s probably more than pages in few novels. Understanding Pricing, Usage, is not only time consuming, but requires a deep understanding of how Cloud services work. No wonder there are full time jobs for just this purpose!
Firebase, and Cloud Run are really powerful
At the peak, Firebase was able to handle about one billion reads per minute. This is exceptionally powerful. We had been playing around with Firebase for 2–3 months now and still learning about it, but I had absolutely no idea how powerful it was until now.
Same goes with Cloud Run! With Concurrency == 60, max_containers == 1000 and each Request taking 400ms, number of requests Cloud Run can handle 9 million requests per minute!
60 * 1000 * 2.5 * 60 = 9,000,000 requests / minute
For comparison, Google Search gets 3.8 million searches per minute.
This means, if set up correctly, with really fast micro-services to power the backend, entire Google Search front end could deployed on Cloud Run.
Not suggesting that Google Search is simple, but alluding to the fact that Cloud Run is very powerful.
We Survived!

After going through our lengthy doc on this incident sharing our side of the story, various consults, talks, and internal discussions Google let go of our bill as a one time gesture!
Thank you Google!
We got our lifeline, and got back on both our feet to build Announce. Except this time with a much better perspective, architecture, and much safer implementation.
Google, my favorite tech company, is not just a great company to work for. It’s also a great company to collaborate with. The tools provided by Google are very developer friendly, have a great documentation (for the most part), and are consistently expanding.
(UPDATE) Easier Option | Save on GCP Costs
After writing this article, a startup reached out to me and their goal is to optimize costs for GCP. I tried their services, and highly recommend if you are short on resources of mastering GCP. Give them a try:
What Next?
After this incident, we spent few months on understanding Cloud and our architecture. In few weeks my understanding improved so much that I approximated the cost of scraping the “entire web” using Cloud Run with improved algorithm.
This incident led me to analyze our product’s architecture in depth, and we scrapped V1 of our product, to build scalable infrastructure to power our products.
In Announce V2, we didn’t just build an MVP; we built a platform where we could iteratively develop new products rapidly, and test them thoroughly in a safe environment.
This journey took us some time… Announce was launched in November end, ~7 months later than we had decided for our V1, but it is highly scalable, gets the best of Cloud services, and is highly optimized for usage.
We also launched on all platforms, and not just web.
What’s more is that we reused the entire platform to build our second product Point Address. Not only are both the products scalable, have a great architecture, and highly efficient, they are built on a platform that allows us to rapidly build and deploy ideas into useable products.
Checkout Announce and Point:
I will soon be writing another post on how to deploy on GCP while keeping usage, development and cost in check.
Follow Up:
I made the post on best practices to keep usage and development in check for most cloud platforms. You can refer to the medium article below:
Blog post originally published at the Milkie Way Blog: https://blog.tomilkieway.com.