Get all your news in one place.
100's of premium titles.
One app.
Start reading
The Economic Times
The Economic Times

How Google engineer Bhasker Goel is rethinking monitoring to catch failures before they escalate

When a major platform ships a broken update, the damage rarely arrives with a clear warning. A core metric drifts imperceptibly, a graph tilts a few degrees, and engineers argue over whether the signal is real. Bhasker Goel, a software engineer at Google Ads in Bengaluru, works on the narrow but consequential gap between a system beginning to fail and a monitoring system recognising that failure.

Goel, an alumnus of IIT Guwahati, built financial infrastructure at DE Shaw before joining Google, where he now designs reliability systems for Google Ads. In environments of this size, small anomalies can cascade rapidly. The regressions his systems are built to catch tend to be gradual and silent; left undetected, they can become expensive quickly.

He recognised early on that traditional monitoring struggles to keep up with modern infrastructure. Older systems were designed for predictable, monolithic applications.

Today’s distributed infrastructure involves shifting traffic, layered dependencies and continuous deployment pipelines. A fixed threshold calibrated for normal conditions might miss a regression affecting only a specific user segment, or it might fire false alerts so frequently that engineers suffer alarm fatigue and simply ignore them.

The pace of software development is compounding the problem. AI-assisted coding tools are increasing the volume of code moving through production systems, which means more changes, more edge cases and less time for engineers to distinguish a real regression from routine variation. In that environment, monitoring systems built for slower, more predictable release cycles are under growing strain1.

“The default industry model is still to pick a threshold, watch a dashboard, and alert when a number crosses a line,” said Bhasker Goel. “That worked well when systems were smaller and traffic was predictable. At current scale, relying solely on static thresholds can become a liability,” he added.

Goel’s work centres on an approach known as comparative monitoring. Rather than measuring metrics against a static line, his systems compare two segments of the same environment in real time. If functionally identical parts of the system begin diverging under the same external conditions, that divergence surfaces a regression well before a conventional alert fires. Because both sides share the same live traffic patterns, the ambient noise that makes conventional monitoring unreliable is largely filtered out.

The logic extends beyond advertising. Cloud infrastructure rollouts, algorithmic trading platforms and global consumer apps all face the same fundamental engineering challenge: distinguishing genuine degradation from normal production variation fast enough to prevent an outage.

Goel’s work aims to shift monitoring from a reactive reporting layer to a system of live validation, where every change is tested against a dynamic baseline before it spreads.

Vinay Kakade, co-founder of Infino AI and former senior staff engineer at Lyft, noted that Goel’s contributions address a widespread industry challenge. “Once you operate distributed systems at scale, you stop trusting static thresholds as your primary defence because the baseline is always moving,” Kakade said.

“What Goel is working on asks a much more robust question: did two identical parts of the system suddenly stop agreeing? That is often the fastest way to separate a real regression from production noise,” he added.

“You stop asking what happened after the fact, and start asking whether what is happening right now matches what was supposed to happen,” Goel said.

As software infrastructure grows more complex, approaches like these are becoming harder to ignore. Catching failures before they become incidents rarely makes product release headlines, but it remains some of the most critical engineering work happening behind the scenes2.

References:

  1. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4945566

  2. https://sre.google/workbook/reaching-beyond/

Disclaimer: This article is generated and published by the ET Spotlight team. You can get in touch with them on etspotlight@timesinternet.in
Sign up to read this article
Read news from 100's of titles, curated specifically for you.
Already a member? Sign in here
Related Stories
Top stories on inkl right now
One subscription that gives you access to news from hundreds of sites
Already a member? Sign in here
Our Picks
Fourteen days free
Download the app
One app. One membership.
100+ trusted global sources.