From chaos to clarity with Grafana dashboards: How video game company EA monitors 200+ metrics

• 2025-07-07 • 8 min

To be a successful gamer, you have to think strategically and creatively. Working as a software engineer at Electronic Arts (EA), a top video game company, requires the same skills. That’s especially true when it comes to monitoring the EA app, which is the launcher for EA games and used by hundreds of millions of people around the world.

Monitoring the app’s status for errors, crash rates, and other issues can be as tricky as battling a game boss. For one thing, the app is run from users’ own machines, and according to EA Software Engineer Kenny Chen, it emits 500 different events. Combining the events can result in potentially more than 4,000 possible metrics. Even if the reality ends up being a fraction of that, a developer could still be observing more than 200 core metrics at a time, making for what Chen calls a “huge and overwhelming” dashboard.

In a GrafanaCON 2025 presentation, Chen revealed how he used Grafana to build what he calls a “functional dashboard” system that allows devs to effectively monitor the status of the app. In the past, a complete review would have been completed over several hours — “or maybe never, I’ll be honest,” he quipped — and now the work can be done in a fraction of that time. “Our devs, after proper training, can review 200 metrics in about 10 minutes,” he said, “and that makes it possible to always keep an eye on everything that’s important.”

In his presentation, Chen discussed the importance of finding the most effective visualizations, the math behind effective monitoring, and large-scale root cause analysis.

“Before this, we sometimes missed critical issues simply because there are so many things to look at and things were not very obvious,” Chen said. “Since we’ve set up this system, we haven’t missed any critical bugs caused by our own code changes.”

Big challenges

Work on the EA app comes with a challenge from the get-go. Unlike microservices, Chen explained, “an app is a monolith by design. You can’t split it further, and monoliths are just hard to monitor.”

Among the metrics being tracked are logins, different types of errors, and various crash rates. “They are equally important to our user experience, and we need to closely monitor them all,” he said. “And whenever there’s a problem, we want to know as soon as possible.”

With a backend service, engineers would be monitoring just one version of the app, but EA has multiple versions running at the same time. In their old dashboard, they utilized one gauge per version, per metric, in order to achieve full coverage for metrics.

EA also releases new builds every week, so the version list being monitored is continuously growing.

All of that translated to a dashboard like this:

A Grafana dashboard with 165 gauge panels

To make matters worse, the team realized all of those gauges couldn’t be trusted, Chen said.

“A gauge can mean a lot of things: It could show the last reading, it could show the median, it could show the average. If you have a dashboard full of gauges, it’s really easy to make things up. Plus gauges are kind of vague by design. Totally different situations can look exactly the same on a gauge.”

A gauge panel at .23% on the left and two related graphs on the right, one that requires action and one that does not

Leveling-up visualizations

Visualizations can fall across a spectrum that ranges from simple to complex, and with information that can range from sparse to dense, Chen explained.

Time series contain every data point in a history but, “they’re not very easy to scan when you have tons of them,” he said. Like gauges, they don’t work well for monitoring a significant amount of metrics, so the goal at EA was to find something that was both simple and informative.

The solution: status history.

“You can think of a status history panel like a colored calendar,” Chen said. “You can stack rows with different meanings, and each cell is like a time block — for example, an hour.”

A slide called Status history that shows a two rows of color-coded boxes that relate to rate and volume

In the example above, different colors are used to show whether a metric is good or bad, based on thresholds. “The key is, it’s got a time axis and it’s color-coded. That makes it super useful,” Chen said.

This type of visualization makes it easy to quickly spot patterns, such as which metric is improving or which one is in trouble. It also lets the EA team observe many app versions in a compact way.

“Instead of having a gauge for each version, we can just have one status history panel that can automatically display the most popular versions in just one panel.”

A panel with three rows of green boxes, each related to a different app version at different time intervals

A panel like this also helps the team spot different types of issues. Referencing the image below as an example, Chen said, “If only the latest version is red, it indicates a regression.”

A panel with three rows of boxes for three app versions, with one red row and two green ones

On the other hand, if all versions suddenly go red at the same time, that likely indicates the problem is an external outage.

A panel with three rows representing app versions, with green boxes on left half of the rows and red on the right half

Finding good thresholds

Although color-coding is an effective visualization tool, it’s not perfect. As Chen noted, “If you just randomly pick a threshold, the same metric can look good or bad for no real reason.”

To choose an ideal threshold, the first step is to check the histogram of your metric.

Side-by-side images of an hourly time series graph and an hourly error rate histogram

The graph above-left is a normal time series, while the image on the right is a histogram of the same metric. “The histogram shows how often each value comes up,” Chen explained.

Taking a closer look, the histogram is showing the frequency of each possible value in the past seven days.

An image showing how to read a histogram, highlighting a high point and a low point

In the above example, the metric was at 1.5% for 22 hours and at 2.1% for one hour. “It should look like a bell curve,” Chen explained — and the reason for that is based on the Central Limit Theorem. “If your metric comes from many independent random events, it should follow a normal distribution,” he said. “And this often works for error rates.”

With a normal distribution, values shouldn’t go too high above the mean. You could draw a line to show the boundary between what’s likely to happen and what’s unlikely to happen, he said. “If you see a bunch of unlikely values, something’s fundamentally changed.” It could be a new bug or a new incident, for example. “The boundary is a really good place to put your threshold for good or bad.”

A graph illustrating the 3-sigma (standard deviation rule) and a similar shaped histogram labeled 'good', and a part of it with no values, labeled 'bad'

This is the formula Chen uses to calculate the threshold:

A slide with a time series graph for a metric and a formula to create its threshold: Threshold = mean + 3 * stddev

Big payoff

Today, rather than looking at an overwhelming dashboard of gauges, the EA team has a dashboard where they can better observe things such as external issues, regressions, bugs fixed, and external issues that are resolved.

A screenshot of panels; four with red error boxes are labeled external issue, regression, bug fixed, and external resolved

Once the solid thresholds were established, the team set up alerts for every metric, providing them with 24/7 coverage. As a result, they haven’t missed any critical bugs caused by their own code changes since implementing this new system.

Moving beyond detection

EA’s app doesn’t have a traditional backend setup with separate metrics, logs, and traces — “You can’t seriously send all the logs from a user’s machine to our server,” Chen explained. So when it comes to finding the cause of an error, the team can’t simply dive into the logs. Instead, EA only collects a few anonymous JSON events with the type of error and few fields as context.

The app’s dashboard is set up so if someone sees an error rate spike, for example, they can click on a graph in a panel and find out more about the context of the error. “Those contexts help us figure out the root cause,” Chen said. The team can also break down the issue by app version or game id. “We can test those games locally and try to reproduce the issue, or even reach out to the game team,” Chen said. “Either way, those are actionable insights.”

Still, breakdowns don’t scale, he explained, so if EA had 200 metrics and each metric has 10 fields worth breaking down, the result is another 2,000 panels. “That’s just impossible to build or use.”

This is where his idea for functional dashboarding comes in.

“The core idea is that dashboards are not static pages — they’re functions,” he said. “You can call them by visiting the URL with different parameters and you get the dashboard you need.”

By using the function below, a developer can get the dashboard they need.

A image of the formula for a function call: fn DeepDive(field_breakdown, query_error, query_startup,...) --> Dashboard

Chen then did a quick demo of a functional dashboard at work, explained what’s happening behind the scenes, and shared how to craft links for parameters.

He also pointed out how easy it is to maintain this type of dashboard. EA’s alert dashboard has more than 200 alerts and is generated from templates. “Once you define a query, everything else is automatically generated by scripts,” he said. “It can be really easy to extend, and we have pipelines that calculate the three sigma thresholds from history and rebuild everything every week so it’s always up to date.”

To learn more, check out Chen’s presentation, “Monitoring EA App’s 200+ core error metrics with a scalable, color-coded Grafana dashboard.”

Grafana Cloud is the easiest way to get started with metrics, logs, traces, dashboards, and more. We have a generous forever-free tier and plans for every use case. Sign up for free now!

From chaos to clarity with Grafana dashboards: How video game company EA monitors 200+ metrics

Big challenges

Leveling-up visualizations

Finding good thresholds

Big payoff

Moving beyond detection

Related content

Observability in under 5 seconds: Reflecting on a year of grafana/otel-lgtm

How Dropbox rebuilt its logging stack with Grafana Loki after a data center went dark

How the Factry Historian data source for Grafana enables data-driven insights for factory teams