🕵️ Intro to API Observability
6 min read
APIs are like blood vessels to a digital business. As data flows through, energy is delivered to activate new opportunities. Oftentimes, we focus on specialized components, the vital organs of our software systems. What can we learn by tapping into the connectors, themselves, pulling insights from the streams? Here's a quick overview on bootstrapping an observability strategy for APIs.
Data lives everywhere. When it comes to measuring success, anticipating problems, or looking for our next opportunity, we instinctively scrape, scrub, polish, and analyze information to the best of our ability. Finding signals in the noise has been a natural activity for living beings since the dawn of vigilance.
As time has progressed, we've applied these data-crunching instincts to our digital assets, as well. For the modern business, this practice is one for survival. With so much of our business being driven by APIs, are we searching for the right signals?
The acronym MELT defines our starting point.
- M - Metrics, numeric measurements collected and tracked over time.
- E - Events, snapshots of significant state changes.
- L - Logs, a detailed transcript of system behavior.
- T - Traces, a route of interactions between components, coupled with an associated context.
The process of communicating and recording these signals is called telemetry.
Most interactions we have with APIs over the network are fairly high-level. We send a blob of JSON. We receive a blob of JSON. Profit! 💰 What signals can we acquire from what lies below?
- When was there a significant shift in traffic?
- How much time is spent receiving data from external sources?
- What's the trend in connection errors over time?
The association of signal trackers to our systems is called instrumentation. When it comes to the lower-level components, we can often take advantage of automatic instrumentation. This can surface in the form of wrapping components within a standard library or adding listeners along connection paths.
Today, we strive to capture more signals than ever before. We see both virtual machine metrics and application logs being shipped to storage and analysis tools. But what about all the business-y stuff in-between? How are we capturing measurements for business Key Performance Indicators (KPIs)? We look to instrumenting the domain.
Domain events are the result of applying a command in the business domain to a specific context.
Whether captured or not, these events are happening all the time. What kind of questions may we ask of these insights?
- What's the average length of time between discount codes being offered and being applied at checkout?
- What's the correlation between in-app product announcements and newsletter sign-ups?
- When appointments are canceled, what behavior directly precedes this action?
As the questions we ask evolve, so too must our methods of collecting these signals.
When we blew up the monolith into many services, we lost the ability to step through our code with a debugger: it now hops the network. Our tools are still coming to grips with this seismic shift. — Charity Majors, Observability — a 3-Year Retrospective
To reap the benefits of a distributed system, we sacrifice the convenience of having one-stop inspection. It wasn't always this way, and that's one aspect which makes upgrading our observability strategy difficult.
Let's take a look at how the observability tooling landscape has evolved.
Logging and monitoring solutions started when we were writing code close to the metal. The open source stack that used to dominate the landscape was a combination of these tools:
- Graylog - full log management system
- Nagios - systems, network, and application monitoring and alerting
- StatsD - metrics collection and forwarding
- Carbon - metrics ingestion for Graphite, stored in the Whisper database
- Graphite - metrics querying, visualization, and alerting
What many observability articles tend to ignore is that this stack is still heavily deployed and in-use today. Some of us are still here, and that's okay.
As virtual machines—and eventually cloud infrastructure—gained traction over running on bare metal servers, we saw a shift in how we approach signal-gathering. This gave rise to two prominent stacks in the observability space:
- ELK stack
- ElasticSearch - full-text search engine for log storage and querying
- Logstash - log ingestion
- Kibana - log visualization and alerting
- TICK stack
- Telegraf - metrics ingestion
- InfluxDB - a time-series database for metric storage and querying
- Chronograf - metrics visualization
- Kapacitor - metrics processing and alerting
It is common to run these stacks—or some combination thereof—in parallel. Many organizations are still here, and it makes sense. For the most part, they're incredibly robust and mature solutions.
However, we are in the midst of yet another sea change. There's one last stop on the map.
There has been a dramatic shift to cloud-native infrastructure. And for systems running in self-managed data centers, containers are beginning to take over as the atomic unit of application deployment. On top of this, Kubernetes has grown to be the dominant container orchestrator (Note: in the Kubernetes world, an atomic unit is known as a Pod and consists of one or more related containers).
What tools do we use to capture signals in this new world?
- Prometheus - metrics collection, querying, and alerting
- Grafana - Metrics visualization and alerting
- EFK stack
- Jaeger or Zipkin - trace querying and visualization
Why so many options? This world is still maturing, and it has become significantly more complex. The mere existence of OpenTelemetry, discussed more in the next section, gives insight into the fact that the number of options in this space are growing at a fast pace.
No matter where businesses are in their journey today, observability of containers and their interactions is likely to become an important initiative.
A tool-agnostic observability framework for communicating telemetry, OpenTelemetry.io defines the project as:
OpenTelemetry is a collection of tools, APIs, and SDKs. Use it to instrument, generate, collect, and export telemetry data (metrics, logs, and traces) to help you analyze your software’s performance and behavior.
Many Application Performance Monitoring (APM) tools are adding support for OpenTelemetry, as well. Check the OpenTelemetry Registry for more information.
Here's an example of adding auto-instrumentation to a Node.js application.
This gives an excellent jumpstart to help us start acquiring signals with low-effort.
Distributed Traces offer a superpower for API observability. They allow us to track requests through our distributed systems, and we can even include domain events in our context propagation.
A trace has a few components worth noting.
Traces can be nested, creating tree-based observability structures.
Spans log segments of a trace. Here's an example of a server receiving an HTTP request:
Span events are a special type of structured logging. They can be associated with a trace, giving insight into what domain events are happening in the broader context of a full interaction.
An example of adding events to a span:
This is only a high-level overview. Check out the OpenTelemetry docs for more details!
- OpenTelemetry Specification Overview
- W3C Trace Context
- Propagation format for distributed trace context: Baggage
A significant driver of containerization is a shift in architectural trends to break apart monolithic applications. Containers and microservices have a symbiotic relationship. A co-evolution is occurring in this space.
In the VMWare State of Observability Report 2021, the findings state the following reasons for a rise in complexity of managing cloud applications:
- Cross-team adoption of polyglot microservices frameworks
- Application requests traversing many third-party APIs and technologies
- Varying approaches in application security across different vendors
Legacy telemetry strategies are not enough. How do we start pushing forward an initiative to improve?
- Evaluate how observability fits into an ongoing API strategy. Talk to stakeholders. Research the impact. Make a case. Follow the guidelines presented in 5 Developer Tips for Surviving API-First.
- Educate and experiment. There are many links in this post with a few references at the end, as well. Make space for trial and error. Start small, perhaps with a greenfield project.
- Delegate responsibilities. Team Topologies describes an approach that can help share the load of instrumenting applications and managing observability infrastructure. Read more at: 🚀 Scale API Teams with Platform Ops.
Observability creates a window into the organic flow of information that moves through our systems. It allows us to ask the important questions that impact our business. The good news is we almost certainly have familiarity with some of the practices involved. As the landscape continues to grow, it takes a lot of effort to stay ahead of the curve. That's expected. Following an observability initiative is a long-term approach to ensuring survival. Evolution, as we know, requires patience. 🧘
- A Three-Phased Approach to Observability by New Relic
- Distributed Systems Observability by Cindy Sridharan
- Observing is not Debugging (and other misnomers) by Kislay Verma
- Splunk's State of Observability 2021