Observability Journey Maturity Model
Modern infrastructure and applications are complex and constantly evolving. Understand where your organization is on your observability journey and how to improve your maturity
Assess your organization’s observability maturity
The model
This model for observability maturity will help you identify your level of observability maturity by giving you a method to evaluate your tools, people, and processes across 9 key dimensions. This will help identify your strengths and opportunities for improvement. This model will then help you identify actions you can take to become systematic about observability and keeping your apps running.
How to use it
You can use this model to evaluate observability using the lens of how you access and analyze observability data as well as how you respond to and prevent incidents. To achieve systematic observability, there are 9 key dimensions to master.
Dimensions of observability
Access
Observability coverage
Observability data access
Observability data efficiency
Analyze
Visualization
Correlation
Root Cause Analysis
Respond and Prevent
SLOs and Business Impact
Incident Response & Management
Observability Driven Development
Access
Observability coverage
Start your observability journey by determining the applications, cloud services, and infrastructure that you need to observe to keep your environment running. Then collect observability data needed to get visibility over your systems, including the metrics, logs, and traces from the key components of your application architecture. Observability data sources include cloud and self-hosted infrastructure, databases, APIs, as well as network, security, real-user and synthetic monitoring. Increasing the data sources and types of observability data you can access will broaden your observability coverage across your organization and limit blind spots.
Observability data access
Next, you evaluate how to collect, store, and access your observability data. Some data collection agents are proprietary while others use open standards, which can be stored and visualized using a wide ecosystem of tools. Observability teams should offer developers and operations teams data stores for hosting metrics, logs, and traces using the latest standards, such as Prometheus and OpenTelemetry. You should determine which data can be accessed using APIs and which data must be collected and stored so you can fully observe your environment.
Observability data efficiency
Finally, you will need to efficiently store and manage large volumes of observability data. The data stores you offer should be scalable, highly available, secure, and highly performant. You should define policies around data fidelity and retention and create policies for managing cardinality and cost
Analyze
Visualization
Once you can access your Observability data, you’ll need to determine the best way to visualize it. You’ll want to create a global view across your organization as well as role-specific views for executives and technical users, and it's a best-practice to offer the ability to visualize data from numerous sources in a single place. You should provide your business a reliable platform for visualization, such as Grafana dashboards, while securing data access amongst the teams using an RBAC model, and integrated with your directory service.
Correlation
After visualizing data, you’ll need to be able to correlate across data sources to solve problems quickly. This includes the ability to correlate many types of data including metrics, logs, and traces as well as business and technology data sources. Navigating between tools is time consuming and error prone, so reducing the number of tools required to correlate data can make a big difference in how long it takes to identify and solve issues.
Root Cause Analysis
Once you can correlate data, you will want to create an efficient root cause analysis (RCA) process and toolset. Great observability teams track Mean Time To Recover (MTTR) metrics and are constantly seeking to improve their process to identify root causes faster. The RCA step depends on solid data fidelity and retention policies to ensure enough data is on hand, balanced by budgetary needs. Collaboration amongst cross-functional teams to reduce the number of tools and people required to determine RCA will reduce errors and allow for faster MTTR. A well-designed RCA practice also allows observability teams to learn from each issue and continually improve their RCA process to prevent future outages.
Respond and Prevent
SLOs and Business Impact
For each of the services you support, you will need to determine expectations for performance and availability. This is often in the form of a Service Level Agreement (SLA) for external customers and Operational Level Agreement (OLA) for internal customers. Most observability teams have defined Service Level Objectives (SLOs) for key services, whether or not they are bound by an SLA. Reporting on SLO performance and business impact becomes an essential tool for executives to manage their key business systems.
Incident Response & Management
Unifying alerts through a central system can help observability teams identify and notify the appropriate on-call engineers with relevant information. Observability teams should define on-call rotations and escalation policies, along with runbooks to guide troubleshooting so that the dependence on specific individuals is minimized.
Observability Driven Development
Organizations that implement observability during the development process roll out applications with higher uptime and improved performance. The earlier in the Software Development Life Cycle (SDLC) that observability and performance testing are implemented, the more issues can be prevented before impacting users. This “shift-left” approach requires metrics, logs and traces to be included in the coding process. It also includes a Quality Assurance (QA) performance testing stage as well as tools designed for developers that can stress test applications to make sure they will perform when the production loads are unleashed on each new release of the application. Ideally, these should use the tools and methodologies that your developers are familiar with so that it's easy to integrate into your development pipelines.