Menu
Grafana Cloud

otelcol.connector.servicegraph

otelcol.connector.servicegraph accepts span data from other otelcol components and outputs metrics representing the relationship between various services in a system. A metric represents an edge in the service graph. Those metrics can then be used by a data visualization application (e.g. Grafana) to draw the service graph.

NOTE: otelcol.connector.servicegraph is a wrapper over the upstream OpenTelemetry Collector servicegraph connector. Bug reports or feature requests will be redirected to the upstream repository, if necessary.

Multiple otelcol.connector.servicegraph components can be specified by giving them different labels.

This component is based on Grafana Tempo’s service graph processor.

Service graphs are useful for a number of use-cases:

  • Infer the topology of a distributed system. As distributed systems grow, they become more complex. Service graphs can help you understand the structure of the system.
  • Provide a high level overview of the health of your system. Service graphs show error rates, latencies, and other relevant data.
  • Provide a historic view of a system’s topology. Distributed systems change very frequently, and service graphs offer a way of seeing how these systems have evolved over time.

Since otelcol.connector.servicegraph has to process both sides of an edge, it needs to process all spans of a trace to function properly. If spans of a trace are spread out over multiple Alloy instances, spans cannot be paired reliably. A solution to this problem is using otelcol.exporter.loadbalancing in front of Alloy instances running otelcol.connector.servicegraph.

Usage

alloy
otelcol.connector.servicegraph "LABEL" {
  output {
    metrics = [...]
  }
}

Arguments

otelcol.connector.servicegraph supports the following arguments:

NameTypeDescriptionDefaultRequired
latency_histogram_bucketslist(duration)Buckets for latency histogram metrics.["2ms", "4ms", "6ms", "8ms", "10ms", "50ms", "100ms", "200ms", "400ms", "800ms", "1s", "1400ms", "2s", "5s", "10s", "15s"]no
dimensionslist(string)A list of dimensions to add with the default dimensions.[]no
cache_loopdurationConfigures how often to delete series which have not been updated."1m"no
store_expiration_loopdurationThe time to expire old entries from the store periodically."2s"no
metrics_flush_intervaldurationThe interval at which metrics are flushed to downstream components."0s"no
database_name_attributestringThe attribute name used to identify the database name from span attributes."db.name"no

Service graphs work by inspecting traces and looking for spans with parent-children relationship that represent a request. otelcol.connector.servicegraph uses OpenTelemetry semantic conventions to detect a myriad of requests. The following requests are currently supported:

  • A direct request between two services, where the outgoing and the incoming span must have a Span Kind value of client and server respectively.
  • A request across a messaging system, where the outgoing and the incoming span must have a Span Kind value of producer and consumer respectively.
  • A database request, where spans have a Span Kind with a value of client, as well as an attribute with a key of db.name.

Every span which can be paired up to form a request is kept in an in-memory store:

  • If the TTL of the span expires before it can be paired, it is deleted from the store. TTL is configured in the store block.
  • If the span is paired prior to its expiration, a metric is recorded and the span is deleted from the store.

The following metrics are emitted by the processor:

MetricTypeLabelsDescription
traces_service_graph_request_totalCounterclient, server, connection_typeTotal count of requests between two nodes
traces_service_graph_request_failed_totalCounterclient, server, connection_typeTotal count of failed requests between two nodes
traces_service_graph_request_server_secondsHistogramclient, server, connection_typeTime for a request between two nodes as seen from the server
traces_service_graph_request_client_secondsHistogramclient, server, connection_typeTime for a request between two nodes as seen from the client
traces_service_graph_unpaired_spans_totalCounterclient, server, connection_typeTotal count of unpaired spans
traces_service_graph_dropped_spans_totalCounterclient, server, connection_typeTotal count of dropped spans

Duration is measured both from the client and the server sides.

The latency_histogram_buckets argument controls the buckets for traces_service_graph_request_server_seconds and traces_service_graph_request_client_seconds.

Each emitted metrics series have a client and a server label corresponding with the service doing the request and the service receiving the request. The value of the label is derived from the service.name resource attribute of the two spans.

The connection_type label may not be set. If it is set, its value will be either messaging_system or database.

Additional labels can be included using the dimensions configuration option:

  • Those labels will have a prefix to mark where they originate (client or server span kinds). The client_ prefix relates to the dimensions coming from spans with a Span Kind of client. The server_ prefix relates to the dimensions coming from spans with a Span Kind of server.
  • Firstly the resource attributes will be searched. If the attribute is not found, the span attributes will be searched.

When metrics_flush_interval is set to 0s, metrics will be flushed on every received batch of traces.

Blocks

The following blocks are supported inside the definition of otelcol.connector.servicegraph:

HierarchyBlockDescriptionRequired
storestoreConfigures the in-memory store for spans.no
outputoutputConfigures where to send telemetry data.yes
debug_metricsdebug_metricsConfigures the metrics that this component generates to monitor its state.no

store block

The store block configures the in-memory store for spans.

NameTypeDescriptionDefaultRequired
max_itemsnumberMaximum number of items to keep in the store.1000no
ttldurationThe time to live for spans in the store."2s"no

output block

The output block configures a set of components to forward resulting telemetry data to.

The following arguments are supported:

NameTypeDescriptionDefaultRequired
metricslist(otelcol.Consumer)List of consumers to send metrics to.[]no

You must specify the output block, but all its arguments are optional. By default, telemetry data is dropped. Configure the metrics argument accordingly to send telemetry data to other components.

debug_metrics block

The debug_metrics block configures the metrics that this component generates to monitor its state.

The following arguments are supported:

NameTypeDescriptionDefaultRequired
disable_high_cardinality_metricsbooleanWhether to disable certain high cardinality metrics.trueno
levelstringControls the level of detail for metrics emitted by the wrapped collector."detailed"no

disable_high_cardinality_metrics is the Grafana Alloy equivalent to the telemetry.disableHighCardinalityMetrics feature gate in the OpenTelemetry Collector. It removes attributes that could cause high cardinality metrics. For example, attributes with IP addresses and port numbers in metrics about HTTP and gRPC connections are removed.

Note

If configured, disable_high_cardinality_metrics only applies to otelcol.exporter.* and otelcol.receiver.* components.

level is the Alloy equivalent to the telemetry.metrics.level feature gate in the OpenTelemetry Collector. Possible values are "none", "basic", "normal" and "detailed".

Exported fields

The following fields are exported and can be referenced by other components:

NameTypeDescription
inputotelcol.ConsumerA value that other components can use to send telemetry data to.

input accepts otelcol.Consumer traces telemetry data. It does not accept metrics and logs.

Component health

otelcol.connector.servicegraph is only reported as unhealthy if given an invalid configuration.

Debug information

otelcol.connector.servicegraph does not expose any component-specific debug information.

Example

The example below accepts traces, creates service graph metrics from them, and writes the metrics to Mimir. The traces are written to Tempo.

otelcol.connector.servicegraph also adds a label to each metric with the value of the “http.method” span/resource attribute.

alloy
otelcol.receiver.otlp "default" {
  grpc {
    endpoint = "0.0.0.0:4320"
  }

  output {
    traces  = [otelcol.connector.servicegraph.default.input,otelcol.exporter.otlp.grafana_cloud_traces.input]
  }
}

otelcol.connector.servicegraph "default" {
  dimensions = ["http.method"]
  output {
    metrics = [otelcol.exporter.prometheus.default.input]
  }
}

otelcol.exporter.prometheus "default" {
  forward_to = [prometheus.remote_write.mimir.receiver]
}

prometheus.remote_write "mimir" {
  endpoint {
    url = "https://prometheus-xxx.grafana.net/api/prom/push"

    basic_auth {
      username = sys.env("PROMETHEUS_USERNAME")
      password = sys.env("GRAFANA_CLOUD_API_KEY")
    }
  }
}

otelcol.exporter.otlp "grafana_cloud_traces" {
  client {
    endpoint = "https://tempo-xxx.grafana.net/tempo"
    auth     = otelcol.auth.basic.grafana_cloud_traces.handler
  }
}

otelcol.auth.basic "grafana_cloud_traces" {
  username = sys.env("TEMPO_USERNAME")
  password = sys.env("GRAFANA_CLOUD_API_KEY")
}

Some of the metrics in Mimir may look like this:

traces_service_graph_request_total{client="shop-backend",failed="false",server="article-service",client_http_method="DELETE",server_http_method="DELETE"}
traces_service_graph_request_failed_total{client="shop-backend",client_http_method="POST",failed="false",server="auth-service",server_http_method="POST"}

Compatible components

otelcol.connector.servicegraph can accept arguments from the following components:

otelcol.connector.servicegraph has exports that can be consumed by the following components:

Note

Connecting some components may not be sensible or components may require further configuration to make the connection work correctly. Refer to the linked documentation for more details.