Menu

Send metrics, logs, and events with Grafana Agent

Note

Grafana Agent has been deprecated. Configuration with the Grafana Kubernetes Monitoring Helm chart is the recommended method.

When you configure Kubernetes Monitoring using Grafana Agent, Agent scrapes the following targets by default:

The default ConfigMap that results from the configuration process creates an allowlist. This allowlist is configured to drop all metrics not referenced in the Kubernetes Monitoring alerts and alert and recording rules. You can optionally do any of these with the allowlist:

  • Modify it.
  • Replace it with a denylist (by using the drop directive).
  • Omit it entirely.
  • Move it to the remote_write level so that it applies globally to all configured scrape jobs.

To learn more, see:

Before you begin

To deploy Kubernetes Monitoring, you need:

  • A Kubernetes cluster, environment, or fleet you want to monitor
  • The kubectl, curl, envsubst, and Helm command-line tools

Note

Make sure you deploy the required resources in the same namespace to avoid any missing data. For example, if Grafana Agent is deployed in one namespace, and ’node_exporter` is deployed in another, you will have missing resource efficiency data.

Use a Grafana Cloud Access Policy Token

You can create a new access policy token or use an existing token. See Grafana Cloud Access Policies for more information.

You’ll use this token in future steps.

Deploy Agent ConfigMap & StatefulSet for metrics and events

The ConfigMap configures:

  • The Grafana Agent StatefulSet to scrape the cadvisor, kubelet, and kube-state-metrics endpoints in your cluster
  • The Agent to collect Kubernetes events from your cluster’s control plane and send them to your Cloud Loki instance.
  1. Save the following ConfigMap to a file, and replace within it the following:

    • NAMESPACE with the namespace for the Grafana Agent for metrics
    • CLUSTER_NAME with the name of your cluster
    • METRICS_HOST with the hostname for your Prometheus instance
    • METRICS_USERNAME with the username for your Prometheus instance
    • METRICS_PASSWORD with your Access Policy Token from earlier
    • LOGS_HOST with the hostname for your Loki instance
    • LOGS_USERNAME with the username for your Loki instance
    • LOGS_PASSWORD with your Access Policy Token from earlier
    kind: ConfigMap
    metadata:
      name: grafana-agent
      namespace: NAMESPACE
    apiVersion: v1
    data:
      agent.yaml: |
        metrics:
          wal_directory: /var/lib/agent/wal
          global:
            scrape_interval: 60s
            external_labels:
              cluster: CLUSTER_NAME
          configs:
          - name: integrations
            remote_write:
            - url: METRICS_HOST/api/prom/push
              basic_auth:
                username: METRICS_USERNAME
                password: METRICS_PASSWORD
            scrape_configs:
            - bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
              job_name: integrations/kubernetes/cadvisor
              kubernetes_sd_configs:
                  - role: node
              metric_relabel_configs:
                  - source_labels: [__name__]
                    regex: kubelet_running_containers|go_goroutines|kubelet_runtime_operations_errors_total|cluster:namespace:pod_cpu:active:kube_pod_container_resource_limits|namespace_memory:kube_pod_container_resource_limits:sum|kubelet_volume_stats_inodes_used|kubelet_certificate_manager_server_ttl_seconds|namespace_workload_pod:kube_pod_owner:relabel|kubelet_node_config_error|kube_daemonset_status_number_misscheduled|kube_pod_container_resource_requests|namespace_cpu:kube_pod_container_resource_limits:sum|container_memory_working_set_bytes|container_fs_reads_bytes_total|kube_node_status_condition|namespace_cpu:kube_pod_container_resource_requests:sum|kubelet_server_expiration_renew_errors|container_fs_writes_total|kube_horizontalpodautoscaler_status_desired_replicas|node_filesystem_avail_bytes|kube_pod_status_reason|node_filesystem_size_bytes|kube_deployment_spec_replicas|kube_statefulset_metadata_generation|namespace_workload_pod|storage_operation_duration_seconds_count|kubelet_certificate_manager_client_expiration_renew_errors|kube_pod_container_resource_limits|kube_statefulset_status_replicas_updated|node_namespace_pod_container:container_memory_rss|kube_statefulset_status_observed_generation|node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate|kubelet_pleg_relist_interval_seconds_bucket|kube_job_status_start_time|kube_deployment_status_observed_generation|kubelet_pod_worker_duration_seconds_bucket|container_memory_cache|kube_resourcequota|kube_horizontalpodautoscaler_spec_min_replicas|namespace_memory:kube_pod_container_resource_requests:sum|kube_persistentvolumeclaim_resource_requests_storage_bytes|kube_daemonset_status_number_available|kube_job_failed|storage_operation_errors_total|cluster:namespace:pod_memory:active:kube_pod_container_resource_limits|container_fs_writes_bytes_total|kube_statefulset_replicas|kube_replicaset_owner|container_network_receive_bytes_total|volume_manager_total_volumes|kube_horizontalpodautoscaler_spec_max_replicas|kube_daemonset_status_desired_number_scheduled|kube_pod_container_status_waiting_reason|process_cpu_seconds_total|kube_node_status_allocatable|kube_deployment_status_replicas_available|kube_daemonset_status_updated_number_scheduled|container_network_receive_packets_total|container_memory_rss|container_cpu_usage_seconds_total|kube_namespace_status_phase|cluster:namespace:pod_memory:active:kube_pod_container_resource_requests|kubelet_volume_stats_available_bytes|kube_deployment_status_replicas_updated|kubelet_running_container_count|kube_node_info|container_network_transmit_packets_dropped_total|kubelet_certificate_manager_client_ttl_seconds|kube_pod_owner|kubelet_volume_stats_inodes|kubelet_runtime_operations_total|container_cpu_cfs_throttled_periods_total|kubelet_cgroup_manager_duration_seconds_bucket|kubelet_running_pod_count|container_network_transmit_packets_total|kubelet_node_name|kube_daemonset_status_current_number_scheduled|kube_statefulset_status_replicas_ready|cluster:namespace:pod_cpu:active:kube_pod_container_resource_requests|kubelet_volume_stats_capacity_bytes|kube_horizontalpodautoscaler_status_current_replicas|node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile|kube_node_spec_taint|kubelet_pleg_relist_duration_seconds_bucket|kube_pod_status_phase|container_cpu_cfs_periods_total|kube_deployment_metadata_generation|node_namespace_pod_container:container_memory_cache|kube_statefulset_status_current_revision|kubelet_pleg_relist_duration_seconds_count|container_fs_reads_total|kube_statefulset_status_update_revision|container_network_receive_packets_dropped_total|kube_pod_info|kubelet_running_pods|process_resident_memory_bytes|kubelet_pod_worker_duration_seconds_count|kubelet_pod_start_duration_seconds_count|kubelet_cgroup_manager_duration_seconds_count|kube_node_status_capacity|container_network_transmit_bytes_total|rest_client_requests_total|kubernetes_build_info|machine_memory_bytes|kube_statefulset_status_replicas|container_memory_swap|kube_job_status_active|kubelet_pod_start_duration_seconds_bucket|node_namespace_pod_container:container_memory_working_set_bytes|node_namespace_pod_container:container_memory_swap|kube_namespace_status_phase|container_cpu_usage_seconds_total|kube_pod_status_phase|kube_pod_start_time|kube_pod_container_status_restarts_total|kube_pod_container_info|kube_pod_container_status_waiting_reason|kube_daemonset.*|kube_replicaset.*|kube_statefulset.*|kube_job.*|kube_node.*|node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate|cluster:namespace:pod_cpu:active:kube_pod_container_resource_requests|namespace_cpu:kube_pod_container_resource_requests:sum|node_cpu.*|node_memory.*|node_filesystem.*|node_network_transmit_bytes_total
                    action: keep
              relabel_configs:
                  - replacement: kubernetes.default.svc.cluster.local:443
                    target_label: __address__
                  - regex: (.+)
                    replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
                    source_labels:
                      - __meta_kubernetes_node_name
                    target_label: __metrics_path__
              scheme: https
              tls_config:
                  ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
                  insecure_skip_verify: false
                  server_name: kubernetes
            - job_name: integrations/kubernetes/kubelet
              bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
              kubernetes_sd_configs:
                  - role: node
              metric_relabel_configs:
                  - source_labels: [__name__]
                    regex: kubelet_running_containers|go_goroutines|kubelet_runtime_operations_errors_total|cluster:namespace:pod_cpu:active:kube_pod_container_resource_limits|namespace_memory:kube_pod_container_resource_limits:sum|kubelet_volume_stats_inodes_used|kubelet_certificate_manager_server_ttl_seconds|namespace_workload_pod:kube_pod_owner:relabel|kubelet_node_config_error|kube_daemonset_status_number_misscheduled|kube_pod_container_resource_requests|namespace_cpu:kube_pod_container_resource_limits:sum|container_memory_working_set_bytes|container_fs_reads_bytes_total|kube_node_status_condition|namespace_cpu:kube_pod_container_resource_requests:sum|kubelet_server_expiration_renew_errors|container_fs_writes_total|kube_horizontalpodautoscaler_status_desired_replicas|node_filesystem_avail_bytes|kube_pod_status_reason|node_filesystem_size_bytes|kube_deployment_spec_replicas|kube_statefulset_metadata_generation|namespace_workload_pod|storage_operation_duration_seconds_count|kubelet_certificate_manager_client_expiration_renew_errors|kube_pod_container_resource_limits|kube_statefulset_status_replicas_updated|node_namespace_pod_container:container_memory_rss|kube_statefulset_status_observed_generation|node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate|kubelet_pleg_relist_interval_seconds_bucket|kube_job_status_start_time|kube_deployment_status_observed_generation|kubelet_pod_worker_duration_seconds_bucket|container_memory_cache|kube_resourcequota|kube_horizontalpodautoscaler_spec_min_replicas|namespace_memory:kube_pod_container_resource_requests:sum|kube_persistentvolumeclaim_resource_requests_storage_bytes|kube_daemonset_status_number_available|kube_job_failed|storage_operation_errors_total|cluster:namespace:pod_memory:active:kube_pod_container_resource_limits|container_fs_writes_bytes_total|kube_statefulset_replicas|kube_replicaset_owner|container_network_receive_bytes_total|volume_manager_total_volumes|kube_horizontalpodautoscaler_spec_max_replicas|kube_daemonset_status_desired_number_scheduled|kube_pod_container_status_waiting_reason|process_cpu_seconds_total|kube_node_status_allocatable|kube_deployment_status_replicas_available|kube_daemonset_status_updated_number_scheduled|container_network_receive_packets_total|container_memory_rss|container_cpu_usage_seconds_total|kube_namespace_status_phase|cluster:namespace:pod_memory:active:kube_pod_container_resource_requests|kubelet_volume_stats_available_bytes|kube_deployment_status_replicas_updated|kubelet_running_container_count|kube_node_info|container_network_transmit_packets_dropped_total|kubelet_certificate_manager_client_ttl_seconds|kube_pod_owner|kubelet_volume_stats_inodes|kubelet_runtime_operations_total|container_cpu_cfs_throttled_periods_total|kubelet_cgroup_manager_duration_seconds_bucket|kubelet_running_pod_count|container_network_transmit_packets_total|kubelet_node_name|kube_daemonset_status_current_number_scheduled|kube_statefulset_status_replicas_ready|cluster:namespace:pod_cpu:active:kube_pod_container_resource_requests|kubelet_volume_stats_capacity_bytes|kube_horizontalpodautoscaler_status_current_replicas|node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile|kube_node_spec_taint|kubelet_pleg_relist_duration_seconds_bucket|kube_pod_status_phase|container_cpu_cfs_periods_total|kube_deployment_metadata_generation|node_namespace_pod_container:container_memory_cache|kube_statefulset_status_current_revision|kubelet_pleg_relist_duration_seconds_count|container_fs_reads_total|kube_statefulset_status_update_revision|container_network_receive_packets_dropped_total|kube_pod_info|kubelet_running_pods|process_resident_memory_bytes|kubelet_pod_worker_duration_seconds_count|kubelet_pod_start_duration_seconds_count|kubelet_cgroup_manager_duration_seconds_count|kube_node_status_capacity|container_network_transmit_bytes_total|rest_client_requests_total|kubernetes_build_info|machine_memory_bytes|kube_statefulset_status_replicas|container_memory_swap|kube_job_status_active|kubelet_pod_start_duration_seconds_bucket|node_namespace_pod_container:container_memory_working_set_bytes|node_namespace_pod_container:container_memory_swap|kube_namespace_status_phase|container_cpu_usage_seconds_total|kube_pod_status_phase|kube_pod_start_time|kube_pod_container_status_restarts_total|kube_pod_container_info|kube_pod_container_status_waiting_reason|kube_daemonset.*|kube_replicaset.*|kube_statefulset.*|kube_job.*|kube_node.*|node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate|cluster:namespace:pod_cpu:active:kube_pod_container_resource_requests|namespace_cpu:kube_pod_container_resource_requests:sum|node_cpu.*|node_memory.*|node_filesystem.*|node_network_transmit_bytes_total
                    action: keep
              relabel_configs:
                  - replacement: kubernetes.default.svc.cluster.local:443
                    target_label: __address__
                  - regex: (.+)
                    replacement: /api/v1/nodes/${1}/proxy/metrics
                    source_labels:
                      - __meta_kubernetes_node_name
                    target_label: __metrics_path__
              scheme: https
              tls_config:
                  ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
                  insecure_skip_verify: false
                  server_name: kubernetes
            - job_name: integrations/kubernetes/kube-state-metrics
              kubernetes_sd_configs:
                  - role: pod
              metric_relabel_configs:
                  - source_labels: [__name__]
                    regex: kubelet_running_containers|go_goroutines|kubelet_runtime_operations_errors_total|cluster:namespace:pod_cpu:active:kube_pod_container_resource_limits|namespace_memory:kube_pod_container_resource_limits:sum|kubelet_volume_stats_inodes_used|kubelet_certificate_manager_server_ttl_seconds|namespace_workload_pod:kube_pod_owner:relabel|kubelet_node_config_error|kube_daemonset_status_number_misscheduled|kube_pod_container_resource_requests|namespace_cpu:kube_pod_container_resource_limits:sum|container_memory_working_set_bytes|container_fs_reads_bytes_total|kube_node_status_condition|namespace_cpu:kube_pod_container_resource_requests:sum|kubelet_server_expiration_renew_errors|container_fs_writes_total|kube_horizontalpodautoscaler_status_desired_replicas|node_filesystem_avail_bytes|kube_pod_status_reason|node_filesystem_size_bytes|kube_deployment_spec_replicas|kube_statefulset_metadata_generation|namespace_workload_pod|storage_operation_duration_seconds_count|kubelet_certificate_manager_client_expiration_renew_errors|kube_pod_container_resource_limits|kube_statefulset_status_replicas_updated|node_namespace_pod_container:container_memory_rss|kube_statefulset_status_observed_generation|node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate|kubelet_pleg_relist_interval_seconds_bucket|kube_job_status_start_time|kube_deployment_status_observed_generation|kubelet_pod_worker_duration_seconds_bucket|container_memory_cache|kube_resourcequota|kube_horizontalpodautoscaler_spec_min_replicas|namespace_memory:kube_pod_container_resource_requests:sum|kube_persistentvolumeclaim_resource_requests_storage_bytes|kube_daemonset_status_number_available|kube_job_failed|storage_operation_errors_total|cluster:namespace:pod_memory:active:kube_pod_container_resource_limits|container_fs_writes_bytes_total|kube_statefulset_replicas|kube_replicaset_owner|container_network_receive_bytes_total|volume_manager_total_volumes|kube_horizontalpodautoscaler_spec_max_replicas|kube_daemonset_status_desired_number_scheduled|kube_pod_container_status_waiting_reason|process_cpu_seconds_total|kube_node_status_allocatable|kube_deployment_status_replicas_available|kube_daemonset_status_updated_number_scheduled|container_network_receive_packets_total|container_memory_rss|container_cpu_usage_seconds_total|kube_namespace_status_phase|cluster:namespace:pod_memory:active:kube_pod_container_resource_requests|kubelet_volume_stats_available_bytes|kube_deployment_status_replicas_updated|kubelet_running_container_count|kube_node_info|container_network_transmit_packets_dropped_total|kubelet_certificate_manager_client_ttl_seconds|kube_pod_owner|kubelet_volume_stats_inodes|kubelet_runtime_operations_total|container_cpu_cfs_throttled_periods_total|kubelet_cgroup_manager_duration_seconds_bucket|kubelet_running_pod_count|container_network_transmit_packets_total|kubelet_node_name|kube_daemonset_status_current_number_scheduled|kube_statefulset_status_replicas_ready|cluster:namespace:pod_cpu:active:kube_pod_container_resource_requests|kubelet_volume_stats_capacity_bytes|kube_horizontalpodautoscaler_status_current_replicas|node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile|kube_node_spec_taint|kubelet_pleg_relist_duration_seconds_bucket|kube_pod_status_phase|container_cpu_cfs_periods_total|kube_deployment_metadata_generation|node_namespace_pod_container:container_memory_cache|kube_statefulset_status_current_revision|kubelet_pleg_relist_duration_seconds_count|container_fs_reads_total|kube_statefulset_status_update_revision|container_network_receive_packets_dropped_total|kube_pod_info|kubelet_running_pods|process_resident_memory_bytes|kubelet_pod_worker_duration_seconds_count|kubelet_pod_start_duration_seconds_count|kubelet_cgroup_manager_duration_seconds_count|kube_node_status_capacity|container_network_transmit_bytes_total|rest_client_requests_total|kubernetes_build_info|machine_memory_bytes|kube_statefulset_status_replicas|container_memory_swap|kube_job_status_active|kubelet_pod_start_duration_seconds_bucket|node_namespace_pod_container:container_memory_working_set_bytes|node_namespace_pod_container:container_memory_swap|kube_namespace_status_phase|container_cpu_usage_seconds_total|kube_pod_status_phase|kube_pod_start_time|kube_pod_container_status_restarts_total|kube_pod_container_info|kube_pod_container_status_waiting_reason|kube_daemonset.*|kube_replicaset.*|kube_statefulset.*|kube_job.*|kube_node.*|node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate|cluster:namespace:pod_cpu:active:kube_pod_container_resource_requests|namespace_cpu:kube_pod_container_resource_requests:sum|node_cpu.*|node_memory.*|node_filesystem.*|node_network_transmit_bytes_total
                    action: keep
              relabel_configs:
                  - action: keep
                    regex: kube-state-metrics
                    source_labels:
                      - __meta_kubernetes_pod_label_app_kubernetes_io_name
            - job_name: integrations/node_exporter
              bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
              kubernetes_sd_configs:
                  - namespaces:
                      names:
                          - NAMESPACE
                    role: pod
              metric_relabel_configs:
                  - source_labels: [__name__]
                    regex: kubelet_running_containers|go_goroutines|kubelet_runtime_operations_errors_total|cluster:namespace:pod_cpu:active:kube_pod_container_resource_limits|namespace_memory:kube_pod_container_resource_limits:sum|kubelet_volume_stats_inodes_used|kubelet_certificate_manager_server_ttl_seconds|namespace_workload_pod:kube_pod_owner:relabel|kubelet_node_config_error|kube_daemonset_status_number_misscheduled|kube_pod_container_resource_requests|namespace_cpu:kube_pod_container_resource_limits:sum|container_memory_working_set_bytes|container_fs_reads_bytes_total|kube_node_status_condition|namespace_cpu:kube_pod_container_resource_requests:sum|kubelet_server_expiration_renew_errors|container_fs_writes_total|kube_horizontalpodautoscaler_status_desired_replicas|node_filesystem_avail_bytes|kube_pod_status_reason|node_filesystem_size_bytes|kube_deployment_spec_replicas|kube_statefulset_metadata_generation|namespace_workload_pod|storage_operation_duration_seconds_count|kubelet_certificate_manager_client_expiration_renew_errors|kube_pod_container_resource_limits|kube_statefulset_status_replicas_updated|node_namespace_pod_container:container_memory_rss|kube_statefulset_status_observed_generation|node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate|kubelet_pleg_relist_interval_seconds_bucket|kube_job_status_start_time|kube_deployment_status_observed_generation|kubelet_pod_worker_duration_seconds_bucket|container_memory_cache|kube_resourcequota|kube_horizontalpodautoscaler_spec_min_replicas|namespace_memory:kube_pod_container_resource_requests:sum|kube_persistentvolumeclaim_resource_requests_storage_bytes|kube_daemonset_status_number_available|kube_job_failed|storage_operation_errors_total|cluster:namespace:pod_memory:active:kube_pod_container_resource_limits|container_fs_writes_bytes_total|kube_statefulset_replicas|kube_replicaset_owner|container_network_receive_bytes_total|volume_manager_total_volumes|kube_horizontalpodautoscaler_spec_max_replicas|kube_daemonset_status_desired_number_scheduled|kube_pod_container_status_waiting_reason|process_cpu_seconds_total|kube_node_status_allocatable|kube_deployment_status_replicas_available|kube_daemonset_status_updated_number_scheduled|container_network_receive_packets_total|container_memory_rss|container_cpu_usage_seconds_total|kube_namespace_status_phase|cluster:namespace:pod_memory:active:kube_pod_container_resource_requests|kubelet_volume_stats_available_bytes|kube_deployment_status_replicas_updated|kubelet_running_container_count|kube_node_info|container_network_transmit_packets_dropped_total|kubelet_certificate_manager_client_ttl_seconds|kube_pod_owner|kubelet_volume_stats_inodes|kubelet_runtime_operations_total|container_cpu_cfs_throttled_periods_total|kubelet_cgroup_manager_duration_seconds_bucket|kubelet_running_pod_count|container_network_transmit_packets_total|kubelet_node_name|kube_daemonset_status_current_number_scheduled|kube_statefulset_status_replicas_ready|cluster:namespace:pod_cpu:active:kube_pod_container_resource_requests|kubelet_volume_stats_capacity_bytes|kube_horizontalpodautoscaler_status_current_replicas|node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile|kube_node_spec_taint|kubelet_pleg_relist_duration_seconds_bucket|kube_pod_status_phase|container_cpu_cfs_periods_total|kube_deployment_metadata_generation|node_namespace_pod_container:container_memory_cache|kube_statefulset_status_current_revision|kubelet_pleg_relist_duration_seconds_count|container_fs_reads_total|kube_statefulset_status_update_revision|container_network_receive_packets_dropped_total|kube_pod_info|kubelet_running_pods|process_resident_memory_bytes|kubelet_pod_worker_duration_seconds_count|kubelet_pod_start_duration_seconds_count|kubelet_cgroup_manager_duration_seconds_count|kube_node_status_capacity|container_network_transmit_bytes_total|rest_client_requests_total|kubernetes_build_info|machine_memory_bytes|kube_statefulset_status_replicas|container_memory_swap|kube_job_status_active|kubelet_pod_start_duration_seconds_bucket|node_namespace_pod_container:container_memory_working_set_bytes|node_namespace_pod_container:container_memory_swap|kube_namespace_status_phase|container_cpu_usage_seconds_total|kube_pod_status_phase|kube_pod_start_time|kube_pod_container_status_restarts_total|kube_pod_container_info|kube_pod_container_status_waiting_reason|kube_daemonset.*|kube_replicaset.*|kube_statefulset.*|kube_job.*|kube_node.*|node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate|cluster:namespace:pod_cpu:active:kube_pod_container_resource_requests|namespace_cpu:kube_pod_container_resource_requests:sum|node_cpu.*|node_memory.*|node_filesystem.*|node_network_transmit_bytes_total
                    action: keep
              relabel_configs:
                  - action: keep
                    regex: prometheus-node-exporter.*
                    source_labels:
                      - __meta_kubernetes_pod_label_app_kubernetes_io_name
                  - action: replace
                    source_labels:
                      - __meta_kubernetes_pod_node_name
                    target_label: instance
                  - action: replace
                    source_labels:
                      - __meta_kubernetes_namespace
                    target_label: namespace
              tls_config:
                  ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
                  insecure_skip_verify: false
            - job_name: integrations/kubernetes/opencost
              kubernetes_sd_configs:
                - role: pod
              relabel_configs:
                - action: keep
                  regex: opencost-*
                  source_labels:
                    - __meta_kubernetes_pod_label_app_kubernetes_io_name
    
        integrations:
          eventhandler:
            cache_path: /var/lib/agent/eventhandler.cache
            logs_instance: integrations
        logs:
          configs:
          - name: integrations
            clients:
            - url: LOGS_HOST/loki/api/v1/push
              basic_auth:
                username: LOGS_USERNAME
                password: LOGS_PASSWORD
              external_labels:
                cluster: CLUSTER_NAME
                job: integrations/kubernetes/eventhandler
            positions:
              filename: /tmp/positions.yaml
            target_config:
              sync_period: 10s
  2. Deploy the file using kubectl apply -f <file.yaml>

  3. In the following command, replace NAMESPACE with the namespace for your Grafana Agent, and deploy the Agent by running the command:

    MANIFEST_URL=https://raw.githubusercontent.com/grafana/agent/v0.35.3/production/kubernetes/agent-bare.yaml NAMESPACE="NAMESPACE" /bin/sh -c "$(curl -fsSL https://raw.githubusercontent.com/grafana/agent/v0.35.3/production/kubernetes/install-bare.sh)" | kubectl apply -f -

Deploy kube-state-metrics

kube-state-metrics is a service that listens to the Kubernetes API server, and generates metrics on the state of the objects. The metrics are exported on the HTTP endpoint /metrics, on the listening port.

Replace NAMESPACE with the namespace for your Grafana Agent in the following command, and run it to deploy kube-state-metrics:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts && helm repo update && helm install ksm prometheus-community/kube-state-metrics -n "NAMESPACE"

Deploy node_exporter for resource monitoring

Node Exporter is required for resource efficiency monitoring.

Replace NAMESPACE with the namespace for your Grafana Agent in the following command, and run it to deploy Node Exporter:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts && helm repo update && helm install nodeexporter prometheus-community/prometheus-node-exporter -n "NAMESPACE"

Deploy OpenCost for cost monitoring

OpenCost provides estimates for the Cost view.

Run the following command, replacing within the command the following:

  • NAMESPACE with the namespace for the Grafana Agent for metrics
  • CLUSTER_NAME with the name of your cluster
  • METRICS_HOST with the hostname for your Prometheus instance
  • METRICS_USERNAME with the username for your Prometheus instance
  • METRICS_PASSWORD with your Access Policy Token from a previous step
helm repo add opencost https://opencost.github.io/opencost-helm-chart && helm repo update && \
  helm install opencost opencost/opencost -n "NAMESPACE" -f - <<EOF
fullnameOverride: opencost
opencost:
  exporter:
    defaultClusterId: "CLUSTER_NAME"
    extraEnv:
      CLOUD_PROVIDER_API_KEY: AIzaSyD29bGxmHAVEOBYtgd8sYM2gM2ekfxQX4U
      EMIT_KSM_V1_METRICS: "false"
      EMIT_KSM_V1_METRICS_ONLY: "true"
      PROM_CLUSTER_ID_LABEL: cluster
  prometheus:
    password: "METRICS_PASSWORD"
    username: "METRICS_USERNAME"
    external:
      enabled: true
      url: "METRICS_HOST/api/prom"
    internal: { enabled: false }
  ui: { enabled: false }
EOF

Deploy Agent ConfigMap & DaemonSet for logs

The grafana-agent-logs ConfigMap configures the Grafana Agent DaemonSet to tail container logs and send them to Grafana Cloud.

  1. Save the following to a file, and replace within the file the following:

    • NAMESPACE with the namespace for the Grafana Agent for metrics
    • CLUSTER_NAME with the name of your cluster
    • METRICS_HOST with the hostname for your Prometheus instance
    • METRICS_USERNAME with the username for your Prometheus instance
    • METRICS_PASSWORD with your Access Policy Token from earlier
    • LOGS_HOST with the hostname for your Loki instance
    • LOGS_USERNAME with the username for your Loki instance
    • LOGS_PASSWORD with your Access Policy Token from earlier

    If your Kubernetes cluster does not use Docker, copy the following script to an editor and replace docker: {} with cri: {}.

    kind: ConfigMap
    metadata:
      name: grafana-agent-logs
      namespace: NAMESPACE
    apiVersion: v1
    data:
      agent.yaml: |
        metrics:
          wal_directory: /tmp/grafana-agent-wal
          global:
            scrape_interval: 60s
            external_labels:
              cluster: CLUSTER_NAME
          configs:
          - name: integrations
            remote_write:
            - url: METRICS_HOST/api/prom/push
              basic_auth:
                username: METRICS_USERNAME
                password: METRICS_PASSWORD
        integrations:
          prometheus_remote_write:
          - url: METRICS_HOST/api/prom/push
            basic_auth:
              username: METRICS_USERNAME
              password: METRICS_PASSWORD
    
        logs:
          configs:
          - name: integrations
            clients:
            - url: LOGS_HOST/loki/api/v1/push
              basic_auth:
                username: LOGS_USERNAME
                password: LOGS_PASSWORD
              external_labels:
                cluster: CLUSTER_NAME
            positions:
              filename: /tmp/positions.yaml
            target_config:
              sync_period: 10s
            scrape_configs:
            - job_name: integrations/kubernetes/pod-logs
              kubernetes_sd_configs:
                - role: pod
              pipeline_stages:
                - docker: {}
              relabel_configs:
                - source_labels:
                    - __meta_kubernetes_pod_node_name
                  target_label: __host__
                - action: replace
                  replacement: $1
                  separator: /
                  source_labels:
                    - __meta_kubernetes_namespace
                    - __meta_kubernetes_pod_name
                  target_label: job
                - action: replace
                  source_labels:
                    - __meta_kubernetes_namespace
                  target_label: namespace
                - action: replace
                  source_labels:
                    - __meta_kubernetes_pod_name
                  target_label: pod
                - action: replace
                  source_labels:
                    - __meta_kubernetes_pod_container_name
                  target_label: container
                - replacement: /var/log/pods/*$1/*.log
                  separator: /
                  source_labels:
                    - __meta_kubernetes_pod_uid
                    - __meta_kubernetes_pod_container_name
                  target_label: __path__
  2. Deploy the file using kubectl apply -f <file.yaml>

  3. Replace NAMESPACE with the namespace for your Grafana Agent in the following command, and deploy the Agent by running it:

    MANIFEST_URL=https://raw.githubusercontent.com/grafana/agent/v0.35.3/production/kubernetes/agent-loki.yaml NAMESPACE="NAMESPACE" /bin/sh -c "$(curl -fsSL https://raw.githubusercontent.com/grafana/agent/v0.35.3/production/kubernetes/install-bare.sh)" | kubectl apply -f -

Done with setup

To finish up:

  1. Navigate to Kubernetes Monitoring, and click Configuration on the main menu.

  2. Click the Metrics status tab to view the data status. Your data becomes populated as the system components begin scraping and sending data to Grafana Cloud.

    Metrics status tab
    Metrics status tab

  3. Click Kubernetes Monitoring on the main menu to view the home page and see any issues currently highlighted. You can drill into the data from here.

  4. Explore your Kubernetes infrastructure:

    • Click Cluster navigation in the menu, then click your namespace to view the grafana-agent StatefulSet, the grafana-agent-logs DaemonSet, and the ksm-kube-state-metrics deployment. Click the kube-system namespace to see Kubernetes-specific resources.
    • Click the Nodes tab, then click the Nodes of your cluster to view their condition, utilization, and pod density.