Menu
Grafana Cloud

Import recording and alerting rules

You can import Kube-Prometheus’s recording and alerting rules to Grafana Cloud. Recording rules let you cache expensive queries at customizable intervals to reduce load on your Prometheus instances and improve performance. To learn more, refer to Recording rules. Alerting rules allow you to define alert conditions based on PromQL queries and your Prometheus metrics.

Avoid evaluating the same recording rules in both your local Prometheus instance and on Grafana Cloud, as this creates additional data points for the same time series. You may wish to split up recording and alerting rule evaluation across your local Prometheus instance and Grafana Cloud Prometheus. This lets you keep alerting and recording rule evaluation local to your cluster, and use Grafana Cloud for rules that require global, multi-cluster aggregations.

Note

To enable multi-cluster support for Kube-Prometheus rules, refer to Enable multi-cluster support.

Extract Kube-Prometheus rules from Prometheus Pod

Extract the Kube-Prometheus recording and alerting rules from the Prometheus Pod’s file system. Since you’re using the Kube-Prometheus Helm chart in these steps, there are two ways to extract the rules YAML files:

  • kubectl exec into the Prometheus container and copy out the rules files, which Helm generated from the templated source.
  • Use the helm template command to generate the K8s manifests locally, and extract the rules files from this output.

For these steps, use the first method. If you are not using Helm, you might want to do either of the following:

  • Generate or extract the rules files directly from the Jsonnet source
  • Import the Jsonnet source directly using a tool like Grizzly.

These methods go beyond the scope of these instructions. You can also find a reduced set of rules in the kubernetes-mixin project’s GitHub repository. The Kubernetes mixin is a subset of Kube Prometheus stack.

To extract the Kube-Prometheus rules:

  1. Fetch the latest release of cortex-tools from the Releases page.

  2. After you confirm that you can run mimirtool locally, copy the rules files from your Prometheus instance’s container using kubectl exec:

    kubectl exec -n "default" "prometheus-foo-kube-prometheus-stack-prometheus-0" -- tar cf - "/etc/prometheus/rules/prometheus-foo-kube-prometheus-stack-prometheus-rulefiles-0" | tar xf -

    In this command, replace:

    • prometheus-foo-kube-prometheus-stack-prometheus-0 with the name of your Prometheus Pod
    • /etc/prometheus/rules/prometheus-foo-kube-prometheus-stack-prometheus-rulefiles-0 with the path to Prometheus’s rules directory. To find this, port-forward to your Prometheus Pod using kubectl port-forward, access Prometheus’ configuration from the UI (Status -> Configuration), and search for the rule_files parameter.

    This will create a directory called etc in your current directory. Navigate through the nested hierarchy to locate a set of rules YAML files.

    Note

    These files are symlinks to the actual rules definitions, which are in a hidden directory. Copy these rule definitions to a more convenient location.

With the Kube Prometheus rules files available locally, you can upload them to Cloud Prometheus using cortex-tools.

Load rules into Grafana Cloud Prometheus

Use cortex-tools to load the Kube-Prometheus stack recording and alerting rules into your Cloud Prometheus endpoint.

  1. Use the rules load command to load the defined rule groups to Grafana Cloud using the HTTP API.

    Warning

    Your active series usage will increase with this step. You may need to increase your stack’s default rule limits or break up rule groups as described in the introductory note before running this command.
    mimirtool rules load --address=<your_cloud_prometheus_endpoint> --id=<your_instance_id> --key=<your_cloud_access_policy_token> *.yaml

    Replace the parameters in the previous command with the appropriate values. You can find these in your Grafana Cloud portal. Be sure to omit the /api/prom path from the endpoint URL. To learn how to create a Cloud Access Policy token, follow the instructions in Create a Grafana Cloud Access Policy.

    This command loads the rules files into Cortex’s rule evaluation engine. If you encounter errors, use the --log.level=debug flag to increase the tool’s verbosity.

  2. After loading the rules, navigate to your hosted Grafana instance, then Grafana Cloud Alerting in the left-hand menu, and finally Rules. Select the appropriate data source from the dropdown menu (ending in -prom). You should see a list of alerting and recording rules.

In the Reduce your active series usage instructions, you limited metrics sent to Grafana Cloud from the local Prometheus instance to only those referenced in Kubernetes Monitoring. Now you need to expand this set of metrics to include those referenced in the recording and alerting rules you just imported.

Expand the allowlist to capture rules metrics

Expand the allowlist of sent metrics to include those referenced in Kube-Prometheus’s recording and alerting rules.

Warning

Your active series usage will increase with this step.
  1. Open the values.yaml file you used to configure remote_write in the Migrate a Kube-Prometheus Helm stack to Grafana Cloud guide

  2. Modify the values.yaml file as follows:

    prometheus:
      prometheusSpec:
        remoteWrite:
        - url: "<Your Cloud Prometheus instance remote_write endpoint>"
          basicAuth:
              username:
                name: kubepromsecret
                key: username
              password:
                name: kubepromsecret
                key: password
          writeRelabelConfigs:
          - sourceLabels:
            - "__name__"
            regex: ":node_memory_MemAvailable_bytes:sum|aggregator_unavailable_apiservice|aggregator_unavailable_apiservice_total|alertmanager_alerts|alertmanager_alerts_invalid_total|alertmanager_alerts_received_total|alertmanager_cluster_members|alertmanager_config_hash|alertmanager_config_last_reload_successful|alertmanager_notification_latency_seconds_bucket|alertmanager_notification_latency_seconds_count|alertmanager_notification_latency_seconds_sum|alertmanager_notifications_failed_total|alertmanager_notifications_total|apiserver_client_certificate_expiration_seconds_bucket|apiserver_client_certificate_expiration_seconds_count|apiserver_request:availability30d|apiserver_request:burnrate1d|apiserver_request:burnrate1h|apiserver_request:burnrate2h|apiserver_request:burnrate30m|apiserver_request:burnrate3d|apiserver_request:burnrate5m|apiserver_request:burnrate6h|apiserver_request_duration_seconds_bucket|apiserver_request_duration_seconds_count|apiserver_request_terminations_total|apiserver_request_total|cluster:node_cpu_seconds_total:rate5m|cluster_quantile:apiserver_request_duration_seconds:histogram_quantile|code:apiserver_request_total:increase30d|code_resource:apiserver_request_total:rate5m|code_verb:apiserver_request_total:increase1h|code_verb:apiserver_request_total:increase30d|container_cpu_cfs_periods_total|container_cpu_cfs_throttled_periods_total|container_cpu_usage_seconds_total|container_fs_reads_bytes_total|container_fs_reads_total|container_fs_writes_bytes_total|container_fs_writes_total|container_memory_cache|container_memory_rss|container_memory_swap|container_memory_usage_bytes|container_memory_working_set_bytes|container_network_receive_bytes_total|container_network_receive_packets_dropped_total|container_network_receive_packets_total|container_network_transmit_bytes_total|container_network_transmit_packets_dropped_total|container_network_transmit_packets_total|coredns_cache_entries|coredns_cache_hits_total|coredns_cache_misses_total|coredns_cache_size|coredns_dns_do_requests_total|coredns_dns_request_count_total|coredns_dns_request_do_count_total|coredns_dns_request_duration_seconds_bucket|coredns_dns_request_size_bytes_bucket|coredns_dns_request_type_count_total|coredns_dns_requests_total|coredns_dns_response_rcode_count_total|coredns_dns_response_size_bytes_bucket|coredns_dns_responses_total|etcd_disk_backend_commit_duration_seconds_bucket|etcd_disk_wal_fsync_duration_seconds_bucket|etcd_http_failed_total|etcd_http_received_total|etcd_http_successful_duration_seconds_bucket|etcd_mvcc_db_total_size_in_bytes|etcd_network_client_grpc_received_bytes_total|etcd_network_client_grpc_sent_bytes_total|etcd_network_peer_received_bytes_total|etcd_network_peer_round_trip_time_seconds_bucket|etcd_network_peer_sent_bytes_total|etcd_server_has_leader|etcd_server_leader_changes_seen_total|etcd_server_proposals_applied_total|etcd_server_proposals_committed_total|etcd_server_proposals_failed_total|etcd_server_proposals_pending|go_goroutines|grpc_server_handled_total|grpc_server_handling_seconds_bucket|grpc_server_started_total|instance:node_cpu_utilisation:rate5m|instance:node_load1_per_cpu:ratio|instance:node_memory_utilisation:ratio|instance:node_network_receive_bytes_excluding_lo:rate5m|instance:node_network_receive_drop_excluding_lo:rate5m|instance:node_network_transmit_bytes_excluding_lo:rate5m|instance:node_network_transmit_drop_excluding_lo:rate5m|instance:node_num_cpu:sum|instance:node_vmstat_pgmajfault:rate5m|instance_device:node_disk_io_time_seconds:rate5m|instance_device:node_disk_io_time_weighted_seconds:rate5m|kube_daemonset_status_current_number_scheduled|kube_daemonset_status_desired_number_scheduled|kube_daemonset_status_number_available|kube_daemonset_status_number_misscheduled|kube_daemonset_updated_number_scheduled|kube_deployment_metadata_generation|kube_deployment_spec_replicas|kube_deployment_status_observed_generation|kube_deployment_status_replicas_available|kube_deployment_status_replicas_updated|kube_horizontalpodautoscaler_spec_max_replicas|kube_horizontalpodautoscaler_spec_min_replicas|kube_horizontalpodautoscaler_status_current_replicas|kube_horizontalpodautoscaler_status_desired_replicas|kube_job_failed|kube_job_spec_completions|kube_job_status_succeeded|kube_node_spec_taint|kube_node_status_allocatable|kube_node_status_capacity|kube_node_status_condition|kube_persistentvolume_status_phase|kube_pod_container_resource_limits|kube_pod_container_resource_requests|kube_pod_container_status_restarts_total|kube_pod_container_status_waiting_reason|kube_pod_info|kube_pod_owner|kube_pod_status_phase|kube_replicaset_owner|kube_resourcequota|kube_state_metrics_list_total|kube_state_metrics_shard_ordinal|kube_state_metrics_total_shards|kube_state_metrics_watch_total|kube_statefulset_metadata_generation|kube_statefulset_replicas|kube_statefulset_status_current_revision|kube_statefulset_status_observed_generation|kube_statefulset_status_replicas|kube_statefulset_status_replicas_current|kube_statefulset_status_replicas_ready|kube_statefulset_status_replicas_updated|kube_statefulset_status_update_revision|kubelet_certificate_manager_client_expiration_renew_errors|kubelet_certificate_manager_client_ttl_seconds|kubelet_certificate_manager_server_ttl_seconds|kubelet_cgroup_manager_duration_seconds_bucket|kubelet_cgroup_manager_duration_seconds_count|kubelet_node_config_error|kubelet_node_name|kubelet_pleg_relist_duration_seconds_bucket|kubelet_pleg_relist_duration_seconds_count|kubelet_pleg_relist_interval_seconds_bucket|kubelet_pod_start_duration_seconds_count|kubelet_pod_worker_duration_seconds_bucket|kubelet_pod_worker_duration_seconds_count|kubelet_running_container_count|kubelet_running_containers|kubelet_running_pod_count|kubelet_running_pods|kubelet_runtime_operations_duration_seconds_bucket|kubelet_runtime_operations_errors_total|kubelet_runtime_operations_total|kubelet_server_expiration_renew_errors|kubelet_volume_stats_available_bytes|kubelet_volume_stats_capacity_bytes|kubelet_volume_stats_inodes|kubelet_volume_stats_inodes_used|kubeproxy_network_programming_duration_seconds_bucket|kubeproxy_network_programming_duration_seconds_count|kubeproxy_sync_proxy_rules_duration_seconds_bucket|kubeproxy_sync_proxy_rules_duration_seconds_count|kubernetes_build_info|namespace_cpu:kube_pod_container_resource_limits:sum|namespace_cpu:kube_pod_container_resource_requests:sum|namespace_memory:kube_pod_container_resource_limits:sum|namespace_memory:kube_pod_container_resource_requests:sum|namespace_workload_pod|namespace_workload_pod:kube_pod_owner:relabel|node_cpu_seconds_total|node_disk_io_time_seconds_total|node_disk_io_time_weighted_seconds_total|node_disk_read_bytes_total|node_disk_written_bytes_total|node_exporter_build_info|node_filesystem_avail_bytes|node_filesystem_files|node_filesystem_files_free|node_filesystem_readonly|node_filesystem_size_bytes|node_load1|node_load15|node_load5|node_md_disks|node_md_disks_required|node_memory_Buffers_bytes|node_memory_Cached_bytes|node_memory_MemAvailable_bytes|node_memory_MemFree_bytes|node_memory_MemTotal_bytes|node_memory_Slab_bytes|node_namespace_pod:kube_pod_info:|node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate|node_namespace_pod_container:container_memory_cache|node_namespace_pod_container:container_memory_rss|node_namespace_pod_container:container_memory_swap|node_namespace_pod_container:container_memory_working_set_bytes|node_network_receive_bytes_total|node_network_receive_drop_total|node_network_receive_errs_total|node_network_receive_packets_total|node_network_transmit_bytes_total|node_network_transmit_drop_total|node_network_transmit_errs_total|node_network_transmit_packets_total|node_network_up|node_nf_conntrack_entries|node_nf_conntrack_entries_limit|node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile|node_textfile_scrape_error|node_timex_maxerror_seconds|node_timex_offset_seconds|node_timex_sync_status|node_vmstat_pgmajfault|process_cpu_seconds_total|process_resident_memory_bytes|process_start_time_seconds|prometheus|prometheus_build_info|prometheus_config_last_reload_successful|prometheus_engine_query_duration_seconds|prometheus_engine_query_duration_seconds_count|prometheus_notifications_alertmanagers_discovered|prometheus_notifications_errors_total|prometheus_notifications_queue_capacity|prometheus_notifications_queue_length|prometheus_notifications_sent_total|prometheus_operator_list_operations_failed_total|prometheus_operator_list_operations_total|prometheus_operator_managed_resources|prometheus_operator_node_address_lookup_errors_total|prometheus_operator_ready|prometheus_operator_reconcile_errors_total|prometheus_operator_reconcile_operations_total|prometheus_operator_syncs|prometheus_operator_watch_operations_failed_total|prometheus_operator_watch_operations_total|prometheus_remote_storage_failed_samples_total|prometheus_remote_storage_highest_timestamp_in_seconds|prometheus_remote_storage_queue_highest_sent_timestamp_seconds|prometheus_remote_storage_samples_failed_total|prometheus_remote_storage_samples_total|prometheus_remote_storage_shards_desired|prometheus_remote_storage_shards_max|prometheus_remote_storage_succeeded_samples_total|prometheus_rule_evaluation_failures_total|prometheus_rule_group_iterations_missed_total|prometheus_rule_group_rules|prometheus_sd_discovered_targets|prometheus_target_interval_length_seconds_count|prometheus_target_interval_length_seconds_sum|prometheus_target_metadata_cache_entries|prometheus_target_scrape_pool_exceeded_label_limits_total|prometheus_target_scrape_pool_exceeded_target_limit_total|prometheus_target_scrapes_exceeded_sample_limit_total|prometheus_target_scrapes_sample_duplicate_timestamp_total|prometheus_target_scrapes_sample_out_of_bounds_total|prometheus_target_scrapes_sample_out_of_order_total|prometheus_target_sync_length_seconds_sum|prometheus_tsdb_compactions_failed_total|prometheus_tsdb_head_chunks|prometheus_tsdb_head_samples_appended_total|prometheus_tsdb_head_series|prometheus_tsdb_reloads_failures_total|rest_client_request_duration_seconds_bucket|rest_client_requests_total|scheduler_binding_duration_seconds_bucket|scheduler_binding_duration_seconds_count|scheduler_e2e_scheduling_duration_seconds_bucket|scheduler_e2e_scheduling_duration_seconds_count|scheduler_scheduling_algorithm_duration_seconds_bucket|scheduler_scheduling_algorithm_duration_seconds_count|scheduler_volume_scheduling_duration_seconds_bucket|scheduler_volume_scheduling_duration_seconds_count|storage_operation_duration_seconds_bucket|storage_operation_duration_seconds_count|storage_operation_errors_total|up|volume_manager_total_volumes|workqueue_adds_total|workqueue_depth|workqueue_queue_duration_seconds_bucket"
            action: "keep"
        replicaExternalLabelName: "__replica__"
        externalLabels: {cluster: "test"}

Note

In these steps, the metrics allowlist corresponds to version 16.12.0 of the Kube-Prometheus Helm chart. This list of metrics may change as rules are updated.

Disable local Prometheus rules evaluation

Local Prometheus rule evaluation will create additional data points and alerts for the same time series. After importing the Kube Prometheus recording and alerting rules to Grafana Cloud, you might want to turn off this local rule evaluation.

To disable local rule evaluation:

  1. Add the following to your values.yaml Helm configuration file:

    . . .
    defaultRules:
      create: false
      rules:
        alertmanager: false
        etcd: false
        general: false
        k8s: false
        kubeApiserver: false
        kubeApiserverAvailability: false
        kubeApiserverError: false
        kubeApiserverSlos: false
        kubelet: false
        kubePrometheusGeneral: false
        kubePrometheusNodeAlerting: false
        kubePrometheusNodeRecording: false
        kubernetesAbsent: false
        kubernetesApps: false
        kubernetesResources: false
        kubernetesStorage: false
        kubernetesSystem: false
        kubeScheduler: false
        kubeStateMetrics: false
        network: false
        node: false
        prometheus: false
        prometheusOperator: false
        time: false
  2. Apply the changes using helm upgrade:

    helm upgrade -f values.yaml your_release_name prometheus-community/kube-prometheus-stack
  3. After the changes have been applied, use port-forward to forward a local port to your Prometheus Service:

    kubectl port-forward svc/foo-kube-prometheus-stack-prometheus 9090
  4. Navigate to http://localhost:9090 in your browser, then Status and Rules.

  5. Verify that Prometheus has ceased evaluating recording and alerting rules.

To learn more, refer to the Helm chart’s values.yaml file.

Next steps

To learn how to enable multi-cluster support for Kube-Prometheus rules, refer to Enable multi-cluster support.