Import recording and alerting rules
You can import Kube-Prometheus’s recording and alerting rules to Grafana Cloud. Recording rules let you cache expensive queries at customizable intervals to reduce load on your Prometheus instances and improve performance. To learn more, refer to Recording rules. Alerting rules allow you to define alert conditions based on PromQL queries and your Prometheus metrics.
Avoid evaluating the same recording rules in both your local Prometheus instance and on Grafana Cloud, as this creates additional data points for the same time series. You may wish to split up recording and alerting rule evaluation across your local Prometheus instance and Grafana Cloud Prometheus. This lets you keep alerting and recording rule evaluation local to your cluster, and use Grafana Cloud for rules that require global, multi-cluster aggregations.
Note
To enable multi-cluster support for Kube-Prometheus rules, refer to Enable multi-cluster support.
Extract Kube-Prometheus rules from Prometheus Pod
Extract the Kube-Prometheus recording and alerting rules from the Prometheus Pod’s file system. Since you’re using the Kube-Prometheus Helm chart in these steps, there are two ways to extract the rules YAML files:
kubectl exec
into the Prometheus container and copy out the rules files, which Helm generated from the templated source.- Use the
helm template
command to generate the K8s manifests locally, and extract the rules files from this output.
For these steps, use the first method. If you are not using Helm, you might want to do either of the following:
- Generate or extract the rules files directly from the Jsonnet source
- Import the Jsonnet source directly using a tool like Grizzly.
These methods go beyond the scope of these instructions. You can also find a reduced set of rules in the kubernetes-mixin project’s GitHub repository. The Kubernetes mixin is a subset of Kube Prometheus stack.
To extract the Kube-Prometheus rules:
Fetch the latest release of cortex-tools from the Releases page.
After you confirm that you can run
mimirtool
locally, copy the rules files from your Prometheus instance’s container usingkubectl exec
:kubectl exec -n "default" "prometheus-foo-kube-prometheus-stack-prometheus-0" -- tar cf - "/etc/prometheus/rules/prometheus-foo-kube-prometheus-stack-prometheus-rulefiles-0" | tar xf -
In this command, replace:
prometheus-foo-kube-prometheus-stack-prometheus-0
with the name of your Prometheus Pod/etc/prometheus/rules/prometheus-foo-kube-prometheus-stack-prometheus-rulefiles-0
with the path to Prometheus’s rules directory. To find this, port-forward to your Prometheus Pod usingkubectl port-forward
, access Prometheus’ configuration from the UI (Status -> Configuration), and search for therule_files
parameter.
This will create a directory called
etc
in your current directory. Navigate through the nested hierarchy to locate a set of rules YAML files.Note
These files are symlinks to the actual rules definitions, which are in a hidden directory. Copy these rule definitions to a more convenient location.
With the Kube Prometheus rules files available locally, you can upload them to Cloud Prometheus using cortex-tools.
Load rules into Grafana Cloud Prometheus
Use cortex-tools to load the Kube-Prometheus stack recording and alerting rules into your Cloud Prometheus endpoint.
Use the
rules load
command to load the defined rule groups to Grafana Cloud using the HTTP API.Warning
Your active series usage will increase with this step. You may need to increase your stack’s default rule limits or break up rule groups as described in the introductory note before running this command.mimirtool rules load --address=<your_cloud_prometheus_endpoint> --id=<your_instance_id> --key=<your_cloud_access_policy_token> *.yaml
Replace the parameters in the previous command with the appropriate values. You can find these in your Grafana Cloud portal. Be sure to omit the
/api/prom
path from the endpoint URL. To learn how to create a Cloud Access Policy token, follow the instructions in Create a Grafana Cloud Access Policy.This command loads the rules files into Cortex’s rule evaluation engine. If you encounter errors, use the
--log.level=debug
flag to increase the tool’s verbosity.After loading the rules, navigate to your hosted Grafana instance, then Grafana Cloud Alerting in the left-hand menu, and finally Rules. Select the appropriate data source from the dropdown menu (ending in
-prom
). You should see a list of alerting and recording rules.
In the Reduce your active series usage instructions, you limited metrics sent to Grafana Cloud from the local Prometheus instance to only those referenced in Kubernetes Monitoring. Now you need to expand this set of metrics to include those referenced in the recording and alerting rules you just imported.
Expand the allowlist to capture rules metrics
Expand the allowlist of sent metrics to include those referenced in Kube-Prometheus’s recording and alerting rules.
Warning
Your active series usage will increase with this step.
Open the
values.yaml
file you used to configureremote_write
in the Migrate a Kube-Prometheus Helm stack to Grafana Cloud guideModify the
values.yaml
file as follows:prometheus: prometheusSpec: remoteWrite: - url: "<Your Cloud Prometheus instance remote_write endpoint>" basicAuth: username: name: kubepromsecret key: username password: name: kubepromsecret key: password writeRelabelConfigs: - sourceLabels: - "__name__" regex: ":node_memory_MemAvailable_bytes:sum|aggregator_unavailable_apiservice|aggregator_unavailable_apiservice_total|alertmanager_alerts|alertmanager_alerts_invalid_total|alertmanager_alerts_received_total|alertmanager_cluster_members|alertmanager_config_hash|alertmanager_config_last_reload_successful|alertmanager_notification_latency_seconds_bucket|alertmanager_notification_latency_seconds_count|alertmanager_notification_latency_seconds_sum|alertmanager_notifications_failed_total|alertmanager_notifications_total|apiserver_client_certificate_expiration_seconds_bucket|apiserver_client_certificate_expiration_seconds_count|apiserver_request:availability30d|apiserver_request:burnrate1d|apiserver_request:burnrate1h|apiserver_request:burnrate2h|apiserver_request:burnrate30m|apiserver_request:burnrate3d|apiserver_request:burnrate5m|apiserver_request:burnrate6h|apiserver_request_duration_seconds_bucket|apiserver_request_duration_seconds_count|apiserver_request_terminations_total|apiserver_request_total|cluster:node_cpu_seconds_total:rate5m|cluster_quantile:apiserver_request_duration_seconds:histogram_quantile|code:apiserver_request_total:increase30d|code_resource:apiserver_request_total:rate5m|code_verb:apiserver_request_total:increase1h|code_verb:apiserver_request_total:increase30d|container_cpu_cfs_periods_total|container_cpu_cfs_throttled_periods_total|container_cpu_usage_seconds_total|container_fs_reads_bytes_total|container_fs_reads_total|container_fs_writes_bytes_total|container_fs_writes_total|container_memory_cache|container_memory_rss|container_memory_swap|container_memory_usage_bytes|container_memory_working_set_bytes|container_network_receive_bytes_total|container_network_receive_packets_dropped_total|container_network_receive_packets_total|container_network_transmit_bytes_total|container_network_transmit_packets_dropped_total|container_network_transmit_packets_total|coredns_cache_entries|coredns_cache_hits_total|coredns_cache_misses_total|coredns_cache_size|coredns_dns_do_requests_total|coredns_dns_request_count_total|coredns_dns_request_do_count_total|coredns_dns_request_duration_seconds_bucket|coredns_dns_request_size_bytes_bucket|coredns_dns_request_type_count_total|coredns_dns_requests_total|coredns_dns_response_rcode_count_total|coredns_dns_response_size_bytes_bucket|coredns_dns_responses_total|etcd_disk_backend_commit_duration_seconds_bucket|etcd_disk_wal_fsync_duration_seconds_bucket|etcd_http_failed_total|etcd_http_received_total|etcd_http_successful_duration_seconds_bucket|etcd_mvcc_db_total_size_in_bytes|etcd_network_client_grpc_received_bytes_total|etcd_network_client_grpc_sent_bytes_total|etcd_network_peer_received_bytes_total|etcd_network_peer_round_trip_time_seconds_bucket|etcd_network_peer_sent_bytes_total|etcd_server_has_leader|etcd_server_leader_changes_seen_total|etcd_server_proposals_applied_total|etcd_server_proposals_committed_total|etcd_server_proposals_failed_total|etcd_server_proposals_pending|go_goroutines|grpc_server_handled_total|grpc_server_handling_seconds_bucket|grpc_server_started_total|instance:node_cpu_utilisation:rate5m|instance:node_load1_per_cpu:ratio|instance:node_memory_utilisation:ratio|instance:node_network_receive_bytes_excluding_lo:rate5m|instance:node_network_receive_drop_excluding_lo:rate5m|instance:node_network_transmit_bytes_excluding_lo:rate5m|instance:node_network_transmit_drop_excluding_lo:rate5m|instance:node_num_cpu:sum|instance:node_vmstat_pgmajfault:rate5m|instance_device:node_disk_io_time_seconds:rate5m|instance_device:node_disk_io_time_weighted_seconds:rate5m|kube_daemonset_status_current_number_scheduled|kube_daemonset_status_desired_number_scheduled|kube_daemonset_status_number_available|kube_daemonset_status_number_misscheduled|kube_daemonset_updated_number_scheduled|kube_deployment_metadata_generation|kube_deployment_spec_replicas|kube_deployment_status_observed_generation|kube_deployment_status_replicas_available|kube_deployment_status_replicas_updated|kube_horizontalpodautoscaler_spec_max_replicas|kube_horizontalpodautoscaler_spec_min_replicas|kube_horizontalpodautoscaler_status_current_replicas|kube_horizontalpodautoscaler_status_desired_replicas|kube_job_failed|kube_job_spec_completions|kube_job_status_succeeded|kube_node_spec_taint|kube_node_status_allocatable|kube_node_status_capacity|kube_node_status_condition|kube_persistentvolume_status_phase|kube_pod_container_resource_limits|kube_pod_container_resource_requests|kube_pod_container_status_restarts_total|kube_pod_container_status_waiting_reason|kube_pod_info|kube_pod_owner|kube_pod_status_phase|kube_replicaset_owner|kube_resourcequota|kube_state_metrics_list_total|kube_state_metrics_shard_ordinal|kube_state_metrics_total_shards|kube_state_metrics_watch_total|kube_statefulset_metadata_generation|kube_statefulset_replicas|kube_statefulset_status_current_revision|kube_statefulset_status_observed_generation|kube_statefulset_status_replicas|kube_statefulset_status_replicas_current|kube_statefulset_status_replicas_ready|kube_statefulset_status_replicas_updated|kube_statefulset_status_update_revision|kubelet_certificate_manager_client_expiration_renew_errors|kubelet_certificate_manager_client_ttl_seconds|kubelet_certificate_manager_server_ttl_seconds|kubelet_cgroup_manager_duration_seconds_bucket|kubelet_cgroup_manager_duration_seconds_count|kubelet_node_config_error|kubelet_node_name|kubelet_pleg_relist_duration_seconds_bucket|kubelet_pleg_relist_duration_seconds_count|kubelet_pleg_relist_interval_seconds_bucket|kubelet_pod_start_duration_seconds_count|kubelet_pod_worker_duration_seconds_bucket|kubelet_pod_worker_duration_seconds_count|kubelet_running_container_count|kubelet_running_containers|kubelet_running_pod_count|kubelet_running_pods|kubelet_runtime_operations_duration_seconds_bucket|kubelet_runtime_operations_errors_total|kubelet_runtime_operations_total|kubelet_server_expiration_renew_errors|kubelet_volume_stats_available_bytes|kubelet_volume_stats_capacity_bytes|kubelet_volume_stats_inodes|kubelet_volume_stats_inodes_used|kubeproxy_network_programming_duration_seconds_bucket|kubeproxy_network_programming_duration_seconds_count|kubeproxy_sync_proxy_rules_duration_seconds_bucket|kubeproxy_sync_proxy_rules_duration_seconds_count|kubernetes_build_info|namespace_cpu:kube_pod_container_resource_limits:sum|namespace_cpu:kube_pod_container_resource_requests:sum|namespace_memory:kube_pod_container_resource_limits:sum|namespace_memory:kube_pod_container_resource_requests:sum|namespace_workload_pod|namespace_workload_pod:kube_pod_owner:relabel|node_cpu_seconds_total|node_disk_io_time_seconds_total|node_disk_io_time_weighted_seconds_total|node_disk_read_bytes_total|node_disk_written_bytes_total|node_exporter_build_info|node_filesystem_avail_bytes|node_filesystem_files|node_filesystem_files_free|node_filesystem_readonly|node_filesystem_size_bytes|node_load1|node_load15|node_load5|node_md_disks|node_md_disks_required|node_memory_Buffers_bytes|node_memory_Cached_bytes|node_memory_MemAvailable_bytes|node_memory_MemFree_bytes|node_memory_MemTotal_bytes|node_memory_Slab_bytes|node_namespace_pod:kube_pod_info:|node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate|node_namespace_pod_container:container_memory_cache|node_namespace_pod_container:container_memory_rss|node_namespace_pod_container:container_memory_swap|node_namespace_pod_container:container_memory_working_set_bytes|node_network_receive_bytes_total|node_network_receive_drop_total|node_network_receive_errs_total|node_network_receive_packets_total|node_network_transmit_bytes_total|node_network_transmit_drop_total|node_network_transmit_errs_total|node_network_transmit_packets_total|node_network_up|node_nf_conntrack_entries|node_nf_conntrack_entries_limit|node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile|node_textfile_scrape_error|node_timex_maxerror_seconds|node_timex_offset_seconds|node_timex_sync_status|node_vmstat_pgmajfault|process_cpu_seconds_total|process_resident_memory_bytes|process_start_time_seconds|prometheus|prometheus_build_info|prometheus_config_last_reload_successful|prometheus_engine_query_duration_seconds|prometheus_engine_query_duration_seconds_count|prometheus_notifications_alertmanagers_discovered|prometheus_notifications_errors_total|prometheus_notifications_queue_capacity|prometheus_notifications_queue_length|prometheus_notifications_sent_total|prometheus_operator_list_operations_failed_total|prometheus_operator_list_operations_total|prometheus_operator_managed_resources|prometheus_operator_node_address_lookup_errors_total|prometheus_operator_ready|prometheus_operator_reconcile_errors_total|prometheus_operator_reconcile_operations_total|prometheus_operator_syncs|prometheus_operator_watch_operations_failed_total|prometheus_operator_watch_operations_total|prometheus_remote_storage_failed_samples_total|prometheus_remote_storage_highest_timestamp_in_seconds|prometheus_remote_storage_queue_highest_sent_timestamp_seconds|prometheus_remote_storage_samples_failed_total|prometheus_remote_storage_samples_total|prometheus_remote_storage_shards_desired|prometheus_remote_storage_shards_max|prometheus_remote_storage_succeeded_samples_total|prometheus_rule_evaluation_failures_total|prometheus_rule_group_iterations_missed_total|prometheus_rule_group_rules|prometheus_sd_discovered_targets|prometheus_target_interval_length_seconds_count|prometheus_target_interval_length_seconds_sum|prometheus_target_metadata_cache_entries|prometheus_target_scrape_pool_exceeded_label_limits_total|prometheus_target_scrape_pool_exceeded_target_limit_total|prometheus_target_scrapes_exceeded_sample_limit_total|prometheus_target_scrapes_sample_duplicate_timestamp_total|prometheus_target_scrapes_sample_out_of_bounds_total|prometheus_target_scrapes_sample_out_of_order_total|prometheus_target_sync_length_seconds_sum|prometheus_tsdb_compactions_failed_total|prometheus_tsdb_head_chunks|prometheus_tsdb_head_samples_appended_total|prometheus_tsdb_head_series|prometheus_tsdb_reloads_failures_total|rest_client_request_duration_seconds_bucket|rest_client_requests_total|scheduler_binding_duration_seconds_bucket|scheduler_binding_duration_seconds_count|scheduler_e2e_scheduling_duration_seconds_bucket|scheduler_e2e_scheduling_duration_seconds_count|scheduler_scheduling_algorithm_duration_seconds_bucket|scheduler_scheduling_algorithm_duration_seconds_count|scheduler_volume_scheduling_duration_seconds_bucket|scheduler_volume_scheduling_duration_seconds_count|storage_operation_duration_seconds_bucket|storage_operation_duration_seconds_count|storage_operation_errors_total|up|volume_manager_total_volumes|workqueue_adds_total|workqueue_depth|workqueue_queue_duration_seconds_bucket" action: "keep" replicaExternalLabelName: "__replica__" externalLabels: {cluster: "test"}
Note
In these steps, the metrics allowlist corresponds to version16.12.0
of the Kube-Prometheus Helm chart. This list of metrics may change as rules are updated.
Disable local Prometheus rules evaluation
Local Prometheus rule evaluation will create additional data points and alerts for the same time series. After importing the Kube Prometheus recording and alerting rules to Grafana Cloud, you might want to turn off this local rule evaluation.
To disable local rule evaluation:
Add the following to your
values.yaml
Helm configuration file:. . . defaultRules: create: false rules: alertmanager: false etcd: false general: false k8s: false kubeApiserver: false kubeApiserverAvailability: false kubeApiserverError: false kubeApiserverSlos: false kubelet: false kubePrometheusGeneral: false kubePrometheusNodeAlerting: false kubePrometheusNodeRecording: false kubernetesAbsent: false kubernetesApps: false kubernetesResources: false kubernetesStorage: false kubernetesSystem: false kubeScheduler: false kubeStateMetrics: false network: false node: false prometheus: false prometheusOperator: false time: false
Apply the changes using
helm upgrade
:helm upgrade -f values.yaml your_release_name prometheus-community/kube-prometheus-stack
After the changes have been applied, use
port-forward
to forward a local port to your Prometheus Service:kubectl port-forward svc/foo-kube-prometheus-stack-prometheus 9090
Navigate to
http://localhost:9090
in your browser, then Status and Rules.Verify that Prometheus has ceased evaluating recording and alerting rules.
To learn more, refer to the Helm chart’s values.yaml file.
Next steps
To learn how to enable multi-cluster support for Kube-Prometheus rules, refer to Enable multi-cluster support.