How to build an alerting system with Prometheus and Alertmanager

[ learning  monitoring  devops  prometheus  alert  sre  ]

Introduction

While Prometheus is excellent at collecting and storing metrics, it does not provide a built-in mechanism for alert notifications.

This is where Alertmanager comes in. Alertmanager is a component of the Prometheus ecosystem that manages and sends alerts based on the rules defined in Prometheus. It allows you to configure different alerting rules for different sets of metrics and specify how you want to receive alerts, such as via email, text messages, Slack, PagerDuty, or other integrations.

Without Alertmanager, Prometheus would be limited to only collecting and storing metrics without providing any actionable alerts when certain thresholds or conditions are met. Therefore, Alertmanager is a crucial component of the Prometheus ecosystem that complements its monitoring capabilities by enabling efficient and reliable alerting.

How Alertmanager works

The Prometheus server is a component that sends alerts to Alertmanager. Alertmanager is responsible for receiving, deduplicating, grouping, and routing alerts to various notification systems such as email, webhook, or other alerting systems via an API. Alertmanager also provides additional features such as silence management, notification inhibition, and alert template rendering.

The receivers can be configured to handle the alerts in different ways, such as sending an email, triggering a webhook, or forwarding the alert to another system for further processing. Alertmanager comes with built-in support for various receivers, but it also allows users to define custom receivers to meet their specific needs.

Alertmanager is composed of several components that work together to provide a flexible and powerful alert management system.

Here is a brief overview of some of the key Alertmanager components:

  1. Dispatcher is responsible for processing incoming alerts, deduplicating them, grouping them by their labels, and sending them to the router.

  2. Inhibitor allows users to suppress alerts based on the presence of other alerts. This can be useful in situations where a higher-level alert should suppress lower-level alerts to avoid unnecessary noise. For example, if a server is down, there is no need to receive alerts for all of the services running on that server.

  3. Silencer allows users to temporarily suppress alerts for a given period of time. This can be useful in situations where a known issue is being addressed or during maintenance windows when alerts are not needed. The Silencer supports both manual and automatic silencing based on a set of predefined rules.

  4. Router is responsible for determining which receiver(s) should receive a given alert. It uses a set of routing rules that are defined by the user to match alerts based on their labels and route them to the appropriate receiver(s). The Router supports powerful regular expression matching and can be configured to route alerts based on the severity or type of the alert.

  5. Receiver is a key component that defines how alerts are sent to various notification channels or systems. The Receiver is essentially a target for alerts, and it defines which notification channels should be used for a particular alert.

Overall, these Alertmanager components work together to provide a flexible and powerful alert management system that can be customized to meet the specific needs of each user or organization.

Alerting rules

Alertmanager does not trigger alerts, it is done by the Prometheus server. So in the first step, we would define and trigger an alert on the Prometheus side without handling it by any notifications target.

In the previous post, we set up the Prometheus server to collect metrics from a web application. Now, we can use the existing Docker Compose file and extend it with some configurations.

At first, we have to define a Prometheus rules file. In the rules file we define a list of groups identified by name key. Each rules group is executed in parallel. Within a group, we can determine a set of rules (alerting rules, but also recording rules) that are evaluated sequentially. The simplest alert that would be very useful and easy to test is a check whether our application is alive.

# rules.yml
groups:
  - name: web-app
    rules:
      - alert: Application down
        for: 1m
        expr: up{job="web-app"} == 0
        labels:
          severity: critical
        annotations:
          title: App is down on 
          description: The app on instance  has been down for the past 1 minute.

In our example, we have defined one rule that is checking whether the application is down using metric up{job="web-app"} . The for parameter in Prometheus alerting rules specifies the duration of time that a condition must be true before an alert fires. It is used to prevent false positives by ensuring that the condition is sustained over a certain period of time before triggering an alert. We have also added a label severity: critical that would be useful when we would route notifications. Labels are used to classify alerts, while annotations provide additional information about incidents.

To configure a rules file in Prometheus, we need to add its path to rule_files section in the prometheus.yml configuration file.

# prometheus.yml
global:
  scrape_interval: 15s
rule_files:
  - "rules.yml"
scrape_configs:
  - job_name: "prometheus"
    scrape_interval: 5s
    static_configs:
      - targets: ["localhost:9090"]
  - job_name: "web-app"
    scrape_interval: 5s
    static_configs:
      - targets: ["web-app:8000"]

If we store our rules file in the same folder as prometheus.yml file, we can use a relative file path. To make it happen, we have to modify our Docker Compose file with proper rules.yml file binding.

# docker-compose.yml
version: "3.7"

services:
  prometheus:
    image: prom/prometheus
    ports:
      - target: 9090
        published: 9090
    volumes:
      - type: bind
        source: ./prometheus.yml
        target: /etc/prometheus/prometheus.yml
      - type: bind
        source: ./rules.yml
        target: /etc/prometheus/rules.yml
      - ./prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'

  web-app:
    image: web-app-monitoring:v0
    ports:
      - target: 8000
        published: 8080

Now we can start our Docker Compose stack:

$ docker-compose up

Our application and the Prometheus server should be running. We can check the alerts definition and state in the Prometheus UI (http://localhost:9090/alerts):

We can see that there is one alert called “Application down” in the inactive state. It means that our application is running at the moment.

To trigger an alert we would kill the application process.

Now our alert is in the pending state. In that state, the expression defining alert is true, but not for 1 minute yet (the time specified in the for parameter).

If we check the situation after 1 minute, our alert would be firing.

As we said before, firing alert would not send a notification, because Prometheus is not responsible for it. In the next step, we configure the Alertmanager to handle firing alerts and send notifications to external systems (e.g. email, slack, etc.).

Alertmanager installation

The easiest way to add Alertmanager to our stack is by modifying Docker compose file. We can use a docker image provided by the Prometheus community (Nevertheless if you would like to install Alertmanager from scratch on a host, here is a great tutorial describing this process step by step).

First, we need to define alertmanager.yml configuration file. At that stage, it would be almost empty. We just want to run the Alertmanager instance. Eventually, we will configure our alert routing and receivers in this file.

# alertmanager.yml
route:
  receiver: default

receivers:
  - name: default

In the next step, we specify the Alertmanager image, and where the alertmanager.yml configuration file would be located. All that stuff would be configured within the Docker Compose file:

# docker-compose.yml
version: "3.7"

services:
  prometheus:
    image: prom/prometheus
    ports:
      - target: 9090
        published: 9090
    volumes:
      - type: bind
        source: ./prometheus.yml
        target: /etc/prometheus/prometheus.yml
      - type: bind
        source: ./rules.yml
        target: /etc/prometheus/rules.yml
      - ./prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'

  alertmanager:
    image: prom/alertmanager:v0.25.0
    ports:
      - 9093:9093
    volumes:
      - type: bind
        source: ./alertmanager.yml
        target: /etc/alertmanager/config.yml
    command:
      - '--config.file=/etc/alertmanager/config.yml'
      - '--storage.path=/alertmanager'

  web-app:
    image: web-app-monitoring:v0
    ports:
      - target: 8000
        published: 8080

We also need to tell the Prometheus Server what is the address of the running Alertmanager instance. There is a special alerting section in prometheus.yml file dedicated to it:

# prometheus.yml

global:
  scrape_interval: 15s
rule_files:
  - "rules.yml"
alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]
scrape_configs:
  - job_name: "prometheus"
    scrape_interval: 5s
    static_configs:
      - targets: ["localhost:9090"]
  - job_name: "web-app"
    scrape_interval: 5s
    static_configs:
      - targets: ["web-app:8000"]

When we restart our Docker Compose file, we would see that the Alertmanager instance is running and is available at http://localhost:9093.

Notifications configuration

In the Alertmanager configuration file, we can specify routing rules for alerts and set up notification receivers. For the purpose of this post, we create a simple Slack integration with our Alertmanger instance.

First, you’ll need to create a Slack bot and obtain an API token. To do this, go to https://api.slack.com/apps and create a new app. Once you’ve created the app, go to the “Bot” section and add a bot user. Then, go to the “Install App” section and install the app in your workspace. Finally, go to the “OAuth & Permissions” section and copy the Bot User OAuth Access Token.

In the next step, add the following configuration to the route and receivers section in the alertmanager.yml file:

# alertmanager.yml

route:
  receiver: default
  routes:
    - matchers:
        - severity="critical"
      receiver: slack
receivers:
  - name: slack
    slack_configs:
      - channel: "#monitoring"
        send_resolved: true
        api_url: "https://hooks.slack.com/services/XXXXX"
        title: Alert
        text: >-
          *Alert:*  - ``
          *Description:* 
          *Details:*
             • *:* ``
            
          
  - name: default

It will route all severity: critical alerts to slack receiver and send a Slack notification to #montioring channel. Under api_url field, you should put your own webhook URL that you obtained in the Slack app installation process.

We can now restart our Docker Compose stack and kill web-app instance again to trigger the “Application Down” alert. It should result in firing the alert and sending the following message on a given Slack channel:

Conclusion

In conclusion, Alertmanager is a powerful tool that allows for efficient alert management and notification handling. It integrates seamlessly with Prometheus and supports a variety of notification channels. By properly configuring Alertmanager and defining appropriate alerting rules, organizations can ensure that they are notified promptly of important issues and can take action to prevent or mitigate downtime. With its ease of use and flexible configuration options, Alertmanager is an essential tool for any organization using Prometheus for monitoring and alerting.

If you are interested in the configuration files presented in this post, you can find them here.

Written on May 10, 2023