prometheus1

Prometheus

  • Prometheus - database of time series
  • Exporter - instrument for get metric
  • Alertmanager - notification and everything is connected with it

Basic startup parameters

We will start looking at the configuration file in the next task, but now let’s look at the main Prometheus startup keys:

--config.file="prometheus.yml" - which configuration file to use;

--web.listen-address="0.0.0.0:9090" - the address that the embedded web server will listen to;

--web.enable-admin-api - enable or disable administrative API via web-interface;

--web.console.templates="consoles" - path to the directory with html templates;

--web.console.libraries="console_libraries" - path to the directory with libraries for templates;

--web.page-title - the title of the web page;

--web.cors.origin=".*" - CORS settings for web-interface;

--storage.tsdb.path="data/" - path for storing time series database;

--storage.tsdb.retention.time - default metrics retention time is 15 days, anything older will be deleted;

--storage.tsdb.retention.size - TSDB size, after which Prometheus will start deleting the oldest data;

--query.max-concurrency - maximum simultaneous number of requests to Prometheus via PromQL;

--query.timeout=2m - maximum execution time of one query;

--enable-feature - flag to enable various functions described here;

--log.level - logging level

Let’s take a look at the main files:

console_libraries - a directory that contains libraries for html templates;

consoles - directory that contains html page templates for the web interface;

prometheus - Prometheus server executable;

prometheus.yml - Prometheus configuration file;

prometheus.yml - Prometheus configuration file; prometheus.yml - Prometheus configuration file; promtool - utility for checking configuration, working with TSDB and getting metrics.

Push and Pull Model:

Prometheus uses a Pull model (also called Scraping) to collect metrics, meaning the Prometheus server will reach out to specified services by calling their configured HTTP endpoint to pull those metrics.

For example, in the configuration defined in prometheus.yml file tells the Prometheus servers to fetch metrics every 15s on the specified endpoint.

prometheus2

Install

install Prometheus, Alert Manager, Node Exporter and Nginx exporter

sudo apt install prometheus prometheus-alertmanager prometheus-node-exporter prometheus-nginx-exporter

see if all modules online

ps auxf | grep prometheus

If nginx not start

sudo nano /etc/nginx/sites-available/site-local.conf

if nginx exporter not in local machine change 127.0.0.1 to ip nginx exporter

listen 8080 default_server;

location /stub status {
    stub status;
    allow 127.0.0.1;
    deny all; 
}

after it need restart nginx sudo systemctl restart nginx

and nginx exporter

sudo systemctl restart prometheus-nginx-exporter

and see status

sudo systemctl status prometheus-nginx-exporter

Advanced Option can edit in

sudo nano /etc/prometheus/prometheus.yml

add nginx job

- job name:

Install Prometheus from Binary

download prometheus-2.48.0.linux-amd64.tar.gz

cd /opt/
wget https://github.com/prometheus/prometheus/releases/download/v2.48.0/prometheus-2.48.0.linux-amd64.tar.gz
``
Unzip him
```bash
tar -zxf prometheus-2.26.0-linux-amd64.tar.gz 

Delete prometheus archive tar.gz file

rm -rf prometheus-2.48.0.linux-amd64.tar.gz

Rename name to prometheus

mv prometheus-2.48.0.linux-amd64 prometheus

Enter in directory

cd prometheus

Run prometheus (default site on port localhost:9090)

./prometius

Test

rate(prometheus_http_requests_total[1m])

Configurate Prometheus for Systemd

useradd -s /sbin/nologin -d /opt/prometheus prometheus
chown -R prometheus:prometheus /opt/prometheus

Copy && Past

nano /etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
ExecStart=/opt/prometheus/prometheus \
    --config.file=/opt/prometheus/prometheus.yml \
    --storage.tsdb.path=/opt/prometheus/data \
    --web.console.templates=/opt/prometheus/consoles \
    --web.console.libraries=/opt/prometheus/console_libraries

[Install]
WantedBy=default.target
systemctl daemon-reload
systemctl enable prometheus
systemctl start prometheus
systemctl status prometheus
ps ax | prometheus

See all Metrics

localhost:9090/metrics

Test with Promtool

Test config

./promtool check config prometheus.yml
./promtool query instant http://localhost:9090 up
up{instance="localhost:9090", job="prometheus"} => 1 @[1617970111.787]
up{instance="localhost:9100", job="node"} => 1 @[1617970111.787]
root@node1:~# ./promtool query instant http://localhost:9090 'up{job="node"}'
up{instance="localhost:9100", job="node"} => 1 @[1617970210.143]
./promtool query instant http://localhost:9090 'node_disk_written_bytes_total'
node_disk_written_bytes_total{device="vda", instance="localhost:9100", job="node"} => 52323148800 @[1617970275.39]
./promtool query instant http://localhost:9090 'node_disk_written_bytes_total'
node_disk_written_bytes_total{device="vda", env="dev", instance="localhost:9100", job="node"} => 52338947072 @[1617970356.777]
node_disk_written_bytes_total{device="vda", instance="localhost:9100", job="node"} => 52332880896 @[1617970356.777]

Exporters

Node_Exporter from Binary

Download node_exporter-1.7.0.linux-amd64.tar.gz

cd /opt/
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz

Unzip him

tar -xvf node_exporter-1.7.0.linux-amd64.tar.gz 

Remove archive tar.gz

rm -f node_exporter-1.7.0.linux-amd64.tar.gz 

Rename name to Node-Exporter

mv node_exporter-1.7.0.linux-amd64 node_exporter 
cd node_exporter

Run Node-Exporter (default port localhost:9100)

./node_exporter

See all metrics

http://localhost:9100/metrics

check added target localhost:9090 => Status => Targets

Collectors

./node_exporter --help 2>&1 | grep collector

In addition to selecting the collectors that will give us different metrics, we can specify the following parameters to run:

--web.listen-address=":9100" - the address and port where the node_exporter will wait for incoming connections;

--web.telemetry-path="/metrics" - URL where our metrics will be available;

--web.max-requests - the maximum number of simultaneous requests to receive metrics on the port specified in the first parameter;

--web.config="" - experts

Configurate Node_Exporter for Systemd

useradd -s /sbin/nologin -d /opt/node_exporter node_exporter
chown -R node_exporter:node_exporter /opt/node_exporter

Copy && Past

nano /etc/systemd/system/node_exporter.service
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=root
Group=root
Type=simple
ExecStart=/opt/node_exporter/node_exporter

[Install]
WantedBy=multi-user.target

Start the service with systemd

sudo systemctl daemon-reload
systemctl enable node_exporter
systemctl start node_exporter
systemctl status node_exporter

On the prometheus server, dont’ forget to add the static config prometheus.yml for the collection of data!

BEST PRACTICE: The official documentation does NOT recommend running node_exporter in docker, because it needs access to the host system to get all metrics. And you will need to mount all file systems inside the docker container. So it is much easier to run it as a service from the example above.

Test all metrics

curl localhost:9100/metrics

Connect Node_Exporter to Prometheus

on Prometheus prometheus.yml

global:
  scrape_interval:     15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
    - targets: ['localhost:9090']

  - job_name: 'node'
    scrape_interval: 5s
    static_configs:
      - targets: ['localhost:9100']
 systemctl restart prometheus

Redis_Exporter from Binary

Install Redis server

apt install redis-server

check status

systemctl status redis-server

connect to redis

connect to redis-cli  
```bash
redis-cli
info

Install Redis-Exporter

Download

cd /opt/
wget https://github.com/oliver006/redis_exporter/releases/download/v1.55.0/redis_exporter-v1.55.0.linux-amd64.tar.gz
tar -xvf redis_exporter-v1.55.0.linux-amd64.tar.gz

Remove archive tar.gz

rm -rf redis_exporter-v1.55.0.linux-amd64.tar.gz 

Rename name to Redis-Exporter

mv redis_exporter-v1.55.0.linux-amd64 redis_exporter
cd redis_exporter

Run PostgreSQL-Exporter (default port localhost:9187/metrics)

./postgresql_exporter
./postgresql_exporter -help

Configurate Redis_Exporter for Systemd

Add user

useradd -s /sbin/nologin -d /opt/redis_exporter redis_exporter
chown -R redis_exporter:redis_exporter /opt/node_exporter

Copy && Past

nano /etc/systemd/system/redis_exporter.service
[Unit]
Description=Redis Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=redis_exporter
Group=redis_exporter
Type=simple
ExecStart=/opt/redis_exporter/redis_server /etc/redis/redis.conf
RestartSec=5s
Restart=on-success

[Install]
WantedBy=multi-user.target

other option no redis config:

ExecStart=/usr/bin/redis_exporter \
    -web.listen-address ":9121" \
    -redis.addr "redis://ip.of.redis.server:6379" \
    -redis.password "your-strong-redis-password"
systemctl start redis_exporter.service

…bash systemctl enable redis_exporter.service

```bash
systemctl status redis_exporter.service

Connect Redis_Exporter to Prometheus

Edit prometheus.yml

global:
  scrape_interval:     15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
    - targets: ['localhost:9090']

  - job_name: 'node'
    scrape_interval: 5s
    static_configs:
      - targets: ['localhost:9100']

  - job_name: 'redis'
    static_configs:
      - targets: ['localhost:9121']
 systemctl restart prometheus

PostgreSQL from Repository

install postgresql-14

apt install  postgresql-14

check if postgresql running

ps ax | grep post

switch user to postgres

su postgres

enter to postrges

psql
\l
\q

PostgreSQL Server Configuration

The PostgreSQL server provides two different password encryption methods: md5 and scram-sha-256. Both password encryptions can be configured via the PostgreSQL config file postgresql.conf.

In this step, you’ll set up PostgreSQL to use the scram-sha-256 password encryption.

This example uses the PostgreSQL server v14 that is installed on an Ubuntu system, so the PostgreSQL configuration files is stored in the /etc/postgresql/14/main directory.

Move to the working directory to the /etc/postgresql/14/main directory and open the configuration file postgresql.conf via the nano editor command.

cd /etc/postgresql/14/main
sudo nano postgresql.conf

Uncomment the option password_encryption and change the value to scram-sha-256.

password_encryption = scram-sha-256     # scram-sha-256 or md5

Save the file and exit the editor when you are finished.

Next, open the config file pg_hba.conf`` via the nano editor command below. The file pg_hba.conf`` is the configuration where password authentication methods are defined for hosts or IP addresses.

sudo nano pg_hba.conf

Change the default authentication methods for the host 127.0.0.1/32 and ::1/128 to scram-sha-256. With this, the authentication method scram-sha-256 will be used for every client connection to the PostgreSQL server 127.0.0.1.

# "local" is for Unix domain socket connections only
local   all             all                                     peer
# IPv4 local connections:
host    all             all             127.0.0.1/32            scram-sha-256
# IPv6 local connections:
host    all             all             ::1/128                 scram-sha-256

Save and exit the editor when you’re done.

Lastly, run the below systemctl command utility to restart the PostgreSQL service and apply the changes.

sudo systemctl restart postgresql

Postgres_Exporter from Binary

cd /opt/
wget https://github.com/prometheus-community/postgres_exporter/releases/download/v0.15.0/postgres_exporter-0.15.0.linux-arm64.tar.gz
tar -xvf postgres_exporter-0.15.0.linux-arm64.tar.gz

Remove archive tar.gz

rm -f postgres_exporter-0.15.0.linux-arm64.tar.gz

Rename name to PostgreSQL-Exporter

mv postgres_exporter-0.15.0.linux-arm64.tar.gz postgres_exporter
cd porstgres_exporter

Run Postgres_exporter (localhost:9187/mertics)

DATA_SOURCE_NAME=postgresql://postgres_exporter:password@localhost:5432/postgres?sslmode=disable ./postgress_exporter

Configuring Postgres Exporter

I this step, you’ll configure the postgres_exporter to gather PostgreSQL metrics, and this can be done by defining the PostgreSQL user and password. You’ll also set up and configure the systemd service for the postgres_exporter.

With the postgres_exporter, you can expose metrics for all available databases on the PostgreSQL server, or you can expose specific databases that you want to monitor. You can also use secure SSL mode or non-SSL mode.

Move the current working directory to /opt/postgres_exporter. via the cd command below.

cd /opt/postgres_exporter

Now create a new file .env using the below nano editor command.

nano .env

Add the following lines to the file. Also, be sure to change the details of the PostgreSQL user, password, and host. With this .env file, you’ll scrape and gathers PostgreSQL metrics from all available databases. You can also gather metrics from a specific PostgreSQL database, and adjust the following config file.

# Format
#DATA_SOURCE_NAME=postgresql://username:password@localhost:5432/postgres?sslmode=disable

# Monitor all databases via postgres_exporter
DATA_SOURCE_NAME="postgresql://postgres:strongpostgrespassword@localhost:5432/?sslmode=disable"

# Monitor specific databases on the PostgreSQL server
# DATA_SOURCE_NAME="postgresql://username:password@localhost:5432/database-name?sslmode=disable"

Save the file and exit the editor when you’re finished.

Configurate Postgres_Exporter for Systemd

Create a new system user postgres_exporter

useradd -s /sbin/nologin -d /opt/postgres_exporter postgres_exporter

ownership of the /opt/postgres_exporter directory to the user postgres_exporter.

chown -R postgresql_exporter:postgres_exporter /opt/postgres_exporter

Copy && Past

sudo nano /etc/systemd/system/postgres_exporter.service
[Unit]
Description=Postgresql Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=postgres_exporter
Group=postgres_exporter
WorkingDirectory=/opt/postgres_exporter
EnvironmentFile=/opt/postgres_exporter/.env
ExecStart=/opt/postgres_exporter/postgres_exporter --web.listen-address=:9187 --web.telemetry-path=/metrics
Restart=always

[Install]
WantedBy=multi-user.target

Reload the systemd manager and apply the changes.

sudo systemctl daemon-reload

After the systemd manager is reloaded, start and enable the postgres_exporter service via the systemctl command utility below.

sudo systemctl start postgres_exporter
sudo systemctl enable postgres_exporter

The postgres_exporter should be running and scrape metrics from the PostgreSQL server. Also, it should be enabled and will be run automatically upon the bootup.

You’ll receive the output similar to this - the postgres_exporter service is running and it’s enabled.

sudo systemctl status postgres_exporter

Setting up Firewall In this step, you’ll set up the firewall to open the default port of postgres_exporter - TCP 9187. After that, you’ll verify that the postgres_exporter metrics is accessible via the web browser.

For Ubuntu systems that used UFW as the firewall, run the below ufw command to add port 9187 to the ufw firewall. Then, reload the firewall to apply the changes.

sudo ufw allow 9187/tcp
sudo ufw reload

You can now verify the list of ports on UFW via the ufw command below.

sudo ufw status

Connect Postgres_Exporter to Prometheus

Edit prometheus.yml

global:
  scrape_interval:     15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
    - targets: ['localhost:9090']

  - job_name: 'node'
    scrape_interval: 5s
    static_configs:
      - targets: ['localhost:9100']

  - job_name: 'redis'
    static_configs:
      - targets: ['localhost:9121']

  - job_name: 'postgres'
    static_configs:
      - targets: ['localhost:9187']
ps aux | prom

Reload process

kill -HUP process_id

Or restart the Prometheus service and apply the changes.

sudo systemctl restart prometheus

 

 

Tips: Create Postgresql Superuser for easy connect

su - postgresql

psql

create user prom with password ‘12345678’;

alter user prom superuser;

MyAPP_Exporter

I have my own application i need install golang and integrate golang prometheus client

Download Golang

wget https://go.dev/dl/go1.21.5.linux-amd64.tar.gz

unpack tar.gz

tar  -xzvf go1.21.5.linux-amd64.tar.gz

remove pack

rm -rf go1.21.5.linux-amd64.tar.gz
echo 'export GOROOT=$HOME/go' >> .bashrc
echo 'export PATH=$PATH:$GOROOT/bin' >> .bashrc

The first variable GOROOT is needed by the golang interpreter itself: thanks to it it finds the path to its built-in libraries. The second variable is required so that we do not write the full path to the go binary, but the system finds it by itself. Now we can log in and check that golang has been successfully installed:

go version

Creating the application

Create directory for our project:

mkdir -p app/src/app
cd app/src/app

Inside it we will create a simple main.go file

package main

// Import basic modules - fmt, log, net/http, time, math/rand
import (
    "fmt"
    "log"
    "net/http"
    "time"
    "math/rand"

    // Add Prometheus client library import
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
        // Metric for calculating request processing time
        apiDurations = prometheus.NewSummary(
                prometheus.SummaryOpts{
                        Name:       "app_api_durations_seconds",
                        Help:       "API latency distributions.",
                        Objectives: map[float64]float64{0.5: 0.05, 0.9: 0.01, 0.99: 0.001},
                })

        // Metric for counting the number of incoming requests
        apiProcessed = prometheus.NewCounter(prometheus.CounterOpts{
                Name: "app_api_processed_ops_total",
                Help: "The total number of processed requests",
        })
)

// A function that will handled all requests to our web server
func handler(w http.ResponseWriter, r *http.Request) {
    // Increase the counter of the number of incoming requests
    apiProcessed.Inc()
    
    // Time to start processing the request
    start := time.Now()
    
    // asleep for a random number of seconds - from 0 to 2х
    time.Sleep(time.Duration(rand.Intn(2)) * time.Second)
    // In response, we send the path that the user has requested
    fmt.Fprintf(w, "%s\n", r.URL.Path)

    // After the end of processing, count how much time has passed since the start of processing
    duration := time.Since(start)
    // Save the processing time to a metric
    apiDurations.Observe(duration.Seconds())
}

// The main function is the starting point of our program
func main() {
    // Register our metrics
    prometheus.MustRegister(apiDurations)
    prometheus.MustRegister(apiProcessed)

    // Initialize the random number generator
    rand.Seed(time.Now().UnixNano())
    // We define twhen making requests to / - to any http path, it is necessary to call function handler 
    http.HandleFunc("/", handler)


    // When requesting the path "/metrics" we will output metrics in Prometheus format
    http.Handle("/metrics", promhttp.Handler())

    // start server on port 8080
    log.Fatal(http.ListenAndServe(":8080", nil))
}

Build project

export GOPATH=$HOME/app
go build -o app main.go

Run

./main

fort test

curl localhost:8080/test
curl localhost:8080/test/user/path

Conncet Myapp to Prometheus

Edit file prometheus.yml

global:
  scrape_interval:     15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
    - targets: ['localhost:9090']

  - job_name: 'node'
    scrape_interval: 5s
    static_configs:
      - targets: ['localhost:9100']

  - job_name: 'redis'
    static_configs:
      - targets: ['localhost:9121']

  - job_name: 'postgres'
    static_configs:
      - targets: ['localhost:9187']
  
  - job_name 'myapp'
    static_configs:
      - targets: ['localhost:8080']

Pushgateway_Exporter from Binary

The Pushgateway is an intermediary service which allows you to push metrics from jobs which cannot be scraped. For details, see Pushing metrics.

only use short programs like Crone or something end quickly

[OPTIONS]

--web.config.file="" - [EXPERIMENTAL] Path to configuration file that can enable TLS or authentication. See: https://github.com/prometheus/exporter-toolkit/blob/master/docs/web-configuration.md

--web.listen-address=:9091 - Addresses on which to expose metrics and web interface. Repeatable for multiple addresse

--web.telemetry-path="/metrics" - Path where the metrics for Prometheus will be available.

--web.enable-lifecycle - Enable shutdown via HTTP request.

--web.enable-admin-api - Enable API endpoints for admin control actions.

--persistence.file="/var/lib/prometheus/pushgateway.data" - File to persist metrics. If empty, metrics are only kept in memory.

--persistence.interval=5m - The minimum interval at which to write out the persistence file.

Download

cd /opt/
wget https://github.com/prometheus/pushgateway/releases/download/v1.6.2/pushgateway-1.6.2.linux-arm64.tar.gz

Unpack

tar -xvf pushgateway-1.6.2.linux-arm64.tar.gz

Remove archive

rm -rf pushgateway-1.6.2.linux-arm64.tar.gz

Rename

mv pushgateway-1.6.2.linux-arm64 pushgateway

Run (default port 127.0.0.1:9091)

./pushgateway

Configurate Pushgateway For Systemd

Create a new system user pushgateway

useradd -s /sbin/nologin -d /opt/pushgateway pushgateway

ownership of the /opt/pushgateway directory to the user pushgateway.

chown -R pushgateway:pushgateway /opt/pushgateway

Copy && Past

sudo nano /etc/systemd/system/pushgateway.service
[Unit]
Description=Prometheus Pushgateway
Wants=network-online.target
After=network-online.target

[Service]
User=pushgateway
Group=pushgateway
Type=simple
ExecStart=/opt/pushgateway

[Install]
WantedBy=multi-user.target

Start and enable the pushgateway service:

sudo systemctl enable pushgateway
sudo systemctl start pushgateway

Verify the service is running and serving metrics:

sudo systemctl status pushgateway
curl localhost:9091/metrics

Conncet Myapp to Prometheus

Edit file prometheus.yml

global:
  scrape_interval:     15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
    - targets: ['localhost:9090']

  - job_name: 'node'
    scrape_interval: 5s
    static_configs:
      - targets: ['localhost:9100']

  - job_name: 'redis'
    static_configs:
      - targets: ['localhost:9121']

  - job_name: 'postgres'
    static_configs:
      - targets: ['localhost:9187']
  
  - job_name 'myapp'
    static_configs:
      - targets: ['localhost:8080']

  - job_name: 'Pushgateway'
    honor_labels: true
    static_configs:
      - targets: ['localhost:9091']  
  • Of interest here is the honor_labels: true parameter. Let’s remember how Prometheus collects metrics. First of all, it accesses the exporter via http at the path /metrics, from where it takes all metrics with their tags. After that it adds its own internal tags - job, which is specified in the job_name configuration, as well as instance - endpoint from which these metrics were collected. In our case, we expose these tags ourselves in pushgateway. Since it can collect data from different services, we directly pass the job name and instance name to db1. So this option honor_labels: true prohibits Prometheus from replacing these tags with its own, but requires using the ones specified in pushgateway.So this option honor_labels: true disallows Prometheus to replace these tags with its own, but requires to use the ones specified in pushgateway.

Restart Prometheus to load the new configuration:

sudo systemctl restart prometheus

Send Metrics to PushGateway

Sending basic metrics to pushgateway

First, it’s worth mentioning that pushgateway has a built-in web interface, which can be accessed using port 9091 (default). We will use the console to work with it. So, pushgateway provides a simple API that you can use to send metrics to it. After receiving the metrics, pushgateway can give them to Prometheus server via the /metrics endpoint. Let’s try to create our first metric using curl. Let’s say our cron script runs once an hour and takes money from users’ accounts. Then we can send information to pushgateway about the number of users processed. Or the total amount of debit:

echo "cron_app_processed_users 112" | curl --data-binary @- http://localhost:9091/metrics/job/cron_app

echo "cron_app_payed_sum 13423" | curl --data-binary @- http://localhost:9091/metrics/job/cron_app

In this example we sent two metrics - cron_app_processed_users - number of processed users with the value 112 and cron_app_payed_sum - total amount of debit in the amount of 13423. We also used the --data-binary option, which does not modify the data passed to it from stdin in any way, but just passes it in the POST request.

we used the path /metrics/job/cron_app. In this case, job is a label, and cron_app is its value. Remember in scrape_config for Prometheus we set the job option, which was displayed in metrics as a tag? Actually, this is exactly that tag. Pushgateway will group your metrics by this tag. This is very useful when working with a large number of crowns or other short-lived tasks. You can see the grouped metrics by them.

curl http://localhost:9091/metrics
# TYPE cron_app_payed_sum untyped
cron_app_payed_sum{instance="",job="cron_app"} 13423
# TYPE cron_app_processed_users untyped
cron_app_processed_users{instance="",job="cron_app"} 112

Let’s now try to see how our metrics will be given to Prometheus:

curl http://localhost:9091/metrics
# TYPE cron_app_payed_sum untyped
cron_app_payed_sum{instance="",job="cron_app"} 13423
# TYPE cron_app_processed_users untyped
cron_app_processed_users{instance="",job="cron_app"} 112

... skipped ...

# HELP push_failure_time_seconds Last Unix time when changing this group in the Pushgateway failed.
# TYPE push_failure_time_seconds gauge
push_failure_time_seconds{instance="",job="cron_app"} 0
# HELP push_time_seconds Last Unix time when changing this group in the Pushgateway succeeded.
# TYPE push_time_seconds gauge
push_time_seconds{instance="",job="cron_app"} 1.6208582543334155e+09

... skipped ...

Additional parameters when sending metrics :

In addition to the basic option of sending metrics, we can also use the advanced option - specifying the type of metric to be sent:

cat <<EOF | curl --data-binary @- http://localhost:9091/metrics/job/cron_app
# TYPE cron_app_payed_sum gauge
cron_app_payed_sum 15487
# TYPE cron_app_processed_users gauge
# HELP cron_app_processed_users Processed Users Counter.
cron_app_processed_users 238
EOF

In this example we sent the same metrics as before, but we specified their type in the comment. We will talk more about data types in Prometheus in the promql assignment, now the main thing is to understand that you can specify the type of data to be sent.

Another important feature when passing metrics is the ability to tag your metrics. Let’s add a new metric to our application. For example, the number of users processed, categorized by groups.

For example our script processes two groups of users - corporate clients and individuals. We want to see how many users from one or the other group were processed. To do this, it is enough to send metrics in this format:

cat <<EOF | curl --data-binary @- http://localhost:9091/metrics/job/cron_app
cron_app_users{type="corp"} 64
cron_app_users{type="man"} 47
EOF

That is, we specified the tag directly in our metric. Let’s see what happened with our metrics:

curl http://localhost:9091/metrics

# TYPE cron_app_payed_sum gauge
cron_app_payed_sum{instance="",job="cron_app"} 15487
# HELP cron_app_processed_users Processed Users Counter.
# TYPE cron_app_processed_users gauge
cron_app_processed_users{instance="",job="cron_app"} 238
# TYPE cron_app_users untyped
cron_app_users{instance="",job="cron_app",type="corp"} 64
cron_app_users{instance="",job="cron_app",type="man"} 47

... skipped ...

As you can see from the output, we have a new metric with different tags.

Dividing metrics into groups

In addition to adding labels directly in metrics, you can add tags via URLs. For example like this:

echo 'test_app_processed{type="users"} 12' | curl --data-binary @- http://localhost:9091/metrics/job/test_app/instance/db1

In this case, we added a new metric test_app_processed with three tags. One tag type=users, which was specified directly in the metric, the other two were specified via URL: job=test_app and instance=db1. Let’s see how this metric will look like in Prometheus format:

curl -s http://localhost:9091/metrics

... skipped ...
# TYPE test_app_processed untyped
test_app_processed{instance="db1",job="test_app",type="users"} 12

Delete Metrics

curl -X DELETE http://localhost:9091/metrics/job/test_app/instance/db1

And all metrics that belonged to this group will be removed. To summarize, pushgateway grouping allows you to flexibly manage incoming metrics.

Prometheus Service Discovery

Manual Add Targets scripts

Create dirctory for scripts

mkdir -p /opt/targets
cd /opt/targets

Now modify the config in targets.json by adding an entry for the new Node Exporter:

[
  {
    "targets": [
      "localhost:9100"
    ],
    "labels": {
      "job": "node"
    }
  }
]

test config

cat targets.json | jq .
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
    - targets: ['localhost:9090']

  - job_name: 'node'
    scrape_interval: 5s
    static_configs:
      - targets: ['localhost:9100']

  - job_name: 'redis'
    static_configs:
      - targets: ['localhost:9121']

  - job_name: 'postgres'
    static_configs:
      - targets: ['localhost:9187']
  
  - job_name 'myapp'
    static_configs:
      - targets: ['localhost:8080']

  - job_name: 'Pushgateway'
    honor_labels: true
    static_configs:
      - targets: ['localhost:9091']
  
  - job_name: 'file'
    file_sd_config:
    - file:
      - '/opt/targets/*.json
      refresh_interval: 10s

Go on Prometheus Website localhost:9090 Status => Server Discovery

DNS SRV(Service Record) method

global:
  scrape_interval:     15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
    - targets: ['localhost:9090']

  - job_name: 'node'
    scrape_interval: 5s
    static_configs:
      - targets: ['localhost:9100']

  - job_name: 'redis'
    static_configs:
      - targets: ['localhost:9121']

  - job_name: 'postgres'
    static_configs:
      - targets: ['localhost:9187']
  
  - job_name 'myapp'
    static_configs:
      - targets: ['localhost:8080']

  - job_name: 'Pushgateway'
    honor_labels: true
    static_configs:
      - targets: ['localhost:9091']
  
  - job_name: 'file'
    file_sd_config:
    - file:
      - '/opt/targets/*.json
      refresh_interval: 10s

  - job_name: 'dns'
    dns_sd_configs:
  - names:
    - _payment._tcp.api.prod.example.org'
    type: SRV
    refresh_interval: 120s

If you don’t want to bother with SRV record detection, you can use regular A types. The only difference is that you will have to specify the port to which prometheus should connect. In SRV record the port is specified, but in A record there is no such field, so the configuration will take the following form:

  - job_name: 'dns'
    dns_sd_configs:
    - names:
      - example.com
      type: A
      port: 9090
      refresh_interval: 10s

PromQL

Prometheus there are only 4 types of metrics:

prometheus3

  • Counter is a simple counter. Its peculiarity is that the value of this metric cannot decrease. For example, the processed_requests metric is the number of processed requests. Agree that it cannot decrease - with each new request it will increase by one.

  • Gauge this type is responsible for a simple metric that can both increase and decrease. A good example of such a metric is temperature. At the moment it is 18 degrees, in a couple hours it will be 21 degrees, and at night it will drop to 10 degrees.

  • Histograms Collects several data about a metric - the total number of values, their sum, as well as a breakdown by groups. It is most often used to collect temporal metrics (e.g. response time). All values are calculated on the server side.

  • Summaries are essentially the same histograms, but all calculations are done on the client, so the client can eat up some percentage of CPU time. But they do not load the prometheus server, unlike Histograms.

Data filtering

./promtool query instant http://localhost:9090/ up

up{instance="localhost:8080", job="app"} => 1 @[1622717586.873]
up{instance="localhost:9090", job="prometheus"} => 1 @[1622717586.873]

In this case, we have output the latest values of the up metric. To filter the values - we can add the tags we want to see there to the queries:

./promtool query instant http://localhost:9090/ "up{job='app'}"

up{instance="localhost:8080", job="app"} => 1 @[1622717644.331]

In this way prometheus shows us all up metrics that have a job tag identical to app. Besides direct comparison you can also use negation and even regexp:

./promtool query instant http://localhost:9090/ "up{job!='prometheus'}"

up{instance="localhost:8080", job="app"} => 1 @[1622717699.7]

./promtool query instant http://localhost:9090/ "up{job=~'..p'}"

up{instance="localhost:8080", job="app"} => 1 @[1622717711.238]
root@prom-test:/opt/prometheus#

By the way, in prometheus you can use the secret tag __name__, which contains the name of the metric. In this way you can output all metrics with the name you are interested in with one query:

./promtool query instant http://localhost:9090/ "{__name__=~'^up$'}"

up{instance="localhost:8080", job="app"} => 0 @[1622759105.491]
up{instance="localhost:9090", job="prometheus"} => 1 @[1622759105.491]

In addition to filtering by tags, you can also filter metrics by current values - for example, let’s display all up metrics that have a value of 1:

./promtool query instant http://localhost:9090/ "up == 1"

up{instance="localhost:8080", job="app"} => 1 @[1622717784.272]
up{instance="localhost:9090", job="prometheus"} => 1 @[1622717784.272]

And now all app metrics that have a metric value not equal to 1:

./promtool query instant http://localhost:9090/ "up != 1"

Everything is logical - the output is empty - we don’t have such metrics. Or let’s output all up metrics that have a value greater than 0:

./promtool query instant http://localhost:9090/ "up > 0"

up{instance="localhost:8080", job="app"} => 1 @[1622717832.933]
up{instance="localhost:9090", job="prometheus"} => 1 @[1622717832.933]

Working with basic functions

https://prometheus.io/docs/prometheus/latest/querying/functions/

Instant Vectors are multiple time series metrics that reflect a single value over a specific time, for example:

./promtool query instant http://localhost:9090/ "up"
up{instance="localhost:8080", job="app"} => 1 @[1622719041.253]
up{instance="localhost:9090", job="prometheus"} => 1 @[1622719041.253]

This displays the value of the up metric for all tags for the same timestamp - 1622719041.253.

Range Vectors are multiple time series metrics that reflect multiple values over a specific time interval, for example:

/promtool query instant http://localhost:9090/ "up[2m]"
up{instance="localhost:8080", job="app"} =>
1 @[1622718987.573]
1 @[1622719002.573]
1 @[1622719017.573]
1 @[1622719032.573]
1 @[1622719047.573]
1 @[1622719062.573]
1 @[1622719077.573]
1 @[1622719092.573]
up{instance="localhost:9090", job="prometheus"} =>
1 @[1622718989.926]
1 @[1622719004.926]
1 @[1622719019.926]
1 @[1622719034.926]
1 @[1622719049.926]
1 @[1622719064.926]
1 @[1622719079.926]
1 @[1622719094.926]

This displays the value of the up metric for the last two minutes. When using range vector, you specify the time period in square brackets. You can also use tag filtering in front of them:

./promtool query instant http://localhost:9090/ "up{job='app'}[2m]"
up{instance="localhost:8080", job="app"} =>
1 @[1622719047.573]
1 @[1622719062.573]
1 @[1622719077.573]
1 @[1622719092.573]
1 @[1622719107.573]
1 @[1622719122.573]
1 @[1622719137.573]
1 @[1622719152.573]

Scalar - simple floating point numbers. You will be able to use them, for example, by adding to or multiplying by metrics.

Now let’s move on to the operators. In general, they are all intuitive:

  • + addition

  • - subtraction

  • * multiplication

  • / division

  • % modulus

  • ^ ascending

These operators work both between scalar values (5*5) and instant vectors, for example:

./promtool query instant http://localhost:9090/ "2 + 2"

scalar: 4 @[1622719951.177]

 ./promtool query instant http://localhost:9090/ "up + 1"

{instance="localhost:8080", job="app"} => 2 @[1622719953.617]
{instance="localhost:9090", job="prometheus"} => 2 @[1622719953.617]

./promtool query instant http://localhost:9090/ "up + up"

{instance="localhost:8080", job="app"} => 2 @[1622719955.02]
{instance="localhost:9090", job="prometheus"} => 2 @[1622719955.02]

In the first example, we add just two scalar values. In the second example we add a scalar value of 1 to the up metric. And in the third example we add two instant vectors.

In addition to standard additions there are also aggregation operators - they allow to aggregate values of one instant vector by different tags - for example, to summarize the number of requests between all hosts with the tag env=dev. Here are the main ones:

  • sum - summation
  • min - take the minimum value
  • max - take the maximum value
  • avg - take average value
  • count - count the number of elements in the vector

Let’s try to summarize all the values of all the up metrics:

./promtool query instant http://localhost:9090/ "sum(up)"
{} => 2 @[1622720202.092]

In this case prometheus took all the up metrics, discarded all the tags and summed their values - the result is 2.

You can do a job-by-job breakdown:

./promtool query instant http://localhost:9090/ "sum by (job) (up)"
{job="app"} => 1 @[1622720257.153]
{job="prometheus"} => 1 @[1622720257.153]

In this case, prometheus took the up metric, grouped them by job tag and summed up the values in each group. Since we have only one metric in each group (app and prometheus), the values are identical to the original ones.

Working with Functions in Prometheus

The rate function takes a range vector as input and then performs a simple conversion - it takes the difference between the last and the first value and divides it by the number of seconds in the interval - so you get the counter growth rate. This function works only with the counter data type. Let’s look at an example:

./promtool query instant http://localhost:9090/ "prometheus_http_requests_total{}"

prometheus_http_requests_total{code="200", handler="/api/v1/query", instance="localhost:9090", job="prometheus"} => 27 @[1622721465.532]
prometheus_http_requests_total{code="200", handler="/metrics", instance="localhost:9090", job="prometheus"} => 366 @[1622721465.532]
prometheus_http_requests_total{code="400", handler="/api/v1/query", instance="localhost:9090", job="prometheus"} => 9 @[1622721465.532]

 ./promtool query instant http://localhost:9090/ "rate(prometheus_http_requests_total{}[2m])"

{code="200", handler="/api/v1/query", instance="localhost:9090", job="prometheus"} => 0 @[1622721473.43]
{code="200", handler="/metrics", instance="localhost:9090", job="prometheus"} => 0.06666666666666667 @[1622721473.43]
{code="400", handler="/api/v1/query", instance="localhost:9090", job="prometheus"} => 0 @[1622721473.43]

The first time we just asked for our metrics and the second time we asked prometheus to calculate the average number of requests for the last two minutes. Prometheus took the current value of the prometheus_http_requests_total metric and then subtracted the value of the prometheus_http_requests_total metric that was 2 minutes ago and then divided the obtained value by 120 seconds (2 minutes). Thus we got the average number of requests per second.

We can also use the sum operator to summarize the number of requests for all the same instances:

./promtool query instant http://localhost:9090/ "sum by (instance) (rate(prometheus_http_requests_total{}[2m]))"

{instance="localhost:9090"} => 0.06666666666666667 @[1622721815.76]

Everything is the same here - we got the number of requests per second using rate, and after that we added up our metrics by instance tag - all those who had it matched were added up. In this way you can output, for example, the number of requests per second on all dev and prod servers with a simple query.

In addition to the rate function, there is the irate function - it does the same thing, but it takes the difference of the last and penultimate value instead of the first and last value. That is, it does not care what interval you specify - [2m] or [10m] - it will take the last and penultimate values for calculation

The next function in our list is delta. It works only with the gauge metrics type and calculates the difference of the first and the last value in the range vector. For example:

./promtool query instant http://localhost:9090/ "delta(prometheus_tsdb_blocks_loaded[2m])"

{instance="localhost:9090", job="prometheus"} => 0 @[1622722060.54]

So delta will take the last value of the prometheus_tsdb_blocks_loaded metric and subtract from it the value of the prometheus_tsdb_blocks_loaded metric from 2 minutes ago.

Another interesting function for gauge metrics is predict_linear - it predicts the value of a metric for a given number of seconds ahead. Linear regression is used. That is, in essence, a straight line is drawn using the average values in the past and future values are predicted on its basis:

./promtool query instant http://localhost:9090/ "predict_linear(prometheus_tsdb_blocks_loaded[120m], 120)"

{instance="localhost:9090", job="prometheus"} => 1.0048419915591933 @[1622722310.225]

In this case we took the value of prometheus_tsdb_blocks_loaded metric for the last 2 minutes and tried to predict the value for 120 seconds in the future. This function is very useful to use with the available disk space - it will be able to predict when the disk space will end.

let’s take a look at the function that is designed to work with histograms. Since summaries are counted immediately on the client, this function does not apply to summaries. Let’s try to calculate 85% persistence for prometheus’ http response time:

./promtool query instant http://localhost:9090/ "histogram_quantile(0.85, prometheus_http_request_duration_seconds_bucket)"

{handler="/metrics", instance="localhost:9090", job="prometheus"} => 0.085 @[1622722586.628]
{handler="/api/v1/query", instance="localhost:9090", job="prometheus"} => 0.085 @[1622722586.628]

In this case, we use the histogram_quantile function with a parameter of 0.85 (that’s 85%) and specifying the histogram metric to use - prometheus_http_request_duration_seconds_bucket. You can also calculate the value on a time segment by adding the rate function:

root@prom-test:/opt/prometheus# ./promtool query instant http://localhost:9090/ "histogram_quantile(0.85, rate(prometheus_http_request_duration_seconds_bucket[10m]))"

{handler="/api/v1/query", instance="localhost:9090", job="prometheus"} => 0.085 @[1622722648.458]
{handler="/metrics", instance="localhost:9090", job="prometheus"} => 0.085 @[1622722648.458]

This calculates the same metric, but over a 10 minute interval.

Before install Alertmanager (Alerts of Prometheus)

Customizing Rules Alerts in Prometheus

To configure alerts in promethues you will need to specify the rule_files parameter in the main config file prometheus.yml, which is responsible for which file prometheus will look for alerting rules.

alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - alertmanager:9093

rules_files:
  - "alerts.yml"

After that we can create an alerts.yml file (it should be in the same directory as prometheus.yml) with our host availability alert:

nano alerts.yml
groups:
- name: instances
  interval: 10s
  rules:
  - alert: InstanceDown
    expr: up == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Instance {{ $labels.instance }} down"
      description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for 1 minute."

here we use go-template https://pkg.go.dev/text/template

for more alerts can take for here https://samber.github.io/awesome-prometheus-alerts/

Let’s turn off our application that was polled by prometheus and look at the current alerts:

./promtool query instant http://localhost:9090/ "ALERTS{}"

ALERTS{alertname="InstanceDown", alertstate="firing", instance="localhost:8080", job="app", severity="critical"} => 1 @[1622757949.277]

there is a special ALERTS metric in prometheus that allows you to see all current alerts. In our case we can see alert alertname=InstanceDown and its tags. Please note that all tags of the triggered metric are added to the alert

Testing of Alerting Rules

# Путь к файлу с алертами
rule_files:
    - alerts.yml

# Как часто будут выполняться запросы из тестовых правил
evaluation_interval: 10s

tests:
    - interval: 15s # Как часто собираются данные по нашей метрике
      input_series: # Тестовая метрика
          - series: 'up{job="prometheus", instance="localhost:9090"}' # Название и теги тестовой метрики
            values: '0 0 0 0 0 0 0 0 0 0 0 0 0 0 0' # Значения тестовой метрики - каждое значение по факту будет с
                                                    # обираться каждые interval секунд. То есть в нашем примере
                                                    # interval 15 sec, значит при тестах первое значение - ноль
                                                    # - будет собрано в самом начале, когда прошло 0 секунд, 
                                                    # второе значение - ноль - будет собрано в 15 секунд, 
                                                    # третье значение - ноль - появится в 30 секунд и так 
                                                    # далее. Это важно, поскольку вам так же захочется 
                                                    # тестировать крайние случаи - когда for и evaluation_interval
                                                    # и scrape_interval отличаются и алерт может не сработать.

      # Тестовые правила
      alert_rule_test:
          - eval_time: 5m # Время прошедшее с момента запуска теста - за это 
                          # время тесты соберут все значения метрики
                          # (раз в 15 секунд, а так же будут раз в 10 секунд
                          # выполнять наши правила алертинга)
            alertname: InstanceDown # Названия алерта, которое тестируем
            exp_alerts: # что ожидаем получить на выходе
                - exp_labels: # Какие лейблы должны быть у алерта
                      severity: critical
                      instance: localhost:9090
                      job: prometheus
                  exp_annotations: # Какие аннотации должны быть у алерта
                      summary: "Instance localhost:9090 down"
                      description: "localhost:9090 of job prometheus has been down for 1 minute."

Try running our tests:

root@prom-test:/opt/prometheus# ./promtool test rules test.yaml

Unit Testing:  test.yaml
  SUCCESS

Standard alerts for systems

The example will be about disk space:

- alert: HostOutOfDiskSpace
    expr: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host out of disk space (instance {{ $labels.instance }})
      description: "Disk is almost full (< 10% left)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

Same example with disk space, but only using the predict_linear function, which predicts the value in 24 hour

 - alert: HostDiskWillFillIn24Hours
    expr: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs"}[1h], 24 * 3600) < 0 and ON (instance, device, mountpoint) node_filesystem_readonly == 0
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host disk will fill in 24 hours (instance {{ $labels.instance }})
      description: "Filesystem is predicted to run out of space within the next 24 hours at current write rate\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

The following alert will help you quickly locate heavily loaded servers:

- alert: HostHighCpuLoad
    expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Host high CPU load (instance {{ $labels.instance }})
      description: "CPU load is > 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

The amount of free memory:

- alert: HostOutOfMemory
    expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host out of memory (instance {{ $labels.instance }})
      description: "Node memory is filling up (< 10% left)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

This example, the alert is triggered if the network interface receives more than 100 mb of traffic per second. You can always adjust this value to suit your requirements.

  - alert: HostUnusualNetworkThroughputIn
    expr: sum by (instance) (rate(node_network_receive_bytes_total[2m])) / 1024 / 1024 > 100
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host unusual network throughput in (instance {{ $labels.instance }})
      description: "Host network interfaces are probably receiving too much data (> 100 MB/s)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

Alertsmanager

Alertsmanager from Binary

cd /opt
wget https://github.com/prometheus/alertmanager/releases/download/v0.26.0/alertmanager-0.26.0.linux-amd64.tar.gz

Unpack

tar zxf alertmanager-0.26.0.linux-amd64.tar.gz

Rename

mv alertmanager-0.26.0.linux-amd64 alertmanager
rm -rf alertmanager-0.26.0.linux-amd64.tar.gz

Move to alermanager directory

cd alertmanager

Run (default port localhost:9093)

./alertmanager
./alertmanager --config.file=alertmanager.yml

Alertmanager Systemd

useradd -s /sbin/nologin -d /opt/alertmanager alertmanager

Change ownership

chown alertmanager:alertmanager -R /opt/alertmanager

Create systemd unit file

nano /etc/systemd/system/alertmanager.service
[Unit]
Description=Alertmanager
Wants=network-online.target
After=network-online.target

[Service]
User=alertmanager
Group=alertmanager
Type=simple
WorkingDirectory=/opt/alertmanager/
ExecStart=/opt/alertmanager --config.file=/opt/alertmanager/alertmanager.yml --web.external-url http://0.0.0.0:9093

[Install]
WantedBy=multi-user.target

Run service (website on localhost:9093 => Alerts)

systemctl daemon-reload

website on localhost:9093 => Alerts

systemctl start alertmanager
systemctl enable alertmanager
systemctl status alertmanager

Basic Alertmanager Configuration

That’s actually the whole installation. Now let’s discuss the main features of alertmanager in more detail:

  • Grouping. This is a very important point - imagine that one node on your network crashed. And a bunch of alerts were triggered on it at once - node_exporter unavailability, mysql unavailability and so on. With the help of grouping you can group alerts into one group and send only one notification, not dozens or hundreds.

  • Inhibition. One more important thing about alertmanager is that it allows you not to send a new group of alerts if some of the specified alerts have already been sent. From the example above - you may not send a new alert that the site is unavailable, if you already have an alert that the database is unavailable - because most likely it is the root of all problems.

  • Silences. The ability to stop sending alerts for a period of time - for example, a weekend for a dev server or all alerts for a period of system maintenance.

  • High Availability - alertmanager can be run in cluster mode to ensure high availability. Prometheus will send information about alerts to all available alertmanagers, and they in turn will select a leader that will directly handle them. In case of leader’s failure - he will be replaced by another available alertmanager. This way you will not miss any alert.

Create alertmanager.yml file

nano /opt/alertmanager/alertmanager.yml

Pagerduty and Slack

route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'web.hook'
  routes:
  - receiver: 'pagerduty-critical'
    group_wait: 10s
    matchers:
    - severity="critical"
  - receiver: 'slack-warning'
    group_wait: 60s
    matchers:
    - severity="warning"

receivers:
- name: 'web.hook'
  webhook_configs:
  - url: 'http://127.0.0.1:5001/'
- name: 'pagerduty-critical'
  pagerduty_configs:
  - routing_key: 'test'
    service_key: 'md5-hash'
- name: 'slack-warning'
  slack_configs:
  - api_url: 'https://hooks.slack.com/services/TOKEN'
    channel: '#warnings'


inhibit_rules:
  - source_matchers:
    - severity="critical"
    target_matchers:
    - severity="warning"
    equal: ['instance']

Test Configuration of Alertmanager

Here it is worth mentioning an interesting utility amtool - it allows, for example, to check the alertmanager configuration for correctness. In fact, it is similar to the promtool utility for prometheus. examples:

./amtool check-config alertmanager.yml
Checking 'alertmanager.yml'  SUCCESS
Found:
 - global config
 - route
 - 1 inhibit rules
 - 3 receivers
 - 0 templates

Test routing for your Alerts:

amtool config routes --config.file alertmanager.yml

Routing tree:
.
└── default-route  receiver: web.hook
    ├── {severity="critical"}  receiver: pagerduty-critical
    └── {severity="warning"}  receiver: slack-warning
./amtool config routes test severity="warning" --config.file alertmanager.yml

slack-warning
./amtool config routes test severity="info" --config.file alertmanager.yml

web.hook

Connect Alertmanager to Prometheus

The only need change prometheus prometheus.yml:

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - localhost:9093

Connecting Notifications in Telegram

In fact, from alertmanager’s point of view everything is simple - we just need to send a Webhook to Telegram Api to send notification. But on the telegram side, we have a few steps to take. Let’s go through them all:

  • First, you need to create a new bot. To do this, you should turn to the bot @BotFather. Directly type this nickname in the Telegram search and then type the command /newbot. Specify its name - for example rebrainme_notify, its nickname (it must end with bot) - rebrainme_notify_bot. In response botFather will give you an API key to access the bot using API. Save this token - we will use it further as $TOKEN.

  • Find your bot by searching for the nickname - rebrainme_notify_bot and send it a message - to do this, first press the start command and then write anything. Yes, it won’t reply, but we don’t need to - we just need to know our chat_id, which appeared when we sent it a message.

To find out the chat_id, you can send an http request using curl:

$ curl -s https://api.telegram.org/bot$TOKEN/getUpdates | jq .
{
  "ok": true,
  "result": [
    {
      "update_id": 736752606,
      "message": {
        "message_id": 2,
        "from": {
          "id": 113317645,
          "is_bot": false,
          "first_name": "Vasiliy",
          "last_name": "Ozerov",
          "username": "vasiliyozerov",
          "language_code": "en"
        },
        "chat": {
          "id": 113317645,
          "first_name": "Vasiliy",
          "last_name": "Ozerov",
          "username": "vasiliyozerov",
          "type": "private"
        },
        "date": 1623194263,
        "text": "gfhjkm"
      }
    }
  ]
}

nstead of $TOKEN you need to substitute your token, which was given to you by botFather in the first step. And you can get your chat_id - in our case it is 113317645.

Now you can send messages through the bot using api requests, using the bot’s $TOKEN and the chat_id obtained in the previous step. Let’s try sending a test message:

curl -s 'https://api.telegram.org/bot$TOKEN/sendMessage?chat_id=113317645&text=Heya'

{"ok":true,"result":{"message_id":3,"from":{"id":1765727458,"is_bot":true,"first_name":"rebrainme_notify","username":"rebrainme_notify_bot"},"chat":{"id":113317645,"first_name":"Vasiliy","last_name":"Ozerov","username":"vasiliyozerov","type":"private"},"date":1623194449,"text":"Heya"}}

Configure alertmanager

The official configuration documentation is on https://prometheus.io/docs/alerting/latest/configuration

For my tests, I added these 2 snippets.

route:
  - match:
      severity: test
    receiver: test-telegram
receivers:
 
- name: 'test-telegram'
  telegram_configs:
  - bot_token: YOUR_BOT_TOKEN
    api_url: https://api.telegram.org
    chat_id: YOUR_CHAT_ID
    parse_mode: ''

If you need a proxy, add the http_config section below:

receivers:
- name: 'stardata-telegram'
  telegram_configs:
  - bot_token: YOUR_BOT_TOKEN
    api_url: https://api.telegram.org
    chat_id: YOUR_CHAT_ID
    parse_mode: ''
    http_config:
      proxy_url: 'http://your-proxy-server-if-required:3128'

Test the configuration

To create a temporary alert to test the configuration, run

amtool --alertmanager.url=http://localhost:9093/ alert add alertname="test123" severity="test-telegram" job="test-alert" instance="localhost" exporter="none" cluster="test"

Prometheus Federation

High Availability First things first, let’s tackle the question of high availability. Unlike Alertmanager, which allows clustering and communication between multiple instances, Prometheus doesn’t extend the same privilege. However, the truth is, clustering Prometheus is not needed to achieve high availability. This tool has been ingeniously engineered to maintain high availability without requiring server communication.

In fact, setting up a highly available Prometheus system is a breeze just by running multiple Prometheus servers with identical configurations. prometheus8

Without further ado, I’ll explain how to set up this HA in a few steps, if you don’t mind.

Let’s assume that we have two Prometheus servers, Prometheus_DC1 and Prometheus_DC2.

To set up the HA, we have to copy all Prometheus configurations from Prometheus_DC1 to Prometheus_DC2 :

  • Copy the Prometheus configuration file /etc/prometheus/prometheus.yml from the Prometheus_DC1 to Prometheus_DC2.

  • Copy the rules configuration /etc/prometheus/rules/rules.yaml from the Prometheus_DC1 to Prometheus_DC2

  • On Prometheus_DC2, enable the Prometheus service, and start it.

sudo systemctl enable prometheus
sudo systemctl start prometheus

And that’s it. We don’t need anything else to configure HA at the Prometheus level. If we wish, we can add a load balancer configured in round-robin. It can be helpful if we’re using Grafana.

Federation

When we start monitoring with Prometheus, we typically begin with one server per datacenter or environment, which is sufficient due to its efficiency and lower failure rates. However, as operational overhead and performance issues arise, it may be necessary to split the Prometheus server into separate servers for network, infrastructure, and application monitoring, known as vertical sharding. Teams can also run their own Prometheus servers for greater flexibility and control over target labels and scrape intervals. Social factors often lead to Prometheus servers being split before performance concerns arise.

But how do we aggregate all this data? To perform global aggregations, federation allows for a global Prometheus server to pull aggregated metrics from datacenter Prometheus servers. This ensures reliability and simplicity in monitoring systems, particularly for graphing and alerting purposes.

We can easily share our metric data with Federation across different Prometheus servers.

For informational purposes, there are two types of Federation

  • Hierarchical Federation This means we have bigger Prometheus servers that collect time-series data from smaller ones. We have a top-down approach where data is gathered from different levels.

prometheus7

  • Cross-Service Federation

This method involves one Prometheus server monitoring a particular service or group of services, gathering specific time-series data from another server that is monitoring a different set of services. By doing this, we can run queries and alerts on the merged data from both servers.

prometheus9

Implementing a hierarchical Federation

For this blog, we will assume a configuration with two servers to be monitored – one in Basel and the other in Zurich, each hosting a node_exporter. These servers will be monitored by dedicated Prometheus servers named dc1 and dc2, respectively. Finally, the metrics from these two Prometheus servers will be aggregated by a federation server.

prometheus

From the Prometheus user interface, we can observe that the “Office Server” jobs gather metrics from the Basel and Zurich servers through their respective endpoints http:/dbi-basel-srv1:9100/metrics and http:/dbi-zurich-srv1:9100/metrics, via the Prometheus servers DC1 and DC2.

Prometheus DC1: prometheus11

Prometheus DC2: prometheus12

Now, we can log in to our third Prometheus server (the Federate one) and edit the Prometheus configuration file /etc/prometheus/prometheus.yml.

Add the following block at the bottom of the file, just after the static configuration of the “Prometheus” job name.

# Federation configuration
  - job_name: 'federation'
    scrape_interval: 15s
    
    honor_labels: true
    metrics_path: '/federate'
    
    params:
      'match[]':
        - '{job!~"prometheus"}'
        - '{__name__=~"job:.*"}'
    
    static_configs:
    - targets:
      - 'prometheus-dc-1:9090'
      - 'prometheus-dc-2:9090'

check config

./promtool check config ./prometheus.yml

Once configured, we enable the Federal Prometheus Server

sudo systemctl enable prometheus

And start Prometheus

sudo systemctl start prometheus

We can access Federal Prometheus Server through the web UI at http://:9090.

Then, click on status and finally on target.

prometheus12

Creating new metrics via recording rules

Let’s look at a simple example. First, let’s configure the aggregation rules on the local prometheus host. We haven’t analyzed the aggregation rules with you, but in fact they are very similar to alerting rules - they will be executed once in a given interval and will essentially generate a new metric based on your request. This way you can create new metrics that will be stored in prometheus.

So we have a local prometheus that collects data from all nodes in the datacenter via a simple node_exporter. Let’s take a look at its configuration:

So we have a local prometheus that collects data from all nodes in the datacenter via a simple node_exporter. Let’s take a look at its configuration:

global:
  scrape_interval:     15s
  evaluation_interval: 15s
  external_labels:
    datacenter: dc1

rule_files:
  - "rules.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
    - targets: ['localhost:9090']

  - job_name: 'node'
    static_configs:
    - targets: ['localhost:9100']

Check config

./promtool check config ./prometheus.yml

In addition to external_labels, we specified a file with rules - rules.yml. In fact, we already did this when we added the rules for alerting, but now we will add the rules for creating new metrics. So, let’s have a look at rules.yml:

rules.yml

groups:
  - name: globaldc
    interval: 5s
    rules:
    - record: job:node_memory_MemTotal_bytes:sum
      expr: sum without(instance)(node_memory_MemTotal_bytes{job="node"})
systemctl restart prometheus

test new metric

./promtool query instant http://localhost:9090/ 'job:node_memory_MemTotal_bytes:sum'

Long Term Storage for Prometheus

How TSDB works (short time storage)

All metrics that prometheus receives from exporters are grouped into two-hour blocks and saved to disk. One block is a separate directory, which contains the data itself (chunks), a metadata file, and an index file, which contains information on all metrics available for a given time period, as well as all labels of these metrics. The metrics themselves within the chunks directory are grouped into files up to 512 MB in size

root@prom-test:/opt/prometheus/data# ls -la
total 24
drwxr-xr-x 5 prometheus prometheus  4096 Jun 16 07:52 .
drwxr-xr-x 5 prometheus prometheus  4096 Apr 30 08:03 ..
drwxr-xr-x 3 prometheus prometheus  4096 Jun 16 07:52 01F89WWE7DY9X2X3Q2E47VAA8C
drwxr-xr-x 2 prometheus prometheus  4096 Jun 16 07:52 chunks_head
-rw-r--r-- 1 prometheus prometheus     0 Apr 30 08:03 lock
-rw-r--r-- 1 prometheus prometheus 20001 Jun 16 07:51 queries.active
drwxr-xr-x 2 prometheus prometheus  4096 Jun 16 07:52 wal
root@prom-test:/opt/prometheus/data# ls -la 01F89WWE7DY9X2X3Q2E47VAA8C/
total 68
drwxr-xr-x 3 prometheus prometheus  4096 Jun 16 07:52 .
drwxr-xr-x 5 prometheus prometheus  4096 Jun 16 07:52 ..
drwxr-xr-x 2 prometheus prometheus  4096 Jun 16 07:52 chunks
-rw-r--r-- 1 prometheus prometheus 45866 Jun 16 07:52 index
-rw-r--r-- 1 prometheus prometheus   272 Jun 16 07:52 meta.json
-rw-r--r-- 1 prometheus prometheus     9 Jun 16 07:52 tombstones
root@prom-test:/opt/prometheus/data#

The metrics that prometheus currently receives from exporters do not immediately get to disk - they are stored in memory for some time. To protect against unexpected termination, write ahead log is used, thanks to which prometheus can run the history of metrics retrieval again at startup and save them to disk in the necessary directory. WAL logs are stored in the wal directory and occupy a maximum of 128 MB of space each:

root@prom-test:/opt/prometheus/data# ls -la wal/
total 100
drwxr-xr-x 2 prometheus prometheus  4096 Jun 16 07:52 .
drwxr-xr-x 5 prometheus prometheus  4096 Jun 16 07:52 ..
-rw-r--r-- 1 prometheus prometheus 32768 Apr 30 08:03 00000000
-rw-r--r-- 1 prometheus prometheus 32768 Jun 16 07:52 00000001
-rw-r--r-- 1 prometheus prometheus 25704 Jun 16 07:55 00000002

After writing to wal, prometheus performs a compaction operation in the background and stores the resulting metrics into the desired data blocks. In addition, initial two-hour data blocks can be grouped into larger blocks - usually up to 10% of retention time. That is, if you set 30 days to store metrics, the blocks may contain 3 days of data.

The only limitation of this storage is that it cannot be clustered or replicated, but it is very fast. The developers claim that you will be able to store a year’s worth of metrics in this engine. It is worth noting that you should take care of prometheus failover by using a raid array or migrated virtual machines, otherwise you may lose all your data if a hard disk fails.

Basic TSBD settings

  • --storage.tsdb.path - path where tsdb data will be located. If you run prometheus in docker, it makes sense to mount this directory on the host system.

  • --storage.tsdb.retention.time - storage time of collected metrics. By default metrics are retained for 15 days.

  • --storage.tsdb.retention.size - experimental function to clean up metrics based on the size they take.

  • --storage.tsdb.wal-compression - enable wal archive compression. Enabled by default since version 2.20.

  • --storage.tsdb.no-lockfile - do not create lockfile in data directory. It will allow to run the second prometheus on the same data. But data consistency cannot be guaranteed anymore.

Settings Long Term Storage.

This magic works thanks to the standardized API offered by prometheus. In fact, the scheme works as follows - prometheus accesses an external application using its api, which in turn converts the request into the required format and sends the request to the storage - for example, to postgresql or clickhouse. Then it receives the response, converts it back into prometheus format and sends it back. Thus the whole scheme is very similar to the scheme with exporters. Prometheus contacts the adapter, and the adapter already interacts with the storage. An example of such an adapter can be seen here(https://github.com/prometheus/prometheus/tree/release-2.27/documentation/examples/remote_storage/example_write_adapter).

https://prometheus.io/docs/operating/integrations/#remote-endpoints-and-storage

Kafka

So, let’s start by installing zookeeper, since kafka uses it:

apt-get install zookeeper zookeeper-bin zookeeperd

Now we can configure it and add the id of our first zookeeper:

echo 1 > /etc/zookeeper/conf/myid
systemctl start zookeeper
systemctl status zookeeper
zookeeper.service - LSB: centralized coordination service
     Loaded: loaded (/etc/init.d/zookeeper; generated)
     Active: active (running) since Wed 2021-06-16 08:24:11 UTC; 18s ago
       Docs: man:systemd-sysv-generator(8)
      Tasks: 17 (limit: 2345)
     Memory: 37.2M
     CGroup: /system.slice/zookeeper.service
             └─11859 /usr/bin/java -cp /etc/zookeeper/conf:/usr/share/java/jline.jar:/usr/share/java/log4j-1.2.jar:/usr/share/java/xercesImpl.jar:/usr/share/java/xmlParserAPIs.j>

Install Kafka from binary

cd /opt/
wget https://downloads.apache.org/kafka/3.6.1/kafka_2.13-3.6.1.tgz

Unpack

tar zxf kafka_2.13-3.6.1.tgz
mv afka_2.13-3.6.1 kafka
rm -rf kafka_2.13-3.6.1.tgz
cd kafka/

run kafka

./bin/kafka-topics.sh --list --zookeeper localhost:2181

So, Kafka is installed - now we just need to get prometheus to write to it.

Great, we just need to check that everything works as it should - for this purpose we will create a new topic:

./bin/kafka-topics.sh --zookeeper localhost:2181 --topic "test" --create --partitions 5 --replication-factor 1
Created topic test.
./bin/kafka-topics.sh --list --zookeeper localhost:2181
test

Configuring Prometheus to Write to Remote Storage

To connect prometheus, we need to run an adapter that will receive messages from prometheus and send them to kafka. To do this, let’s use the officially recommended adapter - prometheus-kafka-adapter. It will accept messages from prometheus in its format, convert it to json and save it to kafka.

So, before installing the adapter, we will need to install the librdkafka dependencies. The official repositories contain an outdated version, so we will have to add a new repository:

wget -qO - https://packages.confluent.io/deb/6.2/archive.key | sudo apt-key add -
sudo add-apt-repository "deb [arch=amd64] https://packages.confluent.io/deb/6.2 stable main"
echo "deb http://security.ubuntu.com/ubuntu bionic-security main" | sudo tee -a /etc/apt/sources.list.d/bionic.list
 apt-get install librdkafka-dev librdkafka1 zlib1g zlib1g-dev libcrypto++-dev libcrypto++6 libssl-dev

install prom-kafka-adapter:

cd /opt/
wget https://go.dev/dl/go1.21.5.linux-amd64.tar.gz
 
export GOROOT=/opt/go
export PATH=/opt/go/bin:$PATH
go version
go version go1.21.5 linux/amd64
mkdir prom-kafka-adapter
cd prom-kafka-adapter/
export GOPATH=`pwd`
go get github.com/Telefonica/[email protected]

check that everything works:

KAFKA_BROKER_LIST=localhost:9092 ./bin/prometheus-kafka-adapter

Only left to configure prometheus. For this we add to prometheus.yml:

global:
  scrape_interval:     15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
    - targets: ['localhost:9090']

remote_write:
  - url: "http://localhost:8080/receive"
systemctl restart prometheus
./bin/kafka-topics.sh --list --zookeeper localhost:2181
./bin/kafka-console-consumer.sh --from-beginning --bootstrap-server localhost:9092 --topic metrics

Prometheus throws all our metrics into prom-kafka-adapter, and that in turn into kafka, in json format.

Firewall and SSL/TCL

iptables firewall for prometheus

cat /etc/iptables/rules.v4
# Generated by iptables-save v1.4.21 on Tue Jun 23 15:46:13 2020
*nat
:PREROUTING ACCEPT [17566285:1114295190]
:INPUT ACCEPT [8292701:445269860]
:OUTPUT ACCEPT [12857064:888777529]
:POSTROUTING ACCEPT [12857064:888777529]
COMMIT
*filter
:INPUT DROP [4:176]
:FORWARD ACCEPT [0:0]
:OUTPUT ACCEPT [67101:106192188]
-A INPUT -i lo -j ACCEPT
-A INPUT -p tcp -m tcp --dport 22 -j ACCEPT
-A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
COMMIT

Configuring Prometheus SSL

In our example we will use a self-signed certificate. First we need to generate the root key and certificate so that we can issue certificates for prometheus and node_exporter. This can be done using the openssl utility:

cd /opt/
mkdir ssl
cd ssl
openssl genrsa -out ca.key 4096
openssl req -new -x509 -days 720 -key ca.key -out ca.crt
Country Name (2 letter code) [AU]:IL
State or Province Name (full name) [Some-State]:Center
Locality Name (eg, city) []:Moscow
Organization Name (eg, company) [Internet Widgits Pty Ltd]:Mycompany
Organizational Unit Name (eg, section) []:IT
Common Name (e.g. server FQDN or YOUR name) []:myweb.com
Email Address []:[email protected]

Please enter the following 'extra' attributes
to be sent with your certificate request
A challenge password []:
An optional company name []:
ls -la

Here we created a private key ca.key, as well as a certificate ca.crt, in which we added general information about us.

Now all that’s left is to generate a server certificate and key for Prometheus, and then sign it with the root certificate:

openssl genrsa -out server.key 4096

Country Name (2 letter code) [AU]:IL
State or Province Name (full name) [Some-State]:Center
Locality Name (eg, city) []:Tel-aviv
Organization Name (eg, company) [Internet Widgits Pty Ltd]:Mycompany
Organizational Unit Name (eg, section) []:IT
Common Name (e.g. server FQDN or YOUR name) []:mywebsite.com
Email Address []:[email protected]
openssl x509 -req -days 365 -in server.csr -CA ca.crt -CAkey ca.key -set_serial 1 -out server.crt

We should end up with a server.key and a server.crt certificate

Now we can describe the TLS and basic auth settings in the /opt/prometheus/web.config file for our prometheus server:

tls_server_config:
  cert_file: /opt/ssl/server.crt
  key_file: /opt/ssl/server.key

http_server_config:
  http2: true

basic_auth_users:
  dan: $2a$10$j6bz1OjUBADJ.dx5CxZGA.KVMAsNvAEEmJdrk0W1LHlxQDyqXGte.

In this example, we simply specified the server key and certificate, and enabled http2 support and also set a login/password for our user dan. We obtained the password using htpasswd. You must use bcrypt encryption:

install htpasswd

apt install apache2-utils

Generate password for auth_users:

htpasswd -bnBC 10 "" password | tr -d ':\n' | sed 's/$2y/$2a/'

Now let’s assign permissions to our certificates, since prometheus is running under the prometheus user:

chown prometheus:prometheus /opt/ssl/server.*

Now let’s add a new parameter to start (--web.config.file) where we describe our web.config in /etc/systemd/system/prometheus.service:

[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
ExecStart=/opt/prometheus/prometheus \
    --config.file=/opt/prometheus/prometheus.yml \
    --storage.tsdb.path=/opt/prometheus/data \
    --web.console.templates=/opt/prometheus/consoles \
    --web.console.libraries=/opt/prometheus/console_libraries \
    --web.config.file=/opt/prometheus/web.config

[Install]
WantedBy=default.target

Now you can reboot systemd and run prometheus:

systemctl daemon-reload
systemctl restart prometheus
systemctl status prometheus

let’s try to address prometheus:

curl -D - -u dan:password --cacert /opt/ssl/ca.crt --resolve mywebsite.com:9090:127.0.0.1 https://mywebsite.com:9090

Let’s deal with these parameters. The first parameter -D will cause curl to display all the headers on the screen.

The second parameter --cacert /opt/ssl/ca.crt specifies the path to our root certificate, which will allow us to check its validity, since it is the one that the prometheus server certificate is signed with.

The third parameter --resolve prom.kis.im:9090:127.0.0.1 - since we don’t have a DNS entry for mywebsite.com with this parameter we force curl to define mywebsite.com:9090 as 127.0.0.1. And so when we specify the url https://mywebsite.com:9090 at the end, curl will not go to DNS, but will go straight to 127.0.0.1:9090. This is required for curl to be able to verify the server name mywebsite.com in our server certificate. And actually we got a valid response from prometheus: 302 Found.

Configuring Node_Exporter with Nginx Proxy

Let’s try to implement the scheme with tls and basic auth for node_exporter. So, let’s start by generating a key and certificate for node_exporter, similar to prometheus:

So, let’s start by generating a key and certificate for node_exporter, similar to prometheus (on machine prometheus):

openssl genrsa -out node_exporter.key 4096
cat > req.conf << EOF
[req]
distinguished_name = req_distinguished_name
x509_extensions = v3_req
prompt = no
[req_distinguished_name]
C = IL
ST = Telaviv
L = Telaviv
O = Mycompany
OU = IT
CN = node_exporter.mywebsite.com
[v3_req]
keyUsage = keyEncipherment, dataEncipherment
extendedKeyUsage = serverAuth
subjectAltName = @alt_names
[alt_names]
DNS.1 = node_exporter.mywebsite.com
EOF
openssl req -new -key node_exporter.key -out node_exporter.csr -config req.conf -extensions 'v3_req'
openssl x509 -req -days 365 -in node_exporter.csr -CA ca.crt -CAkey ca.key -set_serial 1 -extensions 'v3_req' -extfile req.conf -out node_exporter.crt

After these steps, we will get the files node_expoter.key - the key and node_exporter.crt - the certificate for our exporter.

We used the req.conf configuration where we described all the fields we need, as well as additional SAN names for our host. This is important, because if we make an old certificate (without SAN names), prometheus will get an error like this:

Get “https://127.0.0.1:9100/metrics”: x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0

Now let’s install Nginx and make a configuration for it:

apt-get install nginx

We place the configuration in /etc/sites-enabled/node_exporter.conf:

server {
  listen 9100 ssl;

  ssl_certificate /opt/ssl/node_exporter.crt;
  ssl_certificate_key /opt/ssl/node_exporter.key;

  server_name node_exporter.mywebsite.com;

  location / {
    proxy_pass http://127.0.0.1:9100/;
    auth_basic "Restricted!";
    auth_basic_user_file .htpasswd;
  }
}

Here everything is quite simple - we use the generated certificate and key for node_exporter and also enable basic authentication for requests. We listen on port 9100 with ssl enabled.

Now we can create htpasswd file and Restart Nginx:

htpasswd -n dan
echo 'dan:$apr1$P1sF6Xzd$4rCqbl8FN.Fk9DqmUSEeb0' > /etc/nginx/.htpasswd

reload nginx

/etc/init.d/nginx  reload
./node_exporter --web.listen-address="127.0.0.1:9100"

In this case we just run node_exporter in tmux, although it is better to run it through systemd. The main thing is that we added the --web.listen-address=127.0.0.0.1:9100 parameter, so it will only listen for connections on the loopback interface and can only connect to it via nginx. Let’s check if everything works:

curl -D - -u dan:password --cacert /opt/ssl/ca.crt --resolve node_exporter.mywebsite.com:9100:127.0.0.1 https://node_exporter.mywebsite.com:9100

All parameters are taken from the previous command, so there is no point in parsing them here - the only thing we have changed is the server name to node_exporter.mywebsite.com, as it is specified in our certificates and port 9100, on which nginx listens.

Connect this exporter to prometheus:

global:
  scrape_interval:     15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
    - targets: ['localhost:9090']

  - job_name: node_Wexporter
    static_configs:
    - targets: ['127.0.0.1:9100']
    scheme: https
    tls_config:
      ca_file: /opt/ssl/ca.crt
      server_name: node_exporter.mywebsite.com
    basic_auth:
      username: dan
      password: password

In this case we have defined a new task node_exporter, which connects to our nginx at 127.0.0.1:9100 via https: scheme: https. Then we specify the path to the root certificate so that prometheus can make sure that it connects to the right host, as well as the server name we set in the certificate - node_exporter.mywebsite.com. In addition, we add authentication rules - basic_auth. After this configuration we can restart prometheus and make sure that all metrics are successfully collected.

Prometheus Useful Cases

Case 1 is an active prometheus replica.

TSDB works only locally and cannot be configured to replicate or work in a cluster. But it is much simpler than you think. Since prometheus works on the pull model you can simply put a second virtualization next to prometheus, thus providing reservation. Do the same configuration and prometheus starts pulling metrics from the same sources in parallel. If one server die, you can easily start using a reserve server.

Case 2 - prometheus backup.

How to backup prometheus. It’s really simple - you just need to backup the data directory (/opt/prometheus/data). Although there is an interesting dump command in promtool, but it just outputs a list of all metrics from the data directory:

./promtool tsdb dump /opt/prometheus/data | head -2

It can output them, but you can’t import them :) Prometheus even has an issue on this topic

But promtool allows you to import data in openmetrics format:

./promtool tsdb create-blocks-from openmetrics <input file> [<output directory>]

back to backups - you can use the API to create a snapshot of tsdb roughly like this:

curl -XPOST -D - -s http://localhost:9090/api/v1/admin/tsdb/snapshot

That is, we just send a POST request to a specific URL. By the way, this url is not available by default and to enable it you need to add the --web.enable-admin-api option when starting prometheus. After running this command we will get a snapshot in the /opt/prometheus/data/snapshots directory:

ls -la /opt/prometheus/data/snapshots/

Case 3 - how to monitor prometheus itself.

If prometheus goes down, nobody will even notice it, because prometheus will not be able to send alerts to alertmanager. and alertmanager will probably die with prometheus.

To solve this problem we use a very simple method - “Dead Man’s Switch”. Its essence is the following - you set up an alert like this inside prometheus:

     - alert: DeadMansSwitch
       annotations:
         description: This is a DeadMansSwitch meant to ensure that the entire Alerting
           pipeline is functional.
         summary: Alerting DeadMansSwitch
       expr: vector(1)
       labels:
         severity: none

Since the vector(1) expression is used, this alert will essentially always be active. Next, you configure the alertmanager roughly as follows:

global:
 ...
route:
 ...
    routes:
    - match:
        alertname: DeadMansSwitch
        receiver: 'cole'
        group_wait: 0s
        group_interval: 1m
        repeat_interval: 50s
receivers:
- name: 'cole'
webhook_configs:
- url: 'https://nosnch.in/c2354d53d2'
    send_resolved: false

This url can be obtained from deadmanssnitch.com service. Thus prometheus will constantly send alerts to alertmanager, and alertmanager in turn to deadmanssnitch.If your prometheus or alertmanager is down, the alerts will also be lost, which means something went wrong. Alerts can be configured in the deadmanssnitch itself. same functionality is offered by pagerduty. https://www.pagerduty.com/docs/guides/dead-mans-snitch-integration-guide/

Grafana

Install Grafana from Binary

cd /opt/
wget https://dl.grafana.com/enterprise/release/grafana-enterprise-10.2.2.linux-amd64.tar.gz
tar -zxvf grafana-enterprise-10.2.2.linux-amd64.tar.gz
mv grafana-enterprise-10.2.2.linux-amd64 grafana
[Unit]
Description=Grafana instance
Documentation=http://docs.grafana.org
Wants=network-online.target
After=network-online.target
After=postgresql.service mariadb.service mysql.service influxdb.service

[Service]
EnvironmentFile=/etc/default/grafana-server
User=grafana
Group=grafana
Type=simple
Restart=on-failure
WorkingDirectory=/usr/share/grafana
RuntimeDirectory=grafana
RuntimeDirectoryMode=0750
ExecStart=/usr/share/grafana/bin/grafana server                                     \
                            --config=${CONF_FILE}                                   \
                            --pidfile=${PID_FILE_DIR}/grafana-server.pid            \
                            --packaging=deb                                         \
                            cfg:default.paths.logs=${LOG_DIR}                       \
                            cfg:default.paths.data=${DATA_DIR}                      \
                            cfg:default.paths.plugins=${PLUGINS_DIR}                \
                            cfg:default.paths.provisioning=${PROVISIONING_CFG_DIR}

LimitNOFILE=10000
TimeoutStopSec=20
CapabilityBoundingSet=
DeviceAllow=
LockPersonality=true
MemoryDenyWriteExecute=false
NoNewPrivileges=true
PrivateDevices=true
PrivateTmp=true
ProtectClock=true
ProtectControlGroups=true
ProtectHome=true
ProtectHostname=true
ProtectKernelLogs=true
ProtectKernelModules=true
ProtectKernelTunables=true
ProtectProc=invisible
ProtectSystem=full
RemoveIPC=true
RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX
RestrictNamespaces=true
RestrictRealtime=true
RestrictSUIDSGID=true
SystemCallArchitectures=native
UMask=0027

[Install]
WantedBy=multi-user.target

Nginx configuration example

upstream grafana { server localhost:3000; }

server { listen 80; #listen 443 ssl;

root /var/www/grafana;
index index.html index.htm index.nginx-debian.html;

server_name grafana.localhost;

#ssl_certificate     /etc/letsencrypt/live/grafana.localhost/fullchain.pem;
#ssl_certificate_key /etc/letsencrypt/live/grafana.localhost/privkey.pem;

location / {
    proxy_set_header Host $http_host;
    proxy_pass http://grafana;
}

}

Reload the NGINX configuration via the command line:

/etc/init.d/nginx reload

Grafana from repository

Install the Grafana repository authentication key:

wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add

Add a repository with the command:

sudo add-apt-repository "deb https://packages.grafana.com/oss/deb stable main"

Update the list of packages in the repositories:

sudo apt update

Installation:

sudo apt install grafana

After installation, reboot of the daemons:

sudo systemctl daemon-reload

After installation, enable autoloader and start the Grafana daemon:

sudo systemctl start grafana-server

After installation, enable autoloader and start the Grafana daemon:

sudo systemctl enable grafana-server

Start Grafana

sudo systemctl start grafana-server

Grafana is up and running and ready to accept incoming connections on port 3000. Login and password for login: admin/admin

Config file and Environment Variables

By default, the config file with Grafana settings is located in the working directory of the OS. For Ubuntu, this is /etc/grafana/grafana.ini. You can override the location of the file by using the --config <path> flag.

The .ini configuration file consists of sections labeled with square brackets.

# default section
instance_name = ${HOSTNAME}

[security]
admin_user = admin

[auth.google]
client_secret = 0ldS3cretKey

[plugin.grafana-image-renderer]
rendering_ignore_https_errors = true

A detailed description of the parameters can be found in the official documentation: https://grafana.com/docs/grafana/latest/administration/configuration/.

You can also use environment variables to customize Grafana. The name of the environment variables is always given in capital letters and is built from the section name and parameter name in the configuration file: GF__.

export GF_DEFAULT_INSTANCE_NAME=my-instance
export GF_SECURITY_ADMIN_USER=owner
export GF_AUTH_GOOGLE_CLIENT_SECRET=newS3cretKey
export GF_PLUGIN_GRAFANA_IMAGE_RENDERER_RENDERING_IGNORE_HTTPS_ERRORS=true

Using environment variables can be particularly useful when using a Docker container or for testing a new parameter configuration.

As with a configuration file, Grafana needs to be restarted for changes to take effect

Configuring Nginx Reverse Proxy for Grafana

Install nginx

sudo apt install nginx

Check nginx status

sudo service nginx status

If you see Active status, it means that nginx is installed and working correctly.

Best practice:

Before starting nginx, you may need to first stop another web server, Apache2, and remove it from autoloading. To do this, run the following commands in sequence:

sudo systemctl disable apache2
sudo systemctl status apache2

Go to the IP address of our server (or localhost if you execute commands locally). Note that this time port 3000 should not be specified! The browser should display the nginx start page.

Create an nginx config file for Grafana.:

nano /etc/nginx/sites-enabled/my-grafana.conf

A text editor will open. Let’s fill in the nginx configuration file

server {
    listen 80;
    listen [::]:80;
    server_name  my-grafana;

    location / {
        proxy_pass http://localhost:3000/;
    }
}

Where my-grafana is the domain name you plan to use to access Grafana.

Note: that port 80 is specified in the nginx configuration as an example only. In a production environment, SSL (and HTTPS connection) must be used.

Basic Authentication

Basic Authentication is basic authentication using a login/password pair.

LDAP Authentication

LDAP Authentication is an authentication method that uses the LDAP protocol and access to a directory service.

LDAP is most often encountered in the case of Windows domain authorization. Let’s take a closer look at the LDAP configuration process in Grafana.

Before configuring Grafana itself, we need to go to the Windows server where we have the domain controller configured and create a grafana user.

With this technical account we will “go” to LDAP and fetch users from there when they try to log in.

  • To do this, go to Active Directory Users and Computers. Open our DC and in the Users group right-click New and User.

  • Fill in the parameters of the new user.

  • On the “Account” tab, enter the name “grafana” in the “User Login Name” field.

Next, log in to our server with Grafana and go to the home directory of the service.

cd /etc/grafana

Open the grafana.ini file

sudo nano grafana.ini

Go down to the [auth.ldap] parameter group.

Set enabled = true and remove the semicolons to make it work:

[auth.ldap]
enabled = true
config_file = /etc/grafana/ldap.toml
allow_sign_up = true

Open the file ldap.toml

sudo nano ldap.toml

Replace the parameters with the ones shown below.

host = "192.168.67.138"

…where host IP is the address of your Windows server raised as a Domain Controller.

bind_dn = "cn=grafana LDAP,cn=users,dc=MRS,DC=CONTOSO,DC=COM"

To find out the exact bind_dn, you need to run the cmdlet on a Windows server in PowerShell:

Get-ADUser grafana -Properties *

or

(Get-ADUser -Identity "grafana").DistinguishedName

In this data you need to find the DistinguishedName attribute - this is the bind_dn.

Enter the password for the grafana user in bind_password.

Filling in:

search_base_dns = ["dc=MRS,DC=CONTOSO,DC=COM"]

Restart Grafana

sudo systemctl restart grafana-server

Go to the Grafana web interface, log in as admin. Go to Server Admin - LDAP.

In test user mapping you can enter any user from AD. If it exists, Grafana will show its status. You can now log in with a Grafana account.

https://grafana.com/docs/grafana/latest/setup-grafana/configure-security/configure-authentication/ldap/

GitLab OAuth

If your team works in GitLab and has a workspace configured there, you can use authorization with Gitlab.

To do this, go to the Grafana home directory and open the configuration file.

sudo nano /etc/grafana/grafana.ini

Go down to the bottom and find the [auth.gitlab] parameter group. If there is no parameter group in the configuration file, add them. If the parameters are present but commented out, remove the comment characters and replace the enabled parameter with true, then fill in the other parameters according to the authorization provider’s settings.

It should go something like this:

[auth.gitlab]
enabled = true
allow_sign_up = true
client_id = some_id
client_secret = some_secret
scopes = api
auth_url = https://gitlab.com/oauth/authorize
token_url = https://gitlab.com/oauth/token
api_url = https://gitlab.com/api/v4
allowed_domains =
allowed_groups =

…where the auth_url, token_url, api_url parameters should be changed to your GitLab repository parameters.

Basic Authorization without reverse proxy

Let’s install dependencies to create an encrypted version of the password.

sudo apt install python3-bcrypt

Create a Python script that asks for a password and returns an encrypted version of it:

nano gen-pass.py

Script code:

import getpass
import bcrypt

password = getpass.getpass("password: ")
hashed_password = bcrypt.hashpw(password.encode("utf-8"), bcrypt.gensalt())
print(hashed_password.decode())

Run the created utility:

python3 gen-pass.py

Enter the password and copy the output result - you will need it to configure Prometheus.

Let’s create an additional configuration file for storing authentication settings.

sudo nano /opt/prometheus/web.yml

Let’s create a basic_auth section and fill in the information about our user.

basic_auth_users:
    admin: $2b$12$8aGxV85H8Ukkcft4e2X6XuCKATuPU8Fywokq.QlZuS/Dms6.HyJMe

To add more users, repeat the previous steps by adding other users in the same configuration section under the admin user.

basic_auth_users:
    admin: $2b$12$8aGxV85H8Ukkcft4e2X6XuCKATuPU8Fywokq.QlZuS/Dms6.HyJMe
    bob: $2b$12$oLA0.oCvRGe7gN1WhvdomOF34143UqBzd8jr/hFT1c8dqXf/TcuQa

If Prometheus was launched as a service, we need to adjust the startup settings. To do this, let’s find out where the file of our service is located:

sudo systemctl status prometheus

In the line with Loaded we will see the path to the file with the extension .service. Edit this file by adding a new flag to the ExecStart line --web.config.file=/opt/prometheus/web.yml.

Configuring Nginx Reverse Proxy for Grafana SSL

For security purposes, every exporter and solution you install on the host should be closed from the outside world. Access to metrics should be strictly limited, and all ports that provide information about the host system should be secured.

Using HTTPS encrypts the requests sent to the server, but most importantly, it encrypts the server responses containing all the metrics.

Since Prometheus originally did not support TLS authentication and encryption, it is common practice to use reverse proxy and TLS at the proxy level to provide protection between clients and the Prometheus server. NGINX will be used in this capacity.

Workflow

Let’s configure Prometheus to run behind an nginx server available in a domain named example.com:

prometheus5

In the example Prometheus is installed on the same host as nginx.

Note: In the example, Ubuntu is installed on the host. For other OS versions nix commands may be different

Setting up the working directory

Let’s create a directory where our keys will be stored and navigate to it:

sudo mkdir -p /opt/prometheus-certs
cd /opt/prometheus-certs

Creating TLS certificates

sudo openssl req -new -newkey rsa:2048 -days 365 -nodes -x509 -keyout prometheus.key -out prometheus.crt -subj "/C=IL/ST=Telaviv/L=Telaviv/O=Mycompany/CN=<dns_record>" -addext "subjectAltName = DNS:<dns_record>"

…where <dns_record> is a DNS record.

Detailed description of the command

openssl - command to start OpenSSL.

req - OpenSSL utility for creating CSRs.

-newkey rsa:4096 - tells OpenSSL to create a new 4096-bit RSA private key.

-nodes - tells OpenSSL to create a UNencrypted key.

-x509 - creates a new self-signed certificate.

-keyout prometheus.key - indicates where to save the private key file.

-out prometheus.crt - indicates where to save the CSR.

Example output of the command

root@prometheus:~/certs/prometheus# openssl req -newkey rsa:4096 -nodes -x509 -keyout prometheus.key -out prometheus.crt
Generating a RSA private key
....................................++++
..................++++
writing new private key to 'prometheus.key'
-----
You are about to be asked to enter information that will be incorporated
into your certificate request.
What you are about to enter is what is called a Distinguished Name or a DN.
There are quite a few fields but you can leave some blank
For some fields there will be a default value,
If you enter '.', the field will be left blank.
-----
Country Name (2 letter code) [AU]:IL
State or Province Name (full name) [Some-State]:Center
Locality Name (eg, city) []:Israel
Organization Name (eg, company) [Internet Widgits Pty Ltd]:mycompany
Organizational Unit Name (eg, section) []:-
Common Name (e.g. server FQDN or YOUR name) []:prometheus
Email Address []:[email protected]

Fill in the appropriate information in the prompts in English and without using wildcards. Be sure to enter the correct DNS record in Common Name.

Two files will be created in the directory: prometheus.crt and prometheus.key.

Configuring TLS at nginx level

In this step, enable the TLS protocol in our nginx instance.

Let’s go to the nginx configuration files directory.

cd /etc/nginx/sites-enabled

Let’s create and fill in the configuration file for Prometheus.

sudo nano prometheus
server {
    listen              443 ssl;
    server_name         example.com;  # DNS record
    ssl_certificate     /opt/prometheus-certs/prometheus.crt;  # path to crt
    ssl_certificate_key /opt/prometheus-certs/prometheus.key;  # path to key

    location / {
        proxy_pass http://localhost:9090/;  # local address Prometheus
    }
}

We have configured the HTTPS server by enabling the ssl parameter in the server block, and specifying the location of the server certificate and secret key files.

Also in this configuration nginx will proxy all connections to / endpoint to the Prometheus server running on the same host (localhost).

Let’s check that the configuration file is correct.

sudo nginx -t

Let’s restart nginx to apply the configuration:

sudo systemctl restart nginx

Port binding errors may happen when restarting nginx:

Aug 08 21:33:00 prometheus nginx[20782]: nginx: [emerg] bind() to 0.0.0.0:80 failed (98: Address already in use)

Aug 08 21:33:00 prometheus nginx[20782]: nginx: [emerg] bind() to [::]:80 failed (98: Address already in use)

 Aug 08 21:33:00 prometheus nginx[20782]: nginx: [emerg] bind() to 0.0.0.0:443 failed (98: Address already in use)

 Aug 08 21:33:00 prometheus nginx[20782]: nginx: [emerg] bind() to 0.0.0.0:80 failed (98: Address already in use)

Aug 08 21:33:01 prometheus nginx[20782]: nginx: [emerg] still could not bind()

Solution: Kill the nginx process and try restarting the service again:

 sudo pkill -f nginx & wait $!
 sudo systemctl start nginx

Data filtering

mkdir -p /home/certs/prometheus
cd /home/certs/prometheus
sudo openssl req -newkey rsa:4096 -nodes -x509 -keyout prometheus.key -out prometheus.crt 

записать все поля

sudo nano /etc/nginx/nginx.conf 

Edit conf

server {
    listen              443 ssl;
    server_name         www.example.com;
    ssl_certificate     home/certs/prometheus.crt;
    ssl_certificate_key home/certs/prometheus.key;
    location / {
        proxy_pass http://localhost:3000/;
    }
}
systemctl restart nginx
systemctl stop nginx
sudo prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--web.external-url=https://prometheus.do.rbr-grafana.com \
--web.route-prefix="/" 

sudo apt-get install apache2-utils

sudo htpasswd -c .htpasswd admin

sudo nano /etc/nginx/nginx.conf 

Edit conf

server {
    listen              443 ssl;
    server_name         www.example.com;
    ssl_certificate     home/certs/prometheus.crt;
    ssl_certificate_key home/certs/prometheus.key;
    location / {
        proxy_pass http://localhost:3000/;
        auth_basic           "Prometheus";
        auth_basic_user_file /home/user.htpasswd;            
    }
}
systemctl restart nginx

curl -u admin -k https://prometheus.do.rbr-grafana.com:443/metrics

On grafana

Configuration => Data source => Add data source => Select Prometheus

HTTP
URL : https://prometheus.do.rbr-grafana.com
Basic Auth [x]
Skip TLS verify [x]
with Credentials [x]
User: admin 
Password: *********

[Save & Test]

Configuring Nginx Reverse Proxy for Grafana

Create an Nginx config file for Grafana. To do this, type the following command:

nano /etc/nginx/sites-enabled/my-grafana.conf

A text editor opens. Let’s fill in the nginx configuration file:

server {
    listen 80;
    listen [::]:80;
    server_name  my-grafana;

    location / {
        proxy_pass http://localhost:3000/;
    }
}

restart all and test on localhost:9090 => Alerts

PagerDuty

have alternative zenduty

Services => Service Directory => New Service

name of service and add integration type prometheus and push Add Service

after it go to integration and copy integration key and integration url

and change nano alertmanager.yml

global:
  resolve_timeout: 5m
  pagerduty_url: '<url from web>'
route:
  receiver: 'pagerduty_notifications'
receivers:
- name: 'pagerduty_notifications'
  pagerduty_configs: 
  - service_key: '<key from web>'
    send_resolved: true
inhibit_rules:
  - source_match:
      severity: 'warning'
    equel: ['alertname', 'dev', 'instance']

Grafana

install Grafana on Ubuntu

Install the prerequisite packages:

install apt-transport-https software-properties-common wget

sudo apt-get install -y apt-transport-https software-properties-common wget

Import the GPG key:

sudo mkdir -p /etc/apt/keyrings/
wget -q -O - https://apt.grafana.com/gpg.key | gpg --dearmor | sudo tee /etc/apt/keyrings/grafana.gpg > /dev/null

To add a repository for stable releases, run the following command:

echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com stable main" | sudo tee -a /etc/apt/sources.list.d/grafana.list

run the following command to update the list of available packages:

# Updates the list of available packages
sudo apt-get update

To install Grafana OSS, run the following command:

# Installs the latest OSS release:
sudo apt-get install grafana

see status grafana

systemctl status grafana-server.service

start grafana default port localhost:3000

systemctl status grafana-server.service

first time in, user admin password admin need change password..

Add metrics for dashboard Grafana :

+ => Dashboard => + Add new panel => default change to promethius

for add panel :

Panel => Graph => panel title: my_event

for add metrics:

A => metrics => prometheus_http_requests_total and take most likely legend

for add name to legend {{handler}}

after Apply

metrics for cpu im take from node_exporter its node_cpu_seconds_total(mode="iowait") for legend {{cpu}} - {{job}}

for save dashboard save dashboard and give name

Simple Metrics

for monitoring up or down server A : metrics up Legend {{job}}

for monitoring time scrape A : metrics scrape_duration_seconds Legend {{instance}} {{job}}

for monitoring memory ram A: metrics 100 - node_filesystem_avail_bytes{fstype!="tmfs"}/node_filesystem_size_bytes * 100 Legend {{device}}

for monitoring aggiration cpu A: metrics avg by (instance, cpu) (node_cpu_seconds_total)

for monitoring how many cpu core A: metrics count (count(node_cpu_seconds_total)without (mode)) without (cpu)

Function

for monitoring how many places take on disk delta(node_filesystem_avail_bytes {device!='tmpfs'}[45s])

Install Grafana Loki

database logs of grafana

prometheus5

Download loki-linux-amd64.zip and on promtail-linux-amd64.zip (promtail need for endpoint node ) unzip all

curl -O -L "https://github.com/grafana/loki/releases/download/v2.9.2/loki-linux-amd64.zip"
# extract the binary
unzip "loki-linux-amd64.zip"
make sure it is executable
chmod a+x "loki-linux-amd64"

download to server

wget https://raw.githubusercontent.com/grafana/loki/main/cmd/loki/loki-local-config.yaml

download to node

wget https://raw.githubusercontent.com/grafana/loki/main/clients/cmd/promtail/promtail-local-config.yaml

In Grafana add loki

Configuration => Data Source => Add data source => HTTP => URL localhost:3100 => Save+Test

Change Configuration of Promtail

promtail-local-config.yaml

server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://localhost:3100/loki/api/v1/push

scrape_configs:
- job_name: system
  static_configs:
  - targets:
      - localhost
    labels:
      job: varlogs
      __path__: /var/log/*log

Run Loki

run with config (default port localhost:9080)

./loki-linux-amd64 -config.file=loki-local-config.yaml

Run Promtail

run with config

./promtail-linux-amd64 -config.file=promtail-local-config.yaml

Explorer logs

Explorer => log labels => {filename="/var/log/syslog"}

for 2 files {filename=-"/var/log/syslog|/var/log/auth.log"}

LogQL: Log query language

https://grafana.com/docs/loki/latest/query/

{filename=-"/var/log/syslog|/var/log/auth.log"} |= "requested_routers" see all logs with requested_routers

{filename=-"/var/log/syslog|/var/log/auth.log"} != "Succseed" see all logs without Succseed

{job="app"} |= count() by (status) |- {job="db"} |= count() by (status) combine 2 results to 1

count_over_time=({filename="/var/log/syslog"} |= "error" [12h])

Wiki Awesome Prometheus alerts

Collection of alerting rules

https://samber.github.io/awesome-prometheus-alerts/

Grafana Dashboard Collection

https://grafana.com/grafana/dashboards/

Import Dashboard

+ => import => load