[Linux] Alarm remote agent with telegram
Hello everybody,
today we’re going to talk in which manner you can implement some useful agents for monitoring the status of you machines and receive alert through telegram.
Let’s start!
Machine alive
First of all, is the machine alive? We’re going to use ping for discovering it:
function check_alive(){
ping -c 1 ${machine}
alive=$?
}
If the ping return a code different from 0, it means that the machine is unreachable.
Monitor Filesystem
One of the most important things to monitor is the status of filesystem, so let’s see how we can implement it:
function check_fs(){
ret_fs=0
fs_info=$(ssh -n ${user}@${machine} "df -Ph" | awk "0 + \$5 >= ${busy_fs} { print }")
echo "fs_info: "$fs_info
if [[ ! -z ${fs_info} ]]
then
ret_fs=1
fi
}
In this function you get the status of the remote filesystems, printing all of them that are occupied more than ${busy_fs}, that is a variable that we’re going to pass to the function in the script. Then we set the variable ret_fs to 0 or 1 as the alert state, 0 is ok, 1 is alarm.
Monitor CPU
Another thing that you want to monitor is the cpu, a near 100% cpu activity is bad for a lot of things, heating and slowness of the machine, for example. So, we’re going to monitor the cpu in this way:
function check_cpu(){
ret_cpu=0
cpu_info=$(ssh -n ${user}@${machine} "vmstat | tail -1")
free_cpu=$(echo ${cpu_info} | awk '{print $15}')
if [[ ${free_cpu_perc} -gt ${free_cpu} ]]
then
ret_cpu=1
fi
}
The concept of the script is really similar to the previous one, so we’re going to “return” 0 or 1 if the cpu is under or upper our threshold.
Monitor RAM
The last thing that usually you want to monitor is the ram, here you can find the function:
function check_ram(){
ret_ram=0
mem_info=$(ssh -n ${user}@${machine} "cat /proc/meminfo | egrep 'MemTotal|MemAvailable'")
total_mem=$(echo ${mem_info} | awk '{print $2}')
free_mem=$(echo ${mem_info} | awk '{print $5}')
free_perc=$((100*free_mem/total_mem))
if [[ ${free_ram_perc} -gt ${free_perc} ]]
then
ret_ram=1
fi
}
The concept of the script is really similar to the previous one, so we’re going to “return” 0 or 1 if the ram is under or upper our threshold.
Create alert and send it
So, now, we want to send an alert report so we can fix the errors. We’re defining two functions, one that creates the report and one that sends it. Check the first one:
function report(){
if [[ ${ret_ram} -eq 1 ]]
then
echo "Free ram percentage of ${machine} is: " ${free_perc} >>${basedir}/log.txt
fi
if [[ ${ret_cpu} -eq 1 ]]
then
echo "Free cpu percentage of ${machine} is: " ${free_cpu} >>${basedir}/log.txt
fi
if [[ ${ret_fs} -eq 1 ]]
then
echo "Full filesystems of ${machine} are: " ${fs_info} >>${basedir}/log.txt
fi
}
In this function you’re going to use all the previous return codes to create the log to send through telegram:
function send_alert(){
if [[ -s ${basedir}/log.txt ]]
then
token=$(cat ${basedir}/token.txt)
message=$(cat ${basedir}/log.txt)
url="https://api.telegram.org/bot${token}/sendMessage"
cat ${basedir}/dist_list.txt | while read id
do
curl -s -X POST ${url} -d chat_id=${id} -d text=${message}
done
fi
}
In this function you get the report created by the previous functions and send it through a telegram bot.
Variable parser
Ok, we have defined a lot of functions, but we need to pass a lot of variables to them, so this is the parser of the variables:
function get_variables(){
basedir=$(dirname $0)
machine=$(echo $line | awk '{print $1}')
user=$(echo $line | awk '{print $2}')
free_ram_perc=$(echo $line | awk '{print $3}')
free_cpu_perc=$(echo $line | awk '{print $4}')
free_fs=$(echo $line | awk '{print $5}')
busy_fs=$((100 - ${free_fs}))
}
In this parser, you’re going to read every line and parsing all the given parameters. With this function you check that all variables are meaningful:
function check_variables(){
check_var=0
if [[ ${free_ram_perc} -lt 0 ]] || [[ ${free_ram_perc} -gt 100 ]]
then
echo "Ram percentage value is not valid"
check_var=1
fi
if [[ ${free_cpu_perc} -lt 0 ]] || [[ ${free_cpu_perc} -gt 100 ]]
then
echo "Cpu percentage value is not valid"
check_var=1
fi
if [[ ${busy_fs} -lt 0 ]] || [[ ${busy_fs} -gt 100 ]]
then
echo "Free fs value is not valid"
check_var=1
fi
}
Puttin all together: Main
Now we can define the main, putting all the functions toghether:
cat ${basedi}r/hostname.txt | while read line
do
echo $line
get_variables
check_variables
if [[ ${check_var} -eq 0 ]]
then
>${basedir}/log.txt
check_alive
echo $alive
if [[ ${alive} -eq 0 ]]
then
check_ram
check_cpu
check_fs
else
echo "Machine ${machine} doesn't respond to ping" >${basedir}/log.txt
fi
report
send_alert
fi
echo "check of the machine ${machine} terminated"
done
Configuration file
Now, we need to create a configuration file:
$ vi hostname.txt
host1 Vito 5 5 20
host2 Vito 10 10 50
$ vi token.txt
aaaabbbbbcccc
$ vi dist_list.txt
12334
45566
So, in the first file we’re going to write all the machines that we need to monitor, with this syntax: machine user free_ram% free_cpu% free_fs%.
In the second file, we’re going to put the token_id of the bot telegram. For generating the token id, follow the official documentation.
In the third file, we’re going to write all the telegram ids of the operator that wants to be alerted. You can check this through @userinfobot:
SSH Passwordless
Lastly, you need to create a passwordless ssh connection from the monitoring machine to all the machines that need to be monitorated. Just follow this.
Git Project
I’ve published the project of the bot here: https://github.com/crujiff/alarmbot
Feel free to share or improve it.
Thanks for the reading, I hope that you’ll find this useful!
Regards