监控 - OpenSkill

Memcached和Redis监控脚本分享

运维 push 发表了文章 0 个评论 2661 次浏览 2016-09-20 16:05 来自相关话题

Memcached： #!/usr/bin/env python #coding=utf8 import sys import os class GetMemStatu ...查看全部

Memcached：

#!/usr/bin/env python

#coding=utf8

 

import sys

import os

 

class GetMemStatus():

    def __init__(self):

        self.val = {}

    def check(self):

        try:

            import memcache

            self.mc = memcache.Client(['127.0.0.1:11211'], debug=0)

        except:

            raise Exception, 'Plugin needs the memcache module'

 

    def extract(self, key):

        stats = self.mc.get_stats()

        try:

            if key in stats[0][1]:

                self.val[key] = stats[0][1][key]

            return self.val[key]

        except:

            raise Exception, 'ERROR: key is not in stats!!!'

 

def main():

    if len(sys.argv) == 1:

        print "ERROR! Please enter a key"

    elif len(sys.argv) == 2:

        key = sys.argv[1]

        a = GetMemStatus()

        a.check()

        print a.extract(key)

 

if __name__ == "__main__":

    main()

Redis：

#!/usr/bin/env python

#coding=utf8



import sys

import os



class GetRedisStatus():

    def __init__(self):

        self.val = {}

    def check(self):

        try:

            import redis

            self.redis = redis.Redis('127.0.0.1', port=6379, password=None)

        except:

            raise Exception, 'Plugin needs the redis module'



    def extract(self, key):

        info = self.redis.info()

        try:

            if key in info:

                self.val[key] = info[key]

            return self.val[key]

        except:

            raise Exception, 'ERROR info not include this key!'



def main():

    if len(sys.argv) == 1:

        print "ERROR! Please enter a key"

    elif len(sys.argv) == 2:

        key = sys.argv[1]

        a = GetRedisStatus()

        a.check()

        print a.extract(key)



if __name__ == "__main__":

    main()

influxdata监控系统介绍

开源项目 koyo 发表了文章 0 个评论 5381 次浏览 2016-09-12 20:03 来自相关话题

influxdata是一个强大的实时监控系统，分为4个部分，系统架构图如下： Telegraf Telegraf负责收集监控数据，并将数据输出到in ...查看全部

influxdata是一个强大的实时监控系统，分为4个部分，系统架构图如下：

Telegraf

Telegraf负责收集监控数据，并将数据输出到influxDB数据库，它支持多种类型的数据输入，比如httpjson、mysql、rabbitMQ等等。

InfluxDB

InfluxDB 是一个开源分布式时序、事件和指标数据库。使用 Go 语言编写，无需外部依赖。其设计目标是实现分布式和水平伸缩扩展。

Chronograf

从InfluxDB时间序列数据的数据可视化工具，负责从InfluxDB收集数据，并将数据图表以web的形式发布。

Kapacitor

Kapacitor是InfluxDB的数据处理引擎，主要作用是时间序列数据处理、监视和警报。

Enterprise Manager

Enterprise Manager是正在开发的UI系统，用于更加广泛的图形展示。

InfluxData平台是第一个专用,端到端解决方案收集、存储、可视化和警报在时间序列数据规模。基于堆栈,所有组件平台的设计无缝地协同工作。TICK堆栈是什么?这是influxdata的愿景管理时间序列数据的完整的数据平台。

运维监控平台之Ganglia

运维 Ansible 发表了文章 0 个评论 4121 次浏览 2016-08-24 21:24 来自相关话题

Ganglia简介 Ganglia 是一款为 HPC（高性能计算）集群而设计的可扩展的分布式监控系统，它可以监视和显示集群中的节点的各种状态信息，它由运行在各个节点上的 gmond 守护进程来采集 CPU 、内存、硬盘利用率、 I/O ...查看全部

Ganglia简介

Ganglia 是一款为 HPC（高性能计算）集群而设计的可扩展的分布式监控系统，它可以监视和显示集群中的节点的各种状态信息，它由运行在各个节点上的 gmond 守护进程来采集 CPU 、内存、硬盘利用率、 I/O 负载、网络流量情况等方面的数据，然后汇总到gmetad守护进程下，使用rrdtool 存储数据，最后将历史数据以曲线方式通过 PHP 页面呈现。

Ganglia 的特点如下：

良好的扩展性，分层架构设计能够适应大规模服务器集群的需要
负载开销低，支持高并发
广泛支持各种操作系统（ UNIX 等）和 cpu 架构，支持虚拟

Ganglia组成

Ganglia 监控系统有三部分组成，分别是 gmond、 gmetad、 webfrontend，作用如下。

gmond: 即为 ganglia monitoring daemon，是一个守护进程，运行在每一个需要监测的节点上，用于收集本节点的信息并发送到其他节点，同时也接收其他节点发过了的数据，默认的监听端口为 8649。
gmetad: 即为 ganglia meta daemon，是一个守护进程，运行在一个数据汇聚节点上，定期检查每个监测节点的 gmond 进程并从那里获取数据，然后将数据指标存储在本地 RRD 存储引擎中。
webfrontend: 是一个基于 web 的图形化监控界面，需要和 Gmetad 安装在同一个节点上，它从 gmetad 取数据，并且读取 RRD 数据库，通过

前台展示，界面美观、丰富，功能强大，结构如下图：

环境规划（centos6.7）服务器端 172.16.80.117 客户端 172.16.80.117 172.16.80.116

Ganglia的安装

[root@centos02 tools]# wget wget [root@centos02 tools]# rpm -ivh epel-release-6-8.noarch.rpm  [root@centos02 tools]# yum install ganglia-gmetad.x86_64  ganglia-gmond.x86_64 ganglia-gmond-python.x86_64  -y 修改服务端配置文件[root@centos02 tools]# vim /etc/ganglia/gmetad.conf data_source "my cluster"  172.16.80.117 172.16.80.116gridname "MyGrid"  ganglia web的安装（基于LNMP环境）[root@centos02 tools]# tar xf ganglia-web-3.7.2.tar.gz [root@centos02 tools]# mv ganglia-web-3.7.2 /application/nginx/html/ganglia 修改ganglia web的php配置文件[root@centos02 tools]# vim /application/nginx/html/ganglia/conf_default.php$conf['gweb_confdir'] = "/application/nginx/html/ganglia"; nginx配置[root@centos02 ganglia]# cat /application/nginx/conf/nginx.confworker_processes  2;events {    worker_connections  1024;}http {  log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '                      '$status $body_bytes_sent "$http_referer" '                      '"$http_user_agent" "$http_x_forwarded_for"';       include       mime.types;    default_type  application/octet-stream;    sendfile        on;    keepalive_timeout  65;     server {        listen       80;        server_name  www.martin.com martin.com;         location / {            root   html/zabbix;            index  index.php index.html index.htm;        }                           location ~ .*\.(php|php5)?$ {            root  html/zabbix;            fastcgi_pass 127.0.0.1:9000;            fastcgi_index index.php;            include fastcgi.conf;               }          access_log  logs/access_zabbix.log  main;           }     server {        listen       80;        server_name  ganglia.martin.com;         location / {            root   html/ganglia;            index  index.php index.html index.htm;        }                             location ~ .*\.(php|php5)?$ {            root   html/ganglia;            fastcgi_pass 127.0.0.1:9000;            fastcgi_index index.php;            include fastcgi.conf;               }          access_log  logs/access_bbs.log  main;            } ###status   server{      listen 80;      server_name status.martin.org;      location / {      stub_status on;      access_log off;        }   } } 访问测试，报错如下Fatal error:Errors were detected in your configuration.DWOO compiled templates directory '/application/nginx/html/ganglia/dwoo/compiled' is not writeable.Please adjust $conf['dwoo_compiled_dir'].DWOO cache directory '/application/nginx/html/ganglia/dwoo/cache' is not writeable.Please adjust $conf['dwoo_cache_dir'].in /application/nginx-1.6.3/html/ganglia/eval_conf.php on line 126 解决办法：[root@centos02 tools]# mkdir /application/nginx/html/ganglia/dwoo/compiled[root@centos02 tools]# mkdir /application/nginx/html/ganglia/dwoo/cache [root@centos02 tools]# chmod 777 /application/nginx/html/ganglia/dwoo/compiled[root@centos02 tools]# chmod 777 /application/nginx/html/ganglia/dwoo/cache[root@centos02 html]# chmod -R 777 /var/lib/ganglia/rrds  修改客户端配置文件（所有的客户端都需要做）[root@centos02 tools]# vim /etc/ganglia/gmond.conf cluster {  name = "my cluster"    #这个名字要和服务器端定义的data_source后面的名字一样  owner = "unspecified"  latlong = "unspecified"  url = "unspecified"} udp_send_channel {  #bind_hostname = yes # Highly recommended, soon to be default.                       # This option tells gmond to use a source address                       # that resolves to the machine's hostname.  Without                       # this, the metrics may appear to come from any                       # interface and the DNS names associated with                       # those IPs will be used to create the RRDs.#  mcast_join = 239.2.11.71  host = 172.16.80.117      #这里我们采用单播方式，默认是组播  port = 8649#  ttl = 1} udp_recv_channel {#  mcast_join = 239.2.11.71  port = 8649#  bind = 239.2.11.71  retry_bind = true  # Size of the UDP buffer. If you are handling lots of metrics you really  # should bump it up to e.g. 10MB or even higher.  # buffer = 10485760}

访问测试

这里是整个集群的一个总的汇总图，而不是单台服务器的图，下面我们打开单台服务器的图看看

再来看看对同一指标，每台服务器一起显示的图

扩展 Ganglia 监控功能的方法

默认安装完成的 Ganglia 仅向我们提供基础的系统监控信息，通过 Ganglia 插件可以实现两种扩展 Ganglia 监控功能的方法。 [list=1]

添加带内（ in-band）插件，主要是通过gmetric命令来实现。这是通常使用的一种方法，主要是通过crontab方法并调用Ganglia的gmetric命令来向gmond 输入数据，进而实现统一监控。这种方法简单，对于少量的监控可以采用，但是对于大规模自定义监控时，监控数据难以统一管理。

添加一些其他来源的带外（ out-of-band）插件，主要是通过 C 或者 Python 接口来实现。

在 Ganglia3.1.x 版本以后，增加了 C 或 Python 接口，通过这个接口可以自定义数据收集模块，并且可以将这些模块直接插入到 gmond 中以监控用户自定义的应用。

这里我们举例通过带外扩展的方式来监控nginx的运行状态

配置 ganglia 客户端，收集 nginx_status 数据

[root@centos02 nginx_status]# pwd

/tools/gmond_python_modules-master/nginx_status

[root@centos02 nginx_status]# cp conf.d/nginx_status.pyconf /etc/ganglia/conf.d/

[root@centos02 nginx_status]# cp python_modules/nginx_status.py  /usr/lib64/ganglia/python_modules/

[root@centos02 nginx_status]# cp graph.d/nginx_* /application/nginx/html/ganglia/graph.d/

 

[root@centos02 mysql]# cat /etc/ganglia/conf.d/nginx_status.pyconf 

#

 

modules {

  module {

    name = 'nginx_status'

    language = 'python'

 

    param status_url {

      value = 'http://status.martin.org/'

    }

    param nginx_bin {

      value = '/application/nginx/sbin/nginx'

    }

    param refresh_rate {

      value = '15'

    }

  }

}

 

collection_group {

  collect_once = yes

  time_threshold = 20

 

  metric {

    name = 'nginx_server_version'

    title = "Nginx Version"

  }

}

 

collection_group {

  collect_every = 10

  time_threshold = 20

 

  metric {

    name = "nginx_active_connections"

    title = "Total Active Connections"

    value_threshold = 1.0

  }

 

  metric {

    name = "nginx_accepts"

    title = "Total Connections Accepted"

    value_threshold = 1.0

  }

 

  metric {

    name = "nginx_handled"

    title = "Total Connections Handled"

    value_threshold = 1.0

  }

 

  metric {

    name = "nginx_requests"

    title = "Total Requests"

    value_threshold = 1.0

  }

 

  metric {

    name = "nginx_reading"

    title = "Connections Reading"

    value_threshold = 1.0

  }

 

  metric {

    name = "nginx_writing"

    title = "Connections Writing"

    value_threshold = 1.0

  }

 

  metric {

    name = "nginx_waiting"

    title = "Connections Waiting"

    value_threshold = 1.0

  }

}

完成上面的所有步骤后，重启 Ganglia 客户端 gmond 服务，在客户端通过“ gmond–m”命令可以查看支持的模板，最后就可以在 Ganglia web 界面查看 Nginx 的运行状态

阅读分享授权转载：http://huaxin.blog.51cto.com/903026/1841208

Bitly运维团队的10个监控教训

运维 Geek小A 发表了文章 0 个评论 3336 次浏览 2016-06-29 23:01 来自相关话题

bit.ly 是一个全球知名的短网址服务商，为网民提供网址和链接缩短服务。Bitly 公司2008年成立于纽约。据说 bitly 每月缩短超过10亿个网址用于社交网络分享传播。2009年5月6日 bit.ly 一度成为 Twitter 默认的短网址，后来被 ...查看全部

bit.ly 是一个全球知名的短网址服务商，为网民提供网址和链接缩短服务。Bitly 公司2008年成立于纽约。据说 bitly 每月缩短超过10亿个网址用于社交网络分享传播。2009年5月6日 bit.ly 一度成为 Twitter 默认的短网址，后来被 Twitter 自家的 t.co 取代。今年年初 bitly 运维团队官方技术博客发了一篇文章，分享了他们的一些经验教训。以下是全文。

我们总是会监控很多指标（比如硬盘利用率、内存利用率、负载、ping等等）。除了这些，我们还从运营自家产品系统的过程中吸取了很多经验教训，这些经验教训帮助我们扩充了在bitly的监控范围。

下面是我最喜欢的推特之一，来自@DevOps_Borat
开发者的墨菲定律：如果一件事情可能会出现错误，那么这就意味着它已经出错了，只不过你还没有发现罢了。

下面是一个我们运营bitly时的监控清单，这些例子的背后故事，有时甚至可以称为痛苦的经历，帮助了bitly的成长。

1.叉率 | Fork Rate

我们曾经遇到过这样一个问题：通过设置options ipv6 disable=1和在/etc/modprobe.conf中的alias ipv6 off，将一台服务器的IPv6关闭。不过这可给我们找了一个大麻烦：每次创建一个新的curl对象，modprobe都会被调用，并通过检查net-pf-10来确定IPv6的状态。这可给服务器带来了很大的负担，最终我们发现了/proc/stat下的进程计数器会以每秒数以百计的速度增长，进而发现了上面说到的那些现象的原因。通常你会希望在一台流量稳定的机器上的叉率保持在1-10/s。

#!/bin/bash

# Copyright bitly, Aug 2011 

# written by Jehiah Czebotar



DATAFILE="/var/tmp/nagios_check_forkrate.dat"

VALID_INTERVAL=600



OK=0

WARNING=1

CRITICAL=2

UNKNOWN=-1



function usage()

{

    echo "usage: $0 --warn= --critical="

    echo "this script checks the rate processes are created"

    echo "and alerts when it goes above a certain threshold"

    echo "it saves the value from each run in $DATAFILE"

    echo "and computes a delta on the next run. It will ignore"

    echo "any values that are older than --valid-interval=$VALID_INTERVAL (seconds)"

    echo "warn and critical values are in # of new processes per second"

}



while [ "$1" != "" ]; do

    PARAM=`echo $1 | awk -F= '{print $1}'`

    VALUE=`echo $1 | awk -F= '{print $2}'`

    case $PARAM in

        -w | --warn)

            WARN_THRESHOLD=$VALUE

            ;;

        -c | --critical)

            CRITICAL_THRESHOLD=$VALUE

            ;;

        --valid-interval)

            VALID_INTERVAL=$VALUE

            ;;

        -h | --help)

            usage

            exit 0;

            ;;

    esac

    shift

done



if [ -z "$WARN_THRESHOLD" ] || [ -z "$CRITICAL_THRESHOLD" ]; then

    echo "error: --warn and --critical parameters are required"

    exit $UNKNOWN

fi

if [[ $WARN_THRESHOLD -ge $CRITICAL_THRESHOLD ]]; then

    echo "error: --warn ($WARN_THRESHOLD) can't be greater than --critical ($CRITICAL_THRESHOLD)"

    exit $UNKNOWN

fi



NOW=`date +%s`

min_valid_ts=$(($NOW - $VALID_INTERVAL))

current_process_count=`awk '/processes/ {print $2}' /proc/stat`



if [ ! -f $DATAFILE ]; then

    mkdir -p $(dirname $DATAFILE)

    echo -e "$NOW\t$current_process_count" > $DATAFILE

    echo "Missing $DATAFILE; creating"

    exit $UNKNOWN

fi



# now compare this to previous

mv $DATAFILE{,.previous}

while read ts process_count; do

    if [[ $ts -lt $min_valid_ts ]]; then

        continue

    fi

    if [[ $ts -ge $NOW ]]; then

        # we can't use data from the same second

        continue

    fi

    # calculate the rate

    process_delta=$(($current_process_count - $process_count))

    ts_delta=$(($NOW - $ts))

    current_fork_rate=`echo "$process_delta / $ts_delta" | bc`

    echo -e "$ts\t$process_count" >> $DATAFILE

done < $DATAFILE.previous

echo -e "$NOW\t$current_process_count" >> $DATAFILE



echo "fork rate is $current_fork_rate processes/second (based on the last $ts_delta seconds)"

if [[ $current_fork_rate -ge $CRITICAL_THRESHOLD ]]; then

    exit $CRITICAL

fi

if [[ $current_fork_rate -ge $WARN_THRESHOLD ]]; then

    exit $WARNING

fi

exit $OK

2.流控制包

参考网卡控制如果你的网络设置中包括流控制包，并且你没有设置禁止它们，那么它们有时可能会引起流量丢失。（如果你觉得这听起来还不够严重，那你也许该检查下你的脑袋里都装了些什么了）。

$ /usr/sbin/ethtool -S eth0 | grep flow_control

rx_flow_control_xon: 0

rx_flow_control_xoff: 0

tx_flow_control_xon: 0

tx_flow_control_xoff: 0

注：阅读这个来更加详细的了解当你使用某些博通网卡时，这些流控制帧是如何和链接的损耗联系在一起的。

3.交换输入/输出速率

人们通常会检查超过某一阈值的交换使用率。不过即便你仅仅只有一小部分内存被交换，实际上影响性能的却是交换输入/输出的速率，而不是数量。检查交换输入/输出速率会更直观。

#!/bin/bash

# Show the rate of swapping (in number of pages) between executions



OK=0

WARNING=1

CRITICAL=2

UNKNOWN=-1

EXITFLAG=$OK



WARN_THRESHOLD=1

CRITICAL_THRESHOLD=100



IN_DATAFILE="/var/tmp/nagios_check_swap_pages_in.dat"

OUT_DATAFILE="/var/tmp/nagios_check_swap_pages_out.dat"

VALID_INTERVAL=600



function usage()

{

    echo "usage: $0 --warn= --critical="

    echo "Script checks for any swap usage"

}



while [ "$1" != "" ]; do

    PARAM=`echo $1 | awk -F= '{print $1}'`

    VALUE=`echo $1 | awk -F= '{print $2}'`

    case $PARAM in

        --warn)

            WARN_THRESHOLD=$VALUE

            ;;

        --critical)

            CRITICAL_THRESHOLD=$VALUE

            ;;

        -h | --help)

            usage

            exit 0;

            ;;

    esac

    shift

done



NOW=`date +%s`

min_valid_ts=$(($NOW - $VALID_INTERVAL))



CURRENT_PAGES_SWAPPED_IN=`vmstat -s | grep 'pages swapped in' | awk '{print $1}'`

CURRENT_PAGES_SWAPPED_OUT=`vmstat -s | grep 'pages swapped out' | awk '{print $1}'`



mkdir -p $(dirname $IN_DATAFILE)

if [ ! -f $IN_DATAFILE ]; then

    echo -e "$NOW\t$CURRENT_PAGES_SWAPPED_IN" > $IN_DATAFILE

    echo "Missing $IN_DATAFILE; creating"

    EXITFLAG=$UNKNOWN

fi

if [ ! -f $OUT_DATAFILE ]; then

    echo -e "$NOW\t$CURRENT_PAGES_SWAPPED_OUT" > $OUT_DATAFILE

    echo "Missing $OUT_DATAFILE; creating"

    EXITFLAG=$UNKNOWN

fi



if [ $EXITFLAG != $OK ]; then

    exit $EXITFLAG

fi



function swap_rate() {

    local file=$1

    local current=$2

    local rate=0



    mv $file ${file}.previous

    while read ts swap_count; do

        if [[ $ts -lt $min_valid_ts ]]; then

            continue

        fi

        if [[ $ts -ge $NOW ]]; then

            # we can't use data from the same second

            continue

        fi

        # calculate the rate

        swap_delta=$(($current - $swap_count))

        ts_delta=$(($NOW - $ts))

        rate=`echo "$swap_delta / $ts_delta" | bc`

        echo -e "$ts\t$swap_count" >> $file

    done < ${file}.previous

    echo -e "$NOW\t$current" >> $file

    echo $rate

}



in_rate=`swap_rate $IN_DATAFILE $CURRENT_PAGES_SWAPPED_IN`

out_rate=`swap_rate $OUT_DATAFILE $CURRENT_PAGES_SWAPPED_OUT`



echo "swap in/out is $in_rate/$out_rate per second"

if [[ $in_rate -ge $CRITICAL_THRESHOLD ]] || [[ $out_rate -ge $CRITICAL_THRESHOLD ]]; then

    exit $CRITICAL

fi

if [[ $in_rate -ge $WARN_THRESHOLD ]] || [[ $out_rate -ge $WARN_THRESHOLD ]]; then

    exit $WARNING

fi

exit $OK

4.服务器启动通知

意外的重启是生活的一部分。你知道你的服务器何时重启了吗？很多人都不知道。这里我们会使用一个当系统重启时会发送邮件通知的简单的初始化脚本。当添加新服务器的时候，这会很有用。同时，当服务器出现异常时，能优雅的使人了解服务器状态的变化，而不是只提供一个报警。

#!/bin/bash

#

# *************************************************

# chkconfig: 2345 99 99

# description: notify email address on system boot.

# *************************************************

# Installing:

# 1) save as /etc/rc.d/init.d/notify

# 2) set the desired email address in "MAILADD" variable

# 3) chmod a+w /etc/rc.d/init.d/notify

# 4) /sbin/chkconfig --level 2345 notify on



PATH=/bin:/usr/sbin:/usr/bin

SERVER=`hostname`

case $1 in

    start)

        PUBLIC_IP=`curl --connect-timeout 5 -s icanhazip.com`

        PUBLIC_IPV6=`curl -6 --connect-timeout 5 -s icanhazip.com`

        MAILADD=your@email.example

        mail -s " Boot of $SERVER" $MAILADD <
From: $0

To: $MAILADD

$SERVER has booted up.

public ip $PUBLIC_IP $PUBLIC_IPV6

If this is news to you, please investigate.

`date -u`

EOF

    ;;

esac

exit 0

5.NTP的时钟偏移

如果这货不被检测，是的，你的某台服务器也许已经挂了。如果你从未考虑过时钟偏离，那么你甚至可能没有在你的服务器上跑过ntpd命令。通常来说，有三点可以作为检查的切入点。

]ntpd是否在运行。[/

]你的资料中心内的时钟脉冲相位差。[/

]你的主时间服务器和外部之间的时钟脉冲相位差。[/

我们使用naginx check_ntp_time Plugin 做检查。

6.DNS决议
内部DNS-这是一个你会依赖却常被忽略掉的、你的构架的隐藏部分。检查它的切入点如下：

1）每个服务器的本地决议。
2）如果你的数据中心有本地DNS服务器，那么你应该检查决议，和查询的数量。
3）检查你用的每个上行DNS解析器是否可用。

外部DNS-最好能核实你的外部域名解析能正确的和你已经发布的外部域名服务器对应上。在bitly我们也依靠一些CC顶级域名，而且我们也直接监测这些认证服务器。(是的，这发生在所有的顶级域名服务器离线的时候。）

7.SSL过期

因为这种情况发生的如此之少，以至于很多人都忘记了它。修复很简单，试试更新一下SSL证书吧。

define command{

    command_name    check_ssl_expire

    command_line    $USER1$/check_http --ssl -C 14 -H $ARG1$

}

define service{

    host_name               virtual

    service_description     bitly_com_ssl_expiration

    use                     generic-service

    check_command           check_ssl_expire!bitly.com

    contact_groups          email_only

    normal_check_interval   720

    retry_check_interval    10

    notification_interval   720

}

8.DELL服务器管理器（OMSA）
我们将bitly分别部署在两个数据中心，一个在DELL的设备上，另一个是亚马逊EC2。对于我们的DELL设备而言，监测OMSA的输出是十分重要的。它会让我们留意磁盘阵列的状态，坏掉的磁盘（可预见性的硬件故障），内存问题，能源供应状态等等。

9.连接限制

你可能在连接限制的情况下运行过例如memcached和mysql这样的东西，但是当你向外扩展应用程序层的时候，你真的监测过你离那些限制到底有多接近吗？

与此相关的是解决遇到文件修饰符限制的进程的问题。在实际操作中，我们经常在启动脚本中加入ulimit -n 65535来启动服务以最小化连接限制带来的影响。我们也可以通过 worker_rlimit_nofile来设置Nginx。

10.负载均衡器的状态

我们可以设置负载均衡器的健康检查（health check），这样我们就可以轻松的将某台服务器从轮转中剔除。（假设一个服务器挂掉了，负载均衡器将会探测到同时停止向这台服务器发送信息—译者注）我们发现健康检查的可视化十分重要，于是我们基于相同的健康检查来监控、报警。（如果你使用EC2负载均衡器，你可以通过亚马逊的API来监测ELB的状态）

一些碎碎念（这些东西也要监测）
Nginx错误日志，服务重启（假设遇到错误时，会重启），numa统计，新进程核心转储。

结语
以上仅仅是我们保证bitly稳定运营的一些皮毛，如果打动了你，那么请戳这。
中文原文：http://blog.jobbole.com/62783/
英文原文：http://word.bitly.com/post/74839060954/ten-things-to-monitor

监控软件对比选择

运维 Rock 发表了文章 2 个评论 4326 次浏览 2016-01-03 04:18 来自相关话题

Cacti、Nagios、Zabbix功能对比其他对比

Cacti、Nagios、Zabbix功能对比

其他对比

监控Mysql主从同步脚本

运维空心菜发表了文章 2 个评论 3187 次浏览 2015-11-24 00:02 来自相关话题

Shell版本 #!/bin/bash #Auth: lucky.chen hosts="192.168.3.9:3305 192.168.3.10:3306 " ...查看全部

Shell版本

#!/bin/bash

#Auth: lucky.chen



hosts="192.168.3.9:3305

192.168.3.10:3306

"

for i in $hosts 

do

        alert=0

	host=`echo $i|awk -F':' '{print $1}'`

	port=`echo $i|awk -F':' '{print $2}'`

	declare -i alert	

	IO=`mysql -uwrite -P$port -p'write@jkb' -h${host}  -e "show slave status\G"|grep Slave_IO_Running: |awk '{print $NF}'`

	SQL=`mysql -uwrite -P$port -p'write@jkb' -h${host}  -e "show slave status\G"|grep Slave_SQL_Running: |awk '{print $NF}'`

	declare -i BEHIN=`mysql -uwrite -P$port -p'write@jkb' -h${host}  -e "show slave status\G"|grep Seconds_Behind_Master|awk '{print $NF}'`

	

	if [ $IO != Yes ] ;then

        status="${status} \n IO is $IO"

	alert=1

	fi



	if [ $SQL != Yes ] ;then

	stauts="${status} \n SQL is $SQL"

	alert=1

	fi



	if [[ $BEHIN -gt 100 ]] ;then

       	status="${status} \n behind master $BEHIN second"

	alert=1

	fi





	if [[ alert -eq 1 ]] ;then

	echo -e "$host : $status"

       php /usr/local/bin/sendmail/tongbu.php  "$host $status" "$status"

	fi



done

python简易版本

#!/usr/bin/env python

# _[i]_coding: utf8_[/i]_

import MySQLdb

from MySQLdb import cursors

import threading



slaveList = [

             'ip list'

             ]

def getSlaveTime(host):

    try:

        username = 'username'

        passwd = 'password'

        conn = MySQLdb.connect(user = username, passwd = passwd,  host = host, connect_timeout = 5, cursorclass = cursors.DictCursor)

        cur = conn.cursor()

        cur.execute('''show slave status''')

        fallSec = cur.fetchone()['Seconds_Behind_Master']

        cur.close()

        conn.close()

        print  host + ' 落后 ' + str(fallSec)

    except:

        print host + ' 落后 ' + str(10000000)

    

for host in slaveList:

    s = threading.Thread(target = getSlaveTime,args = (host,))

    s.start()

Memcached和Redis监控脚本分享

运维 push 发表了文章 0 个评论 2661 次浏览 2016-09-20 16:05 来自相关话题

Memcached： #!/usr/bin/env python #coding=utf8 import sys import os class GetMemStatu ...查看全部

Memcached：

#!/usr/bin/env python

#coding=utf8

 

import sys

import os

 

class GetMemStatus():

    def __init__(self):

        self.val = {}

    def check(self):

        try:

            import memcache

            self.mc = memcache.Client(['127.0.0.1:11211'], debug=0)

        except:

            raise Exception, 'Plugin needs the memcache module'

 

    def extract(self, key):

        stats = self.mc.get_stats()

        try:

            if key in stats[0][1]:

                self.val[key] = stats[0][1][key]

            return self.val[key]

        except:

            raise Exception, 'ERROR: key is not in stats!!!'

 

def main():

    if len(sys.argv) == 1:

        print "ERROR! Please enter a key"

    elif len(sys.argv) == 2:

        key = sys.argv[1]

        a = GetMemStatus()

        a.check()

        print a.extract(key)

 

if __name__ == "__main__":

    main()

Redis：

#!/usr/bin/env python

#coding=utf8



import sys

import os



class GetRedisStatus():

    def __init__(self):

        self.val = {}

    def check(self):

        try:

            import redis

            self.redis = redis.Redis('127.0.0.1', port=6379, password=None)

        except:

            raise Exception, 'Plugin needs the redis module'



    def extract(self, key):

        info = self.redis.info()

        try:

            if key in info:

                self.val[key] = info[key]

            return self.val[key]

        except:

            raise Exception, 'ERROR info not include this key!'



def main():

    if len(sys.argv) == 1:

        print "ERROR! Please enter a key"

    elif len(sys.argv) == 2:

        key = sys.argv[1]

        a = GetRedisStatus()

        a.check()

        print a.extract(key)



if __name__ == "__main__":

    main()

influxdata监控系统介绍

开源项目 koyo 发表了文章 0 个评论 5381 次浏览 2016-09-12 20:03 来自相关话题

influxdata是一个强大的实时监控系统，分为4个部分，系统架构图如下： Telegraf Telegraf负责收集监控数据，并将数据输出到in ...查看全部

influxdata是一个强大的实时监控系统，分为4个部分，系统架构图如下：

Telegraf

Telegraf负责收集监控数据，并将数据输出到influxDB数据库，它支持多种类型的数据输入，比如httpjson、mysql、rabbitMQ等等。

InfluxDB

InfluxDB 是一个开源分布式时序、事件和指标数据库。使用 Go 语言编写，无需外部依赖。其设计目标是实现分布式和水平伸缩扩展。

Chronograf

从InfluxDB时间序列数据的数据可视化工具，负责从InfluxDB收集数据，并将数据图表以web的形式发布。

Kapacitor

Kapacitor是InfluxDB的数据处理引擎，主要作用是时间序列数据处理、监视和警报。

Enterprise Manager

运维监控平台之Ganglia

运维 Ansible 发表了文章 0 个评论 4121 次浏览 2016-08-24 21:24 来自相关话题

Ganglia简介

良好的扩展性，分层架构设计能够适应大规模服务器集群的需要
负载开销低，支持高并发
广泛支持各种操作系统（ UNIX 等）和 cpu 架构，支持虚拟

Ganglia组成

Ganglia 监控系统有三部分组成，分别是 gmond、 gmetad、 webfrontend，作用如下。

gmond: 即为 ganglia monitoring daemon，是一个守护进程，运行在每一个需要监测的节点上，用于收集本节点的信息并发送到其他节点，同时也接收其他节点发过了的数据，默认的监听端口为 8649。
gmetad: 即为 ganglia meta daemon，是一个守护进程，运行在一个数据汇聚节点上，定期检查每个监测节点的 gmond 进程并从那里获取数据，然后将数据指标存储在本地 RRD 存储引擎中。
webfrontend: 是一个基于 web 的图形化监控界面，需要和 Gmetad 安装在同一个节点上，它从 gmetad 取数据，并且读取 RRD 数据库，通过

前台展示，界面美观、丰富，功能强大，结构如下图：

环境规划（centos6.7）服务器端 172.16.80.117 客户端 172.16.80.117 172.16.80.116

Ganglia的安装

[root@centos02 tools]# wget wget [root@centos02 tools]# rpm -ivh epel-release-6-8.noarch.rpm  [root@centos02 tools]# yum install ganglia-gmetad.x86_64  ganglia-gmond.x86_64 ganglia-gmond-python.x86_64  -y 修改服务端配置文件[root@centos02 tools]# vim /etc/ganglia/gmetad.conf data_source "my cluster"  172.16.80.117 172.16.80.116gridname "MyGrid"  ganglia web的安装（基于LNMP环境）[root@centos02 tools]# tar xf ganglia-web-3.7.2.tar.gz [root@centos02 tools]# mv ganglia-web-3.7.2 /application/nginx/html/ganglia 修改ganglia web的php配置文件[root@centos02 tools]# vim /application/nginx/html/ganglia/conf_default.php$conf['gweb_confdir'] = "/application/nginx/html/ganglia"; nginx配置[root@centos02 ganglia]# cat /application/nginx/conf/nginx.confworker_processes  2;events {    worker_connections  1024;}http {  log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '                      '$status $body_bytes_sent "$http_referer" '                      '"$http_user_agent" "$http_x_forwarded_for"';       include       mime.types;    default_type  application/octet-stream;    sendfile        on;    keepalive_timeout  65;     server {        listen       80;        server_name  www.martin.com martin.com;         location / {            root   html/zabbix;            index  index.php index.html index.htm;        }                           location ~ .*\.(php|php5)?$ {            root  html/zabbix;            fastcgi_pass 127.0.0.1:9000;            fastcgi_index index.php;            include fastcgi.conf;               }          access_log  logs/access_zabbix.log  main;           }     server {        listen       80;        server_name  ganglia.martin.com;         location / {            root   html/ganglia;            index  index.php index.html index.htm;        }                             location ~ .*\.(php|php5)?$ {            root   html/ganglia;            fastcgi_pass 127.0.0.1:9000;            fastcgi_index index.php;            include fastcgi.conf;               }          access_log  logs/access_bbs.log  main;            } ###status   server{      listen 80;      server_name status.martin.org;      location / {      stub_status on;      access_log off;        }   } } 访问测试，报错如下Fatal error:Errors were detected in your configuration.DWOO compiled templates directory '/application/nginx/html/ganglia/dwoo/compiled' is not writeable.Please adjust $conf['dwoo_compiled_dir'].DWOO cache directory '/application/nginx/html/ganglia/dwoo/cache' is not writeable.Please adjust $conf['dwoo_cache_dir'].in /application/nginx-1.6.3/html/ganglia/eval_conf.php on line 126 解决办法：[root@centos02 tools]# mkdir /application/nginx/html/ganglia/dwoo/compiled[root@centos02 tools]# mkdir /application/nginx/html/ganglia/dwoo/cache [root@centos02 tools]# chmod 777 /application/nginx/html/ganglia/dwoo/compiled[root@centos02 tools]# chmod 777 /application/nginx/html/ganglia/dwoo/cache[root@centos02 html]# chmod -R 777 /var/lib/ganglia/rrds  修改客户端配置文件（所有的客户端都需要做）[root@centos02 tools]# vim /etc/ganglia/gmond.conf cluster {  name = "my cluster"    #这个名字要和服务器端定义的data_source后面的名字一样  owner = "unspecified"  latlong = "unspecified"  url = "unspecified"} udp_send_channel {  #bind_hostname = yes # Highly recommended, soon to be default.                       # This option tells gmond to use a source address                       # that resolves to the machine's hostname.  Without                       # this, the metrics may appear to come from any                       # interface and the DNS names associated with                       # those IPs will be used to create the RRDs.#  mcast_join = 239.2.11.71  host = 172.16.80.117      #这里我们采用单播方式，默认是组播  port = 8649#  ttl = 1} udp_recv_channel {#  mcast_join = 239.2.11.71  port = 8649#  bind = 239.2.11.71  retry_bind = true  # Size of the UDP buffer. If you are handling lots of metrics you really  # should bump it up to e.g. 10MB or even higher.  # buffer = 10485760}

访问测试

这里是整个集群的一个总的汇总图，而不是单台服务器的图，下面我们打开单台服务器的图看看

再来看看对同一指标，每台服务器一起显示的图

扩展 Ganglia 监控功能的方法

默认安装完成的 Ganglia 仅向我们提供基础的系统监控信息，通过 Ganglia 插件可以实现两种扩展 Ganglia 监控功能的方法。 [list=1]

添加一些其他来源的带外（ out-of-band）插件，主要是通过 C 或者 Python 接口来实现。

配置 ganglia 客户端，收集 nginx_status 数据

[root@centos02 nginx_status]# pwd

/tools/gmond_python_modules-master/nginx_status

[root@centos02 nginx_status]# cp conf.d/nginx_status.pyconf /etc/ganglia/conf.d/

[root@centos02 nginx_status]# cp python_modules/nginx_status.py  /usr/lib64/ganglia/python_modules/

[root@centos02 nginx_status]# cp graph.d/nginx_* /application/nginx/html/ganglia/graph.d/

 

[root@centos02 mysql]# cat /etc/ganglia/conf.d/nginx_status.pyconf 

#

 

modules {

  module {

    name = 'nginx_status'

    language = 'python'

 

    param status_url {

      value = 'http://status.martin.org/'

    }

    param nginx_bin {

      value = '/application/nginx/sbin/nginx'

    }

    param refresh_rate {

      value = '15'

    }

  }

}

 

collection_group {

  collect_once = yes

  time_threshold = 20

 

  metric {

    name = 'nginx_server_version'

    title = "Nginx Version"

  }

}

 

collection_group {

  collect_every = 10

  time_threshold = 20

 

  metric {

    name = "nginx_active_connections"

    title = "Total Active Connections"

    value_threshold = 1.0

  }

 

  metric {

    name = "nginx_accepts"

    title = "Total Connections Accepted"

    value_threshold = 1.0

  }

 

  metric {

    name = "nginx_handled"

    title = "Total Connections Handled"

    value_threshold = 1.0

  }

 

  metric {

    name = "nginx_requests"

    title = "Total Requests"

    value_threshold = 1.0

  }

 

  metric {

    name = "nginx_reading"

    title = "Connections Reading"

    value_threshold = 1.0

  }

 

  metric {

    name = "nginx_writing"

    title = "Connections Writing"

    value_threshold = 1.0

  }

 

  metric {

    name = "nginx_waiting"

    title = "Connections Waiting"

    value_threshold = 1.0

  }

}

阅读分享授权转载：http://huaxin.blog.51cto.com/903026/1841208

Bitly运维团队的10个监控教训

运维 Geek小A 发表了文章 0 个评论 3336 次浏览 2016-06-29 23:01 来自相关话题

#!/bin/bash

# Copyright bitly, Aug 2011 

# written by Jehiah Czebotar



DATAFILE="/var/tmp/nagios_check_forkrate.dat"

VALID_INTERVAL=600



OK=0

WARNING=1

CRITICAL=2

UNKNOWN=-1



function usage()

{

    echo "usage: $0 --warn= --critical="

    echo "this script checks the rate processes are created"

    echo "and alerts when it goes above a certain threshold"

    echo "it saves the value from each run in $DATAFILE"

    echo "and computes a delta on the next run. It will ignore"

    echo "any values that are older than --valid-interval=$VALID_INTERVAL (seconds)"

    echo "warn and critical values are in # of new processes per second"

}



while [ "$1" != "" ]; do

    PARAM=`echo $1 | awk -F= '{print $1}'`

    VALUE=`echo $1 | awk -F= '{print $2}'`

    case $PARAM in

        -w | --warn)

            WARN_THRESHOLD=$VALUE

            ;;

        -c | --critical)

            CRITICAL_THRESHOLD=$VALUE

            ;;

        --valid-interval)

            VALID_INTERVAL=$VALUE

            ;;

        -h | --help)

            usage

            exit 0;

            ;;

    esac

    shift

done



if [ -z "$WARN_THRESHOLD" ] || [ -z "$CRITICAL_THRESHOLD" ]; then

    echo "error: --warn and --critical parameters are required"

    exit $UNKNOWN

fi

if [[ $WARN_THRESHOLD -ge $CRITICAL_THRESHOLD ]]; then

    echo "error: --warn ($WARN_THRESHOLD) can't be greater than --critical ($CRITICAL_THRESHOLD)"

    exit $UNKNOWN

fi



NOW=`date +%s`

min_valid_ts=$(($NOW - $VALID_INTERVAL))

current_process_count=`awk '/processes/ {print $2}' /proc/stat`



if [ ! -f $DATAFILE ]; then

    mkdir -p $(dirname $DATAFILE)

    echo -e "$NOW\t$current_process_count" > $DATAFILE

    echo "Missing $DATAFILE; creating"

    exit $UNKNOWN

fi



# now compare this to previous

mv $DATAFILE{,.previous}

while read ts process_count; do

    if [[ $ts -lt $min_valid_ts ]]; then

        continue

    fi

    if [[ $ts -ge $NOW ]]; then

        # we can't use data from the same second

        continue

    fi

    # calculate the rate

    process_delta=$(($current_process_count - $process_count))

    ts_delta=$(($NOW - $ts))

    current_fork_rate=`echo "$process_delta / $ts_delta" | bc`

    echo -e "$ts\t$process_count" >> $DATAFILE

done < $DATAFILE.previous

echo -e "$NOW\t$current_process_count" >> $DATAFILE



echo "fork rate is $current_fork_rate processes/second (based on the last $ts_delta seconds)"

if [[ $current_fork_rate -ge $CRITICAL_THRESHOLD ]]; then

    exit $CRITICAL

fi

if [[ $current_fork_rate -ge $WARN_THRESHOLD ]]; then

    exit $WARNING

fi

exit $OK

$ /usr/sbin/ethtool -S eth0 | grep flow_control

rx_flow_control_xon: 0

rx_flow_control_xoff: 0

tx_flow_control_xon: 0

tx_flow_control_xoff: 0

#!/bin/bash

# Show the rate of swapping (in number of pages) between executions



OK=0

WARNING=1

CRITICAL=2

UNKNOWN=-1

EXITFLAG=$OK



WARN_THRESHOLD=1

CRITICAL_THRESHOLD=100



IN_DATAFILE="/var/tmp/nagios_check_swap_pages_in.dat"

OUT_DATAFILE="/var/tmp/nagios_check_swap_pages_out.dat"

VALID_INTERVAL=600



function usage()

{

    echo "usage: $0 --warn= --critical="

    echo "Script checks for any swap usage"

}



while [ "$1" != "" ]; do

    PARAM=`echo $1 | awk -F= '{print $1}'`

    VALUE=`echo $1 | awk -F= '{print $2}'`

    case $PARAM in

        --warn)

            WARN_THRESHOLD=$VALUE

            ;;

        --critical)

            CRITICAL_THRESHOLD=$VALUE

            ;;

        -h | --help)

            usage

            exit 0;

            ;;

    esac

    shift

done



NOW=`date +%s`

min_valid_ts=$(($NOW - $VALID_INTERVAL))



CURRENT_PAGES_SWAPPED_IN=`vmstat -s | grep 'pages swapped in' | awk '{print $1}'`

CURRENT_PAGES_SWAPPED_OUT=`vmstat -s | grep 'pages swapped out' | awk '{print $1}'`



mkdir -p $(dirname $IN_DATAFILE)

if [ ! -f $IN_DATAFILE ]; then

    echo -e "$NOW\t$CURRENT_PAGES_SWAPPED_IN" > $IN_DATAFILE

    echo "Missing $IN_DATAFILE; creating"

    EXITFLAG=$UNKNOWN

fi

if [ ! -f $OUT_DATAFILE ]; then

    echo -e "$NOW\t$CURRENT_PAGES_SWAPPED_OUT" > $OUT_DATAFILE

    echo "Missing $OUT_DATAFILE; creating"

    EXITFLAG=$UNKNOWN

fi



if [ $EXITFLAG != $OK ]; then

    exit $EXITFLAG

fi



function swap_rate() {

    local file=$1

    local current=$2

    local rate=0



    mv $file ${file}.previous

    while read ts swap_count; do

        if [[ $ts -lt $min_valid_ts ]]; then

            continue

        fi

        if [[ $ts -ge $NOW ]]; then

            # we can't use data from the same second

            continue

        fi

        # calculate the rate

        swap_delta=$(($current - $swap_count))

        ts_delta=$(($NOW - $ts))

        rate=`echo "$swap_delta / $ts_delta" | bc`

        echo -e "$ts\t$swap_count" >> $file

    done < ${file}.previous

    echo -e "$NOW\t$current" >> $file

    echo $rate

}



in_rate=`swap_rate $IN_DATAFILE $CURRENT_PAGES_SWAPPED_IN`

out_rate=`swap_rate $OUT_DATAFILE $CURRENT_PAGES_SWAPPED_OUT`



echo "swap in/out is $in_rate/$out_rate per second"

if [[ $in_rate -ge $CRITICAL_THRESHOLD ]] || [[ $out_rate -ge $CRITICAL_THRESHOLD ]]; then

    exit $CRITICAL

fi

if [[ $in_rate -ge $WARN_THRESHOLD ]] || [[ $out_rate -ge $WARN_THRESHOLD ]]; then

    exit $WARNING

fi

exit $OK

#!/bin/bash

#

# *************************************************

# chkconfig: 2345 99 99

# description: notify email address on system boot.

# *************************************************

# Installing:

# 1) save as /etc/rc.d/init.d/notify

# 2) set the desired email address in "MAILADD" variable

# 3) chmod a+w /etc/rc.d/init.d/notify

# 4) /sbin/chkconfig --level 2345 notify on



PATH=/bin:/usr/sbin:/usr/bin

SERVER=`hostname`

case $1 in

    start)

        PUBLIC_IP=`curl --connect-timeout 5 -s icanhazip.com`

        PUBLIC_IPV6=`curl -6 --connect-timeout 5 -s icanhazip.com`

        MAILADD=your@email.example

        mail -s " Boot of $SERVER" $MAILADD <
From: $0

To: $MAILADD

$SERVER has booted up.

public ip $PUBLIC_IP $PUBLIC_IPV6

If this is news to you, please investigate.

`date -u`

EOF

    ;;

esac

exit 0

]ntpd是否在运行。[/

]你的资料中心内的时钟脉冲相位差。[/

]你的主时间服务器和外部之间的时钟脉冲相位差。[/

define command{

    command_name    check_ssl_expire

    command_line    $USER1$/check_http --ssl -C 14 -H $ARG1$

}

define service{

    host_name               virtual

    service_description     bitly_com_ssl_expiration

    use                     generic-service

    check_command           check_ssl_expire!bitly.com

    contact_groups          email_only

    normal_check_interval   720

    retry_check_interval    10

    notification_interval   720

}

监控软件对比选择

运维 Rock 发表了文章 2 个评论 4326 次浏览 2016-01-03 04:18 来自相关话题

Cacti、Nagios、Zabbix功能对比其他对比

Cacti、Nagios、Zabbix功能对比

其他对比

监控Mysql主从同步脚本

运维空心菜发表了文章 2 个评论 3187 次浏览 2015-11-24 00:02 来自相关话题

Shell版本 #!/bin/bash #Auth: lucky.chen hosts="192.168.3.9:3305 192.168.3.10:3306 " ...查看全部

Shell版本

#!/bin/bash

#Auth: lucky.chen



hosts="192.168.3.9:3305

192.168.3.10:3306

"

for i in $hosts 

do

        alert=0

	host=`echo $i|awk -F':' '{print $1}'`

	port=`echo $i|awk -F':' '{print $2}'`

	declare -i alert	

	IO=`mysql -uwrite -P$port -p'write@jkb' -h${host}  -e "show slave status\G"|grep Slave_IO_Running: |awk '{print $NF}'`

	SQL=`mysql -uwrite -P$port -p'write@jkb' -h${host}  -e "show slave status\G"|grep Slave_SQL_Running: |awk '{print $NF}'`

	declare -i BEHIN=`mysql -uwrite -P$port -p'write@jkb' -h${host}  -e "show slave status\G"|grep Seconds_Behind_Master|awk '{print $NF}'`

	

	if [ $IO != Yes ] ;then

        status="${status} \n IO is $IO"

	alert=1

	fi



	if [ $SQL != Yes ] ;then

	stauts="${status} \n SQL is $SQL"

	alert=1

	fi



	if [[ $BEHIN -gt 100 ]] ;then

       	status="${status} \n behind master $BEHIN second"

	alert=1

	fi





	if [[ alert -eq 1 ]] ;then

	echo -e "$host : $status"

       php /usr/local/bin/sendmail/tongbu.php  "$host $status" "$status"

	fi



done

python简易版本

#!/usr/bin/env python

# _[i]_coding: utf8_[/i]_

import MySQLdb

from MySQLdb import cursors

import threading



slaveList = [

             'ip list'

             ]

def getSlaveTime(host):

    try:

        username = 'username'

        passwd = 'password'

        conn = MySQLdb.connect(user = username, passwd = passwd,  host = host, connect_timeout = 5, cursorclass = cursors.DictCursor)

        cur = conn.cursor()

        cur.execute('''show slave status''')

        fallSec = cur.fetchone()['Seconds_Behind_Master']

        cur.close()

        conn.close()

        print  host + ' 落后 ' + str(fallSec)

    except:

        print host + ' 落后 ' + str(10000000)

    

for host in slaveList:

    s = threading.Thread(target = getSlaveTime,args = (host,))

    s.start()

更多...

基础网络监控、网站监控、应用软件监控、服务器性能监控、API监控等！

监控

Memcached和Redis监控脚本分享

influxdata监控系统介绍

运维监控平台之Ganglia

Bitly运维团队的10个监控教训

监控软件对比选择

监控Mysql主从同步脚本

Memcached和Redis监控脚本分享

influxdata监控系统介绍

运维监控平台之Ganglia

Bitly运维团队的10个监控教训

监控软件对比选择

监控Mysql主从同步脚本

话题描述

相关话题

根话题

最佳回复者

7 人关注该话题

OpenSkill 专业的开源技术学习问答平台