前言: 公司目前还是传统方式的微服务集群,使用的是SpringCloud架构,最近晚上经常出现服务注册失败,consul上的服务会下线,导致整个微服务应用不可以使用,出现Down机状态,
前言:
公司目前还是传统方式的微服务集群,使用的是SpringCloud架构,最近晚上经常出现服务注册失败,consul上的服务会下线,导致整个微服务应用不可以使用,出现Down机状态,这个时候到集群节点上通过 ps -ef|grep xxx.jar 发现服务进程还在,然后继续查看相应日志,找到了原因,由于微服务会每隔一段时间去自检mysql,mq的连通性,如果出现一个超时(公司使用aws,难免会出现超时),那么服务就任务自己是异常状态,从而自动退出,鉴于这种情况,博主自己设计了一套服务治理方案,可以在服务出现异常退出(异常是指由于网络延时,连接超时等)时,去优雅的重启服务,减少了人工干预。
环境介绍:
主机名 角色 操作系统 opsServer 跳板机+ansible+python ubuntu18.0 cloudServer1p 微服务+consul ubuntu18.0 cloudServer2p 微服务+consul ubuntu18.0 cloudServer3p 微服务+consul ubuntu18.0
操作步骤:
1. 在 opsServer上操作:
安装ansible:
apt-get update #更新软件源中的所有软件列表 apt-get install ansible #安装ansible配置ansible:
vim /etc/ansible/ansible.cfg [defaults] inventory = /etc/ansible/hosts forks = 5 remote_port = 22 host_key_checking = False timeout = 10 log_path = /var/log/ansible.log [inventory] [privilege_escalation] [paramiko_connection] [ssh_connection] [persistent_connection] [accelerate] [selinux] [colors] [diff] vim /etc/ansible/hosts [webservers] cloudServer1p ansible_ssh_user=cloud ansible_ssh_key=/home/cloud/.ssh/id_rsa cloudServer2p ansible_ssh_user=cloud ansible_ssh_key=/home/cloud/.ssh/id_rsa cloudServer3p ansible_ssh_user=cloud ansible_ssh_key=/home/cloud/.ssh/id_rsa 解释: 确保opsServer与cloudServer1p ,cloudServer2p ,cloudServer3p的ssh 打通,这里是用密钥进行认证配置crontab任务:
crontab -l */5 * * * * /usr/bin/python /home/cloud/ops/manageCloudService1p.py `curl -s http://cloudServer1p:8530/v1/agent/services` >/home/cloud/ops/cloudServer1p.log */5 * * * * /usr/bin/python /home/cloud/ops/manageCloudService2p.py `curl -s http://cloudServer2p:8530/v1/agent/services` >/home/cloud/ops/cloudServer2p.log */5 * * * * /usr/bin/python /home/cloud/ops/manageCloudService3p.py `curl -s http://cloudServer3p:8530/v1/agent/services` >/home/cloud/ops/cloudServer3p.log注释:
curl -s http://cloudServer1p:8530/v1/agent/services 通过调用consul接口获取consul注册的微服务然后作为参数传递给脚本去处理。
manageCloudService1p.py内容:
manageCloudService2p.py内容:
# python json_format.py json_text import os import sys import json import smtplib from email.mime.text import MIMEText length = len(sys.argv) def sendMail(name): mail_host = 'mail.163.com' mail_user = 'alert@163.com' mail_pass = 'test123' sender = 'alert@163.com' receivers = ['alert@163.com'] message = MIMEText(name+' server check failure! i will start this server.','plain','utf-8') message['Subject'] = 'cloud server check' message['From'] = sender message['To'] = receivers[0] try: smtpObj = smtplib.SMTP() smtpObj.connect(mail_host,25) smtpObj.login(mail_user,mail_pass) smtpObj.sendmail( sender,receivers,message.as_string()) smtpObj.quit() print('success') except smtplib.SMTPException as e: print('error',e) if length > 1: try: jsonstr = sys.argv[1] jsonObj = json.loads(jsonstr) serverName=["gateway-service-9999","order-service-8020","portal-service-8080","product-service-8010"] realServerName=[] for key in jsonObj.keys(): realServerName.append(key) for name in serverName: if name not in realServerName: print(name+"is down") if name == 'gateway-service-9999': sendMail(name) os.system("ansible-playbook /etc/ansible/restartService.yml -e serverList=cloudServer2p -e serverName=restartGatewayService") elif name == 'order-service-8020': sendMail(name) os.system("ansible-playbook /etc/ansible/restartService.yml -e serverList=cloudServer2p -e serverName=restartOrderService") elif name == 'portal-service-8080': sendMail(name) os.system("ansible-playbook /etc/ansible/restartService.yml -e serverList=cloudServer2p -e serverName=restartPortalService") elif name == 'product-service-8010': sendMail(name) os.system("ansible-playbook /etc/ansible/restartService.yml -e serverList=cloudServer2p -e serverName=restartProductService") else: print(name+"not exist!") else: print(name+"is up") except Exception: print("json parse error.") else : print("argv's length is 1, no json text input.")manageCloudService3p.py内容:
# python json_format.py json_text import os import sys import json import smtplib from email.mime.text import MIMEText length = len(sys.argv) def sendMail(name): mail_host = 'mail.163.com' mail_user = 'alert@163.com' mail_pass = 'test123' sender = 'alert@163.com' receivers = ['alert@163.com'] message = MIMEText(name+' server check failure! i will start this server.','plain','utf-8') message['Subject'] = 'cloud server check' message['From'] = sender message['To'] = receivers[0] try: smtpObj = smtplib.SMTP() smtpObj.connect(mail_host,25) smtpObj.login(mail_user,mail_pass) smtpObj.sendmail( sender,receivers,message.as_string()) smtpObj.quit() print('success') except smtplib.SMTPException as e: print('error',e) if length > 1: try: jsonstr = sys.argv[1] jsonObj = json.loads(jsonstr) serverName=["gateway-service-9999","order-service-8020","portal-service-8080","product-service-8010"] realServerName=[] for key in jsonObj.keys(): realServerName.append(key) for name in serverName: if name not in realServerName: print(name+"is down") if name == 'gateway-service-9999': sendMail(name) os.system("ansible-playbook /etc/ansible/restartService.yml -e serverList=cloudServer3p -e serverName=restartGatewayService") elif name == 'order-service-8020': sendMail(name) os.system("ansible-playbook /etc/ansible/restartService.yml -e serverList=cloudServer3p -e serverName=restartOrderService") elif name == 'portal-service-8080': sendMail(name) os.system("ansible-playbook /etc/ansible/restartService.yml -e serverList=cloudServer3p -e serverName=restartPortalService") elif name == 'product-service-8010': sendMail(name) os.system("ansible-playbook /etc/ansible/restartService.yml -e serverList=cloudServer3p -e serverName=restartProductService") else: print(name+"not exist!") else: print(name+"is up") except Exception: print("json parse error.") else : print("argv's length is 1, no json text input.")restartService.yml内容如下:
--- - hosts: '{{ serverList }}' tasks: - name: restart server command: "/bin/bash /home/cloud/startService.sh {{serverName}}" register: result - name: show debug info debug: var=result.stdout verbosity=02. 在微服务节点上操作:
将 startService.sh 这个脚本分别拷贝到三个微服务节点的 /home/cloud/ 目录下面:
startService.sh 内容如下:
#!/bin/bash restartGatewayService() { srevice_name="gateway-service" service_pid=`ps -ef|grep ${srevice_name} |grep -v task|grep prod |grep -v grep |awk '{print $2}'` if [ ! -n "$service_pid" ];then echo "${srevice_name} is already stopped,i will start ${srevice_name} now!" /usr/bin/nohup java -jar -Xms512m -Xmx512m -Dserver.port=9999 -Dspring.profiles.active=prod -Dlogging.level.root=info /home/cloud/services/gateway-service.jar >> /home/cloud/services/gateway-service.log & sleep 10 else echo "${srevice_name} is running,i will kill -15 ${srevice_name} and start it now!" kill -15 $service_pid sleep 10 /usr/bin/nohup java -jar -Xms512m -Xmx512m -Dserver.port=9999 -Dspring.profiles.active=prod -Dlogging.level.root=info /home/cloud/services/gateway-service.jar >> /home/cloud/services/gateway-service.log & fi } restartOrderService() { srevice_name="order-service" service_pid=`ps -ef|grep ${srevice_name} |grep -v task|grep prod |grep -v grep |awk '{print $2}'` if [ ! -n "$service_pid" ];then echo "${srevice_name} is already stopped,i will start ${srevice_name} now!" /usr/bin/nohup java -jar -Xms512m -Xmx512m -Dserver.port=8020 -Dspring.profiles.active=prod -Dlogging.level.root=info /home/cloud/services/order-service.jar >> /home/cloud/services/order-service.log & sleep 10 else echo "${srevice_name} is running,i will kill -15 ${srevice_name} and start it now!" kill -15 $service_pid sleep 10 /usr/bin/nohup java -jar -Xms512m -Xmx512m -Dserver.port=8020 -Dspring.profiles.active=prod -Dlogging.level.root=info /home/cloud/services/order-service.jar >> /home/cloud/services/order-service.log & fi } restartPortalService() { srevice_name="portal-service" service_pid=`ps -ef|grep ${srevice_name} |grep -v task|grep prod |grep -v grep |awk '{print $2}'` if [ ! -n "$service_pid" ];then echo "${srevice_name} is already stopped,i will start ${srevice_name} now!" /usr/bin/nohup java -jar -Xms512m -Xmx512m -Dserver.port=8080 -Dspring.profiles.active=prod -Dlogging.level.root=info /home/cloud/services/portal-service.jar >> /home/cloud/services/portal-service.log & sleep 10 else echo "${srevice_name} is running,i will kill -15 ${srevice_name} and start it now!" kill -15 $service_pid sleep 10 /usr/bin/nohup java -jar -Xms512m -Xmx512m -Dserver.port=8080 -Dspring.profiles.active=prod -Dlogging.level.root=info /home/cloud/services/portal-service.jar >> /home/cloud/services/portal-service.log & fi } restartProductService() { srevice_name="product-service" service_pid=`ps -ef|grep ${srevice_name} |grep -v task|grep prod |grep -v grep |awk '{print $2}'` if [ ! -n "$service_pid" ];then echo "${srevice_name} is already stopped,i will start ${srevice_name} now!" /usr/bin/nohup java -jar -Xms512m -Xmx512m -Dserver.port=8010 -Dspring.profiles.active=prod -Dlogging.level.root=info /home/cloud/services/product-service.jar >> /home/cloud/services/product-service.log & sleep 10 else echo "${srevice_name} is running,i will kill -15 ${srevice_name} and start it now!" kill -15 $service_pid sleep 10 /usr/bin/nohup java -jar -Xms512m -Xmx512m -Dserver.port=8010 -Dspring.profiles.active=prod -Dlogging.level.root=info /home/cloud/services/product-service.jar >> /home/cloud/services/product-service.log & fi } restartAllService() { restartGatewayService restartOrderService restartPortalService restartProductService } serviceStatus() { echo "service status as:" jps -l } case $1 in "all") echo "restart all service!" restartAllService serviceStatus ;; "restartGatewayService") echo "restart gateway service!" restartGatewayService serviceStatus ;; "restartOrderService") echo "restart order service!" restartOrderService serviceStatus ;; "restartPortalService") echo "restart portal service!" restartPortalService serviceStatus ;; "restartProductService") echo "restart product service!" restartProductService serviceStatus ;; *) echo "input parameter error! USAGE: bash startService.sh all|one" ;; esac此方案已经在生产环境开始运行,运行效果测试有效,如有问题请及时联系博主。