故障现象
1.看日志发现正常日志和错误日志比例几乎1:12.错误日志全部是104: Connection reset by peer) while reading upstream3.看访问日志也没有其他http错误状态码
[root@VM_0_22_centos logs]# ls -lh total 389M -rw-r--r-- 1 work work 191M Oct 30 17:30 ttt.minminmsn.com_access.log -rw-r--r-- 1 work work 199M Oct 30 17:30 ttt.minminmsn.com_error.log [root@VM_0_22_centos logs]# tail -n 1 ttt.minminmsn.com_error.log 2020/10/30 17:30:27 [error] 14063#0: *807476828 readv() failed (104: Connection reset by peer) while reading upstream, client: 117.61.242.104, server: ttt.minminmsn.com, request: "POST /yycp-launcherSnapshot/launcherSnapshot/querySnapshotSync HTTP/1.1", upstream: "http://192.168.8831:8081/ttt", host: "ttt.minminmsn.com" [root@VM_0_22_centos logs]# cat ttt.minminmsn.com_access.log |awk '{print $9}'|sort |uniq -dc 1081274 200 6 304 125 400 27482 404 145 429 106 499 8 500分析问题
1.连续责任人咨询业务场景发现客户端请求基本上都是POST请求,开始以为是上传大文件连接超时了,后来开发确认为了安全使用POST请求,所以并没有大文件上传2.由于upstream重置连接了,就是说后端主动断开了连接,然后发现连接里有很多TIME-WAIT,应该是qps比较大的情况下,连接处理比较快还在断开连接中就显得比较多了3.nginx作为反向代理既然是客户端又是服务端,当和后端服务建立连接时并没有默认开启长连接,开启长连接后性能应该会提升很多4.默认开启长连接不需要keeplive参数,如下是nginx官网查寻的keepalive参数,看的不是很明白,不过有个链接讲的很清楚,他可以激活连接缓存,应该属于长连接性能优化类5.keepalive参数值应该与qps有关,默认不需要设置太大,如果访问日志里面有5XX错误还得根据实际情况调整,以达到最优效果
下面是官网keeplaive参数解释Syntax: keepalive connections;Default: —Context: upstreamThis directive appeared in version 1.1.4.
Activates the cache for connections to upstream servers.
The connections parameter sets the maximum number of idle keepalive connections to upstream servers that are preserved in the cache of each worker process. When this number is exceeded, the least recently used connections are closed.
It should be particularly noted that the keepalive directive does not limit the total number of connections to upstream servers that an nginx worker process can open. The connections parameter should be set to a number small enough to let upstream servers process new incoming connections as well.When using load balancing methods other than the default round-robin method, it is necessary to activate them before the keepalive directive.
处理方案
1.修改nginx配置开启长连接及结合连接缓存2.重启nginx服务主要配置如下
upstream gateway{ server 192.168.88.31:8081; server 192.168.88.44:8081; server 192.168.88.115:8081; server 192.168.88.80:8081; #以下是新增配置 keepalive 100; } location / { proxy_pass http://gateway; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; #以下是新增配置 proxy_connect_timeout 120; proxy_send_timeout 300; proxy_read_timeout 300; proxy_http_version 1.1; proxy_set_header Connection ""; }检查效果
1.查看错误日志错误日志清空后没有增长过
[root@VM_0_22_centos logs]# ls -lh total 389M -rw-r--r-- 1 work work 389M Oct 30 18:50 ttt.minminmsn.com_access.log -rw-r--r-- 1 work work 446 Oct 30 18:10 ttt.minminmsn.com_error.log2.查看连接数状态长连接前TIME-WAIT比较多
[root@VM_0_22_centos logs]# ss -an |awk '{print $2}'|sort |uniq -dc |sort -rn 5045 TIME-WAIT 156 ESTAB 62 UNCONN 21 LISTE长连接后TSTAB比较多
[root@VM_0_22_centos ~]# ss -an |awk '{print $2}'|sort |uniq -dc |sort -rn 511 ESTAB 62 UNCONN 52 TIME-WAIT 21 LISTEN参考文档
http://nginx.org/en/docs/http/ngx_http_upstream_module.html#keepalivehttps://www.cnblogs.com/sunsky303/p/10648861.htmlhttp://blog.51yip.com/apachenginx/2203.html