Hadoop高可用ZKFC节点异常退出

最近公司的一个分析系统,Hadoop ZKFC经常异常退出,具体日志情况如下:

2020-08-26 14:30:14,455 INFO org.apache.zookeeper.ClientCnxn: Socket connection established, initiating session, client: /10.100.232.31:33989, server: hadoop-dn1/10.100.232.35:2181
2020-08-26 14:30:14,457 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server hadoop-dn1/10.100.232.35:2181, sessionid = 0x46dc45ca67201f1, negotiated timeout = 5000
2020-08-26 14:30:14,459 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected.
2020-08-26 14:30:17,792 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 3335ms for sessionid 0x46dc45ca67201f1, closing socket connection and attempting reconnect

2020-08-26 14:30:17,899 FATAL org.apache.hadoop.ha.ActiveStandbyElector: Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode monitoring connection errors.

2020-08-26 14:30:18,179 INFO org.apache.zookeeper.ZooKeeper: Session: 0x46dc45ca67201f1 closed

2020-08-26 14:30:18,180 FATAL org.apache.hadoop.ha.ZKFailoverController: Fatal error occurred:Received stat error from Zookeeper. code:CONNECTIONLOSS. Not retrying further znode monitoring connection errors.

2020-08-26 14:30:18,180 INFO org.apache.hadoop.ipc.Server: Stopping server on 8019
2020-08-26 14:30:18,180 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x46dc45ca67201f1
2020-08-26 14:30:18,181 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x46dc45ca67201f1
2020-08-26 14:30:18,181 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x46dc45ca67201f1
2020-08-26 14:30:18,181 INFO org.apache.hadoop.ha.ActiveStandbyElector: Yielding from election
2020-08-26 14:30:18,181 INFO org.apache.hadoop.ha.HealthMonitor: Stopping HealthMonitor thread
2020-08-26 14:30:18,181 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server listener on 8019
2020-08-26 14:30:18,181 INFO org.apache.hadoop.ipc.Server: Stopping IPC Server Responder
2020-08-26 14:30:18,181 WARN org.apache.hadoop.ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x46dc45ca67201f1
2020-08-26 14:30:18,181 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down

请问这个如何优化或者解决,请知道的同学给点建议和思路。

已邀请:

空心菜 - 心向阳光,茁壮成长

赞同来自: Ansible

1. 首先你可以通过telnet 和 ping等方法测试网络质量,看看ZKFC节点所在主机到Zookeeper的网络质量是否有问题。
 
2. 优化一下Zookeeper的配置,让超时空间得到缓解

tickTime = 4000   # 在ZooKeeper中,它是所有涉及到时间长度的单元,单位为毫秒,就相当于时钟里的秒单元一样,默认2000ms,
initLimit = 30 # Follower在启动过程中,会从Leader同步所有最新数据,然后确定自己能够对外服务的起始状态。Leader允许F在 initLimit 时间内完成这个工作。
syncLimit = 15 # 在运行过程中,Leader负责与ZK集群中所有机器进行通信,例如通过一些心跳检测机制,来检测机器的存活状态
forceSync=no # 如果该选项设置为‘no’,ZooKeeper将不会强制同步事务更新日志到磁盘,可以减少时间,但是重启可能会造成一些数据丢失。

tick的中文意思是"嘀的一声",tickTime指的是滴答一声的时间长度。在ZooKeeper中,它是所有涉及到时间长度的单元,单位为毫秒,就相当于时钟里的秒单元一样。例如,tickTime=4000;initLimit=30,表示initLimit的时间为"嘀嗒"15次,长度为4000*15=60秒。tickTime隐含了心跳时间(即心跳时间为tickTime),还隐含了客户端和服务器之间保持的会话的最小和最大超时时间(最小2倍tickTime,最大20倍tickTime)。

3. hadoop配置文件中建议配置如下属性
hdfs-size.xml:

<property>
<name>dfs.qjournal.start-segment.timeout.ms</name>
<value>90000</value>
</property>
<property>
<name>dfs.qjournal.select-input-streams.timeout.ms</name>
<value>90000</value>
</property>
<property>
<name>dfs.qjournal.write-txns.timeout.ms</name>
<value>90000</value>
</property>

core-site.xml:

<property>
<name>ipc.client.connect.timeout</name>
<value>90000</value>
</property>

调整了如上参数还不行,你只能在从如下几个方向去思考:

  1. 服务器性能问题: CPU 、内存、磁盘io
  2. 跟研发看看,应用程序是否可以分时间段做任务等。

要回复问题请先登录注册