Sergey Shelukhin (JIRA)
2018-12-11 00:06:00 UTC
Sergey Shelukhin created HBASE-21576:
Summary: master should proactively reassign meta when killing a RS with it
Key: HBASE-21576
Project: HBase
Issue Type: Bug
Reporter: Sergey Shelukhin
Master has killed an RS that was hosting meta due to some internal error (still need to see if it's a separate bug or just a machine/HDFS issue, I've lost the RS logs due to HBASE-21575).
RS took a very long time to die (again, might be a separate bug, I'll file if I see repro), and a long time to restart; meanwhile master never tried to reassign meta, and eventually killed itself not being able to update it.
It seems like a RS on a bad machine would be especially prone to slow abort/startup, as well as to issues causing master to kill it, so it would make sense for master to immediately relocate meta once meta-hosting RS is dead; or even when killing the RS. In the former case (if the RS needs to die for meta to be reassigned safely), perhaps the RS hosting meta in particular should try to die fast in such circumstances, and not do any cleanup.
2018-12-08 04:52:55,144 WARN [RpcServer.default.FPBQ.Fifo.handler=39,queue=4,port=17000] master.MasterRpcServices: <server1>,17020,1544264858183 reported a fatal error:
***** ABORTING region server <server1>,17020,1544264858183: Replay of WAL required. Forcing server shutdown *****
.... [aborting for ~7 minutes]
2018-12-08 04:53:44,190 INFO [PEWorker-7] client.RpcRetryingCallerImpl: Call exception, tries=6, retries=61, started=41190 ms ago, cancelled=false, msg=org.apache.hadoop.hbase.regionserver.RegionServerAbortedException: Server <server1>,17020,1544264858183 aborting, details=row '...' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=<server1>,17020,1544264858183, seqNum=-1
... [starting for ~5]
2018-12-08 04:59:58,574 INFO [RpcServer.default.FPBQ.Fifo.handler=45,queue=0,port=17000] client.RpcRetryingCallerImpl: Call exception, tries=10, retries=61, started=392702 ms ago, cancelled=false, msg=Call to <server1> failed on connection exception: connection timed out: <server1>, details=row '...' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=<server1>,17020,1544264858183, seqNum=-1
... [re-initializing for at least ~7]
2018-12-08 05:04:17,271 INFO [hconnection-0x4d58bcd4-shared-pool3-t1877] client.RpcRetryingCallerImpl: Call exception, tries=6, retries=61, started=41137 ms ago, cancelled=false, msg=org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server <server1>,17020,1544274145387 is not running yet
2018-12-08 05:11:18,470 ERROR [RpcServer.default.FPBQ.Fifo.handler=38,queue=3,port=17000] master.HMaster: ***** ABORTING master ...,17000,1544230401860: FAILED persisting region=... state=OPEN *****^M
There are no signs of meta assignment activity at all in master logs
This message was sent by Atlassian JIRA
Summary: master should proactively reassign meta when killing a RS with it
Key: HBASE-21576
Project: HBase
Issue Type: Bug
Reporter: Sergey Shelukhin
Master has killed an RS that was hosting meta due to some internal error (still need to see if it's a separate bug or just a machine/HDFS issue, I've lost the RS logs due to HBASE-21575).
RS took a very long time to die (again, might be a separate bug, I'll file if I see repro), and a long time to restart; meanwhile master never tried to reassign meta, and eventually killed itself not being able to update it.
It seems like a RS on a bad machine would be especially prone to slow abort/startup, as well as to issues causing master to kill it, so it would make sense for master to immediately relocate meta once meta-hosting RS is dead; or even when killing the RS. In the former case (if the RS needs to die for meta to be reassigned safely), perhaps the RS hosting meta in particular should try to die fast in such circumstances, and not do any cleanup.
2018-12-08 04:52:55,144 WARN [RpcServer.default.FPBQ.Fifo.handler=39,queue=4,port=17000] master.MasterRpcServices: <server1>,17020,1544264858183 reported a fatal error:
***** ABORTING region server <server1>,17020,1544264858183: Replay of WAL required. Forcing server shutdown *****
.... [aborting for ~7 minutes]
2018-12-08 04:53:44,190 INFO [PEWorker-7] client.RpcRetryingCallerImpl: Call exception, tries=6, retries=61, started=41190 ms ago, cancelled=false, msg=org.apache.hadoop.hbase.regionserver.RegionServerAbortedException: Server <server1>,17020,1544264858183 aborting, details=row '...' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=<server1>,17020,1544264858183, seqNum=-1
... [starting for ~5]
2018-12-08 04:59:58,574 INFO [RpcServer.default.FPBQ.Fifo.handler=45,queue=0,port=17000] client.RpcRetryingCallerImpl: Call exception, tries=10, retries=61, started=392702 ms ago, cancelled=false, msg=Call to <server1> failed on connection exception: connection timed out: <server1>, details=row '...' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=<server1>,17020,1544264858183, seqNum=-1
... [re-initializing for at least ~7]
2018-12-08 05:04:17,271 INFO [hconnection-0x4d58bcd4-shared-pool3-t1877] client.RpcRetryingCallerImpl: Call exception, tries=6, retries=61, started=41137 ms ago, cancelled=false, msg=org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server <server1>,17020,1544274145387 is not running yet
2018-12-08 05:11:18,470 ERROR [RpcServer.default.FPBQ.Fifo.handler=38,queue=3,port=17000] master.HMaster: ***** ABORTING master ...,17000,1544230401860: FAILED persisting region=... state=OPEN *****^M
There are no signs of meta assignment activity at all in master logs
This message was sent by Atlassian JIRA