Discussion:
[jira] [Created] (HBASE-21464) Splitting blocked when meta relocates during split transaction
Andrew Purtell (JIRA)
2018-11-09 18:34:00 UTC
Permalink
Andrew Purtell created HBASE-21464:
--------------------------------------

Summary: Splitting blocked when meta relocates during split transaction
Key: HBASE-21464
URL: https://issues.apache.org/jira/browse/HBASE-21464
Project: HBase
Issue Type: Bug
Affects Versions: 1.4.8, 1.5.0
Reporter: Andrew Purtell
Fix For: 1.5.0, 1.4.9


ITBLL tests with an internal fork of 1.4.7 looked fine, but then same with an internal fork of 1.4.8 showed an alarming performance problem and eventual test failure. Can repro with the 1.4.8 upstream release. I didn't try 1.4.7 and will need to do it as a sanity check but let's assume for now there is a bad bug introduced somewhere between 1.4.7 and 1.4.8.

Splitting is blocked when meta relocates during split transaction because the splitting thread does not try to relocate meta.

The split worker is trying to update meta but doesn't relocate it even after NSRE:
{noformat}
2018-11-09 17:50:45,277 INFO  [regionserver/ip-172-31-5-92.us-west-2.compute.internal/172.31.5.92:8120-splits-1541785709434] client.RpcRetryingCaller: Call exception, tries=13, retries=350, started=88590 ms ago, cancelled=false, msg=org.apache.hadoop.hbase.NotServingRegionException: Region hbase:meta,,1 is not online on ip-172-31-13-83.us-west-2.compute.internal,8120,1541785618832
     at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3088)
        at org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1271)
        at org.apache.hadoop.hbase.regionserver.RSRpcServices.execService(RSRpcServices.java:2198)
        at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36617)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2396)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
        at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
        at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)row 'test,,1541785709452.5ba6596f0050c2dab969d152829227c6.44' on table 'hbase:meta' at region=hbase:meta,1.1588230740, hostname=ip-172-31-15-225.us-west-2.compute.internal,8120,1541785640586, seqNum=0{noformat}
Clients, in this case YCSB, are hung with part of the keyspace missing:
{noformat}
2018-11-09 17:51:06,033 DEBUG [hconnection-0x5739e567-shared--pool1-t165] client.ConnectionManager$HConnectionImplementation: locateRegionInMeta parentTable=hbase:meta, metaLocation=, attempt=14 of 35 failed; retrying after sleep of 20158 because: No server address listed in hbase:meta for region test,user307326104267982763,1541785754600.ef90030b05cb02305b75e9bfbc3ee081. containing row user3301635648728421323{noformat}
Additional confirmation of the problem on the master, balancing cannot run indefinitely because the split transaction is stuck
{noformat}
2018-11-09 17:49:55,478 DEBUG [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=8100] master.HMaster: Not running balancer because 3 region(s) in transition: [{ef90030b05cb02305b75e9bfbc3ee081 state=SPLITTING_NEW, ts=1541785754606, server=ip-172-31-5-92.us-west-2.compute.internal,8120,1541785626417}, {5ba6596f0050c2dab969d152829227c6 state=SPLITTING, ts=1541785754606, server=ip-172-31-5-92.us-west-2.compute....{noformat}
Unfortunately I don't have a lot of time to debug this before heading out for the weekend. Will pick it up on Monday. I saved all of the cluster logs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
Andrew Purtell (JIRA)
2018-11-30 22:48:00 UTC
Permalink
[ https://issues.apache.org/jira/browse/HBASE-21464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Purtell reopened HBASE-21464:
------------------------------------

I reverted my commit because I was able to reproduce the problem again. 

That part of the change we removed where the cache lookup for meta region was modified was what was necessary. There is something wrong with the cache lookup. I don't see a change between 1.4.2 and 1.4.3 that is any kind of smoking gun so am a bit stumped here.
Splitting blocked with meta NSRE during split transaction
---------------------------------------------------------
Key: HBASE-21464
URL: https://issues.apache.org/jira/browse/HBASE-21464
Project: HBase
Issue Type: Bug
Affects Versions: 1.5.0, 1.4.3, 1.4.4, 1.4.5, 1.4.6, 1.4.8, 1.4.7
Reporter: Andrew Purtell
Assignee: Andrew Purtell
Priority: Blocker
Fix For: 1.5.0, 1.4.9
Attachments: HBASE-21464-branch-1.patch, HBASE-21464-branch-1.patch
{noformat}
2018-11-09 17:50:45,277 INFO  [regionserver/ip-172-31-5-92.us-west-2.compute.internal/172.31.5.92:8120-splits-1541785709434] client.RpcRetryingCaller: Call exception, tries=13, retries=350, started=88590 ms ago, cancelled=false, msg=org.apache.hadoop.hbase.NotServingRegionException: Region hbase:meta,,1 is not online on ip-172-31-13-83.us-west-2.compute.internal,8120,1541785618832
     at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3088)
        at org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1271)
        at org.apache.hadoop.hbase.regionserver.RSRpcServices.execService(RSRpcServices.java:2198)
        at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36617)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2396)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
        at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
        at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)row 'test,,1541785709452.5ba6596f0050c2dab969d152829227c6.44' on table 'hbase:meta' at region=hbase:meta,1.1588230740, hostname=ip-172-31-15-225.us-west-2.compute.internal,8120,1541785640586, seqNum=0{noformat}
{noformat}
2018-11-09 17:51:06,033 DEBUG [hconnection-0x5739e567-shared--pool1-t165] client.ConnectionManager$HConnectionImplementation: locateRegionInMeta parentTable=hbase:meta, metaLocation=, attempt=14 of 35 failed; retrying after sleep of 20158 because: No server address listed in hbase:meta for region test,user307326104267982763,1541785754600.ef90030b05cb02305b75e9bfbc3ee081. containing row user3301635648728421323{noformat}
Balancing cannot run indefinitely because the split transaction is stuck
{noformat}
2018-11-09 17:49:55,478 DEBUG [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=8100] master.HMaster: Not running balancer because 3 region(s) in transition: [{ef90030b05cb02305b75e9bfbc3ee081 state=SPLITTING_NEW, ts=1541785754606, server=ip-172-31-5-92.us-west-2.compute.internal,8120,1541785626417}, {5ba6596f0050c2dab969d152829227c6 state=SPLITTING, ts=1541785754606, server=ip-172-31-5-92.us-west-2.compute....{noformat}
 
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Loading...