Discussion:
commit semantics
(too old to reply)
Joydeep Sarma
2010-01-11 23:46:25 UTC
Permalink
Hey HBase-devs,

we have been going through hbase code to come up to speed.

One of the questions was regarding the commit semantics. Thumbing through
the RegionServer code that's appending to the wal:

syncWal -> HLog.sync -> addToSyncQueue ->syncDone.await()

and the log writer thread calls:

hflush(), syncDone.signalAll()

however hflush doesn't necessarily call a sync on the underlying log file:

if (this.forceSync ||
this.unflushedEntries.get() >= this.flushlogentries) { ... sync()
... }

so it seems that if forceSync is not true, the syncWal can unblock before a
sync is called (and forcesync seems to be only true for metaregion()).

are we missing something - or is there a bug here (the signalAll should be
conditional on hflush having actually flushed something).

thanks,

Joydeep
Ryan Rawson
2010-01-11 23:58:55 UTC
Permalink
Performance.... It's all about performance.

In my own tests, calling sync() in HDFS-0.21 on every single commit
can limit the number of small rows you do to about a max of 1200 a
second. One way to speed things up is to sync less often. Another
way is to sync on a timer instead. Both of these are going to be way
more important in HDFS-0.21/Hbase-0.21.

If we are talking about hdfs/hadoop 0.20, it hardly matters either
way, there is that whole 'no append/sync' thing you know all about.

-ryan
Post by Joydeep Sarma
Hey HBase-devs,
we have been going through hbase code to come up to speed.
One of the questions was regarding the commit semantics. Thumbing through
syncWal -> HLog.sync -> addToSyncQueue ->syncDone.await()
hflush(), syncDone.signalAll()
     if (this.forceSync ||
         this.unflushedEntries.get() >= this.flushlogentries) { ... sync()
... }
so it seems that if forceSync is not true, the syncWal can unblock before a
sync is called (and forcesync seems to be only true for metaregion()).
are we missing something - or is there a bug here (the signalAll should be
conditional on hflush having actually flushed something).
thanks,
Joydeep
Jean-Daniel Cryans
2010-01-12 00:03:49 UTC
Permalink
Hey Joydeep,

This is actually intended this way but the name of the variable is
misleading. The sync is done only if forceSync or we have enough
entries to sync (default is 1). If someone wants to sync only 100
entries for example, they would play with that configuration.

Hope that helps,

J-D
Post by Joydeep Sarma
Hey HBase-devs,
we have been going through hbase code to come up to speed.
One of the questions was regarding the commit semantics. Thumbing through
syncWal -> HLog.sync -> addToSyncQueue ->syncDone.await()
hflush(), syncDone.signalAll()
     if (this.forceSync ||
         this.unflushedEntries.get() >= this.flushlogentries) { ... sync()
... }
so it seems that if forceSync is not true, the syncWal can unblock before a
sync is called (and forcesync seems to be only true for metaregion()).
are we missing something - or is there a bug here (the signalAll should be
conditional on hflush having actually flushed something).
thanks,
Joydeep
Joydeep Sarma
2010-01-12 04:12:24 UTC
Permalink
ok - hadn't thought about it that way - but yeah with a default of 1 -
the semantics seem correct.

under high load - some batching would automatically happen at this
setting (or so one would think - not sure if hdfs appends are blocked
on pending syncs (in which case the batching wouldn't quite happen i
think) - cc'ing Dhruba).

if the performance with setting of 1 doesn't work out - we may need an
option to delay acks until actual syncs .. (most likely we would be
able to compromise on latency to get higher throughput - but wouldn't
be willing to compromise on data integrity)
Post by Jean-Daniel Cryans
Hey Joydeep,
This is actually intended this way but the name of the variable is
misleading. The sync is done only if forceSync or we have enough
entries to sync (default is 1). If someone wants to sync only 100
entries for example, they would play with that configuration.
Hope that helps,
J-D
Post by Joydeep Sarma
Hey HBase-devs,
we have been going through hbase code to come up to speed.
syncWal -> HLog.sync -> addToSyncQueue ->syncDone.await()
hflush(), syncDone.signalAll()
      if (this.forceSync ||
          this.unflushedEntries.get() >= this.flushlogentries) { ... sync() ... }
so it seems that if forceSync is not true, the syncWal can unblock before a sync is called (and forcesync seems to be only true for metaregion()).
are we missing something - or is there a bug here (the signalAll should be conditional on hflush having actually flushed something).
thanks,
Joydeep
Jean-Daniel Cryans
2010-01-12 04:48:08 UTC
Permalink
Inline.

J-D
Post by Joydeep Sarma
ok - hadn't thought about it that way - but yeah with a default of 1 -
the semantics seem correct.
under high load - some batching would automatically happen at this
setting (or so one would think - not sure if hdfs appends are blocked
on pending syncs (in which case the batching wouldn't quite happen i
think) - cc'ing Dhruba).
Yes this is our version of group commit.
Post by Joydeep Sarma
if the performance with setting of 1 doesn't work out - we may need an
option to delay acks until actual syncs .. (most likely we would be
able to compromise on latency to get higher throughput - but wouldn't
be willing to compromise on data integrity)
Good idea, we don't currently support that feature although we have
the opposite running by default which is deferred log flush. Tables
are never sync'ed and they rely on the LogSyncer thread awaitNanos'
timeout (configurable) or tables that are highly durable. In our
opinion, a cluster with a healthy mix of deferred and non-deferred
tables still guarantees a very high level of durability for the
default setting.
Dhruba Borthakur
2010-01-12 06:25:38 UTC
Permalink
any IO to a HDFS-file (appends, writes, etc) ae actually blocked on a
pending sync. "sync" in HDFS is a pretty heavyweight operation as it stands.

if we want the best of both worlds.. latency as well as data integrity, how
about inserting the same record into two completely separate HBase tables in
parallel... the operation can complete as soon as the record is inserted
into the first HBase table (thus giving low latencies) but data integrity
will not be compromised because it is unlikely that two region servers will
fail exactly at the same time (assuming that there is a way to ensure that
these two tables are not handled by the same region server).

thanks,
dhruba
Post by Joydeep Sarma
ok - hadn't thought about it that way - but yeah with a default of 1 -
the semantics seem correct.
under high load - some batching would automatically happen at this
setting (or so one would think - not sure if hdfs appends are blocked
on pending syncs (in which case the batching wouldn't quite happen i
think) - cc'ing Dhruba).
if the performance with setting of 1 doesn't work out - we may need an
option to delay acks until actual syncs .. (most likely we would be
able to compromise on latency to get higher throughput - but wouldn't
be willing to compromise on data integrity)
Post by Jean-Daniel Cryans
Hey Joydeep,
This is actually intended this way but the name of the variable is
misleading. The sync is done only if forceSync or we have enough
entries to sync (default is 1). If someone wants to sync only 100
entries for example, they would play with that configuration.
Hope that helps,
J-D
Post by Joydeep Sarma
Hey HBase-devs,
we have been going through hbase code to come up to speed.
One of the questions was regarding the commit semantics. Thumbing
syncWal -> HLog.sync -> addToSyncQueue ->syncDone.await()
hflush(), syncDone.signalAll()
however hflush doesn't necessarily call a sync on the underlying log
if (this.forceSync ||
this.unflushedEntries.get() >= this.flushlogentries) { ...
sync() ... }
Post by Jean-Daniel Cryans
Post by Joydeep Sarma
so it seems that if forceSync is not true, the syncWal can unblock
before a sync is called (and forcesync seems to be only true for
metaregion()).
Post by Jean-Daniel Cryans
Post by Joydeep Sarma
are we missing something - or is there a bug here (the signalAll should
be conditional on hflush having actually flushed something).
Post by Jean-Daniel Cryans
Post by Joydeep Sarma
thanks,
Joydeep
--
Connect to me at http://www.facebook.com/dhruba
Ryan Rawson
2010-01-12 06:53:08 UTC
Permalink
Right now each regionserver has 1 log, so if 2 puts on different
tables hit the same RS, they hit the same HLog.

There are 2 performance enhancing things in trunk:
- bulk commit - we only call sync() once per RPC, no matter how many
rows are involved. If you use the batch put API you can get really
high levels of performance.
- group commit - we can take multiple thread's worth of sync()s and do
it in one, not N. This improves performance while maintaining high
data security.

If you are expecting very high concurrency, group commit is your
friend. The more concurrent operations, the more rows per sync you are
capturing and the higher overall rows/sec performance you can achieve
while the same number of sync() calls/sec performance remains
constant.

The other option is to sync() on a fine grained timer, eg: every 10ms
(or at 100hz). The window of data loss is small, and the performance
boost is substantial. I asked JD to implement a switchable config so
that you can chose on a table-by-table basis the right mix of
performance vs persistence with a better control feature than merely
"sync every N rows".

I've thought about this issue quite a bit, and I think the sync every
1 rows combined with optional no-sync and low time sync() is the way
to go. If you want to discuss this more in person, maybe we can meet
up for brews or something.

-ryan
Post by Dhruba Borthakur
any IO to a HDFS-file (appends, writes, etc) ae actually blocked on a
pending sync. "sync" in HDFS is a pretty heavyweight operation as it stands.
if we want the best of both worlds.. latency as well as data integrity, how
about inserting the same record into two completely separate HBase tables in
parallel... the operation can complete as soon as the record is inserted
into the first HBase table (thus giving low latencies) but data integrity
will not be compromised because it is unlikely that two region servers will
fail exactly at the same time (assuming that there is a way to ensure that
these two tables are not handled by the same region server).
thanks,
dhruba
Post by Joydeep Sarma
ok - hadn't thought about it that way - but yeah with a default of 1 -
the semantics seem correct.
under high load - some batching would automatically happen at this
setting (or so one would think - not sure if hdfs appends are blocked
on pending syncs (in which case the batching wouldn't quite happen i
think) - cc'ing Dhruba).
if the performance with setting of 1 doesn't work out - we may need an
option to delay acks until actual syncs .. (most likely we would be
able to compromise on latency to get higher throughput - but wouldn't
be willing to compromise on data integrity)
Post by Jean-Daniel Cryans
Hey Joydeep,
This is actually intended this way but the name of the variable is
misleading. The sync is done only if forceSync or we have enough
entries to sync (default is 1). If someone wants to sync only 100
entries for example, they would play with that configuration.
Hope that helps,
J-D
Post by Joydeep Sarma
Hey HBase-devs,
we have been going through hbase code to come up to speed.
One of the questions was regarding the commit semantics. Thumbing
syncWal -> HLog.sync -> addToSyncQueue ->syncDone.await()
hflush(), syncDone.signalAll()
however hflush doesn't necessarily call a sync on the underlying log
      if (this.forceSync ||
          this.unflushedEntries.get() >= this.flushlogentries) { ...
sync() ... }
Post by Jean-Daniel Cryans
Post by Joydeep Sarma
so it seems that if forceSync is not true, the syncWal can unblock
before a sync is called (and forcesync seems to be only true for
metaregion()).
Post by Jean-Daniel Cryans
Post by Joydeep Sarma
are we missing something - or is there a bug here (the signalAll should
be conditional on hflush having actually flushed something).
Post by Jean-Daniel Cryans
Post by Joydeep Sarma
thanks,
Joydeep
--
Connect to me at http://www.facebook.com/dhruba
Dhruba Borthakur
2010-01-12 08:24:36 UTC
Permalink
Hi Ryan,

thanks for ur response.
Post by Ryan Rawson
Right now each regionserver has 1 log, so if 2 puts on different
tables hit the same RS, they hit the same HLog.
I understand. My point was that the application could insert the same record
into two different tables on two different Hbase instances on two different
piece of hardware.

On a related note, can somebody explain what the tradeoff is if each region
has its own hlog? are you worried about the number of files in HDFS? or
maybe the number of sync-threads in the region server? Can multiple hlog
files provide faster region splits?
Post by Ryan Rawson
I've thought about this issue quite a bit, and I think the sync every
1 rows combined with optional no-sync and low time sync() is the way
to go. If you want to discuss this more in person, maybe we can meet
up for brews or something.
The group-commit thing I can understand. HDFS does a very similar thing. But
can you explain your alternative "sync every 1 rows combined with optional
no-sync and low time sync"? For those applications that have the natural
characteristics of updating only one row per logical operation, how can they
be sure that their data has reached some-sort-of-stable-storage unless they
sync after every row update?

thanks,
dhruba
Ryan Rawson
2010-01-12 08:39:01 UTC
Permalink
Post by Dhruba Borthakur
Hi Ryan,
thanks for ur response.
Post by Ryan Rawson
Right now each regionserver has 1 log, so if 2 puts on different
tables hit the same RS, they hit the same HLog.
I understand. My point was that the application could insert the same record
into two different tables on two different Hbase instances on two different
piece of hardware.
Ah yes, of course, I thought you meant 2 tables in the same cluster.
Post by Dhruba Borthakur
On a related note, can somebody explain what the tradeoff is if each region
has its own hlog? are you worried about the number of files in HDFS? or
maybe the number of sync-threads in the region server? Can multiple hlog
files provide faster region splits?
So each hlog needs to be treated as a stream of edits for log
recovery. So adding more logs, requires the code to still treat the
pool as 1 log and keep an overall ordering across all logs as a merged
set. It just adds complexity, and I'd like to put it off as long as
possible. Initially when I was worried about performance issues,
adding a pool only extended the performance by a linear amount, and I
was looking for substantially more than that.
Post by Dhruba Borthakur
Post by Ryan Rawson
I've thought about this issue quite a bit, and I think the sync every
1 rows combined with optional no-sync and low time sync() is the way
to go. If you want to discuss this more in person, maybe we can meet
up for brews or something.
The group-commit thing I can understand. HDFS does a very similar thing. But
can you explain your alternative "sync every 1 rows combined with optional
no-sync and low time sync"? For those applications that have the natural
characteristics of updating only one row per logical operation, how can they
be sure that their data has reached some-sort-of-stable-storage unless they
sync after every row update?
Normally this would be the case, but consider the case of the call
'incrementColumnValue' which maintains a counter essentially. Losing
some edits means losing counter values - if we we are talking about a
counter that is incremented 100m times a day, then speed is more
important than potentially losing some extremely small number of
updates when a server crashes.

-ryan
Post by Dhruba Borthakur
thanks,
dhruba
Jean-Daniel Cryans
2010-01-12 17:41:44 UTC
Permalink
wrt 1 HLog per region server, this is from the Bigtable paper. Their
main concern is the number of opened files since if you have 1000
region servers * 500 regions then you may have 100 000 HLogs to
manage. Also you can have more than one file per HLog, so let's say
you have on average 5 log files per HLog that's 500 000 files on HDFS.

J-D
Post by Dhruba Borthakur
Hi Ryan,
thanks for ur response.
Post by Ryan Rawson
Right now each regionserver has 1 log, so if 2 puts on different
tables hit the same RS, they hit the same HLog.
I understand. My point was that the application could insert the same record
into two different tables on two different Hbase instances on two different
piece of hardware.
On a related note, can somebody explain what the tradeoff is if each region
has its own hlog? are you worried about the number of files in HDFS? or
maybe the number of sync-threads in the region server? Can multiple hlog
files provide faster region splits?
Post by Ryan Rawson
I've thought about this issue quite a bit, and I think the sync every
1 rows combined with optional no-sync and low time sync() is the way
to go. If you want to discuss this more in person, maybe we can meet
up for brews or something.
The group-commit thing I can understand. HDFS does a very similar thing. But
can you explain your alternative "sync every 1 rows combined with optional
no-sync and low time sync"? For those applications that have the natural
characteristics of updating only one row per logical operation, how can they
be sure that their data has reached some-sort-of-stable-storage unless they
sync after every row update?
thanks,
dhruba
Kannan Muthukkaruppan
2010-01-12 19:40:00 UTC
Permalink
Btw, is there much gains in having a large number of regions-- i.e. to the tune of 500 -- per region server?

I understand that having multiple regions per region server allows finer grained rebalancing when new nodes are added or a node goes down. But would say having a smaller number of regions per region server (say ~50) be really bad. If a region server goes down, 50 other nodes would pick up ~1/50 of its work. Not as good as 500 other nodes picking up 1/500 of its work each-- but seems acceptable still. Are there other advantages of having a large number of regions per region server?

regards,
Kannan
-----Original Message-----
From: jdcryans-***@public.gmane.org [mailto:jdcryans-***@public.gmane.org] On Behalf Of Jean-Daniel Cryans
Sent: Tuesday, January 12, 2010 9:42 AM
To: hbase-dev-7ArZoLwFLBtd/SJB6HiN2Ni2O/***@public.gmane.org
Subject: Re: commit semantics

wrt 1 HLog per region server, this is from the Bigtable paper. Their
main concern is the number of opened files since if you have 1000
region servers * 500 regions then you may have 100 000 HLogs to
manage. Also you can have more than one file per HLog, so let's say
you have on average 5 log files per HLog that's 500 000 files on HDFS.

J-D
Post by Dhruba Borthakur
Hi Ryan,
thanks for ur response.
Post by Ryan Rawson
Right now each regionserver has 1 log, so if 2 puts on different
tables hit the same RS, they hit the same HLog.
I understand. My point was that the application could insert the same record
into two different tables on two different Hbase instances on two different
piece of hardware.
On a related note, can somebody explain what the tradeoff is if each region
has its own hlog? are you worried about the number of files in HDFS? or
maybe the number of sync-threads in the region server? Can multiple hlog
files provide faster region splits?
Post by Ryan Rawson
I've thought about this issue quite a bit, and I think the sync every
1 rows combined with optional no-sync and low time sync() is the way
to go. If you want to discuss this more in person, maybe we can meet
up for brews or something.
The group-commit thing I can understand. HDFS does a very similar thing. But
can you explain your alternative "sync every 1 rows combined with optional
no-sync and low time sync"? For those applications that have the natural
characteristics of updating only one row per logical operation, how can they
be sure that their data has reached some-sort-of-stable-storage unless they
sync after every row update?
thanks,
dhruba
Jean-Daniel Cryans
2010-01-12 19:53:55 UTC
Permalink
It's all very depending on the size of your data VS the size of your
cluster VS your usage pattern.

Example: you have 50 regions on a RS and they are all filled at the
same rate. The RS dies so the master has to split the logs of 50
regions before reassigning.

Example2: you have 500 regions on a RS and only 1 is filled. When it
dies, the master will only have 1 region to process.

Since a planned optimization is to reassign regions that have no edits
in any HLog (you have to have that knowledge prior to processing the
files, maybe store that in zookeeper) right before log splitting, then
you lose availability on 49 regions in this case. Nevertheless,
splitting a small number of regions should be more efficient.

Also more regions in general means more memory usage, possibly more
opened files, and if your data should be served very fast, then a
higher number of regions means more data to keep in memory.

J-D

On Tue, Jan 12, 2010 at 11:40 AM, Kannan Muthukkaruppan
Post by Kannan Muthukkaruppan
Btw, is there much gains in having a large number of regions-- i.e. to the tune of 500 -- per region server?
I understand that having multiple regions per region server allows finer grained rebalancing when new nodes are added or a node goes down. But would say having a smaller number of regions per region server (say ~50) be really bad. If a region server goes down, 50 other nodes would pick up ~1/50 of its work. Not as good as 500 other nodes picking up 1/500 of its work each-- but seems acceptable still. Are there other advantages of having a large number of regions per region server?
regards,
Kannan
-----Original Message-----
Sent: Tuesday, January 12, 2010 9:42 AM
Subject: Re: commit semantics
wrt 1 HLog per region server, this is from the Bigtable paper. Their
main concern is the number of opened files since if you have 1000
region servers * 500 regions then you may have 100 000 HLogs to
manage. Also you can have more than one file per HLog, so let's say
you have on average 5 log files per HLog that's 500 000 files on HDFS.
J-D
Post by Dhruba Borthakur
Hi Ryan,
thanks for ur response.
Post by Ryan Rawson
Right now each regionserver has 1 log, so if 2 puts on different
tables hit the same RS, they hit the same HLog.
I understand. My point was that the application could insert the same record
into two different tables on two different Hbase instances on two different
piece of hardware.
On a related note, can somebody explain what the tradeoff is if each region
has its own hlog? are you worried about the number of files in HDFS? or
maybe the number of sync-threads in the region server? Can multiple hlog
files provide faster region splits?
Post by Ryan Rawson
I've thought about this issue quite a bit, and I think the sync every
1 rows combined with optional no-sync and low time sync() is the way
to go. If you want to discuss this more in person, maybe we can meet
up for brews or something.
The group-commit thing I can understand. HDFS does a very similar thing. But
can you explain your alternative "sync every 1 rows combined with optional
no-sync and low time sync"? For those applications that have the natural
characteristics of updating only one row per logical operation, how can they
be sure that their data has reached some-sort-of-stable-storage unless they
sync after every row update?
thanks,
dhruba
Andrew Purtell
2010-01-12 20:49:31 UTC
Permalink
Post by Kannan Muthukkaruppan
But would say having a
smaller number of regions per region server (say ~50) be really bad.
Not at all.

There are some (test) HBase deployments I know of that go pretty
vertical, multiple TBs of disk on each node therefore wanting a high
number of regions per region server to match that density. That may meet
with operational success but it is architecturally suspect. I ran a test
cluster once with > 1,000 regions per server on 25 servers, in the 0.19
timeframe. 0.20 is much better in terms of resource demand (less) and
liveness (enormously improved), but I still wouldn't recommend it,
unless your clients can wait for up to several minutes on blocked reads
and writes to affected regions should a node go down. With that many
regions per server, it stands to reason just about every client would be
affected.

The numbers I have for Google's canonical BigTable deployment are several
years out of date but they go pretty far in the other direction -- about
100 regions per server is the target.

I think it also depends on whether you intend to colocate TaskTrackers
with the region servers. I presume you intend to run HBase region servers
colocated with HDFS DataNodes. After you have a HBase cluster up for some
number of hours, certainly ~24, background compaction will bring the HDFS
blocks backing region data local to the server, generally. MapReduce
tasks backed by HBase tables will see similar advantages of data locality
that you are probably accustomed to with working with files in HDFS. If
you mix storage and computation this way it makes sense to seek a balance
between the amount of data stored on each node (number of regions being
served) and the available computational resources (available CPU cores,
time constraints (if any) on task execution).

Even if you don't intend to do the above, it's possible that an overly
high region density can negatively impact performance if too much I/O
load is placed on average on each region server. Adding more servers to
spread load would then likely help**.

These considerations bias against hosting a very large number of regions
per region server.

- Andy

**: I say likely because this presumes query and edit patterns have been
guided as necessary through engineering to be widely distributed in the
key space. You have to take some care to avoid hot regions.


----- Original Message ----
Post by Kannan Muthukkaruppan
Sent: Tue, January 12, 2010 11:40:00 AM
Subject: RE: commit semantics
Btw, is there much gains in having a large number of regions-- i.e. to the tune
of 500 -- per region server?
I understand that having multiple regions per region server allows finer grained
rebalancing when new nodes are added or a node goes down. But would say having a
smaller number of regions per region server (say ~50) be really bad. If a region
server goes down, 50 other nodes would pick up ~1/50 of its work. Not as good as
500 other nodes picking up 1/500 of its work each-- but seems acceptable still.
Are there other advantages of having a large number of regions per region
server?
regards,
Kannan
-----Original Message-----
Cryans
Sent: Tuesday, January 12, 2010 9:42 AM
Subject: Re: commit semantics
wrt 1 HLog per region server, this is from the Bigtable paper. Their
main concern is the number of opened files since if you have 1000
region servers * 500 regions then you may have 100 000 HLogs to
manage. Also you can have more than one file per HLog, so let's say
you have on average 5 log files per HLog that's 500 000 files on HDFS.
J-D
Post by Dhruba Borthakur
Hi Ryan,
thanks for ur response.
Post by Ryan Rawson
Right now each regionserver has 1 log, so if 2 puts on different
tables hit the same RS, they hit the same HLog.
I understand. My point was that the application could insert the same record
into two different tables on two different Hbase instances on two different
piece of hardware.
On a related note, can somebody explain what the tradeoff is if each region
has its own hlog? are you worried about the number of files in HDFS? or
maybe the number of sync-threads in the region server? Can multiple hlog
files provide faster region splits?
Post by Ryan Rawson
I've thought about this issue quite a bit, and I think the sync every
1 rows combined with optional no-sync and low time sync() is the way
to go. If you want to discuss this more in person, maybe we can meet
up for brews or something.
The group-commit thing I can understand. HDFS does a very similar thing. But
can you explain your alternative "sync every 1 rows combined with optional
no-sync and low time sync"? For those applications that have the natural
characteristics of updating only one row per logical operation, how can they
be sure that their data has reached some-sort-of-stable-storage unless they
sync after every row update?
thanks,
dhruba
Kannan Muthukkaruppan
2010-01-12 21:07:59 UTC
Permalink
Post by Andrew Purtell
I presume you intend to run HBase region servers
colocated with HDFS DataNodes.
Yes.

---

Seems like we all generally agree that large number of regions per region server may not be the way to go.

So coming back to Dhruba's question on having one commit log per region instead of one commit log per region server. Is the number of HDFS files open still a major concern?

Is my understanding correct that unavailability window during region server failover is large due to the time it takes to split the shared commit log into a per region log? Instead, if we always had per-region commit logs even in the normal mode of operation, then the unavailability window would be minimized? It does minimize the extent of batch/group commits you can do though-- since you can only batch updates going to the same region. Any other gotchas/issues?

regards,
Kannan
-----Original Message-----
From: Andrew Purtell [mailto:apurtell-1oDqGaOF3Lkdnm+***@public.gmane.org]
Sent: Tuesday, January 12, 2010 12:50 PM
To: hbase-dev-7ArZoLwFLBtd/SJB6HiN2Ni2O/***@public.gmane.org
Subject: Re: commit semantics
Post by Andrew Purtell
But would say having a
smaller number of regions per region server (say ~50) be really bad.
Not at all.

There are some (test) HBase deployments I know of that go pretty
vertical, multiple TBs of disk on each node therefore wanting a high
number of regions per region server to match that density. That may meet
with operational success but it is architecturally suspect. I ran a test
cluster once with > 1,000 regions per server on 25 servers, in the 0.19
timeframe. 0.20 is much better in terms of resource demand (less) and
liveness (enormously improved), but I still wouldn't recommend it,
unless your clients can wait for up to several minutes on blocked reads
and writes to affected regions should a node go down. With that many
regions per server, it stands to reason just about every client would be
affected.

The numbers I have for Google's canonical BigTable deployment are several
years out of date but they go pretty far in the other direction -- about
100 regions per server is the target.

I think it also depends on whether you intend to colocate TaskTrackers
with the region servers. I presume you intend to run HBase region servers
colocated with HDFS DataNodes. After you have a HBase cluster up for some
number of hours, certainly ~24, background compaction will bring the HDFS
blocks backing region data local to the server, generally. MapReduce
tasks backed by HBase tables will see similar advantages of data locality
that you are probably accustomed to with working with files in HDFS. If
you mix storage and computation this way it makes sense to seek a balance
between the amount of data stored on each node (number of regions being
served) and the available computational resources (available CPU cores,
time constraints (if any) on task execution).

Even if you don't intend to do the above, it's possible that an overly
high region density can negatively impact performance if too much I/O
load is placed on average on each region server. Adding more servers to
spread load would then likely help**.

These considerations bias against hosting a very large number of regions
per region server.

- Andy

**: I say likely because this presumes query and edit patterns have been
guided as necessary through engineering to be widely distributed in the
key space. You have to take some care to avoid hot regions.


----- Original Message ----
Post by Andrew Purtell
Sent: Tue, January 12, 2010 11:40:00 AM
Subject: RE: commit semantics
Btw, is there much gains in having a large number of regions-- i.e. to the tune
of 500 -- per region server?
I understand that having multiple regions per region server allows finer grained
rebalancing when new nodes are added or a node goes down. But would say having a
smaller number of regions per region server (say ~50) be really bad. If a region
server goes down, 50 other nodes would pick up ~1/50 of its work. Not as good as
500 other nodes picking up 1/500 of its work each-- but seems acceptable still.
Are there other advantages of having a large number of regions per region
server?
regards,
Kannan
-----Original Message-----
Cryans
Sent: Tuesday, January 12, 2010 9:42 AM
Subject: Re: commit semantics
wrt 1 HLog per region server, this is from the Bigtable paper. Their
main concern is the number of opened files since if you have 1000
region servers * 500 regions then you may have 100 000 HLogs to
manage. Also you can have more than one file per HLog, so let's say
you have on average 5 log files per HLog that's 500 000 files on HDFS.
J-D
Post by Dhruba Borthakur
Hi Ryan,
thanks for ur response.
Post by Ryan Rawson
Right now each regionserver has 1 log, so if 2 puts on different
tables hit the same RS, they hit the same HLog.
I understand. My point was that the application could insert the same record
into two different tables on two different Hbase instances on two different
piece of hardware.
On a related note, can somebody explain what the tradeoff is if each region
has its own hlog? are you worried about the number of files in HDFS? or
maybe the number of sync-threads in the region server? Can multiple hlog
files provide faster region splits?
Post by Ryan Rawson
I've thought about this issue quite a bit, and I think the sync every
1 rows combined with optional no-sync and low time sync() is the way
to go. If you want to discuss this more in person, maybe we can meet
up for brews or something.
The group-commit thing I can understand. HDFS does a very similar thing. But
can you explain your alternative "sync every 1 rows combined with optional
no-sync and low time sync"? For those applications that have the natural
characteristics of updating only one row per logical operation, how can they
be sure that their data has reached some-sort-of-stable-storage unless they
sync after every row update?
thanks,
dhruba
Jean-Daniel Cryans
2010-01-12 21:36:02 UTC
Permalink
Even with 100 regions, times 1000 region servers, we talk about
potentially having 100 000 opened files instead of 1000 (and also we
have to count every replica).

I guess that an OS that was configured for such usage would be able to
sustain it... You would have to watch that metric cluster-wide, get
new nodes when needed, etc.

Then you need to make sure that GC pauses won't block for too long to
have a very low unavailability time.

J-D

On Tue, Jan 12, 2010 at 1:07 PM, Kannan Muthukkaruppan
Post by Kannan Muthukkaruppan
Post by Andrew Purtell
I presume you intend to run HBase region servers
colocated with HDFS DataNodes.
Yes.
---
Seems like we all generally agree that large number of regions per region server may not be the way to go.
So coming back to Dhruba's question on having one commit log per region instead of one commit log per region server. Is the number of HDFS files open still a major concern?
Is my understanding correct that unavailability window during region server failover is large due to the time it takes to split the shared commit log into a per region log? Instead, if we always had per-region commit logs even in the normal mode of operation, then the unavailability window would be minimized? It does minimize the extent of batch/group commits you can do though-- since you can only batch updates going to the same region. Any other gotchas/issues?
regards,
Kannan
-----Original Message-----
Sent: Tuesday, January 12, 2010 12:50 PM
Subject: Re: commit semantics
Post by Andrew Purtell
But would say having a
smaller number of regions per region server (say ~50) be really bad.
Not at all.
There are some (test) HBase deployments I know of that go pretty
vertical, multiple TBs of disk on each node therefore wanting a high
number of regions per region server to match that density. That may meet
with operational success but it is architecturally suspect. I ran a test
cluster once with > 1,000 regions per server on 25 servers, in the 0.19
timeframe. 0.20 is much better in terms of resource demand (less) and
liveness (enormously improved), but I still wouldn't recommend it,
unless your clients can wait for up to several minutes on blocked reads
and writes to affected regions should a node go down. With that many
regions per server,  it stands to reason just about every client would be
affected.
The numbers I have for Google's canonical BigTable deployment are several
years out of date but they go pretty far in the other direction -- about
100 regions per server is the target.
I think it also depends on whether you intend to colocate TaskTrackers
with the region servers. I presume you intend to run HBase region servers
colocated with HDFS DataNodes. After you have a HBase cluster up for some
number of hours, certainly ~24, background compaction will bring the HDFS
blocks backing region data local to the server, generally. MapReduce
tasks backed by HBase tables will see similar advantages of data locality
that you are probably accustomed to with working with files in HDFS. If
you mix storage and computation this way it makes sense to seek a balance
between the amount of data stored on each node (number of regions being
served) and the available computational resources (available CPU cores,
time constraints (if any) on task execution).
Even if you don't intend to do the above, it's possible that an overly
high region density can negatively impact performance if too much I/O
load is placed on average on each region server. Adding more servers to
spread load would then likely help**.
These considerations bias against hosting a very large number of regions
per region server.
  - Andy
**: I say likely because this presumes query and edit patterns have been
guided as necessary through engineering to be widely distributed in the
key space. You have to take some care to avoid hot regions.
----- Original Message ----
Post by Andrew Purtell
Sent: Tue, January 12, 2010 11:40:00 AM
Subject: RE: commit semantics
Btw, is there much gains in having a large number of regions-- i.e. to the tune
of 500 -- per region server?
I understand that having multiple regions per region server allows finer grained
rebalancing when new nodes are added or a node goes down. But would say having a
smaller number of regions per region server (say ~50) be really bad. If a region
server goes down, 50 other nodes would pick up ~1/50 of its work. Not as good as
500 other nodes picking up 1/500 of its work each-- but seems acceptable still.
Are there other advantages of having a large number of regions per region
server?
regards,
Kannan
-----Original Message-----
Cryans
Sent: Tuesday, January 12, 2010 9:42 AM
Subject: Re: commit semantics
wrt 1 HLog per region server, this is from the Bigtable paper. Their
main concern is the number of opened files since if you have 1000
region servers * 500 regions then you may have 100 000 HLogs to
manage. Also you can have more than one file per HLog, so let's say
you have on average 5 log files per HLog that's 500 000 files on HDFS.
J-D
Post by Dhruba Borthakur
Hi Ryan,
thanks for ur response.
Post by Ryan Rawson
Right now each regionserver has 1 log, so if 2 puts on different
tables hit the same RS, they hit the same HLog.
I understand. My point was that the application could insert the same record
into two different tables on two different Hbase instances on two different
piece of hardware.
On a related note, can somebody explain what the tradeoff is if each region
has its own hlog? are you worried about the number of files in HDFS? or
maybe the number of sync-threads in the region server? Can multiple hlog
files provide faster region splits?
Post by Ryan Rawson
I've thought about this issue quite a bit, and I think the sync every
1 rows combined with optional no-sync and low time sync() is the way
to go. If you want to discuss this more in person, maybe we can meet
up for brews or something.
The group-commit thing I can understand. HDFS does a very similar thing. But
can you explain your alternative "sync every 1 rows combined with optional
no-sync and low time sync"? For those applications that have the natural
characteristics of updating only one row per logical operation, how can they
be sure that their data has reached some-sort-of-stable-storage unless they
sync after every row update?
thanks,
dhruba
stack
2010-01-13 01:23:00 UTC
Permalink
On Tue, Jan 12, 2010 at 1:07 PM, Kannan Muthukkaruppan
Post by Kannan Muthukkaruppan
Seems like we all generally agree that large number of regions per region
server may not be the way to go.
What Andrew says. You could make regions bigger so more data per
regionserver but same rough (small) number to redeploy on crash but the logs
to replay will be correspondingly bigger taking longer to process
Post by Kannan Muthukkaruppan
So coming back to Dhruba's question on having one commit log per region
instead of one commit log per region server. Is the number of HDFS files
open still a major concern?
Yes. From "Commit-log implementation" section of the BT paper:

"If we kept the commit log for each tablet in a separate log file, a very
large number of files would be written concurrently in GFS. Depending on the
underlying file system implementation on each GFS server, these writes could
cause a large number of disk seeks to write to the different physical log
files. In addition, having separate log files per tablet also reduces the
effectiveness of the group commit optimization, since groups would tend to
be smaller. To fix these issues, we append mutations to a single commit log
per tablet server, co-mingling mutations for different tablets in the same
physical log file."

Not knowing any better, we presume hdfs is kinda-like gfs.
Post by Kannan Muthukkaruppan
Is my understanding correct that unavailability window during region server
failover is large due to the time it takes to split the shared commit log
into a per region log?
Yes, though truth be told, this area of hbase performance has had very
little attention paid to it. There are things that we could do much better
-- e.g. distributed split instead of threaded split inside in a single
procss -- and ideas for making it so we can take on writes much sooner than
we currently do; e.g. open regions immediately on new server before split
completes.
Post by Kannan Muthukkaruppan
Instead, if we always had per-region commit logs even in the normal mode of
operation, then the unavailability window would be minimized? It does
minimize the extent of batch/group commits you can do though-- since you can
only batch updates going to the same region. Any other gotchas/issues?
Just those listed above.

St.Ack
stack
2010-01-13 05:12:23 UTC
Permalink
(Below is a note from Joydeep. Something about Joydeeps' messages are
requiring that I approve/disapprove them. For the message below, I his
disapprove by mistake so am copying it here manually)

---------- Forwarded message ----------
From: Joydeep Sarma <jssarma-1oDqGaOF3Lkdnm+***@public.gmane.org>
To: hbase-dev-7ArZoLwFLBtd/SJB6HiN2Ni2O/***@public.gmane.org, kannan-/eYsKH2uxv5Wk0Htik3J/***@public.gmane.org, Dhruba Borthakur <
dhruba-/eYsKH2uxv5Wk0Htik3J/***@public.gmane.org>
Date: Tue, 12 Jan 2010 15:39:05 -0800
Subject: Re: commit semantics
Post by Dhruba Borthakur
any IO to a HDFS-file (appends, writes, etc) ae actually blocked on a
pending sync. "sync" in HDFS is a pretty heavyweight operation as it stands.
i think this is likely to explain limited throughput with the default
write queue threshold of 1. if the appends cannot make progress while
one is waiting for the sync - then the write pipeline is going to be
idle most of the time (with queue threshold of 1).

i think it would be good to have the sync not block other writers on
the file/pipeline. logically - it's not clear why it needs to (since
the sync is just a wait for the completion as of some write
transaction id - allowing new ones to be queued up subsequently).

Joydeep
stack
2010-01-13 05:16:56 UTC
Permalink
Post by Dhruba Borthakur
Post by Dhruba Borthakur
any IO to a HDFS-file (appends, writes, etc) ae actually blocked on a
pending sync. "sync" in HDFS is a pretty heavyweight operation as it
stands.
i think this is likely to explain limited throughput with the default
write queue threshold of 1. if the appends cannot make progress while
one is waiting for the sync - then the write pipeline is going to be
idle most of the time (with queue threshold of 1).
i think it would be good to have the sync not block other writers on
the file/pipeline. logically - it's not clear why it needs to (since
the sync is just a wait for the completion as of some write
transaction id - allowing new ones to be queued up subsequently).
Are you talking about internal to DFSClient Joydeep? Or some
synchronization block up in hlog?

St.Ack
Joydeep Sarma
2010-01-13 05:41:27 UTC
Permalink
this is internal to the dfsclient. this would explain why performance
would suck with queue threshold of 1.

leave it up to Dhruba to explain the details.
Post by Dhruba Borthakur
Post by Dhruba Borthakur
any IO to a HDFS-file (appends, writes, etc) ae actually blocked on a
pending sync. "sync" in HDFS is a pretty heavyweight operation as it
stands.
i think this is likely to explain limited throughput with the default
write queue threshold of 1. if the appends cannot make progress while
one is waiting for the sync - then the write pipeline is going to be
idle most of the time (with queue threshold of 1).
i think it would be good to have the sync not block other writers on
the file/pipeline. logically - it's not clear why it needs to (since
the sync is just a wait for the completion as of some write
transaction id - allowing new ones to be queued up subsequently).
Are you talking about internal to DFSClient Joydeep?  Or some
synchronization block up in hlog?
St.Ack
Dhruba Borthakur
2010-01-13 16:51:13 UTC
Permalink
I opened http://issues.apache.org/jira/browse/HDFS-895 for this one.

thanks,
dhruba
Post by Joydeep Sarma
this is internal to the dfsclient. this would explain why performance
would suck with queue threshold of 1.
leave it up to Dhruba to explain the details.
Post by stack
Post by Dhruba Borthakur
Post by Dhruba Borthakur
any IO to a HDFS-file (appends, writes, etc) ae actually blocked on a
pending sync. "sync" in HDFS is a pretty heavyweight operation as it
stands.
i think this is likely to explain limited throughput with the default
write queue threshold of 1. if the appends cannot make progress while
one is waiting for the sync - then the write pipeline is going to be
idle most of the time (with queue threshold of 1).
i think it would be good to have the sync not block other writers on
the file/pipeline. logically - it's not clear why it needs to (since
the sync is just a wait for the completion as of some write
transaction id - allowing new ones to be queued up subsequently).
Are you talking about internal to DFSClient Joydeep? Or some
synchronization block up in hlog?
St.Ack
--
Connect to me at http://www.facebook.com/dhruba
Jean-Daniel Cryans
2010-01-13 17:56:09 UTC
Permalink
That's great dhruba, I guess the sooner it could go in is 0.21.1?

J-D
Post by Dhruba Borthakur
I opened http://issues.apache.org/jira/browse/HDFS-895 for this one.
thanks,
dhruba
Post by Joydeep Sarma
this is internal to the dfsclient. this would explain why performance
would suck with queue threshold of 1.
leave it up to Dhruba to explain the details.
Post by Dhruba Borthakur
Post by Dhruba Borthakur
any IO to a HDFS-file (appends, writes, etc) ae actually blocked on a
pending sync. "sync" in HDFS is a pretty heavyweight operation as it
stands.
i think this is likely to explain limited throughput with the default
write queue threshold of 1. if the appends cannot make progress while
one is waiting for the sync - then the write pipeline is going to be
idle most of the time (with queue threshold of 1).
i think it would be good to have the sync not block other writers on
the file/pipeline. logically - it's not clear why it needs to (since
the sync is just a wait for the completion as of some write
transaction id - allowing new ones to be queued up subsequently).
Are you talking about internal to DFSClient Joydeep?  Or some
synchronization block up in hlog?
St.Ack
--
Connect to me at http://www.facebook.com/dhruba
Dhruba Borthakur
2010-01-13 18:38:16 UTC
Permalink
I will try to make a patch for it first. depending on the complexity of the
patch code, we can decide which release it can go in.

thanks,
dhruba
Post by Jean-Daniel Cryans
That's great dhruba, I guess the sooner it could go in is 0.21.1?
J-D
Post by Dhruba Borthakur
I opened http://issues.apache.org/jira/browse/HDFS-895 for this one.
thanks,
dhruba
Post by Joydeep Sarma
this is internal to the dfsclient. this would explain why performance
would suck with queue threshold of 1.
leave it up to Dhruba to explain the details.
Post by stack
Post by Dhruba Borthakur
Post by Dhruba Borthakur
any IO to a HDFS-file (appends, writes, etc) ae actually blocked on
a
Post by Dhruba Borthakur
Post by Joydeep Sarma
Post by stack
Post by Dhruba Borthakur
Post by Dhruba Borthakur
pending sync. "sync" in HDFS is a pretty heavyweight operation as
it
Post by Dhruba Borthakur
Post by Joydeep Sarma
Post by stack
Post by Dhruba Borthakur
stands.
i think this is likely to explain limited throughput with the default
write queue threshold of 1. if the appends cannot make progress while
one is waiting for the sync - then the write pipeline is going to be
idle most of the time (with queue threshold of 1).
i think it would be good to have the sync not block other writers on
the file/pipeline. logically - it's not clear why it needs to (since
the sync is just a wait for the completion as of some write
transaction id - allowing new ones to be queued up subsequently).
Are you talking about internal to DFSClient Joydeep? Or some
synchronization block up in hlog?
St.Ack
--
Connect to me at http://www.facebook.com/dhruba
--
Connect to me at http://www.facebook.com/dhruba
Jean-Daniel Cryans
2010-01-13 18:40:57 UTC
Permalink
I'll be happy to benchmark, we already have code to test the
multi-client hitting 1 region server case.

J-D
Post by Dhruba Borthakur
I will try to make a patch for it first. depending on the complexity of the
patch code, we can decide which release it can go in.
thanks,
dhruba
Post by Jean-Daniel Cryans
That's great dhruba, I guess the sooner it could go in is 0.21.1?
J-D
Post by Dhruba Borthakur
I opened http://issues.apache.org/jira/browse/HDFS-895 for this one.
thanks,
dhruba
Post by Joydeep Sarma
this is internal to the dfsclient. this would explain why performance
would suck with queue threshold of 1.
leave it up to Dhruba to explain the details.
Post by Dhruba Borthakur
Post by Dhruba Borthakur
any IO to a HDFS-file (appends, writes, etc) ae actually blocked on
a
Post by Dhruba Borthakur
Post by Joydeep Sarma
Post by Dhruba Borthakur
Post by Dhruba Borthakur
pending sync. "sync" in HDFS is a pretty heavyweight operation as
it
Post by Dhruba Borthakur
Post by Joydeep Sarma
Post by Dhruba Borthakur
stands.
i think this is likely to explain limited throughput with the default
write queue threshold of 1. if the appends cannot make progress while
one is waiting for the sync - then the write pipeline is going to be
idle most of the time (with queue threshold of 1).
i think it would be good to have the sync not block other writers on
the file/pipeline. logically - it's not clear why it needs to (since
the sync is just a wait for the completion as of some write
transaction id - allowing new ones to be queued up subsequently).
Are you talking about internal to DFSClient Joydeep?  Or some
synchronization block up in hlog?
St.Ack
--
Connect to me at http://www.facebook.com/dhruba
--
Connect to me at http://www.facebook.com/dhruba
Dhruba Borthakur
2010-01-13 18:43:58 UTC
Permalink
Awesome, I will try to post a patch soon and will let you know as soon as I
have the first version ready.

thanks,
dhruba
Post by Jean-Daniel Cryans
I'll be happy to benchmark, we already have code to test the
multi-client hitting 1 region server case.
know
J-D
Post by Dhruba Borthakur
I will try to make a patch for it first. depending on the complexity of
the
Post by Dhruba Borthakur
patch code, we can decide which release it can go in.
thanks,
dhruba
Post by Jean-Daniel Cryans
That's great dhruba, I guess the sooner it could go in is 0.21.1?
J-D
Post by Dhruba Borthakur
I opened http://issues.apache.org/jira/browse/HDFS-895 for this one.
thanks,
dhruba
Post by Joydeep Sarma
this is internal to the dfsclient. this would explain why performance
would suck with queue threshold of 1.
leave it up to Dhruba to explain the details.
Post by stack
Post by Dhruba Borthakur
Post by Dhruba Borthakur
any IO to a HDFS-file (appends, writes, etc) ae actually blocked
on
Post by Dhruba Borthakur
Post by Jean-Daniel Cryans
a
Post by Dhruba Borthakur
Post by Joydeep Sarma
Post by stack
Post by Dhruba Borthakur
Post by Dhruba Borthakur
pending sync. "sync" in HDFS is a pretty heavyweight operation
as
Post by Dhruba Borthakur
Post by Jean-Daniel Cryans
it
Post by Dhruba Borthakur
Post by Joydeep Sarma
Post by stack
Post by Dhruba Borthakur
stands.
i think this is likely to explain limited throughput with the
default
Post by Dhruba Borthakur
Post by Jean-Daniel Cryans
Post by Dhruba Borthakur
Post by Joydeep Sarma
Post by stack
Post by Dhruba Borthakur
write queue threshold of 1. if the appends cannot make progress
while
Post by Dhruba Borthakur
Post by Jean-Daniel Cryans
Post by Dhruba Borthakur
Post by Joydeep Sarma
Post by stack
Post by Dhruba Borthakur
one is waiting for the sync - then the write pipeline is going to
be
Post by Dhruba Borthakur
Post by Jean-Daniel Cryans
Post by Dhruba Borthakur
Post by Joydeep Sarma
Post by stack
Post by Dhruba Borthakur
idle most of the time (with queue threshold of 1).
i think it would be good to have the sync not block other writers
on
Post by Dhruba Borthakur
Post by Jean-Daniel Cryans
Post by Dhruba Borthakur
Post by Joydeep Sarma
Post by stack
Post by Dhruba Borthakur
the file/pipeline. logically - it's not clear why it needs to
(since
Post by Dhruba Borthakur
Post by Jean-Daniel Cryans
Post by Dhruba Borthakur
Post by Joydeep Sarma
Post by stack
Post by Dhruba Borthakur
the sync is just a wait for the completion as of some write
transaction id - allowing new ones to be queued up subsequently).
Are you talking about internal to DFSClient Joydeep? Or some
synchronization block up in hlog?
St.Ack
--
Connect to me at http://www.facebook.com/dhruba
--
Connect to me at http://www.facebook.com/dhruba
--
Connect to me at http://www.facebook.com/dhruba
Joydeep Sarma
2010-01-13 19:01:51 UTC
Permalink
i posted on the jira as well - but we should be able to simulate the
effect of the patch.

if the sync was simulated merely a sleep (for 2-3ms - whatever is the
average RTT for dfs write pipeline) instead of an actual call into dfs
client - it should simulate the effect of the patch. (the appends
would proceed in parallel, each sync would block for sometime).

so we should be able to test whether this gets a performance win for
the queue threshold=1 case.
Awesome, I will try to post a patch soon and  will let you know as soon as I
have the first version ready.
thanks,
dhruba
Post by Jean-Daniel Cryans
I'll be happy to benchmark, we already have code to test the
multi-client hitting 1 region server case.
  know
J-D
Post by Dhruba Borthakur
I will try to make a patch for it first. depending on the complexity of
the
Post by Dhruba Borthakur
patch code, we can decide which release it can go in.
thanks,
dhruba
Post by Jean-Daniel Cryans
That's great dhruba, I guess the sooner it could go in is 0.21.1?
J-D
Post by Dhruba Borthakur
I opened http://issues.apache.org/jira/browse/HDFS-895 for this one.
thanks,
dhruba
Post by Joydeep Sarma
this is internal to the dfsclient. this would explain why performance
would suck with queue threshold of 1.
leave it up to Dhruba to explain the details.
Post by Dhruba Borthakur
Post by Dhruba Borthakur
any IO to a HDFS-file (appends, writes, etc) ae actually blocked
on
Post by Dhruba Borthakur
Post by Jean-Daniel Cryans
a
Post by Dhruba Borthakur
Post by Joydeep Sarma
Post by Dhruba Borthakur
Post by Dhruba Borthakur
pending sync. "sync" in HDFS is a pretty heavyweight operation
as
Post by Dhruba Borthakur
Post by Jean-Daniel Cryans
it
Post by Dhruba Borthakur
Post by Joydeep Sarma
Post by Dhruba Borthakur
stands.
i think this is likely to explain limited throughput with the
default
Post by Dhruba Borthakur
Post by Jean-Daniel Cryans
Post by Dhruba Borthakur
Post by Joydeep Sarma
Post by Dhruba Borthakur
write queue threshold of 1. if the appends cannot make progress
while
Post by Dhruba Borthakur
Post by Jean-Daniel Cryans
Post by Dhruba Borthakur
Post by Joydeep Sarma
Post by Dhruba Borthakur
one is waiting for the sync - then the write pipeline is going to
be
Post by Dhruba Borthakur
Post by Jean-Daniel Cryans
Post by Dhruba Borthakur
Post by Joydeep Sarma
Post by Dhruba Borthakur
idle most of the time (with queue threshold of 1).
i think it would be good to have the sync not block other writers
on
Post by Dhruba Borthakur
Post by Jean-Daniel Cryans
Post by Dhruba Borthakur
Post by Joydeep Sarma
Post by Dhruba Borthakur
the file/pipeline. logically - it's not clear why it needs to
(since
Post by Dhruba Borthakur
Post by Jean-Daniel Cryans
Post by Dhruba Borthakur
Post by Joydeep Sarma
Post by Dhruba Borthakur
the sync is just a wait for the completion as of some write
transaction id - allowing new ones to be queued up subsequently).
Are you talking about internal to DFSClient Joydeep?  Or some
synchronization block up in hlog?
St.Ack
--
Connect to me at http://www.facebook.com/dhruba
--
Connect to me at http://www.facebook.com/dhruba
--
Connect to me at http://www.facebook.com/dhruba
Jean-Daniel Cryans
2010-01-13 22:56:15 UTC
Permalink
Good idea, let me try it.

J-D
Post by Joydeep Sarma
i posted on the jira as well - but we should be able to simulate the
effect of the patch.
if the sync was simulated merely a sleep (for 2-3ms - whatever is the
average RTT for dfs write pipeline) instead of an actual call into dfs
client - it should simulate the effect of the patch. (the appends
would proceed in parallel, each sync would block for sometime).
so we should be able to test whether this gets a performance win for
the queue threshold=1 case.
Awesome, I will try to post a patch soon and  will let you know as soon as I
have the first version ready.
thanks,
dhruba
Post by Jean-Daniel Cryans
I'll be happy to benchmark, we already have code to test the
multi-client hitting 1 region server case.
  know
J-D
Post by Dhruba Borthakur
I will try to make a patch for it first. depending on the complexity of
the
Post by Dhruba Borthakur
patch code, we can decide which release it can go in.
thanks,
dhruba
Post by Jean-Daniel Cryans
That's great dhruba, I guess the sooner it could go in is 0.21.1?
J-D
Post by Dhruba Borthakur
I opened http://issues.apache.org/jira/browse/HDFS-895 for this one.
thanks,
dhruba
Post by Joydeep Sarma
this is internal to the dfsclient. this would explain why performance
would suck with queue threshold of 1.
leave it up to Dhruba to explain the details.
Post by Dhruba Borthakur
Post by Dhruba Borthakur
any IO to a HDFS-file (appends, writes, etc) ae actually blocked
on
Post by Dhruba Borthakur
Post by Jean-Daniel Cryans
a
Post by Dhruba Borthakur
Post by Joydeep Sarma
Post by Dhruba Borthakur
Post by Dhruba Borthakur
pending sync. "sync" in HDFS is a pretty heavyweight operation
as
Post by Dhruba Borthakur
Post by Jean-Daniel Cryans
it
Post by Dhruba Borthakur
Post by Joydeep Sarma
Post by Dhruba Borthakur
stands.
i think this is likely to explain limited throughput with the
default
Post by Dhruba Borthakur
Post by Jean-Daniel Cryans
Post by Dhruba Borthakur
Post by Joydeep Sarma
Post by Dhruba Borthakur
write queue threshold of 1. if the appends cannot make progress
while
Post by Dhruba Borthakur
Post by Jean-Daniel Cryans
Post by Dhruba Borthakur
Post by Joydeep Sarma
Post by Dhruba Borthakur
one is waiting for the sync - then the write pipeline is going to
be
Post by Dhruba Borthakur
Post by Jean-Daniel Cryans
Post by Dhruba Borthakur
Post by Joydeep Sarma
Post by Dhruba Borthakur
idle most of the time (with queue threshold of 1).
i think it would be good to have the sync not block other writers
on
Post by Dhruba Borthakur
Post by Jean-Daniel Cryans
Post by Dhruba Borthakur
Post by Joydeep Sarma
Post by Dhruba Borthakur
the file/pipeline. logically - it's not clear why it needs to
(since
Post by Dhruba Borthakur
Post by Jean-Daniel Cryans
Post by Dhruba Borthakur
Post by Joydeep Sarma
Post by Dhruba Borthakur
the sync is just a wait for the completion as of some write
transaction id - allowing new ones to be queued up subsequently).
Are you talking about internal to DFSClient Joydeep?  Or some
synchronization block up in hlog?
St.Ack
--
Connect to me at http://www.facebook.com/dhruba
--
Connect to me at http://www.facebook.com/dhruba
--
Connect to me at http://www.facebook.com/dhruba
stack
2010-01-12 17:58:44 UTC
Permalink
Post by Dhruba Borthakur
if we want the best of both worlds.. latency as well as data integrity, how
about inserting the same record into two completely separate HBase tables in
parallel... the operation can complete as soon as the record is inserted
into the first HBase table (thus giving low latencies)
Return after insert into the first table? Then internally hbase is meant to
take care of the insert into the second table? What if the latter fails for
some reason other than regionserver crash?

The two writes would have to be done as hdfs does, in series, if the two
tables are to remain in sync, with the addition of a roll back of the
transaction if insert does not go through to both tables since we don't have
something like the hdfs background thread ensuring replica counts.
Post by Dhruba Borthakur
but data integrity
will not be compromised because it is unlikely that two region servers will
fail exactly at the same time (assuming that there is a way to ensure that
these two tables are not handled by the same region server).
How do you suggest the application deal with reading from these two tables?
If they are guaranteed in-sync, then it could pick either. If the two can
wander, then the application needs to read from both and make reconciliation
somehow?

Just trying to understand what you are suggesting Dhruba,
St.Ack
Post by Dhruba Borthakur
thanks,
dhruba
Post by Joydeep Sarma
ok - hadn't thought about it that way - but yeah with a default of 1 -
the semantics seem correct.
under high load - some batching would automatically happen at this
setting (or so one would think - not sure if hdfs appends are blocked
on pending syncs (in which case the batching wouldn't quite happen i
think) - cc'ing Dhruba).
if the performance with setting of 1 doesn't work out - we may need an
option to delay acks until actual syncs .. (most likely we would be
able to compromise on latency to get higher throughput - but wouldn't
be willing to compromise on data integrity)
Post by Jean-Daniel Cryans
Hey Joydeep,
This is actually intended this way but the name of the variable is
misleading. The sync is done only if forceSync or we have enough
entries to sync (default is 1). If someone wants to sync only 100
entries for example, they would play with that configuration.
Hope that helps,
J-D
Post by Joydeep Sarma
Hey HBase-devs,
we have been going through hbase code to come up to speed.
One of the questions was regarding the commit semantics. Thumbing
syncWal -> HLog.sync -> addToSyncQueue ->syncDone.await()
hflush(), syncDone.signalAll()
however hflush doesn't necessarily call a sync on the underlying log
if (this.forceSync ||
this.unflushedEntries.get() >= this.flushlogentries) { ...
sync() ... }
Post by Jean-Daniel Cryans
Post by Joydeep Sarma
so it seems that if forceSync is not true, the syncWal can unblock
before a sync is called (and forcesync seems to be only true for
metaregion()).
Post by Jean-Daniel Cryans
Post by Joydeep Sarma
are we missing something - or is there a bug here (the signalAll
should
Post by Joydeep Sarma
be conditional on hflush having actually flushed something).
Post by Jean-Daniel Cryans
Post by Joydeep Sarma
thanks,
Joydeep
--
Connect to me at http://www.facebook.com/dhruba
Dhruba Borthakur
2010-01-12 18:14:07 UTC
Permalink
Hi stack,

I was meaning "what if the application inserted the same record into two
Hbase instances"? Of course, now the onus is on the appl to keep both of
them in sync and recover from any inconsistencies between them.

thanks,
dhruba
Post by stack
Post by Dhruba Borthakur
if we want the best of both worlds.. latency as well as data integrity,
how
Post by Dhruba Borthakur
about inserting the same record into two completely separate HBase tables in
parallel... the operation can complete as soon as the record is inserted
into the first HBase table (thus giving low latencies)
Return after insert into the first table? Then internally hbase is meant to
take care of the insert into the second table? What if the latter fails for
some reason other than regionserver crash?
The two writes would have to be done as hdfs does, in series, if the two
tables are to remain in sync, with the addition of a roll back of the
transaction if insert does not go through to both tables since we don't have
something like the hdfs background thread ensuring replica counts.
Post by Dhruba Borthakur
but data integrity
will not be compromised because it is unlikely that two region servers
will
Post by Dhruba Borthakur
fail exactly at the same time (assuming that there is a way to ensure
that
Post by Dhruba Borthakur
these two tables are not handled by the same region server).
How do you suggest the application deal with reading from these two tables?
If they are guaranteed in-sync, then it could pick either. If the two can
wander, then the application needs to read from both and make
reconciliation
somehow?
Just trying to understand what you are suggesting Dhruba,
St.Ack
Post by Dhruba Borthakur
thanks,
dhruba
Post by Joydeep Sarma
ok - hadn't thought about it that way - but yeah with a default of 1 -
the semantics seem correct.
under high load - some batching would automatically happen at this
setting (or so one would think - not sure if hdfs appends are blocked
on pending syncs (in which case the batching wouldn't quite happen i
think) - cc'ing Dhruba).
if the performance with setting of 1 doesn't work out - we may need an
option to delay acks until actual syncs .. (most likely we would be
able to compromise on latency to get higher throughput - but wouldn't
be willing to compromise on data integrity)
Post by Jean-Daniel Cryans
Hey Joydeep,
This is actually intended this way but the name of the variable is
misleading. The sync is done only if forceSync or we have enough
entries to sync (default is 1). If someone wants to sync only 100
entries for example, they would play with that configuration.
Hope that helps,
J-D
Post by Joydeep Sarma
Hey HBase-devs,
we have been going through hbase code to come up to speed.
One of the questions was regarding the commit semantics. Thumbing
syncWal -> HLog.sync -> addToSyncQueue ->syncDone.await()
hflush(), syncDone.signalAll()
however hflush doesn't necessarily call a sync on the underlying log
if (this.forceSync ||
this.unflushedEntries.get() >= this.flushlogentries) { ...
sync() ... }
Post by Jean-Daniel Cryans
Post by Joydeep Sarma
so it seems that if forceSync is not true, the syncWal can unblock
before a sync is called (and forcesync seems to be only true for
metaregion()).
Post by Jean-Daniel Cryans
Post by Joydeep Sarma
are we missing something - or is there a bug here (the signalAll
should
Post by Joydeep Sarma
be conditional on hflush having actually flushed something).
Post by Jean-Daniel Cryans
Post by Joydeep Sarma
thanks,
Joydeep
--
Connect to me at http://www.facebook.com/dhruba
--
Connect to me at http://www.facebook.com/dhruba
stack
2010-01-12 18:51:32 UTC
Permalink
Post by Dhruba Borthakur
Hi stack,
I was meaning "what if the application inserted the same record into two
Hbase instances"? Of course, now the onus is on the appl to keep both of
them in sync and recover from any inconsistencies between them.
Ok. Like your "Overlapping Clusters for HA" from
http://www.borthakur.com/ftp/hdfs_high_availability.pdf?

I'm not sure how the application could return after writing one cluster
without waiting on the second to complete as you suggest above. It could
write in parallel but the second thread might not complete for myriad
reasons. What then? And as you say, reading, the client would have to make
reconciliation.

Isn't there already a 'scalable database' that gives you this headache for
free without your having to do work on your part (smile)?

Do you think there a problem syncing on every write (with some batching of
writes happening when high-concurrency) or, if that too slow for your needs,
adding the holding of clients until sync happens as joydeep suggests? Will
that be sufficient data integrity-wise?

St.Ack

Thanks,
St.Ack
Kannan Muthukkaruppan
2010-01-12 19:29:46 UTC
Permalink
Dhruba & I just talked off-line about this as well. Yes, writing to two clusters would result in unnecessary complexity... we will essentially need to deal with inconsistencies between the two clusters at the application level.

For data integrity, going with group commits (batch commits) seems like a good option. My understanding of group commits as implemented in 0.21 is as follows:

* We wait on acknowledging back to the client until the transaction has been synced to HDFS.

* Syncs are batched-a sync is called if the queue has enough transactions or if a timer expires. (I would imagine that both the # of transactions to batch up as well as timer are configurable knobs already)? In this mode, for the client, the latency increase on writes is upper bounded by the timer setting + the cost of sync itself.




From: saint.ack-***@public.gmane.org [mailto:saint.ack-***@public.gmane.org] On Behalf Of stack
Sent: Tuesday, January 12, 2010 10:52 AM
To: hbase-dev-7ArZoLwFLBtd/SJB6HiN2Ni2O/***@public.gmane.org
Cc: Kannan Muthukkaruppan; Dhruba Borthakur
Subject: Re: commit semantics

On Tue, Jan 12, 2010 at 10:14 AM, Dhruba Borthakur <dhruba-***@public.gmane.org<mailto:dhruba-***@public.gmane.org>> wrote:
Hi stack,

I was meaning "what if the application inserted the same record into two
Hbase instances"? Of course, now the onus is on the appl to keep both of
them in sync and recover from any inconsistencies between them.

Ok. Like your "Overlapping Clusters for HA" from http://www.borthakur.com/ftp/hdfs_high_availability.pdf?

I'm not sure how the application could return after writing one cluster without waiting on the second to complete as you suggest above. It could write in parallel but the second thread might not complete for myriad reasons. What then? And as you say, reading, the client would have to make reconciliation.

Isn't there already a 'scalable database' that gives you this headache for free without your having to do work on your part (smile)?

Do you think there a problem syncing on every write (with some batching of writes happening when high-concurrency) or, if that too slow for your needs, adding the holding of clients until sync happens as joydeep suggests? Will that be sufficient data integrity-wise?

St.Ack

Thanks,
St.Ack
Jean-Daniel Cryans
2010-01-12 19:43:33 UTC
Permalink
On Tue, Jan 12, 2010 at 11:29 AM, Kannan Muthukkaruppan
*         We wait on acknowledging back to the client until the transaction has been synced to HDFS.
Yes
*         Syncs are batched-a sync is called if the queue has enough transactions  or if a timer expires. (I would imagine that both the # of transactions to batch up as well as timer are configurable knobs already)? In this mode, for the client, the latency increase on writes is upper bounded by the timer setting + the cost of sync itself.
Nope. There is two kinds of group commit around that piece of code:

1) What you called batch commit, which is a configurable value
(flushlogentries) that we have to append x amount of entries to
trigger a sync. Clients don't hold until that syncs happens so a
region server failure could lose some rows depending on the time
between the last sync and the failure.

If flushlogentries=100 and 99 entries are lying around for more than
the timer's timeout (default 1 sec), the timer will force sync those
entries.

2) Group commit happens at high concurrency and is only useful if a
high number of clients are writing at the same time and that
flushlogentries=1. What happens in the LogSyncer thread is that
instead of calling sync() for every entry, we "group" the clients
waiting on the previous sync and issue only 1 sync for all of them. In
this case, when the call returns in the client, we are sure that the
value is in HDFS.
Sent: Tuesday, January 12, 2010 10:52 AM
Cc: Kannan Muthukkaruppan; Dhruba Borthakur
Subject: Re: commit semantics
Hi stack,
I was meaning "what if the application inserted the same record into two
Hbase instances"? Of course, now the onus is on the appl to keep both of
them in sync and recover from any inconsistencies between them.
Ok.  Like your  "Overlapping Clusters for HA" from http://www.borthakur.com/ftp/hdfs_high_availability.pdf?
I'm not sure how the application could return after writing one cluster without waiting on the second to complete as you suggest above.  It could write in parallel but the second thread might not complete for myriad reasons.  What then?  And as you say, reading, the client would have to make reconciliation.
Isn't there already a 'scalable database' that gives you this headache for free without your having to do work on your part (smile)?
Do you think there a problem syncing on every write (with some batching of writes happening when high-concurrency) or, if that too slow for your needs, adding the holding of clients until sync happens as joydeep suggests?  Will that be sufficient data integrity-wise?
St.Ack
Thanks,
St.Ack
Kannan Muthukkaruppan
2010-01-12 20:10:42 UTC
Permalink
Ok cool. Thanks for clarifying.

I think what I had in mind was a hybrid-- basically try to accumulate transactions up to a certain app configurable time window before sync'ing (& until the sync delay ack'ing the client).

Just caught up on a earlier response from Joy on this as well.

<<if the performance with setting of 1 doesn't work out - we may need an option to delay acks until actual syncs .. (most likely we would be able to compromise on latency to get higher throughput - but wouldn't be willing to compromise on data integrity)>>

Yes, that's what I had in mind. Agree that this could be something we explore later if necessary.

Regards,
Kannan
-----Original Message-----
From: jdcryans-***@public.gmane.org [mailto:jdcryans-***@public.gmane.org] On Behalf Of Jean-Daniel Cryans
Sent: Tuesday, January 12, 2010 11:44 AM
To: hbase-dev-7ArZoLwFLBtd/SJB6HiN2Ni2O/***@public.gmane.org
Subject: Re: commit semantics

On Tue, Jan 12, 2010 at 11:29 AM, Kannan Muthukkaruppan
Post by Kannan Muthukkaruppan
* We wait on acknowledging back to the client until the transaction has been synced to HDFS.
Yes
Post by Kannan Muthukkaruppan
* Syncs are batched-a sync is called if the queue has enough transactions or if a timer expires. (I would imagine that both the # of transactions to batch up as well as timer are configurable knobs already)? In this mode, for the client, the latency increase on writes is upper bounded by the timer setting + the cost of sync itself.
Nope. There is two kinds of group commit around that piece of code:

1) What you called batch commit, which is a configurable value
(flushlogentries) that we have to append x amount of entries to
trigger a sync. Clients don't hold until that syncs happens so a
region server failure could lose some rows depending on the time
between the last sync and the failure.

If flushlogentries=100 and 99 entries are lying around for more than
the timer's timeout (default 1 sec), the timer will force sync those
entries.

2) Group commit happens at high concurrency and is only useful if a
high number of clients are writing at the same time and that
flushlogentries=1. What happens in the LogSyncer thread is that
instead of calling sync() for every entry, we "group" the clients
waiting on the previous sync and issue only 1 sync for all of them. In
this case, when the call returns in the client, we are sure that the
value is in HDFS.
Post by Kannan Muthukkaruppan
Sent: Tuesday, January 12, 2010 10:52 AM
Cc: Kannan Muthukkaruppan; Dhruba Borthakur
Subject: Re: commit semantics
Hi stack,
I was meaning "what if the application inserted the same record into two
Hbase instances"? Of course, now the onus is on the appl to keep both of
them in sync and recover from any inconsistencies between them.
Ok. Like your "Overlapping Clusters for HA" from http://www.borthakur.com/ftp/hdfs_high_availability.pdf?
I'm not sure how the application could return after writing one cluster without waiting on the second to complete as you suggest above. It could write in parallel but the second thread might not complete for myriad reasons. What then? And as you say, reading, the client would have to make reconciliation.
Isn't there already a 'scalable database' that gives you this headache for free without your having to do work on your part (smile)?
Do you think there a problem syncing on every write (with some batching of writes happening when high-concurrency) or, if that too slow for your needs, adding the holding of clients until sync happens as joydeep suggests? Will that be sufficient data integrity-wise?
St.Ack
Thanks,
St.Ack
Continue reading on narkive:
Loading...