Using an HBase table as MapReduce source - hadoop

As far as I understood when using an hbase table as the source to a mapreduce job, we have define the value for the scan. LEt's say we set it to 500, does this mean that each mapper is only given 500 rows from the hbase table? Is there any problem if we set it to a very high value ?
If the scan size is small, don't we have the same problem as having small files in mapreduce?

Here's the sample code from the HBase Book on how to run a MapReduce job reading from an HBase table.
Configuration config = HBaseConfiguration.create();
Job job = new Job(config, "ExampleRead");
job.setJarByClass(MyReadJob.class); // class that contains mapper
Scan scan = new Scan();
scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs
scan.setCacheBlocks(false); // don't set to true for MR jobs
// set other scan attrs
...
TableMapReduceUtil.initTableMapperJob(
tableName, // input HBase table name
scan, // Scan instance to control CF and attribute selection
MyMapper.class, // mapper
null, // mapper output key
null, // mapper output value
job);
job.setOutputFormatClass(NullOutputFormat.class); // because we aren't emitting anything from mapper
boolean b = job.waitForCompletion(true);
if (!b) {
throw new IOException("error with job!");
}
When you say "value for the scan", that's not a real thing. You either mean scan.setCaching() or scan.setBatch() or scan.setMaxResultSize().
setCaching is used to tell the server how many rows to load before returning the result to the client
setBatch is used to limit the number of columns returned in each call if you have a very wide table
setMaxResultSize is used to limit the number of results returned to the client
Typically with you don't set the MaxResultSize in a MapReduce job. So you will see all of the data.
Reference for the above information is here.

The mapper code that you write is given the data row by row. The mapper run-time however would read the records by the caching side (i.e. 500 rows at a time in your case).
if the scan size is too small the execution becomes very inefficient (lots of io calls)

Related

union not happening with Spark transform

I have a Spark stream in which records are flowing in. And the interval size is 1 second.
I want to union all the data in the stream. So i have created an empty RDD , and then using transform method, doing union of RDD (in the stream) with this empty RDD.
I am expecting this empty RDD to have all the data at the end.
But this RDD always remains empty.
Also, can somebody tell me if my logic is correct.
JavaRDD<Row> records = ss.emptyDataFrame().toJavaRDD();
JavaDStream<Row> transformedMessages = messages.flatMap(record -> processData(record))
.transform(rdd -> rdd.union(records));
transformedMessages.foreachRDD(record -> {
System.out.println("Aman" +record.count());
StructType schema = DataTypes.createStructType(fields);
Dataset ds = ss.createDataFrame(records, schema);
ds.createOrReplaceTempView("tempTable");
ds.show();
});
Initially, records is empty.
Then we have transformedMessages = messages + records, but records is empty, so we have: transformedMessages = messages (obviating the flatmap function which is not relevant for the discussion)
Later on, when we do Dataset ds = ss.createDataFrame(records, schema); records
is still empty. That does not change in the flow of the program, so it will remain empty as an invariant over time.
I think what we want to do is, instead of
.transform(rdd -> rdd.union(records));
we should do:
.foreachRDD{rdd => records = rdd.union(records)} //Scala: translate to Java syntax
That said, please note that as this process iteratively adds to the lineage of the 'records' RDD and also will accumulate all data over time. This is not a job that can run stable for a long period of time as, eventually, given enough data, it will grow beyond the limits of the system.
There's no information about the usecase behind this question, but the current approach does not seem to be scalable nor sustainable.

Cassandra WordCount Hadoop

Can anyone explain to me the following lines from Cassandra 2.1.15 WordCount example?
CqlConfigHelper.setInputCQLPageRowSize(job.getConfiguration(), "3");
CqlConfigHelper.setInputCql(job.getConfiguration(), "select * from " + COLUMN_FAMILY + " where token(id) > ? and token(id) <= ? allow filtering");
How do I define concrete values which will be used to replace "?" in the query?
And what is meant by page row size?
How do I define concrete values which will be used to replace "?" in
the query?
You don't. These parameterized values are set by the splits created by the input format. They are set automatically but can be adjusted (to a degree) by adjusting the split size.
And what is meant by page row size?
Page row size determines the number of CQL Rows retrieved in a single request by a mapper during execution. If a C* partition contains 10000 CQL rows and the page row size is set to 1000, it will take 10 requests to retrieve all of the data.

How to know Hadoop reducer assigned records

I'm using custom partitioner that assign records to the reducers randomly. Then the reducers start processing.
Is there a way I can know how many records assigned to each reducer before the reducers start working??
Partitioner does not assign records to the reducer randomly ,it has predefined logic
when we write custom partitioner we write the logic how records should be distributed among reducers
for instance if you are dealing with the data with consists of one field age
You can decide how your input would be processed at reducer
first of all you would have to configure no. of reducer you want for particular job which can configured in driver program of map reduce job
suppose you have configured 3 number of reducer .
While writing custom partitioner you would define the logic for instance
if(ageInt <=20){
return 0;
}
//else if the age is between 20 and 50, assign partition 1
if(ageInt >20 && ageInt <=50){
return 1 % numReduceTasks;
}
//otherwise assign partition 2
else
return 2 % numReduceTasks;
All those records which falls in the category to less than age 20 would go to first reducer .
Even before executing job you can count all number of records based on your condition.

Hadoop Buffering vs Streaming

Could someone please explain to me what is the difference between Hadoop Streaming vs Buffering?
Here is the context I have read in Hive :
In every map/reduce stage of the join, the last table in the sequence is streamed through the reducers whereas the others are buffered. Therefore, it helps to reduce the memory needed in the reducer for buffering the rows for a particular value of the join key by organizing the tables such that the largest tables appear last in the sequence. e.g. in:
SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1)
In a reduce side join, the values from multiple tables are often tagged to identify them on reducer stage, for the table they are coming from.
Consider a case of two tables:
On reduce call, the mixed values associated with both tables are iterated.
During iteration, the value for one of the tag/table are locally stored into an arraylist. (This is buffering).
While the rest of the values are being streamed through and values for another tag/table are detected, the values of first tag are fetched from the saved arraylist. The two tag values are joined and written to output collector.
Contrast this with the case what if the larger table values are kept in arraylist then it could result into OOM if the arraylist outgrows to overwhelm the memory of the container's JVM.
void reduce(TextPair key , Iterator <TextPair> values ,OutputCollector <Text,Text> output ,Reporter reporter ) throws IOException {
//buffer for table1
ArrayList <Text> table1Values = new ArrayList <Text>() ;
//table1 tag
Text table1Tag = key . getSecond();
TextPair value = null;
while( values . hasNext() ){
value = values . next() ;
if(value.getSecond().equals(table1Tag)){
table1Values.add (value.getFirst() );
}
else{
for( Text val : table1Values ){
output.collect ( key.getFirst() ,new Text(val.toString() + "\t"+ value.getFirst().toString () ));
}
}
}
}
You can use the below hint to specify which of the joined tables would be streamed on reduce side:
SELECT /*+ STREAMTABLE(a) */ a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1)
Hadoop Streaming in general refers to using custom made python or shell scripts to perform your map-reduce logic. ( For example, using the Hive TRANSFORM keyword.)
Hadoop buffering, in this context, refers to the phase in a map-reduce job of a Hive query with a join, when records are read into the reducers, after having been sorted and grouped coming out of the mappers. The author is explaining why you should order the join clauses i n a Hive query, so that the largest tables are last; because it helps optimize the implementation of joins in Hive.
They are completely different concepts.
In response to your comments:
In Hive's join implementation, it must take records from multiple tables, sort them by the join key, and then collate them together in the proper order. It has to read them grouped by the different tables, so they have to see groups from different tables, and once all tables have been seen, start processing them. The first groups from the first tables need to be buffered (kept in memory) because they can not be processed until the last table is seen. The last table can be streamed, (each row processed as they are read) since the other tables group are in memory, and the join can start.

Hbase quickly count number of rows

Right now I implement row count over ResultScanner like this
for (Result rs = scanner.next(); rs != null; rs = scanner.next()) {
number++;
}
If data reaching millions time computing is large.I want to compute in real time that i don't want to use Mapreduce
How to quickly count number of rows.
Use RowCounter in HBase
RowCounter is a mapreduce job to count all the rows of a table. This is a good utility to use as a sanity check to ensure that HBase can read all the blocks of a table if there are any concerns of metadata inconsistency. It will run the mapreduce all in a single process but it will run faster if you have a MapReduce cluster in place for it to exploit.
$ hbase org.apache.hadoop.hbase.mapreduce.RowCounter <tablename>
Usage: RowCounter [options]
<tablename> [
--starttime=[start]
--endtime=[end]
[--range=[startKey],[endKey]]
[<column1> <column2>...]
]
You can use the count method in hbase to count the number of rows. But yes, counting rows of a large table can be slow.count 'tablename' [interval]
Return value is the number of rows.
This operation may take a LONG time (Run ‘$HADOOP_HOME/bin/hadoop jar
hbase.jar rowcount’ to run a counting mapreduce job). Current count is shown
every 1000 rows by default. Count interval may be optionally specified. Scan
caching is enabled on count scans by default. Default cache size is 10 rows.
If your rows are small in size, you may want to increase this
parameter.
Examples:
hbase> count 't1'
hbase> count 't1', INTERVAL => 100000
hbase> count 't1', CACHE => 1000
hbase> count 't1', INTERVAL => 10, CACHE => 1000
The same commands also can be run on a table reference. Suppose you had a reference to table 't1', the corresponding commands would be:
hbase> t.count
hbase> t.count INTERVAL => 100000
hbase> t.count CACHE => 1000
hbase> t.count INTERVAL => 10, CACHE => 1000
If you cannot use RowCounter for whatever reason, then a combination of these two filters should be an optimal way to get a count:
FirstKeyOnlyFilter() AND KeyOnlyFilter()
The FirstKeyOnlyFilter will result in the scanner only returning the first column qualifier it finds, as opposed to the scanner returning all of the column qualifiers in the table, which will minimize the network bandwith. What about simply picking one column qualifier to return? This would work if you could guarentee that column qualifier exists for every row, but if that is not true then you would get an inaccurate count.
The KeyOnlyFilter will result in the scanner only returning the column family, and will not return any value for the column qualifier. This further reduces the network bandwidth, which in the general case wouldn't account for much of a reduction, but there can be an edge case where the first column picked by the previous filter just happens to be an extremely large value.
I tried playing around with scan.setCaching but the results were all over the place. Perhaps it could help.
I had 16 million rows in between a start and stop that I did the following pseudo-empirical testing:
With FirstKeyOnlyFilter and KeyOnlyFilter activated:
With caching not set (i.e., the default value), it took 188 seconds.
With caching set to 1, it took 188 seconds
With caching set to 10, it took 200 seconds
With caching set to 100, it took 187 seconds
With caching set to 1000, it took 183 seconds.
With caching set to 10000, it took 199 seconds.
With caching set to 100000, it took 199 seconds.
With FirstKeyOnlyFilter and KeyOnlyFilter disabled:
With caching not set, (i.e., the default value), it took 309 seconds
I didn't bother to do proper testing on this, but it seems clear that the FirstKeyOnlyFilter and KeyOnlyFilter are good.
Moreover, the cells in this particular table are very small - so I think the filters would have been even better on a different table.
Here is a Java code sample:
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.ResultScanner;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.hbase.filter.RowFilter;
import org.apache.hadoop.hbase.filter.KeyOnlyFilter;
import org.apache.hadoop.hbase.filter.FirstKeyOnlyFilter;
import org.apache.hadoop.hbase.filter.FilterList;
import org.apache.hadoop.hbase.filter.CompareFilter.CompareOp;
import org.apache.hadoop.hbase.filter.RegexStringComparator;
public class HBaseCount {
public static void main(String[] args) throws IOException {
Configuration config = HBaseConfiguration.create();
HTable table = new HTable(config, "my_table");
Scan scan = new Scan(
Bytes.toBytes("foo"), Bytes.toBytes("foo~")
);
if (args.length == 1) {
scan.setCaching(Integer.valueOf(args[0]));
}
System.out.println("scan's caching is " + scan.getCaching());
FilterList allFilters = new FilterList();
allFilters.addFilter(new FirstKeyOnlyFilter());
allFilters.addFilter(new KeyOnlyFilter());
scan.setFilter(allFilters);
ResultScanner scanner = table.getScanner(scan);
int count = 0;
long start = System.currentTimeMillis();
try {
for (Result rr = scanner.next(); rr != null; rr = scanner.next()) {
count += 1;
if (count % 100000 == 0) System.out.println(count);
}
} finally {
scanner.close();
}
long end = System.currentTimeMillis();
long elapsedTime = end - start;
System.out.println("Elapsed time was " + (elapsedTime/1000F));
}
}
Here is a pychbase code sample:
from pychbase import Connection
c = Connection()
t = c.table('my_table')
# Under the hood this applies the FirstKeyOnlyFilter and KeyOnlyFilter
# similar to the happybase example below
print t.count(row_prefix="foo")
Here is a Happybase code sample:
from happybase import Connection
c = Connection(...)
t = c.table('my_table')
count = 0
for _ in t.scan(filter='FirstKeyOnlyFilter() AND KeyOnlyFilter()'):
count += 1
print count
Thanks to #Tuckr and #KennyCason for the tip.
Use the HBase rowcount map/reduce job that's included with HBase
Simple, Effective and Efficient way to count row in HBASE:
Whenever you insert a row trigger this API which will increment that particular cell.
Htable.incrementColumnValue(Bytes.toBytes("count"), Bytes.toBytes("details"), Bytes.toBytes("count"), 1);
To check number of rows present in that table. Just use "Get" or "scan" API for that particular Row 'count'.
By using this Method you can get the row count in less than a millisecond.
To count the Hbase table record count on a proper YARN cluster you have to set the map reduce job queue name as well:
hbase org.apache.hadoop.hbase.mapreduce.RowCounter -Dmapreduce.job.queuename= < Your Q Name which you have SUBMIT access>
< TABLE_NAME>
You can use coprocessor what is available since HBase 0.92. See Coprocessor and AggregateProtocol and example
Two ways Worked for me to get count of rows from hbase table with Speed
Scenario #1
If hbase table size is small then login to hbase shell with valid user and execute
>count '<tablename>'
Example
>count 'employee'
6 row(s) in 0.1110 seconds
Scenario #2
If hbase table size is large,then execute inbuilt RowCounter map reduce job:
Login to hadoop machine with valid user and execute:
/$HBASE_HOME/bin/hbase org.apache.hadoop.hbase.mapreduce.RowCounter '<tablename>'
Example:
/$HBASE_HOME/bin/hbase org.apache.hadoop.hbase.mapreduce.RowCounter 'employee'
....
....
....
Virtual memory (bytes) snapshot=22594633728
Total committed heap usage (bytes)=5093457920
org.apache.hadoop.hbase.mapreduce.RowCounter$RowCounterMapper$Counters
ROWS=6
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
If you're using a scanner, in your scanner try to have it return the least number of qualifiers as possible. In fact, the qualifier(s) that you do return should be the smallest (in byte-size) as you have available. This will speed up your scan tremendously.
Unfortuneately this will only scale so far (millions-billions?). To take it further, you can do this in real time but you will first need to run a mapreduce job to count all rows.
Store the Mapreduce output in a cell in HBase. Every time you add a row, increment the counter by 1. Every time you delete a row, decrement the counter.
When you need to access the number of rows in real time, you read that field in HBase.
There is no fast way to count the rows otherwise in a way that scales. You can only count so fast.
U can find sample example here:
/**
* Used to get the number of rows of the table
* #param tableName
* #param familyNames
* #return the number of rows
* #throws IOException
*/
public long countRows(String tableName, String... familyNames) throws IOException {
long rowCount = 0;
Configuration configuration = connection.getConfiguration();
// Increase RPC timeout, in case of a slow computation
configuration.setLong("hbase.rpc.timeout", 600000);
// Default is 1, set to a higher value for faster scanner.next(..)
configuration.setLong("hbase.client.scanner.caching", 1000);
AggregationClient aggregationClient = new AggregationClient(configuration);
try {
Scan scan = new Scan();
if (familyNames != null && familyNames.length > 0) {
for (String familyName : familyNames) {
scan.addFamily(Bytes.toBytes(familyName));
}
}
rowCount = aggregationClient.rowCount(TableName.valueOf(tableName), new LongColumnInterpreter(), scan);
} catch (Throwable e) {
throw new IOException(e);
}
return rowCount;
}
Go to Hbase home directory and run this command,
./bin/hbase org.apache.hadoop.hbase.mapreduce.RowCounter 'namespace:tablename'
This will launch a mapreduce job and the output will show the number of records existing in the hbase table.
You could try hbase api methods!
org.apache.hadoop.hbase.client.coprocessor.AggregationClient

Resources