Hadoop: Map Reduce: read from HBase, but filter rows by content of one column - hadoop

I am really new to Hadoop and I am not able to find an answer to my question. I want to write a map reduce job, where I read from HBase and write then in a simple text file.
In HBase, Ive got a column representing an id. Now I dont want to work on all containing rows in my HBase Table, but only on those between a maxId and a minId.
I found out that I could possibly user filters (scan.setFilter), so that I can filter rows which dont match my request.
This is my first Map Reduce Job, so please be patient :-)
Ive got a Starter Class, where I configure the job and the Scan Object and then start the job.
Now, my first try looks like this:
private Scan getScan()
{
final Scan scan = new Scan();
// ** FILTER **
List<Filter> filters = new ArrayList<Filter>();
Filter filter1 = new ValueFilter(CompareFilter.CompareOp.GREATER_OR_EQUAL, new BinaryComparator(Bytes.toBytes(Integer.parseInt(minId))));
filters.add(filter1);
Filter filter2 = new ValueFilter(CompareFilter.CompareOp.LESS_OR_EQUAL, new BinaryComparator(Bytes.toBytes(Integer.parseInt(maxId))));
filters.add(filter2);
FilterList filterList = new FilterList(filters);
scan.setFilter(filterList);
scan.setCaching(500);
scan.setCacheBlocks(false);
// id
scan.addColumn("columnfamily".getBytes(), "id".getBytes());
return scan;
}
Well, Im not sure if this is the right way to do it. I also read that I could pass my minId and maxId maybe with the Configuration Object to the Map Job, but Im not sure how.
Besides, what have I to do afterwards? I would normally just initiate the job with initTableMapperJob and pass the Scan Object to it. Ive read something of ResultScanner and so, do I need them? I thought the MapReduce Framework would now automatically pass the correct rows to my map job, is that correct?

Related

Spark not be able to retrieve all of Hbase data in specific column

My Hbase table has 30 Million records, each record has the column raw:sample, raw is columnfamily sample is column. This column is very big, the size from a few KB to 50MB. When I run the following Spark code, it only can get 40 thousand records but I should get 30 million records:
val conf = HBaseConfiguration.create()
conf.set("hbase.zookeeper.quorum", "10.1.1.15:2181")
conf.set(TableInputFormat.INPUT_TABLE, "sampleData")
conf.set(TableInputFormat.SCAN_COLUMNS, "raw:sample")
conf.set("hbase.client.keyvalue.maxsize","0")
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],classOf[org.apache.hadoop.hbase.client.Result])
var arrRdd:RDD[Map[String,Object]] = hBaseRDD.map(tuple => tuple._2).map(...)
Right now I work around this by get the id list first then iterate the id list to get the column raw:sample by pure Hbase java client in Spark foreach.
Any ideas please why I can not get all of the column raw:sample by Spark, is it because the column too big?
A few days ago one of my zookeeper nodes and datanodes down, but I fixed it soon since the replica is 3, is this the reason? Would think if I run hbck -repair would help, thanks a lot!
Internally, TableInputFormat creates a Scan object in order to retrieve the data from HBase.
Try to create a Scan object (without using Spark), configured to retrieve the same column from HBase, see if the error repeats:
// Instantiating Configuration class
Configuration config = HBaseConfiguration.create();
// Instantiating HTable class
HTable table = new HTable(config, "emp");
// Instantiating the Scan class
Scan scan = new Scan();
// Scanning the required columns
scan.addColumn(Bytes.toBytes("personal"), Bytes.toBytes("name"));
scan.addColumn(Bytes.toBytes("personal"), Bytes.toBytes("city"));
// Getting the scan result
ResultScanner scanner = table.getScanner(scan);
// Reading values from scan result
for (Result result = scanner.next(); result != null; result = scanner.next())
System.out.println("Found row : " + result);
//closing the scanner
scanner.close();
In addition, by default, TableInputFormat is configured to request a very small chunk of data from the HBase server (which is bad and causes a large overhead). Set the following to increase the chunk size:
scan.setBlockCache(false);
scan.setCaching(2000);
For a high throughput like yours, Apache Kafka is the best solution to integrate the data flow and keeping data pipeline alive. Please refer http://kafka.apache.org/08/uses.html for some use cases of kafka
One more
http://sites.computer.org/debull/A12june/pipeline.pdf

Triple composite key in Hbase

I have a use case where I want 3 level composite key.
For eg.
Rollnumber:class:friendsRollNumber
I would want to query "Get all friends for a particular roll number and a class"
I could not find sufficient examples over net to use composite keys and range scans over it together.
Currently, I am doing the following.
byte[]rowkey = Bytes.add(Bytes.tobytes("myrollnumber"),Bytes.tobytes("myClass"),Bytes.tobytes("myFriendsRollNumber"))
This is the way I form row key .
Will it select region server based on myRollNumber and myClass ? If not How can I do that ?
Also , For range scan , what is the correct way to use it. I am doing it in the following way. I am still in process of writing the code , so have not tested it.
Scan s = new Scan();
Filter f = New PrefixFilter(Bytes.tobytes("myrollnumber"),Bytes.tobytes("class"))
s.setFilter(f)
Is the above way correct to scan as per my requirement ?
Also , how to get the individual parts of rowKey from the scanner ?
Try this:
byte[] prefix=Bytes.toBytes("rollnumber" + "class");
Scan scan = new Scan(prefix));
PrefixFilter prefixFilter = new PrefixFilter(prefix);
scan.addFilter(prefixFilter);
ResultScanner resultScanner = table.getScanner(scan);
For your requirement, you can use start stop row feature of Scan. You do not need a filter.
byte[]startRow= Bytes.add(Bytes.tobytes("search_rollnumber"),Bytes.tobytes("myClass"));
byte[]stopRow= Bytes.add(Bytes.tobytes("search_rollnumber"),Bytes.tobytes("myClass"));
stopRow[stopRow.length - 1]++;
Scan s = new Scan(startRow, stopRow);
Using this scan you will get all rows starting with search_rollnumberMyClass.
I am not sure you use : in your rowkey. But i think you should if both rollnumber and class are represented as integer.

Column Value Range Filter in Hbase 0.94

I want to use a range filter in hbase on more than one column . I know we can use SingleColumnValueFilter implementing And/Or Conditions but I want to run the same filter condition against two different columns.
Example:myhbase table
rowkey,cf:bidprice,cf:askprice,cf:product
I want to filter all the rows with (cf:bidprice>10 and cf:bidprice<20) or (cf:askprice>10 and cf:askprice<20).
I think I figured it out. Below code snippet is an example implementation.
byte[] startRow=Bytes.toBytes("startrow");
byte[] endRow=Bytes.toBytes("stoprow");
SingleColumnValueFilter bidPriceGreaterFilter=new SingleColumnValueFilter("q".getBytes(), "bidprice".getBytes(), CompareFilter.CompareOp.GREATER_OR_EQUAL, "12345".getBytes());
SingleColumnValueFilter bidPricelesserFilter=new SingleColumnValueFilter("q".getBytes(), "bidprice".getBytes(), CompareFilter.CompareOp.LESS_OR_EQUAL, "12346".getBytes());
SingleColumnValueFilter askPriceGreaterFilter=new SingleColumnValueFilter("q".getBytes(), "askprice".getBytes(), CompareFilter.CompareOp.GREATER_OR_EQUAL, "12345".getBytes());
SingleColumnValueFilter askPricelesserFilter=new SingleColumnValueFilter("q".getBytes(), "askprice".getBytes(), CompareFilter.CompareOp.LESS_OR_EQUAL, "12346".getBytes());
FilterList andFilter1= new FilterList(FilterList.Operator.MUST_PASS_ALL);
andFilter1.addFilter(bidPriceGreaterFilter);
andFilter1.addFilter(bidPricelesserFilter);
FilterList andFilter2= new FilterList(FilterList.Operator.MUST_PASS_ALL);
andFilter2.addFilter(askPriceGreaterFilter);
andFilter2.addFilter(askPricelesserFilter);
FilterList finalFilterList=new FilterList(FilterList.Operator.MUST_PASS_ONE);
finalFilterList.addFilter(andFilter1);
finalFilterList.addFilter(andFilter2);
Scan scan = new Scan(startRow,endRow);
scan.setFilter(finalFilterList);

Performing bulk load in cassandra with map reduce

I haven't got much experience working with cassandra, so please excuse me if I have put in a wrong approach.
I am trying to do bulk load in cassandra with map reduce
Basically the word count example
Reference : http://henning.kropponline.de/2012/11/15/using-cassandra-hadoopbulkoutputformat/
I have put the simple Hadoop Wordcount Mapper Example and slightly modified the driver code and the reducer as per the above example.
I have successfully generated the output file as well. Now my doubt is how to perform the loading to cassandra part? Is there any difference in my approach ?
Please advice.
This is a part of the driver code
Job job = new Job();
job.setJobName(getClass().getName());
job.setJarByClass(CassaWordCountJob.class);
Configuration conf = job.getConfiguration();
conf.set("cassandra.output.keyspace", "test");
conf.set("cassandra.output.columnfamily", "words");
conf.set("cassandra.output.partitioner.class", "org.apache.cassandra.dht.RandomPartitioner");
conf.set("cassandra.output.thrift.port","9160"); // default
conf.set("cassandra.output.thrift.address", "localhost");
conf.set("mapreduce.output.bulkoutputformat.streamthrottlembits", "400");
job.setMapperClass(CassaWordCountMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
job.setReducerClass(CassaWordCountReducer.class);
FileOutputFormat.setOutputPath(job, new Path("/home/user/Desktop/test/cassandra"));
MultipleOutputs.addNamedOutput(job, "reducer", BulkOutputFormat.class, ByteBuffer.class, List.class);
return job.waitForCompletion(true) ? 0 : 1;
Mapper is the same as the normal wordcount mapper that just tokenizes and emits Word, 1
The reducer class is of the form
public class CassaWordCountReducer extends
Reducer<Text, IntWritable, ByteBuffer, List<Mutation>> {
#Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
List<Mutation> columnsToAdd = new ArrayList<Mutation>();
Integer wordCount = 0;
for(IntWritable value : values) {
wordCount += value.get();
}
Column countCol = new Column(ByteBuffer.wrap("count".getBytes()));
countCol.setValue(ByteBuffer.wrap(wordCount.toString().getBytes()));
countCol.setTimestamp(new Date().getTime());
ColumnOrSuperColumn wordCosc = new ColumnOrSuperColumn();
wordCosc.setColumn(countCol);
Mutation countMut = new Mutation();
countMut.column_or_supercolumn = wordCosc;
columnsToAdd.add(countMut);
context.write(ByteBuffer.wrap(key.toString().getBytes()), columnsToAdd);
}
}
To do bulk loads into Cassandra, I would advise looking at this article from DataStax. Basically you need to do 2 things for bulk loading:
Your output data won't natively fit into Cassandra, you need to transform it to SSTables.
Once you have your SSTables, you need to be able to stream them into Cassandra. Of course you don't simply want to copy each SSTable to every node, you want to only copy the relevant part of the data to each node
In your case when using the BulkOutputFormat, it should do all that as it's using the sstableloader behind the scenes. I've never used it with MultipleOutputs, but it should work fine.
I think the error in your case is that you're not using MultipleOutputs correctly: you're still doing a context.write, when you should really be writing to your MultipleOutputs object. The way you're doing it right now, since you're writing to the regular Context, it will get picked up by the default output format of TextOutputFormat and not the one you defined in your MultipleOutputs. More information on how to use the MultipleOutputs in your reducer here.
Once you write to the correct output format of BulkOutputFormat like you defined, your SSTables should get created and streamed to Cassandra from each node in your cluster - you shouldn't need any extra step, the output format will take care of it for you.
Also I would advise looking at this post, where they also explain how to use BulkOutputFormat, but they're using a ConfigHelper which you might want to take a look at to more easily configure your Cassandra endpoint.

Adding partitions to Hive from a MapReduce Job

I am new to Hive and MapReduce and would really appreciate your answer and also provide a right approach.
I have defined an external table logs in hive partitioned on date and origin server with an external location on hdfs /data/logs/. I have a MapReduce job which fetches these logs file and splits them and stores under the folder mentioned above. Like
"/data/logs/dt=2012-10-01/server01/"
"/data/logs/dt=2012-10-01/server02/"
...
...
From MapReduce job I would like add partitions to the table logs in Hive. I know the two approaches
alter table command -- Too many alter table commands
adding dynamic partitions
For approach two I see only examples of INSERT OVERWRITE which is not an options for me. Is there a way to add these new partitions to the table after the end of the job?
To do this from within a Map/Reduce job I would recommend using Apache HCatalog, which is a new project stamped under Hadoop.
HCatalog really is an abstraction layer on top of HDFS so you can write your outputs in a standardized way, be it from Hive, Pig or M/R. Where this comes into the picture for you, is that you can directly load data in Hive from your Map/Reduce job using the output format HCatOutputFormat. Below is an example taken from the official website.
A current code example for writing out a specific partition for (a=1,b=1) would go something like this:
Map<String, String> partitionValues = new HashMap<String, String>();
partitionValues.put("a", "1");
partitionValues.put("b", "1");
HCatTableInfo info = HCatTableInfo.getOutputTableInfo(dbName, tblName, partitionValues);
HCatOutputFormat.setOutput(job, info);
And to write to multiple partitions, separate jobs will have to be kicked off with each of the above.
You can also use dynamic partitions with HCatalog, in which case you could load as many partitions as you want in the same job !
I recommend reading further on HCatalog on the website provided above, which should give you more details if needed.
In reality, things are a little more complicated than that, which is unfortunate because it is undocumented in official sources (as of now), and it takes a few days of frustration to figure out.
I've found that I need to do the following to get HCatalog Mapreduce jobs to work with writing to dynamic partitions:
In my record writing phase of my job (usually the reducer), I have to manually add my dynamic partitions (HCatFieldSchema) to my HCatSchema objects.
The trouble is that HCatOutputFormat.getTableSchema(config) does not actually return partitioned fields. They need to be manually added
HCatFieldSchema hfs1 = new HCatFieldSchema("date", Type.STRING, null);
HCatFieldSchema hfs2 = new HCatFieldSchema("some_partition", Type.STRING, null);
schema.append(hfs1);
schema.append(hfs2);
Here's the code for writing into multiple tables with dynamic partitioning in one job using HCatalog, the code has been tested on Hadoop 2.5.0, Hive 0.13.1:
// ... Job setup, InputFormatClass, etc ...
String dbName = null;
String[] tables = {"table0", "table1"};
job.setOutputFormatClass(MultiOutputFormat.class);
MultiOutputFormat.JobConfigurer configurer = MultiOutputFormat.createConfigurer(job);
List<String> partitions = new ArrayList<String>();
partitions.add(0, "partition0");
partitions.add(1, "partition1");
HCatFieldSchema partition0 = new HCatFieldSchema("partition0", TypeInfoFactory.stringTypeInfo, null);
HCatFieldSchema partition1 = new HCatFieldSchema("partition1", TypeInfoFactory.stringTypeInfo, null);
for (String table : tables) {
configurer.addOutputFormat(table, HCatOutputFormat.class, BytesWritable.class, CatRecord.class);
OutputJobInfo outputJobInfo = OutputJobInfo.create(dbName, table, null);
outputJobInfo.setDynamicPartitioningKeys(partitions);
HCatOutputFormat.setOutput(
configurer.getJob(table), outputJobInfo
);
HCatSchema schema = HCatOutputFormat.getTableSchema(configurer.getJob(table).getConfiguration());
schema.append(partition0);
schema.append(partition1);
HCatOutputFormat.setSchema(
configurer.getJob(table),
schema
);
}
configurer.configure();
return job.waitForCompletion(true) ? 0 : 1;
Mapper:
public static class MyMapper extends Mapper<LongWritable, Text, BytesWritable, HCatRecord> {
#Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
HCatRecord record = new DefaultHCatRecord(3); // Including partitions
record.set(0, value.toString());
// partitions must be set after non-partition fields
record.set(1, "0"); // partition0=0
record.set(2, "1"); // partition1=1
MultiOutputFormat.write("table0", null, record, context);
MultiOutputFormat.write("table1", null, record, context);
}
}

Resources