Hbase - Hadoop : TableInputFormat extension - hadoop

Using an hbase table as my input, of which the keys I have pre-processed in order to consist of a number concatenated with the respective row ID, I want to rest assured that all rows with the same number heading their key, will be processed from the same mapper at a M/R job. I am aware that this could be achieved through extension of TableInputFormat, and I have seen one or two posts concerning extension of this class, but I am searching for the most efficient way to do this in particular.
If anyone has any ideas, please let me know.

You can use a PrefixFilter in your scan.
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/PrefixFilter.html
And parallelize the launch of your different mappers using Future
final Future<Boolean> newJobFuture = executor.submit(new Callable<Boolean>() {
#Override
public Boolean call() throws Exception {
Job mapReduceJob = MyJobBuilder.createJob(args, thePrefix,
...);
return mapReduceJob.waitForCompletion(true);
}
});
But I believe this is more an approach of a reducer you are looking for.

Related

Spring Batch - Loop reader, processor and writer for N times

In Spring Batch, how to loop the reader,processor and writer for N times?
My requirement is:
I have "N" no of. customers/clients.
For each customer/client, I need to fetch the records from database (Reader), then I have to process (Processor) all records for the customer/client and then I have to write the records into a file (Writer).
How to loop the spring batch job for N times?
AFAIK I'm afraid there's no framework support for this scenario. Not at least the way you want to solve it.
I'd suggest to solve the problem differently:
Option 1
Read/Process/Write all records from all customers at once.You can only do this if they are all in the same DB. I would not recommend it otherwise, because you'll have to configure JTA/XA transactions and it's not worth the trouble.
Option 2
Run your job once for each client (best option in my opinion). Save necessary info of each client in different properties files (db data connections, values to filter records by client, whatever other data you may need specific to a client) and pass through a param to the job with the client it has to use. This way you can control which client is processed and when using bash files and/or cron. If you use Spring Boot + Spring Batch you can store the client configuration in profiles (application-clientX.properties) and run the process like:
$> java -Dspring.profiles.active="clientX" \
-jar "yourBatch-1.0.0-SNAPSHOT.jar" \
-next
Bonus - Option 3
If none of the abobe fits your needs or you insist in solving the problem they way you presented, then you can dynamically configure the job depending on parameters and creating one step for each client using JavaConf:
#Bean
public Job job(){
JobBuilder jb = jobBuilders.get("job");
for(Client c : clientsToProcess) {
jb.flow(buildStepByClient(c));
};
return jb.build();
}
Again, I strongly advise you not to go this way: ugly, against framework philosophy, hard to maintain, debug, you'll probably have to also use JTA/XA here, ...
I hope I've been of any help!
Local Partitioning will solve your problem.
In your partitioner, you will put all of your clients Ids in map as shown below ( just pseudo code ) ,
public class PartitionByClient implements Partitioner {
#Override
public Map<String, ExecutionContext> partition(int gridSize) {
Map<String, ExecutionContext> result = new HashMap<>();
int partitionNumber = 1;
for (String client: allClients) {
ExecutionContext value = new ExecutionContext();
value.putString("client", client);
result.put("Client [" + client+ "] : THREAD " + partitionNumber, value);
partitionNumber++;
}
}
return result;
}
}
This is just a pseudo code. You have to look to detailed documentation of partitioning.
You will have to mark your reader , processor and writer in #StepScope ( i.e. which ever part needs the value of your client ). Reader will use this client in WHERE clause of SQL. You will use #Value("#{stepExecutionContext[client]}") String client in reader etc definition to inject this value.
Now final piece , you will need a task executor and clients equal to concurrencyLimit will start in parallel provided you set this task executor in your master partitioner step configuration.
#Bean
public TaskExecutor taskExecutor() {
SimpleAsyncTaskExecutor simpleTaskExecutor = new SimpleAsyncTaskExecutor();
simpleTaskExecutor.setConcurrencyLimit(concurrencyLimit);
return simpleTaskExecutor;
}
concurrencyLimit will be 1 if you wish only one client running at a time.

Performing bulk load in cassandra with map reduce

I haven't got much experience working with cassandra, so please excuse me if I have put in a wrong approach.
I am trying to do bulk load in cassandra with map reduce
Basically the word count example
Reference : http://henning.kropponline.de/2012/11/15/using-cassandra-hadoopbulkoutputformat/
I have put the simple Hadoop Wordcount Mapper Example and slightly modified the driver code and the reducer as per the above example.
I have successfully generated the output file as well. Now my doubt is how to perform the loading to cassandra part? Is there any difference in my approach ?
Please advice.
This is a part of the driver code
Job job = new Job();
job.setJobName(getClass().getName());
job.setJarByClass(CassaWordCountJob.class);
Configuration conf = job.getConfiguration();
conf.set("cassandra.output.keyspace", "test");
conf.set("cassandra.output.columnfamily", "words");
conf.set("cassandra.output.partitioner.class", "org.apache.cassandra.dht.RandomPartitioner");
conf.set("cassandra.output.thrift.port","9160"); // default
conf.set("cassandra.output.thrift.address", "localhost");
conf.set("mapreduce.output.bulkoutputformat.streamthrottlembits", "400");
job.setMapperClass(CassaWordCountMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
FileInputFormat.setInputPaths(job, new Path(args[0]));
job.setReducerClass(CassaWordCountReducer.class);
FileOutputFormat.setOutputPath(job, new Path("/home/user/Desktop/test/cassandra"));
MultipleOutputs.addNamedOutput(job, "reducer", BulkOutputFormat.class, ByteBuffer.class, List.class);
return job.waitForCompletion(true) ? 0 : 1;
Mapper is the same as the normal wordcount mapper that just tokenizes and emits Word, 1
The reducer class is of the form
public class CassaWordCountReducer extends
Reducer<Text, IntWritable, ByteBuffer, List<Mutation>> {
#Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
List<Mutation> columnsToAdd = new ArrayList<Mutation>();
Integer wordCount = 0;
for(IntWritable value : values) {
wordCount += value.get();
}
Column countCol = new Column(ByteBuffer.wrap("count".getBytes()));
countCol.setValue(ByteBuffer.wrap(wordCount.toString().getBytes()));
countCol.setTimestamp(new Date().getTime());
ColumnOrSuperColumn wordCosc = new ColumnOrSuperColumn();
wordCosc.setColumn(countCol);
Mutation countMut = new Mutation();
countMut.column_or_supercolumn = wordCosc;
columnsToAdd.add(countMut);
context.write(ByteBuffer.wrap(key.toString().getBytes()), columnsToAdd);
}
}
To do bulk loads into Cassandra, I would advise looking at this article from DataStax. Basically you need to do 2 things for bulk loading:
Your output data won't natively fit into Cassandra, you need to transform it to SSTables.
Once you have your SSTables, you need to be able to stream them into Cassandra. Of course you don't simply want to copy each SSTable to every node, you want to only copy the relevant part of the data to each node
In your case when using the BulkOutputFormat, it should do all that as it's using the sstableloader behind the scenes. I've never used it with MultipleOutputs, but it should work fine.
I think the error in your case is that you're not using MultipleOutputs correctly: you're still doing a context.write, when you should really be writing to your MultipleOutputs object. The way you're doing it right now, since you're writing to the regular Context, it will get picked up by the default output format of TextOutputFormat and not the one you defined in your MultipleOutputs. More information on how to use the MultipleOutputs in your reducer here.
Once you write to the correct output format of BulkOutputFormat like you defined, your SSTables should get created and streamed to Cassandra from each node in your cluster - you shouldn't need any extra step, the output format will take care of it for you.
Also I would advise looking at this post, where they also explain how to use BulkOutputFormat, but they're using a ConfigHelper which you might want to take a look at to more easily configure your Cassandra endpoint.

Adding partitions to Hive from a MapReduce Job

I am new to Hive and MapReduce and would really appreciate your answer and also provide a right approach.
I have defined an external table logs in hive partitioned on date and origin server with an external location on hdfs /data/logs/. I have a MapReduce job which fetches these logs file and splits them and stores under the folder mentioned above. Like
"/data/logs/dt=2012-10-01/server01/"
"/data/logs/dt=2012-10-01/server02/"
...
...
From MapReduce job I would like add partitions to the table logs in Hive. I know the two approaches
alter table command -- Too many alter table commands
adding dynamic partitions
For approach two I see only examples of INSERT OVERWRITE which is not an options for me. Is there a way to add these new partitions to the table after the end of the job?
To do this from within a Map/Reduce job I would recommend using Apache HCatalog, which is a new project stamped under Hadoop.
HCatalog really is an abstraction layer on top of HDFS so you can write your outputs in a standardized way, be it from Hive, Pig or M/R. Where this comes into the picture for you, is that you can directly load data in Hive from your Map/Reduce job using the output format HCatOutputFormat. Below is an example taken from the official website.
A current code example for writing out a specific partition for (a=1,b=1) would go something like this:
Map<String, String> partitionValues = new HashMap<String, String>();
partitionValues.put("a", "1");
partitionValues.put("b", "1");
HCatTableInfo info = HCatTableInfo.getOutputTableInfo(dbName, tblName, partitionValues);
HCatOutputFormat.setOutput(job, info);
And to write to multiple partitions, separate jobs will have to be kicked off with each of the above.
You can also use dynamic partitions with HCatalog, in which case you could load as many partitions as you want in the same job !
I recommend reading further on HCatalog on the website provided above, which should give you more details if needed.
In reality, things are a little more complicated than that, which is unfortunate because it is undocumented in official sources (as of now), and it takes a few days of frustration to figure out.
I've found that I need to do the following to get HCatalog Mapreduce jobs to work with writing to dynamic partitions:
In my record writing phase of my job (usually the reducer), I have to manually add my dynamic partitions (HCatFieldSchema) to my HCatSchema objects.
The trouble is that HCatOutputFormat.getTableSchema(config) does not actually return partitioned fields. They need to be manually added
HCatFieldSchema hfs1 = new HCatFieldSchema("date", Type.STRING, null);
HCatFieldSchema hfs2 = new HCatFieldSchema("some_partition", Type.STRING, null);
schema.append(hfs1);
schema.append(hfs2);
Here's the code for writing into multiple tables with dynamic partitioning in one job using HCatalog, the code has been tested on Hadoop 2.5.0, Hive 0.13.1:
// ... Job setup, InputFormatClass, etc ...
String dbName = null;
String[] tables = {"table0", "table1"};
job.setOutputFormatClass(MultiOutputFormat.class);
MultiOutputFormat.JobConfigurer configurer = MultiOutputFormat.createConfigurer(job);
List<String> partitions = new ArrayList<String>();
partitions.add(0, "partition0");
partitions.add(1, "partition1");
HCatFieldSchema partition0 = new HCatFieldSchema("partition0", TypeInfoFactory.stringTypeInfo, null);
HCatFieldSchema partition1 = new HCatFieldSchema("partition1", TypeInfoFactory.stringTypeInfo, null);
for (String table : tables) {
configurer.addOutputFormat(table, HCatOutputFormat.class, BytesWritable.class, CatRecord.class);
OutputJobInfo outputJobInfo = OutputJobInfo.create(dbName, table, null);
outputJobInfo.setDynamicPartitioningKeys(partitions);
HCatOutputFormat.setOutput(
configurer.getJob(table), outputJobInfo
);
HCatSchema schema = HCatOutputFormat.getTableSchema(configurer.getJob(table).getConfiguration());
schema.append(partition0);
schema.append(partition1);
HCatOutputFormat.setSchema(
configurer.getJob(table),
schema
);
}
configurer.configure();
return job.waitForCompletion(true) ? 0 : 1;
Mapper:
public static class MyMapper extends Mapper<LongWritable, Text, BytesWritable, HCatRecord> {
#Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
HCatRecord record = new DefaultHCatRecord(3); // Including partitions
record.set(0, value.toString());
// partitions must be set after non-partition fields
record.set(1, "0"); // partition0=0
record.set(2, "1"); // partition1=1
MultiOutputFormat.write("table0", null, record, context);
MultiOutputFormat.write("table1", null, record, context);
}
}

SingleColumnValueFilter not returning proper number of rows

In our HBase table, each row has a column called crawl identifier. Using a MapReduce job, we only want to process at any one time rows from a given crawl. In order to run the job more efficiently we gave our scan object a filter that (we hoped) would remove all rows except those with the given crawl identifier. However, we quickly discovered that our jobs were not processing the correct number of rows.
I wrote a test mapper to simply count the number of rows with the correct crawl identifier, without any filters. It iterated over all the rows in the table and counted the correct, expected number of rows (~15000). When we took that same job, added a filter to the scan object, the count dropped to ~3000. There was no manipulation of the table itself during or in between these two jobs.
Since adding the scan filter caused the visible rows to change so dramatically, we expect that we simply built the filter incorrectly.
Our MapReduce job features a single mapper:
public static class RowCountMapper extends TableMapper<ImmutableBytesWritable, Put>{
public String crawlIdentifier;
// counters
private static enum CountRows {
ROWS_WITH_MATCHED_CRAWL_IDENTIFIER
}
#Override
public void setup(Context context){
Configuration configuration=context.getConfiguration();
crawlIdentifier=configuration.get(ConfigPropertyLib.CRAWL_IDENTIFIER_PROPERTY);
}
#Override
public void map(ImmutableBytesWritable legacykey, Result row, Context context){
String rowIdentifier=HBaseSchema.getValueFromRow(row, HBaseSchema.CRAWL_IDENTIFIER_COLUMN);
if (StringUtils.equals(crawlIdentifier, rowIdentifier)){
context.getCounter(CountRows.ROWS_WITH_MATCHED_CRAWL_IDENTIFIER).increment(1l);
}
}
}
The filter setup is like this:
String crawlIdentifier=configuration.get(ConfigPropertyLib.CRAWL_IDENTIFIER_PROPERTY);
if (StringUtils.isBlank(crawlIdentifier)){
throw new IllegalArgumentException("Crawl Identifier not set.");
}
// build an HBase scanner
Scan scan=new Scan();
SingleColumnValueFilter filter=new SingleColumnValueFilter(HBaseSchema.CRAWL_IDENTIFIER_COLUMN.getFamily(),
HBaseSchema.CRAWL_IDENTIFIER_COLUMN.getQualifier(),
CompareOp.EQUAL,
Bytes.toBytes(crawlIdentifier));
filter.setFilterIfMissing(true);
scan.setFilter(filter);
Are we using the wrong filter, or have we configured it wrong?
EDIT: we're looking at manually adding all the column families as per https://issues.apache.org/jira/browse/HBASE-2198 but I'm pretty sure the Scan includes all the families by default.
The filter looks correct, but under certain conditions one scenario that could cause this relates to character encodings. Your Filter is using Bytes.toBytes(String) which uses UTF8 [1], whereas you might be using native character encoding in HBaseSchema or when you write the record if you use String.getBytes()[2]. Check that the crawlIdentifier was originally written to HBase using the following to ensure the filter is comparing like for like in the filtered scan.
Bytes.toBytes(crawlIdentifier)
[1] http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/util/Bytes.html#toBytes(java.lang.String)
[2] http://docs.oracle.com/javase/1.4.2/docs/api/java/lang/String.html#getBytes()

Does HBase scan returns sorted columns?

I am working on a HBase map reduce job and need to understand if the columns in a single column family are returned sorted by their names (key). If so, I wouldnt need to do it in the shuffle sort stage.
Thanks
I have a very similar data model as you. Upon insertion however, I set my own values for the timestamps on the Put object. However, I did so in a way that took a "seed" of the current time and appended a incrementing counter for each event I persisted in the batch.
When I pulled the results out from the Scan, I wrote a comparator:
public class KVTimestampComparator implements Comparator<KeyValue> {
#Override
public int compare(KeyValue kv1, KeyValue kv2) {
Long kv1Timestamp = kv1.getTimestamp();
Long kv2Timestamp = kv2.getTimestamp();
return kv1Timestamp.compareTo(kv2Timestamp);
}
}
Then sorted the raw row:
List<KeyValue> row = Arrays.asList(result.raw());
Collections.sort(row, new KVTimestampComparator());
Got this idea from person who answered this : Sorted results from hbase scanner
no, columns are not sorted
They are stored internally as key-value pairs in a long byte array. But, you should clarify your question about what you actually need this for.

Resources