Can we get all the column names from an HBase table? - hadoop

Setup:
I have an HBase table, with 100M+ rows and 1 Million+ columns. Every row has data for only 2 to 5 columns. There is in just 1 Column Family.
Problem:
I want to find out all the distinct qualifiers (columns) in this column family. Is there a quick way to do that?
I can think of about scanning the whole table, then getting familyMap for each row, get qualifier and add it to a Set<>. But that would be awfully slow, as there are 100M+ rows.
Can we do any better?

You can use a mapreduce for this. In this case you don't need to install a custom libs for hbase as in case for coprocessor.
Below a code for creating a mapreduce task.
Job setup
Job job = Job.getInstance(config);
job.setJobName("Distinct columns");
Scan scan = new Scan();
scan.setBatch(500);
scan.addFamily(YOU_COLUMN_FAMILY_NAME);
scan.setFilter(new KeyOnlyFilter()); //scan only key part of KeyValue (raw, column family, column)
scan.setCacheBlocks(false); // don't set to true for MR jobs
TableMapReduceUtil.initTableMapperJob(
YOU_TABLE_NAME,
scan,
OnlyColumnNameMapper.class, // mapper
Text.class, // mapper output key
Text.class, // mapper output value
job);
job.setNumReduceTasks(1);
job.setReducerClass(OnlyColumnNameReducer.class);
job.setReducerClass(OnlyColumnNameReducer.class);
Mapper
public class OnlyColumnNameMapper extends TableMapper<Text, Text> {
#Override
protected void map(ImmutableBytesWritable key, Result value, final Context context) throws IOException, InterruptedException {
CellScanner cellScanner = value.cellScanner();
while (cellScanner.advance()) {
Cell cell = cellScanner.current();
byte[] q = Bytes.copy(cell.getQualifierArray(),
cell.getQualifierOffset(),
cell.getQualifierLength());
context.write(new Text(q),new Text());
}
}
}
Reducer
public class OnlyColumnNameReducer extends Reducer<Text, Text, Text, Text> {
#Override
protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
context.write(new Text(key), new Text());
}
}

HBase can be visualised as a distributed NavigableMap<byte[], NavigableMap<byte[], NavigableMap<byte[], NavigableMap<Long, byte[]>>>>
There is no "metadata" (say something centrally stored in the master node) about the list of all qualifiers that's available in all region servers.
So if you have a one-time use-case, the only way for you would be to scan through the entire table and add the qualifier names in a Set<>, like you mentioned.
If this is a repeat use-case (plus if you have the discretion to add components to your tech stack), you may want to consider adding Redis. Set of qualifiers can be maintained in a distributed fashion using a Redis Set.

HBase Coprocessors can be used for this scenario. You can write custom EndPoint implementation which works like Stored Procedures in RDBMS. It executes your code on server side and get distinct columns for each region. On client you can get the distinct columns across all regions.
Performance Benefit: All columns are not transferred to the client which results in reduced network calls.

Related

MapReduce sorting with heap

I am trying to analyze the social network data which contains follower and followee pairs. I want to find the top 10 users who have the most followees using MapReduce.
I made pairs of userID and number_of_followee with one MapReduce step.
With this data, however, I am not sure how to sort them in distributed systems.
I am not sure how priority queue can be used in either of Mappers and Reducers since they have the distributed data.
Can someone explain me how I can use data structures to sort the massive data?
Thank you very much.
If you have big input file (files) of format user_id = number_of_followers, simple map-reduce algorithm to find top N users is:
each mapper processes its own input and finds top N users in its file, writes them to a single reducer
single reducer receives number_of_mappers * N rows and finds top N users among them
To Sort the data in descending order, you need another mapreduce job. The Mapper would emit "number of followers" as key and twitter handle as value.
class SortingMap extends Map<LongWritable, Text, LongWritable, Text> {
private Text value = new Text();
private LongWritable key = new LongWritable(0);
#Overwrite
public void map(LongWritable key, Text value, Context context) throws IOException {
String line = value.toString();
// Assuming that the input data is "TweeterId <number of follower>" separated by tab
String tokens[] = value.split(Pattern.quote("\t"));
if(tokens.length > 1) {
key.set(Long.parseLong(tokens[1]));
value.set(tokens[0]);
context.write(key, value);
}
}
}
For reducer, use IdentityReducer<K,V>
// SortedComparator Class
public class DescendingOrderKeyComparator extends WritableComparator {
#Override
public int compare(WritableComparable w1, WritableComparable w2) {
return -1 * w1.compareTo(w2);
}
}
In the Driver Class, set SortedComparator
job.setSortComparatorClass(DescendingOrderKeyComparator.class);

Could reducer class not be launched by any chance? Can't see Sytem.out.println statements in the reducer logs

I have a driver class, mapper class and reducer class. The mapreduce job runs fine. But the desired out is not coming. I have put System.out.println statements in the reducer. I looked at the logs of mapper and reducer. System.out.println statements that I put in mapper can be seen in the logs but println statements in the reducer are not seen in the logs. Could it be possible that reducer is not at all launched?
This is the log fine from reducer.
I assume this question is based on the code in your earlier question: mapreduce composite Key sample - doesn't show the desired output
public class CompositeKeyReducer extends Reducer<Country, IntWritable, Country, IntWritable> {
public void reduce(Country key, Iterator<IntWritable> values, Context context) throws IOException, InterruptedException {
}
}
The reduce isn't running because the reduce method signature is wrong. You have:
public void reduce(Country key, Iterator<IntWritable> values, Context context)
It should be:
public void reduce(Country key, Iterable<IntWritable> values, Context context)
To make sure this doesn't happen again you should add the #Override annotation to the class. This will tell you if you've got the signature wrong.
No change in the code. It works now.
All I did was restarted my Hadoop Cloudera image and it works now. I can't believe this happended.

Use two Mappers on same file simultaneously in Hadoop

Assuming there is a file and two different independent mappers to be executed upon that file in parallel. To do that we require to use a copy of the file.
What I want to know is "Is it possible to use same file for the two mappers" which in turn will reduce the resources utilization and make the system time efficient.
Is there any research in this area or any existing tool in Hadoop which can help in overcoming this.
Assuming that both Mappers have the same K,V signature, you could use a delegating mapper and then call the map method of your two mappers:
public class DelegatingMapper extends Mapper<LongWritable, Text, Text, Text> {
public Mapper<LongWritable, Text, Text, Text> mapper1;
public Mapper<LongWritable, Text, Text, Text> mapper2;
protected void setup(Context context) {
mapper1 = new MyMapper1<LongWritable, Text, Text, Text>();
mapper1.setup(context);
mapper2 = new MyMapper1<LongWritable, Text, Text, Text>();
mapper2.setup(context);
}
public void map(LongWritable key, Text value, Context context) {
// your map methods will need to be public for each class
mapper1.map(key, value, context);
mapper2.map(key, value, context);
}
protected void cleanup(Context context) {
mapper1.cleanup(context);
mapper2.cleanup(context);
}
}
On a high level, there are 2 scenarios I could imagine with the question in hand.
Case 1:
If you are trying to write the SAME implementation in both Mapper classes to process the same input file with the sole aim of efficient resource utilization, this probably isn't the correct approach. Because, when a file is saved in the cluster it gets divided into blocks and replicated across data nodes.
This basically gives you the most efficient resource utilization as all the data blocks for the same input file are processed in PARALLEL.
Case 2:
If you are trying to write two DIFFERENT Mapper implementations (with their own business logic), for some particular workflow you want to execute based on your business requirements. Yes, you can pass the same input file to two different mappers using MultipleInputs class.
MultipleInputs.addInputPath(job, file1, TextInputFormat.class, Mapper1.class);
MultipleInputs.addInputPath(job, file1, TextInputFormat.class, Mapper2.class);
This could only be a workaround based on what you want to implement.
Thanks.

Why cannot Reducer.class used as a real reducer in Hadoop MapReduce?

I noticed that Mapper.class can be used as a real mapper in a phase, together with a user-defined reducer. For example,
Phase 1:
Mapper.class -> WordCountReduce.class
This will work.
However, Reducer.class cannot be used the same way. Namely something like
Phase 2:
WordReadMap.class -> Reducer.class
will not work.
Why is that?
I don't see why it wouldn't as long as the outputs are of the same class as the inputs. The default in the new API just writes out whatever you pass into it, it's implemented as
#SuppressWarnings("unchecked")
protected void reduce(KEYIN key, Iterable<VALUEIN> values, Context context
) throws IOException, InterruptedException {
for(VALUEIN value: values) {
context.write((KEYOUT) key, (VALUEOUT) value);
}
}
For the old API, it's an interface, and you can't directly instantiate an interface. If you're using that, then that's the reason it fails. Then again, the Mapper is an interface as well, and you shouldn't be able to instantiate it...

Hadoop Map Reduce , How to combine first reducer output and first map input , as input for second mapper?

I need to implement a functionality using map reduce.
Requirement is mentioned below.
Input for the mapper is a file containing two columns productId , Salescount
Reducers output , sum of salescount
Requirement is I need to calculate salescount / sum(salescount).
For this I am planing to use nested map reduce.
But for the second mapper I need to use first reducers output and first map's input.
How Can I implement this. Or is there any alternate way ?
Regards
Vinu
You can use ChainMapper and ChainReducer to PIPE Mappers and Reducers the way you want. Please have a look at here
The following will be similar to the code snippet you would need to implement
JobConf mapBConf = new JobConf(false);
JobConf reduceConf = new JobConf(false);
ChainMapper.addMapper(conf, FirstMapper.class, FirstMapperInputKey.class, FirstMapperInputValue.class,
FirstMapperOutputKey.class, FirstMapperOutputValue.class, false, mapBConf);
ChainReducer.setReducer(conf, FirstReducer.class, FirstMapperOutputKey.class, FirstMapperOutputValue.class,
FirstReducerOutputKey.class, FirstReducerOutputValue.class, true, reduceConf);
ChainReducer.addMapper(conf, SecondMapper.class, FirstReducerOutputKey.class, FirstReducerOutputValue.class,
SecondMapperOutputKey.class, SecondMapperOutputValue.class, false, null);
ChainReducer.setReducer(conf, SecondReducer.class, SecondMapperOutputKey.class, SecondMapperOutputValue.class, SecondReducerOutputKey.class, SecondReducerOutputValue.class, true, reduceConf);
or if you don't want to use multiple Mappers and Reducers you can do the following
public static class ProductIndexerMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, LongWritable> {
private static Text productId = new Text();
private static LongWritable salesCount = new LongWritable();
#Override
public void map(LongWritable key, Text value,
OutputCollector<Text, LongWritable> output, Reporter reporter)
throws IOException {
String[] values = value.toString().split("\t");
productId.set(values[0]);
salesCount.set(Long.parseLong(values[1]));
output.collect(productId, salesCount);
}
}
public static class ProductIndexerReducer extends MapReduceBase implements Reducer<Text, LongWritable, Text, LongWritable> {
private static LongWritable productWritable = new LongWritable();
#Override
public void reduce(Text key, Iterator<LongWritable> values,
OutputCollector<Text, LongWritable> output, Reporter reporter)
throws IOException {
List<LongWritable> items = new ArrayList<LongWritable>();
long total = 0;
LongWritable item = null;
while(values.hasNext()) {
item = values.next();
total += item.get();
items.add(item);
}
Iterator<LongWritable> newValues = items.iterator();
while(newValues.hasNext()) {
productWritable.set(newValues.next().get()/total);
output.collect(key, productWritable);
}
}
}
`
With the usecase in hand, I believe we don't need two different mappers/mapreduce jobs to achieve this. (As an extension to the answer given in above comments)
Lets assume you have a very large input file split into multiple blocks in HDFS. When you trigger a MapReduce job with this file as input, multiple mappers(equal to the number of input blocks) will start execution in parallel.
In your mapper implementation, read each line from input and write the productId as key and the saleCount as value to context. This data is passed to the Reducer.
We know that, in a MR job all the data with the same key is passed to the same reducer. Now, in your reducer implementation you can calculate the sum of all saleCounts for a particular productId.
Note: I'm not sure about the value 'salescount' in your numerator.
Assuming that its the count of number of occurrences of a particular product, please use a counter to add and get the total sales count in the same for loop where you are calculating the SUM(saleCount). So, we have
totalCount -> Count of number of occurrences of a product
sumSaleCount -> Sum of saleCount value for each product.
Now, you can directly divide the above values: totalCount/sumSaleCount.
Hope this helps! Please let me know if you have a different use case in mind.

Resources