Extracting rows containing specific value using mapReduce and hadoop - hadoop

I'm new to developing map-reduce function. Consider I have csv file containing four column data.
For example:
101,87,65,67
102,43,45,40
103,23,56,34
104,65,55,40
105,87,96,40
Now, I want extract say
40 102
40 104
40 105
as those row contain 40 in forth column.
How to write map reduce function?

Basically WordCount example resembles very well what you are trying to achieve. Instead of initializing the count per each word, you should have a condition to check if the tokenized String has required value and only in that case you write to context. This will work, since Mapper will receive each line of the CSV separately.
Now Reducer will receive the list of the values, already organized per key. In Reducer, instead of having IntWritable as output value type, you can use NullWritable for return value type, so your code will only output the keys. Also you do not need the cycle in Reducer, since you only would like to output the keys.
I do not provide you any code in my answer, since you will learn nothing from that. Make you way from the recommendations.
EDIT: since you modified you question with request for Reducer, here are some tips how you can achieve what you want.
One of the possibilities for achiving desired result is: in Mapper, after splitting (or tekenizing) the line, you write to context column 3 as key and column 0 as value. Your Reducer, since you do not need to any kind of aggregation, can simply write the keys and values produced by Mappers (yep, your Reducer code will end up with a single line of code). You can check one of my previous answers, the figure there explains quite well what Map and Reduce phases are doing.

Related

How to map the value in Mapper

I have a file with data like
City,Quarter,Classification,Index
Bordeux,Q1,R,3
Krakow,Q1,U,2
Halifax,Q1,U,4
I need to find out the highest Index in each Classification and write them to two separate files. The output should be
Bordeux,Q1,R,3
Halifax,Q1,U,4
How to load the data in Mapper as it requires a key/value pair. In mapper it seems programmer should not do any modification to data. So, how to load it in Context object.
I think the data type of key or value is not changed in Reducer. If so, I'm going to infuse my logic to find the top records, then how to organize into a context object there.
I don't have clue on how to proceed.
Necessary pointers will help me to proceed further.
In your case when you read the file in Mapper the input key is the ObjectId of the line and value is the line itself. So in other words, in each line of the file will be received in Mapper as value field. Now, the output (key,value) of Mapper should be (Classification, Index).
The output of Mapper will become input (key,value) of reducer. So reducer will receive (Classification,Iterable) as input. So for each classification, you can iterate over Index List to get the Max and output of the reducer will be (Classification,Max)
In this case, output key and value type will be same in for Mapper and Reducer.
However, regarding writing it to separate lines: Separate files will be generated only if every key is routed to different reducer instance. So in your case, the total number of reducers should be equal to the total number of unique classifications(Not in good terms of resource utilization though). So, you have to write a custom partitioner to make it happen

How do we count the number of times a map function is called in a mapreduce program?

I have to do certain operations on my input data and write it to hdfs using mapreduce program.
My input data looks like
abc
some data
some data
some data
def
other data
other data
other data
and continues in the same way, where abc ,def are the headers and some data are records with tab space.
My task is to eliminate the headers and append it to its below records like
some data abc
some data abc
some data abc
other data def
other data def
other data def
Each header will have 50 records.
I am using the default record reader so it reads each line at a time
Now my problem is how do I know that map function has been called for a nth time?
Do I have any counter to know that?
So that I can use that counter to append the header to string as
if (counter % 50 ==0 )
*some code*
Or else static variables are the only way?
You can use member variables to keep the count, how many have processed till now. The member variable are instance variables and will not be reset each time map function get called. You can instantiate them in mapper setup method.
Obviously, you can use static variable for keeping the counter.
The data in HDFS is stored in blocks, how are you going to handle when data is split in two blocks.
To handle the data split between two blocks, you might need the Reducers. The property of the reducers is, all the data (values) related to a particular key are always sent to the same (single) reducer. The input to the reducer is key and list of values which is in your case list of data. So you can store them very easily as per your requirement.
Optimization : You can use the same Reducer code as Combiner for optimizing the data transfer.
Idea : The Mapper will emit the key and value as it is. Now when the Reducer receive the data, which is Key, List<value>, all of your values are already combined by the MapReduce framework. You just to need to emit them again. This is the output you are looking for.

Increasing mapper in pig

I am using pig to load data from Cassandra using CqlStorage. i have 4 data nodes each can have 7 mappers, there is ~30 million data in Cassandra. When i run like this
LOAD 'cql://keyspace/columnfamily' using CqlStorage it takes 27 mappers to run .
But if i give where clause in the load function like
LOAD 'cql://keyspace/columnfamily?where_clause=id%3D100' using CqlStorage it always takes one mapper.
Can any one help me in increasing mapper
It looks from your WHERE clause like your map input will only be a single key, which would be the reason why you only get one mapper. Hadoop will allocate mappers based on the number of input keys. If you have only one input key, additional mappers will do nothing.
The bottom line is that if you specify your partition key in the where clause, you will get one mapper (since that's the way it gets distributed). Based on the comments I presume you are doing analysis for more than just one student, so there's no reason you'd be specifying the partition key. You also don't seem to have any columns that make sense for a secondary index. So I'm not sure why you even have a where clause.
It looks from your data model like you'll have to map over all your data to get aggregate marks for a combination of student and time range. It's possible you could change to a time-series data model and successfully filter in the where clause, but your current model doesn't support this.

What happens when identical keys are passed to the Mapper in Hadoop

What is the significance of data being passed as key/value pairs to the mapper also in the Hadoop Map Reduce framework. I understand that key/value pairs hold significance when they are passed to the reducers as they cater to the partitioning of data coming from the mappers. Values belonging to the same key go as a list from the mapper to the reducer stage. But how are the keys used before the mapper stage itself? What happens to values belonging to the same key? If we don't define a custom input format, I presume Hadoop takes in the record number from the input file as the key and the text line as the value in the mapper function. But in case we decide to implement a custom input format there is a custom selection of the keys and there could be a possibility where we have values corresponding to the same key.
How does phenomenon get handled in the mapper stage? Does the mapper ignore duplicate records and treats them as separate records or does it only choose one record per key?
An input split is a chunk of the input that is processed by a
single map. Each map processes a single split. Each split is divided into records, and
the map processes each record—a key-value pair—in turn.
So mapper treats records with same key as separate records.

how to perform ETL in map/reduce

how do we design mapper/reducer if I have to transform a text file line-by-line into another text file.
I wrote a simple map/reduce programs which did a small transformation but the requirement is a bit more elaborate below are the details:
the file is usually structured like this - the first row contains a comma separated list of column names. Second and the rest of the rows specify values against the columns
In some rows the trailing column values might be missing ex: if there are 15 columns then values might be specified only for the first 10 columns.
I have about 5 input files which I need to transform and aggregate into one file. the transformations are specific to each of the 5 input files.
How do I pass contextual information like file name to the mapper/reducer program?
Transformations are specific to columns so how do I remember the columns mentioned in the first row and then correlate and transform values in rows?
Split file into lines, transform (map) each line in parallel, join (reduce) the resulting lines into one file?
You can not rely on the column info in the first row. If your file is larger than a HDFS block, your file will be broken into multiple splits and each split handed to a different mapper. In that case, only the mapper receiving the first split will receive the first row with column info and the rest won't.
I would suggest passing file specific meta data in separate file and distribute it as side data. Your mapper or reducer tasks could read the meta data file.
Through the Hadoop Context object, you can get hold of the name of the file being processed by a mapper. Between all these, I think you have all the context information you are referring to and you can do file specific transformation. Even though the transformation logic is different for different files, the mapper output needs to have the same format.
If you using reducer, you could set the number of reducers to one, to force all output to aggregate to one file.

Resources