Parallel running MapReduce on Hadoop - hadoop

I'm trying to understand how the parallel processing works with Hadoop & MapReduce.
I understand how Map can be run in parallel but I don't understand how Reduce can. For example if I want to find the average of the following list:
COMPUTER | YEAR | RUNS
‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾
compA | 1989 | 20
compA | 1990 | 10
compB | 1991 | 300
Where compA & compB are two data nodes
If the average function in the Reduce is run on compA & compB and then the result of the two data noes is averaged it'll be wrong.

Multiple reducer tasks will not be spawned for all type of logic. For aggregation, averaging logic the data will run on a single reducer.
Mapper class gets data as key-value pair. It sends the output to reducer class. Before reaching the reducer, the keys will be sorted and the values corresponding to the same key will reach the same reducer. In this way, the results will be correct. If your requirement involves averaging the entire data, then the output from all the map tasks will be send to a single reducer.
Eg: I want to find the average number of vehicles sold per month over an year.
I have data in 12 files, each file contains the sales details corresponding to a month.
I have to write the following logic.
mapper class
Mapper will get every record as input.
Parse the number of sales and month details from every record.
Write a constant as key and the sales as value (eg: 2000 cars sold in january will be something like [sales,2000]. Similarly for 1000 sale in february will be something like [sales,1000]). I used the key sales so that all the values with the same key will reach the same reducer.
Sample Input data
jan.txt
car-sold,2000
feb.txt
cars-sold,1000
.....
.....
dec.txt
cars-sold,5000
Mapper output
[sales,2000]
[sales,1000]
....
....
[sales,5000]
Reducer Input
{sales,[1000,2000,.....,5000]}
Ie (key, list of values)

Related

Increase scan performance in Apache Hbase

I am working on an use case and help me in improving the scan performance.
Customers visiting our website are generated as logs and we will be processing it which is usually done by Apache Pig and inserts the output from pig into hbase table(test) directly using HbaseStorage. This will be done every morning. Data consists of following columns
Customerid | Name | visitedurl | timestamp | location | companyname
I have only one column family (test_family)
As of now I have generated random no for each row and it is inserted as row key for that table. For ex I have following data to be inserted into table
1725|xxx|www.something.com|127987834 | india |zzzz
1726|yyy|www.some.com|128389478 | UK | yyyy
If so I will add 1 as row key for first row and 2 for second one and so on.
Note : Same id will be repeated for different days so I chose random no to be row-key
while querying data from table where I use scan 'test', {FILTER=>"SingleColumnValueFilter('test_family','Customerid',=,'binary:1002')"} it takes more than 2 minutes to return the results.`
Suggest me a way so that I have to bring down this process to 1 to 2 seconds since I am using it in real-time analytics
Thanks
As per the query you have mentioned, I am assuming you need records based on Customer ID. If it is correct, then, to improve the performance, you should use Customer ID as Row Key.
However, multiple entries could be there for single Customer ID. So, better design Row key as CustomerID|unique number. This unique number could be the timestamp too. It depends upon your requirements.
To scan the data in this case, you need to use PrefixFilter on row key. This will give you better performance.
Hope this help..

Mapreduce multiple map and redcuer

I am thinking to process CSV files, each size around 1 Mb using MapReduce which has data like
lat , lng
18.123, 77.312
18.434,77,456
18,654,77,483
....
....
I want to keep mapper and reducer count upon no. of input files.
fil1->map1->redcuer1-<output1
fil2->map2->redcuer2-<output2
.....
....
Although for my Mapper input is key->filename, value->18.123, 77.312
But i needed to calculate Distance from first record and second one
i.e Geo Distance for 1st record -> 18.123, 77.312 and 18.434,77,456
2nd record -> 18.123, 77.312 and 18.434,77,456
In Mapper i get one line of csv file is readed at a time and pass that to reducer
So i am thinking to concatinate csv file records by delimiter Pipe sign (|)
Above csv data will become as
lat|lng
18.123, 77.312|18.434,77,456|18,654,77,483|....|....|...
In Mapper i will get key->filename and values->All records of csv with pipe sign
In reducer i will split it and calulate distance and store in Mysql
Is this correct approach or has some other way
Thanks in advance.

Input/Output flow in map reduce chaining

i need help regarding map reduce chaining.i have a map reduce chain like this
map->reduce->map
i want the output of reducer to be used in the last mapper
for example, in my reducer i am getting the max salary of an employee and this value is supposed to be used in the next mapper and find the record with that max salary value.so obviously my last mapper should get the output of the reducer and the contents of the file?is it possible?how can i fix the problem?any better solution?
I'm not sure i understood the problem, but i will try to help.
You have reduced some input containing an employee salaries (lets call it input1) into output (lets call it output1) that looks like that:
Key: someEmployee Value: max salary.
and now you want another mapper to to map the data from both input1 and output1?
if so, than u have few options, u may choose one according to your needs.
Manipulate first reducer output. instad of creating output1 in the format Key: someEmployee Value:
max_salary##salary_1,salary_2,salary_3...salary_n
and than create new job, and set the new mapper input as output1.
Try reading this issue explaining how to get multiple inputs into one mapper

Sorting Mapper output by key and then by value

I am trying to write a sample Map Reduce program whose Mapper output looks like:
1/1/2012 15:11:46
1/1/2012 19:09:26
1/1/2012 14:01:25
1/1/2012 17:32:26
1/1/2012 17:41:00
1/1/2012 19:35:38
1/1/2012 14:28:10
1/1/2012 15:45:55
I want my input to the reducer sorted by key and then by value.
By default, Hadoop framework sorts the mapper output only by key.
I think I should be using the Secondary Sort to accomplish this task but not sure how to use this.
Can anyone please help me with this?
At a high level:
Make your key a concatenation of the current key and value. Keep the value the same.
Create a Grouping Comparator that takes two keys (which are concatenations), extracts just the dates and returns a comparison of the two dates. This makes all records with the same date get passed on a single call to reduce().
Specify your grouping comparator in your job driver with all of the other job and configuration settings.
Note that your date values as shown won't be sorted by date lexically - you want the year to be first.
EDIT: It occurs to me that you'll also probably have to write a partitioner since you want make sure that keys that apparently have different values (but which are all on the same day) get sent to the same partition.
Have a custom Hadoop WritableComparable, as in TextPair pair example TEXT PAIR ,
use it as KEY with Have your Date as first element in TextPair class,
and time as second.
Incase you DONT want to allocate Diffrent Reducer, for same Date with different TIME , USE a custom partitioner, which would partition based on Date Alone

Simplifying a Cascading pipeline used for aggregating sales data

I'm very new to Cascading and Hadoop both, so be gentle... :-D
I think I'm finding myself way over-engineering something. Basically my situation is that I have a pipe delimited file with 9 fields. I want to compute some aggregated statistics over those 9 fields using different groupings. The result should be 10 fields of which only 6 are either counts or sums. So far I'm up to 4 Unique pipes, 4 CountBy pipes, 1 SumBy, 1 GroupBy, 1 Every, 2 Each, 5 CoGroups and a couple others. I'm needing to add another small piece of functionality and the only way I can see to do it is to add in 2 Filters, 2 more CoGroups and 2 more Each pipes. This all seems like way overkill just to compute a few aggregated statistics. So I'm thinking I'm really misunderstanding something.
My input file looks like this:
storeID | invoiceID | groupID | customerID | transaction date | quantity | price | item type | customer type
Item type is either "I", "S" or "G" for inventory, service or group items, customers belong to groups. The rest should be self-explanatory
The result I want is:
project ID | storeID | year | month | unique invoices | unique groups | unique customers | customer visits | inventory type sales | service type sales |
project ID is a constant, customer visits is how many days during the month the customer came in and bought something
The setup that I'm using right now uses a TextDelimited Tap as my source to read the file and passes the records to an Each pipe which uses a DateParser to parse the transaction date and adds in year, month and day fields. So far so good. This is where it gets out of control.
I'm splitting the stream from there up into 5 separate streams to process each of the aggregated fields that I want. Then I'm joining all the results together in 5 CoGroup pipes, sending the result through Insert (to insert the project ID) and writing through a TextDelimited sink Tap.
Is there an easier way than splitting into 5 streams like that? The first four streams do almost the exact same thing just on different fields. For example, the first stream uses a Unique pipe to just get unique invoiceID's then uses a CountBy to count the number of records with the same storeID, year and month. That gives me the number of unique invoices created for each store by year and month. Then there is a stream that does the same thing with groupID and another that does it with customerID.
Any ideas for simplifying this? There must be an easier way.

Resources