Mapping through two data sets with Hadoop - hadoop

Suppose I have two key-value data sets--Data Sets A and B, let's call them. I want to update all the data in Set A with data from Set B where the two match on keys.
Because I'm dealing with such large quantities of data, I'm using Hadoop to MapReduce. My concern is that to do this key matching between A and B, I need to load all of Set A (a lot of data) into the memory of every mapper instance. That seems rather inefficient.
Would there be a recommended way to do this that doesn't require repeating the work of loading in A every time?
Some pseudcode to clarify what I'm currently doing:
Load in Data Set A # This seems like the expensive step to always be doing
Foreach key/value in Data Set B:
If key is in Data Set A:
Update Data Seta A

According to the documentation, the MapReduce framework includes the following steps:
Map
Sort/Partition
Combine (optional)
Reduce
You've described one way to perform your join: loading all of Set A into memory in each Mapper. You're correct that this is inefficient.
Instead, observe that a large join can be partitioned into arbitrarily many smaller joins if both sets are sorted and partitioned by key. MapReduce sorts the output of each Mapper by key in step (2) above. Sorted Map output is then partitioned by key, so that one partition is created per Reducer. For each unique key, the Reducer will receive all values from both Set A and Set B.
To finish your join, the Reducer needs only to output the key and either the updated value from Set B, if it exists; otherwise, output the key and the original value from Set A. To distinguish between values from Set A and Set B, try setting a flag on the output value from the Mapper.

All of the answers posted so far are correct - this should be a Reduce-side join... but there's no need to reinvent the wheel! Have you considered Pig, Hive, or Cascading for this? They all have joins built-in, and are fairly well optimized.

This video tutorial by Cloudera gives a great description of how to do a large-scale Join through MapReduce, starting around the 12 minute mark.
Here are the basic steps he lays out for joining records from file B onto records from file A on key K, with pseudocode. If anything here isn't clear, I'd suggest watching the video as he does a much better job explaining it than I can.
In your Mapper:
K from file A:
tag K to identify as Primary Key
emit <K, value of K>
K from file B:
tag K to identify as Foreign Key
emit <K, record>
Write a Sorter and Grouper which will ignore the PK/FK tagging, so that your records are sent to the same Reducer regardless of whether they are a PK record or a FK record and are grouped together.
Write a Comparator which will compare the PK and FK keys and send the PK first.
The result of this step will be that all records with the same key will be sent to the same Reducer and be in the same set of values to be reduced. The record tagged with PK will be first, followed by all records from B which need to be joined. Now, the Reducer:
value_of_PK = values[0] // First value is the value of your primary key
for value in values[1:]:
value.replace(FK,value_of_PK) // Replace the foreign key with the key's value
emit <key, value>
The result of this will be file B, with all occurrences of K replaced by the value of K in file A. You can also extend this to effect a full inner join, or to write out both files in their entirety for direct database storage, but those are pretty trivial modifications once you get this working.

Related

Sorting after Repartitioning PySpark Dataframe

We have a giant file which we repartitioned according to one column, for example, say it is STATE. Now it seems like after repartitioning, the data cannot be sorted completely. We are trying to save our final file as a text file but instead of the first state listed being Alabama, now California shows up first. OrderBy doesn't seem to have an effect after running the repartition.
df = df.repartition(100, ['STATE_NAME'])\
.sortWithinPartitions('STATE_NAME', 'CUSTOMER_ID', 'ROW_ID')
I can't find a clear statement in the documentation about this, only this hint for pyspark.sql.DataFrame.repartition:
The resulting DataFrame is hash partitioned.
Obviously, repartition doesn't bring the rows in a specific (namely alphabetic) order (not even if they were ordered previously), it only groups them. That .sortWithinPartitions imposes no global order is no wonder considering the name, which implies that the sorting only occurs within the partitions, not on them. You can try .sort instead.

Comparing data from same file in reducer function of a map reduce program

In my map reduce program, the mapper function will give two key value pair:
1) (person1, age)
2) (person2, age)
(I have kept 2 pairs only for simplicity it would be nice if u can explain for n nos of line)
Now I want to write a reducer which will compare age of both and give the answer who is older.
The thing I cannot understand is the output of mapper will be in different line in the file. And as reducer works on line by line bases over a file how will it compare them.
Thanks in advance.
See if any of the following logic serves your purpose:
A.
emit (age,person_name) from your map
have only 1 reducer -
your will get all ages, person pair in sorted manner. so simply emitting will give first one as youngest and last one as oldest.
If you don't want to print all values, just have two references in the reducer task - youngest, oldest - set them in reduce method and emit whichever you want in the cleanup of reducer task
B.
Have a mapper emitting (name,age) as you said
in reducer task:
a. Use setup() to create a treemap
b. in reduce() add (age, person) in the treemap
c. you map will be sorted by age which you can use in cleanup() to do something about it.
Essentially you can store all key,value in internal object(s) in reduce(), in cleanup() you will have access to all of these value and perform any logic you want in it.
I think your use case straight away fits into Secondary Sorting technique.
Secondary sorting is a technique, which has been introduced to sort "value" emitted by mapper. Primary sorting will be done by "key" emitted by mapper.
If you try to sort all values at reducer level, you may get out of memory. Secondary Sorting should be done at mapper level.
Have a look at this article
In above example, just replace "year" with "person" and "temperature" with "age"
Solution:
Create Custom partitioner to send all values from a particular key to a single reducer
Sorting should be done Key, Value combination emitted by mapper => Create a composite key with Key + Value have been used for sorting. Come up with a Comparator which sorts first by Key and then with Value.
In reducer method, all you would be getting is a key and list of values. So you can find min or max among a list of values for that key. However if you need to compare with other keys, then may be you should think of a single reducer and get all the records from mappers and handle that logic in your reducer class with help of reference variable rather than local variable and updating the reference variable with every min/max value for each key

How to process large file with one record dependent on another in MapReduce

I have a scenario where there is a really large file and say line 1 record might have dependency on 1000th line data and the line 1 and 1000 can be part of separate spilts. Now my understanding of the framework is that record reader is going to return one key, value pair to mapper and each k,v pair will be independent of another. Moreover since the file has been divided into splits and i want that as well (i.e. splittable false is no option), can i handle this anyhow may be writing my own record reader, mapper or reducer?
Dependency is like -
Row1: a,b,c,d,e,f
Row2: x,y,z,p,q,r
Now x in Row2 need to be used with say d in Row1 to get my desired output.
Thanks.
I think what you need is to implement a reducer side join. Here you can see a better explanation of it: http://hadooped.blogspot.mx/2013/09/reduce-side-joins-in-java-map-reduce.html.
Both related values have to end in the same reducer (defined by the key and the Partitioner) and they should be grouped together (GroupingComparator) and may be use a SecondSort to order the grouped values.

Sort reducer input iterator value before processing in Hadoop

I have some input data coming to the reducer with the value type Iterator .
How can I sort this list of values to be ascending order?
I need to sort them in order since they are time values, before processing all in the reducer.
To achieve sorting of reducer input values using hadoop's built-in features,you can do this:
1.Modify map output key - Append map output key with the corresponding value.Emit this composite key and the value from map.Since hadoop uses entire key by default for sorting, map output records will be sorted by (your old key + value).
2.Although sorting is done in step 1, you have manipulated the map output key in the process.Hadoop does Partitioning and Grouping based on the key by default.
3.Since you have modified the original key, you need to take care of modifying Partitioner and GroupingComparator to work based on the old key i.e., only the first part of your composite key.
Partitioner - decides which key-value pairs land in the same Reducer instance
GroupComparator - decides which key-value pairs among the ones that landed into the Reducer go to the same reduce method call.
4.Finally(and obviously) you need to extract the first part of input key in the reducer to get old key.
If you need more(and a better) answer, turn to Hadoop Definitive Guide 3rd Edition -> chapter 8 -> sorting -> secondary sort
What you asked for is called Secondary Sort. In a nutshell - you extend the key to add "value sort key" to it and make hadoop to group by only "real key" but sort by both.
Here is a very good explanation about the secondary sort:
http://pkghosh.wordpress.com/2011/04/13/map-reduce-secondary-sort-does-it-all/

Removing duplicates from a huge file

We have a huge chunk of data and we want to perform a few operations on them. Removing duplicates is one of the main operations.
Ex.
a,me,123,2631272164
yrw,wq,1237,123712,126128361
yrw,dsfswq,1323237,12xcvcx3712,1sd26128361
These are three entries in a file and we want to remove duplicates on the basis of 1st column. So, 3rd row should be deleted. Each row may have different number of columns but the column we are interested into, will always be present.
In memory operation doesn't look feasible.
Another option is to store the data in database and removing duplicates from there but it's again not a trivial task.
What design should I follow to dump data into database and removing duplicates?
I am assuming that people must have faced such issues and solved it.
How do we usually solve this problem?
PS: Please consider this as a real life problem rather than interview question ;)
If the number of keys is also infeasible to load into memory, you'll have to do a Stable(order preserving) External Merge Sort to sort the data and then a linear scan to do duplicate removal. Or you could modify the external merge sort to provide duplicate elimination when merging sorted runs.
I guess since this isn't an interview question or efficiency/elegance seems to not be an issue(?). Write a hack python script that creates 1 table with the first field as the primary key. Parse this file and just insert the records into the database, wrap the insert into a try except statement. Then preform a select * on the table, parse the data and write it back to a file line by line.
If you go down the database route, you can load the csv into a database and use 'duplicate key update'
using mysql:-
Create a table with rows to match your data (you may be able to get away with just 2 rows - id and data)
dump the data using something along the lines of
LOAD DATA LOCAL infile "rs.txt" REPLACE INTO TABLE data_table FIELDS TERMINATED BY ',';
You should then be able to dump out the data back into csv format without duplicates.
If the number of unique keys aren't extremely high, you could simply just do this;
(Pseudo code since you're not mentioning language)
Set keySet;
while(not end_of_input_file)
read line from input file
if first column is not in keySet
add first column to keySet
write line to output file
end while
If the input is sorted or can be sorted, then one could do this which only needs to store one value in memory:
r = read_row()
if r is None:
os.exit()
last = r[0]
write_row(r)
while True:
r = read_row()
if r is None:
os.exit()
if r[0] != last:
write_row(r)
last = r[0]
Otherwise:
What I'd do is keep a set of the first column values that I have already seen and drop the row if it is in that set.
S = set()
while True:
r = read_row()
if r is None:
os.exit()
if r[0] not in S:
write_row(r)
S.add(r[0])
This will stream over the input using only memory proportional to the size of the set of values from the first column.
If you need to preserve order in your original data, it MAY be sensible to create new data that is a tuple of position and data, then sort on the data you want to de-dup. Once you've sorted by data, de-duplication is (essentially) a linear scan. After that, you can re-create the original order by sorting on the position-part of the tuple, then strip it off.
Say you have the following data: a, c, a, b
With a pos/data tuple, sorted by data, we end up with: 0/a, 2/a, 3/b, 1/c
We can then de-duplicate, trivially being able to choose either the first or last entry to keep (we can also, with a bit more memory consumption, keep another) and get: 0/a, 3/b, 1/c.
We then sort by position and strip that: a, c, b
This would involve three linear scans over the data set and two sorting steps.

Resources