Joins using Custom File Format - hadoop

If I wish to perform a Reduce Side Join using Custom File Format, how may shall I implement the same talking about the RecordReaderSay I have to fetch Data from two datasets One from customers table(customerid,fname,lname,age,profession) One from Transactions table(transId,transdate,customerId,itemPurchased1,itemPurchased2,city,state,methodOfPayment)In order to fetch data from two datasets, I need two mappers.Can I have two record readers for two mappers? If so how?
Please explain along with the driver implementation. If not possible please suggest me a way to implement reduce side join using custom file format.
Thank you in Advance :)

You want to join two data sets with Reducer join.
You need two mappers as both have different data and need separate parsing. While writing output, you should output join attribute(may be cust id in your case) as key and entire record as value from each mapper. You can also filter unnecessary fields here to optimize. Important thing is, you need to append a string like ("set1:"+map value), to indentify in reduce from which mapper did the record come from.
In reducer, you will have cust Id as key, then the list contains both records from different sets, and you can join them there as your requirement.
So once two mappers are written, you should let the job know about them. This is mentioned in Job class using MultipleInputs as below
MultipleInputs.addInputPath(job, new Path("inputPath1"), TextInputFormat.class, com.abc.HBaseMapper1.class);
MultipleInputs.addInputPath(job, new Path("inputPath2"), TextInputFormat.class, com.abc.HBaseMapper2.class);
From performance point, if one of the table is small, you can use distributed cache to load that file and then send the other data set accordingly.
In Mapper 1,
Get cust id from the row:
context.write(new Text("custId"),new Text("##map1##|"+value));
In Mapper 2,
context.write(new Text("custId"),new Text("##map2##|"+value));
In reducer,
for(Text txt:values)
{
String output;
if(txt contains "map1"){
//Append your output string
} else if(txt contains "map2") {
//Append your output string
}
}
context.write(key, output)

Related

Apache Storm, co-partitioning of streams

I have the following situation where I need to join two streams Bid(Seller, Item, Price) and Ask(Buyer, Item, Price) where I need to emit a tuple (Seller, Buyer) when the buyer offers a higher price than requested by the seller.
I know that I can configure the Bolt's grouping option FieldGrouping. But if I configure each input separately, is there a guarantee that the data with the same value will always go to the same Bolt task.
I am putting a pseudo code to help explain more
builder.setBolt("goodPrice", new GoodPriceBolt(), 5)
.fieldsGrouping("Bid", new Fields("Item"))
.FieldsGrouping("Ask", new Fields("Item"));
Now, as per the documentation http://storm.apache.org/releases/current/Concepts.html, we can guarantee that all Bid data points for the same item value will be delivered to the same task. But, I am not sure if the code above will guarantee also that all Ask data points with the same item value as that of the Bid will be delivered to the same task.
In other words, I need to partition on Bid.Item = Ask.Item. Is that possible in Storm?
Yes, as far as I know. Joins is listed on Storm's page as a common pattern http://storm.apache.org/releases/2.0.0-SNAPSHOT/Common-patterns.html.
Here's the implementation of fields grouping in Storm https://github.com/apache/storm/blob/09e01231cc427004bab475c9c70f21fa79cfedef/storm-client/src/jvm/org/apache/storm/daemon/GrouperFactory.java#L160. The values list contains the values of the fields you've specified in the field grouping (in your case "Item"). The id of the task to send the tuple to is based on https://github.com/apache/storm/blob/09e01231cc427004bab475c9c70f21fa79cfedef/storm-client/src/jvm/org/apache/storm/utils/TupleUtils.java#L44, which uses the hash code of the values. As long as whatever is in your "Item" field implements hashCode properly, you should be good.
You might also be interested in http://storm.apache.org/releases/1.2.1/Joins.html, and maybe https://github.com/apache/storm/blob/master/examples/storm-starter/src/jvm/org/apache/storm/starter/SingleJoinExample.java. Keep in mind that when you join streams, you should try to take into account that the matching tuples might not show up in the joiner at the same time, which is why Storm provides a join bolt that lets you specify a window for how long you want to wait on one part of the match.

Multiple table input for mapreduce

I am thinking of doing a mapreduce using accumulo tables as input.
Is there a way to have 2 different tables as input, the same way it exists for the multiple files input like addInputPath ?
Or is it possible to have one input from a file and the other one from a table with AccumuloInputFormat ?
You probably want to take a look at AccumuloMultiTableInputFormat. The Accumulo manual demonstrates how to use it here.
Example Usage:
job.setInputFormat(AccumuloInputFormat.class);
AccumuloMultiTableInputFormat.setConnectorInfo(job, user, new PasswordToken(pass));
AccumuloMultiTableInputFormat.setMockInstance(job, INSTANCE_NAME);
InputTableConfig tableConfig1 = new InputTableConfig();
InputTableConfig tableConfig2 = new InputTableConfig();
Map<String, InputTableConfig> configMap = new HashMap<String, InputTableConfig>();
configMap.put(table1, tableConfig1);
configMap.put(table2, tableConfig2);
AccumuloMultiTableInputFormat.setInputTableConfigs(job, configMap);
See the unit test for AccumuloMultiTableInputFormat here for some additional information.
Note, that unlike normal multiple inputs, you can't specify different Mappers to run on each table. Although, its not a massive problem in this case since the incoming Key/Value types are the same and you can use:
RangeInputSplit split = (RangeInputSplit)c.getInputSplit();
String tableName = split.getTableName();
To workout which table the records are coming from (taken from the Accumulo manual) in your mapper.

Hadoop - Get Split Ids in map function

I'm working on a project with map reduce.
My understanding of Hadoop is that it will seperate my data into blocks which will then be turned into splits where a split corresponds to a single map task.
It would be my assumption that each split would have an ID or number associated with it.
I'm wondering if there is any way to get this split Id/number or even the block Id/number as the key to the map function?
ie:
map(split_id, data)
The Inputsplit toString() method will return a pattern. If hash this pattern using MD5 Hash we can get an Unique Id identifying each of the input splits.
InputSplit is = context.getInputSplit();
splitId = MD5Hash.digest(is.toString()).toString();
Then we can use the splitId as the key to the mapper function.

How to fill a Cassandra Column Family from another one's columns?

I have always read that Cassandra is good if your application changes frequently and features are added frequently.
That makes sense, since you don't have any fixed schema, you can add columns to rows to suffice your needs, instead of running an ALTER TABLE query which may freeze your database for hours for very large tables.
However I have an hypotetical problem which I'm not able to solve.
Let's say I have:
CREATE COLUMN FAMILY Students
with comparator='CompositeType(UTF8Type,UTF8Type),
and key_validation_class=UUIDType;
Each student has some generic column (you know, meta:username, meta:password, meta:surname, etc), plus each student may follow N courses. This N-N relationship is resolved using denormalization, adding N columns to each Student (course:ID1, course:ID2).
On the other side, I may have a Courses CF, where each row is contains all of the following Students UUIDs.
So I can ask "which courses are followed by XXX" and "which students follow course YYY".
The problem is: what if I didn't create the second column family? Maybe at the time when the application was built, getting the students following a specific course wasn't a requirement.
This is a simple example, but I believe it's quite common. "With Cassandra you plan CFs in terms of queries instead of relationships". I need that query now, while at first it wasn't needed.
Given a table of students with thousands of entries, how would you fill the Courses CF? Is this a job for Hadoop, Pig or Hive (I never touched any of those, just guessing).
Pig (which uses the Hadoop integration) is actually perfect for this type of work, because you can not only read but also write data back into Cassandra using CassandraStorage. It gives you the parallel processing capability to do the job with minimal time and overhead. Otherwise the alternative is to write something to do the extraction yourself, then write the new CF.
Here is a Pig example that computes averages from a set of data in one CF and outputs them to another:
rows = LOAD 'cassandra://HadoopTest/TestInput' USING CassandraStorage() AS (key:bytearray,cols:bag{col:tuple(name:chararray,value)});
columns = FOREACH rows GENERATE flatten(cols) AS (name,value);
grouped = GROUP columns BY name;
vals = FOREACH grouped GENERATE group, columns.value AS values;
avgs = FOREACH vals GENERATE group, 'Pig_Average' AS name, (long)SUM(values.value)/COUNT(values.value) AS average;
cass_group = GROUP avgs BY group;
cass_out = FOREACH cass_group GENERATE group, avgs.(name, average);
STORE cass_out INTO 'cassandra://HadoopTest/TestOutput' USING CassandraStorage();
If you use the existing cassandra file, you would have to unwind the data. Since NOSQL files are unidirectional this could be a very time consuming operation in Cassandra itself. The data would have to be sorted in the opposite order from the first file. Frankly I believe that you would have to go back to the original data that was used to populate the first file and populate this new file from that.

Mapping through two data sets with Hadoop

Suppose I have two key-value data sets--Data Sets A and B, let's call them. I want to update all the data in Set A with data from Set B where the two match on keys.
Because I'm dealing with such large quantities of data, I'm using Hadoop to MapReduce. My concern is that to do this key matching between A and B, I need to load all of Set A (a lot of data) into the memory of every mapper instance. That seems rather inefficient.
Would there be a recommended way to do this that doesn't require repeating the work of loading in A every time?
Some pseudcode to clarify what I'm currently doing:
Load in Data Set A # This seems like the expensive step to always be doing
Foreach key/value in Data Set B:
If key is in Data Set A:
Update Data Seta A
According to the documentation, the MapReduce framework includes the following steps:
Map
Sort/Partition
Combine (optional)
Reduce
You've described one way to perform your join: loading all of Set A into memory in each Mapper. You're correct that this is inefficient.
Instead, observe that a large join can be partitioned into arbitrarily many smaller joins if both sets are sorted and partitioned by key. MapReduce sorts the output of each Mapper by key in step (2) above. Sorted Map output is then partitioned by key, so that one partition is created per Reducer. For each unique key, the Reducer will receive all values from both Set A and Set B.
To finish your join, the Reducer needs only to output the key and either the updated value from Set B, if it exists; otherwise, output the key and the original value from Set A. To distinguish between values from Set A and Set B, try setting a flag on the output value from the Mapper.
All of the answers posted so far are correct - this should be a Reduce-side join... but there's no need to reinvent the wheel! Have you considered Pig, Hive, or Cascading for this? They all have joins built-in, and are fairly well optimized.
This video tutorial by Cloudera gives a great description of how to do a large-scale Join through MapReduce, starting around the 12 minute mark.
Here are the basic steps he lays out for joining records from file B onto records from file A on key K, with pseudocode. If anything here isn't clear, I'd suggest watching the video as he does a much better job explaining it than I can.
In your Mapper:
K from file A:
tag K to identify as Primary Key
emit <K, value of K>
K from file B:
tag K to identify as Foreign Key
emit <K, record>
Write a Sorter and Grouper which will ignore the PK/FK tagging, so that your records are sent to the same Reducer regardless of whether they are a PK record or a FK record and are grouped together.
Write a Comparator which will compare the PK and FK keys and send the PK first.
The result of this step will be that all records with the same key will be sent to the same Reducer and be in the same set of values to be reduced. The record tagged with PK will be first, followed by all records from B which need to be joined. Now, the Reducer:
value_of_PK = values[0] // First value is the value of your primary key
for value in values[1:]:
value.replace(FK,value_of_PK) // Replace the foreign key with the key's value
emit <key, value>
The result of this will be file B, with all occurrences of K replaced by the value of K in file A. You can also extend this to effect a full inner join, or to write out both files in their entirety for direct database storage, but those are pretty trivial modifications once you get this working.

Resources