understanding custom partitioner in hadoop - hadoop

i am learning partitioner concept now.can any one explain me the below piece of code.it is hard for me to understand
public class TaggedJoiningPartitioner extends Partitioner<TaggedKey,Text> {
#Override
public int getPartition(TaggedKey taggedKey, Text text, int numPartitions) {
return taggedKey.getJoinKey().hashCode() % numPartitions;
}
}
how this taggedKey.getJoinKey().hashCode() % numPartitions determine which reducer to be executed for a key?
can any one explain me this?

It's not as complex as you think once you break things down a little bit.
taggedKey.getJoinKey().hashCode() will simply return an integer. Every object will have a hashCode() function that simply returns a number that will hopefully be unique to that object itself. You could look into the source code of TaggedKey to see how it works if you'd like, but all you need to know is that it returns an integer based on the contents of the object.
The % operator performs modulus division, which is where you return the remainder after performing division. (8 % 3 = 2, 15 % 7 = 1, etc.).
So let's say you have 3 partitioners (numPartitions = 3). Every time you do modulus division with 3, you'll get either 0, 1, or 2, no matter what number is passed. This is used to determine which of the 3 partitioners will get the data.
The whole idea of partitioners is that you can use them to group data to be sorted. If you wanted to sort by month, you could pass every piece of data with the string "January" to the first partition, "December" to the 12th partitioner, etc. But in your case it on the outside looks a bit confusing. But really they just want to spread the data out (hopefully) evenly, so they're using a simple hash/modulus function to choose the partition at random.

Related

Generate “hash” functions programmatically

I have some extremely old legacy procedural code which takes 10 or so enumerated inputs [ i0, i1, i2, ... i9 ] and generates 170 odd enumerated outputs [ r0, r1, ... r168, r169 ]. By enumerated, I mean that each individual input & output has its own set of distinct value sets e.g. [ red, green, yellow ] or [ yes, no ] etc.
I’m putting together the entire state table using the existing code, and instead of puzzling through them by hand, I was wondering if there was an algorithmic way of determining an appropriate function to get to each result from the 10 inputs. Note, not all input columns may be required to determine an individual output column, i.e. r124 might only be dependent on i5, i6 and i9.
These are not continuous functions, and I expect I might end up with some sort of hashing function approach, but I wondered if anyone knew of a more repeatable process I should be using instead? (If only there was some Karnaugh map like approach for multiple value non-binary functions ;-) )
If you are willing to actually enumerate all possible input/output sequences, here is a theoretical approach to tackle this that should be fairly effective.
First, consider the entropy of the output. Suppose that you have n possible input sequences, and x[i] is the number of ways to get i as an output. Let p[i] = float(x[i])/float(n[i]) and then the entropy is - sum(p[i] * log(p[i]) for i in outputs). (Note, since p[i] < 1 the log(p[i]) is a negative number, and therefore the entropy is positive. Also note, if p[i] = 0 then we assume that p[i] * log(p[i]) is also zero.)
The amount of entropy can be thought of as the amount of information needed to predict the outcome.
Now here is the key question. What variable gives us the most information about the output per information about the input?
If a particular variable v has in[v] possible values, the amount of information in specifying v is log(float(in[v])). I already described how to calculate the entropy of the entire set of outputs. For each possible value of v we can calculate the entropy of the entire set of outputs for that value of v. The amount of information given by knowing v is the entropy of the total set minus the average of the entropies for the individual values of v.
Pick the variable v which gives you the best ratio of information_gained_from_v/information_to_specify_v. Your algorithm will start with a switch on the set of values of that variable.
Then for each value, you repeat this process to get cascading nested if conditions.
This will generally lead to a fairly compact set of cascading nested if conditions that will focus on the input variables that tell you as much as possible, as quickly as possible, with as few branches as you can manage.
Now this assumed that you had a comprehensive enumeration. But what if you don't?
The answer to that is that the analysis that I described can be done for a random sample of your possible set of inputs. So if you run your code with, say, 10,000 random inputs, then you'll come up with fairly good entropies for your first level. Repeat with 10,000 each of your branches on your second level, and the same will happen. Continue as long as it is computationally feasible.
If there are good patterns to find, you will quickly find a lot of patterns of the form, "If you put in this that and the other, here is the output you always get." If there is a reasonably short set of nested ifs that give the right output, you're probably going to find it. After that, you have the question of deciding whether to actually verify by hand that each bucket is reliable, or to trust that if you couldn't find any exceptions with 10,000 random inputs, then there are none to be found.
Tricky approach for the validation. If you can find fuzzing software written for your language, run the fuzzing software with the goal of trying to tease out every possible internal execution path for each bucket you find. If the fuzzing software decides that you can't get different answers than the one you think is best from the above approach, then you can probably trust it.
Algorithm is pretty straightforward. Given possible values for each input we can generate all the input vectors possible. Then per each output we can just eliminate these inputs that do no matter for the output. As the result we for each output we can get a matrix showing output values for all the input combinations excluding the inputs that do not matter for given output.
Sample input format (for code snipped below):
var schema = new ConvertionSchema()
{
InputPossibleValues = new object[][]
{
new object[] { 1, 2, 3, }, // input #0
new object[] { 'a', 'b', 'c' }, // input #1
new object[] { "foo", "bar" }, // input #2
},
Converters = new System.Func<object[], object>[]
{
input => input[0], // output #0
input => (int)input[0] + (int)(char)input[1], // output #1
input => (string)input[2] == "foo" ? 1 : 42, // output #2
input => input[2].ToString() + input[1].ToString(), // output #3
input => (int)input[0] % 2, // output #4
}
};
Sample output:
Leaving the heart of the backward conversion below. Full code in a form of Linqpad snippet is there: http://share.linqpad.net/cknrte.linq.
public void Reverse(ConvertionSchema schema)
{
// generate all possible input vectors and record the resul for each case
// then for each output we could figure out which inputs matters
object[][] inputs = schema.GenerateInputVectors();
// reversal path
for (int outputIdx = 0; outputIdx < schema.OutputsCount; outputIdx++)
{
List<int> inputsThatDoNotMatter = new List<int>();
for (int inputIdx = 0; inputIdx < schema.InputsCount; inputIdx++)
{
// find all groups for input vectors where all other inputs (excluding current) are the same
// if across these groups outputs are exactly the same, then it means that current input
// does not matter for given output
bool inputMatters = inputs.GroupBy(input => ExcudeByIndexes(input, new[] { inputIdx }), input => schema.Convert(input)[outputIdx], ObjectsByValuesComparer.Instance)
.Where(x => x.Distinct().Count() > 1)
.Any();
if (!inputMatters)
{
inputsThatDoNotMatter.Add(inputIdx);
Util.Metatext($"Input #{inputIdx} does not matter for output #{outputIdx}").Dump();
}
}
// mapping table (only inputs that matters)
var mapping = new List<dynamic>();
foreach (var inputGroup in inputs.GroupBy(input => ExcudeByIndexes(input, inputsThatDoNotMatter), ObjectsByValuesComparer.Instance))
{
dynamic record = new ExpandoObject();
object[] sampleInput = inputGroup.First();
object output = schema.Convert(sampleInput)[outputIdx];
for (int inputIdx = 0; inputIdx < schema.InputsCount; inputIdx++)
{
if (inputsThatDoNotMatter.Contains(inputIdx))
continue;
AddProperty(record, $"Input #{inputIdx}", sampleInput[inputIdx]);
}
AddProperty(record, $"Output #{outputIdx}", output);
mapping.Add(record);
}
// input x, ..., input y, output z form is needed
mapping.Dump();
}
}

How to Connect Logic with Objects

I have a system that contains x number of strings. These string are shown in a UI based on some logic. For example string number 1 should only show if the current time is past midday and string 3 only shows if a randomly generated number between 0-1 is less than 0.5.
How would be the best way to model this?
Should the logic just be in code and be linked to a string by some sort or ID?
Should the logic be some how stored with the strings?
NOTE The above is a theoretical example before people start questioning my logic.
It's usually better to keep resources (such as strings) separate from logic. So referring strings by IDs is a good idea.
It seems that you have a bunch of rules which you have to link to the display of strings. I'd keep all three as separate entities: rules, strings, and the linking between them.
An illustration in Python, necessarily simplified:
STRINGS = {
'morning': 'Good morning',
'afternoon': 'Good afternoon',
'luck': 'you must be lucky today',
}
# predicates
import datetime, random
def showMorning():
return datetime.datetime.now().hour < 12
def showAfternoon():
return datetime.datetime.now().hour >= 12
def showLuck():
return random.random() > 0.5
# interconnection
RULES = {
'morning': showMorning,
'afternoon': showAfternoon,
'luck': showLuck,
}
# usage
for string_id, predicate in RULES.items():
if predicate():
print STRINGS[string_id]

what does that mean for Text.hashCode() & Interger.MAX_VALUE?

Recently, I am reading the definitive guide of hadoop.
I have two questions:
1.I saw a piece of code of one custom Partitioner:
public class KeyPartitioner extends Partitioner<TextPair, Text>{
#Override
public int getPartition(TextPair key, Text value, int numPartitions){
return (key.getFirst().hashCode()&Interger.MAX_VALUE)%numPartitions;
}
}
what does that mean for &Integer.MAX_VALUE? why should use & operator?
2.I also want write a custom Partitioner for IntWritable. So is it OK and best for key.value%numPartitions directly?
Like I already wrote in the comments, it is used to keep the resulting integer positive.
Let's use a simple example using Strings:
String h = "Hello I'm negative!";
int hashCode = h.hashCode();
hashCode is negative with the value of -1937832979.
If you would mod this with a positive number (>0) that denotes the partition, the resulting number is always negative.
System.out.println(hashCode % 5); // yields -4
Since partitions can never be negative, you need to make sure the number is positive. Here comes a simple bit twiddeling trick into play, because Integer.MAX_VALUE has all-ones execpt the sign bit (MSB in Java as it is big endian) which is only 1 on negative numbers.
So if you have a negative number with the sign bit set, you will always AND it with the zero of the Integer.MAX_VALUE which is always going to be zero.
You can make it more readable though:
return Math.abs(key.getFirst().hashCode() % numPartitions);
For example I have done that in Apache Hama's partitioner for arbitrary objects:
#Override
public int getPartition(K key, V value, int numTasks) {
return Math.abs(key.hashCode() % numTasks);
}

How to choose all possible combinations?

Let's assume that we have the list of loans user has like below:
loan1
loan2
loan3
...
loan10
And we have the function which can accept from 2 to 10 loans:
function(loans).
For ex., the following is possible:
function(loan1, loan2)
function(loan1, loan3)
function(loan1, loan4)
function(loan1, loan2, loan3)
function(loan1, loan2, loan4)
function(loan1, loan2, loan3, loan4, loan5, loan6, loan7, loan8, loan9, loan10)
How to write the code to pass all possible combinations to that function?
On RosettaCode you have implemented generating combinations in many languages, choose yourself.
Here's how we could do it in ruby :
loans= ['loan1','loan2', ... , 'loan10']
def my_function(loans)
array_of_loan_combinations = (0..arr.length).to_a.combination(2).map{|i,j| arr[i...j]}
array_of_loan_combinations.each do |combination|
//do something
end
end
To call :
my_function(loans);
I have written a class to handle common functions for working with the binomial coefficient, which is the type of problem that your problem falls under. It performs the following tasks:
Outputs all the K-indexes in a nice format for any N choose K to a file. The K-indexes can be substituted with more descriptive strings or letters. This method makes solving this type of problem quite trivial.
Converts the K-indexes to the proper index of an entry in the sorted binomial coefficient table. This technique is much faster than older published techniques that rely on iteration. It does this by using a mathematical property inherent in Pascal's Triangle. My paper talks about this. I believe I am the first to discover and publish this technique, but I could be wrong.
Converts the index in a sorted binomial coefficient table to the corresponding K-indexes. I believe it might be faster than the link you have found.
Uses Mark Dominus method to calculate the binomial coefficient, which is much less likely to overflow and works with larger numbers.
The class is written in .NET C# and provides a way to manage the objects related to the problem (if any) by using a generic list. The constructor of this class takes a bool value called InitTable that when true will create a generic list to hold the objects to be managed. If this value is false, then it will not create the table. The table does not need to be created in order to perform the 4 above methods. Accessor methods are provided to access the table.
There is an associated test class which shows how to use the class and its methods. It has been extensively tested with 2 cases and there are no known bugs.
To read about this class and download the code, see Tablizing The Binomial Coeffieicent.
It should not be hard to convert this class to the language of your choice.
To solve your problem, you might want to write a new loans function that takes as input an array of loan objects and works on those objects with the BinCoeff class. In C#, to obtain the array of loans for each unique combination, something like the following example code could be used:
void LoanCombinations(Loan[] Loans)
{
// The Loans array contains all of the loan objects that need
// to be handled.
int LoansCount = Loans.Length;
// Loop though all possible combinations of loan objects.
// Start with 2 loan objects, then 3, 4, and so forth.
for (int N = 2; N <= N; N++)
{
// Loop thru all the possible groups of combinations.
for (int K = N - 1; K < N; K++)
{
// Create the bin coeff object required to get all
// the combos for this N choose K combination.
BinCoeff<int> BC = new BinCoeff<int>(N, K, false);
int NumCombos = BinCoeff<int>.GetBinCoeff(N, K);
int[] KIndexes = new int[K];
// Loop thru all the combinations for this N choose K.
for (int Combo = 0; Combo < NumCombos; Combo++)
{
// Get the k-indexes for this combination, which in this case
// are the indexes to each loan in Loans.
BC.GetKIndexes(Loop, KIndexes);
// Create a new array of Loan objects that correspond to
// this combination group.
Loan[] ComboLoans = new Loan[K];
for (int Loop = 0; Loop < K; Loop++)
ComboLoans[Loop] = Loans[KIndexes[Loop]];
// Call the ProcessLoans function with the loans to be processed.
ProcessLoans(ComboLoans);
}
}
}
}
I have not tested the above code, but in general it should solve your problem.

How can I get an integer index for a key in hadoop?

Intuitively, hadoop is doing something like this to distribute keys to mappers, using python-esque pseudocode.
# data is a dict with many key-value pairs
keys = data.keys()
key_set_size = len(keys) / num_mappers
index = 0
mapper_keys = []
for i in range(num_mappers):
end_index = index + key_set_size
send_to_mapper(keys[int(index):int(end_index)], i)
index = end_index
# And something vaguely similar for the reducer (but not exactly).
It seems like somewhere hadoop knows the index of each key it is passing around, since it distributes them evenly among the mappers (or reducers). My question is: how can I access this index? I'm looking for a range of integers [0, n) mapping to all my n keys; this is what I mean by an "index".
I'm interested in the ability to get the index from within either the mapper or reducer.
After doing more research on this question, I don't believe it is possible to do exactly what I want. Hadoop does not seem to have such an index that is user-visible after all, although it does try to distribute work evenly among the mappers (so such an index is theoretically possible).
Actually, your reducer (each individual one) gets an array of items back that correspond to the reduce key. So do you want the offset of items within the reduce key in your reducer, or do you want the overall offset of the particular item in the global array of all lines being processed? To get an indeex in your mapper, you can simply prepend a line number to each line of the file before the file gets to the mapper. This will tell you the "global index". However keep in mind that with 1 000 000 items, item 662 345 could be processed before item 10 000.
If you are using the new MR API then the org.apache.hadoop.mapreduce.lib.partition.HashPartitioner is the default partitioner or else org.apache.hadoop.mapred.lib.HashPartitioner is the default partitioner. You can call the getPartition() on either of the HashPartitioner to get the partition number for the key (which you mentioned as index).
Note that the HashPartitioner class is only used to distribute the keys to the Reducer. When it comes to a mapper, each input split is processed by a map task and the keys are not distributed.
Here is the code from HashPartitioner for the getPartition(). You can write a simple Java program for the same.
public int getPartition(K key, V value, int numReduceTasks) {
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}
Edit: Including another way to get the index.
The following code from should also work. To be included in the map or the reduce function.
public void configure(JobConf job) {
partition = job.getInt( "mapred.task.partition", 0);
}

Resources