Related
Given finder : List of map, key : String, Val : String.
On receiving super map : a map which is super set of one or more map in a finder, find map from finder list which has contains maximum elements from super map.
For instance
Map { Name = Adam, Qualification = Engg, Job = Manager , Country = US, City = Seattle}
list of sub maps :
List [ submap1 {Name = Adam, Job = Manager },
submap2 {Name = Adam, Country = Us, city - Seattle}
submap3 {Name = Adam, Country = Us, city - Seattle, Job =
manager, Nickname = bobby}]
Result should be submap2. Is there a way I could utilise tries to preprocess list of sublist for faster look ups. Given that I have list of submaps at compile time and on receiving super map I need to find the best sub map.
Looking for some pointers on algo and datastructure i could use to preprocess sub lists and to match any super-set faster. Some sort of trie based structure ?
I want to Union/Merge two files using pig. But, this is a different union than a usual union. Following are my files (h* are header of files) :
F1 :
h1,h2,h3,h4
a01,a02,a03,a04
a11,a12,a13,a14
F2 :
h3,h4,h5,h6
a23,a24,b01,b02
a33,a34,b11,b12
The resulting output must be a Union of these files like this :
FR :
h1,h2,h3,h4,h5,h6
a01,a02,a03,a04,,
a11,a12,a13,a14,,
,,a23,a24,b01,b02
,,a33,a34,b11,b12
One more difficulty is I want to make it generic so that it works for dynamic number of common columns. Currently there are two common columns, it could have 3 or 1 common column or even no common column at all. For example :
F1 :
h1,h2,h3,h4
a1,a2,a3,a4
F2
h5,h6,h7,h8
b1,b2,b3,b4
FR
a1,a2,a3,a4
,,,,b1,b2,b3,b4
Any hint/help is appreciable.
Here is how you can do it statically:
F1full = FOREACH F1 GENERATE h1,h2,h3,h4, NULL as h5, NULL as h6;
F2full = FOREACH F2 GENERATE NULL as h1,NULL as h2,h3,h4, h5, h6;
FR = F1full UNION F2full;
Pig is not very flexible, so I don't think it is possible to generate this dynamically/for the generic case.
If you would want a solution for the generic case, you could use a language like python to build the required command based on metadata of stored tables/files.
I tried to solve the problem using following approach :
1) Load both of the files.
2) Add counter to generate a unique field (ID).
3) Start the counter for file B where counter for A ended.
4) Cogroup both files with common columns, including counteer.
5) Take all group columns in a different schema.
6) Generate uncommon columns from both files, along with the counter.
7) First join uncommon columns from file A with group columns on counter.
8) Join the result of step 7 with uncommon columns from file B on counter.
Following is the pig script to do the same. As this script is generic, I have mentioned what all parameters will be required before running the script.
-- Parameters required : $file1_path, $file2_path, $file1_schema, $file2_schema, $COUNT_A (number of rows in file A), $CMN_COLUMN_A (common columns in A), $CMN_COLUMN_B, $UNCMN_COLUMN_A(Unique columns in file A), $UNCMN_COLUMN_B.
A = LOAD '$file1_path' USING org.apache.pig.piggybank.storage.CSVExcelStorage('~', 'NO_MULTILINE', 'NOCHANGE', 'SKIP_INPUT_HEADER') as ($file1_schema);
B = LOAD '$file2_path' USING org.apache.pig.piggybank.storage.CSVExcelStorage('~', 'NO_MULTILINE', 'NOCHANGE', 'SKIP_INPUT_HEADER') as ($file2_schema);
RANK_A = RANK A;
RANK_B = RANK B;
COUNT_RANK_B = FOREACH RANK_B GENERATE ($0+(long)'$COUNT_A') as rank_B, $1 ..;
COGRP_RANK_AB = COGROUP RANK_A BY($CMN_COLUMN_A), COUNT_RANK_B BY ($CMN_COLUMN_B);
CMN_COGRP_RANK_AB = FOREACH COGRP_RANK_AB GENERATE FLATTEN(group) AS ($CMN_COLUMN_A);
UNCMN_RB = FOREACH COUNT_RANK_B GENERATE $UNCMN_COLUMN_B;
JOIN_CMN_UNCMN_A = JOIN CMN_COGRP_RANK_AB BY(rank_A) LEFT OUTER, UNCMN_RA by rank_A;
JOIN_CMN_UNCMN_B = JOIN JOIN_CMN_UNCMN_A BY(CMN_COGRP_RANK_AB::rank_A) LEFT OUTER, UNCMN_RB by rank_B;
STORE FINAL_DATA INTO '$store_path' USING org.apache.pig.piggybank.storage.CSVExcelStorage('~', 'NO_MULTILINE', 'UNIX', 'WRITE_OUTPUT_HEADER');
Hi have a JavaRDDPair with 2 elements:
("TypeA", List<jsonTypeA>),
("TypeB", List<jsonTypeB>)
I need to combine the 2 pairs into 1 pair of type:
("TypeA_B", List<jsonCombinedAPlusB>)
I need to combine the 2 lists into 1 list, where each 2 jsons (1 of type A and 1 of type B) have some common field I can join on.
Consider that list of type A is significantly smaller than the other, and the join should be inner, so the result list should be as small as the list of type A.
What is the most efficient way to do that?
rdd.join(otherRdd) provides you inner join on the first rdd. To use it, you will need to transform both RDDs to a PairRDD that has as key the common attribute that you will be joining on.
Something like this (example, untested):
val rddAKeyed = rddA.keyBy{case (k,v) => key(v)}
val rddBKeyed = rddB.keyBy{case (k,v) => key(v)}
val joined = rddAKeyed.join(rddBKeyed).map{case (k,(json1,json2)) => (newK, merge(json1,json2))}
Where merge(j1,j2) is the specific business logic on how to join the two json objects.
EDIT
I'm going to illustrate the exact problem I'm trying to solve. The simplified problem explanation wasn't working.
I'm writing a framework that requires me to assign threads to CPU cores based on load factor. Please let's not debate the point as to why I'm doing this.
When the framework boots, it forms a map of the following hardware:
Level 1: processor workgroups.
Level 2: NUMA nodes.
Level 3: processors (sockets).
level 4: cores.
Level 5: logical processors (only applicable with SMT systems).
I represent this with a fairly complex 5-level hierarchy.
Users may query this hardware info. A user can specify nothing, the desired workgroup, the desired NUMA nodes, etc. down to level 4. In this case, the framework simply filters out the full data set and returns only what matches the input, so long as it complies with the hierarchy (i.e. the user doesn't say specify cores that don't appear under the specified processors).
Next, the user my specify ranges, as in "give me any 1 workgroup, any 1 numa node, and any 3 CPUs", for example. In this case, the framework should return the 3 CPUs with the lowest assignment. This is a filter & sort process.
Again, the user may specify his filter to any level.
The user could also simply specify nothing, which means the framework must return the hardware info, but sorted according to the load assignment at each level.
The process is always filter & sort, regardless of what the user specifies. The only difference is the user may specify a range, a count, or nothing.
To begin this process, I get the raw hardware data filtered according to what info is supplied by the user. This comes back as a flattened enumeration of object {L1, L2, L3, L4, L5) for each L5 object.
Next, I do the following:
IEnumerable<KeyValuePair<int, double>> wgSub;
IEnumerable<KeyValuePair<int, double>> nnSub;
IEnumerable<KeyValuePair<int, double>> cpSub;
IEnumerable<KeyValuePair<int, double>> coSub;
wgSub = (
from n in query
group n by n.L1.ID into g
select new KeyValuePair<int, double>(g.Key, g.Sum(n => n.L1.Assignment))
)
.OrderBy(o => o.Value);
nnSub = (
from n in query
group n by n.L2.ID into g
select new KeyValuePair<int, double>(g.Key, g.Sum(n => n.L2.Assignment))
)
.OrderBy(o => o.Value);
cpSub = (
from n in query
group n by n.L3.ID into g
select new KeyValuePair<int, double>(g.Key, g.Sum(n => n.L3.Assignment))
)
.OrderBy(o => o.Value);
coSub = (
from n in query
group n by n.L4.ID into g
select new KeyValuePair<int, double>(g.Key, g.Sum(n => n.L4.Assignment))
)
.OrderBy(o => o.Value);
query = (
from n in query
join wgj in wgSub on n.L1.ID equals wgj.Key
join nnj in nnSub on n.L2.ID equals nnj.Key
join cpj in cpSub on n.L3.ID equals cpj.Key
join coj in coSub on n.L4.ID equals coj.Key
select n
)
.OrderBy(o => o.L1.ID == wgSub.Key)
.ThenBy(o => o.L2.ID == nnSub.Key)
.ThenBy(o => o.L3.ID == cpSub.Key)
.ThenBy(o => o.L4.ID == coSub.Key);
Where I'm stuck is on the orderby (which will be 4 levels deep). I need to sort the input query by the ID in each sub-query, "thenby" the next, etc. What I wrote is not correct.
If the user specified a range or a count (both imply a quantity), I also need to implement a Take, possibly for each level.
I'm not super clear on what you're going for, but would it be something along these lines?
query = query
.OrderBy(n => wgSub.First(g => g.Key == n.L1.ID).Value)
.ThenBy(n => nnSub.First(g => g.Key == n.L1.ID).Value)
...
In PigLatin, I want to group by 2 times, so as to select lines with 2 different laws.
I'm having trouble explaining the problem, so here is an example. Let's say I want to grab the specifications of the persons who have the nearest age as mine ($my_age) and have lot of money.
Relation A is four columns, (name, address, zipcode, age, money)
B = GROUP A BY (address, zipcode); # group by the address
-- generate the address, the person's age ...
C = FOREACH B GENERATE group, MIN($my_age - age) AS min_age, FLATTEN(A);
D = FILTER C BY min_age == age
--Then group by as to select the richest, group by fails :
E = GROUP D BY group; or E = GROUP D BY (address, zipcode);
-- The end would work
D = FOREACH E GENERATE group, MAX(money) AS max_money, FLATTEN(A);
F = FILTER C BY max_money == money;
I've tried to filter at the same time the nearest and the richest, but it doesn't work, because you can have richest people who are oldest as mine.
An another more realistic example is :
You have demands file like : iddem, idopedem, datedem
You have operations file like : idope,labelope,dateope,idoftheday,infope
I want to return operations that matches demands like :
idopedem matches ideope.
The dateope must be the nearest with datedem.
If datedem - date_ope > 0, then I must select the operation with the max(idoftheday), else I must select the operation with the min(idoftheday).
Relation A is 5 columns (idope,labelope,dateope,idoftheday,infope)
Relation B is 3 columns (iddem, idopedem, datedem)
C = JOIN A BY idope, B BY idopedem;
D = FOREACH E GENERATE iddem, idope, datedem, dateope, ABS(datedem - dateope) AS datedelta, idoftheday, infope;
E = GROUP C BY iddem;
F = FOREACH D GENERATE group, MIN(C.datedelta) AS deltamin, FLATTEN(D);
G = FILTER F BY deltamin == datedelta;
--Then I must group by another time as to select the min or max idoftheday
H = GROUP G BY group; --Does not work when dump
H = GROUP G BY iddem; --Does not work when dump
I = FOREACH H GENERATE group, (datedem - dateope >= 0 ? max(idoftheday) as idofdaysel : min(idoftheday) as idofdaysel), FLATTEN(D);
J = FILTER F BY idofdaysel == idoftheday;
DUMP J;
Data in the 2nd example (note date are already in Unix format) :
You have demands file like :
1, 'ctr1', 1359460800000
2, 'ctr2', 1354363200000
You have operations file like :
idope,labelope,dateope,idoftheday,infope
'ctr0','toto',1359460800000,1,'blabla0'
'ctr0','tata',1359460800000,2,'blabla1'
'ctr1','toto',1359460800000,1,'blabla2'
'ctr1','tata',1359460800000,2,'blabla3'
'ctr2','toto',1359460800000,1,'blabla4'
'ctr2','tata',1359460800000,2,'blabla5'
'ctr3','toto',1359460800000,1,'blabla6'
'ctr3','tata',1359460800000,2,'blabla7'
Result must be like :
1, 'ctr1', 'tata',1359460800000,2,'blabla3'
2, 'ctr2', 'toto',1359460800000,1,'blabla4'
Sample input and output would help greatly, but from what you have posted it appears to me that the problem is not so much in writing the Pig script but in specifying what exactly it is you hope to accomplish. It's not clear to me why you're grouping at all. What is the purpose of grouping by address, for example?
Here's how I would solve your problem:
First, design an optimization function that will induce an ordering on your dataset that reflects your own prioritization of money vs. age. For example, to severely penalize large age differences but prefer more money with small ones, you could try:
scored = FOREACH A GENERATE *, money / POW(1+ABS($my_age-age)/10, 2) AS score;
ordered = ORDER scored BY score DESC;
top10 = LIMIT ordered 10;
That gives you the 10 best people according to your optimization function.
Then the only work is to design a function that matches your own judgments. For example, in the function I chose, a person with $100,000 who is your age would be preferred to someone with $350,000 who is 10 years older (or younger). But someone with $500,000 who is 20 years older or younger is preferred to someone your age with just $50,000. If either of those don't fit your intuition, then modify the formula. Likely a simple quadratic factor won't be sufficient. But with a little experimentation you can hit upon something that works for you.