I performed the set operation (i.e., return the union of two sets minus the intersection of the two sets) on two sets t1 and t2. The result is still a set. I used the following script to iterate over all the elements in the result, but an error is raised. Where is the problem?
t1 = set(1..23)
t2 = set(2..34)
result = t1^t2
for(i in result){
print(i)
}
Error message:
result => set doesn't support random access.
A set is a collection of unique elements and does not support iteration. Convert the set to a vector with keys() before you proceed:
t1 = set(1..23)
t2 = set(2..34)
result = t1^t2
vecResult = result.keys()
print(vecResult);
[24,25,26,27,28,29,1,30,31,32,33,34]
I m trying to get a dataframe with all different layers of a kml. The code below gives a dataframe, but I also want the name of the kml layers to create a column in data. Any idea about what I m doing wrong?
gpd.io.file.fiona.drvsupport.supported_drivers['KML'] = 'rw'
fp="file.kml"
data = gpd.GeoDataFrame()
layers_list=pd.Series(fiona.listlayers(fp))
list(layers_list)
for l in layers_list :
s = gpd.read_file(fp, driver='KML', layer=l)
data = data.append(s, ignore_index=True)
data['layers']= l
In this raw data we have info of baseball players, the schema is:
name:chararray, team:chararray, position:bag{t:(p:chararray)}, bat:map[]
Using the following script we are able to list out players and the different positions they have played. How do we get a count of how many players have played a particular position?
E.G. How many players were in the 'Designated_hitter' position?
A single position can't appear multiple times in position bag for a player.
Pig Script and output for the sample data is listed below.
--pig script
players = load 'baseball' as (name:chararray, team:chararray,position:bag{t:(p:chararray)}, bat:map[]);
pos = foreach players generate name, flatten(position) as position;
groupbyposition = group pos by position;dump groupbyposition;
--dump groupbyposition (output of one position i.e Designated_hitter)
(Designated_hitter,{(Michael Young,Designated_hitter)})
From what I can tell you've already done all of the 'grunt' (Ha!, Pig joke) work. All there it left to do is use COUNT on the output of the GROUP BY. Something like:
groupbyposition = group pos by position ;
pos_count = FOREACH groupbyposition GENERATE group AS position, COUNT(pos) ;
Note: Using UDFs you may be able to get a more efficient solution. If you care about counting a certain few fields then it should be more efficient to filter the postion bag before hand (This is why I said UDF, I forgot you could just use a nested FILTER). For example:
pos = FOREACH players {
-- you can also add the DISTINCT that alexeipab points out here
-- make sure to change postion in the FILTER to dist!
-- dist = DISTINCT position ;
filt = FILTER postion BY p MATCHES 'Designated_hitter|etc.' ;
GENERATE name, FLATTEN(filt) ;
}
If none of the positions you want appear in postion then it will create an empty bag. When empty bags are FLATTENed the row is discarded. This means you'll be FLATTENing bags of N or less elements (where N is the number of fields you want) instead of 7-15 (didn't really look at the data that closely), and the GROUP will be on significantly less data.
Notes: I'm not sure if this will be significantly faster (if at all). Also, using a UDF to preform the nested FILTER may be faster.
You can use nested DISTINCT to get the list of players and than count it.
players = load 'baseball' as (name:chararray, team:chararray,position:bag{t:(p:chararray)}, bat:map[]);
pos = foreach players generate name, flatten(position) as position;
groupbyposition = group pos by position;
pos_count = foreach groupbyposition generate {
players = DISTINCT name;
generate group, COUNT(players) as num, pos;
}
In PigLatin, I want to group by 2 times, so as to select lines with 2 different laws.
I'm having trouble explaining the problem, so here is an example. Let's say I want to grab the specifications of the persons who have the nearest age as mine ($my_age) and have lot of money.
Relation A is four columns, (name, address, zipcode, age, money)
B = GROUP A BY (address, zipcode); # group by the address
-- generate the address, the person's age ...
C = FOREACH B GENERATE group, MIN($my_age - age) AS min_age, FLATTEN(A);
D = FILTER C BY min_age == age
--Then group by as to select the richest, group by fails :
E = GROUP D BY group; or E = GROUP D BY (address, zipcode);
-- The end would work
D = FOREACH E GENERATE group, MAX(money) AS max_money, FLATTEN(A);
F = FILTER C BY max_money == money;
I've tried to filter at the same time the nearest and the richest, but it doesn't work, because you can have richest people who are oldest as mine.
An another more realistic example is :
You have demands file like : iddem, idopedem, datedem
You have operations file like : idope,labelope,dateope,idoftheday,infope
I want to return operations that matches demands like :
idopedem matches ideope.
The dateope must be the nearest with datedem.
If datedem - date_ope > 0, then I must select the operation with the max(idoftheday), else I must select the operation with the min(idoftheday).
Relation A is 5 columns (idope,labelope,dateope,idoftheday,infope)
Relation B is 3 columns (iddem, idopedem, datedem)
C = JOIN A BY idope, B BY idopedem;
D = FOREACH E GENERATE iddem, idope, datedem, dateope, ABS(datedem - dateope) AS datedelta, idoftheday, infope;
E = GROUP C BY iddem;
F = FOREACH D GENERATE group, MIN(C.datedelta) AS deltamin, FLATTEN(D);
G = FILTER F BY deltamin == datedelta;
--Then I must group by another time as to select the min or max idoftheday
H = GROUP G BY group; --Does not work when dump
H = GROUP G BY iddem; --Does not work when dump
I = FOREACH H GENERATE group, (datedem - dateope >= 0 ? max(idoftheday) as idofdaysel : min(idoftheday) as idofdaysel), FLATTEN(D);
J = FILTER F BY idofdaysel == idoftheday;
DUMP J;
Data in the 2nd example (note date are already in Unix format) :
You have demands file like :
1, 'ctr1', 1359460800000
2, 'ctr2', 1354363200000
You have operations file like :
idope,labelope,dateope,idoftheday,infope
'ctr0','toto',1359460800000,1,'blabla0'
'ctr0','tata',1359460800000,2,'blabla1'
'ctr1','toto',1359460800000,1,'blabla2'
'ctr1','tata',1359460800000,2,'blabla3'
'ctr2','toto',1359460800000,1,'blabla4'
'ctr2','tata',1359460800000,2,'blabla5'
'ctr3','toto',1359460800000,1,'blabla6'
'ctr3','tata',1359460800000,2,'blabla7'
Result must be like :
1, 'ctr1', 'tata',1359460800000,2,'blabla3'
2, 'ctr2', 'toto',1359460800000,1,'blabla4'
Sample input and output would help greatly, but from what you have posted it appears to me that the problem is not so much in writing the Pig script but in specifying what exactly it is you hope to accomplish. It's not clear to me why you're grouping at all. What is the purpose of grouping by address, for example?
Here's how I would solve your problem:
First, design an optimization function that will induce an ordering on your dataset that reflects your own prioritization of money vs. age. For example, to severely penalize large age differences but prefer more money with small ones, you could try:
scored = FOREACH A GENERATE *, money / POW(1+ABS($my_age-age)/10, 2) AS score;
ordered = ORDER scored BY score DESC;
top10 = LIMIT ordered 10;
That gives you the 10 best people according to your optimization function.
Then the only work is to design a function that matches your own judgments. For example, in the function I chose, a person with $100,000 who is your age would be preferred to someone with $350,000 who is 10 years older (or younger). But someone with $500,000 who is 20 years older or younger is preferred to someone your age with just $50,000. If either of those don't fit your intuition, then modify the formula. Likely a simple quadratic factor won't be sufficient. But with a little experimentation you can hit upon something that works for you.
Scenario:
I have database table that stores the hierarchy of another table's many-to-many relationship. An item can have multiple children and can also have more than one parent.
Items
------
ItemID (key)
Hierarchy
---------
MemberID (key)
ParentItemID (fk)
ChildItemID (fk)
Sample hierarchy:
Level1 Level2 Level3
X A A1
A2
B B1
X1
Y C
I would like to group all of the child nodes by each parent node in the hierarchy.
Parent Child
X A1
A2
B1
X1
A A1
A2
B B1
X1
Y C
Notice how there are no leaf nodes in the Parent column, and how the Child column only contains leaf nodes.
Ideally, I would like the results to be in the form of IEnumerable<IGrouping<Item, Item>> where the key is a Parent and the group items are all Children.
Ideally, I would like a solution that the entity provider can translate in to T-SQL, but if that is not possible then I need to keep round trips to a minimum.
I intend to Sum values that exist in another table joined on the leaf nodes.
Since you are always going to be returning ALL of the items in the table, why not just make a recursive method that gets all children for a parent and then use that on the in-memory Items:
partial class Items
{
public IEnumerable<Item> GetAllChildren()
{
//recursively or otherwise get all the children (using the Hierarchy navigation property?)
}
}
then:
var items =
from item in Items.ToList()
group new
{
item.itemID,
item.GetAllChildren()
} by item.itemID;
Sorry for any syntax errors...
Well, if the hierarchy is strictly 2 levels you can always union them and let LINQ sort out the SQL (it ends up being a single trip though it needs to be seen how fast it will run on your volume of data):
var hlist = from h in Hierarchies
select new {h.Parent, h.Child};
var slist = from h in Hierarchies
join h2 in hlist on h.Parent equals h2.Child
select new {h2.Parent, h.Child};
hlist = hlist.Union(slist);
This gives you an flat IEnumerable<{Item, Item}> list so if you want to group them you just follow on:
var glist = from pc in hlist.AsEnumerable()
group pc.Child by pc.Parent into g
select new { Parent = g.Key, Children = g };
I used AsEnumerable() here as we reached the capability of LINQ SQL provider with attempting to group a Union. If you try it against IQueryable it will run a basic Union for eligable parents then do a round-trip for every parent (which is what you want to avoid). Whether or not its ok for you to use regular LINQ for the grouping is up to you, same volume of data would have to come through the pipe either way.
EDIT: Alternatively you could build a view linking parent to all its children and use that view as a basis for tying Items. In theory this should allow you/L2S to group over it with a single trip.