EDIT
I'm going to illustrate the exact problem I'm trying to solve. The simplified problem explanation wasn't working.
I'm writing a framework that requires me to assign threads to CPU cores based on load factor. Please let's not debate the point as to why I'm doing this.
When the framework boots, it forms a map of the following hardware:
Level 1: processor workgroups.
Level 2: NUMA nodes.
Level 3: processors (sockets).
level 4: cores.
Level 5: logical processors (only applicable with SMT systems).
I represent this with a fairly complex 5-level hierarchy.
Users may query this hardware info. A user can specify nothing, the desired workgroup, the desired NUMA nodes, etc. down to level 4. In this case, the framework simply filters out the full data set and returns only what matches the input, so long as it complies with the hierarchy (i.e. the user doesn't say specify cores that don't appear under the specified processors).
Next, the user my specify ranges, as in "give me any 1 workgroup, any 1 numa node, and any 3 CPUs", for example. In this case, the framework should return the 3 CPUs with the lowest assignment. This is a filter & sort process.
Again, the user may specify his filter to any level.
The user could also simply specify nothing, which means the framework must return the hardware info, but sorted according to the load assignment at each level.
The process is always filter & sort, regardless of what the user specifies. The only difference is the user may specify a range, a count, or nothing.
To begin this process, I get the raw hardware data filtered according to what info is supplied by the user. This comes back as a flattened enumeration of object {L1, L2, L3, L4, L5) for each L5 object.
Next, I do the following:
IEnumerable<KeyValuePair<int, double>> wgSub;
IEnumerable<KeyValuePair<int, double>> nnSub;
IEnumerable<KeyValuePair<int, double>> cpSub;
IEnumerable<KeyValuePair<int, double>> coSub;
wgSub = (
from n in query
group n by n.L1.ID into g
select new KeyValuePair<int, double>(g.Key, g.Sum(n => n.L1.Assignment))
)
.OrderBy(o => o.Value);
nnSub = (
from n in query
group n by n.L2.ID into g
select new KeyValuePair<int, double>(g.Key, g.Sum(n => n.L2.Assignment))
)
.OrderBy(o => o.Value);
cpSub = (
from n in query
group n by n.L3.ID into g
select new KeyValuePair<int, double>(g.Key, g.Sum(n => n.L3.Assignment))
)
.OrderBy(o => o.Value);
coSub = (
from n in query
group n by n.L4.ID into g
select new KeyValuePair<int, double>(g.Key, g.Sum(n => n.L4.Assignment))
)
.OrderBy(o => o.Value);
query = (
from n in query
join wgj in wgSub on n.L1.ID equals wgj.Key
join nnj in nnSub on n.L2.ID equals nnj.Key
join cpj in cpSub on n.L3.ID equals cpj.Key
join coj in coSub on n.L4.ID equals coj.Key
select n
)
.OrderBy(o => o.L1.ID == wgSub.Key)
.ThenBy(o => o.L2.ID == nnSub.Key)
.ThenBy(o => o.L3.ID == cpSub.Key)
.ThenBy(o => o.L4.ID == coSub.Key);
Where I'm stuck is on the orderby (which will be 4 levels deep). I need to sort the input query by the ID in each sub-query, "thenby" the next, etc. What I wrote is not correct.
If the user specified a range or a count (both imply a quantity), I also need to implement a Take, possibly for each level.
I'm not super clear on what you're going for, but would it be something along these lines?
query = query
.OrderBy(n => wgSub.First(g => g.Key == n.L1.ID).Value)
.ThenBy(n => nnSub.First(g => g.Key == n.L1.ID).Value)
...
Related
New to cascading, trying to find out a way to get top N tuples based on a sort/order. for example, I'd like to know the top 100 first names people are using.
here's what I can do similar in teradata sql:
select top 100 first_name, num_records
from
(select first_name, count(1) as num_records
from table_1
group by first_name) a
order by num_records DESC
Here's similar in hadoop pig
a = load 'table_1' as (first_name:chararray, last_name:chararray);
b = foreach (group a by first_name) generate group as first_name, COUNT(a) as num_records;
c = order b by num_records DESC;
d = limit c 100;
It seems very easy to do in SQL or Pig, but having a hard time try to find a way to do it in cascading. Please advise!
Assuming you just need the Pipe set up on how to do this:
In Cascading 2.1.6,
Pipe firstNamePipe = new GroupBy("topFirstNames", InPipe,
new Fields("first_name"),
);
firstNamePipe = new Every(firstNamePipe, new Fields("first_name"),
new Count("num_records"), Fields.All);
firstNamePipe = new GroupBy(firstNamePipe,
new Fields("first_name"),
new Fields("num_records"),
true); //where true is descending order
firstNamePipe = new Every(firstNamePipe, new Fields("first_name", "num_records")
new First(Fields.Args, 100), Fields.All)
Where InPipe is formed with your incoming tap that holds the tuple data that you are referencing above. Namely, "first_name". "num_records" is created when new Count() is called.
If you have the "num_records" and "first_name" data in separate taps (tables or files) then you can set up two pipes that point to those two Tap sources and join them using CoGroup.
The definitions I used were are from Cascading 2.1.6:
GroupBy(String groupName, Pipe pipe, Fields groupFields, Fields sortFields, boolean reverseOrder)
Count(Fields fieldDeclaration)
First(Fields fieldDeclaration, int firstN)
Method 1
Use a GroupBy and group them base on the columns required and u can make use of secondary sorting that is provided by the cascading ,by default it provies them in ascending order ,if we want them in descing order we can do them by reverseorder()
To get the TOP n tuples or rows
Its quite simple just use a static variable count in FILTER and increment it by 1 for each tuple count value increases by 1 and check weather it is greater than N
return true when count value is greater than N or else return false
this will provide the ouput with first N tuples
method 2
cascading provides an inbuit function unique which returns firstNbuffer
see the below link
http://docs.cascading.org/cascading/2.2/javadoc/cascading/pipe/assembly/Unique.html
In PigLatin, I want to group by 2 times, so as to select lines with 2 different laws.
I'm having trouble explaining the problem, so here is an example. Let's say I want to grab the specifications of the persons who have the nearest age as mine ($my_age) and have lot of money.
Relation A is four columns, (name, address, zipcode, age, money)
B = GROUP A BY (address, zipcode); # group by the address
-- generate the address, the person's age ...
C = FOREACH B GENERATE group, MIN($my_age - age) AS min_age, FLATTEN(A);
D = FILTER C BY min_age == age
--Then group by as to select the richest, group by fails :
E = GROUP D BY group; or E = GROUP D BY (address, zipcode);
-- The end would work
D = FOREACH E GENERATE group, MAX(money) AS max_money, FLATTEN(A);
F = FILTER C BY max_money == money;
I've tried to filter at the same time the nearest and the richest, but it doesn't work, because you can have richest people who are oldest as mine.
An another more realistic example is :
You have demands file like : iddem, idopedem, datedem
You have operations file like : idope,labelope,dateope,idoftheday,infope
I want to return operations that matches demands like :
idopedem matches ideope.
The dateope must be the nearest with datedem.
If datedem - date_ope > 0, then I must select the operation with the max(idoftheday), else I must select the operation with the min(idoftheday).
Relation A is 5 columns (idope,labelope,dateope,idoftheday,infope)
Relation B is 3 columns (iddem, idopedem, datedem)
C = JOIN A BY idope, B BY idopedem;
D = FOREACH E GENERATE iddem, idope, datedem, dateope, ABS(datedem - dateope) AS datedelta, idoftheday, infope;
E = GROUP C BY iddem;
F = FOREACH D GENERATE group, MIN(C.datedelta) AS deltamin, FLATTEN(D);
G = FILTER F BY deltamin == datedelta;
--Then I must group by another time as to select the min or max idoftheday
H = GROUP G BY group; --Does not work when dump
H = GROUP G BY iddem; --Does not work when dump
I = FOREACH H GENERATE group, (datedem - dateope >= 0 ? max(idoftheday) as idofdaysel : min(idoftheday) as idofdaysel), FLATTEN(D);
J = FILTER F BY idofdaysel == idoftheday;
DUMP J;
Data in the 2nd example (note date are already in Unix format) :
You have demands file like :
1, 'ctr1', 1359460800000
2, 'ctr2', 1354363200000
You have operations file like :
idope,labelope,dateope,idoftheday,infope
'ctr0','toto',1359460800000,1,'blabla0'
'ctr0','tata',1359460800000,2,'blabla1'
'ctr1','toto',1359460800000,1,'blabla2'
'ctr1','tata',1359460800000,2,'blabla3'
'ctr2','toto',1359460800000,1,'blabla4'
'ctr2','tata',1359460800000,2,'blabla5'
'ctr3','toto',1359460800000,1,'blabla6'
'ctr3','tata',1359460800000,2,'blabla7'
Result must be like :
1, 'ctr1', 'tata',1359460800000,2,'blabla3'
2, 'ctr2', 'toto',1359460800000,1,'blabla4'
Sample input and output would help greatly, but from what you have posted it appears to me that the problem is not so much in writing the Pig script but in specifying what exactly it is you hope to accomplish. It's not clear to me why you're grouping at all. What is the purpose of grouping by address, for example?
Here's how I would solve your problem:
First, design an optimization function that will induce an ordering on your dataset that reflects your own prioritization of money vs. age. For example, to severely penalize large age differences but prefer more money with small ones, you could try:
scored = FOREACH A GENERATE *, money / POW(1+ABS($my_age-age)/10, 2) AS score;
ordered = ORDER scored BY score DESC;
top10 = LIMIT ordered 10;
That gives you the 10 best people according to your optimization function.
Then the only work is to design a function that matches your own judgments. For example, in the function I chose, a person with $100,000 who is your age would be preferred to someone with $350,000 who is 10 years older (or younger). But someone with $500,000 who is 20 years older or younger is preferred to someone your age with just $50,000. If either of those don't fit your intuition, then modify the formula. Likely a simple quadratic factor won't be sufficient. But with a little experimentation you can hit upon something that works for you.
I'm porting some sql stored procedure logic, which would return multiple tables in a dataset, to entity framework strongly typed objects, queried with linq.
Basically I need the data from tables A, B, and C, where C has a foreign key to B, and B has a foreign key to A. But I don't want every C with a FK to B, just the C's with a certain constraint X.
So basically, the stored proc basically said
TableA = select from A where A.AID = AIDPassedIn
TableB = select from B where B.AID = AIDPassedIn
TableC = select from TableB where TableB.XID = XIDPassedIn
return new DataSet(TableA, TableB, TableC);
//yes this is gross and confusing, thus our current efforts
Entity framework almost makes this super easy like so
A.Include("B.C").Where(a => a.AID == AIDPassedIn)
My only problem is that this doesn't include constraint X on the C table. I've read a bunch of articles, but everything I've read suggests things I could add to the where clause, and that would filter what A objects I end up with. I should only end up with one A object though, regardless of the properties of it's children. What I want is The A with AIDPassedIn, and all it's child B's, and all the B's children C that match constraint X.
I feel like this is one of my worst phrased questions ever but I'm at a bit of a block. Any help would be great thanks!
You can try it along the lines of the following:
var AList = context.As.Where(a => a.AID == AIDPassedIn)
.Select(a => new
{
A = a,
Bs = a.Bs,
Cs = a.Bs.Select(b => b.Cs.Where(c => c.XID == XIDPassedIn))
})
.AsEnumerable()
.Select(x => x.A)
.ToList(); // or SingleOrDefault if AIDPassedIn is the PK
Entity Framework will put the object graph together automatically (even without using Include) as long as you don't disable change tracking.
I am trying to compare two tables (i.e values, count, etc..) in linq to sql but I am not getting the way to achieve it. I tried the following,
Table1.Any(i => i.itemNo == Table2.itemNo)
It gives error. Could you please help me?
Thanks in Advance.
how about
var isDifferent =
Table1.Zip(Table2, (j, k) => j.itemNo == k.itemMo).Any(m => !m);
EDIT
if Linq-To-Sql does not support Zip.
var one = Table1.ToList();
var two = Table2.ToList();
var isDifferent =
one.Zip(two, (j, k) => j.itemNo == k.itemMo).Any(m => !m);
if the tables are vary large this could cause performance problems. In that case you will need a much more sophisticated solution, if so, please ask.
EDIT2
If the tables are very large you don't want to get all the data from the server and hold it memory. Additionaly, Linq and SQL server do not garauntee the order of the rows unless you specify an order in the query. This becomes espcially relavent for large result sets returned by a multi processor server where the effects of parallelism are likely to come into play.
I suggest that Linq-to-Sql doesen't really cater well for your scenario so you will have to help it out using ExecuteQuery somthing like this.
string zipQuery =
#"SELECT TOP 1
1
FROM
[Table1] [one]
WHERE
NOT EXISTS (
SELECT * FROM [Table2] [two] WHERE [two].[itemNo] = [one].[itemNo]
)
UNION ALL
SELECT
1
FROM
[Table2] [two]
WHERE
NOT EXISTS (
SELECT * FROM [Table1] [one] WHERE [one].[itemNo] = [two].[itemNo]
)
UNION ALL
SELECT 0";
var isDifferent = context.ExecuteQuery<int>(zipQuery).Single() == 1;
This will do the select on the server without returning lots of data to the client but, I think you will agree is much more complicated.
EDIT3
Okay, the zip approach should be fine for 1000 rows. I've read your comment and I suggest changing the code accordingly.
var one = Table1.ToList();
var two = Table2.ToList();
var isDifferent =
one.Count != two.Count ||
one.Zip(two, (o, t) => o.itemNo == k.itemNo).Any(m => !m);
You should probably consider putting an order by on the list retrievers, like this.
var one = Table1.OrderBy(o => o.itemNo).ToList();
Strictly, the results of a Linq-to-Sql come back in any order unless an order is specified.
I want a list of counts for some of my data (count the number of open.closed tasks etc), I want to get all counts inside 1 query, so I am not sure what I do with my linq statement below...
_user is an object that returns info about the current loggedon user
_repo is am object that returns an IQueryable of whichever table I want to select
var counters = (from task in _repo.All<InstructionTask>()
where task.AssignedToCompanyID == _user.CompanyID || task.CompanyID == _user.CompanyID
join instructions in _repo.GetAllMyInstructions(_user) on task.InstructionID equals
instructions.InstructionID
group new {task, instructions}
by new
{
task
}
into g
select new
{
TotalEveryone = g.Count(),
TotalMine = g.Count(),
TotalOpen = g.Count(x => x.task.IsOpen),
TotalClosed = g.Count(c => !c.task.IsOpen)
}).SingleOrDefault();
Do I convert my object to single or default? The exception I am getting is, this sequence contains more than one element
Note: I want overall stats, not for each task, but for all tasks - not sure how to get that?
You need to dump everything into a single group, and use a regular Single. I am not sure if LINQ-to-SQL would be able to translate it correctly, but it's definitely worth a try.
var counters = (from task in _repo.All<InstructionTask>()
where task.AssignedToCompanyID == _user.CompanyID || task.CompanyID == _user.CompanyID
join instructions in _repo.GetAllMyInstructions(_user) on task.InstructionID == instructions.InstructionID
group task by 1 /* <<=== All tasks go into one group */ into g select new {
TotalEveryone = task.Count(),
TotalMine = task.Count(), // <<=== You probably need a condition here
TotalOpen = task.Count(x => x.task.IsOpen),
TotalClosed = task.Count(c => !c.task.IsOpen)
}).Single();
From MSDN
Returns the only element of a sequence, or a default value if the
sequence is empty; this method throws an exception if there is more
than one element in the sequence.
You need to use FirstOrDefault. SingleOrDefault is designed for collections that contains exactly 1 element (or none).