How to traverse all the elements in a set? - set

I performed the set operation (i.e., return the union of two sets minus the intersection of the two sets) on two sets t1 and t2. The result is still a set. I used the following script to iterate over all the elements in the result, but an error is raised. Where is the problem?
t1 = set(1..23)
t2 = set(2..34)
result = t1^t2
for(i in result){
print(i)
}
Error message:
result => set doesn't support random access.

A set is a collection of unique elements and does not support iteration. Convert the set to a vector with keys() before you proceed:
t1 = set(1..23)
t2 = set(2..34)
result = t1^t2
vecResult = result.keys()
print(vecResult);
[24,25,26,27,28,29,1,30,31,32,33,34]

Related

DAX IF measure - return fixed value

This should be a very simple requirement. But it seems impossible to implement in DAX.
Data model, User lookup table joined to many "Cards" linked to each user.
I have a measure setup to count rows in CardUser. That is working fine.
<measureA> = count rows in CardUser
I want to create a new measure,
<measureB> = IF(User.boolean = 1,<measureA>, 16)
If User.boolean = 1, I want to return a fixed value of 16. Effectively, bypassing measureA.
I can't simply put User.boolean = 1 in the IF condition, throws an error.
I can modify measureA itself to return 0 if User.boolean = 1
measureA> =
CALCULATE (
COUNTROWS(CardUser),
FILTER ( User.boolean != 1 )
)
This works, but I still can't find a way to return 16 ONLY if User.boolean = 1.
That's easy in DAX, you just need to learn "X" functions (aka "Iterators"):
Measure B =
SUMX( VALUES(User.boolean),
IF(User.Boolean, [Measure A], 16))
VALUES function generates a list of distinct user.boolean values (1, 0 in this case). Then, SUMX iterates this list, and applies IF logic to each record.

GLua - Getting the difference between two tables

Disclaimer: This is Glua (Lua used by Garry's Mod)
I just need to compare tables between them and return the difference, like if I was substrating them.
TableOne = {thing = "bob", 89 = 1, 654654 = {"hi"}} --Around 3k items like that
TableTwo = {thing = "bob", 654654 = "hi"} --Same, around 3k
function table.GetDifference(t1, t2)
local diff = {}
for k, dat in pairs(t1) do --Loop through the biggest table
if(!table.HasValue(t2, t1[k])) then --Checking if t2 hasn't the value
table.insert(diff, t1[k]) --Insert the value in the difference table
print(t1[k])
end
end
return diff
end
if table.Count(t1) != table.Count(t2) then --Check if amount is equal, in my use I don't need to check if they are exact.
PrintTable(table.GetDifference(t1, t2)) --Print the difference.
end
My problem being that with only one of difference between the two tables, this returns me more than 200 items. The only item I added was a String. I tried many other functions like this one but they usually cause a stack overflow error because of the table's length.
Your problem is with this line
if(!table.HasValue(t2, t1[k])) then --Checking if t2 hasn't the value
Change it to this:
if(!table.HasValue(t2, k) or t1[k] != t2[k]) then --Checking if t2[k] matches
Right now what is happening is that you're looking at an entry like thing = "bob" and then you're looking to see whether t2 has "bob" as a key. And it doesn't. But neither did t1 so that shouldn't be regarded as a difference.

Iterate over table in order of value

Lets say I have a table like so:
{
value = 4
},
{
value = 3
},
{
value = 1
},
{
value = 2
}
and I want to iterate over this and print the value in order so the output is like so:
1
2
3
4
How do I do this, I understand how to use ipairs and pairs, and table.sort, but that only works if using table.insert and the key is valid, I need to be loop over this in order of the value.
I tried a custom function but it simply printed them in the incorrect order.
I have tried:
Creating an index and looping that
Sorting the table (throws error: attempt to perform __lt on table and table)
And a combination of sorts, indexes and other tables that not only didn't work, but also made it very complicated.
I am well and truly stumped.
Sorting the table
This was the right solution.
(throws error: attempt to perform __lt on table and table)
Sounds like you tried to use a < b.
For Lua to be able to sort values, it has to know how to compare them. It knows how to compare numbers and strings, but by default it has idea how to compare two tables. Consider this:
local people = {
{ name = 'fred', age = 43 },
{ name = 'ted', age = 31 },
{ name = 'ned', age = 12 },
}
If I call sort on people, how can Lua know what I intend? I doesn't know what 'age' or 'name' means or which I'd want to use for comparison. I have to tell it.
It's possible to add a metatable to a table which tells Lua what the < operator means for a table, but you can also supply sort with a callback function that tells it how to compare two objects.
You supply sort with a function that receives two values and you return whether the first is "less than" the second, using your knowledge of the tables. In the case of your tables:
table.sort(t, function(a,b) return a.value < b.value end)
for i,entry in ipairs(t) do
print(i,entry.value)
end
If you want to leave the original table unchanged, you could create a custom 'sort by value' iterator like this:
local function valueSort(a,b)
return a.value < b.value;
end
function sortByValue( tbl ) -- use as iterator
-- build new table to sort
local sorted = {};
for i,v in ipairs( tbl ) do sorted[i] = v end;
-- sort new table
table.sort( sorted, valueSort );
-- return iterator
return ipairs( sorted );
end
When sortByValue() is called, it clones tbl to a new sorted table, and then sorts the sorted table. It then hands the sorted table over to ipairs(), and ipairs outputs the iterator to be used by the for loop.
To use:
for i,v in sortByValue( myTable ) do
print(v)
end
While this ensures your original table remains unaltered, it has the downside that each time you do an iteration the iterator has to clone myTable to make a new sorted table, and then table.sort that sorted table.
If performance is vital, you can greatly speed things up by 'caching' the work done by the sortByValue() iterator. Updated code:
local resort, sorted = true;
local function valueSort(a,b)
return a.value < b.value;
end
function sortByValue( tbl ) -- use as iterator
if not sorted then -- rebuild sorted table
sorted = {};
for i,v in ipairs( tbl ) do sorted[i] = v end;
resort = true;
end
if resort then -- sort the 'sorted' table
table.sort( sorted, valueSort );
resort = false;
end
-- return iterator
return ipairs( sorted );
end
Each time you add or remove an element to/from myTable set sorted = nil. This lets the iterator know it needs to rebuild the sorted table (and also re-sort it).
Each time you update a value property within one of the nested tables, set resort = true. This lets the iterator know it has to do a table.sort.
Now, when you use the iterator, it will try and re-use the previous sorted results from the cached sorted table.
If it can't find the sorted table (eg. on first use of the iterator, or because you set sorted = nil to force a rebuild) it will rebuild it. If it sees it needs to resort (eg. on first use, or if the sorted table was rebuilt, or if you set resort = true) then it will resort the sorted table.

hadoop cascading how to get top N tuples

New to cascading, trying to find out a way to get top N tuples based on a sort/order. for example, I'd like to know the top 100 first names people are using.
here's what I can do similar in teradata sql:
select top 100 first_name, num_records
from
(select first_name, count(1) as num_records
from table_1
group by first_name) a
order by num_records DESC
Here's similar in hadoop pig
a = load 'table_1' as (first_name:chararray, last_name:chararray);
b = foreach (group a by first_name) generate group as first_name, COUNT(a) as num_records;
c = order b by num_records DESC;
d = limit c 100;
It seems very easy to do in SQL or Pig, but having a hard time try to find a way to do it in cascading. Please advise!
Assuming you just need the Pipe set up on how to do this:
In Cascading 2.1.6,
Pipe firstNamePipe = new GroupBy("topFirstNames", InPipe,
new Fields("first_name"),
);
firstNamePipe = new Every(firstNamePipe, new Fields("first_name"),
new Count("num_records"), Fields.All);
firstNamePipe = new GroupBy(firstNamePipe,
new Fields("first_name"),
new Fields("num_records"),
true); //where true is descending order
firstNamePipe = new Every(firstNamePipe, new Fields("first_name", "num_records")
new First(Fields.Args, 100), Fields.All)
Where InPipe is formed with your incoming tap that holds the tuple data that you are referencing above. Namely, "first_name". "num_records" is created when new Count() is called.
If you have the "num_records" and "first_name" data in separate taps (tables or files) then you can set up two pipes that point to those two Tap sources and join them using CoGroup.
The definitions I used were are from Cascading 2.1.6:
GroupBy(String groupName, Pipe pipe, Fields groupFields, Fields sortFields, boolean reverseOrder)
Count(Fields fieldDeclaration)
First(Fields fieldDeclaration, int firstN)
Method 1
Use a GroupBy and group them base on the columns required and u can make use of secondary sorting that is provided by the cascading ,by default it provies them in ascending order ,if we want them in descing order we can do them by reverseorder()
To get the TOP n tuples or rows
Its quite simple just use a static variable count in FILTER and increment it by 1 for each tuple count value increases by 1 and check weather it is greater than N
return true when count value is greater than N or else return false
this will provide the ouput with first N tuples
method 2
cascading provides an inbuit function unique which returns firstNbuffer
see the below link
http://docs.cascading.org/cascading/2.2/javadoc/cascading/pipe/assembly/Unique.html

How can I select record with minimum value in pig latin

I have timestamped samples and I'm processing them using Pig. I want to find, for each day, the minimum value of the sample and the time of that minimum. So I need to select the record that contains the sample with the minimum value.
In the following for simplicity I'll represent time in two fields, the first is the day and the second the "time" within the day.
1,1,4.5
1,2,3.4
1,5,5.6
To find the minimum the following works:
samples = LOAD 'testdata' USING PigStorage(',') AS (day:int, time:int, samp:float);
g = GROUP samples BY day;
dailyminima = FOREACH g GENERATE group as day, MIN(samples.samp) as samp;
But then I've lost the exact time at which the minimum happened. I hoped I could use nested expressions. I tried the following:
dailyminima = FOREACH g {
minsample = MIN(samples.samp);
mintuple = FILTER samples BY samp == minsample;
GENERATE group as day, mintuple.time, mintuple.samp;
};
But with that I receive the error message:
2012-11-12 12:08:40,458 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000:
<line 5, column 29> Invalid field reference. Referenced field [samp] does not exist in schema: .
Details at logfile: /home/hadoop/pig_1352722092997.log
If I set minsample to a constant, it doesn't complain:
dailyminima = FOREACH g {
minsample = 3.4F;
mintuple = FILTER samples BY samp == minsample;
GENERATE group as day, mintuple.time, mintuple.samp;
};
And indeed produces a sensible result:
(1,{(2)},{(3.4)})
While writing this I thought of using a separate JOIN:
dailyminima = FOREACH g GENERATE group as day, MIN(samples.samp) as minsamp;
dailyminima = JOIN samples BY (day, samp), dailyminima BY (day, minsamp);
That work, but results (in the real case) in a join over two large data sets instead of a search through a single day's values, which doesn't seem healthy.
In the real case I actually want to find max and min and associated times. I hoped that the nested expression approach would allow me to do both at once.
Suggestions of ways to approach this would be appreciated.
Thanks to alexeipab for the link to another SO question.
One working solution (finding both min and max and the associated time) is:
dailyminima = FOREACH g {
minsamples = ORDER samples BY samp;
minsample = LIMIT minsamples 1;
maxsamples = ORDER samples BY samp DESC;
maxsample = LIMIT maxsamples 1;
GENERATE group as day, FLATTEN(minsample), FLATTEN(maxsample);
};
Another way to do it, which has the advantage that it doesn't sort the entire relation, and only keeps the (potential) min in memory, is to use the PiggyBank ExtremalTupleByNthField. This UDF implements Accumulator and Algebraic and is pretty efficient.
Your code would look something like this:
DEFINE TupleByNthField org.apache.pig.piggybank.evaluation.ExtremalTupleByNthField('3', 'min');
samples = LOAD 'testdata' USING PigStorage(',') AS (day:int, time:int, samp:float);
g = GROUP samples BY day;
bagged = FOREACH g GENERATE TupleByNthField(samples);
flattened = FOREACH bagged GENERATE FLATTEN($0);
min_result = FOREACH flattened GENERATE $1 .. ;
Keep in mind that the fact we are sorting based on the samp field is defined in the DEFINE statement by passing 3 as the first param.

Resources