Pig order preservation across RANK and in general - sorting

In the following code:
dsBodyStartStopSort =
order dsBodyStartStop
by terminal_id, point_date_stamp, point_time_stamp, event_code
;
dsBodyStartStopRank =
rank dsBodyStartStopSort
;
store dsBodyStartStopSort
into 'xStartStopSort.csv'
using PigStorage(';')
;
I know that if I don't do that RANK in the middle, the sort order will make it to the STORE command. And that is guaranteed by Pig.
And it appears from the testing I've done that doing RANK does not mess up the sort order--but is that guaranteed? I don't want to just be running on luck.
More generally, what is Pig's rule for preserving sort once it's done? Is it until some reduce event occurs? Will it work across FILTER? Certainly not GROUP? Just wondering if there is a well defined set of rules on when and how Pig guarantees or does not guarantee order.
To summarize: 1) Is order preserved across RANK? 2) How is order preserved generally?

The best piece of documentation I found on the topic:
Bags are disordered unless you explicitly apply a nested ORDER BY
operation as demonstrated below. A nested FOREACH will preserve
ordering, letting you order by one combination of fields then project
out just the values you'd like to concatenate.
From looking at unofficial examples and comments, I conclude the following:
If you do an order right before a rank, it should preserve the order. Personally I prefer to just use RANK xxx BY a,b,c; and only use ORDER afterwards if it is really needed.
If you do an order right before a LIMIT, it should feed LIMIT with the top lines. However the output would be sorted rather than in the original order.

Related

In SPSS modeler (17/18), what is criteria for evaluating ties encountered while sorting particular column using sorting nugget?

Am sorting by particular column using sorting nugget in SPSS modeler 17/18. However, do not understand how ties are evaluated when values are repeated in sorting column. None of the other columns have any sequence associated with it? Can someone throw some light on this.
Have attached illustration here where am sorting on col3 (excel file is original data). However, after sorting, no other cols (Key) seem to follow any sequence/order. How was final data arrived at then?
I have not been able to find any documentation to answer this question, but I believe that the order of ties after the sort is essentially random or at least determined by a number of factors that are outside of the user's control. Generally, I think it is determined by the order of the records in the source, but if you are querying a database or similar without specifying a sort order, you may see that the data will be sorted differently depending on the source system and it may even differ between each execution.
If your processing depends on the sort order of the data (including the order of the ties), the best approach will be to specify the sort order in such a detail that ties will not happen.

Fast algorithm for approximate lookup on multiple keys

I have formulated a solution to a problem where I am storing parameters in a set of tables, and I want to be able to look up the parameters based on multiple criteria.
For example, if criteria 1 and criteria 2 can each be either A or B, then I'd have four potential parameters - one for each combination A&A, A&B, B&A and B&B. For these sort of criteria I could concatenate the fields or something similar and create a unique key to look up each value quickly.
Unfortunately not all of my criteria are like this. Some of the criteria are numerical and I only care about whether or not a result sits above or below a boundary. That also wouldn't be a problem on its own - I could maybe use a binary search or something relatively quick to find the nearest key above or below my value.
My problem is I need to include a number of each in the same table. In other words, I could have three criteria - two with A/B entries, and one with less-than-x/greater-than-x type entries, where x is in no way fixed. So in this example I would have a table with 8 entries. I can't just do a binary search for the boundary because the closest boundary won't necessarily be applicable due to the other criteria. For example, if the first two criteria are A&B, then the closest boundary might be 100, but if the if first two criteria are A&A, the closest boundary might be 50. If I want to look up A, A, 101, then I want it to recognise that 50 is the closest boundary that applies - not 100.
I have a procedure to do the lookup but it gets very slow as the tables get bigger - it basically goes through each criteria, checks if a match is still possible, and if so it looks at more criteria - if not, it moves on to check the next entry in the table. So in other words, my procedure requires cycling through the table entries one by one and checking for a match. I have tried to optimise that by ensuring the tables that are input to the procedure are as small as possible and by making sure it looks at the criteria that are least likely to match first (so that it checks each entry as quickly as possible) but it is still very slow.
The biggest tables are maybe 200 rows with about 10 criteria to check, but many are much smaller (maybe 10x5). The issue is that I need to call the procedure many times during my application, so algorithms with some initial overhead don't necessarily make things better. I do have some scope to change the format of the tables before runtime but I would like to keep away from that as much as possible (while recognising it may be the only way forward).
I've done quite a bit of research but I haven't had any luck. Does anyone know of any algorithms that have been designed to tackle this kind of problem? I was really hoping that there would be some clever hash function or something that means I won't have to cycle through the tables, but from my limited knowledge something like that would struggle here. I feel confident that I understand the problem well enough to gradually optimise the solution I have at the moment, but I want to be sure I've not missed a much better solution.
Apologies for the very long and abstract description of the problem - hopefully it's clear what I'm trying to do. I'll amend my question if it's unclear.
Thanks for any help.
this is basically what a query optimizer does in SQL land. There are fast, free, in memory databases for exactly this purpose. Checkout sqlite https://www.sqlite.org/inmemorydb.html.
It sounds like you are doing what is called a 'full table scan' for each query, which is like the last resort for a query optimizer.
As I've understood, you want to select entries by criteria like
A& not B & x1 >= lower_x1 & x1 < upper_x1 & x2 >= lower_x2 & x2 < lower_x2 & ...
The easiest way is to have them sorted by all possible xi, where i=1,2.. in separate sets, and have separated 'words' for various combination of A,B,..
The search will works as follows:
Select a proper world by Boolean criteria combination
For each i, find the population of lower_xi..upper_xi range in corresponding set (this operation is O(log(N))
Select i where the population is the lowest
While iterating instances through lower_xi..upper_xi range, filter the results by checking other upper/lower bound criteria (for all xj where j!=i)
Note that this s a general solution. Of course if you know some relation between your bound(s), you may use a list sorted by respective combination(s) of item values.

Can pig relational operators work on bags

I am running through the example given in Programming Pig. Take a look at the --analyze_stock.pig example.
I am basically confused about how relational operators are working on bags, I have read that relational operators can work only on relations.
daily = load 'NYSE_daily' as (exchange:chararray, symbol:chararray,
date:chararray, open:float, high:float, low:float,
close:float, volume:int, adj_close:float);
grpd = group daily by symbol;
After running these two statements if i run
describe grpd
The output i get is
{group: chararray,daily: {(exchange: chararray,symbol: chararray,date: chararray,open: float,high: float,low: float,close: float,volume: int,adj_close: float)}}
This clearly shows that daily is a bag
The next statement in the script is
analyzed = foreach grpd {
sorted = order daily by date;
generate group, analyze(sorted);
Here the order (relational operator) is being applied on daily (bag) based on the describe statement above.
I realize that in all probability that my concepts are a little weak here, would appreciate if someone could help me out.
Remember that a bag is an unordered collection of tuples. Also remember that the records in a Pig relation are tuples. That means that a relation is actually just a big bag. That means that conceptually, anything you can do to a whole relation you can also do to a smaller bag in a single record. This is done using a nested foreach.
In practice, they are not identical -- when dealing with a bag, there are not map-reduce cycles involved; it's more like using a UDF. Consequently, not every operator can be used this way. Note that the source I linked to is out-of-date on this point: you can now use, e.g., GROUP BY as well.

Why are the elements of a rascal set not ordered when printing?

For the sake of usability, when I write the following code
{1,2,3,4,5,6,7,8,9,10}
I expect the Rascal console to print the same, yet in the output window I see:
{10,9,8,7,6,5,4,3,2,1}
This example is overly simple of course, and for this the ordering doesn't really hurt. However, in more complex examples I would expect the output to be sorted so that I can more easily verify if a certain element is included in the set.
Does the current ordering of the printed set have a meaning?
From the Rascal Tutor:
A set is an unordered sequence of values and has the following
properties:
All elements have the same static type.
The order of the elements does not matter.
A set contains an element only once. In other words, duplicate elements are eliminated and no matter how many times an element is
added to a set, it will occur in it only once.
And the wikipedia page on Sets:
In computer science, a set is an abstract data structure that can store certain values, without any particular order, and no repeated values.
So the behaviour you observe is as expected, there is no order inside a set, the order displayed is due to the implementation (a java HashSet). Sorting before printing, or during construction will have negative performance overhead, and might give a user the incorrect impression that there is an order.
In regards to the first suggestion, using the same sequence as supplied, that would require a less efficient data structure, and would again hurt performance in the off-change we have to print a set.
And of course, you can always do:
import List;
import Set;
sort(toList({4,2,1,3}))
if you really want the output sorted.

Dynamic tuple in Pig?

I obtained data with tuple in pig
0,(0),(zero)
1,(1,2),(first,second)
Can I receive this?
0,0,zero
1,1,first
1,2,second
To start off, I'm going to correct your terminology and you should be treating (0) and (1,2) as bags, not tuples. Tuples are intended to be fixed-length data structures that represent some sort of entity. Say (name, address, year of birth), for example. If you have a list of similar objects, like {(apple), (orange), (banana)}, you want a bag.
There doesn't exist behavior that allows you to "zip" up multiple bags/lists. The reason for this is from a design perspective, Pig treats bags as unordered lists, hence the term "bag" not "list". This assumption really helps out with parallelism since you don't have to be considered with order. Therefore, it's really hard to match up 1 with first.
What you could try to do is write an eval function UDF that takes in two bags as parameters, and then zips up the two lists, and then returns back one back with the bags zipped.

Resources