What do the statements in MonetDB query plan explanations mean? - monetdb

I am trying to understand the query plan of MonetDB.
Is there a documentation anywhere where I can find what each instruction stays for?
If not, can anybody tell me what are returning
sql.projectdelta(X_15,X_23,X_25,r1_30,X_27)
and
sql.subdelta(X_246,X_4,X_10,X_247,X_249), for example?
In my query I am sorting the result by two attributes (e.g., by A,B). Can you tell me why the second sort has more parameters than the first?
(X_29,r1_36,r2_36) := algebra.subsort(X_28,false,false);
(X_33,r1_40,r2_40) := algebra.subsort(X_22,r1_36,r2_36,false,false);
Is algebra.subsort returning (oid, columnType) pairs, or just oid?
Thank you!!

Understanding output of the explain SQL statement requires knowledge of the MonetDB Assembly-like Language (MAL).
Concerning functions sql.projectdelta, sql.subdelta, and algebra.subsort, you'll find their signature and a (brief) description in the monetdb lib folder. Ex :
[MonetDB_install_folder]\MonetDB5\lib\monetdb5\sql.mal for all sql functions
[MonetDB_install_folder]\MonetDB5\lib\monetdb5\algebra.mal for all algebra functions
Concerning the different number of parameters for algebra.subsort :
(X_29,r1_36,r2_36) := algebra.subsort(X_28,false,false);
is described as :
Returns a copy of the BAT sorted on tail values, a BAT that specifies
how the input was reordered, and a BAT with group information.
The input and output are (must be) dense headed.
The order is descending if the reverse bit is set.
This is a stable sort if the stable bit is set.
(X_33,r1_40,r2_40) := algebra.subsort(X_22,r1_36,r2_36,false,false);
is described as:
Returns a copy of the BAT sorted on tail values, a BAT that specifies
how the input was reordered, and a BAT with group information.
The input and output are (must be) dense headed.
The order is descending if the reverse bit is set.
This is a stable sort if the stable bit is set.
MAL functions can be overloaded bassed on their return value. algebra.subsort can return 1, 2 or 3 values depending on what you're asking for. Checl algebra.mal for the different possibilities.

Related

Logic to compare rows in pig

I need logic for below scenario which needs to be implemented using Pig scripts. Can anyone please help in providing some ideas on how to do this.
Input contains a column groupName with some data like others and unknown. This data needs to be replaced by its previous record data.
Input:
id,groupName
123,casc0001
124,casc0002
125,sale0001
126,unknown
127,nave9876
128,casc0001
129,sale0002
130,others
131,casc0004
132,unknown
133,unknown
134,others
135,nave1234
output:
123,casc0001
124,casc0002
125,sale0001
126,sale0001
127,nave9876
128,casc0001
129,sale0002
130,sale0002
131,casc0004
132,casc0004
133,casc0004
134,casc0004
135,nave1234
In the above input 126,unknown to be replaced with 125,sale0001. 130,others need to be replaced by 129,sale0002. 132,unknown 133,unknown 134,others to be replaced with 131,casc0004.
--Edit--
I tried lead function in Pig. But it is used only to compare n rows at a time. Which cannot solve this completely.
Another logic which is working, but looking for optimized one.
Cogroup for the same data set (like Dataset and Dataset_self)
-Filter Dataset.id=Dataset_self.id or Dataset_self.groupname='others' or Dataset_self.groupname='unknown'
-Generate IdDiff like (Dataset_self.id-Dataset.id), CASE when id=id then ( id, group) else (id_self,group)
-Foreach (group id){
ordered = order by id,diff,group;
limited = ordered limit 1;
generate limited ;
}
This is going to be a complicated problem on a distributed system like hadoop, especially that your file is going to be split between nodes. In your case what if 126 happens to be the first record in a new split. Then you will need to trace the previous file split which is most likely on a different node. Lets say you come up with a MapReduce program to do this, in all likelyhood it would an extremely slow and inefficient way to do it. The solution might be simpler if you are in a single node system where the splittable property of your input format is false, and the nuber of reducers is set to 1.
In that case you could almost make the argument that a traditional database like Oracle or Terra data might be a better fit for your problem as you have lead or lag functions readily available which could be used to do exactly what u need.

FindNextFile order NTFS

FindNextFile WinApi function is used to list content of directories. Microsoft is stating in documentation, that order is file system dependent. However NTFS should be in alphabetical order most of the time.
The order in which this function returns the file names is dependent on the file system type. With the NTFS file system and CDFS file systems, the names are usually returned in alphabetical order. With FAT file systems, the names are usually returned in the order the files were written to the disk, which may or may not be in alphabetical order. However, as stated previously, these behaviors are not guaranteed.
My application needs some ordering of object in directories. Because majority of Windows users use NTFS, I would like to optimize my application for that case. Therefore I use function _wcsicmp for name compare. Most of the time it is correct and results from FindNextFile are sorted according to _wcsicmp. However sometime result are not sorted. I thought, that it is natural, because FindFirstFile doesn't guaranteed the order and I must sort it anyway (in case of another file system). Then I noticed strange pattern. It looks like character '_' is returned after letters. Folder with content (a.txt, b.txt, _.txt) is returned in order a, b, _. Function _wcsicmp will sort that as _, a, b. Tested on Windows 8.1. I ran some test and this behavior is consistent.
Can someone explain me what is the comparison criteria used by NTFS? Or why is FindNextFile returning names out of alphabetical order?
Because NTFS sort rules are not so simple as just to sort in alphabetical order. Here is an msdn blog article to shed some light on the problem:
Why do NTFS and Explorer disagree on filename sorting?
One reason to this can be that NTFS captures the case mapping table at the time the drive is formatted and continues to use that table, even if the OS's case mapping tables change subsequently.
You can use CompareStringEx and set the flag SORT_DIGITSASNUMBERS
Minimum system requirement for this function is Windows Vista
LINK
int CompareStringEx(0,0x00000008/*SORT_DIGITSASNUMBERS*/,
lpString1, cchCount1, lpString2, cchCount2, NULL, NULL, 0);
Comparison result for this function is weird, it returns 1, 2, or 3:
#define CSTR_LESS_THAN 1 // string 1 less than string 2
#define CSTR_EQUAL 2 // string 1 equal to string 2
#define CSTR_GREATER_THAN 3 // string 1 greater than string 2
You can also try _wcsicoll for older systems. If I recall correctly _wcsicoll works better but not the same as Windows's sort.

SAS- how to do ASCENDING order when concatenating

How do you sort the values by ascending order when concatenating for SAS?
eg. In this example I am trying to acsend values for aeacnoth1_std1, aeacnoth2_std, etc.....
if cmiss( aeacnoth1_std, aeacnoth2_std)=0
then aeacolst=strip(aeacnoth1_std)||','||strip(aeacnoth2_std);
if cmiss( aeacnoth1_std, aeacnoth2_std, aeacnoth3_std)=0
then aeacolst=strip(aeacnoth1_std)||','||strip(aeacnoth2_std)||','||strip(aeacnoth3_std);
if cmiss( aeacnoth1_std, aeacnoth2_std, aeacnoth3_std, aeacnoth4_std)=0
then aeacolst=strip(aeacnoth1_std)||','||strip(aeacnoth2_std)||','||strip(aeacnoth3_std)||','||strip(aeacnoth4_std);
One possible approach:
Declare an array containing all the variables you want to concatenate
Sort the array into the desired order
Concatenate the array
The hard part is step 2, as SAS 9.1 or earlier doesn't provide any direct way of doing this. You might find this paper useful, or just Google for 'sas sort array' and see what comes up:
http://www2.sas.com/proceedings/sugi26/p096-26.pdf
EDIT: if you have SAS 9.2 or later, you can use call sortc to sort the array:
http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a003106052.htm

confirm conditional statement applies to >0 observations in Stata

This is something that has puzzled me for some time and I have yet to find an answer.
I am in a situation where I am applying a standardized data cleaning process to (supposedly) similarly structured files, one file for each year. I have a statement such as the following:
replace field="Plant" if field=="Plant & Machinery"
Which was a result of the original code-writing based on the data file for year 1. Then I generalize the code to loop through the years of data. The problem becomes if in year 3, the analogous value in that variable was coded as "Plant and MachInery ", such that the code line above would not make the intended change due to the difference in the text string, but not result in an error alerting the change was not made.
What I am after is some sort of confirmation that >0 observations actually satisfied the condition each instance the code is executed in the loop, otherwise return an error. Any combination of trimming, removing spaces, and standardizing the text case are not workaround options. At the same time, I don't want to add a count if and then assert statement before every conditional replace as that becomes quite bulky.
Aside from going to the raw files to ensure the variable values are standardized, is there any way to do this validation "on the fly" as I have tried to describe? Maybe just write a custom program that combines a count if, assert and replace?
The idea has surfaced occasionally that replace should return the number of observations changed, but there are good reasons why not, notably that it is not a r-class or e-class command any way and it's quite important not to change the way it works because that could break innumerable programs and do-files.
So, I think the essence of any answer is that you have to set up your own monitoring process counting how many values have (or would be) changed.
One pattern is -- when working on a current variable:
gen was = .
foreach ... {
...
replace was = current
replace current = ...
qui count if was != current
<use the result>
}

Format statement with unknown columns

I am attempting to use fortran to write out a comma-delimited file for import into another commercial package. The issue is that I have an unknown number of data columns. My output needs to look like this:
a_string,a_float,a_different_float,float_array_elem1,float_array_elem2,...,float_array_elemn
which would result in something that might look like this:
L1080,546876.23,4325678.21,300.2,150.125,...,0.125
L1090,563245.1,2356345.21,27.1245,...,0.00983
I have three issues. One, I would prefer the elements to be tightly grouped (variable column width), two, I do not know how to define a variable number of array elements in the format statement, and three, the array elements can span a large range--maybe 12 orders of magnitude. The following code conceptually does what I want, but the variable 'n' and the lack of column-width definition throws an error (of course):
WRITE(50,900) linenames(ii),loc(ii,1:2),recon(ii,1:n)
900 FORMAT(A,',',F,',',F,n(',',F))
(I should note that n is fixed at run-time.) The write statement does what I want it to when I do WRITE(50,*), except that it's width-delimited.
I think this thread almost answered my question, but I got quite confused: SO. Right now I have a shell script with awk fixing the issue, but that solution is...inelegant. I could do some manipulation to make the output a string, and then just write it, but I would rather like to avoid that option if at all possible.
I'm doing this in Fortran 90 but I like to try to keep my code as backwards-compatible as possible.
the format close to what you want is f0.3, this will give no spaces and a fixed number of decimal places. I think if you want to also lop off trailing zeros you'll need to do a good bit of work.
The 'n' in your write statement can be larger than the number of data values, so one (old school) approach is to put a big number there, eg 100000. Modern fortran does have some syntax to specify indefinite repeat, i'm sure someone will offer that up.
----edit
the unlimited repeat is as you might guess an asterisk..and is evideltly "brand new" in f2008
In order to make sure that no space occurs between the entries in your line, you can write them separately in character variables and then print them out using theadjustl() function in fortran:
program csv
implicit none
integer, parameter :: dp = kind(1.0d0)
integer, parameter :: nn = 3
real(dp), parameter :: floatarray(nn) = [ -1.0_dp, -2.0_dp, -3.0_dp ]
integer :: ii
character(30) :: buffer(nn+2), myformat
! Create format string with appropriate number of fields.
write(myformat, "(A,I0,A)") "(A,", nn + 2, "(',',A))"
! You should execute the following lines in a loop for every line you want to output
write(buffer(1), "(F20.2)") 1.0_dp ! a_float
write(buffer(2), "(F20.2)") 2.0_dp ! a_different_float
do ii = 1, nn
write(buffer(2+ii), "(F20.3)") floatarray(ii)
end do
write(*, myformat) "a_string", (trim(adjustl(buffer(ii))), ii = 1, nn + 2)
end program csv
The demonstration above is only for one output line, but you can easily write a loop around the appropriate block to execute it for all your output lines. Also, you can choose different numerical format for the different entries, if you wish.

Resources