How to give equations in Apache pig - hadoop

I am trying to get a value from this equation
--counted gives the total row count in a file
samplecount = counted*(10/100);
How to sample data according to this
--Load data
examples = LOAD '/home/sreeveni/myfiles/PE/USCensus1990New.csv' ;
--Group data
groupedByUser = group examples all;
--count no of lines in the file
counted = FOREACH groupedByUser generate COUNT(examples) ;
--sampling
sampled = SAMPLE examples counted*(10/100);
store sampled into '/home/sreeveni/myfiles/OUT/samplesout';
Showing error in above line
Invalid scalar projection: counted : A column needs to be projected
from a relation for it to be used as a scalar
Please advice.
Am I doing anything wrong.

i guess sample works with a number between [0,1]. In your case, its exceeding the required value. If you want just 10% of the data, pass 0.1 directly and to get that in a code, find this percentage in a FOREACH statement only.

If you are trying to generate a sample of "examples" with 10% of the total number of rows, all you have to do is:
SAMPLE examples 0.1;
Read the documentation for SAMPLE command here.

Related

How do I add noise/variability to a dataset in Python, given the CV?

Given a dataset of blood results, say cholesterol level, and knowing that the instrument that produced those results is subject to a known degree of variability, how would I add that variability back into the dataset? i.e. I want to assume the result in the original dataset is the true/mean value, and then produce new results that are subject to the known variability of the instrument.
In Excel you use =NORM.INV(RAND(), mean, std_dev), where RAND() provides a random value between 0 and 1, "mean" will be the original value and I have the CV so I can calculate the SD. NORM.INV then provides the inverse of the cumulative normal distribution function.
I've done the following to create a new column with my new values, but would like to know if it is valid (i.e., will each row have a different random number between 0 and 1 as the probability? and is this formula equivalent to NORM.INV?
df8000['HDL_1'] = norm.ppf(random(), loc = df8000['HDL_0'], scale = TAE_df.loc[0,'HDL'])
Thanks in advance!

How to format thousands from digit a with a point

I'am looking for the best way to format thousands from digit a with a point. I am creating a view from an other table and I need for some columns to have this type of format.
ex : 10000 -> 10.000
Thank you in advance for your possible answers !
There is way how to do it without UDF but it can be performance issue. But for small sets of data it works:
You can use the following combination of build-in functions:
// Template
select reverse(concat_ws(".",split(reverse(<column-to-format>),"(?<=\\G.{3})")));
// sample 1 - format number 12345678
select reverse(concat_ws(".",split(reverse("12345678"),"(?<=\\G.{3})")));
// sample 2 - format number 10000
select reverse(concat_ws(".",split(reverse("10000"),"(?<=\\G.{3})")));
Explanation for sample 1 with number 12345678
1. reverse("12345678") Result: 87654321
2. split(reverse("12345678"),"(?<=\\G.{3})")")) Result: array [876,543,21]
3. concat_ws(".",split(reverse("10000"),"(?<=\\G.{3})")) Result "876.543.21"
4. finally reverse it back. Result: "12.543.876"

SPSS Count function advice

Im having some issues with some SPSS code. Im new to SPSS and still trying to figure out the syntax. I'm trying to get my code to count the sum of two dice equal to 7. I cant get the count function to work the way I want it. Below is my code. Any tips would be greatly appreciated.
INPUT PROGRAM.
LOOP #I=1 TO 100000.
COMPUTE case = 1.
END CASE.
END LOOP.
END FILE.
END INPUT PROGRAM.
EXECUTE.
COMPUTE Dice_1 = TRUNC (RV.UNIFORM(1,7)).
COMPUTE Dice_2 = TRUNC (RV.UNIFORM(1,7)).
COMPUTE total = Dice_1+Dice_2.
COMPUTE Number_Sum7= Dice_1+Dice_2 = 7.
COUNT Num= case TO Number_Sum7(1).
SAVE outfile = 'my file path'.
count function counts across a list of variables, in each line separately.
What you seem to be looking for is to count over rows.
You can start with:
frequencies total. /* see counts of all possible totals.
means Number_Sum7/cells=sum. /* count only the cases where total=7.
These will give you the answers in the output window.
If you want the answers in data for further analysis, look up the aggregate function.
For example, the following will give you the same results but in a new datasets:
DATASET NAME ORIG.
DATASET DECLARE freqs.
AGGREGATE /OUTFILE='freqs' /BREAK=total /Mycount=N.
DATASET ACTIVATE ORIG.
DATASET DECLARE only7.
AGGREGATE /OUTFILE='only7' /BREAK= /only7=sum(Number_Sum7).
Or, instead, you can add the results to your present data:
AGGREGATE /OUTFILE=* MODE=ADDVARIABLES /BREAK=total /TotalCount=N.
AGGREGATE /OUTFILE=* MODE=ADDVARIABLES /BREAK= /total7=sum(Number_Sum7).

Creating an average matrix from four individual matrices of same size in SAS / IML

I am using IML/SAS in SAS Enterprise Guide for the first time, and want to do the following:
Read some datasets into IML matrices
Average the matrices
Turn the resulting IML matrix back into a SAS data set
My input data sets look something like the following (this is dummy data - the actual sets are larger). The format of the input data sets is also the format I want from the output data sets.
data_set0: d_1 d_2 d_3
1 2 3
4 5 6
7 8 9
I proceed as follows:
proc iml;
/* set the names of the migration matrix columns */
varNames = {"d_1","d_2","d_3"};
/* 1. transform input data set into matrix
USE data_set_0;
READ all var _ALL_ into data_set0_matrix[colname=varNames];
CLOSE data_set_0;
USE data_set_1;
READ all var _ALL_ into data_set1_matrix[colname=varNames];
CLOSE data_set_1;
USE data_set_2;
READ all var _ALL_ into data_set2_matrix[colname=varNames];
CLOSE data_set_2;
USE data_set_3;
READ all var _ALL_ into data_set3_matrix[colname=varNames];
CLOSE data_set_3;
/* 2. find the average matrix */
matrix_sum = (data_set0_matrix + data_set1_matrix +
data_set2_matrix + data_set3_matrix)/4;
/* 3. turn the resulting IML matrix back into a SAS data set */
create output_data from matrix_sum[colname=varNames];
append from matrix_sum;
close output_data;
quit;
I've been trying loads of stuff, but nothing seems to work for me. The error I currently get reads:
ERROR: Matrix matrix_sum has not been set to a value
What am I doing wrong? Thanks up front for the help.
The above code works. In the full version of this code (this is simplified for readability) I had misnamed one of my variables.
I'll leave the question up in case somebody else wants to use SAS / IML to find an average matrix.

How to create missing records within date-time range in pig latin

I have input records of the form
2013-07-09T19:17Z,f1,f2
2013-07-09T03:17Z,f1,f2
2013-07-09T21:17Z,f1,f2
2013-07-09T16:17Z,f1,f2
2013-07-09T16:14Z,f1,f2
2013-07-09T16:16Z,f1,f2
2013-07-09T01:17Z,f1,f2
2013-07-09T16:18Z,f1,f2
These represent timestamps and events. I have written these by hand, but actual data should be sorted based on time.
I would like to generate a set of records which would be input to graph plotting function which needs continuous time series. I would like to fill in missing values, i.e. if there are entries for "2013-07-09T19:17Z" and "2013-07-09T19:19Z", I would like to generate entry for "2013-07-09T19:18Z" with predefined value.
My thoughts on doing this:
Use MIN and MAX to find the start and end date in the series
Write UDF which takes min and max and returns relation with missing
timestamps
Join above 2 relations
I cannot get my head around on how to implement this in PIG though. Would appreciate any help.
Thanks!
Generate another file using a script (outside pig)with all time stamps between MIN and MAX , including MIN and MAX. Load this as a second data set. Here is a sample that I used from your data set. Please note I filled in only few gaps not all.
2013-07-09T01:17Z,d1,d2
2013-07-09T01:18Z,d1,d2
2013-07-09T03:17Z,d1,d2
2013-07-09T16:14Z,d1,d2
2013-07-09T16:15Z,d1,d2
2013-07-09T16:16Z,d1,d2
2013-07-09T16:17Z,d1,d2
2013-07-09T16:18Z,d1,d2
2013-07-09T19:17Z,d1,d2
2013-07-09T21:17Z,d1,d2
Do a COGROUP on the original dataset and the generated dataset above. Use a nested FOREACH GENERATE to write output dataset. If first dataset is empty, use the values from second set to generate output dataset else the first dataset. Here is the piece of code I used on these two datasets.
Org_Set = LOAD 'pigMissingData/timeSeries' USING PigStorage(',') AS (timeStamp, fl1, fl2);
Default_set = LOAD 'pigMissingData/timeSeriesFull' USING PigStorage(',') AS (timeStamp, fl1, fl2);
coGrouped = COGROUP Org_Set BY timeStamp, Default_set BY timeStamp;
Filled_Data_set = FOREACH coGrouped {
x = COUNT(times);
y = (x == 0? (Default_set.fl1, Default_set.fl2): (Org_Set.fl1, Org_Set.fl2));
GENERATE FLATTEN(group), FLATTEN(y.$0), FLATTEN(y.$1);
};
if you need further clarification or help let me know
In addition to #Rags answer, you could use the STREAM x THROUGH command and a simple awk script (similar to this one) to generate the date range once you have the min and max dates. Something similar to (untested! - you might need to single line the awk script with semi-colon command delimitation, or better to ship it as a script file)
grunt> describe bounds;
(min:chararray, max:chararray)
grunt> dump bounds;
(2013/01/01,2013/01/04)
grunt> fullDateBounds = STREAM bounds THROUGH `gawk '{
split($1,s,"/")
split($2,e,"/")
st=mktime(s[1] " " s[2] " " s[3] " 0 0 0")
et=mktime(e[1] " " e[2] " " e[3] " 0 0 0")
for (i=st;i<=et;i+=60*24) print strftime("%Y/%m/%d",i)
}'`;

Resources