sparse matrix list .cat.codes invalid syntax error - matrix

i am having trouble with the piece of code below as part of an tutorial where we have to create a sparse matrix from 2 columns by first converting them into a list each. I am getting an unknown syntax error and really not sure how to resolve this. I would much appreciate any help please.
Code:
from pandas.api.types import CategoricalDtype
customers = list(np.sort(grouped_purchased.CustomerID.unique())) # Get our unique customers
products = list(grouped_purchased.StockCode.unique()) # Get our unique products that were purchased
quantity = list(grouped_purchased.Quantity) # All of our purchases
rows = grouped_purchased.CustomerID.astype(CategoricalDtype(categories = ['customers']).cat.codes
Get the associated row indices
cols = grouped_purchased.StockCode.astype(CategoricalDtype(categories = ['products']).cat.codes
Get the associated column indices
purchases_sparse = sparse.csr_matrix((quantity, (rows, cols)), shape=(len(customers), len(products)))
Error:
File "", line 10
cols = grouped_purchased.StockCode.astype(CategoricalDtype(categories = ['products']).cat.codes
^
SyntaxError: invalid syntax

Related

Powerquery: passing column value to custom function

I'm struggling on passing the column value to a formula. I tried many different combinations but I only have it working when I hard code the column,
(tbl as table, col as list) =>
let
avg = List.Average(col),
sdev = List.StandardDeviation(col)
in
Table.AddColumn(tbl, "newcolname" , each ([column] - avg)/sdev)
I'd like to replace [column] by a variable. In fact, it's the column I use for the average and the standard deviation.
Please any help.
Thank you
This probably does what you want, called as x= fctn(Source,"ColumnA")
Does the calculations using and upon ColumnA from Source table
(tbl as table, col as text) =>
let
avg = List.Average(Table.Column(tbl,col)),
sdev = List.StandardDeviation(Table.Column(tbl,col))
in Table.AddColumn(tbl, "newcolname" , each (Record.Field(_, col) - avg)/sdev)
Potentially you want this. Does the average and std on the list provided (which can come from any table) and does the subsequent calculations on the named column in the table passed over
called as x = fctn(Source,"ColumnNameInSource",SomeSource[SomeColumn])
(tbl as table, cname as text, col as list) =>
let
avg = List.Average(col),
sdev = List.StandardDeviation(col)
in Table.AddColumn(tbl, "newcolname" , each (Record.Field(_, cname) - avg)/sdev)

geopandas loop does not give me name of layers

I m trying to get a dataframe with all different layers of a kml. The code below gives a dataframe, but I also want the name of the kml layers to create a column in data. Any idea about what I m doing wrong?
gpd.io.file.fiona.drvsupport.supported_drivers['KML'] = 'rw'
fp="file.kml"
data = gpd.GeoDataFrame()
layers_list=pd.Series(fiona.listlayers(fp))
list(layers_list)
for l in layers_list :
s = gpd.read_file(fp, driver='KML', layer=l)
data = data.append(s, ignore_index=True)
data['layers']= l

Error when using MAX in Apache Pig (Hadoop)

I am trying to calculate maximum values for different groups in a relation in Pig. The relation has three columns patientid, featureid and featurevalue (all int).
I group the relation based on featureid and want to calculate the max feature value of each group, heres the code:
grpd = GROUP features BY featureid;
DUMP grpd;
temp = FOREACH grpd GENERATE $0 as featureid, MAX($1.featurevalue) as val;
Its giving me Invalid scalar projection: grpd Exception. I read on different forums that MAX takes in a "bag" format for such functions, but when I take the dump of grpd, it shows me a bag format. Here's a small part of the output from the dump:
(5662,{(22579,5662,1)})
(5663,{(28331,5663,1),(2624,5663,1)})
(5664,{(27591,5664,1)})
(5665,{(30217,5665,1),(31526,5665,1)})
(5666,{(27783,5666,1),(30983,5666,1),(32424,5666,1),(28064,5666,1),(28932,5666,1)})
(5667,{(31257,5667,1),(27281,5667,1)})
(5669,{(31041,5669,1)})
Whats the issue ?
The issue was with column addressing, heres the correct working code:
grpd = GROUP features BY featureid;
temp = FOREACH grpd GENERATE group as featureid, MAX(features.featurevalue) as val;

How can I select record with minimum value in pig latin

I have timestamped samples and I'm processing them using Pig. I want to find, for each day, the minimum value of the sample and the time of that minimum. So I need to select the record that contains the sample with the minimum value.
In the following for simplicity I'll represent time in two fields, the first is the day and the second the "time" within the day.
1,1,4.5
1,2,3.4
1,5,5.6
To find the minimum the following works:
samples = LOAD 'testdata' USING PigStorage(',') AS (day:int, time:int, samp:float);
g = GROUP samples BY day;
dailyminima = FOREACH g GENERATE group as day, MIN(samples.samp) as samp;
But then I've lost the exact time at which the minimum happened. I hoped I could use nested expressions. I tried the following:
dailyminima = FOREACH g {
minsample = MIN(samples.samp);
mintuple = FILTER samples BY samp == minsample;
GENERATE group as day, mintuple.time, mintuple.samp;
};
But with that I receive the error message:
2012-11-12 12:08:40,458 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000:
<line 5, column 29> Invalid field reference. Referenced field [samp] does not exist in schema: .
Details at logfile: /home/hadoop/pig_1352722092997.log
If I set minsample to a constant, it doesn't complain:
dailyminima = FOREACH g {
minsample = 3.4F;
mintuple = FILTER samples BY samp == minsample;
GENERATE group as day, mintuple.time, mintuple.samp;
};
And indeed produces a sensible result:
(1,{(2)},{(3.4)})
While writing this I thought of using a separate JOIN:
dailyminima = FOREACH g GENERATE group as day, MIN(samples.samp) as minsamp;
dailyminima = JOIN samples BY (day, samp), dailyminima BY (day, minsamp);
That work, but results (in the real case) in a join over two large data sets instead of a search through a single day's values, which doesn't seem healthy.
In the real case I actually want to find max and min and associated times. I hoped that the nested expression approach would allow me to do both at once.
Suggestions of ways to approach this would be appreciated.
Thanks to alexeipab for the link to another SO question.
One working solution (finding both min and max and the associated time) is:
dailyminima = FOREACH g {
minsamples = ORDER samples BY samp;
minsample = LIMIT minsamples 1;
maxsamples = ORDER samples BY samp DESC;
maxsample = LIMIT maxsamples 1;
GENERATE group as day, FLATTEN(minsample), FLATTEN(maxsample);
};
Another way to do it, which has the advantage that it doesn't sort the entire relation, and only keeps the (potential) min in memory, is to use the PiggyBank ExtremalTupleByNthField. This UDF implements Accumulator and Algebraic and is pretty efficient.
Your code would look something like this:
DEFINE TupleByNthField org.apache.pig.piggybank.evaluation.ExtremalTupleByNthField('3', 'min');
samples = LOAD 'testdata' USING PigStorage(',') AS (day:int, time:int, samp:float);
g = GROUP samples BY day;
bagged = FOREACH g GENERATE TupleByNthField(samples);
flattened = FOREACH bagged GENERATE FLATTEN($0);
min_result = FOREACH flattened GENERATE $1 .. ;
Keep in mind that the fact we are sorting based on the samp field is defined in the DEFINE statement by passing 3 as the first param.

Handling DataTable Column Name Mismatch Exception

i feeding my Data table from excel sheet upload,i face the problem when i look for a particular columns, ironically i don't know the position of column ,That can be anywhere or may be not present
so i cant Use indexing,when i go with column name then white spaces causing the problem
i assume the i know the index of Column but how can i handle the whitespaces
so far what i tried
Code:
if (ds.Tables[0].Columns[3].Caption.Replace(" ", "").Equals("XXXX"))
{
var ds = from r in ds.Tables[0].AsEnumerable() select new { Fname=r.Field<String>("XX XX") , Lname=r.Field<string>(" Yy YY Y ") };
ds.ToList();
}
Do i need to care About the case sensitiveness in Column Name ?
how can i find the Column index if it matched with a given String ?
You can find the column like:
DataColumn yourColumn = ds.Tables[0].Columns.Cast<DataColumn>()
.Where(r => r.Caption.Trim().Equals("XXXX",StringComparison.InvariantCultureIgnoreCase))
.FirstOrDefault();

Resources