Groupby and sort Pandas - sorting

I have a dataframe:
df = pd.DataFrame({
'Metric': ['Total Assets', 'Total Promo', 'Total Assets', 'Total Promo'],
'Risk': ['High', 'High','Low', 'Low'],
'2021': [ 200, 100, 400, 50]})
I want to groupby the Metric column and sort by '2021' column.
I tried:
df = df.sort_values(['2021'],ascending=False).groupby(['Metric', 'Risk'])
But I get the following output:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000213CA672B48>
The output should look like:
df = pd.DataFrame({
'Metric': ['Total Assets', 'Total Assets', 'Total Promo', 'Total Promo'],
'Risk': ['Low', 'High', 'High', 'Low'],
'2021': [ 400, 200, 100, 50]})

If I understand you right, you want to sort by column "Metric" (ascending) and then "2021" (descending):
df = df.sort_values(["Metric", "2021"], ascending=[True, False])
print(df)
Prints:
Metric Risk 2021
0 Total Assets Low 400
1 Total Assets High 200
2 Total Promo High 100
3 Total Promo Low 50

Related

Laravel group data by months

I am working on a Laravel application, users can place football bets.
This is a simplified version of my tables:
users
- id
- name
bets
- id
- id_user
- cost
- profit (e.g. can be 0 if user lost this bet or any integer value if won)
- created_at (default laravel column, this should be used to group bets by month)
I need to show a chart with ROI (not looking for the formula, this can be simplified as calculateROI in your comments) of last six months from current one.
Let's assume current month is july, how can i write a query or use Eloquent to have something like:
[
[
"february" => 2%
],
[
"march" => 0%
],
[
"april" => 100%
],
[
"may" => 500%
],
[
"june" => 13%
],
[
"july" => 198%
],
]

Removing categories with patsy and statsmodels

I am using statsmodels and patsy for building a logistic regression model. I'll use pseudocode here. Let's assume I have a dataframe containing a categorical variable, say Country, with 200 levels. I have reasons to believe some of them would be predictive, so I build a model as in
formula = 'outcome ~ C(Country)'
patsy splits Country into its levels and the model is build using all countries. I then see that the coefficient in GB is high so I want to remove only GB. Can I do something like this in patsy:
formula = 'outcome ~ C(country) - C(country)[GB]'
I tried and it did not change anything.
I don't know if there is a way to subset a Category with patsy formula, but you can do it in the DataFrame.
For example
import numpy as np
import pandas as pd
import statsmodels.api as sm
# sample data
size = 100
np.random.seed(1)
countries = ['IT', 'UK', 'US', 'FR', 'ES']
df = pd.DataFrame({
'outcome': np.random.random(size),
'Country': np.random.choice(countries, size)
})
df['Country'] = df.Country.astype('category')
print(df.Country)
0 ES
1 IT
2 UK
3 US
4 UK
..
95 FR
96 UK
97 ES
98 UK
99 US
Name: Country, Length: 100, dtype: category
Categories (5, object): ['ES', 'FR', 'IT', 'UK', 'US']
Let us suppose we want to remove Category "US"
# create a deep copy excluding 'US'
_df = df[df.Country!='US'].copy(deep=True)
print(_df.Country)
0 ES
1 IT
2 UK
4 UK
5 ES
..
94 UK
95 FR
96 UK
97 ES
98 UK
Name: Country, Length: 83, dtype: category
Categories (5, object): ['ES', 'FR', 'IT', 'UK', 'US']
Even if there are no more elements with category "US" in the DataFrame, the category is still there. If we use this DataFrame in a statsmodels model, we'd get a singular matrix error, so we need to remove unused categories
# remove unused category 'US'
_df['Country'] = _df.Country.cat.remove_unused_categories()
print(_df.Country)
0 ES
1 IT
2 UK
4 UK
5 ES
..
94 UK
95 FR
96 UK
97 ES
98 UK
Name: Country, Length: 83, dtype: category
Categories (4, object): ['ES', 'FR', 'IT', 'UK']
and now we can fit a model
mod = sm.Logit.from_formula('outcome ~ Country', data=_df)
fit = mod.fit()
print(fit.summary())
Optimization terminated successfully.
Current function value: 0.684054
Iterations 4
Logit Regression Results
==============================================================================
Dep. Variable: outcome No. Observations: 83
Model: Logit Df Residuals: 79
Method: MLE Df Model: 3
Date: Sun, 16 May 2021 Pseudo R-squ.: 0.01179
Time: 22:43:37 Log-Likelihood: -56.776
converged: True LL-Null: -57.454
Covariance Type: nonrobust LLR p-value: 0.7160
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
Intercept -0.1493 0.438 -0.341 0.733 -1.007 0.708
Country[T.FR] 0.4129 0.614 0.673 0.501 -0.790 1.616
Country[T.IT] -0.1223 0.607 -0.201 0.840 -1.312 1.068
Country[T.UK] 0.1027 0.653 0.157 0.875 -1.178 1.383
=================================================================================

Laravel: Is it possible to use eagerloading after pagination?

Eagerloading with pagination is simple:
Model::with(['relation1', 'relation2'])->paginate();
There are 6 models M1, ..., M6 and model M1 has foreign key to models M2, ..., M6. There are at least 2,000,000 records in each model and model M1 has more than 10,000,000 records. The following statement
M1::paginate();
is fast enough but when relations are included, it takes more than 45 seconds to return the results. To improve the performance, I need to run the M1::paginate(); at the beginning, then include other relations.
My solution is to loop through the collection, gather the ids and add the relations. I would like to know does such thing have been implemented in Laravel before?
Whenever you are unsure about how the queries made, open the console (php artisan tinker) and write the following:
DB::listen(fn($q) => dump([$q->sql, $q->bindings, $q->time]))
For each query you make (in the current console session), you'll get an array containing the SQL, the bindings and the time it actually takes for the database to return the data (this does not take into account how long it takes PHP to turn these results into an Eloquent Collection).
For example, for a Model (A) that has one hasMany relation with another Model (B), look at the output below:
>>> DB::listen(fn($q) => dump([$q->sql, $q->bindings, $q->time]))
=> null
>>> App\Models\A::with('b')->get()->first()->id
array:3 [
0 => "select * from "a""
1 => []
2 => 0.0
]
array:3 [
0 => "select * from "b" where "b"."a_id" in (1, 2, 3, 4, 5,
6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
27)"
1 => []
2 => 0.0
]
=> 1
>>> App\Models\A::with('b')->paginate(5)->first()->id
array:3 [
0 => "select count(*) as aggregate from "a""
1 => []
2 => 0.0
]
array:3 [
0 => "select * from "a" limit 5 offset 0"
1 => []
2 => 0.0
]
array:3 [
0 => "select * from "b" where "b"."a_id" in (1, 2, 3, 4, 5)"
1 => []
2 => 0.0
]
As you can see, the pagination has an effect on the relationship queries made.

Converting Language Detection Score of CLD2 to CLD3 Accuracy

My cld2 language detection model (langID) returns for the input sentence to classify the following values
{ reliable: true,
textBytes: 181,
languages:
[ { name: 'ITALIAN', code: 'it', percent: 61, score: 774 },
{ name: 'ENGLISH', code: 'en', percent: 38, score: 1573 } ],
chunks:
[ { name: 'ITALIAN', code: 'it', offset: 0, bytes: 116 },
{ name: 'ENGLISH', code: 'en', offset: 116, bytes: 71 } ] }
where the textBytes represents the size of the input text, percent the distribution of the code in the sentence, while the score is an indicator of the quality of the detection (the smaller it is the best it is).
That said, in the brand new CLD3 neural network, the result of the classification is just the accuracy (so a probability value between 0 and 1) so like
println(ld.getCode(0))
println(ld.getScore(0))
en
0.99
I would like to figure out how to convert CLD2 score to probabilities values in order to compare the results to the new CLD3 model.

Please suggest a linq query for my requirement

Can anyone suggest a linq query for the below requirement.
There is a Checkbox on the form..when we click on it...As per the below datatable it has to be grouped according to ItemCode,Sum(SoldQty), StockInHand,LatestRecordValueOfSales, Amount, Description.
You can't group. the following columns
solddate - show the latest sold date
department
category
ItemCode Description UOM SoldQty Stock in Hand SellPrice Amount
---------------------------------------------------------------
100 Paracetamol 200MG UOM1 5 -5 3 8 0 100 1/21/2013 MEAT INDIAN BEAF
100 Paracetamol 200MG UOM1 5 -5 3 8 0 100 1/21/2013 MEAT INDIAN BEAF
200 frozen meat Kilograms 0.005 88.19 4 4.01 0 200 1/21/2013 OTHERS INDIAN BEAF
200 frozen meat Kilograms 0.044 88.19 4 4.04 0 200 1/21/2013 OTHERS INDIAN BEAF
100 Paracetamol 200MG UOM1 5 -5 3 8 0 100 1/22/2013 MEAT INDIAN BEAF
200 frozen meat Kilograms 0.054 88.19 4 4.05 0 200 1/22/2013 OTHERS INDIAN BEAF
200 frozen meat Kilograms 0.055 88.19 4 4.06 0 200 1/22/2013 OTHERS INDIAN BEAF
========================================================================
General query
var resQuery = from i in someQueryable
group i by new {i.groupProperty1, i.groupProperty2} into g
select new
{
Property1 = g.Key.Property1,
Property2 = g.Key.Property2
Total = g.Sum(p => p.SumProperty),
/// other properties
};
For your example data it could be like:
var resQuery = from i in dbContext.Items
group i by new{ i.ItemCode, i.Description, i.UOM} into g
select new
{
ItemCode = g.Key.ItemCode,
TotalSold = g.Sum(p => p.SoldQty),
Description = g.Key.Description,
UOM =g.Key.UOM
/// other properties
};
Try example on Ideone: http://ideone.com/xXwgoG
Similar questions asked on SO many times:
Linq Objects Group By & Sum
LINQ Lambda Group By with Sum
Multiple group by and Sum LINQ
Below is my code and it works fine but only for the first row the soldqty and Amount values are getting doubled.while other rows data is fine.I am not able to understand why only the first row data Sum(SoldQty) is getting doubled.
decimal? SoldQty, stockinhand,SellPrice,Amount,CostPrice;
string ItemCode, Description,UOM,BarCode,SoldDate,Department,Category,User;
var resQuery = from row in dtFilter.AsEnumerable()
group row by row.Field<string>("Item Code") into g
select dtFilter.LoadDataRow(new object[]
{
ItemCode=g.Key,
Description=g.Select(r=>r.Field<string>("Description")).First<string>(),
UOM=g.Select(r=>r.Field<string>("UOM")).First<string>(),
SoldQty = g.Sum(r => r.Field<decimal?>("Sold Qty")).Value,
stockinhand=g.Select(r=>r.Field<decimal?>("Stock in Hand")).First<decimal?>(),
SellPrice=g.Select(r=>r.Field<decimal?>("Sell Price")).First<decimal?>(),
Amount = g.Sum(r => r.Field<decimal?>("Amount")).Value,
CostPrice = g.Sum(r => r.Field<decimal?>("Cost Price")).Value,
BarCode=g.Select(r=>r.Field<string>("Barcode")).First<string>(),
SoldDate=g.Select(r=>r.Field<string>("SoldDate")).Last<string>(),
Department=g.Select(r=>r.Field<string>("Department")).First<string>(),
Category=g.Select(r=>r.Field<string>("Category")).First<string>(),
User=g.Select(r=>r.Field<string>("User")).First<string>(), }, false);

Resources