Removing categories with patsy and statsmodels - statsmodels

I am using statsmodels and patsy for building a logistic regression model. I'll use pseudocode here. Let's assume I have a dataframe containing a categorical variable, say Country, with 200 levels. I have reasons to believe some of them would be predictive, so I build a model as in
formula = 'outcome ~ C(Country)'
patsy splits Country into its levels and the model is build using all countries. I then see that the coefficient in GB is high so I want to remove only GB. Can I do something like this in patsy:
formula = 'outcome ~ C(country) - C(country)[GB]'
I tried and it did not change anything.

I don't know if there is a way to subset a Category with patsy formula, but you can do it in the DataFrame.
For example
import numpy as np
import pandas as pd
import statsmodels.api as sm
# sample data
size = 100
np.random.seed(1)
countries = ['IT', 'UK', 'US', 'FR', 'ES']
df = pd.DataFrame({
'outcome': np.random.random(size),
'Country': np.random.choice(countries, size)
})
df['Country'] = df.Country.astype('category')
print(df.Country)
0 ES
1 IT
2 UK
3 US
4 UK
..
95 FR
96 UK
97 ES
98 UK
99 US
Name: Country, Length: 100, dtype: category
Categories (5, object): ['ES', 'FR', 'IT', 'UK', 'US']
Let us suppose we want to remove Category "US"
# create a deep copy excluding 'US'
_df = df[df.Country!='US'].copy(deep=True)
print(_df.Country)
0 ES
1 IT
2 UK
4 UK
5 ES
..
94 UK
95 FR
96 UK
97 ES
98 UK
Name: Country, Length: 83, dtype: category
Categories (5, object): ['ES', 'FR', 'IT', 'UK', 'US']
Even if there are no more elements with category "US" in the DataFrame, the category is still there. If we use this DataFrame in a statsmodels model, we'd get a singular matrix error, so we need to remove unused categories
# remove unused category 'US'
_df['Country'] = _df.Country.cat.remove_unused_categories()
print(_df.Country)
0 ES
1 IT
2 UK
4 UK
5 ES
..
94 UK
95 FR
96 UK
97 ES
98 UK
Name: Country, Length: 83, dtype: category
Categories (4, object): ['ES', 'FR', 'IT', 'UK']
and now we can fit a model
mod = sm.Logit.from_formula('outcome ~ Country', data=_df)
fit = mod.fit()
print(fit.summary())
Optimization terminated successfully.
Current function value: 0.684054
Iterations 4
Logit Regression Results
==============================================================================
Dep. Variable: outcome No. Observations: 83
Model: Logit Df Residuals: 79
Method: MLE Df Model: 3
Date: Sun, 16 May 2021 Pseudo R-squ.: 0.01179
Time: 22:43:37 Log-Likelihood: -56.776
converged: True LL-Null: -57.454
Covariance Type: nonrobust LLR p-value: 0.7160
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
Intercept -0.1493 0.438 -0.341 0.733 -1.007 0.708
Country[T.FR] 0.4129 0.614 0.673 0.501 -0.790 1.616
Country[T.IT] -0.1223 0.607 -0.201 0.840 -1.312 1.068
Country[T.UK] 0.1027 0.653 0.157 0.875 -1.178 1.383
=================================================================================

Related

How to plot shaded areas on a seaborn countplot

I would like to be able to show shaded areas on a seaborn countplot as shown below. The idea would be to show the covid lockown periods that cover the data period. I have the countplot, but i cant figure out how to add the the shaded area.
My current df contains dates of properties for sale (from domain.com.au), and i have a small dataset of covid dates:
The code that generates the seaborn plot:
fig, ax = plt.subplots(figsize=(16,10));
sns.countplot(data=df, x='dates', ax=ax)
representation of what i am looking to produce (created from excel).
Assuming your property dataset looks like this:
df = pd.DataFrame({
"date": ["2022-07-01", "2022-07-01", "2022-07-01", "2022-07-02", "2022-07-02", "2022-07-03", "2022-07-04", "2022-07-05", "2022-07-05", "2022-07-05", "2022-07-06"],
"prop": ['test'] * 11
})
df['date'] = pd.to_datetime(df['date']).dt.date
date prop
0 2022-07-01 test
1 2022-07-01 test
2 2022-07-01 test
3 2022-07-02 test
4 2022-07-02 test
5 2022-07-03 test
6 2022-07-04 test
7 2022-07-05 test
8 2022-07-05 test
9 2022-07-05 test
10 2022-07-06 test
Then you could do a bar plot of the counts and shade date ranges with matplotlib's axvspan:
fig, ax = plt.subplots(figsize=(16,10));
prop_count = df['date'].value_counts()
plt.bar(prop_count.index, prop_count.values)
for start_date, end_date in zip(covid_df['start_date'], covid_df['end_date']):
ax.axvspan(start_date, end_date, alpha=0.1, color='red')
plt.show()

Calculate features at multiple training windows in Featuretools

I have a table with customers and transactions. Is there a way how to get features that would be filtered for last 3/6/9/12 months? I would like to automatically generate features:
number of trans in last 3 months
....
number of trans in last 12 months
average trans in last 3 months
...
average trans in last 12 months
I've tried using the training_window =["1 month", "3 months"],, but it does not seem to return multiple features for each window.
Example:
import featuretools as ft
es = ft.demo.load_mock_customer(return_entityset=True)
window_features = ft.dfs(entityset=es,
target_entity="customers",
training_window=["1 hour", "1 day"],
features_only = True)
window_features
Do I have to do individual windows separately and then merge the results?
As you mentioned, in Featuretools 0.2.1 you have to build the feature matrices individually for each training window and then merge the results. With your example, you would do that as follows:
import pandas as pd
import featuretools as ft
es = ft.demo.load_mock_customer(return_entityset=True)
cutoff_times = pd.DataFrame({"customer_id": [1, 2, 3, 4, 5],
"time": pd.date_range('2014-01-01 01:41:50', periods=5, freq='25min')})
features = ft.dfs(entityset=es,
target_entity="customers",
agg_primitives=['count'],
trans_primitives=[],
features_only = True)
fm_1 = ft.calculate_feature_matrix(features,
entityset=es,
cutoff_time=cutoff_times,
training_window='1h',
verbose=True)
fm_2 = ft.calculate_feature_matrix(features,
entityset=es,
cutoff_time=cutoff_times,
training_window='1d',
verbose=True)
new_df = fm_1.reset_index()
new_df = new_df.merge(fm_2.reset_index(), on="customer_id", suffixes=("_1h", "_1d"))
Then, the new dataframe will look like:
customer_id COUNT(sessions)_1h COUNT(transactions)_1h COUNT(sessions)_1d COUNT(transactions)_1d
1 1 17 3 43
2 3 36 3 36
3 0 0 1 25
4 0 0 0 0
5 1 15 2 29

PIG - retrieve data from XML using XPATH

I have n number of these type of xml files.
<students roll_no=1>
<name>abc</name>
<gender>m</gender>
<maxmarks>
<marks>
<year>2014</year>
<maths>100</maths>
<english>100</english>
<spanish>100</spanish>
<marks>
<marks>
<year>2015</year>
<maths>110</maths>
<english>110</english>
<spanish>110</spanish>
<marks>
</maxmarks>
<marksobt>
<marks>
<year>2014</year>
<maths>90</maths>
<english>95</english>
<spanish>82</spanish>
<marks>
<marks>
<year>2015</year>
<maths>94</maths>
<english>98</english>
<spanish>02</spanish>
<marks>
</marksobt>
</Students>
I need output like
roll_no name gender year eng_max_marks maths_max_marks spanish_max_marks
1 abc m 2014 100 100 100
1 abc m 2015 110 110 110
I am able to retrieve marks row wise in single statement but not able to extract roll_no and name with this.
A = LOAD 'student.xml' using org.apache.pig.piggybank.storage.XMLLoader('marks') as (x:chararray);
B = FOREACH A GENERATE XPath(x, 'marks/year'), XPath(x, 'marks/english'), XPath(x, 'marks/math'), XPath(x, 'marks/spanish');
This return
year eng_max_marks maths_max_marks spanish_max_marks
2014 100 100 100
2015 110 110 110
I can extract both the chunks but not getting how to join other fields. I can't use across join because I have n number of other files.
Let's forger attribute name (roll_no) for now. How can I extract the rest of nodes
name gender year eng_max_marks maths_max_marks spanish_max_marks
abc m 2014 100 100 100
abc m 2015 110 110 110
I don't want to use marks(1)/english approach because this nodes can also vary and don't want to adopt any dirty approach.
Any pointers????

product are not being assigned to categories via magmi

I am using magmi to upload the product it is working fine product are being uploaded.
Only one problem they are not showing up at front end but they are showing up in admin
when i try to find the reason I find that products are not being assigned to any category when i did that manually they are showing up at fronted.
Any body can help ?
Here is a sample of my CSV
sku _store _attribute_set _type _category _root_category _product_websites ada_compliant backplate_dimension base_dimension brand bulb_included bulb_type bulb_wattage canopy_dimension carton_height carton_length carton_width collection1 cost country_of_manufacture country_orgin created_at custom_design custom_design_from custom_design_to custom_layout_update depth description designer diameter dimension enable_googlecheckout energy extension finish finish1 gallery gender gift_message_available harddrive_speed hardrive has_options height height_1 image image_label in_depth lamping length manufacturer1 max_resolution media_gallery megapixels memory meta_description meta_keyword meta_title minimal_price model msrp msrp_display_actual_price_type msrp_enabled name news_from_date news_to_date no_bulbs options_container page_layout price processor ram_size required_options response_time room screensize shade_color shade_dimension shade_material shape shirt_size shoe_size shoe_type short_description small_image small_image_label special_from_date special_price special_to_date status style switch tax_class_id thumbnail thumbnail_label updated_at url_key url_path visibility weight width qty min_qty use_config_min_qty is_qty_decimal backorders use_config_backorders min_sale_qty use_config_min_sale_qty max_sale_qty use_config_max_sale_qty is_in_stock notify_stock_qty use_config_notify_stock_qty manage_stock use_config_manage_stock stock_status_changed_auto use_config_qty_increments qty_increments use_config_enable_qty_inc enable_qty_increments is_decimal_divided _links_related_sku _links_related_position _links_crosssell_sku _links_crosssell_position _links_upsell_sku _links_upsell_position _associated_sku _associated_default_qty _associated_position _tier_price_website _tier_price_customer_group _tier_price_qty _tier_price_price _group_price_website _group_price_customer_group _group_price_price _media_attribute_id _media_image _media_lable _media_position _media_is_disabled
EP777777-81 admin Default simple Wall Lights/Wall Sconces base No Maxim Lighting No Medium base bulbs 100 29.72 33.66 10.43 Basix 170 Contemporary collection with sweeping arms and clean lines. Offered in Ice glass and Satin Nickel finish or Wilshire glass and Oil Rubbed Bronze finish. Maxim Lighting 31.5 H x 32 W x L Dry Locations Satin Nickel 1 31.5 /10001CLPC.jpg Basix 9-Light Chandelier Maxim Lighting 0 Basix 9-Light Chandelier7777 Ceiling Lights, Chandeliers, lighting, lights, Maxim Lighting Maxim Lighting Basix 9-Light Chandelier $510.00 Basix 9-Light Chandelier9999 9 255 Basix 9-Light Chandelier /10001CLPC.jpg Basix 9-Light Chandelier 1 Contemporary 2 /10001CLPC.jpg Basix 9-Light Chandelier 4 26 32 10 0 1 0 0 1 1 1 100 1 1 1 0 1 0 1 0 1 0 0
I had the same problem.
BUT i used the magmi classes into an personal project, not using the magmi interface.
I solved my problem by adding in the product data array the "category_ids"=> "2" . This is the id of the category.
It may not help you, but may help others.
Looking at your CSV, some of your column names are incorrect (including Category)
For example:
_store _attribute_set _type _category _product_websites
Should be:
store attribute_set type category websites
You can see the required column names at Magmi: Import New Products
Also ensure you have the On the fly category creator/importer plugin enabled in your Magmi configuration, and that your Category column values follow the format outlined in the documentation.
Try naming the field "categories" instead of "category"

Please suggest a linq query for my requirement

Can anyone suggest a linq query for the below requirement.
There is a Checkbox on the form..when we click on it...As per the below datatable it has to be grouped according to ItemCode,Sum(SoldQty), StockInHand,LatestRecordValueOfSales, Amount, Description.
You can't group. the following columns
solddate - show the latest sold date
department
category
ItemCode Description UOM SoldQty Stock in Hand SellPrice Amount
---------------------------------------------------------------
100 Paracetamol 200MG UOM1 5 -5 3 8 0 100 1/21/2013 MEAT INDIAN BEAF
100 Paracetamol 200MG UOM1 5 -5 3 8 0 100 1/21/2013 MEAT INDIAN BEAF
200 frozen meat Kilograms 0.005 88.19 4 4.01 0 200 1/21/2013 OTHERS INDIAN BEAF
200 frozen meat Kilograms 0.044 88.19 4 4.04 0 200 1/21/2013 OTHERS INDIAN BEAF
100 Paracetamol 200MG UOM1 5 -5 3 8 0 100 1/22/2013 MEAT INDIAN BEAF
200 frozen meat Kilograms 0.054 88.19 4 4.05 0 200 1/22/2013 OTHERS INDIAN BEAF
200 frozen meat Kilograms 0.055 88.19 4 4.06 0 200 1/22/2013 OTHERS INDIAN BEAF
========================================================================
General query
var resQuery = from i in someQueryable
group i by new {i.groupProperty1, i.groupProperty2} into g
select new
{
Property1 = g.Key.Property1,
Property2 = g.Key.Property2
Total = g.Sum(p => p.SumProperty),
/// other properties
};
For your example data it could be like:
var resQuery = from i in dbContext.Items
group i by new{ i.ItemCode, i.Description, i.UOM} into g
select new
{
ItemCode = g.Key.ItemCode,
TotalSold = g.Sum(p => p.SoldQty),
Description = g.Key.Description,
UOM =g.Key.UOM
/// other properties
};
Try example on Ideone: http://ideone.com/xXwgoG
Similar questions asked on SO many times:
Linq Objects Group By & Sum
LINQ Lambda Group By with Sum
Multiple group by and Sum LINQ
Below is my code and it works fine but only for the first row the soldqty and Amount values are getting doubled.while other rows data is fine.I am not able to understand why only the first row data Sum(SoldQty) is getting doubled.
decimal? SoldQty, stockinhand,SellPrice,Amount,CostPrice;
string ItemCode, Description,UOM,BarCode,SoldDate,Department,Category,User;
var resQuery = from row in dtFilter.AsEnumerable()
group row by row.Field<string>("Item Code") into g
select dtFilter.LoadDataRow(new object[]
{
ItemCode=g.Key,
Description=g.Select(r=>r.Field<string>("Description")).First<string>(),
UOM=g.Select(r=>r.Field<string>("UOM")).First<string>(),
SoldQty = g.Sum(r => r.Field<decimal?>("Sold Qty")).Value,
stockinhand=g.Select(r=>r.Field<decimal?>("Stock in Hand")).First<decimal?>(),
SellPrice=g.Select(r=>r.Field<decimal?>("Sell Price")).First<decimal?>(),
Amount = g.Sum(r => r.Field<decimal?>("Amount")).Value,
CostPrice = g.Sum(r => r.Field<decimal?>("Cost Price")).Value,
BarCode=g.Select(r=>r.Field<string>("Barcode")).First<string>(),
SoldDate=g.Select(r=>r.Field<string>("SoldDate")).Last<string>(),
Department=g.Select(r=>r.Field<string>("Department")).First<string>(),
Category=g.Select(r=>r.Field<string>("Category")).First<string>(),
User=g.Select(r=>r.Field<string>("User")).First<string>(), }, false);

Resources