How Can I use one hot encoding on a categorical data of a dataset and then apply feature selection on the dataset?

How Can I use one hot encoding on a categorical data of a dataset and then apply feature selection on the dataset? - categorical-data

I have a dataset in which there are Features of both float and object type . I want to apply feature selection On this dataset in such a way that fisrt Find Mutual Information Score of all the features with Target. Then I choose 20 high score feature and do SFFS on them. So, I use mutual_info_classif in my codes but I get this error: could not convert string to float Because of one of my feature (Name=Grade) that is categorical and the unique value of this feature is :A,B,C,D. I have searched for finding the solution and everybody said that in this condition you should use one hot encoding. but I cant understand how to use one hot encoding? because I want to score each feature , not each category of feature. and If for example category=A get high score and category=D get Low score How can I decide to select or not select Grade feature?

Related

(Google Sheets) How to remove certain dropdown options after a certain number of cells with said option is met?

I'm currently working on a google sheets file to organize the members of my class. I am currently assigning committees and I want them to choose their committee in Google Sheets. However, I want to apply only a certain limit per committee.
What I want to happen is, if a certain choice has been chosen i.e. 5 times, I would like that choice to disappear from the choices and would make it reappear again if ever a students change their choice, however, I do not know how to do this in terms of a formula or through data validation.
I would really appreciate your help. Thank you!

Here's a toy example you may be able to adapt to your needs:
Create a list of options a,b,c,d,e in A1:E1 of Sheet1
Create a list of the limits for each option in A2:E2 (for instance 2,1,3,5,3)
Create a list of people Person1,Person2,Person3 in G2:G4
Apply data validation to H2:H4:
Use criteria 'drop down (from a range)'
Set the data range to =Sheet1!$A3:$E3 (only lock columns, not rows)
In A3 enter the following formula:
=lambda(people,choices,list,limits,
makearray(counta(people),counta(list),lambda(r,c,
if(index(choices,r)<>index(list,,c),if(countif(choices,index(list,,c))<index(limits,,c),index(list,,c),),index(list,,c)))))(
$G$2:$G$4,$H$2:$H$4,$A$1:$E$1,$A$2:$E$2)
We are using MAKEARRAY to create a 2D array with the list of options on each line, however we are asking it to omit elements of the list from each line if they haven't already been selected AND a preset limit on the number of selections for that option has not been reached. Obviously in a 'real' example you would place the data range for validation in a separate sheet and probably hide and protect that sheet as well. You could also potentially use an array literal of strings rather than a cell range as the list of options in order to make the validation list formula completely self-contained.

Replicate "Countif" in PowerBI using DAX

I am a new user to DAX and Power BI, but I am familiar with Excel. I want to replicate these countif formulas in DAX. In Excel, they are counting how many times a specific text string (in this case, the name of a brand) appears in the column, for example:
=COUNTIF(BH2:BH31,"Brand_A"), it is counting how many times the text "Brand_A" appears in the selection.
and I would like to know how I can do this in DAX in PowerBI. If anyone would be interested in providing some sample code I could try out, that would be very helpful.

You will likely want something like the COUNTX or COUNTAX function, combined with a FILTER, to replicate the functionality of Excel's COUNTIF.
https://learn.microsoft.com/en-us/dax/countax-function-dax
https://learn.microsoft.com/en-us/dax/countx-function-dax
Eg.
=COUNTAX(FILTER('YourTable',[BrandColumn]="Brand_A"),[BrandColumn])
Power BI's different "COUNT" functions have slightly different criteria in terms of whether a row gets counted or not (based on whether it's considering purely "empty" cells, or how the expression is evaluated), so you'd need to check the docs for each function and work out which one suits your specific requirement
(And by the way, a Google search of "Power BI COUNTIF" will give you plenty of results where you will find a range of different examples that should help)

You can use this calculation (COUNTX may be slow, because its a iterator) :
CountIf = CALCULATE( COUNTROWS('YourTable' ), FILTER(ALL('YourTable'), 'YourTable'[Brand] = "YourBrand"))

How do h2o models determine what columns to use for predictions (position, name, etc.)?

Using h2o python API to train some models and am a bit confused on how to correctly implement some parts of the API. Specifically, what columns should be ignored in a training dataset and how models look for the actual predictor features in a data set when actually using the model's predict() method. Also how weight columns should be handled (when the actual prediction datasets don't really have weights)
The details of the code here (I think) are not majorly important, but the basic training logic looks something like
drf_dx = h2o.h2o.H2ORandomForestEstimator(
# denoting update version name by epoch timestamp
model_id='drf_dx_v'+str(version)+'t'+str(int(time.time())),
response_column='dx_outcome',
ignored_columns=[
'ucl_id', 'patient_id', 'account_id', 'tar_id', 'charge_line', 'ML_data_begin',
'procedure_outcome', 'provider_outcome',
'weight'
],
weights_column='weight',
ntrees=64,
nbins=32,
balance_classes=True,
binomial_double_trees=True)
.
.
.
drf_dx.train(x=X_train, y=Y_train,
training_frame=train_u, validation_frame=val_u,
max_runtime_secs=max_train_time_hrs*60*60)
(note the ignored columns) and the prediction logic just looks like
preds = model.predict(X)
where X is some (h2o)dataframe with more (or less) columns than in X_train used to train the model (includes some columns for post-processing exploration (in a Jupyter notebook)). Eg. X_train columns may look like
<columns to ignore (as seen in the code)> <columns to use a features for training> <outcome label>
and X columns may look like
<columns to ignore (as seen in the code)> <EVEN MORE COLUMNS TO IGNORE> <columns to use a features for training>
My question is: Is this going to confuse the model when making predictions? Ie. is the model getting the columns to use as features by column name (in which case, I don't think the different dataframe width would be a problem) or is it going by column position (in which case adding more data columns to each sample would shift the positions and become a problem) or something else? What happens since these columns were not explicated in the ignored_columns arg in the model constructor?
** Slight aside: should the weights_column name be in the ignored_columns list or not? The example in the docs (http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/weights_column.html#weights-column) seems to use it as a predictor feature as well as seems to recommend it
For scoring, all computed metrics will take the observation weights into account (for Gains/Lift, AUC, confusion matrices, logloss, etc.), so it’s important to also provide the weights column for validation or test sets if you want to up/down-weight certain observations (ideally consistently between training and testing).
but these weight values are not something that comes with the data used in actual predictions.

I've summarized your question into a few distinct parts, so the answers will be in a Q/A type fashion.
1). When I use my_model.predict(X), how does H2O-3 know which columns to predict with?
H2O-3 will use the columns that you passed as predictors when you built your model (i.e. whatever you passed to the x argument in the estimator, or all the columns you included in your training_frame which were not: ignored using ignored_columns, passed as a target to the y argument, dropped because the column has a constant value.). My recommendation would be to use the x argument to specify your predictors and ignore the ignore_columns parameter. If X, the new dataframe you are predicting on includes columns that were not used when you were building a model, those columns will be ignored - so column names not column positions.
2) Should the weights column name be in the ignored column list?
No, if you pass the weights column to the ignored column list, that column will not be considered in any fashion during the model building phase. In fact, if you test this out, you should get a null pointer error or something similar.
3) Why is the "weights" column specified as a predictor and as the weights_column in the following code example?
This is a great question! I've created two Jira tickets one to update the documentation to clear up the confusion and another one to potentially add a user warning.
The short answer, is if you pass the same column to the predictors argument x and the weights_column argument, that column will only be used as a weight - it will not be used as a feature.
4) Does the user guide recommend using the weights as a feature and as a weight?
No, in the paragraph you are pointing to, the recommendation is to ensure that the column you pass as your weights_column exists in your training frame and validation frame - not that it should also be included as a feature.

How to change nominal attribute value order in WEKA GUI?

I have 2 data sets for train and test with weka. They both having same amount of attributes and same type data type for variables (numeric or nominal) .But they are not compatible with each other because the order of nominal values is different
ex - Training set
Occupation
1 Doctor 40%
2 Engineer 40%
3 Teacher 20%
Test set
1 Engineer 40%
2 doctor 40%
3 Teacher 20%
So both sets are incompatible. My question is how to change these distinct value order to make them compatible?

It looks a bit like a data pre-processing issue. I am quite curious as to how the training and testing data ended up looking like this!
If you would like to change the nominal values, you could use RenameNominalValues to rename the labels of your data. One possible method is to apply this to your testing data:
This solution assumes that you are dealing with a Nominal attribute, that it is your last attribute and they are labelled as shown in the valueReplacements field.
Failing this, depending on the amount of cases, you could edit the values manually or use your favourite spreadsheet to replace the values.
Hope this Helps!

Use "SwapValues" under unsupervised > attribute

Limiting AutoIncrement to a specific range

I am trying to create an application for work. The app will be used internally and should allow us to assign some barcode numbers to our product SKUs. I am using Visual Studio / Basic 2010 Express to build this as my very limited and beginners experience is with VS 2010 Express.
I'll give a bit of information about how I see this application working and then I'll get on with my actual question:
I see the app allowing us to create a new Product in the database by a user entering the SKU and description of the product and then the app will assign this product the next available base number for the barcode and from there the app will (if required) generate the correct EAN13 and GTIN14 barcodes and store them against that SKU.
As a company we have a large range of barcode numbers we can use and we have split this large range up so that the first 50,000 (for example) are for our EAN13 codes, the next 50K are for our GTIN14 codes for Inner Cartons and the remaining 50K are for Master Cartons.
So in order to achieve this I have my Product table which contains the fields 'SKU', 'Description' and 'BarcodeBase'. I have managed to set the BarcodeBase field as unique and I am attempting to use AutoIncrement(Seed & Step) to make sure that this assigns the product a base barcode (before I calculate the check digit) that falls within the EAN13 range as described above...
So finally my question is: Is there a way I can put an upper limit on AutoIncrement so that on the off chance, way way in the future, the base barcode number will not overflow into the next range?
I've been googling unsuccessfully for an answer and I am only coming across things which talk about the data type of the field having a limit. For example the upper limit of an Int32 type. Through my searches I have become vaguely aware of the 'Expression' property of the field and also the possibility of coding a partial class - but I don't know if that is the right direction to go in or if there is something much simpler that I am overlooking / have not found.
I would really appreciate any help!
Edit: As per GrandMasterFlush's comment - I have added a local database to my VS project. So I think I am using a SQL Server Compact 3.5 db.

Use a CHECK constraint, e.g.:
ALTER TABLE dbo.Product ADD CONSTRAINT ...
CHECK (BarcodeBase BETWEEN 1 AND 50000);
I suggest you do not make BarcodeBase an IDENTITY column in the Product table (IDENTITY is the feature that you are referring to as "autoincrement"). IDENTITY is really designed for surrogate key use only and isn't ideal for meaningful business data. You can't update an IDENTITY column, it isn't necessarily sequential, may have gaps in the number sequence and you also only get to use one IDENTITY column per table. Instead of using IDENTITY in the Product table you can generate the sequence elsewhere, for example by incrementing a single value stored in a single row table.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio