I get key error whenever I try to encode categorical_features in my dataframe - categorical-data

These are the names of categorical features names that I have stored in a list
my_list=['MSZoning','Street','LotShape','LandContour','Utilities','LotConfig','LandSlope',
'Neighborhood','Condition1','Condition2','BldgType','HouseStyle','RoofStyle','RoofMatl',
'Exterior1st','Exterior2nd','MasVnrType','ExterQual','ExterCond','Foundation','BsmtFinType2',
'Heating','HeatingQC','CentralAir','KitchenQual','Functional','GarageType','GarageFinish',
'GarageQual','GarageCond','PavedDrive','SaleType','SaleCondition']
my code for encoding is as follows:
for cols in my_list:
df[cols]=pd.get_dummies(df[cols],drop_first=True)
I get the following error:
KeyError: 'MSZoning'
During handling of the above exception, another exception occurred:
I tried the above method for another dataset but it worked just fine but here it is giving me above error.

First of all , welcome to the Stack Overflow
for cols in my_list:
df[cols]=pd.get_dummies(df[cols], drop_first=True)
Reason of getting error:
You are getting KeyError because of the attribute drop_first=True in pd.get_dummies() function.
Explanation
So before making any dummy column pd.get_dummies() function is checking whether that column is present in the dataframe or not.
For example, you want to make dummy column MSZoning in the your dataframe df. So the function (get_dummies) is first checking whether MSZoning column is present in your df or not. If it is present in the df, it will delete that column and make a new column with the name MSZoning because you have written drop_first=True which means the drop or delete the first column of the same name whose dummy is to be created right now.
Solution
Remove that line drop_first = True and write it like below
for cols in my_list:
df[cols]=pd.get_dummies(df[cols])

Related

Getting error in split function inside array formula when using it with column title(array literal)

I am using these functions in my google sheets. With an array literal, I am getting an error when there are comma-separated inputs which need to be split but its working fine when there is only value in the K column. It's working fine without column title. Can someone explain the error in the first code?
={"Don't Edit this Column TargetGroup ID";Arrayformula(IFERROR(SPLIT(MainSheet!K2:K,",",TRUE, True),""))}
and
=Arrayformula(IFERROR(SPLIT(MainSheet!K2:K,",",TRUE, True),""))
Try this one:
={
"Don't Edit this Column TargetGroup ID", Arrayformula(SPLIT(REPT(",", COLUMNS(SPLIT(MainSheet!K2:K,",")) - 2), ",", True, False));
Arrayformula(IFERROR(SPLIT(MainSheet!K2:K,","),""))
}
You had only one string value for the first raw in you array literal ({}), so it is only one column.
Presumably, SPLIT found at least one comma and gave you a minimum of two column range which cannot be attached to that first row of yours (the header string) from the bottom as they do not match column-wise.
This SPLIT(REPT(...), ...) gives a needed number of empty cells to append to the right of your header so the number of columns will match.
If that is not the case then please provide a error message or, even better, share a sample sheet where this issued is reproduced.

Merge two bag and get all the field from first bag in pig

I am new to PIG scripting. need some help on this issue.
I got two set of bag in pig and from there I want to get all the field from first bag and overwrite data of first bag if second bag has the data of same field
Column list are dynamic (columns may get added or deleted any time).
in set b we may get data in another field also which are currently blank, if so, then we need to overwrite set a with data available in set b
columns - uniqueid,catagory,b,c,d,e,f,region,g,h,date,direction,indicator
EG:
all_data= COGROUP a by (uniqueid), b by (uniqueid);
Output:
(1,{(1,test,,,,,,,,city,,,,,2020-06-08T18:31:09.000Z,west,,,,,,,,,,,,,A)},{(1,,,,,,,,,,,,,,2020-09-08T19:31:09.000Z,,,,,,,,,,,,,,N)})
(2,{(2,test2,,,,,,,,dist,,,,,2020-08-02T13:06:16.000Z,east,,,,,,,,,,,,A)},{(2,,,,,,,,,,,,,,2020-09-08T18:31:09.000Z,,,,,,,,,,,,,,N)})
Expected Result:
(1,test,,,,,,,,city,,,,,2020-09-08T19:31:09.000Z,west,,,,,,,,,,,,,N)
(2,test2,,,,,,,,dist,,,,,2020-09-08T18:31:09.000Z,east,,,,,,,,,,,,N)
I was able to achieve expected output with below
final = FOREACH all_data GENERATE flatten($1),flatten($2.(region)) as region ,flatten($2.(indicator)) as indicator;

sorting in excel error when duplicate data

I am trying formula for sorting of a column in excel. Data is entered in one column and sorted data is output in another one. Please see attached excel file.
As you can see due to same data in second column I am getting rank 2 twice and VLOOKUP cannot find entry named 3 so it is giving error. When all data entered are unique there is no problem but in case of duplicated data I am having problem. What can I do?
Kbv.
Just noticed a third possibility that you miight be looking for. So you want to sort the B items where column A indicates the rank of each item in B:
C2
=VLOOKUP(SMALL($A$2:$A$7,ROW(1:1)), $A$2:$B$7, 2,FALSE)
Old answer:
If you want a custom sort where column A indicates the rank of the resulting item inC , copy the following formula in C2 and fill down.
C2
=SMALL(B$2:B$7,A2)
If you want just to copy the column as sorted "naturally", you dont need any helper column, just type this in the first row and fill down (I used column D in my example image below):
D2:
=SMALL(B$2:B$7,ROW(1:1))

Working with duplicated columns in SparkR

I am working on a problem where I need to load a large number of CSVs and do some aggregations on them with SparkR.
I need to infer the schema whenever I can (so detect integers etc).
I need to assume that I can't hard-code the schema (unknown number of
columns in each file or can't infer schema from column name alone).
I can't infer the schema from a CSV file with a duplicated header value - it simply won't let you.
I load them like so:
df1 <- read.df(sqlContext, file, "com.databricks.spark.csv", header = "true", delimiter = ",")
It loads OK, but when I try to run any sort of job (even a simple count()) it fails:
java.lang.IllegalArgumentException: The header contains a duplicate entry: # etc
I tried renaming the headers in the schema with:
new <- make.unique(c(names(df1)), sep = "_")
names(df1) <- new
schema(df1) # new column names present in schema
But when I try count() again, I get the same duplicate error as before, which suggests it refers back to the old column names.
I feel like there is a really easy way, apologies in advance if there is. Any suggestions?
the spark csv package doesn't seem to currently have a way to skip lines by index, and if you don't use header="true", your header with dupes will become the first line, and this will mess with your schema inference. If you happen to know what character your header with dupes starts with, and know that no other line will start with that, you can put that in for the comment character setting and that line will get skipped. eg.
df <- read.df(sqlContext, "cars.csv", "com.databricks.spark.csv",header="false", comment="y",delimiter=",",nullValue="NA",mode="DROPMALFORMED",inferSchema="true"‌​)

Ignore error in SSIS

I am getting an "Violation of UNIQUE KEY constraint 'AK_User'. Cannot insert duplicate key in object 'dbo.tblUsers when trying to copy data from an excel file to sql db using SSIS.
Is there any way of ingnoring this error, and let the package continue to the next record without stopping?
What I need is if it inserts three records but the first record is a duplicate, instead of failing, it should continue with the other records and insert them.
There is a System variable called propagate which can be used to continue or stop the execution of package .
1.Create an ON-Error event handler for the task which is failing .Generally it is created for the entire Data Flow Task.
2.Press F4 to get the list of all variables and click on the Icon at the top
to show System Variable.By default Propagate variable will be True ,you need to change it to false ,which basically means that SSIS wont propagate the Error to other component and let the execution continue
Update 1:
To skip the bad rows there are basically 2 ways to do so :-
1.Use Lookup
Try to match the primary key column values in source and destination and then use Lookup No Match Output to your destination.If the value doesn't match with the destination then insert the rows else just skip the rows or redirect to some table or flat file using Lookup Match Output
Example
For more details on Lookup refer this article
2.Or you can redirect the error rows to a flat file or a table .Every SSIS Data Flow components has a Error Output .
For example for Derived component ,the error output dialogue box is
But this condition may not helpful to u in your case as redirect error rows in destination doesn't work properly .If an error occurs it redirects the entire data without inserting any row in the destination .I think this happens because OLEDB destination does a bulk insert or inserts data using transactions.So try to use lookup to achieve your functionality .

Resources