I have a csv file and i need to delete empty columns with empty values.
date,A1,A2,A3,A4,A5
2020-07-14 00:00:00.0,,,,10,,
Any suggestions please
Related
According to the H2 documentation for CSVREAD
If the column names are specified (a list of column names separated with the fieldSeparator), those are used, otherwise (or if they are set to NULL) the first line of the file is interpreted as the column names.
I'd expect reading the csv file
id,name,label,origin,destination,length
81,foobar,,19,11,27.4
like this
insert into route select * from csvread ('routes.csv',null,'charset=UTF-8')
would work. However, actually a JdbcSQLIntegrityConstraintViolationException is thrown, saying NULL not allowed for column "ORIGIN" and indicating error code 23502.
If I explicitly add the column names to the insert statement like so,
insert into route (id,name,label,origin,destination,length) select * from csvread ('routes.csv',null,'charset=UTF-8')
it works fine. However, I'd prefer not to repeat myself - following the DRY principle :)
Using version 2.1.212.
The CSVREAD function produces a virtual table. Its column names can be specified in parameters or in the CSV file.
INSERT command with a query doesn't map column names from this query with column names of target table, it uses their ordinal positions instead. Value from the first column of the query is inserted into first column specified in insert column list or into first column of target table if insert column list isn't specified, the second is inserted into second column, and so on.
You can omit insert column list only if your table was defined with the same columns in the same order as in the source query (is your case in the CSV file). If your table has columns declared in different order or it has some additional columns, you need to specify this list.
I've an issue where one of the column loaded in a hive table contains junk character ("~) in a column suffixed with actual value (ABC). So the actual value that's visible for this column is (ABC"~).
This column can have either ABC (or any such string) or NULL. The table is huge and Update is not an option here.
I've thought of a solution of creating a temp table with this column containing either the string (ABC) or NULL, thereby want to remove this junk character ("~) completely while copying the data from original table to this temp table.
Any help on how I can remove this junk? I tried using regexp function, but no success. Any suggestions?
I was not using regexp properly; my fault.
The data loaded initially in the table had the extra characters attached to a column's values. For Ex: If the column's actual value was Adf452, then the data contained in the cell was Adf452"~.
So I loaded the data to a temp table like this:
insert overwrite table tempTable select colA, colB, colC, regexp_replace(colC,"\"~",""), partitionedCol from origTable;
This simply loaded the data in tempTable without those junk characters.
I loaded text file into hive external table. That text file has a delimiter of / to differentiate column. Also additionally some column has new line character in one column. Because of that there is mismatch in the data stored in external table. In my case the unique key is row_id which contains values like 1_234 . rowid is numeric. But because of new line character in the text file, some rows has text in row_id.
Is there any way to delete those rows in hive or how can I remove the new line character in text file in hdfs?
You will have to write a hadoop (streaming is an option) job to clean your data before loading into Hive.
I am using the dropDuplicates method to remove the duplicates entry of column A and B in the dataframe. And i am saving my resulting dataframe to empty sql table with primary key on Column A and B. Sometimes the new dataframe has duplicates value on the column A and B
newdf = df.dropDuplicates(Seq("A", "B"))
newdf.write.mode("append").jdbc(url,table,prop)
So while inserting into the table i am getting the java.sql.BatchUpdateException: Duplicate entry Exception
Isn't drop Duplicates expected to remove all the duplicates entry on Column A and B and how can i use batch operation under try catch such that if one batch operation fails then other batch operation go forwards instead of failing the whole job.
dropDuplicates removes duplicates from the current dataset but you use append writer mode. There is no guarantee that dataset doesn't contain duplicates of data that is already in the table.
I have text file which looks like as below,
ID1~name1~city1~zipcode1~position1
ID2~name2~city2~zipcode2~position2
ID3~name3~city3~zipcode3~position3
ID4~name4~city4~zipcode4~position4
.
.
etc goes on...
This text file is the source file and I want split the file (~) and compare the table with ID.
If the value is not in the table, insert operation should perform.
If the id is available in the table but other column values are different then need to update the table.
If the id is not available in the text but available in the table then then the record should get deleted.
I did goggle it but i could find the below page,
https://www.experts-exchange.com/questions/27419804/VBScript-compare-differences-in-two-record-sets.html
Please help me how I can proceed with VBscript.
Whose leg you are trying to pull? Obviously the desired/resulting table is the input table, so use "load data infile" to import the file.