Replace specific junk characters from column in hive

Replace specific junk characters from column in hive - hadoop

I've an issue where one of the column loaded in a hive table contains junk character ("~) in a column suffixed with actual value (ABC). So the actual value that's visible for this column is (ABC"~).
This column can have either ABC (or any such string) or NULL. The table is huge and Update is not an option here.
I've thought of a solution of creating a temp table with this column containing either the string (ABC) or NULL, thereby want to remove this junk character ("~) completely while copying the data from original table to this temp table.
Any help on how I can remove this junk? I tried using regexp function, but no success. Any suggestions?

I was not using regexp properly; my fault.
The data loaded initially in the table had the extra characters attached to a column's values. For Ex: If the column's actual value was Adf452, then the data contained in the cell was Adf452"~.
So I loaded the data to a temp table like this:
insert overwrite table tempTable select colA, colB, colC, regexp_replace(colC,"\"~",""), partitionedCol from origTable;
This simply loaded the data in tempTable without those junk characters.

Related

Why does CSVREAD not work as expected when it is supposed to read the column names from the csv file?

According to the H2 documentation for CSVREAD
If the column names are specified (a list of column names separated with the fieldSeparator), those are used, otherwise (or if they are set to NULL) the first line of the file is interpreted as the column names.
I'd expect reading the csv file
id,name,label,origin,destination,length
81,foobar,,19,11,27.4
like this
insert into route select * from csvread ('routes.csv',null,'charset=UTF-8')
would work. However, actually a JdbcSQLIntegrityConstraintViolationException is thrown, saying NULL not allowed for column "ORIGIN" and indicating error code 23502.
If I explicitly add the column names to the insert statement like so,
insert into route (id,name,label,origin,destination,length) select * from csvread ('routes.csv',null,'charset=UTF-8')
it works fine. However, I'd prefer not to repeat myself - following the DRY principle :)
Using version 2.1.212.

The CSVREAD function produces a virtual table. Its column names can be specified in parameters or in the CSV file.
INSERT command with a query doesn't map column names from this query with column names of target table, it uses their ordinal positions instead. Value from the first column of the query is inserted into first column specified in insert column list or into first column of target table if insert column list isn't specified, the second is inserted into second column, and so on.
You can omit insert column list only if your table was defined with the same columns in the same order as in the source query (is your case in the CSV file). If your table has columns declared in different order or it has some additional columns, you need to specify this list.

Delete row in hive external table

I loaded text file into hive external table. That text file has a delimiter of / to differentiate column. Also additionally some column has new line character in one column. Because of that there is mismatch in the data stored in external table. In my case the unique key is row_id which contains values like 1_234 . rowid is numeric. But because of new line character in the text file, some rows has text in row_id.
Is there any way to delete those rows in hive or how can I remove the new line character in text file in hdfs?

You will have to write a hadoop (streaming is an option) job to clean your data before loading into Hive.

Insert part of data from csv into oracle table

I have a CSV (pipe-delimited) file as below
ID|NAME|DES
1|A|B
2|C|D
3|E|F
I need to insert the data into a temp table where I already have SQLLODER in place, but my table have only one column. The below is the control file configuration for loading from csv.
OPTIONS (SKIP=1)
LOAD DATA
CHARACTERSET UTF8
TRUNCATE
INTO TABLE EMPLOYEE
FIELDS TERMINATED BY '|'
TRAILING NULLCOLS
(
NAME
)
How do I select the data from only 2nd column from the csv and insert into only one column in the table EMPLOYEE?
Please let me know if you have any questions.

If you're using a filler field you don't need to have a matching column in the database table - that's the point, really - and as long as you know the field you're interested in is always the second one, you don't need to modify the control file if there are extra fields in the file, you just never specify them.
So this works, with just a filler ID field added and the three-field data file you showed:
OPTIONS (SKIP=1)
LOAD DATA
CHARACTERSET UTF8
TRUNCATE
INTO TABLE EMPLOYEE
FIELDS TERMINATED BY '|'
TRAILING NULLCOLS
(
IF FILLER,
NAME
)
Dmoe'd with:
SQL> create table employee (name varchar2(30));
$ sqlldr ...
Commit point reached - logical record count 3
SQL> select * from employee;
NAME
------------------------------
A
C
E
Adding more fields to the data file makes no difference, as long as they are after the field you are actually interested in. The same thing works for external tables, which can be more convenient for temporary/staging tables, as long as the CSV file is available on the database server.

Columns in data file which needs to be excluded from load can be defined as FILLER.
In given example use following. List all incoming fields and add filler to those columns needs to be ignored from load, e.g.
(
ID FILLER,
NAME,
DES FILLER
)
Another issue here is to ignore header line as in CSV so just use OPTIONS clause e.g.
OPTIONS(SKIP=1)
LOAD DATA ...
Regards,
R.

update table based on concatenated column value

I have a table with only 4 columns
First column - The concatenated column values for each row from another table.The columns are concatenated based on column id from the metadata table.The order of concatenation is the same order of column ids.
Second column -I have the comma separated primary key columns.
Now, based on the primary keys in the second column, I need to update the 3rd column which will retrieve the values for the primary key from each of the first concatenated field.
4 column _ it has the table name.
I am using cursor and string functions and it works perfectly fine but when I tested it fir huge millions of data , it failed and the performance is very poor.
Could anyone give please me a single update query for the same
There is a comparison tool which compares the data between 2 tables in different database but with same data structure and it dumps the mismatch rows into a table with all the columns concatenated(pipe seperaed).The columns are in the same order as that of column id and I know the primary keys for that table(concatenated but pipe seperated). So, based on this data I need to extract the primary key values for which there is a data mismatch.
I need to do something like
Update column4(primary key values pipe seperated extracted from column2)

Check this LINK, maybe will be useful. With that query you could concatenate a value with a character you need (this works for 11g2 version, for earlier versions use xmlagg
, xmlelement, extract method).
CREATE TABLE TEST(
FIELD INT);
INSERT INTO TEST VALUES(1);
INSERT INTO TEST VALUES(2);
INSERT INTO TEST VALUES(3);
INSERT INTO TEST VALUES(4);
SELECT listagg(FIELD,',' ) WITHIN GROUP (ORDER BY FIELD)
FROM TEST
Returns '1,2,3,4'

Difference Between Insert and Append statement in SQL Loader?

Can any one tell me the Difference Between Insert and Append statement in SQL Loader?consider the below example :
Here is my control file
load_1.ctl
load data
infile 'load_1.dat' "str '\r\n'"
insert*/+append/* into table sql_loader_1
(
load_time sysdate,
field_2 position( 1:10),
field_1 position(11:20)
)
Here is my data file
load_1.dat
0123456789abcdefghij
**********##########
foo bar
here comes a very long line
and the next is
short

The documentation is fairly clear; use INSERT when you're loading into an empty table, and APPEND when adding rows to a table that (might) contains data (that you want to keep).
APPEND will still work if your table is empty. INSERT might be safer if you're expecting the table to be empty, as it will error if that isn't true, possibly avoiding unexpected results (particularly if you don't notice and don't get other errors like unique index constraint violations) and/or a post-load data cleanse.

The difference are in two points clear:
append will only add the record if at the end of statement
insert will insert anywhere you want i.e if your table have 10 column you can insert in 5 column only but in append you can't.
in append both your data and the table should have same columns means insert data in row level rather than in column level
and it's also true you cannot use insert if your table have data if it's empty then only you can do use insert.
hope it helps

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Replace specific junk characters from column in hive - hadoop

Related

Why does CSVREAD not work as expected when it is supposed to read the column names from the csv file?

Delete row in hive external table

Insert part of data from csv into oracle table

update table based on concatenated column value

Difference Between Insert and Append statement in SQL Loader?

Categories

Resources