I am trying to load data from flat file to the table. But flat file has LF(linefeed) so for every LF, one empty row is getting inserted into the table.
Tried using some commands like trim, replace and others none of them are working here, Could anyone please help.
sample flat file
test.txt
ID NAME
1 abc
2 def
(linefeed)
3 hij
(linefeed)
(linefeed)
4 klm
control file
test.ctl
OPTIONS (SKIP=1)
LOAD DATA
CHARACTERSET WE8ISO8859P1 length semantics char
TRUNCATE
INTO TABLE test
TRAILING NULLCOLS
( ID char(4) ,
NAME char(18))
command used
sqlldr CONTROL=test.ctl log=test.log data=test.txt USERID=appdata/app#orcl direct=true
table data
| ID| NAME |
| --| -- |
| 1 | abc|
| 2 | def|
| NULL| NULL|
| 3 | hij|
| NULL| NULL|
| NULL| NULL|
| 4| klm|
While loading data to the table i need to avoid these empty row to be inserted into the table.
Notepad++ is a free text editor, which contains following command in the Edit menu:
Related
This is my table's create script
CREATE TABLE IF NOT EXISTS replacing_test (
addr String,
ver UInt64,
stt String,
time DateTime,
) engine=ReplacingMergeTree(ver)
PARTITION BY toYYYYMM(time)
PRIMARY KEY addr
ORDER BY addr
I have 2 rows as follows:
ABC | 0 | NEW | 2020-04-17 12:39:52
ABC | 2 | DONE | 2020-04-17 12:40:52
When I insert 2 rows above in separate times, with the order like above, after merging, I got:
ABC | 2| DONE | 2020-04-17 12:40:52
It also my expectation.
But, when I try to insert these 2 rows at the same time by reading from backup, with random order, the result will be:
ABC | 0| DONE | 2020-04-17 12:39:52
Is there any behavior that I don't know about here?
I am trying to load the below table which is having two array typed columns in hive.
Base table:
Array<int> col1 Array<string> col2
[1,2] ['a','b','c']
[3,4] ['d','e','f']
I have created the table in hive as below:
create table base(col1 array<int>,col2 array<string>) row format delimited fields terminated by '\t' collection items terminated by ',';
And then loaded the data as below:
load data local inpath '/home/hduser/Desktop/batch/hiveip/basetable' into table base;
I have used below command:
select * from base;
I got the output as below
[null,null] ["['a'","'b'","'c']"]
[null,null] ["['d'","'e'","'f]"]
I am not getting the data in correct format.
Please help me out where I am getting wrong.
You can change the datatype of col1 array of string instead on array of int then you can get the data for col1.
With col1 datatype as Array(string):-
hive>create table base(col1 array<string>,col2 array<string>) row format delimited fields terminated by '\t' collection items terminated by ',';
hive>select * from base;
+--------------+------------------------+--+
| col1 | col2 |
+--------------+------------------------+--+
| ["[1","2]"] | ["['a'","'b'","'c']"] |
| ["[3","4]"] | ["['d'","'e'","'f']"] |
+--------------+------------------------+--+
Why is this behaviour because hive not able to detect the values inside array as integers as we are having 1,2 values enclosed in []
Accessing col1 elements:-
hive>select col1[0],col1[1] from base;
+------+------+--+
| _c0 | _c1 |
+------+------+--+
| [1 | 2] |
| [3 | 4] |
+------+------+--+
(or)
With col1 datatype as Array(int type):-
if you are thinking to don't want to change the datatype then you need to keep your input file as below without [] square brackets for array(i.e.col1) values.
1,2 ['a','b','c']
3,4 ['d','e','f']
then create table same as you mentioned in the question, then hive can detect the first 1,2 as array elements as int type.
hive> create table base(col1 array<int>,col2 array<string>) row format delimited fields terminated by '\t' collection items terminated by ',';
hive> select * from base;
+--------+------------------------+--+
| col1 | col2 |
+--------+------------------------+--+
| [1,2] | ["['a'","'b'","'c']"] |
| [3,4] | ["['d'","'e'","'f']"] |
+--------+------------------------+--+
Accessing array elements:-
hive> select col1[0] from base;
+------+--+
| _c0 |
+------+--+
| 1 |
| 3 |
+------+--+
I have table i have run the job in scdtype 2 load the data below
no | name | loc |
-----------------
1 | abc | hyd |
-----------------
2 | def | bang |
-----------------
3 | ghi | chennai |
then i have run the second run load the data given below
no | name | loc |
-----------------
1 | abc | hyd |
-----------------
2 | def | bang |
-----------------
3 | ghi | chennai |
--------------------
1 | abc | bang |
here no dates,flags,and run ids
how to find second updated record in this situtation
Thanks
I don't think you'll be able to distinguish between the updated record and the original record.
A Dimension table using Type 2 SCD requires additional columns that describes the period in which the record is valid (or current), exactly for this reason.
The solution is to ensure your dimension table has these columns (Typically ValidFrom and ValidTo dates or date/times, and sometimes an IsCurrent flag for good measure). Your ETL process would then populate these columns as part of making the Type 2 updates.
I am not able to get the genre column values displayed while I am loading the data as below:
1::Toy Story (1995)::Animation|Children's|Comedy
2::Jumanji (1995)::Adventure|Children's|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance
4::Waiting to Exhale 1995)::Comedy|Drama
5::Father of the Bride Part II (1995)::Comedy
6::Heat (1995)::Action|Crime|Thriller
7::Sabrina (1995)::Comedy|Romance
I have tried creating the table as DDl:-
create table movie(movie_id int, movie_name string , genre string) row format delimited fields terminated by '::';
And
create table movie(movie_id int, movie_name string , genre array<string>) row format delimited fields terminated by '::' collection items terminated by '|';
But still the genre column is coming as blank.
Please help in creating the table correctly so as to place the data and get the desired result.
Output is-
hive> select * from movie;
OK
1 Toy Story (1995)
2 Jumanji (1995)
3 Grumpier Old Men (1995)
4 Waiting to Exhale 1995)
5 Father of the Bride Part II (1995)
6 Heat (1995)
7 Sabrina (1995)
You have to use Multi delimiter SerDe
P.s.
You might want to consider loading genre as array<string> instead of string
create external table movie
(
movie_id int
,movie_name string
,genre array<string>
)
row format serde 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe'
with serdeproperties
(
"field.delim" = "::"
,"collection.delim" = "|"
)
;
select * from movie
;
+----------+------------------------------------+--------------------------------------+
| movie_id | movie_name | genre |
+----------+------------------------------------+--------------------------------------+
| 1 | Toy Story (1995) | ["Animation","Children's","Comedy"] |
| 2 | Jumanji (1995) | ["Adventure","Children's","Fantasy"] |
| 3 | Grumpier Old Men (1995) | ["Comedy","Romance"] |
| 4 | Waiting to Exhale 1995) | ["Comedy","Drama"] |
| 5 | Father of the Bride Part II (1995) | ["Comedy"] |
| 6 | Heat (1995) | ["Action","Crime","Thriller"] |
| 7 | Sabrina (1995) | ["Comedy","Romance"] |
+----------+------------------------------------+--------------------------------------+
I have a Data file that looks like this:
1 2 3 4 5 6
FirstName1 | LastName1 | 4224423 | Address1 | PhoneNumber1 | 1/1/1980
FirstName2 | LastName2 | 4008933 | Address1 | PhoneNumber1 | 1/1/1980
FirstName3 | LastName3 | 2344327 | Address1 | PhoneNumber1 | 1/1/1980
FirstName4 | LastName4 | 5998943 | Address1 | PhoneNumber1 | 1/1/1980
FirstName5 | LastName5 | 9854531 | Address1 | PhoneNumber1 | 1/1/1980
My DB has 2 Tables, one for PERSON and one for ADDRESS, so I need to store columns 1,2,3 and 6 in PERSON and column 4 and 5 in ADDRESS. All examples provided in the SQL Loader documentation address this case but only for fixed size columns, and my data file is pipe delimited (and spiting this into 2 different data files is not an option).
Do someone knows how to do this?
As always help will be deeply appreciated.
Another option may be to set up the file as an external table and then run inserts selecting the columns you want from the external table.
options(skip=1)
load data
infile "csv file path"
insert into table person
fields terminated by ','
optionally enclosed by '"'
trialling nullcols(1,2,3,6)
insert into table address
fields terminated by ','
optionally enclosed by '"'
trialling nullcols(4,5)
Even if SQLLoader doesn't support this (I'm not sure) nothing stops you from pre-processing it with say awk and then loading. For example:
cat 1.dat | awk -F '|' '{print $1 $2 $3 $6}' > person.dat
cat 1.dat | awk -F '|' '{print $4 $5}' > address.dat