Reading CSV with Column header and loading it in hive tables

Reading CSV with Column header and loading it in hive tables - hadoop

I have csv file with column header inside the file.
e.g.
Column1 Column2 Column3
value1 value2 value 3
value1 value2 value 3
value1 value2 value 3
value1 value2 value 3
Now i want to create hive table using this header inside and then load the entire table without the header line into the table.
Can anyone please suggest what approach should be followed in this case.

You can specify
tblproperties ("skip.header.line.count"="1");
see this SO question (Hive External table-CSV File- Header row)

You should remove the header line before loading data into HDFS, no other options here.

Related

How to make hive table match data using column names and not using ordinal positions

If I have a csv like -
colName1,colName2
col1Value,col2Value
and a hive ddl like -
CREATE EXTERNAL TABLE tableName (
col2 STRING,
col1 STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 'hdfs://location/to/testcsv/directory'
tblproperties ("skip.header.line.count"="1");
//select col2 from tableName; gives col1Value
This is obviously because in case of text files hive matches column to data field by ordinal position matching. If the underlying file is parquet then the match is done using column names.
I was wondering is there is a hive SerDe someone has written or maybe some SerDe property I am missing that tells hive to map data field names with hive table column names, such that in above example it would return "col2Value" when col2 is queried, even though ordinal position of col2 in hive table and data file does not match.
Thanks in advance!

Empty String is not treated as null in Hive

My understanding of the following statement is that if blank or empty string is inserted into hive column, it will be treated as null.
TBLPROPERTIES('serialization.null.format'=''
To test the functionality i have created a table and insertted '' to the filed 3. When i query for nulls on the field3, there are no rows with that criteria.
Is my understanding of making blank string to null correct??
CREATE TABLE CDR
(
field1 string,
field2 string,
field3 string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
**TBLPROPERTIES('serialization.null.format'='');**
insert overwrite table emmtest.cdr select **field1,field2,''** from emmtest.cdr_non_orc;
select * from emmtest.cdr where **field3 is null;**
The last statement has not returned any rows. But i am expecting all rows to be returned since there is blank string in field3.

TBLPROPERTIES('serialization.null.format'='') means the following:
An empty field in the data files will be treated as NULL when you query the table
When inserting rows to the table, NULL values will be written to the data files as empty fields
You are doing something else -
You are inserting an empty string to a table from a query.
It is treated "as is" - an empty string.
Demo
bash
hdfs dfs -mkdir /user/hive/warehouse/mytable
echo Hello,,World | hdfs dfs -put - /user/hive/warehouse/mytable/data.txt
hive
create table mytable (s1 string,s2 string,s3 string)
row format delimited
fields terminated by ','
;
hive> select * from mytable;
OK
s1 s2 s3
Hello World
hive> alter table mytable set tblproperties ('serialization.null.format'='');
OK
hive> select * from mytable;
OK
s1 s2 s3
Hello NULL World

You can use the following in your Hive Query properties:
NULL DEFINED AS ''
or any character inside the quotes.

remove surrounding quotes from fields while loading data into hive

I want to load a table with input data into hive. I have data in the following format.
"153662";"0002241447";"0"
"153662";"000647036X";"0"
"153662";"0020434901";"0"
"153662";"0020973403";"0"
"153662";"0028604202";"0"
"153662";"0030437512";"0"
I want to load this data into a table with two varchar columns and one int column.But the surrounding double quotes trouble me. I have created the following table.
CREATE EXTERNAL TABLE Table(A varchar(50),B varchar(50),C varchar(50))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\;'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
but the quotes around the field also become part of field as shown below.
"276725" "034545104X" "0"
"276726" "0155061224" "5"
I want to ignore them. Also I want the third field to be read as INT. Currently it becomes NULL when I provide third field as INT while making table.

You will have to use Csv-Serde for this.
CREATE TABLE Table(A varchar(50),B varchar(50),C varchar(50))
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES
(
"separatorChar" = ";",
"quoteChar" = "\""
)
STORED AS TEXTFILE;

Multiple ways to achieve this:
Use CSV serde
Use regex serde- regex "\"(.*)\"\;\"(.*)\"\;\"(.*)\""
Load data to external table then remove double quotes:
CREATE EXTERNAL TABLE source(
a string,
b String,
c String)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\;' LOCATION 'xyz';
CREATE TABLE destination AS SELECT REGEXP_REPLACE(a,'"',''), REGEXP_REPLACE(b,'"',''), CAST ( REGEXP_REPLACE(c,'"','') AS BIGINT) FROM source;

Hive query to remove double quotes around the string.
Example:
col2 value: "my name is, abc"
select col1, (regexp_replace(col2,'"','')) as col2 from table;
Output: my name is, abc

Not able to query(from Hive) Parquet file created in Pig

I have create a Parquet file in Pig(in the directory outputset)
grunt> STORE extracted INTO './outputset' USING ParquetStorer;
The file has 1 Record as shown below,
grunt> mydata = LOAD './outputset/part-r-00000.parquet' using ParquetLoader;
grunt> dump mydata;
(val1,val2,val3)
grunt> describe mydata;
mydata: {val_0: chararray,val_1: chararray,val_2: chararray}
After this, I have created an external table in Hive to read this file,
CREATE EXTERNAL TABLE parquet_test (
field1 string,
field2 string,
field3 string)
STORED AS PARQUET
LOCATION '/home/.../outputset';
When I query the table I am able to retrieve the 1 Record, but all the fields are NULL as show below,
hive> select * from parquet_test;
NULL NULL NULL
What am I missing here?
PS :
Pig version : 0.15.0
Hive version : 1.2.1

You need to match exact field name in pig with column in hive.
So your hive should look like
CREATE EXTERNAL TABLE parquet_test (
val1 string,
val2 string,
val3 string)
STORED AS PARQUET
LOCATION '/home/.../outputset';

How to really skip the processing of a column?

In order to load data (from a CSV file) into an Oracle database, I use SQL*Loader.
In the table that receives these data, there is a varchar2(500) column, called COMMENTS.
For some reasons, I want to ignore this information from the CSV file.
Thus, I wrote this control file:
Options (BindSize=10000000,Readsize=10000000,Rows=5000,Errors=100)
Load Data
Infile 'XXX.txt'
Append into table T_XXX
Fields Terminated By ';'
TRAILING NULLCOLS
(
...
COMMENTS FILLER,
...
)
This code seems to work correctly, as the COMMENTS field in database is always set to null.
However, if in my CSV file I have a record where the corresponding COMMENTS field exceeds the 500 characters limit, I get an error from SQL*Loader:
Record 2: Rejected - Error on table T_XXX, column COMMENTS.
Field in data file exceeds maximum length
Is there a way to really exclude the processing of my COMMENTS fields?

I can't reproduce your problem. I'm using Oracle 10.2.0.3.0 with SQL*Loader 10.2.0.1.
Here is my test case:
SQL> CREATE TABLE test_sqlldr (
2 ID NUMBER,
3 comments VARCHAR2(20),
4 id2 NUMBER
5 );
Table created
Control file:
LOAD DATA
INFILE test.data
INTO TABLE test_sqlldr
APPEND
FIELDS TERMINATED BY ';'
TRAILING NULLCOLS
( id,
comments filler,
id2
)
data file:
1;aaa;2
3;abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz;4
5;bbb;6
I'm using the command sqlldr userid=xxx/yyy#zzz control=test.ctl and I'm getting all the rows without errors:
SQL> select * from test_sqlldr;
ID COMMENTS ID2
---------- -------------------- ----------
1 2
3 4
5 6
You may try another approach, I'm getting the same desired result with the following control file:
LOAD DATA
INFILE test.data
INTO TABLE test_sqlldr
APPEND
FIELDS TERMINATED BY ';'
TRAILING NULLCOLS
( id,
comments "substr(:comments,1,0)",
id2
)
Update following Romaintaz's comment: I looked into it again and managed to get the same error as you when the size of the column exceeded 255 characters. This is because the default datatype of SQL*Loader is char(255). If you have a column with more data you will have to specify the length. The following control file solved the problem for a column with 300 characters:
LOAD DATA
INFILE test.data
INTO TABLE test_sqlldr
APPEND
FIELDS TERMINATED BY ';'
TRAILING NULLCOLS
( id,
comments filler char(4000),
id2
)
Hope this Helps,
--
Vincent

Just to suggest a tiny improvement, you might try something like:
LOAD DATA
IN FILE test.data INTO TABLE test_sqlldr
APPEND
FIELDS TERMINATED BY ';'TRAILING NULLCOLS
(
id,
comments char(4000) "substr(:comments, 1, 200)",
id2)
Now you'll grab the first 200 characters (or any number you specify in it's place) of all comments - unless some of your input records have values for the comments field that exceed 4000 characters, in which they'll be rejected by loader with the 'exceeds max length' error noted earlier. But assuming that's rare or not the case, all the records will load with some of the comments truncated to 200 chars.
If you go over char(4000) you'll get a SQL Loader error - there's a limit to how far you can push the beast.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Reading CSV with Column header and loading it in hive tables - hadoop

You can specify tblproperties ("skip.header.line.count"="1"); see this SO question (Hive External table-CSV File- Header row)

You should remove the header line before loading data into HDFS, no other options here.

Related

How to make hive table match data using column names and not using ordinal positions

Empty String is not treated as null in Hive

remove surrounding quotes from fields while loading data into hive

Not able to query(from Hive) Parquet file created in Pig

How to really skip the processing of a column?

Categories

Resources