Collecting data from file using Hive - hadoop

I Have my data in this format.
"123";"mybook1";"2002";"publisher1";
"456";"mybook2;the best seller";"2004";"publisher2";
"789";"mybook3";"2002";"publisher1";
the fields are enclosed in "" and are delimited by ;
Also the book name may contain ';' in between.
Can you tell me how to load this data from file to hive table
the below query which i am using now obviously not working ;
create table books (isbn string,title string,year string,publisher string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\;'
if possible i want the userid and year fields to be stored as Int.
Please help
Thanks,
Harish

The thing you are missing is RegexSerDe. It's very helpful in inserting only a part of text from the input. Your DDL goes like :
create table books ( isbn string, title string, year string, publisher string )
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(?:\")(\\d*)(?:\"\;\")([^\"]*)(?:\"\;\")(\\d*)(?:\"\;\")([^\"]*)\"(?:\;)" ,
"output.format.string" = "%1$s %2$s %3$s %4$s"
)
STORED AS TEXTFILE;
The regex may look complex at the first sight due the escaping and non-capturing groups. Actually it contains 2 groups (\d*) & ([^"]*) placed alternately two times. The non-capturing groups ((?:) just helps to remove the unnecessary context. The group ([^"]*) also take care of ';' inside bookName field.
But nothing comes without a cost. Despite all of its features, RegexSerDe supports only string fields. All you can do is to call the default hive UDF cast to do the transformation when selecting the data from the table. eg(actual syntax may vary a bit) :
SELECT cast( year as int ) from books;
Hope this helps.

Related

SQL* Loader mapping DataFile Fields in different tables columns

I'm trying to load data from a Datafile in different tables, I read a lot about field declaration and delimitation(Position(n:n), terminated by ). The point is than I'm not sure how to do what I need to do. Let me explain this with an example.
I have two tables (person, phone):
person_table( person_id_pk, person_name) - phone_table(person_id_pk, phone)
I have a datafile with:
$ datafile.txt
1,jack pierson,+13526985442
2,Katherine McLaren,+15264586548
My point is, when I'm declaring my ConfigFile.ctl, how do I specify than the field number 3 (phone field) should be insert or append into "phone_table", and the others two fields (person_id, person_name) should be insert or append into "person_table"
Considering than the fields are not fixed length, my reference is the field position. (Field datafile position)
I was thinking to try something like
$configfile.ctl
LOAD DATA
INFILE datafile.txt
APPEND
INTO TABLE person_table
(
person_id_pk POSITION (*) INTEGER EXTERNAL TERMINATED BY "," ,
person_name POSITION(*+1) CHAR(30) TERMINATED BY ","
)
INTO TABLE phone_table
(
person_id_fk POSITION (*) INTEGER EXTERNAL TERMINATED BY ","
phone ------> Right here is my point, how can I specify to SQL Loader than here
should be the field number 3 from datafile
)
I hope you guys get my point. it is a HUGE issue for me, because i'm dealing with CSV files which contains 60, 80, even 100 fields (columns based on Excel File). And every fields or group of fields could be in different tables.
I really appreciate the guide and help you could grant me. I'm probably wrong about my example and controlfile declarations, I haven't implemented anything yet. So I'm open to every suggest you could give me.
Your control file should look like this. The second "INTO TABLE" Uses POSITION(1) to move the logical "pointer" back to the start of the current line so it can be read again. then the name is skipped by defining it as a FILLER.
LOAD DATA
INFILE datafile.txt
APPEND
INTO TABLE person_table
FIELDS TERMINATED BY "," TRAILING NULLCOLS
(
person_id_pk INTEGER EXTERNAL,
person_name CHAR(30)
)
INTO TABLE phone_table
FIELDS TERMINATED BY "," TRAILING NULLCOLS
(
person_id_fk POSITION(1) INTEGER EXTERNAL,
x_name FILLER,
phone CHAR(12)
)

remove surrounding quotes from fields while loading data into hive

I want to load a table with input data into hive. I have data in the following format.
"153662";"0002241447";"0"
"153662";"000647036X";"0"
"153662";"0020434901";"0"
"153662";"0020973403";"0"
"153662";"0028604202";"0"
"153662";"0030437512";"0"
I want to load this data into a table with two varchar columns and one int column.But the surrounding double quotes trouble me. I have created the following table.
CREATE EXTERNAL TABLE Table(A varchar(50),B varchar(50),C varchar(50))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\;'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
but the quotes around the field also become part of field as shown below.
"276725" "034545104X" "0"
"276726" "0155061224" "5"
I want to ignore them. Also I want the third field to be read as INT. Currently it becomes NULL when I provide third field as INT while making table.
You will have to use Csv-Serde for this.
CREATE TABLE Table(A varchar(50),B varchar(50),C varchar(50))
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES
(
"separatorChar" = ";",
"quoteChar" = "\""
)
STORED AS TEXTFILE;
Multiple ways to achieve this:
Use CSV serde
Use regex serde- regex "\"(.*)\"\;\"(.*)\"\;\"(.*)\""
Load data to external table then remove double quotes:
CREATE EXTERNAL TABLE source(
a string,
b String,
c String)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\;' LOCATION 'xyz';
CREATE TABLE destination AS SELECT REGEXP_REPLACE(a,'"',''), REGEXP_REPLACE(b,'"',''), CAST ( REGEXP_REPLACE(c,'"','') AS BIGINT) FROM source;
Hive query to remove double quotes around the string.
Example:
col2 value: "my name is, abc"
select col1, (regexp_replace(col2,'"','')) as col2 from table;
Output: my name is, abc

loading data in table using SQL Loader

I'm loading data into my table through SQL Loader
data loading is successful but i''m getting garbage(repetitive) value in a particular column for all rows
After inserting :
column TERM_AGREEMENT is getting value '806158336' for every record
My csv file contains atmost 3 digit data for that column,but i'm forced to set my column definition to Number(10).
LOAD DATA
infile '/ipoapplication/utl_file/LBR_HE_Mar16.csv'
REPLACE
INTO TABLE LOAN_BALANCE_MASTER_INT
fields terminated by ',' optionally enclosed by '"'
(
ACCOUNT_NO,
CUSTOMER_NAME,
LIMIT,
REGION,
**TERM_AGREEMENT INTEGER**
)
create table LOAN_BALANCE_MASTER_INT
(
ACCOUNT_NO NUMBER(30),
CUSTOMER_NAME VARCHAR2(70),
LIMIT NUMBER(30),
PRODUCT_DESC VARCHAR2(30),
SUBPRODUCT_CODE NUMBER,
ARREARS_INT NUMBER(20,2),
IRREGULARITY NUMBER(20,2),
PRINCIPLE_IRREGULARITY NUMBER(20,2),
**TERM_AGREEMENT NUMBER(10)**
)
INTEGER is for binary data type. If you're importing a csv file, I suppose the numbers are stored as plain text, so you should use INTEGER EXTERNAL. The EXTERNAL clause specifies character data that represents a number.
Edit:
The issue seems to be the termination character of the file. You should be able to solve this issue by editing the INFILE line this way:
INFILE'/ipoapplication/utl_file/LBR_HE_Mar16.csv' "STR X'5E204D'"
Where '5E204D' is the hexadecimal for '^ M'. To get the hexadecimal value you can use the following query:
SELECT utl_raw.cast_to_raw ('^ M') AS hexadecimal FROM dual;
Hope this helps.
I actually solved this issue on my own.
Firstly, thanks to #Gary_W AND #Alessandro for their inputs.Really appreciate your help guys,learned some new things in the process.
Here's the new fragment which worked and i got the correct data for the last column
LOAD DATA
infile '/ipoapplication/utl_file/LBR_HE_Mar16.csv'
REPLACE
INTO TABLE LOAN_BALANCE_MASTER_INT
fields terminated by ',' optionally enclosed by '"'
(
ACCOUNT_NO,
CUSTOMER_NAME,
LIMIT,
REGION,
**TERM_AGREEMENT INTEGER Terminated by Whitspace**
)
'Terminated by whitespace' - I went through some threads of SQL Loader and i used 'terminated by whitespace' in the last column of his ctl file. it worked ,this time i didn't even had to use 'INTEGER' or 'EXTERNAL' or EXPRESSION '..' for conversion.
Just one thing, now can you guys let me now what could possibly be creating issue ?what was there in my csv file in that column and how by adding this thing solved the issue ?
Thanks.

How to load CSV data with enclosed by double quotes and separated by tab into HIVE table?

I am trying to load data from a csv file in which the values are enclosed by double quotes '"' and tab separated '\t' .
But when I try to load that into hive its not throwing any error and data is loaded without any error but I think all the data is getting loaded into a single column and most of the values it showing as NULL.
below is my create table statement.
CREATE TABLE example
(
organization STRING,
order BIGINT,
created_on TIMESTAMP,
issue_date TIMESTAMP,
qty INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
ESCAPED BY '"'
STORED AS TEXTFILE;
Input file sample;-
"Organization" "Order" "Created on" "issue_date" "qty"
"GB" "111223" "2015/02/06 00:00:00" "2015/05/15 00:00:00" "5"
"UK" "1110" "2015/05/06 00:00:00" "2015/06/1 00:00:00" "51"
and Load statement to push data into hive table.
LOAD DATA INPATH '/user/example.csv' OVERWRITE INTO TABLE example
What could be the issue and how can I ignore header of the file.
and if I remove ESCAPED BY '"' from create statement its loading in respective columns but all the values are enclosed by double quotes.
How can I remove double quotes from values and ignore header of the file?
You can now use OpenCSVSerde which allows you to define the separator character and easily escape surrounding double-quotes :
CREATE EXTERNAL TABLE example (
organization STRING,
order BIGINT,
created_on TIMESTAMP,
issue_date TIMESTAMP,
qty INT
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = "\t",
"quoteChar" = "\""
)
LOCATION '/your/folder/location/';
You don't want to use escaped by, that's for escape characters, not quote characters. I don't think that Hive actually has support for quote characters. You might want to take a look at this csv serde which accepts a quotechar property.
Also if you have HUE, you can use the metastore manager webapp to load the CSV in, this will deal with the header row, column datatypes and so on.
Use CSV Serde to create the table. I've created a table in hive as follows, and it works like charm.
CREATE EXTERNAL TABLE IF NOT EXISTS myTable (
id STRING,
url STRING,
name STRING
)
row format serde 'com.bizo.hive.serde.csv.CSVSerde'
with serdeproperties ("separatorChar" = "\t")
LOCATION '<folder location>';
"Hive now includes an OpenCSVSerde which will properly parse those quoted fields without adding additional jars or error prone and slow regex."
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
source = Ben Doerr
How to handle fields enclosed within quotes(CSV) in importing data from S3 into DynamoDB using EMR/Hive
You can use a CSV serde " csv-serde-1.1.2.jar " to load the file without double quotes.
download link:
http://ogrodnek.github.io/csv-serde/
and the create table statement as
CREATE TABLE <table_name> (col_name_1 type1, col_name_2 type2, ...) row format serde 'com.bizo.hive.serde.csv.CSVSerde';
you can remove the header with the following property in the create table stmt
tblproperties ("skip.header.line.count"="1");

How to specify custom string for NULL values in Hive table stored as text?

When storing Hive table in text format, for example this table:
CREATE EXTERNAL TABLE clustered_item_info
(
country_id int,
item_id string,
productgroup int,
category string
)
PARTITIONED BY (cluster_id int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION '${hivevar:table_location}';
Fields with null values are represented as '\N' strings, also for numbers NaNs are represented as 'NaN' strings.
Does Hive provide a way to specify custom string to represent these special values?
I would like to use empty strings instead of '\N' and 0 instead of 'NaN' - I know this substitution can be done with streaming, but is there any way to it cleanly using Hive instead of writing extra code?
Other info:
I'm using Hive 0.8 if that matters...
Use this property while creating the table
CREATE TABLE IF NOT EXISTS abc
(
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
TBLPROPERTIES ("serialization.null.format"="")
oh, sorry. I read your question not clear
If you want to represented empty string instead of '\N', you can using COALESCE function:
INSERT OVERWRITE DIRECTORY 's3://bucket/result/'
SELECT NULL, COALESCE(NULL,"")
FROM data_table;

Resources