Loading data into Hive table - hadoop

CREATE TABLE IF NOT EXISTS TestingTable2
(
USER_ID BIGINT,
PURCHASED_ITEM ARRAY<STRUCT<PRODUCT_ID: BIGINT,TIMESTAMPS:STRING>>
) ROW FORMAT
DELIMITED FIELDS TERMINATED BY '-'
collection items terminated by ','
map keys terminated by ':'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/user/rkost/output2';
Below is my data which is in only one row data that I need to upload it in above table.
1015826235-[{"product_id":220003038067,"timestamps":"1340321132000"},{"product_id":300003861266,"timestamps":"1340271857000"},{"product_id":140002997245,"timestamps":"1339694926000"},{"product_id":200002448035,"timestamps":"1339172659000"},{"product_id":260003553381,"timestamps":"1339072514000"}]-
After uploading the data when I do select query, I am not seeing data correctly. I should be getting only one row as below but I am not getting the below result in the table
**USER_ID** **PURCHASED_ITEM**
1015826235 [{"product_id":220003038067,"timestamps":"1340321132000"}, {"product_id":300003861266,"timestamps":"1340271857000"}, {"product_id":140002997245,"timestamps":"1339694926000"}, {"product_id":200002448035,"timestamps":"1339172659000"}, {"product_id":260003553381,"timestamps":"1339072514000"}]
Instead of above data, I am getting something like this in my table data after I do select query. Anything wrong with the delimeter?
1015826235 [{"product_id":null,"timestamps":" 220003038067"},{"product_id":null,"timestamps":" \"1340321132000\"}"},{"product_id":null,"timestamps":"
300003861266"},{"product_id":null,"timestamps":" \"1340271857000\"}"},{"product_id":null,"timestamps":" 140002997245"},
{"product_id":null,"timestamps":" \"1339694926000\"}"},{"product_id":null,"timestamps":" 200002448035"},
{"product_id":null,"timestamps":" \"1339172659000\"}"},{"product_id":null,"timestamps":" 260003553381"},
{"product_id":null,"timestamps":" \"1339072514000\"}]"}]
Can anyone point me what wrong I am doing?

Add double quotes to product id
1015826235-[{"product_id":"220003038067","timestamps":"1340321132000"},{"product_id":"300003861266","timestamps":"1340271857000"},{"product_id":"140002997245","timestamps":"1339694926000"},{"product_id":"200002448035","timestamps":"1339172659000"},{"product_id":"260003553381","timestamps":"1339072514000"}]-

I have figured it out on my own. The whole data that needs to be loaded should be somehow like this-
1015826235-220003038067:1340321132000,300003861266:1340271857000,140002997245:1339694926000,200002448035:1339172659000,260003553381:1339072514000

Related

HIVE - Cannot partition a table: semantic exception failure

I'm not able to import data on partitioned table in Hive.
Here is how I create the table
CREATE TABLE IF NOT EXISTS title_ratings
(
tconst STRING,
averageRating DOUBLE,
numVotes INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
TBLPROPERTIES("skip.header.line.count"="1");
And then I load the data into it : LOAD DATA INPATH '/title.ratings.tsv.gz' INTO TABLE eval_hive_db.title_ratings;
It works fine till here. Now I want to create a dynamic partitioned table. First of all, I setup theses params:
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
I now create my partitioned table:
CREATE TABLE IF NOT EXISTS title_ratings_part
(
tconst STRING,
numVotes INT
)
PARTITIONED BY (averageRating DOUBLE)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\n'
STORED AS TEXTFILE;
insert into title_ratings_part partition(title_ratings) select tconst, averageRating, numVotes from title_ratings;
(I also tried with numVotes instead by the way)
And I receive this error: FAILED: ValidationFailureSemanticException eval_hive_db.title_ratings_part: Partition spec {title_ratings=null} contains non-partition columns
Someone can help me please?
Ideally, I want to partition my table by averageRating (less than 2, between 2 and 4, and greater than 4)
You can run this command to check if there are null values or not.
select count(averageRating) from title_ratings group by averageRating;
Now, if there are null values in this column then you will get the count, which you have to fill then apply partitioning again.
Partition column is stored as last column in a table so while inserting you need to maintain correct order in select statement.
Pls change order of columns in select.
insert into title_ratings_part partition(title_ratings)
Select
Tconst,
numVotes,
averageRating --orderwise this should always be last column
from title_ratings

SQL* Loader mapping DataFile Fields in different tables columns

I'm trying to load data from a Datafile in different tables, I read a lot about field declaration and delimitation(Position(n:n), terminated by ). The point is than I'm not sure how to do what I need to do. Let me explain this with an example.
I have two tables (person, phone):
person_table( person_id_pk, person_name) - phone_table(person_id_pk, phone)
I have a datafile with:
$ datafile.txt
1,jack pierson,+13526985442
2,Katherine McLaren,+15264586548
My point is, when I'm declaring my ConfigFile.ctl, how do I specify than the field number 3 (phone field) should be insert or append into "phone_table", and the others two fields (person_id, person_name) should be insert or append into "person_table"
Considering than the fields are not fixed length, my reference is the field position. (Field datafile position)
I was thinking to try something like
$configfile.ctl
LOAD DATA
INFILE datafile.txt
APPEND
INTO TABLE person_table
(
person_id_pk POSITION (*) INTEGER EXTERNAL TERMINATED BY "," ,
person_name POSITION(*+1) CHAR(30) TERMINATED BY ","
)
INTO TABLE phone_table
(
person_id_fk POSITION (*) INTEGER EXTERNAL TERMINATED BY ","
phone ------> Right here is my point, how can I specify to SQL Loader than here
should be the field number 3 from datafile
)
I hope you guys get my point. it is a HUGE issue for me, because i'm dealing with CSV files which contains 60, 80, even 100 fields (columns based on Excel File). And every fields or group of fields could be in different tables.
I really appreciate the guide and help you could grant me. I'm probably wrong about my example and controlfile declarations, I haven't implemented anything yet. So I'm open to every suggest you could give me.
Your control file should look like this. The second "INTO TABLE" Uses POSITION(1) to move the logical "pointer" back to the start of the current line so it can be read again. then the name is skipped by defining it as a FILLER.
LOAD DATA
INFILE datafile.txt
APPEND
INTO TABLE person_table
FIELDS TERMINATED BY "," TRAILING NULLCOLS
(
person_id_pk INTEGER EXTERNAL,
person_name CHAR(30)
)
INTO TABLE phone_table
FIELDS TERMINATED BY "," TRAILING NULLCOLS
(
person_id_fk POSITION(1) INTEGER EXTERNAL,
x_name FILLER,
phone CHAR(12)
)

Hive External table retrieve query (New to Hive )

I created below mention external table..
create external table if not exists sensor.building1 (BuildingID int,BuildingMgr string , BuildingAge string, HVACproduct string , Country string) row format delimited fields terminated by ',';
Loaded the table by using below query..
load data inpath '/user/cloudera/sensor/SensorFiles/building.csv' into table sensor.building1;
When I am trying to retrieve the buildingID column using below query, but I am getting null value..
select a.BuildingID
from sensor.building1 as a
limit 10;
Please guide me where I am doing something wrong
You are trying to load a CSV file into hive table but hive's default field delimiter is '\001'
So while you tring to load data from csv (I am assuming its ',' separated) its get failed.
You can create table like :
create external table test1(country string, name string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n';

loading data in table using SQL Loader

I'm loading data into my table through SQL Loader
data loading is successful but i''m getting garbage(repetitive) value in a particular column for all rows
After inserting :
column TERM_AGREEMENT is getting value '806158336' for every record
My csv file contains atmost 3 digit data for that column,but i'm forced to set my column definition to Number(10).
LOAD DATA
infile '/ipoapplication/utl_file/LBR_HE_Mar16.csv'
REPLACE
INTO TABLE LOAN_BALANCE_MASTER_INT
fields terminated by ',' optionally enclosed by '"'
(
ACCOUNT_NO,
CUSTOMER_NAME,
LIMIT,
REGION,
**TERM_AGREEMENT INTEGER**
)
create table LOAN_BALANCE_MASTER_INT
(
ACCOUNT_NO NUMBER(30),
CUSTOMER_NAME VARCHAR2(70),
LIMIT NUMBER(30),
PRODUCT_DESC VARCHAR2(30),
SUBPRODUCT_CODE NUMBER,
ARREARS_INT NUMBER(20,2),
IRREGULARITY NUMBER(20,2),
PRINCIPLE_IRREGULARITY NUMBER(20,2),
**TERM_AGREEMENT NUMBER(10)**
)
INTEGER is for binary data type. If you're importing a csv file, I suppose the numbers are stored as plain text, so you should use INTEGER EXTERNAL. The EXTERNAL clause specifies character data that represents a number.
Edit:
The issue seems to be the termination character of the file. You should be able to solve this issue by editing the INFILE line this way:
INFILE'/ipoapplication/utl_file/LBR_HE_Mar16.csv' "STR X'5E204D'"
Where '5E204D' is the hexadecimal for '^ M'. To get the hexadecimal value you can use the following query:
SELECT utl_raw.cast_to_raw ('^ M') AS hexadecimal FROM dual;
Hope this helps.
I actually solved this issue on my own.
Firstly, thanks to #Gary_W AND #Alessandro for their inputs.Really appreciate your help guys,learned some new things in the process.
Here's the new fragment which worked and i got the correct data for the last column
LOAD DATA
infile '/ipoapplication/utl_file/LBR_HE_Mar16.csv'
REPLACE
INTO TABLE LOAN_BALANCE_MASTER_INT
fields terminated by ',' optionally enclosed by '"'
(
ACCOUNT_NO,
CUSTOMER_NAME,
LIMIT,
REGION,
**TERM_AGREEMENT INTEGER Terminated by Whitspace**
)
'Terminated by whitespace' - I went through some threads of SQL Loader and i used 'terminated by whitespace' in the last column of his ctl file. it worked ,this time i didn't even had to use 'INTEGER' or 'EXTERNAL' or EXPRESSION '..' for conversion.
Just one thing, now can you guys let me now what could possibly be creating issue ?what was there in my csv file in that column and how by adding this thing solved the issue ?
Thanks.

How to specify custom string for NULL values in Hive table stored as text?

When storing Hive table in text format, for example this table:
CREATE EXTERNAL TABLE clustered_item_info
(
country_id int,
item_id string,
productgroup int,
category string
)
PARTITIONED BY (cluster_id int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION '${hivevar:table_location}';
Fields with null values are represented as '\N' strings, also for numbers NaNs are represented as 'NaN' strings.
Does Hive provide a way to specify custom string to represent these special values?
I would like to use empty strings instead of '\N' and 0 instead of 'NaN' - I know this substitution can be done with streaming, but is there any way to it cleanly using Hive instead of writing extra code?
Other info:
I'm using Hive 0.8 if that matters...
Use this property while creating the table
CREATE TABLE IF NOT EXISTS abc
(
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
TBLPROPERTIES ("serialization.null.format"="")
oh, sorry. I read your question not clear
If you want to represented empty string instead of '\N', you can using COALESCE function:
INSERT OVERWRITE DIRECTORY 's3://bucket/result/'
SELECT NULL, COALESCE(NULL,"")
FROM data_table;

Resources