why don't Hive have FIELDS ENCLOSED BY like in MySQL? - hadoop

here is my case :
input lines:
"vijay" <\t> "a-b-c","a-c-d","a-d-c"
"kumar" <\t> "a-b-c","b-c-d""
i created table like this :
hive >create table user_infos(name string, path ARRAY<String> --i need array only)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS
TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE ;
output received :
hive > select * from user_infos ;
"vijay" ["**\"a-b-c\"**","**\"a-c-d\"**","**\"a-d-c\"**"]
"kumar" ["**\"a-b-c\"**","**\"b-c-d\"**"]
problem here is : i don't want double quotes i.e., \"
Required output :
vijay ["a-b-c","a-c-d","a-d-c"]
kumar ["a-b-c","b-c-d"]
Is there any why to achieve this not using custom Serde. Any thing like ENCLOSED BY like in mysql?

I was also stuck with the same issue as my fields are enclosed with double quotes and separated by semicolon(;). My table name is employee1.
So I have searched with links and I have found perfect solution for this.
#ramisetty.vijay: Yes, We have to use serde for this. Please download serde jar using this link : https://github.com/downloads/IllyaYalovyy/csv-serde/csv-serde-0.9.1.jar
then follow below steps using hive prompt :
add jar path/to/csv-serde.jar;
create table employee1(id string, name string, addr string)
row format serde 'com.bizo.hive.serde.csv.CSVSerde'
with serdeproperties(
"separatorChar" = "\;",
"quoteChar" = "\"")
stored as textfile
;
and then load data from your given path using below query:
load data local inpath 'path/xyz.csv' into table employee1;
and then run :
select * from employee1;
Thanks.

Related

Hive External table retrieve query (New to Hive )

I created below mention external table..
create external table if not exists sensor.building1 (BuildingID int,BuildingMgr string , BuildingAge string, HVACproduct string , Country string) row format delimited fields terminated by ',';
Loaded the table by using below query..
load data inpath '/user/cloudera/sensor/SensorFiles/building.csv' into table sensor.building1;
When I am trying to retrieve the buildingID column using below query, but I am getting null value..
select a.BuildingID
from sensor.building1 as a
limit 10;
Please guide me where I am doing something wrong
You are trying to load a CSV file into hive table but hive's default field delimiter is '\001'
So while you tring to load data from csv (I am assuming its ',' separated) its get failed.
You can create table like :
create external table test1(country string, name string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n';

remove surrounding quotes from fields while loading data into hive

I want to load a table with input data into hive. I have data in the following format.
"153662";"0002241447";"0"
"153662";"000647036X";"0"
"153662";"0020434901";"0"
"153662";"0020973403";"0"
"153662";"0028604202";"0"
"153662";"0030437512";"0"
I want to load this data into a table with two varchar columns and one int column.But the surrounding double quotes trouble me. I have created the following table.
CREATE EXTERNAL TABLE Table(A varchar(50),B varchar(50),C varchar(50))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\;'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
but the quotes around the field also become part of field as shown below.
"276725" "034545104X" "0"
"276726" "0155061224" "5"
I want to ignore them. Also I want the third field to be read as INT. Currently it becomes NULL when I provide third field as INT while making table.
You will have to use Csv-Serde for this.
CREATE TABLE Table(A varchar(50),B varchar(50),C varchar(50))
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES
(
"separatorChar" = ";",
"quoteChar" = "\""
)
STORED AS TEXTFILE;
Multiple ways to achieve this:
Use CSV serde
Use regex serde- regex "\"(.*)\"\;\"(.*)\"\;\"(.*)\""
Load data to external table then remove double quotes:
CREATE EXTERNAL TABLE source(
a string,
b String,
c String)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\;' LOCATION 'xyz';
CREATE TABLE destination AS SELECT REGEXP_REPLACE(a,'"',''), REGEXP_REPLACE(b,'"',''), CAST ( REGEXP_REPLACE(c,'"','') AS BIGINT) FROM source;
Hive query to remove double quotes around the string.
Example:
col2 value: "my name is, abc"
select col1, (regexp_replace(col2,'"','')) as col2 from table;
Output: my name is, abc

How to load CSV data with enclosed by double quotes and separated by tab into HIVE table?

I am trying to load data from a csv file in which the values are enclosed by double quotes '"' and tab separated '\t' .
But when I try to load that into hive its not throwing any error and data is loaded without any error but I think all the data is getting loaded into a single column and most of the values it showing as NULL.
below is my create table statement.
CREATE TABLE example
(
organization STRING,
order BIGINT,
created_on TIMESTAMP,
issue_date TIMESTAMP,
qty INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
ESCAPED BY '"'
STORED AS TEXTFILE;
Input file sample;-
"Organization" "Order" "Created on" "issue_date" "qty"
"GB" "111223" "2015/02/06 00:00:00" "2015/05/15 00:00:00" "5"
"UK" "1110" "2015/05/06 00:00:00" "2015/06/1 00:00:00" "51"
and Load statement to push data into hive table.
LOAD DATA INPATH '/user/example.csv' OVERWRITE INTO TABLE example
What could be the issue and how can I ignore header of the file.
and if I remove ESCAPED BY '"' from create statement its loading in respective columns but all the values are enclosed by double quotes.
How can I remove double quotes from values and ignore header of the file?
You can now use OpenCSVSerde which allows you to define the separator character and easily escape surrounding double-quotes :
CREATE EXTERNAL TABLE example (
organization STRING,
order BIGINT,
created_on TIMESTAMP,
issue_date TIMESTAMP,
qty INT
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = "\t",
"quoteChar" = "\""
)
LOCATION '/your/folder/location/';
You don't want to use escaped by, that's for escape characters, not quote characters. I don't think that Hive actually has support for quote characters. You might want to take a look at this csv serde which accepts a quotechar property.
Also if you have HUE, you can use the metastore manager webapp to load the CSV in, this will deal with the header row, column datatypes and so on.
Use CSV Serde to create the table. I've created a table in hive as follows, and it works like charm.
CREATE EXTERNAL TABLE IF NOT EXISTS myTable (
id STRING,
url STRING,
name STRING
)
row format serde 'com.bizo.hive.serde.csv.CSVSerde'
with serdeproperties ("separatorChar" = "\t")
LOCATION '<folder location>';
"Hive now includes an OpenCSVSerde which will properly parse those quoted fields without adding additional jars or error prone and slow regex."
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
source = Ben Doerr
How to handle fields enclosed within quotes(CSV) in importing data from S3 into DynamoDB using EMR/Hive
You can use a CSV serde " csv-serde-1.1.2.jar " to load the file without double quotes.
download link:
http://ogrodnek.github.io/csv-serde/
and the create table statement as
CREATE TABLE <table_name> (col_name_1 type1, col_name_2 type2, ...) row format serde 'com.bizo.hive.serde.csv.CSVSerde';
you can remove the header with the following property in the create table stmt
tblproperties ("skip.header.line.count"="1");

How can I do a double delimiter(||) in Hive?

I am trying to load data into hive tables which is delimited by double pipe(||). When I try this :
Sample I/P:
1405983600000||111.111.82.41||806065581||session-id
Creating table in hive:
create table test_hive(k1 string, k2 string, k3 string, k4 string,) row format delimited fields terminated by '||' stored as textfile;
Loading data from text file:
load data local inpath '/Desktop/input.txt' into table test_hive;
When I do this it is storing data in the below format:
1405983600000 tabspace-as-second-column 111.111.82.41 tabspace-as-fourth-column
Where as I am expecting the data in table to be
1405983600000 111.111.82.41 806065581 session-id
Kindly help me out I have tried different options on this but unable to resolve it
Multicharater delimiter eg. || is not supported in Hive till ver 0.13 . So fields terminated by || won't work out.There is an alter native for this.
CREATE EXTERNAL TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User',
country STRING COMMENT 'country of origination')
COMMENT 'This is the staging page view table'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054'
SERDE serde_name WITH SERDEPROPERTIES (field.delim='||')
STORED AS TEXTFILE
LOCATION '<hdfs_location>';
The default serde can be used. Multi character delimiters can be used for fields , line , escape characters by specifying them in the serde properties.
This issue has been resolved in hive 14 with the use of multidelimiter serde. Please find documentation here.
https://cwiki.apache.org/confluence/display/Hive/MultiDelimitSerDe
You could do this if you don't want to use alternate serde or have earlier version of hive:
create external table my_table (line string) location /path/file;
Then create view on top:
create view my_view as select split(line,'\\|\\|')[0] as column_1
, split(line,'\\|\\|')[1] as column_2
, split(line,'\\|\\|')[2] as column_3
from my_table;
Query the view. Good luck.

Hive load CSV with commas in quoted fields

I am trying to load a CSV file into a Hive table like so:
CREATE TABLE mytable
(
num1 INT,
text1 STRING,
num2 INT,
text2 STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";
LOAD DATA LOCAL INPATH '/data.csv'
OVERWRITE INTO TABLE mytable;
The csv is delimited by an comma (,) and looks like this:
1, "some text, with comma in it", 123, "more text"
This will return corrupt data since there is a ',' in the first string.
Is there a way to set an text delimiter or make Hive ignore the ',' in strings?
I can't change the delimiter of the csv since it gets pulled from an external source.
If you can re-create or parse your input data, you can specify an escape character for the CREATE TABLE:
ROW FORMAT DELIMITED FIELDS TERMINATED BY "," ESCAPED BY '\\';
Will accept this line as 4 fields
1,some text\, with comma in it,123,more text
The problem is that Hive doesn't handle quoted texts. You either need to pre-process the data by changing the delimiter between the fields (e.g: with a Hadoop-streaming job) or you can also give a try to use a custom CSV SerDe which uses OpenCSV to parse the files.
As of Hive 0.14, the CSV SerDe is a standard part of the Hive install
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
(See: https://cwiki.apache.org/confluence/display/Hive/CSV+Serde)
keep the delimiter in single quotes it will work.
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n';
This will work
Add a backward slash in FIELDS TERMINATED BY '\;'
For Example:
CREATE TABLE demo_table_1_csv
COMMENT 'my_csv_table 1'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\;'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION 'your_hdfs_path'
AS
select a.tran_uuid,a.cust_id,a.risk_flag,a.lookback_start_date,a.lookback_end_date,b.scn_name,b.alerted_risk_category,
CASE WHEN (b.activity_id is not null ) THEN 1 ELSE 0 END as Alert_Flag
FROM scn1_rcc1_agg as a LEFT OUTER JOIN scenario_activity_alert as b ON a.tran_uuid = b.activity_id;
I have tested it, and it worked.
ORG.APACHE.HADOOP.HIVE.SERDE2.OPENCSVSERDE Serde worked for me. My delimiter was '|' and one of the columns is enclosed in double quotes.
Query:
CREATE EXTERNAL TABLE EMAIL(MESSAGE_ID STRING, TEXT STRING, TO_ADDRS STRING, FROM_ADDRS STRING, SUBJECT STRING, DATE STRING)
ROW FORMAT SERDE 'ORG.APACHE.HADOOP.HIVE.SERDE2.OPENCSVSERDE'
WITH SERDEPROPERTIES (
"SEPARATORCHAR" = "|",
"QUOTECHAR" = "\"",
"ESCAPECHAR" = "\""
)
STORED AS TEXTFILE location '/user/abc/csv_folder';

Resources