remove surrounding quotes from fields while loading data into hive - hadoop

I want to load a table with input data into hive. I have data in the following format.
"153662";"0002241447";"0"
"153662";"000647036X";"0"
"153662";"0020434901";"0"
"153662";"0020973403";"0"
"153662";"0028604202";"0"
"153662";"0030437512";"0"
I want to load this data into a table with two varchar columns and one int column.But the surrounding double quotes trouble me. I have created the following table.
CREATE EXTERNAL TABLE Table(A varchar(50),B varchar(50),C varchar(50))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\;'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
but the quotes around the field also become part of field as shown below.
"276725" "034545104X" "0"
"276726" "0155061224" "5"
I want to ignore them. Also I want the third field to be read as INT. Currently it becomes NULL when I provide third field as INT while making table.

You will have to use Csv-Serde for this.
CREATE TABLE Table(A varchar(50),B varchar(50),C varchar(50))
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES
(
"separatorChar" = ";",
"quoteChar" = "\""
)
STORED AS TEXTFILE;

Multiple ways to achieve this:
Use CSV serde
Use regex serde- regex "\"(.*)\"\;\"(.*)\"\;\"(.*)\""
Load data to external table then remove double quotes:
CREATE EXTERNAL TABLE source(
a string,
b String,
c String)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\;' LOCATION 'xyz';
CREATE TABLE destination AS SELECT REGEXP_REPLACE(a,'"',''), REGEXP_REPLACE(b,'"',''), CAST ( REGEXP_REPLACE(c,'"','') AS BIGINT) FROM source;

Hive query to remove double quotes around the string.
Example:
col2 value: "my name is, abc"
select col1, (regexp_replace(col2,'"','')) as col2 from table;
Output: my name is, abc

Related

How do I remove text from columns in HIVE (sql)

I am trying to import data from a CSV file (latlong.csv) and I want to remove all of the quotes from my columns. Please Refer to first image.
First image
This is the code I used to import the data
CREATE TABLE IF NOT EXISTS latlong
(COUNTRY String, ALPHA2 String, ALPHA3 String, NUMERICCODE String,
LATITUDE String, LONGITUDE String)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
tblproperties("skip.header.line.count"="1");
LOAD DATA LOCAL INPATH '/tmp/project2/latlong.csv' INTO TABLE latlong;
I tried to use the command below but I get an error. Error saying I can only insert into the table and not update it (i think).
Update latlong set country = replace(country, '"', '')
error message
To update table which is not in the transaction mode use INSERT OVERWRITE. Double quote needs shielding. Use ["] or double-slash \\":
insert overwrite table latlong
select regexp_replace(COUNTRY, '["]', '') COUNTRY, --this will remove double-qutes from COUNTRY column
ALPHA2, ALPHA3, NUMERICCODE, LATITUDE, LONGITUDE
from latlong;
This solution is applicable if you have quotes inside strings and you want to remove them. It seems this is not your case.
If you have quoted columns, like in your data example, then use SerDe to remove quotes during de-serialization, this is far more efficient. Just create table with proper SerDe and properties:
drop table latlong;
CREATE TABLE latlong
(COUNTRY String, ALPHA2 String, ALPHA3 String, NUMERICCODE String,
LATITUDE String, LONGITUDE String)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES
(
"separatorChar" = ",",
"quoteChar" = "\""
)
STORED AS TEXTFILE
tblproperties("skip.header.line.count"="1");
;
LOAD DATA LOCAL INPATH '/tmp/project2/latlong.csv' INTO TABLE latlong;
SerDe will remove quotes during select, no need to update the table.

HIVE - create external tables where string itself contains commas

I am new to Hive and am creating external tables on csv file. One of the issues I am coming across are values that contain multiple commas within string itself. For example, the csv file contains the following:
CSV File
When I create an external table in Hive, because there are columns within the "name" column, it shifts the first name to the right adding another column. This throws all of the data off when you view the table in Hive.
External Table result in Hive
Is there anything I can add to my script to keep the commas but also keep first and last name in the same column when the external table is created? Thank you all in advance - I am very new to Hive.
CREATE EXTERNAL TABLE database.table name (
ID INT,
Name String,
City String,
State String
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/xyz/xyz/database/directory/'
TBLPROPERTIES ("skip.header.line.count"="1");
Check this solution - you need to add this line : ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
https://community.cloudera.com/t5/Support-Questions/comma-in-between-data-of-csv-mapped-to-external-table-in/td-p/220193
Complete DDL example:
create table hcc(field1 string,
field2 string,
field3 string,
field4 string,
field5 string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = "\"");

How to load CSV data with enclosed by double quotes and separated by tab into HIVE table?

I am trying to load data from a csv file in which the values are enclosed by double quotes '"' and tab separated '\t' .
But when I try to load that into hive its not throwing any error and data is loaded without any error but I think all the data is getting loaded into a single column and most of the values it showing as NULL.
below is my create table statement.
CREATE TABLE example
(
organization STRING,
order BIGINT,
created_on TIMESTAMP,
issue_date TIMESTAMP,
qty INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
ESCAPED BY '"'
STORED AS TEXTFILE;
Input file sample;-
"Organization" "Order" "Created on" "issue_date" "qty"
"GB" "111223" "2015/02/06 00:00:00" "2015/05/15 00:00:00" "5"
"UK" "1110" "2015/05/06 00:00:00" "2015/06/1 00:00:00" "51"
and Load statement to push data into hive table.
LOAD DATA INPATH '/user/example.csv' OVERWRITE INTO TABLE example
What could be the issue and how can I ignore header of the file.
and if I remove ESCAPED BY '"' from create statement its loading in respective columns but all the values are enclosed by double quotes.
How can I remove double quotes from values and ignore header of the file?
You can now use OpenCSVSerde which allows you to define the separator character and easily escape surrounding double-quotes :
CREATE EXTERNAL TABLE example (
organization STRING,
order BIGINT,
created_on TIMESTAMP,
issue_date TIMESTAMP,
qty INT
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = "\t",
"quoteChar" = "\""
)
LOCATION '/your/folder/location/';
You don't want to use escaped by, that's for escape characters, not quote characters. I don't think that Hive actually has support for quote characters. You might want to take a look at this csv serde which accepts a quotechar property.
Also if you have HUE, you can use the metastore manager webapp to load the CSV in, this will deal with the header row, column datatypes and so on.
Use CSV Serde to create the table. I've created a table in hive as follows, and it works like charm.
CREATE EXTERNAL TABLE IF NOT EXISTS myTable (
id STRING,
url STRING,
name STRING
)
row format serde 'com.bizo.hive.serde.csv.CSVSerde'
with serdeproperties ("separatorChar" = "\t")
LOCATION '<folder location>';
"Hive now includes an OpenCSVSerde which will properly parse those quoted fields without adding additional jars or error prone and slow regex."
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
source = Ben Doerr
How to handle fields enclosed within quotes(CSV) in importing data from S3 into DynamoDB using EMR/Hive
You can use a CSV serde " csv-serde-1.1.2.jar " to load the file without double quotes.
download link:
http://ogrodnek.github.io/csv-serde/
and the create table statement as
CREATE TABLE <table_name> (col_name_1 type1, col_name_2 type2, ...) row format serde 'com.bizo.hive.serde.csv.CSVSerde';
you can remove the header with the following property in the create table stmt
tblproperties ("skip.header.line.count"="1");

How to specify custom string for NULL values in Hive table stored as text?

When storing Hive table in text format, for example this table:
CREATE EXTERNAL TABLE clustered_item_info
(
country_id int,
item_id string,
productgroup int,
category string
)
PARTITIONED BY (cluster_id int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION '${hivevar:table_location}';
Fields with null values are represented as '\N' strings, also for numbers NaNs are represented as 'NaN' strings.
Does Hive provide a way to specify custom string to represent these special values?
I would like to use empty strings instead of '\N' and 0 instead of 'NaN' - I know this substitution can be done with streaming, but is there any way to it cleanly using Hive instead of writing extra code?
Other info:
I'm using Hive 0.8 if that matters...
Use this property while creating the table
CREATE TABLE IF NOT EXISTS abc
(
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
TBLPROPERTIES ("serialization.null.format"="")
oh, sorry. I read your question not clear
If you want to represented empty string instead of '\N', you can using COALESCE function:
INSERT OVERWRITE DIRECTORY 's3://bucket/result/'
SELECT NULL, COALESCE(NULL,"")
FROM data_table;

Hive load CSV with commas in quoted fields

I am trying to load a CSV file into a Hive table like so:
CREATE TABLE mytable
(
num1 INT,
text1 STRING,
num2 INT,
text2 STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";
LOAD DATA LOCAL INPATH '/data.csv'
OVERWRITE INTO TABLE mytable;
The csv is delimited by an comma (,) and looks like this:
1, "some text, with comma in it", 123, "more text"
This will return corrupt data since there is a ',' in the first string.
Is there a way to set an text delimiter or make Hive ignore the ',' in strings?
I can't change the delimiter of the csv since it gets pulled from an external source.
If you can re-create or parse your input data, you can specify an escape character for the CREATE TABLE:
ROW FORMAT DELIMITED FIELDS TERMINATED BY "," ESCAPED BY '\\';
Will accept this line as 4 fields
1,some text\, with comma in it,123,more text
The problem is that Hive doesn't handle quoted texts. You either need to pre-process the data by changing the delimiter between the fields (e.g: with a Hadoop-streaming job) or you can also give a try to use a custom CSV SerDe which uses OpenCSV to parse the files.
As of Hive 0.14, the CSV SerDe is a standard part of the Hive install
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
(See: https://cwiki.apache.org/confluence/display/Hive/CSV+Serde)
keep the delimiter in single quotes it will work.
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n';
This will work
Add a backward slash in FIELDS TERMINATED BY '\;'
For Example:
CREATE TABLE demo_table_1_csv
COMMENT 'my_csv_table 1'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\;'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION 'your_hdfs_path'
AS
select a.tran_uuid,a.cust_id,a.risk_flag,a.lookback_start_date,a.lookback_end_date,b.scn_name,b.alerted_risk_category,
CASE WHEN (b.activity_id is not null ) THEN 1 ELSE 0 END as Alert_Flag
FROM scn1_rcc1_agg as a LEFT OUTER JOIN scenario_activity_alert as b ON a.tran_uuid = b.activity_id;
I have tested it, and it worked.
ORG.APACHE.HADOOP.HIVE.SERDE2.OPENCSVSERDE Serde worked for me. My delimiter was '|' and one of the columns is enclosed in double quotes.
Query:
CREATE EXTERNAL TABLE EMAIL(MESSAGE_ID STRING, TEXT STRING, TO_ADDRS STRING, FROM_ADDRS STRING, SUBJECT STRING, DATE STRING)
ROW FORMAT SERDE 'ORG.APACHE.HADOOP.HIVE.SERDE2.OPENCSVSERDE'
WITH SERDEPROPERTIES (
"SEPARATORCHAR" = "|",
"QUOTECHAR" = "\"",
"ESCAPECHAR" = "\""
)
STORED AS TEXTFILE location '/user/abc/csv_folder';

Resources