Hive external table on data containing newline - hadoop

I have a few txt files on which I want to create an external table.
Unfortunately, the content of the files also contains the string "\n" from time to time. It seems that Hive interprets this as a newline, even though it's not a newline in the original file and is just part of the text.
Can I catch this problem in Hive without having to alter the original txt files?

You can put any other delimiter at end of each line(other than \n and your field separator).And than can register that delimiter in table properties.
Eg: Let's say I have record like this
1,2,3,aniit\n,4\n
In this record aniit\n is a string and \n is string.So hive makes it two record.To avoid this ,you can add any other delimiter at end.Like
1,2,3,aniit\n,4\n||
Here '||' is Line delimiter and my create table will look like :
create external table if not exists table1
(
col1 int,
col2 int,
col3 int,
col4 string,
col5 string
)
row format delimited fields terminated by ','
lines terminated by '||'
stored as textfile
location '/tmp/table1';

Related

How should I write my control file which is used to load text file data into Mysql table using sqlldr command?

I have to load data from a text file into a table. My data in text file is delimited by ',' and each item is present in double quotes (i.e., "").
For example, data in the text file is like below:
"1009","John","NY","USA"
"1010","Ron","AZ","USA"
How should I write my control file in order not to include the double quotes (i.e., "") while loading data into the table.
Assuming that the table structure is like the following:
create table someTable(
colA number,
colB varchar2(100),
colC varchar2(100),
colD varchar2(100)
)
You can use the SQLLoader with a control file like:
OPTIONS(skip=0)
load data
infile "data.txt"
append into TABLE someTable
fields
terminated by ','
enclosed by '"'
(
colA "to_number(:colA)", /* here you can use a format for numbers, if any */
colB,
colC,
colD
)

remove surrounding quotes from fields while loading data into hive

I want to load a table with input data into hive. I have data in the following format.
"153662";"0002241447";"0"
"153662";"000647036X";"0"
"153662";"0020434901";"0"
"153662";"0020973403";"0"
"153662";"0028604202";"0"
"153662";"0030437512";"0"
I want to load this data into a table with two varchar columns and one int column.But the surrounding double quotes trouble me. I have created the following table.
CREATE EXTERNAL TABLE Table(A varchar(50),B varchar(50),C varchar(50))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\;'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
but the quotes around the field also become part of field as shown below.
"276725" "034545104X" "0"
"276726" "0155061224" "5"
I want to ignore them. Also I want the third field to be read as INT. Currently it becomes NULL when I provide third field as INT while making table.
You will have to use Csv-Serde for this.
CREATE TABLE Table(A varchar(50),B varchar(50),C varchar(50))
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES
(
"separatorChar" = ";",
"quoteChar" = "\""
)
STORED AS TEXTFILE;
Multiple ways to achieve this:
Use CSV serde
Use regex serde- regex "\"(.*)\"\;\"(.*)\"\;\"(.*)\""
Load data to external table then remove double quotes:
CREATE EXTERNAL TABLE source(
a string,
b String,
c String)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\;' LOCATION 'xyz';
CREATE TABLE destination AS SELECT REGEXP_REPLACE(a,'"',''), REGEXP_REPLACE(b,'"',''), CAST ( REGEXP_REPLACE(c,'"','') AS BIGINT) FROM source;
Hive query to remove double quotes around the string.
Example:
col2 value: "my name is, abc"
select col1, (regexp_replace(col2,'"','')) as col2 from table;
Output: my name is, abc

I have a map of inputs inside a square bracket and I want to read it it in hive

Input File:
[Tom,123,0,jump]
[jerry,345,1,run]
I want to read the above input in hive,
my ddl is
CREATE EXTERNAL TABLE IF NOT EXISTS db1.tomjerrry
( name string, id
int, isGood int, activity string )
row format delimited fields terminated by ','
LOCATION '/user/myname/sample.txt'
When i try reading ,
Select name from db1.tomjerrry
I get,
[Tom
[jerry
How do I remove the square bracket in the hive output.?
Add ESCAPED BY '['
ie
CREATE EXTERNAL TABLE IF NOT EXISTS db1.tomjerrry ( name ARRAY<string>, id int, isGood int, activity string )
row format delimited fields terminated by ',' ESCAPED BY '[';
LOCATION '/user/myname/sample.txt'
Or update CSV file remove [.

Hive: How to delimit rows using a string literal

Need help here.
This is related to hive.
i have a text file with a single long line, for e.g:
JASON 29\SASHA 24\CHRISTINE 15\ROBERT 20\
Now i need to create a table in hive, whose rows are delimited using "\" (backslash), like if i insert the data from the above mentioned line "JASON 29\SASHA 24...." i would want 4 rows to be inseted in my table.
in other words, i want my custom char to be row delimiters, and not the default "\n".
i wrote the DDL:
CREATE TABLE newline_tab
(
name STRING,
age INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\\'
STORED AS TEXTFILE;
but i am unable to create the table, and im getting following error:
FAILED: SemanticException 9:20 LINES TERMINATED BY only supports newline '\n' right now. Error encountered near token ''\''
any help would be appreciated :)
CREATE TABLE IF NOT EXISTS employee ( eid int, name String,
salary String, destination String)
COMMENT ‘Employee details’
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE;

Hive load CSV with commas in quoted fields

I am trying to load a CSV file into a Hive table like so:
CREATE TABLE mytable
(
num1 INT,
text1 STRING,
num2 INT,
text2 STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";
LOAD DATA LOCAL INPATH '/data.csv'
OVERWRITE INTO TABLE mytable;
The csv is delimited by an comma (,) and looks like this:
1, "some text, with comma in it", 123, "more text"
This will return corrupt data since there is a ',' in the first string.
Is there a way to set an text delimiter or make Hive ignore the ',' in strings?
I can't change the delimiter of the csv since it gets pulled from an external source.
If you can re-create or parse your input data, you can specify an escape character for the CREATE TABLE:
ROW FORMAT DELIMITED FIELDS TERMINATED BY "," ESCAPED BY '\\';
Will accept this line as 4 fields
1,some text\, with comma in it,123,more text
The problem is that Hive doesn't handle quoted texts. You either need to pre-process the data by changing the delimiter between the fields (e.g: with a Hadoop-streaming job) or you can also give a try to use a custom CSV SerDe which uses OpenCSV to parse the files.
As of Hive 0.14, the CSV SerDe is a standard part of the Hive install
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
(See: https://cwiki.apache.org/confluence/display/Hive/CSV+Serde)
keep the delimiter in single quotes it will work.
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n';
This will work
Add a backward slash in FIELDS TERMINATED BY '\;'
For Example:
CREATE TABLE demo_table_1_csv
COMMENT 'my_csv_table 1'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\;'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION 'your_hdfs_path'
AS
select a.tran_uuid,a.cust_id,a.risk_flag,a.lookback_start_date,a.lookback_end_date,b.scn_name,b.alerted_risk_category,
CASE WHEN (b.activity_id is not null ) THEN 1 ELSE 0 END as Alert_Flag
FROM scn1_rcc1_agg as a LEFT OUTER JOIN scenario_activity_alert as b ON a.tran_uuid = b.activity_id;
I have tested it, and it worked.
ORG.APACHE.HADOOP.HIVE.SERDE2.OPENCSVSERDE Serde worked for me. My delimiter was '|' and one of the columns is enclosed in double quotes.
Query:
CREATE EXTERNAL TABLE EMAIL(MESSAGE_ID STRING, TEXT STRING, TO_ADDRS STRING, FROM_ADDRS STRING, SUBJECT STRING, DATE STRING)
ROW FORMAT SERDE 'ORG.APACHE.HADOOP.HIVE.SERDE2.OPENCSVSERDE'
WITH SERDEPROPERTIES (
"SEPARATORCHAR" = "|",
"QUOTECHAR" = "\"",
"ESCAPECHAR" = "\""
)
STORED AS TEXTFILE location '/user/abc/csv_folder';

Resources