HIVE delimiter \n ^M issue - hadoop

I have a file whose columns are delimited by ^A and rows delimited by '\n' new line character.
I first uploaded it to HDFS and then create the table in Hive using the command like this:
CREATE EXTERNAL TABLE
IF NOT EXISTS
html_sample
( ts string,
url string,
html string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
LINES TERMINATED BY '\n'
LOCATION '/tmp/directoryname/';
However, when I do a select statement for that table. It turned out to be a mess.
The table looks like this:
ts url html
10082013 http://url.com/01 <doctype>.....style="padding-top: 10px;
text-align... NULL NULL
text-align... NULL NULL
text-align... NULL NULL
10092013 http://url.com/02 <doctype>.....style="padding-top: 10px;
text-align... NULL NULL
text-align... NULL NULL
text-align... NULL NULL
Then I went back to the text file and found there exist several ^M characters in the file, which makes the HIVE treat that ^M as new line character.
When I first created the file, I intentionally removed all the new line character from the html to guarantee that each record is one line. However, I just cannot understand how on earth the HIVE could treat a ^M as a newline character. How can I get around that without modifying my file.
(I know it might be possible to do a global substitution in VI or sed... but it just doesn't make that much sense to me how could HIVE treat ^M as \n)

^M is a way in which Vim displays Windows line endings.
Here's more on this:
What does ^M character mean in Vim?
And Hive in its turn uses TextInputFormat which happens to treat it like a valid line terminator.
Depending on versions of Hadoop and Hive you're using there can be different ways to overcome this(from changing a property in config to custom InputFormat implementation).
Just find a way to specify separator explicitly.
And yeah, LINES TERMINATED BY '\n' does not do what it looks like.
I'm using Hive 0.11 and only possible value is actually '\n' for it but it is not promoted to TextInputFormat

Related

Importing CSV via ORACLE SQL LOADER, line break in double-quote

I'm trying to import CSV via Oracle SQL*LOADER, but I have a problem because some data has line break within the double-quotes. For example
"text.csv"
John,123,New York
Tom,456,Paris
Park,789,"Europe
London, City"
I think that SQL*LOADER uses the line break character to separate records.
This data generates an error "second enclosure string not present"
I use this control file.
(control.txt)
OPTIONS(LOAD=-1, ERRORS=-1)
LOAD DATA
INFILE 'test.csv'
TRUNCATE
INTO TABLE TMP
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
(
field1,
field2,
field3
)
and I execute a command like this
sqlldr 'scott/tiger' control='control.txt' log='result.txt'
I want to import 3 records, not 4 records...
can SQL*LOADER ignore line breaks within double-quotes?
Seems you need to get rid of carriage return and/or line feed characters.
So replace
field3
with
field3 CHAR(4000) "REPLACE(TRIM(:field3),CHR(13)||CHR(10))"'
or
field3 CHAR(4000) "REPLACE(REPLACE(TRIM(:field3),CHR(13)),CHR(10))"'
where using a TRIM() would be useful in order to remove trailing and leading whitespaces.
In case you would like to preserve the embedded carriage returns, construct the control file like this using the "str" (stream) clause on the infile option line to set the end of record character. It tells sqlldr that hex 0D (carriage return, or ^M) is the record separator (this way it will ignore the linefeeds inside the double-quotes):
INFILE 'test.csv' "str x'0D'"

download the csv file from Hadoop Hue returns unreadable code

I use the Apache Hue (User interface) to interact with Hadoop and Hive.
I saved the result of a hive query in a HDFS directory. (The result set is really large)
Then, I downloaded the result file with hue file browser.
Every thing looks fine, but as I opened the csv file, I found the separator is some unreadable code, like this:
How can I solve the separator problem?
SOH (start of heading) or its Seq equivalent Ctrl + A is the default field delimiter used by Hive. And all the \N represent NULL.
Solution to this depends on the version of Hive used
As of Hive 0.11.0 the separator used can be specified; in earlier
versions it was always the ^A character (\001). However, custom
separators are only supported for LOCAL writes in Hive versions 0.11.0
to 1.1.0 – this bug is fixed in version 1.2.0
If using Hive >= 1.2.0, you can specify the FIELDS TERMINATED BY clause in your INSERT OVERWRITE statements to choose your delimiter.
INSERT OVERWRITE DIRECTORY hdfs_directory SELECT statement ...
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',' ...
Refer HIVE-3682 and HIVE-5672.
I will suggest you replace the noisy 'SOH' by "," and remove the '\N' directly.
If you use python, it's just a one-liner:
pd.read_csv("your_file.csv", sep="\001", na_values='\N']).to_csv("your_new_file.csv")

Remove spaces and UTF while writing hive table into HDFS files

I am trying to write the hive table into hdfs file using following queries
insert overwrite directory '<HDFS Location>' select customerid,'\t' ,f1,',', f2,',', f3,',', f4,',', f5 from sd_cust_product_recomm_all_emailid_model2 WHERE EMAILID IS NOT NULL;
I am getting the UTF and spaces in the file . The output is somthing like this :
customer1\t^Af1^A,^Af2^A,^Af3^A,^Af4^A,^Af5^A,
I desired output in following format
customer1/tf1,f2,f3,f4,f5
customer2/tf1,f2,f3,f4,f5
with no spaces and UTF
Thanks for the help
The default delimiter is the issue. Data written to the filesystem is serialized as text with columns separated by ^A.
By explicitly mentioning the Field delimiter(Comma) and Row delimiter(\n) you can overcome the issue.
insert overwrite directory '[HDFS Location]' ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' select
customerid,'\t',f1,f2,f3,f4,f5 from
sd_cust_product_recomm_all_emailid_model2 WHERE EMAILID IS NOT NULL;

How to handle a delimiter in Hive

How do we handle a data in Hive when the \t is in the value and the delimiter is also \t. Suppose for example there is a column as Street, data type as String and value as XXX\tYYY and while creating a table we have used the field delimiter as \t. How will the delimiter work? In this case will the \t in the value will also be delimited?
If your columns with \t values are enclosed by quote character like " the you could use csv-serde to parse the data like this:
Here is a sample dataset that I have loaded:
R1Col1 R1Col2 "R1Col3 MoreData" R1Col4
R2Col2 R2Col2 "R2Col3 MoreData" R2Col4
Register the jar from hive console
hive> add jar /path/to/csv-serde-1.1.2-0.11.0-all.jar;
Create a table with the specified serde and custom properties
hive> create table test_table(c1 string, c2 string, c3 string, c4 string)
> row format serde 'com.bizo.hive.serde.csv.CSVSerde'
> with serdeproperties(
> "separatorChar" = "\t",
> "quoteChar" = "\"",
> "escapeChar" = "\\"
> )
> stored as textfile;
Load your dataset into the table:
hive> load data inpath '/path/to/file/in/hdfs' into table test_table;
Do a select * from test_table to check the results
You could download the csv-serde from here.
It will treat it as a delimiter, yes, same as if you had a semicolon ; in the value and told it to split on semicolon - when the text is scanned, it will see the character and interpret it as the edge of the field.
To get around this, I used sed to find-and-replace characters before loading it into Hive, or I created the Hive table with different delimiters, or left it at the default ^A, or \001, and then, when I extracted it, used sed on the output to replace the \001 with commas or tabs or whatever I needed. Running sed -i 's/oldval/newval/g' file on the command line will replace the characters in your file in place.
Is there a reason you chose to make the table with \t as the delimiter, instead of the default Hive field delimiter of ^A? Since tab is a fairly common character in text, and Hadoop/Hive is used a lot for handling text, it is tough to find a good character for delimiting.
We have faced the same in our data load into hadoop clusters. What we did, added \\t whenever we saw the delimiter is included within a data fields and added the below in the table definition.
Row format delimited fields terminated by \t escaped by \\ lines terminated by \n

Loading data using Hive Sed command

I Have my data in this format.
"123";"mybook1";"2002";"publisher1";
"456";"mybook2;the best seller";"2004";"publisher2";
"789";"mybook3";"2002";"publisher1";
the fields are enclosed in "" and are delimited by ; Also the book name may contain ';' in between.
Can you tell me how to load this data from file to hive table
the below query which i am using now obviously not working ;
create table books (isbn string,title string,year string,publisher string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\;'
if possible i want the userid and year fields to be stored as Int. Please help
Also i dont want to use regexserde command.
how can i use sed command from unix to clean the data and get my output.
i tried to learn about sed command and found the replace option. So i can remove the " double quotations. But how can i handle the extra ; semi colon which comes in the middle of the data
Please help
I think you can preprocess with sed and then use the MetadataTypedColumnsetSerDe WITH SERDEPROPERTIES
sed -r ':a; s/^([^"]*("[^"]*"[^"]*)*);/\1XXXXX/g; t a; s/;/ /g; s/XXXXX/;/g' file
This sed matches the quote pairs to avoid processing what is between quotes putting a placeholder for the semicolons outside of quoted text. Afterward it removes the ;'s from the book title text and replaces them w/a space and puts back the semicolons that are outside quotes.
See here for more how to load data using Hive including an example of MetadataTypedColumnsetSerDe WITH SERDEPROPERTIES:
https://svn.apache.org/repos/asf/hive/trunk/serde/README.txt
create external table books (isbn int,title string,year int,publisher string)
row format SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH
SERDEPROPERTIES ('separatorChar' = '\;' , 'quoteChar' = '\"' ) location 'S3
path/HDFS path for the file';

Resources