Hive: CREATE TABLE on unicode csv files - hadoop

On HDInsight cluster, trying to create Hive table on unicode csv files.
Invoke-Hive -Query #"
CREATE EXTERNAL TABLE TestUnicode(Numeric1 INT,Numeric2 INT,Numeric3 INT,Name String)ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION
'wasb://$containerName#$storageAccountName.blob.core.windows.net/TestUnicode.csv';
"#
But Hive is not recognising unicode strings properly. Also all integer fields are loaded as NULL.

Change encoding of TestUnicode.csv to UTF-8. Works for me.

Related

How to add file to Hive

I have a file where all the column delimiters at Notepad++ are shown as EOT, SOH, ETX, ACK, BEL, BS, ENQ
I know the schema of the table but I am totally new at these technologies and I cannot load the file to the table. Can I do it through UI like CSV file, and if yes with what delimiter?
Thank you in advance for your help.
It is pretty easy as you have mentioned the file is "," saparated.
lets create a simple table with 1 column.
CREATE TABLE test1(col1 STRING);
Row format delimited
Fields terminated by ',';
Please note statement Fields terminated by ',' we have given fields are saparated by "," if it columns are saparated by tab we can change it to "\t"
once the table is create we can load the file using the below commands.
If File is on local file system
LOAD DATA LOCAL INPATH '<complete_local_file_path>' INTO table test1;
If File is in HDFS
LOAD DATA INPATH '<complete_HDFS_file_path>' INTO table test1;
Hive is just an abstraction layer over HDFS, so you would add the file to HDFS in some folder, then build an EXTERNAL TABLE;over top of it
CREATE EXTERNAL TABLE name(...)
STORED AS TEXT
LINE FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/path/to/folder/'
;
Can I do it through UI like CSV file
If you install HUE, then you could

How to handle new line characters in hive?

I am exporting table from Teradata to Hive.. The table in the teradata Has a address field which has New line characters(\n).. initially I am exporting the table to mount filesystem path from Teradata and then I am loading the table into hive... Record counts are mismatching between teradata table and hive table, Since new line characters are presented in hive.
NOTE: I don't want to handle this through sqoop to bring the data I want to handle the new line characters while loading Into hive from local path.
I got this to work by creating an external table with the following options:
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001'
ESCAPED BY '\\'
STORED AS TEXTFILE;
Then I created a partition to the directory that contains the data files. (my table uses partitions)
i.e.
ALTER TABLE STG_HOLD_CR_LINE_FEED ADD PARTITION (part_key='part_week53') LOCATION '/ifs/test/schema.table/staging/';
NOTE: Be sure that when creating your data file you use '\' as the escape character.
Load data command in Hive only copies the data directly into the hdfs table location.
The only reason Hive would split a new line is if you only defined the table stored as TEXT, which by default uses new lines as record separators, not field separators.
To redefine the table you need something like
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',' ESCAPED BY 'x'
LINES TERMINATED BY 'y'
Where, x and y are, hopefully, escape characters around fields containing new lines, and record delimiters, respectively

Does field delimiter matter in binary file formats in Hive?

In textfile format, data is stored in text format with fields delimited by field delimiter. That's why we prefer non-readable delimiter like CTRL^A.
But is there any effect of using field delimiter while creating hive table in rcfile, orc, avro & sequencefile.
In some hive tutorials, I saw usage of delimiter in these binary file formats too.
Example:
create table olympic_orcfile(athelete STRING,age INT,country STRING,year STRING,closing STRING,sport STRING,gold INT,silver INT,bronze INT,total INT) row format delimited fields terminated by '\t' stored as orcfile;
Does field delimiter ignored or it matter in binary file formats in Hive?
Ignored by RCFILE, ORC and AVRO but does matter for SEQUENCEFILE.

Sqoop not loading CLOB type data into hive table properly

I am trying to use Sqoop job for importing data from Oracle and one of the column in Oracle table is of data type CLOB which contains newline characters.
In this case, the option --hive-drop-import-delims is not working. Hive table doesn’t read the /n characters properly.
Please suggest how I can import CLOB data into target directory parsing all the characters properly.

Import flat files with key=value pair in Hive

I have raw files in HDFS in the format
name=ABC age=10 Location=QWERTY
name=DEF age=15 Location=IWIORS
How do I import data from these flat files into a Hive table with columns 'name' and 'location' only.
You can do the following.
In table declaration, use:
ROW FORMAT DELIMITED
        FIELDS TERMINATED BY ' ' --space
        MAP KEYS TERMINATED BY '='
Also your table will have a single column with data type as Map.
So when you can retireve data from the single column using the key.
Other option:
Write your own SerDe. Link below explain the process for JSON data. I am sure you can customize it for your requirements:
http://blog.cloudera.com/blog/2012/12/how-to-use-a-serde-in-apache-hive/

Resources