Charset, Accents, Special Characters in Apache Hive

Charset, Accents, Special Characters in Apache Hive - hadoop

The Problem
I’m having quite some problems with my Hive tables that contain special characters (in French) in some of their row values. Basically, everything that is a special character (like an accent on a letter or other diacritics) gets transformed in pure gibberish (various weird symbols) when querying the data (via Hive CLI or other methods). The problem is not with column names, but with the actual row values and content.
For exemple, instead of printing "Variat°" or any other special character or accent mark, I get this as a result (when using a select statement):
Variatï¿½ cancel
Infos & Conf
The Hive table is external, from a CSV file in HDFS that is encoded in charset iso-8859-1. Changing the original file encoding charset doesn’t produce any better result.
I'm using a Hortonworks distribution 2.2 on RedHat Enterprise 6. The original CSV displays correctly in Linux.
The Question
I've looked on the web for similar problems but it would seem that no one encountered it. Or at least everybody uses only English when using Hive :) Some Jiras have addressed issues with special characters in the Hive table column names - but my problem is with actual content of rows.
How can I deal with this problem in Hive?
Is it not possible to display special characters in Hive?
Is there any "charset" option for Hive?
Any help would be greatly appreciated as I’m currently stuck.
Thank you in advance!

I had similar issue but since my source file was small used notepad++ to covert it to UTF-8 encoding.

Related

Sqoop characterset conversion to UTF-8

When processing files from sources, one quite often converts these to UTF-8 csvs for sources that have more or less exotic character sets.
If using sqoop to access a database, how does that work? Do not see a conversion clause but do note the HDFS is UTF-8 by default. Automatic? I heard - but could not confirm - that sqoop converts standardly to UTF-8.
This is correct?

Yes, this is so when executing actual tests taken into account.

How to handle extended ASCII in hive?

Just wondering how anyone has dealt with handling extended ASCII in hive. For example, characters like §.
I see that character in the raw data stored as string in Hive but once I query or export the data it does not show up properly. Is there anyway to retain the §?

How can a Microsoft Word binary be stored in Hive?

Question from a relative Hadoop/Hive newbie: How can I pass the contents of a Microsoft Word (binary) document as a parameter to a Hive function?
My goal is to be able to provide the full contents of a binary file (a Microsoft Word document in my particular use case) as a binary parameter to a UDTF. My initial approach has been to slurp the file's contents into a staging table and then provide it to the UDTF in a query later on, and this was how I attempted to build that staging table:
create table worddoc(content BINARY);
load data inpath '/path/to/wordfile' into table worddoc;
Unfortunately, there seem to be newlines in the Word document (or something acting enough like newlines) that results in the staging table having many rows instead of a single comprehensive blob, the latter of which is what I was hoping for. Is there some way of ensuring that the ingest doesn't get exploded into multiple rows? I've seen similar questions here on SO regarding other binary data like image files, so that is why I'm guessing it's the newlines that are tripping me up.
Failing all that, is there a way to skip storing the file's contents in an intermediary Hive table and just provide the content directly to the UDTF at invocation time? Nothing obvious jumped out during my search through Hive's built-in functions, but maybe I am missing something.
Version-wise, the environment is Hive 0.13.1 and Hadoop 1.2.1 (although upgrades to both are pending).

This is a hack-y workaround but what I ended up doing is this:
1) base64 encode the binary document and put the encoded file into HDFS
2) In Hive:
CREATE TABLE staging_table (content STRING);
LOAD DATA INPATH '/path/to/base64_encoded_file' INTO TABLE staging_table;
CREATE TABLE target_table (content BINARY);
INSERT INTO target_table SELECT unbase64(content) FROM staging_table;
Theoretically this should work for any arbitrary binary file that you'd want to squish into Hive this way. A gotcha to watch out for is to make sure your base64 encoding implementation produces a single-line file (my OS X base64 utility produces 1-line output, while the base64 utility in a CentOS 6 VM I was using produced hundreds of lines) - if it doesn't, you can manually glue it together before putting it into HDFS.

Hive support for filtering Unicode data

I have a Hive table with Unicode data. When trying to perform a simple query "SELECT * FROM table," I get back the correct data in correct Unicode encoding. However, when I tried to add filtering criteria such as "... WHERE column = 'some unicode value'," my query returned nothing.
Is it Hive's limitation? Or is there anyway to make Unicode filtering work with Hive?
Thank you!

you should use utf-8 format and load data into hive table, then you can get the data use what you've wrote before, e.g. ... name like '%你好%'

I am using Cloudera Sqoop to fetch data from Oracle database to HDFS. Everything is going fine except for some characters like ® and © which are being converted to Â®Â© in HDFS. (However in Oracle the data is stored without any problems). Is there any way I can store these characters in HDFS as it is?
Sqoop Version: 1.3
Thanks,
Karthikeya

Which format of characters are you use in Oracle database? Because Hadoop use UTF-8 format, you should convert the data form Oracle database if they are different.

I would strongly suggest to check the actual bytes on HDFS rather than looking at the representation. I've seen too many cases where the data were stored just fine (and actually converted into UTF8 by Sqoop automatically) and just the representation/terminal emulator/whatever else used for reading the data was messing with the encoding. Download the file from HDFS and simply hexdump -C it to verify if the encoding is indeed broken.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Charset, Accents, Special Characters in Apache Hive - hadoop

I had similar issue but since my source file was small used notepad++ to covert it to UTF-8 encoding.

Related

Sqoop characterset conversion to UTF-8

How to handle extended ASCII in hive?

How can a Microsoft Word binary be stored in Hive?

Hive support for filtering Unicode data

Issue while loading data containing characters like ® and © from Oracle to HDFS - Hadoop Distributed File System

Categories

Resources