Sqoop characterset conversion to UTF-8 - sqoop

When processing files from sources, one quite often converts these to UTF-8 csvs for sources that have more or less exotic character sets.
If using sqoop to access a database, how does that work? Do not see a conversion clause but do note the HDFS is UTF-8 by default. Automatic? I heard - but could not confirm - that sqoop converts standardly to UTF-8.
This is correct?

Yes, this is so when executing actual tests taken into account.

Related

JDBC encoding Python: Comma delimited Pandas column names

I attempted to read in data via JDBC connection from Spark using JayDeBeApi and my pandas.read_sql contains columns with comma delimited names:
e.g. (A,p,p,l,e,s)....(P,e,a,r,s)
Df = pd.read_sql(query, jdbc_conn)
I realize this is encoding problem but the JDBC api doesn’t have encoding or option methods to set encoding like pyodbc. Is there a way to pass encoding argument to url or api?
Thanks for your help.
I had the same issue when connecting to an Oracle database from JayDeBeApi. This was because the path to the JDK was not properly set. Now that it is fixed, the parsing issue is gone.

Charset, Accents, Special Characters in Apache Hive

The Problem
I’m having quite some problems with my Hive tables that contain special characters (in French) in some of their row values. Basically, everything that is a special character (like an accent on a letter or other diacritics) gets transformed in pure gibberish (various weird symbols) when querying the data (via Hive CLI or other methods). The problem is not with column names, but with the actual row values and content.
For exemple, instead of printing "Variat°" or any other special character or accent mark, I get this as a result (when using a select statement):
Variat� cancel
Infos & Conf
The Hive table is external, from a CSV file in HDFS that is encoded in charset iso-8859-1. Changing the original file encoding charset doesn’t produce any better result.
I'm using a Hortonworks distribution 2.2 on RedHat Enterprise 6. The original CSV displays correctly in Linux.
The Question
I've looked on the web for similar problems but it would seem that no one encountered it. Or at least everybody uses only English when using Hive :) Some Jiras have addressed issues with special characters in the Hive table column names - but my problem is with actual content of rows.
How can I deal with this problem in Hive?
Is it not possible to display special characters in Hive?
Is there any "charset" option for Hive?
Any help would be greatly appreciated as I’m currently stuck.
Thank you in advance!
I had similar issue but since my source file was small used notepad++ to covert it to UTF-8 encoding.

Hadoop Input Formats - Usage

I know different file formats in Hadoop ? By default hadoop uses text input format. what is advantage/disadvantage of using text input format.
What is advantage/disadvantage of avro over text input format.
Also please help me understand use case for different file formats(Avro, Sequence, TextInput, RCFile ).
I believe there are no advantages of Text as default other than its contents are human readable and friendly. You could easily view contents by issuing Hadoop fs -cat .
The disadvantages with Text format are
It takes more resources on disk, so would impact the production job efficiency.
Writing/Parsing the text records take more time
No option to maintain data types incase the text is composed from multiple columns.
The Sequence , Avro , RCFile format have very significant advantages over Text format.
Sequence - The key/value objects are directly stored in the binary format through the Hadoop's native serialization process by implementing Writable interface. The data types of the columns are very well maintained, and parsing the records with relevant data type also done easily. Obvoiusly it takes lesser space compared with Text due to the binary format.
Avro - Its a very compact binary storage format for hadoop key/value pairs, Reads/writes records through Avro serialization/deserialization. It is very similar to Sequence file format but also provides Language interoperability and cell versioning.
You may choose Avro over Sequence only if u need cell versioning or the data to be stored will used by few other applications written in different languages other than Java.Avro files can be processed by any languages like C, Ruby, Python, PHP, Java wherein Sequence files are specific only for Java.
RCFile - The Record Columnar File format is column oriented and it is a Hive specific storage format designed to make hive to support faster data load, reduce storage space.
Apart from this you may also consider the ORC and the Parquet file formats.

Issue while loading data containing characters like ® and © from Oracle to HDFS - Hadoop Distributed File System

I am using Cloudera Sqoop to fetch data from Oracle database to HDFS. Everything is going fine except for some characters like ® and © which are being converted to ®© in HDFS. (However in Oracle the data is stored without any problems). Is there any way I can store these characters in HDFS as it is?
Sqoop Version: 1.3
Thanks,
Karthikeya
Which format of characters are you use in Oracle database? Because Hadoop use UTF-8 format, you should convert the data form Oracle database if they are different.
I would strongly suggest to check the actual bytes on HDFS rather than looking at the representation. I've seen too many cases where the data were stored just fine (and actually converted into UTF8 by Sqoop automatically) and just the representation/terminal emulator/whatever else used for reading the data was messing with the encoding. Download the file from HDFS and simply hexdump -C it to verify if the encoding is indeed broken.

Is there any way to convert oracle data files from Chinese Simplified (HZ) encoding to Unicode(UTF-8) encoding

anyone have any idea?
Assuming that the NLS settings on both machines are correct and that the data on the source is properly encoded, since you are just trying to move data from one database to another, just about any Oracle tool should work. You could use the export and import utilities (classic or DataPump versions depending on the version of Oracle). You could also use Streams or materialized views to copy the data from one database to another. Or you could set up a database link and query data over the database link. The Oracle client will automatically convert the data from the source character set to the destination character set.

Resources