How to store special characters in Hive? - hadoop

I've been playing around with Spark, Hive and Parquet, I have some data in my Hive table and here is how it looks like ( warning french language ahead ) :
Bloqu� � l'arriv�e NULL
Probl�me de connexion Bloqu� en hub
Obviously there's something wrong here.
What I do is : I read a teradata table as a dataframe with spark, I store it as a parquet file and then I use this file to store it to hive, here's my create table script :
CREATE TABLE `table`(
`lib` VARCHAR(255),
`libelle_sous_cause` VARCHAR(255),
)
STORED AS PARQUET
LOCATION
'hdfs://location';
I don't really know what cause this, it might be caused by some special encoding between Teradata > parquet or Parquet > Hive, I'm not sure.
Any help will be appreciated, thanks.

I figured that out, the solution was to simply use STRING instead of VARCHAR
CREATE TABLE `table`(
`lib` STRING,
`libelle_sous_cause` STRING,
)
STORED AS PARQUET
LOCATION
'hdfs://location';

I've run into the same problem when doing sqoop from Teradata to Hadoop. When extracting from Teradata, in the SELECT, please try wrapping your varchar columns, that may have this issue, with this line:
SELECT
NAME,
AGE,
TRIM(CAST(TRANSLATE(COLUMNOFINTEREST USING latin_to_unicode WITH ERROR) AS VARCHAR(50)))
FROM
TABLENAME;
COLUMNOFINTEREST is your column that would have special characters.
Let me know if that works.

Related

Stop sqoop from converting datetime to bigint

Recently I noticed that whenever I ingest from a SQL database using Sqoop, all datetime fields are converted to a bigint (epoch * 1000) instead of to String.
Important to note: I'm storing as parquet.
I have been trying a bunch of sqoop flags like "--map-column-java" but I don't want to manually define this for hundreds of columns in thousands of tables.
What flag am I missing to prevent this sqoop behaviour?
It seems that sqoop didn't do this when storing in plain text.
Instead of letting sqoop do its arcane magic on my tables, I decided to do the following:
Ingest to a temporary table, stored as text.
Create a table (if not exists) like the temporary table, stored as parquet
insert overwrite the text stored temporary table into the parquet stored table
This allows for proper date formatting without the hassle with (maybe not existing) configuration and settings tweaking in Sqoop.
The only tradoff is that it's slightly slower

How to query sorted/indexed columns in Impala

I have to make a POC with Hadoop for a database using interactive query (~300To log database). I'm trying Impala but i didn't find any solution to use sorted or indexed data. I'm a newbie so i don't even know if it is possible.
How to query sorted/indexed columns in Impala ?
By the way, here is my table's code (simplified).
I would like to have a fast access on the "column_to_sort" below.
CREATE TABLE IF NOT EXISTS myTable (
unique_id STRING,
column_to_sort INT,
content STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\073'
STORED AS textfile;

Case sensitive column names in Hive

I am trying to create an external HIVE table with partitions. Some of my
column names have Upper case letters. This caused a problem while creating
tables since the values of column names with upper case letters were
returned as NULL. I then modified the ParquetSerDe in order for it to
handle this by using SERDEPROPERTIES and this was working with external tables (not partitioned). Now I am
trying to create an external table WITH partitions, and whenever I try to
access the upper case columns (Eg FieldName) I get this error.
select FieldName from tablename;
FAILED: RuntimeException Java. Lang.RuntimeException: cannot find field
FieldName from
[org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField#4f45884b,
org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField#8f11f27,
org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField#77e8eb0e,
org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField#1dae4cd,
org.apache.hadoop.hive.serde2.objectinspector.UnionStructObjectInspector$MyField#623e336d
]
Are there any suggestions you can think of? I cannot change the schema of the data source.
This is the command I use to create tables -
CREATE EXTERNAL TABLE tablename (fieldname string)
PARTITIONED BY (partion_name string)
ROW FORMAT SERDE 'path.ModifiedParquetSerDeLatest'
WITH SERDEPROPERTIES ("casesensitive"="FieldName")
STORED AS INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat'
OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat'
And then add partition:
ALTER TABLE tablename ADD PARTITION (partition_name='partitionvalue')
LOCATION '/path/to/data'
This is an old question, but the partition column has to be case sensitive because of the unix filesystem on which it gets stored.
path "/columnname=value/" is always different from path "/columnName=value/" in unix
So it should be considered a bad practice to rely on case insensitive column names for Hive.

Share data between hive and hadoop streaming-api output

I've several hadoop streaming api programs and produce output with this outputformat:
"org.apache.hadoop.mapred.SequenceFileOutputFormat"
And the streaming api program can read the file with input format "org.apache.hadoop.mapred.SequenceFileAsTextInputFormat".
For the data in the output file looks like this.
val1-1,val1-2,val1-3
val2-1,val2-2,val2-3
val3-1,val3-2,val3-3
Now I want to read the output with hive. I created a table with this script:
CREATE EXTERNAL
TABLE IF NOT EXISTS table1
(
col1 int,
col2 string,
col3 int
)
PARTITIONED BY (year STRING,month STRING,day STRING,hour STRING)
ROW FORMAT DELIMITED
FIELDs TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.SequenceFileAsTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.mapred.SequenceFileOutputFormat'
LOCATION '/hive/table1';
When I query data with query
select * from table1
The result will be
val1-2,val1-3
val2-2,val2-3
val3-2,val3-3
It seems the first column has been ignored. I think hive just use values as output not keys. Any ideas?
You are correct. One of the limitations of Hive right now is that ignores the keys from the Sequence file format. By right now, I am referring to Hive 0.7 but I believe it's a limitation of Hive 0.8 and Hive 0.9 as well.
To circumvent this, you might have to create a new input format for which the key is null and the value is the combination of your present key and value. Sorry, I know this was not the answer you were looking for!
It should be fields terminated by ','
instead of fields terminated by '\t', I think.

Importing data from HDFS to Hive table

I have my data in data/2011/01/13/0100/file in HDFS, each of thes file contain data in tab separated, say name, ip , url.
I want to create a table in Hive and import the data from hdfs, table should contain time,name, ip and url.
How can I import these using Hive ? r the data should be in some other format so that I can import the time as well ?
You need to create the table to load the files into and then use the LOAD DATA command to load the files into the Hive tables. See the Hive documentation for the precise syntax to use.
Regards,
Jeff
To do this you have to use partitions, read more about them here:
http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL#Add_Partitions
partition column in hive
You can create an external table for such data.
Something like:
CREATE EXTERNAL TABLE log_data (name STRING, ip STRING, url STRING)
PARTITIONED BY (year BIGINT, month BIGINT, day BIGINT, hour BIGINT)
row format delimited fields terminated by '\t' stored as TEXTFILE
location 'data'

Resources