Ambari Hive UTF-8 problems - hadoop

Have a problem with cyrillic symbols at hive tables. Installed versions:
ambari-server 2.4.2.0-136
hive-2-5-3-0-37 1.2.1000.2.5.3.0-37
Ubuntu 14.04
Whats the problem:
Set locale to ru_RU.UTF-8:
spark#hadoop:~$ locale
LANG=ru_RU.UTF-8
LANGUAGE=ru_RU:ru
LC_CTYPE="ru_RU.UTF-8"
LC_NUMERIC="ru_RU.UTF-8"
LC_TIME="ru_RU.UTF-8"
LC_COLLATE="ru_RU.UTF-8"
LC_MONETARY="ru_RU.UTF-8"
LC_MESSAGES="ru_RU.UTF-8"
LC_PAPER="ru_RU.UTF-8"
LC_NAME="ru_RU.UTF-8"
LC_ADDRESS="ru_RU.UTF-8"
LC_TELEPHONE="ru_RU.UTF-8"
LC_MEASUREMENT="ru_RU.UTF-8"
LC_IDENTIFICATION="ru_RU.UTF-8"
LC_ALL=ru_RU.UTF-8
Connect to hive and create test table:
spark#hadoop:~$ beeline -n spark -u jdbc:hive2://spark#hadoop.domain.com:10000/
Connecting to enter code herejdbc:hive2://spark#hadoop.domain.com:10000/
Connected to: Apache Hive (version 1.2.1000.2.5.3.0-37)
Driver: Hive JDBC (version 1.2.1000.2.5.3.0-37)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 1.2.1000.2.5.3.0-37 by Apache Hive
0: jdbc:hive2://spark#hadoop.domain.com> CREATE TABLE `test`(`name` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ( 'serialization.encoding'='UTF-8');
No rows affected (0,127 seconds)
Insert cyrillic symbols:
0: jdbc:hive2://spark#hadoop.domain.com> insert into test values('привет');
INFO : Tez session hasn't been created yet. Opening session
INFO : Dag name: insert into test values('привет')(Stage-1)
INFO :
INFO : Status: Running (Executing on YARN cluster with App id application_1490211406894_2481)
INFO : Map 1: -/-
INFO : Map 1: 0/1
INFO : Map 1: 0(+1)/1
INFO : Map 1: 1/1
INFO : Loading data to table default.test from hdfs://hadoop.domain.com:8020/apps/hive/warehouse/test/.hive-staging_hive_2017-03-23_13-41-46_215_3133047104896717605-116/-ext-10000
INFO : Table default.test stats: [numFiles=1, numRows=1, totalSize=7, rawDataSize=6]
No rows affected (6,652 seconds)
Select from table:
0: jdbc:hive2://spark#hadoop.domain.com> select * from test;
+------------+--+
| test.name |
+------------+--+
| ?#825B |
+------------+--+
1 row selected (0,162 seconds)
I've read a lot of bugs at apache hive, tested unicode, utf-8, utf-16, some isos encodings with no luck.
Can somebody help me with that?
Thanks!

Guys from Hortonwroks helped me with that issue. Seems that it's a bug.
https://community.hortonworks.com/answers/90989/view.html
https://issues.apache.org/jira/browse/HIVE-13983

Related

Hive query is not executing - yarn shutting down

I was running a hive query and it is not showing any error, but data is not getting inserted, if I rerun the same query after 10mins the records are getting inserted into table
my query
insert into tablename
values(‘exr‘,‘e’,’r’,’20220909')
when I have checked the hs2 logs i found below information
hiveserver2.log:2022-06-08T05:31:32,981 INFO [HiveServer2-Background-Pool: Thread-6666]: SessionState (:()) - Status: Running (Executing on YARN cluster with App id application_55555_09090)
hiveserver2.log:2022-06-08T05:32:31,017 INFO [HiveServer2-Handler-Pool: Thread-6666]: client.TezClient (:()) - Shutting down Tez Session, sessionName=HIVE-ceb52736-83c3-4c57-8c28-31bce3ee3791, applicationId=application_55555_09090
here
in logs for the applicationId i have fetched the logs in last line it is showing Shutting down Tez Session -- this is the problem
if i have fetched logs for application id which got inserted into the hive table the last line of logs was as below
Completed executing command(queryId=hive_4-eee-444);
so please help me with this why this issue is happening is there any reason or how to avoid this issue of Shutting down Tez Session session in yarn

How to use hadoop from spark thrift server?

Please consider the following setup.
hadoop version 2.6.4
spark version 2.1.0
OS CentOS Linux release 7.2.1511 (Core)
All software is installed on a single machine as a single node cluster, spark is installed in standalone mode.
I am trying to use Spark Thrift Server.
To start the spark thrift server I run the shell script
start-thriftserver.sh
After running the thrift server, I can run beeline command line tool and issue the following commands:
The commands run successfully:
!connect jdbc:hive2://localhost:10000 user_name '' org.apache.hive.jdbc.HiveDriver
create database testdb;
use testdb;
create table names_tab(a int, name string) row format delimited fields terminated by ' ';
My first question is where on haddop is the underlying file/folder for this table/database created?
The problem is even if hadoop is stopped using stop-all.sh, still the create table/database command is successful,
which makes me think that the table is not created on hadoop at all.
My second question is how do I tell spark where in the world is hadoop installed?
and ask spark to use hadoop as the underlying data store for all queries run from beeline.
Am I supposed to install spark in some other mode?
Thanks in advance.
My objective was to get the beeline command line utility work through Spark Thrift Server using hadoop as underlying data-store and I got it to work. My setup was like this:
Hadoop <--> Spark <--> SparkThriftServer <--> beeline
I wanted to configure spark in such a manner that it uses hadoop for all queries run at beeline command line utility.
The trick was to specify the following property in spark-defaults.xml.
spark.sql.warehouse.dir hdfs://localhost:9000/user/hive/warehouse
By default spark uses derby for both meta data and the data itself (called warehouse in spark)
In order to have spark use hadoop as warehouse I had to add this property.
Here is a sample output
./beeline
Beeline version 1.0.1 by Apache Hive
beeline> !connect jdbc:hive2://localhost:10000 abbasbutt '' org.apache.hive.jdbc.HiveDriver
Connecting to jdbc:hive2://localhost:10000
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/abbasbutt/Projects/hadoop_fdw/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/abbasbutt/Projects/hadoop_fdw/apache-hive-1.0.1-bin/lib/hive-jdbc-1.0.1-standalone.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Connected to: Spark SQL (version 2.1.0)
Driver: Hive JDBC (version 1.0.1)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://localhost:10000>
0: jdbc:hive2://localhost:10000>
0: jdbc:hive2://localhost:10000>
0: jdbc:hive2://localhost:10000> create database my_test_db;
+---------+--+
| Result |
+---------+--+
+---------+--+
No rows selected (0.379 seconds)
0: jdbc:hive2://localhost:10000> use my_test_db;
+---------+--+
| Result |
+---------+--+
+---------+--+
No rows selected (0.03 seconds)
0: jdbc:hive2://localhost:10000> create table my_names_tab(a int, b string) row format delimited fields terminated by ' ';
+---------+--+
| Result |
+---------+--+
+---------+--+
No rows selected (0.11 seconds)
0: jdbc:hive2://localhost:10000>
Here are the corresponding files in hadoop
[abbasbutt#localhost test]$ hadoop fs -ls /user/hive/warehouse/
17/01/19 10:48:04 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 4 items
drwxrwxr-x - abbasbutt supergroup 0 2017-01-18 23:45 /user/hive/warehouse/fdw_db.db
drwxrwxr-x - abbasbutt supergroup 0 2017-01-18 23:23 /user/hive/warehouse/my_spark_db.db
drwxrwxr-x - abbasbutt supergroup 0 2017-01-19 10:47 /user/hive/warehouse/my_test_db.db
drwxrwxr-x - abbasbutt supergroup 0 2017-01-18 23:45 /user/hive/warehouse/testdb.db
[abbasbutt#localhost test]$ hadoop fs -ls /user/hive/warehouse/my_test_db.db/
17/01/19 10:50:52 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
drwxrwxr-x - abbasbutt supergroup 0 2017-01-19 10:50 /user/hive/warehouse/my_test_db.db/my_names_tab
[abbasbutt#localhost test]$

problems with hive query in cloudera

I can do all other queries in hive, but when I do a join it just gets stuck.
hive> select count (*) from tab10 join tab1;
Warning: Map Join MAPJOIN[13][bigTable=tab10] in task 'Stage-2:MAPRED' is a cross product
Query ID = root_20160406145959_b57642e0-7499-41a0-914c-0004774fe4ac
Total jobs = 1
Execution log at: /tmp/root/root_20160406145959_b57642e0-7499-41a0-914c-0004774fe4ac.log
2016-04-06 03:00:03 Starting to launch local task to process map join; maximum memory = 2058354688
2016-04-06 03:00:03 Dump the side-table for tag: 1 with group count: 1 into file: file:/tmp/root/b71aa45b-f356-4a54-a880-77e57cd53ed3/hive_2016-04-06_14-59-59_858_3722397802100174236-1/-local-10004/HashTable-Stage-2/MapJoin-mapfile01--.hashtable
2016-04-06 03:00:03 Uploaded 1 File to: file:/tmp/root/b71aa45b-f356-4a54-a880-77e57cd53ed3/hive_2016-04-06_14-59-59_858_3722397802100174236-1/-local-10004/HashTable-Stage-2/MapJoin-mapfile01--.hashtable (280 bytes)
2016-04-06 03:00:03 End of local task; Time Taken: 0.562 sec.
Its hung at this point, and it doesn't spawn any of the map reduce tasks at all. What could be wrong?
I did see this in hive.log.
2016-04-06 15:00:00,124 INFO [main]: ql.Driver (Driver.java:launchTask(1643)) - Starting task [Stage-5:MAPREDLOCAL] in serial mode
2016-04-06 15:00:00,125 INFO [main]: mr.MapredLocalTask (MapredLocalTask.java:executeInChildVM(159)) - Generating plan file file:/tmp/root/b71aa45b-f356-4a54-a880-77e57cd53ed3/hive_2016-04-06_14-59-59_858_3722397802100174236-1/-local-10006/plan.xml
2016-04-06 15:00:00,233 INFO [main]: mr.MapredLocalTask (MapredLocalTask.java:executeInChildVM(288)) - Executing: /opt/cloudera/parcels/CDH-5.5.2-1.cdh5.5.2.p0.4/lib/hadoop/bin/hadoop jar /opt/cloudera/parcels/CDH-5.5.2-1.cdh5.5.2.p0.4/jars/hive-exec-1.1.0-cdh5.5.2.jar org.apache.hadoop.hive.ql.exec.mr.ExecDriver -localtask -plan file:/tmp/root/b71aa45b-f356-4a54-a880-77e57cd53ed3/hive_2016-04-06_14-59-59_858_3722397802100174236-1/-local-10006/plan.xml -jobconffile file:/tmp/root/b71aa45b-f356-4a54-a880-77e57cd53ed3/hive_2016-04-06_14-59-59_858_3722397802100174236-1/-local-10007/jobconf.xml
There is nothing beyond this. Anyone knows how to fix this ?
Open mapred-site.xml file and add the property:
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx1024m</value>
you need to increase your heap memory used by hadoop JVM

Loading data from HDFS to HBASE

I'm using Apache hadoop 1.1.1 and Apache hbase 0.94.3.I wanted to load data from HDFS to HBASE.
I wrote pig script to serve the purpose. First i created hbase table in habse and next wrote pig script to load the data from HDFS to HBASE. But it is not loading the data into hbase table. Dont know where it's going worng.
Below is the command used to craete hbase table :
create table 'mydata','mycf'
Below is the pig script to load data from hdfs to hbase:
A = LOAD '/user/hduser/Dataparse/goodrec1.txt' USING PigStorage(',') as (c1:int, c2:chararray,c3:chararray,c4:int,c5:chararray);
STORE A INTO 'hbase://mydata'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
'mycf:c1,mycf:c2,mycf:c3,mycf:c4,mycf:c5');
After excecuting the script it says
2014-04-29 16:01:06,367 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2014-04-29 16:01:06,376 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
2014-04-29 16:01:06,382 [main] INFO org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics:
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
1.1.1 0.12.0 hduser 2014-04-29 15:58:07 2014-04-29 16:01:06 UNKNOWN
Failed!
Failed Jobs:
JobId Alias Feature Message Outputs
job_201403142119_0084 A MAP_ONLY Message: Job failed! Error - JobCleanup Task Failure, Task: task_201403142119_0084_m_000001 hbase://mydata,
Input(s):
Failed to read data from "/user/hduser/Dataparse/goodrec1.txt"
Output(s):
Failed to produce result in "hbase://mydata"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
Job DAG:
job_201403142119_0084
2014-04-29 16:01:06,382 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Failed!
Please help where i'm going wrong ?
You have specified too many columns in the output to hbase. You have 5 input columns and 5 output columns, but HBaseStorage requires the first column to be the row key so there should only be 4 in the output

Reading hive table using Pig script

I am trying to read hive table using PIG script but when I run a pig code to read a table in hive its giving me following error:
2014-02-12 15:48:36,143 [main] WARN org.apache.hadoop.hive.conf.HiveConf
-hive-site.xml not found on CLASSPATH 2014-02-12 15:49:10,781 [main] ERROR
org.apache.pig.tools.grunt.Grunt - ERROR 2997: Unable to recreate
exception from backed error: Error: Found class
org.apache.hadoop.mapreduce.TaskAttemptContext, but interface was expected
(Ignore newlines and whitespace added for readability)
Hadoop version
1.1.1
Hive version
0.9.0
Pig version
0.10.0
Pig code
a = LOAD '/user/hive/warehouse/test' USING
org.apache.pig.piggybank.storage.HiveColumnarLoader('name string');
Is it due to some version mismatch ?
Why can't you use hcatalog to access hive metadata in pig?
Check this for an example

Resources