pig + hbase + hadoop2 integration - hadoop

has anyone had successful experience loading data to hbase-0.98.0 from pig-0.12.0 on hadoop-2.2.0 in an environment of hadoop-2.20+hbase-0.98.0+pig-0.12.0 combination without encountering this error:
ERROR 2998: Unhandled internal error.
org/apache/hadoop/hbase/filter/WritableByteArrayComparable
with a line of log trace:
java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/filter/WritableByteArra
I searched the web and found a handful of problems and solutions but all of them refer to pre-hadoop2 and base-0.94-x which were not applicable to my situation.
I have a 5 node hadoop-2.2.0 cluster and a 3 node hbase-0.98.0 cluster and a client machine installed with hadoop-2.2.0, base-0.98.0, pig-0.12.0. Each of them functioned fine separately and I got hdfs, map reduce, region servers , pig all worked fine. To complete an "loading data to base from pig" example, i have the following export:
export PIG_CLASSPATH=$HADOOP_INSTALL/etc/hadoop:$HBASE_PREFIX/lib/*.jar
:$HBASE_PREFIX/lib/protobuf-java-2.5.0.jar:$HBASE_PREFIX/lib/zookeeper-3.4.5.jar
and when i tried to run : pig -x local -f loaddata.pig
and boom, the following error:ERROR 2998: Unhandled internal error. org/apache/hadoop/hbase/filter/WritableByteArrayComparable (this should be the 100+ times i got it dying countless tries to figure out a working setting).
the trace log shows:lava.lang.NoClassDefFoundError: org/apache/hadoop/hbase/filter/WritableByteArrayComparable
the following is my pig script:
REGISTER /usr/local/hbase/lib/hbase-*.jar;
REGISTER /usr/local/hbase/lib/hadoop-*.jar;
REGISTER /usr/local/hbase/lib/protobuf-java-2.5.0.jar;
REGISTER /usr/local/hbase/lib/zookeeper-3.4.5.jar;
raw_data = LOAD '/home/hdadmin/200408hourly.txt' USING PigStorage(',');
weather_data = FOREACH raw_data GENERATE $1, $10;
ranked_data = RANK weather_data;
final_data = FILTER ranked_data BY $0 IS NOT NULL;
STORE final_data INTO 'hbase://weather' USING
org.apache.pig.backend.hadoop.hbase.HBaseStorage('info:date info:temp');
I have successfully created a base table 'weather'.
Has anyone had successful experience and be generous to share with us?

ant clean jar-withouthadoop -Dhadoopversion=23 -Dhbaseversion=95
By default it builds against hbase 0.94. 94 and 95 are the only options.

If you know which jar file contains the missing class, e.g. org/apache/hadoop/hbase/filter/WritableByteArray, then you can use the pig.additional.jars property when running the pig command to ensure that the jar file is available to all the mapper tasks.
pig -D pig.additional.jars=FullPathToJarFile.jar bulkload.pig
Example:
pig -D pig.additional.jars=/usr/lib/hbase/lib/hbase-protocol.jar bulkload.pig

Related

Can't use 'put'() to add data to hbase with happybase

My python version is 3.7, and after I ran pip3 install happybase, I started the command hbase thrift start and tried to write a brief .py file as following:
import happybase
connection = happybase.Connection('master')
table = connection.table('jmlr') #'jmlr' is a table in hbase
for i in table.scan():
print(i)
table.put('001', {'title':'dasds'}) #error here
connection.close()
When it's about to run table.put(), it reported such an error:
thriftpy2.transport.base.TTransportException: TTransportException(type=4, message='TSocket read 0 bytes')
And at the same time, the thrift reported an error:
ERROR [thrift-worker-1] thrift.TBoundedThreadPoolServer: Error occurred during processing of message. java.lang.IllegalArgumentException: Invalid famAndQf provided.
But just now I ran this python file again, it gave me a different error in thrift:
thrift.TBoundedThreadPoolServer: Thrift error occurred during processing of message.
org.apache.thrift.protocol.TProtocolException: Bad version in readMessageBegin
I have tried to add parameters like protocol='compact', transport='framed', but this didn't work, even the table.scan() failed.
Everything in the hbase shell is OK, so I can't figure out what went wrong, I'm about to collapse.
I ran into the same issue and found this sollution. You need to add even empty Column Qualifier ( ':' symbol as delimiter between Column Family and Column Qualifier) into put() method:
table.put('001:', {'title':'dasds'})
Also, you have a different error message after second run of script because thrift server is already failed.
I hope it will help you.

How to supress info message "io.bytes.per.checksum is deprecated" in grunt shell

When analyzing a Big Data I'm running Apache Pig version 0.17.0 on top of Hadoop-2.7.2. Every time i run a load command in local mode of grunt> shell i get the following message:
grunt> A = load '/usr/lib/pig/data.txt' using TextLoader as (date:chararray);
[main] INFO org.apache.hadoop.conf.Configuration.deprecation-io.bytes.per.
checksum is deprecated. Instead, use dfs.bytes-per-checksum
Is there away to switch off this message as it becomes very annoying with frequent usage of grunt> shell?
Check if below solution works for you,
Create a file named nolog.conf, with the following content
log4j.rootLogger=fatal
and then run pig as follows
pig -x local -4 nolog.conf

How to create external table to DynamoDB on Hive 3.1

I'd like to know if it's possible to have an external table pointing to a DynamoDB table on AWS using Hive.
I'm not using AWS EMR, what I'm using is a Hadoop Stack configured through Apache Ambari.
Hive version: Hive 3.1.0.3.1.4.0-315
What I did was:
Downloaded the EMR Dynamo-Hive connector JARS directly from the maven repository: https://mvnrepository.com/artifact/com.amazon.emr
I loaded all the JARS in hive.aux.jars.path:
emr-dynamodb-hadoop-4.12.0.jar
emr-dynamodb-hive-4.12.0.jar
emr-dynamodb-tools-4.12.0.jar
hive1.2-shims-4.12.0.jar
hive1-shims-4.12.0.jar
hive2-shims-4.12.0.jar
hive2-shims-4.15.0.jar
shims-common-4.12.0.jar
shims-loader-4.12.0.jar
But when I try to create the table with:
CREATE EXTERNAL TABLE dynamo_LabDynamoHive
(id double, nome string)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler'
TBLPROPERTIES (
"dynamodb.table.name" = "LabDynamoHive",
"dynamodb.column.mapping" = "id:id,nome:nome"
);
I get the following error:
INFO : Starting task [Stage-0:DDL] in serial mode
ERROR : FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException: Shim class for Hive version 3.1.1000 does not exist
INFO : Completed executing command(queryId=hive_20200422142624_6ebabdc8-8942-4025-84a8-411505d20895); Time taken: 0.203 seconds
Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException: Shim class for Hive version 3.1.1000 does not exist (state=08S01,code=1)
I know I'm not loading a Shims JAR for Hive 3, but I'd like to know if any of you have tried and succeded in using an external table with DynamoDB using Hive 3 outside of EMR.
Any help or directions would be greatly appreciated!
problem is apparently the source code of this EMR connector is somewhat outdated and lacks Hive 3.x support recently introduced by AWS for EMR 6.0.
However, you can find a working 3.1 implementation here, forked from the official EMR connector: https://github.com/ramsesrm/emr-dynamodb-connector
Installation steps as follow:
1- compile the mentioned code (mvn clean package)
2- install the 3 JARs in your hive.aux.jars.path, along with aws-java-sdk-core and aws-java-sdk-dynamodb JARs from AWS (shim JARs are not required), 5 in total.
Tha's it. Don't forget to specify the region as a TBLPROPERTIES if you're not using the default US one.

Required field 'uncompressed_page_size' was not found in serialized data! Parquet

I am getting below error while trying to save parquet file from local directory using pyspark.
I tried spark 1.6 and 2.2 both give same error
It display's schema properly but gives error at the time of writing file.
base_path = "file:/Users/xyz/Documents/Temp/parquet"
reg_path = "file:/Users/xyz/Documents/Temp/parquet/ds_id=48"
df = sqlContext.read.option( "basePath",base_path).parquet(reg_path)
out_path = "file:/Users/xyz/Documents/Temp/parquet/out"
df2 = df.coalesce(5)
df2.printSchema()
df2.write.mode('append').parquet(out_path)
org.apache.spark.SparkException: Task failed while writing rows
Caused by: java.io.IOException: can not read class org.apache.parquet.format.PageHeader: Required field 'uncompressed_page_size' was not found in serialized data! Struct: PageHeader(type:null, uncompressed_page_size:0, compressed_page_size:0)
In my own case, I was writing a custom Parquet Parser for Apache Tika and I experienced this error. It turned out that if the file is being used by another process, the ParquetReader will not be able to access uncompressed_page_size. Hence, causing the error.
Verify if other processes are not holding on to the file.
Temporary resolved by the spark config:
"spark.sql.hive.convertMetastoreParquet": "false"
Although it would has extra cost, but a walkaround approach by now.

Pig Script useHCatalog flag?

I have written simple pig script to read data from hive table.
A = LOAD 'default.movie' USING org.apache.hive.hcatalog.pig.HCatLoader();
DUMP A;
It is working when i run through hue pig user interface. But it uses a flag useHCatalog.
When i run this using command line using same flag it is working
pig -useHCatalog sample.pig
But how can i run without this flag by providing required jar files and configuration in the pig script. I tried this. But doesn't work
REGISTER /usr/lib/hive/lib/*.jar
REGISTER /usr/lib/hive-hcatalog/share/hcatalog/*.jar
REGISTER /usr/lib/hive-hcatalog/share/hcatalog/storage-handlers/hbase/lib/*.jar
It throws an error when i run without flag
2015-12-15 05:05:55,379 [main] ERROR org.apache.pig.PigServer -
exception during parsing: Error during parsing. Table not found :
default.movie table not found Failed to parse: Can not retrieve schema
from loader org.apache.hive.hcatalog.pig.HCatLoader#25bdba7a
at org.apache.pig.parser.QueryParserDriver.parse(QueryParserDriver.java:198)
at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1678)
at org.apache.pig.PigServer$Graph.access$000(PigServer.java:1411)
at org.apache.pig.PigServer.parseAndBuild(PigServer.java:344)
at org.apache.pig.PigServer.executeBatch(PigServer.java:369)
at org.apache.pig.PigServer.executeBatch(PigServer.java:355)
at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:140)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:769)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:372)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
I just want to know, what is behind useHCatalog flag. what i have to register in order to work fine?
You have to pass the hive configuration as well, namely the file hive-site.xml which will point pig to the metastore. Otherwise pig does not know where to look for the table information.
This page might be helpful: https://cwiki.apache.org/confluence/display/Hive/HCatalog+LoadStore

Resources