Installing Snappy on a HDP Cluster - hadoop

I have a HBase cluster built using Hortonworks Data Platform 2.6.1.
Now I need to apply Snappy compression on HBase tables.
Without installing Snappy, I executed the Compression Test and I got a success output. I used below commands.
hbase org.apache.hadoop.hbase.util.CompressionTest file:///tmp/test.txt snappy
hbase org.apache.hadoop.hbase.util.CompressionTest hdfs://hbase.primary.namenode:8020/tmp/test1.txt snappy
In got below response for both commands.
2017-10-30 11:25:18,454 INFO [main] hfile.CacheConfig: CacheConfig:disabled
2017-10-30 11:25:18,671 INFO [main] compress.CodecPool: Got brand-new compressor [.snappy]
2017-10-30 11:25:18,679 INFO [main] compress.CodecPool: Got brand-new compressor [.snappy]
2017-10-30 11:25:21,560 INFO [main] hfile.CacheConfig: CacheConfig:disabled
2017-10-30 11:25:22,366 INFO [main] compress.CodecPool: Got brand-new decompressor [.snappy]
SUCCESS
I see below libraries in the path /usr/hdp/2.6.1.0-129/hadoop/lib/native/ as well.
libhadoop.a
libhadooppipes.a
libhadoop.so
libhadoop.so.1.0.0
libhadooputils.a
libhdfs.a
libsnappy.so
libsnappy.so.1
libsnappy.so.1.1.4
Does HDP support snappy compression by default?
If so can I compress the HBase tables without installing Snappy?

Without installing Snappy, I executed the Compression Test and I got a success output.
Ambari installed it during cluster installation, so yes those commands are working
Does HDP support snappy compression by default?
Yes, the HDP-UTILS repository provides the snappy libraries.
can I compress the HBase tables without installing Snappy?
Hbase provides other compression algorithms, so yes

Related

Hbase completebulkload stuck on AWS EMR

So the scenario is I am trying to use HBase bulk load to load some data into HBase.
Here's my stack setting:
HBase version 1.3.1
Hadoop version: 2.7.3
EMR version 5.10.
Cluster size: 20 R4.2xlarge instances.
I have a hbase table which was pre-splitted into 400 regions with HexStringSplit for the row key.
The table contains only one column family and it used lz4 compression algorithm
I then tried to use bulkload to load some data into the table.
I was able to use import tsv tool to generate HFiles on HDFS, the total file size is about 20 GB.
Then I ran the "completebulkload" tool as follows:
hadoop jar /usr/lib/hbase/lib/hbase-server-1.3.1.jar completebulkload hdfs:///user/hbase/output MyTable
Here "hdfs:///user/hbase/output" is the output directory of the import tsv job.
The process started but got stuck, I only see following output:
17/12/05 19:49:22 WARN mapreduce.LoadIncrementalHFiles: Skipping non-directory hdfs://ip-172-31-19-197.ec2.internal:8020/user/hbase/output/_SUCCESS
17/12/05 19:49:23 INFO compress.CodecPool: Got brand-new decompressor [.lz4]
17/12/05 19:49:23 INFO compress.CodecPool: Got brand-new decompressor [.lz4]
17/12/05 19:49:23 INFO compress.CodecPool: Got brand-new decompressor [.lz4]
17/12/05 19:49:23 INFO compress.CodecPool: Got brand-new decompressor [.lz4]
No further information was printed. It's been almost 1 hour but still nothing. I checked the HBase UI and nothing has been loaded yet. All regions are empty.
Any thoughts on this?
Thanks

PIG command execution

I am learning Hadoop by myself so I am not sure if what I asking is even a problem. When I run the command pig -x local to run it locally, i get the following message:
15/10/05 15:23:28 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
15/10/05 15:23:28 INFO pig.ExecTypeProvider: Picked LOCAL as the ExecType
2015-10-05 15:23:28,830 [main] INFO org.apache.pig.Main - Apache Pig version 0.15.0 (r1682971) compiled Jun 01 2015, 11:44:35
2015-10-05 15:23:28,831 [main] INFO org.apache.pig.Main - Logging error messages to: /home/nkhl/pig_1444038808829.log
2015-10-05 15:23:29,050 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /home/nkhl/.pigbootup not found
2015-10-05 15:23:29,333 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2015-10-05 15:23:29,334 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2015-10-05 15:23:29,335 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:///
2015-10-05 15:23:29,562 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
It looks different on my online tutor's screen so I am a little confused.
What concerns me most is the deprecation part. Can someone help me with that please? What is it trying to say? Don't get me wrong, everything works fine. The GRUNT shell loads up, and things execute fine. I just wanted to know what that meant.
It's an Ubuntu machine.
Thanks!
Running pig as local is great AFAIK if you are using for some quick testing.Like displaying the sysout in UDF etc.
The above warnings you can safely ignore.It is saying that some of the variables set in conf-site.xml are deprecated.
You can switch off those parameters by editing the
log4j.logger.org.apache.hadoop.conf.Configuration.deprecation
in log4j.properties file.
You have some Hadoop-related variables set, such as HADOOP_HOME or HADOOP_PREFIX or HADOOP_CONF_DIR, which aren't needed if you are running Pig in local mode.
unset HADOOP_HOME
unset HADOOP_PREFIX
unset HADOOP_CONF_DIR
Deprecations aren't scary. They are a reminder that the code is calling on something that will eventually go away in a future version. These specific deprecations are caused by differences between Hadoop 1 vs Hadoop 2. Pig is compatible with both versions. If you happened to be using Hadoop 1.2.1 instead of 2.x, you wouldn't see the warnings. This is because Pig is checking the Hadoop 1 values first.
If you're interested in learning more, you can check out the Pig source code.
https://github.com/apache/pig/blob/release-0.15.0/src/org/apache/pig/backend/hadoop/executionengine/HExecutionEngine.java#L219-L222

Pig Error: Unhandled internal error. Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected

I have just upgraded Pig 0.12.0 to 0.13.0 version on Hortonworks HDP 2.1
I am getting below error when I am trying to use XMLLoader in my script, even though I have registered piggybank already.
Script:
A = load 'EPAXMLDownload.xml' using org.apache.pig.piggybank.storage.XMLLoader('Document') as (x:chararray);
Error:
dump A
2014-08-10 23:08:56,494 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
2014-08-10 23:08:56,496 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2014-08-10 23:08:56,651 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: UNKNOWN
2014-08-10 23:08:56,727 [main] INFO org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, GroupByConstParallelSetter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, PartitionFilterOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter], RULES_DISABLED=[FilterLogicExpressionSimplifier]}
2014-08-10 23:08:57,191 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
2014-08-10 23:08:57,199 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2014-08-10 23:08:57,214 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2014-08-10 23:08:57,223 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
2014-08-10 23:08:57,247 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2998: Unhandled internal error. Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
Note that pig decides the hadoop version depending on which context var you have set
HADOOP_HOME -> v1
HADOOP_PREFIX -> v2
If you use hadoop2, you need to recompile the piggybank (which is by default compiled for hadoop1)
go to pig/contrib/piggybank/java
$ ant -Dhadoopversion=23
then copy that jar over pig/lib/piggybank.jar
A few more details because the other answers didn't work for me:
Git clone the pig git mirror https://github.com/apache/pig
cd into the cloned directory
if you've already built pig in the past in this directory, you should run a clean
ant clean
build pig for hadoop 2
ant -Dhadoopversion=23
cd into piggybank
cd contrib/piggybank/java
again, if you've build piggybank before, make sure to clean out the old build files
ant clean
build piggybank for hadoop 2 (same command, different directory)
ant -Dhadoopversion=23
If you don't build pig first, piggybank will throw a bunch of "symbol not found" exceptions while compiling. In addition, since I had previously built pig for Hadoop 1 (accidentally), without running a clean, I ran into runtime errors.
Some times you may get problem after installing Pig like below:-
java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected
at org.apache.hcatalog.common.HCatUtil.checkJobContextIfRunningFromBackend(HCatUtil.java:88)
at org.apache.hcatalog.pig.HCatLoader.setLocation(HCatLoader.java:162)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:540)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.compile(JobControlCompiler.java:322)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:199)
at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:277)
at org.apache.pig.PigServer.launchPlan(PigServer.java:1367)
at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1352)
at org.apache.pig.PigServer.execute(PigServer.java:1341)
Many blogs suggest you to recompile the Pig by executing command:
ant clean jar-all -Dhadoopversion=23
or recompile piggybank.jar by executing below steps
cd contrib/piggybank/java
ant clean
ant -Dhadoopversion=23
But this may not solve your problem big time. The actual cause here is related to HCatalog. Try updating it!!. In my case, I was using Hive0.13 and Pig.0.13. And I was using HCatalog provided with Hive0.13.
Then I updated Pig to 0.15 and used separate hive-hcatalog-0.13.0.2.1.1.0-385 library jars. And problem was resolved....
Because later I identified it was not Pig who was creating problem rather it was Hive-HCatalog libraries. Hope this may help.
Even i faced the same error with Hadoop version 2.2.0.
The work around is, we have to register following jar files using the grunt shell.
The paths that i am gonna paste below will be according to the hadoop-2.2.0 version. Kindly find the jars according to your version.
/hadoop-2.2.0/share/hadoop/mapreduce/ hadoop-mapreduce-client-core-2.2.0.jar
/hadoop-2.2.0/share/hadoop/common/hadoop-common-2.2.0.jar
Using the REGISTER command we have to register these jars along with piggybank.
Run the pig script/command now and revert if you face any issue.

Sqoop and Java 7

I'm trying to use sqoop to import a MySQL table into HDFS. I'm using JDK 1.7.0_45 and CDH4.4. I'm actually using cloudera's pre-built VM, except I changed the JDK to 1.7 because I wanted to use the pydev plugin for eclipse. My sqoop version is 1.4.3-cdh4.4.0.
When I run sqoop I get this exception:
Error: commodity : Unsupported major.minor version 51.0
I have seen this error in the past when I did this:
1. compiled to java 7
2. ran an application with java 6.
but that is not what I am doing this time. I believe my sqoop version was compiled to java 6, and I'm running it with java 7, which should be perfectly fine. I think maybe hadoop is launching mapper processes with JDK 6, I have no idea how to change that. I skimmed through the mapred configuration documentation, and did not see any way to set the java version to use for map tasks.
Here is the relevant console output:
[cloudera#localhost ~]$ echo $JAVA_HOME
/usr/java/latest
[cloudera#localhost ~]$ java -version
java version "1.7.0_45"
Java(TM) SE Runtime Environment (build 1.7.0_45-b18)
Java HotSpot(TM) 64-Bit Server VM (build 24.45-b08, mixed mode)
[cloudera#localhost ~]$ sqoop version
Sqoop 1.4.3-cdh4.4.0
git commit id 2cefe4939fd464ba11ef63e81f46bbaabf1f5bc6
Compiled by jenkins on Tue Sep 3 20:41:55 PDT 2013
[cloudera#localhost ~]$ hadoop version
Hadoop 2.0.0-cdh4.4.0
Subversion file:///data/1/jenkins/workspace/generic-package-rhel64-6-0/topdir/BUILD/hadoop-2.0.0-cdh4.4.0/src/hadoop-common-project/hadoop-common -r c0eba6cd38c984557e96a16ccd7356b7de835e79
Compiled by jenkins on Tue Sep 3 19:33:17 PDT 2013
From source with checksum ac7e170aa709b3ace13dc5f775487180
This command was run using /usr/lib/hadoop/hadoop-common-2.0.0-cdh4.4.0.jar
[cloudera#localhost ~]$ cat mysqooper.sh
#!/bin/bash
sqoop import -m 1 --connect jdbc:mysql://localhost/$1 \
--username root --table $2 --target-dir $3
[cloudera#localhost ~]$ ./mysqooper.sh cloud commodity /user/cloudera/commodity/csv/sqooped
14/01/16 16:45:10 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
14/01/16 16:45:10 INFO tool.CodeGenTool: Beginning code generation
14/01/16 16:45:11 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `commodity` AS t LIMIT 1
14/01/16 16:45:11 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `commodity` AS t LIMIT 1
14/01/16 16:45:11 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-0.20-mapreduce
14/01/16 16:45:11 INFO orm.CompilationManager: Found hadoop core jar at: /usr/lib/hadoop-0.20-mapreduce/hadoop-core.jar
Note: /tmp/sqoop-cloudera/compile/f75bf6f8829e8eff302db41b01f6796a/commodity.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
14/01/16 16:45:15 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-cloudera/compile/f75bf6f8829e8eff302db41b01f6796a/commodity.jar
14/01/16 16:45:15 WARN manager.MySQLManager: It looks like you are importing from mysql.
14/01/16 16:45:15 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
14/01/16 16:45:15 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
14/01/16 16:45:15 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
14/01/16 16:45:15 INFO mapreduce.ImportJobBase: Beginning import of commodity
14/01/16 16:45:17 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
14/01/16 16:45:20 INFO mapred.JobClient: Running job: job_201401161614_0001
14/01/16 16:45:21 INFO mapred.JobClient: map 0% reduce 0%
14/01/16 16:45:38 INFO mapred.JobClient: Task Id : attempt_201401161614_0001_m_000000_0, Status : FAILED
Error: commodity : Unsupported major.minor version 51.0
14/01/16 16:45:46 INFO mapred.JobClient: Task Id : attempt_201401161614_0001_m_000000_1, Status : FAILED
Error: commodity : Unsupported major.minor version 51.0
14/01/16 16:45:54 INFO mapred.JobClient: Task Id : attempt_201401161614_0001_m_000000_2, Status : FAILED
Error: commodity : Unsupported major.minor version 51.0
14/01/16 16:46:07 INFO mapred.JobClient: Job complete: job_201401161614_0001
14/01/16 16:46:07 INFO mapred.JobClient: Counters: 6
14/01/16 16:46:07 INFO mapred.JobClient: Job Counters
14/01/16 16:46:07 INFO mapred.JobClient: Failed map tasks=1
14/01/16 16:46:07 INFO mapred.JobClient: Launched map tasks=4
14/01/16 16:46:07 INFO mapred.JobClient: Total time spent by all maps in occupied slots (ms)=23048
14/01/16 16:46:07 INFO mapred.JobClient: Total time spent by all reduces in occupied slots (ms)=0
14/01/16 16:46:07 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
14/01/16 16:46:07 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
14/01/16 16:46:07 WARN mapreduce.Counters: Group FileSystemCounters is deprecated. Use org.apache.hadoop.mapreduce.FileSystemCounter instead
14/01/16 16:46:07 INFO mapreduce.ImportJobBase: Transferred 0 bytes in 51.0252 seconds (0 bytes/sec)
14/01/16 16:46:07 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
14/01/16 16:46:07 INFO mapreduce.ImportJobBase: Retrieved 0 records.
14/01/16 16:46:07 ERROR tool.ImportTool: Error during import: Import job failed!
I tried running with JDK 1.6 and it works, but I really don't want to switch back to that every time I need to use sqoop.
Does anybody know what I need to change?
I belive that root cause of your problem is that your Hadoop distribution is still running on JDK6 and not JDK7 as you believe so.
Sqoop process will generate Java code that is compiled with currently used JDK. Therefore if you execute Sqoop on JDK7, it will generate and compile code with this JDK7. The generated code is then submitted to your hadoop cluster as a part of mapreduce job. Therefore if you are getting this unsupported major.minr exception while running Sqoop on JDK7 is very likely that your Hadoop cluster is running on JDK6.
I would strongly suggest calling jinfo on your hadoop deamons to verify which JDK they are running on.
This is an old post, but adding some further info as I've had the same issue when running mixed jdks: java7 locally and java6 on the CDH4.4 VM.
The following post by cloudera provides an answer:
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM4Ent/4.5.3/Cloudera-Manager-Enterprise-Edition-Installation-Guide/cmeeig_topic_16_2.html If I was making a change across a real cluster, I'd follow those directions.
But I'm only using a VM and in that document is an important clue:
/usr/lib64/cmf/service/common/cloudera-config.sh has a function
locate_java_home() which shows that /usr/java/jdk1.6* is preferred before /usr/java/jdk1.7*.
This may be fixed in later Quickstart VMs, but I was looking for a fix that was quicker. (It takes some effort for us to set up a new VM for dev.)
I fixed my VM by simply changing the search order in that file
and rebooting.
HTH,
Glenn

Hadoop compression : "Loaded native gpl library" but "Failed to load/initialize native-lzo library"

after several try installing Lzo compression for hadoop, I need help because I have really no idea why it doesn't work.
I'using hadoop 1.0.4 on CentOs 6. I tried http://opentsdb.net/setup-hbase.html, https://github.com/kevinweil/hadoop-lzo and some others but i'm still getting error :
13/07/03 19:52:23 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
13/07/03 19:52:23 WARN lzo.LzoCompressor: java.lang.NoSuchFieldError: workingMemoryBuf
13/07/03 19:52:23 ERROR lzo.LzoCodec: Failed to load/initialize native-lzo library
even if native gpl is loaded. I've updated my mapred-site and core-site according to links below, I've copy/paste libs in right path (still according to links).
The real problem is that the lzo test works on the namenode :
13/07/03 18:55:47 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
13/07/03 18:55:47 INFO lzo.LzoCodec: Successfully loaded & initialized native-lzo library [hadoop-lzo rev ]
I've try setting several path in haddop-env.sh but there seems to be no right solution...
So, if you have any idea, link ... ? I'm really interested
[edit] after a week, i'm still trying to make it functionnal.
I've try sudhirvn.blogspot.fr/2010/08/hadoop-lzo-installation-errors-and.html but removing all Lzo and gplcompression libraries and then making a nez install was not better at all.
Is that due to my hadoop core version ? Is it possible to have hadoop-core-0.20 and hadoop-core-1.0.4 at the same time ? Should i compile Lzo on a 0.20 hadoop in order to use lzo ?
By the way I already tried compiling hadoop-lzo like this :
CLASSPATH=/usr/lib/hadoop/hadoop-core-1.0.4.jar CFLAGS=-m64 CXXFLAGS=-m64 ant compile-native tar
If it helps the full error is :
INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
WARN lzo.LzoCompressor: java.lang.NoSuchFieldError: workingMemoryBuf
ERROR lzo.LzoCodec: Failed to load/initialize native-lzo library
INFO lzo.LzoIndexer: [INDEX] LZO Indexing file test/table.lzo, size 0.00 GB...
WARN snappy.LoadSnappy: Snappy native library is available
INFO util.NativeCodeLoader: Loaded the native-hadoop library
INFO snappy.LoadSnappy: Snappy native library loaded
Exception in thread "main" java.lang.RuntimeException: native-lzo library not available
at com.hadoop.compression.lzo.LzopCodec.createDecompressor(LzopCodec.java:87)
at com.hadoop.compression.lzo.LzoIndex.createIndex(LzoIndex.java:229)
at com.hadoop.compression.lzo.LzoIndexer.indexSingleFile(LzoIndexer.java:117)
at com.hadoop.compression.lzo.LzoIndexer.indexInternal(LzoIndexer.java:98)
at com.hadoop.compression.lzo.LzoIndexer.index(LzoIndexer.java:52)
at com.hadoop.compression.lzo.LzoIndexer.main(LzoIndexer.java:137)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
I really want to use lzo because I have to deal with very large files on a rather small cluster (5 nodes). Having splittable compressed files could make it run really fast.
Every remark or idea is welcome.
I was having the exact same issue and finally resolved it by randomly choosing a datanode, and checking whether lzop was installed properly.
If it wasn't, I did:
sudo apt-get install lzop
Assuming you are using Debian-based packages.
I had this same issue on my OSX Machine. The problem was solved when I removed hadoop-lzo.jar (0.4.16) from my classpath and put hadoop-gpl-compression jar instead.

Resources