Hbase completebulkload stuck on AWS EMR - hadoop

So the scenario is I am trying to use HBase bulk load to load some data into HBase.
Here's my stack setting:
HBase version 1.3.1
Hadoop version: 2.7.3
EMR version 5.10.
Cluster size: 20 R4.2xlarge instances.
I have a hbase table which was pre-splitted into 400 regions with HexStringSplit for the row key.
The table contains only one column family and it used lz4 compression algorithm
I then tried to use bulkload to load some data into the table.
I was able to use import tsv tool to generate HFiles on HDFS, the total file size is about 20 GB.
Then I ran the "completebulkload" tool as follows:
hadoop jar /usr/lib/hbase/lib/hbase-server-1.3.1.jar completebulkload hdfs:///user/hbase/output MyTable
Here "hdfs:///user/hbase/output" is the output directory of the import tsv job.
The process started but got stuck, I only see following output:
17/12/05 19:49:22 WARN mapreduce.LoadIncrementalHFiles: Skipping non-directory hdfs://ip-172-31-19-197.ec2.internal:8020/user/hbase/output/_SUCCESS
17/12/05 19:49:23 INFO compress.CodecPool: Got brand-new decompressor [.lz4]
17/12/05 19:49:23 INFO compress.CodecPool: Got brand-new decompressor [.lz4]
17/12/05 19:49:23 INFO compress.CodecPool: Got brand-new decompressor [.lz4]
17/12/05 19:49:23 INFO compress.CodecPool: Got brand-new decompressor [.lz4]
No further information was printed. It's been almost 1 hour but still nothing. I checked the HBase UI and nothing has been loaded yet. All regions are empty.
Any thoughts on this?
Thanks

Related

Pig on mapreduce mode is stuck on dumping hdfs data in Hortonworks HDP

I have some data files in my Hortonworks HDFS location. My requirement is to dump HDFS data in pig shell using pig-mapreduce mode. After loading the file data from HDFS, when trying to dump the data in pig shell using DUMP command, the map reduce job is getting stuck at 0% and not completing the job as well for a long time.
Followed the given below steps:
1) Start pig on mapreduce mode:
pig -x mapreduce
2) Load data into pig from a HDFS directory:
mapdata = load 'hdfs://ip-xxx-xx-xx-xx.us-east-2.compute.internal:8020/user/abc/datadir1' as (a:map[chararray]);
3) Print data:
dump mapdata;
After executing the 3rd step getting given below messages on the shell:
2018-10-09 07:25:51,099 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2018-10-09 07:25:51,099 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Running jobs are [job_1539066382468_0147]

Installing Snappy on a HDP Cluster

I have a HBase cluster built using Hortonworks Data Platform 2.6.1.
Now I need to apply Snappy compression on HBase tables.
Without installing Snappy, I executed the Compression Test and I got a success output. I used below commands.
hbase org.apache.hadoop.hbase.util.CompressionTest file:///tmp/test.txt snappy
hbase org.apache.hadoop.hbase.util.CompressionTest hdfs://hbase.primary.namenode:8020/tmp/test1.txt snappy
In got below response for both commands.
2017-10-30 11:25:18,454 INFO [main] hfile.CacheConfig: CacheConfig:disabled
2017-10-30 11:25:18,671 INFO [main] compress.CodecPool: Got brand-new compressor [.snappy]
2017-10-30 11:25:18,679 INFO [main] compress.CodecPool: Got brand-new compressor [.snappy]
2017-10-30 11:25:21,560 INFO [main] hfile.CacheConfig: CacheConfig:disabled
2017-10-30 11:25:22,366 INFO [main] compress.CodecPool: Got brand-new decompressor [.snappy]
SUCCESS
I see below libraries in the path /usr/hdp/2.6.1.0-129/hadoop/lib/native/ as well.
libhadoop.a
libhadooppipes.a
libhadoop.so
libhadoop.so.1.0.0
libhadooputils.a
libhdfs.a
libsnappy.so
libsnappy.so.1
libsnappy.so.1.1.4
Does HDP support snappy compression by default?
If so can I compress the HBase tables without installing Snappy?
Without installing Snappy, I executed the Compression Test and I got a success output.
Ambari installed it during cluster installation, so yes those commands are working
Does HDP support snappy compression by default?
Yes, the HDP-UTILS repository provides the snappy libraries.
can I compress the HBase tables without installing Snappy?
Hbase provides other compression algorithms, so yes

Spark shell throwing exception after trying to integrate s3 / hadoop

I'm working on a windows machine trying to set up a spark teststack - the aim is to read/write file to an s3 bucket.
I'm running 1.6.1. When I run spark-shell I now receive an error:
16/03/22 15:19:48 INFO metastore.HiveMetaStore: 0: get_functions: db=default pat=*
16/03/22 15:19:48 INFO HiveMetaStore.audit: ugi=Administrator ip=unknown-ip-addr cmd=get_functions: db=default pat=*
16/03/22 15:19:48 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table.
java.lang.RuntimeException: java.io.IOException: No FileSystem for scheme: s3n
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
at org.apache.spark.sql.hive.client.ClientWrapper.<init>(ClientWrapper.scala:204)
doing some reading lead me to believe that I need to add the aws jars as an argument - the jars are included in the hadoop structure.
I then run C:\Spark\hadoop\share\hadoop\tools\lib>spark-shell --jars aws-java-sdk-1.7.4.jar, hadoop-aws-2.7.1.jar
thinking that I'm now including the jars and so it must be ok...how foolish of me - I get the exact same error.
I then tried to include just the hadoop-aws jar and all kinds of exceptions were thrown including not being able to instantiate hive, s3a couldn't be instantiated, awscredentials wasn't happy and so on.
I'm at a bit of a loss, if anyone can shed some light on what I might be doing wrong I'll happily buy them a pint :)
EDIT:
I've since updates the core-site.xml file, by removing the fs.defaultFS property witha value os s3n://mybucketname, spark will now load.
In it's stead i have the hdfs://0.0.0.0:19000 which is working fine.
Soi I guess my question changes from 'gaaaaah to 'gaaaaah, how does one include s3 correctly as a filesystem'

Wordcount program is stuck in hadoop-2.3.0

I installed hadoop-2.3.0 and tried to run wordcount example
But it starts the job and sits idle
hadoop#ubuntu:~$ $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.3.0.jar wordcount /myprg outputfile1
14/04/30 13:20:40 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
14/04/30 13:20:51 INFO input.FileInputFormat: Total input paths to process : 1
14/04/30 13:20:53 INFO mapreduce.JobSubmitter: number of splits:1
14/04/30 13:21:02 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1398885280814_0004
14/04/30 13:21:07 INFO impl.YarnClientImpl: Submitted application application_1398885280814_0004
14/04/30 13:21:09 INFO mapreduce.Job: The url to track the job: http://ubuntu:8088/proxy/application_1398885280814_0004/
14/04/30 13:21:09 INFO mapreduce.Job: Running job: job_1398885280814_0004
The url to track the job: application_1398885280814_0004/
For previous versions I did nt get such an issue. I was able to run hadoop wordcount in previous version.
I followed these steps for installing hadoop-2.3.0
Please suggest.
I had the exact same situation a while back while switching to YARN. Basically there was the concept of task slots in MRv1 and containers in MRv2. Both of these differ very much in how the tasks are scheduled and run on the nodes.
The reason that your job is stuck is that it is unable to find/start a container. If you go into the full logs of Resource Manager/Application Master etc daemons, you may find that it is doing nothing after it starts to allocate a new container.
To solve the problem, you have to tweak your memory settings in yarn-site.xml and mapred-site.xml. While doing the same myself, I found this and this tutorials especially helpful. I would suggest you to try with the very basic memory settings and optimize them later on. First check with a word count example then go on to other complex ones.
I was facing the same issue.I added the following property to my yarn-site.xml and it solved the issue.
<property>
<name>yarn.resourcemanager.hostname</name>
<value>Hostname-of-your-RM</value>
<description>The hostname of the RM.</description>
</property>
Without the resource manager host name things go awry in the multi-node set up as each node would then default to trying to find a local resource manager and would never announce its resources to the master node. So your Map Reduce execution request probably didn't find any mappers in which to run because the request was being sent to the master and the master didn't know about the slave slots.
Reference : http://www.alexjf.net/blog/distributed-systems/hadoop-yarn-installation-definitive-guide/

Map reduce job getting stuck at map 0% reduce 0%

I am running the famous wordcount example. I have a local and prod hadoop setup. The same example is working in prod, but its not working locally. Can someone tell me what should I look for.
The job is getting stuck. The task logs are:
~/tmp$ hadoop jar wordcount.jar WordCount /testhistory /outputtest/test
Warning: $HADOOP_HOME is deprecated.
13/08/29 16:12:34 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/08/29 16:12:35 INFO input.FileInputFormat: Total input paths to process : 3
13/08/29 16:12:35 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/08/29 16:12:35 WARN snappy.LoadSnappy: Snappy native library not loaded
13/08/29 16:12:35 INFO mapred.JobClient: Running job: job_201308291153_0015
13/08/29 16:12:36 INFO mapred.JobClient: map 0% reduce 0%
Locally hadoop in running as pseudo distributed mode. All the 3 processes, namenode, datanode, jobtracker is running. Let me know if some extra information is required.
The tasktracker seems to be missing.
Try:
hadoop tasktracker &
In Hadoop 2.x this problem could be related to memory issues, you can see it in MapReduce in Hadoop 2.2.0 not working
I had the same problem and this page helped me:
http://www.alexjf.net/blog/distributed-systems/hadoop-yarn-installation-definitive-guide/
Basically I solved my problem using the following 3 steps. The fact is that I had to configure much more memory I really have.
1) yarn-site.xml
yarn.resourcemanager.hostname = hostname_of_the_master
yarn.nodemanager.resource.memory-mb = 4000
yarn.nodemanager.resource.cpu-vcores = 2
yarn.scheduler.minimum-allocation-mb = 4000
2) mapred-site.xml
yarn.app.mapreduce.am.resource.mb = 4000
yarn.app.mapreduce.am.command-opts = -Xmx3768m
mapreduce.map.cpu.vcores = 2
mapreduce.reduce.cpu.vcores = 2
3) Send these files across all nodes
Except for hadoop tasktracker & and any other issues. Please check you code and make sure that there is no infinite loop or any other bugs. Maybe there are some bugs in your code!
If this problem is coming when using Hive queries then do check if you are joining two very big tables without leveraging partitions. Not using partitions may lead to long running full table scans and hence stuck at map 0% reduce 0%.

Resources