Spark shell throwing exception after trying to integrate s3 / hadoop - hadoop

I'm working on a windows machine trying to set up a spark teststack - the aim is to read/write file to an s3 bucket.
I'm running 1.6.1. When I run spark-shell I now receive an error:
16/03/22 15:19:48 INFO metastore.HiveMetaStore: 0: get_functions: db=default pat=*
16/03/22 15:19:48 INFO HiveMetaStore.audit: ugi=Administrator ip=unknown-ip-addr cmd=get_functions: db=default pat=*
16/03/22 15:19:48 INFO DataNucleus.Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table.
java.lang.RuntimeException: java.io.IOException: No FileSystem for scheme: s3n
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
at org.apache.spark.sql.hive.client.ClientWrapper.<init>(ClientWrapper.scala:204)
doing some reading lead me to believe that I need to add the aws jars as an argument - the jars are included in the hadoop structure.
I then run C:\Spark\hadoop\share\hadoop\tools\lib>spark-shell --jars aws-java-sdk-1.7.4.jar, hadoop-aws-2.7.1.jar
thinking that I'm now including the jars and so it must be ok...how foolish of me - I get the exact same error.
I then tried to include just the hadoop-aws jar and all kinds of exceptions were thrown including not being able to instantiate hive, s3a couldn't be instantiated, awscredentials wasn't happy and so on.
I'm at a bit of a loss, if anyone can shed some light on what I might be doing wrong I'll happily buy them a pint :)
EDIT:
I've since updates the core-site.xml file, by removing the fs.defaultFS property witha value os s3n://mybucketname, spark will now load.
In it's stead i have the hdfs://0.0.0.0:19000 which is working fine.
Soi I guess my question changes from 'gaaaaah to 'gaaaaah, how does one include s3 correctly as a filesystem'

Related

Hbase completebulkload stuck on AWS EMR

So the scenario is I am trying to use HBase bulk load to load some data into HBase.
Here's my stack setting:
HBase version 1.3.1
Hadoop version: 2.7.3
EMR version 5.10.
Cluster size: 20 R4.2xlarge instances.
I have a hbase table which was pre-splitted into 400 regions with HexStringSplit for the row key.
The table contains only one column family and it used lz4 compression algorithm
I then tried to use bulkload to load some data into the table.
I was able to use import tsv tool to generate HFiles on HDFS, the total file size is about 20 GB.
Then I ran the "completebulkload" tool as follows:
hadoop jar /usr/lib/hbase/lib/hbase-server-1.3.1.jar completebulkload hdfs:///user/hbase/output MyTable
Here "hdfs:///user/hbase/output" is the output directory of the import tsv job.
The process started but got stuck, I only see following output:
17/12/05 19:49:22 WARN mapreduce.LoadIncrementalHFiles: Skipping non-directory hdfs://ip-172-31-19-197.ec2.internal:8020/user/hbase/output/_SUCCESS
17/12/05 19:49:23 INFO compress.CodecPool: Got brand-new decompressor [.lz4]
17/12/05 19:49:23 INFO compress.CodecPool: Got brand-new decompressor [.lz4]
17/12/05 19:49:23 INFO compress.CodecPool: Got brand-new decompressor [.lz4]
17/12/05 19:49:23 INFO compress.CodecPool: Got brand-new decompressor [.lz4]
No further information was printed. It's been almost 1 hour but still nothing. I checked the HBase UI and nothing has been loaded yet. All regions are empty.
Any thoughts on this?
Thanks

apache pig not connecting to hdfs

I have Hadoop version 2.6.3 and pig-0.6.0
I have all the daemons up and running in Single node cluster.
After firing the pig command . The pig is only connecting to file:/// not hdfs
could you please tell me how to make it to connect hdfs
below is the INFO log that could i see
2016-01-10 20:58:30,431 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:///
2016-01-10 20:58:30,650 [main] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Initializing JVM Metrics with processName=JobTracker, sessionId=
when I hit the command in GRUNT
grunt> ls hdfs://localhost:54310/
2016-01-10 21:05:41,059 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal error. Wrong FS: hdfs://localhost:54310/, expected: file:///
Details at logfile: /home/hguna/pig_1452488310172.log
I have no clue has to why it is expecting file:///
ERROR 2999: Unexpected internal error. Wrong FS: hdfs://localhost:54310/, expected: file:///
java.lang.IllegalArgumentException: Wrong FS: hdfs://localhost:54310/, expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:305)
at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:47)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:357)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:643)
at org.apache.pig.backend.hadoop.datastorage.HDataStorage.isContainer(HDataStorage.java:203)
at org.apache.pig.backend.hadoop.datastorage.HDataStorage.asElement(HDataStorage.java:131)
at org.apache.pig.tools.grunt.GruntParser.processLS(GruntParser.java:576)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:304)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
at org.apache.pig.Main.main(Main.java:352)
Did I configure hadoop correctly ? or some where I am wrong please let me know if there is any file that I need to share . I have done enough researching could not fix it .Btw I am a newbie to Hadoop and pig
please help me .
Thanks
Chek your configuration in hadoop-site.xml, core-site.xml and mapred-site.xml
Use PIG_CLASSPATH to specify addition classpath entries. For eg, to add hadoop configuration files (hadoop-site.xml, core-site.xml) to classpath
export PIG_CLASSPATH=<path_to_hadoop_conf_dir>
you should override default classpath entries by setting PIG_USER_CLASSPATH_FIRST
export PIG_USER_CLASSPATH_FIRST=true
After that you can able to start the grunt shell

Hive Internal Error: java.lang.ClassNotFoundException(org.apache.atlas.hive.hook.HiveHook)

I am running a hive query throwh oozie using hue..
I am creating a table through hue-oozie work flow...
My job is failing but when I check in hive the table is created.
Log shows below error:
16157 [main] INFO org.apache.hadoop.hive.ql.hooks.ATSHook - Created ATS Hook
2015-09-24 11:05:35,801 INFO [main] hooks.ATSHook (ATSHook.java:<init>(84)) - Created ATS Hook
16159 [main] ERROR org.apache.hadoop.hive.ql.Driver - hive.exec.post.hooks Class not found:org.apache.atlas.hive.hook.HiveHook
2015-09-24 11:05:35,803 ERROR [main] ql.Driver (SessionState.java:printError(960)) - hive.exec.post.hooks Class not found:org.apache.atlas.hive.hook.HiveHook
16159 [main] ERROR org.apache.hadoop.hive.ql.Driver - FAILED: Hive Internal Error: java.lang.ClassNotFoundException(org.apache.atlas.hive.hook.HiveHook)
java.lang.ClassNotFoundException: org.apache.atlas.hive.hook.HiveHook
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
Not able to identify the issue....
I am usig HDP 2.3.1
Basically this error is due to missing atlas jar in oozie share lib.
In HDP the Atlas jar is available in /usr/hdp/2.3.0.0-2557/atlas/
Put all the jars related to atlas in hadoop share lib ..
hadoop fs -put /usr/hdp/2.3.0.0-2557/atlas/hook/hive/* /user/oozie/share/lib/lib200344/hive
Add 'export HIVE_AUX_JARS_PATH=<atlas package>/hook/hive' in hive-env.sh .
Copy <atlas package>/conf/application.propertiesto hive conf directory.
Restart the oozie services. This will solve this problem. If anybody face the problem please comment here so that I can help.
[Comment by Immo Huneke: when using the Hortonworks sandbox VM, I found that just putting the jar files in the share/lib folder under HDFS was enough to resolve the problem. I didn't have to update hive-env.sh or copy the application.properties file. But check the exact path of your share/lib folder by executing the command hdfs dfs -ls /user/oozie/share/lib before copying.]
hive>add jar /usr/hdp//atlas/hook/hive/hive-bridge-${VERSION}.jar
it will be ok.
hope help for u.
It Seems You CLASS is not found exception.
Have you installed Oozie Sharedlib, if Yes, please update all the hive dependent Jar in the sharedLib Location, and check if the status
Also check if Hive Client is available in all the Nodes under the cluster and same should be running
​I tried each and every possible solution mentioned in this forum and in stackoverflow, but it did not resolve my issue.
Finally, I resolved it by copying all the jars in /hook/hive to lib (create a new lib folder at job.properties level) folder of my oozie workflow

getting java.net.SocketTimeoutException when trying to run the Hadoop mapReduce on fresh install of Hortonworks

I have a fresh install of Hortonworks version 2.3_1 for oracle virtualbox and I get a java.net.SocketTimeoutException whenever I try to run a mapreduce job. I changed nothing other than the memory and the cores available to the VM.
full text of run:
WARNING: Use "yarn jar" to launch YARN applications.
15/09/01 01:15:17 INFO impl.TimelineClientImpl: Timeline service address: http:/ /sandbox.hortonworks.com:8188/ws/v1/timeline/
15/09/01 01:15:20 INFO client.RMProxy: Connecting to ResourceManager at sandbox. hortonworks.com/10.0.2.15:8050
15/09/01 01:16:19 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your applicatio n with ToolRunner to remedy this.
15/09/01 01:18:09 WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor excepti on for block BP-601678901-10.0.2.15-1439987491556:blk_1073742292_1499
java.net.SocketTimeoutException: 65000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.0 .2.15:52924 remote=/10.0.2.15:50010]
at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.ja va:164)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:1 61)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:1 31)
at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:1 18)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java :2280)
at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(P ipelineAck.java:244)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor .run(DFSOutputStream.java:749)
15/09/01 01:18:11 INFO mapreduce.JobSubmitter: Cleaning up the staging area /use r/root/.staging/job_1441069639378_0001
Exception in thread "main" java.io.IOException: All datanodes DatanodeInfoWithStorage[10.0.2.15:50010,DS-56099a5f-3cb3-426e-8e1a-ff3b53df9bf2,DISK] are bad. Aborting...
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1117)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:909)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:412)
Full name of file ova file I am using: Sandbox_HDP_2.3_1_virtualbox.ova
my host is a window 7 home premium machine with eight lines of execution(four hyperthreaded cores, I think)
The problem was exactly what it seemed a timeout error. Fixed by going to the hadoop config folder and raising all the timeouts as well as the number of retries (although from the log that didn't come into play) and stopping unnecessary services on both the host and guest operating system.
Thank, sunrise76 on of those issues pointed me to the config folder.

Hadoop Mapreduce tasktrackers keep ignoring HADOOP_CLASSPATH. Zookeeper trying to connect to localhost rather than cluster address

I have a Hadoop cluster (Cloudera CDH4.2) with 5 datanodes. I'm trying to run a MapReduce job which creates an HBaseConfiguration object. The tasktracker attempts fail, because they're trying to connect to localhost:2181 rather than the address of the actual zookeeper installation.
I'm aware that this is because the tasktrackers are not being supplied with the correct classpath containing the hbase configuration. However, if I run the job like so:
HADOOP_CLASSPATH=`/usr/bin/hbase classpath` hadoop jar myjar.jar
The documentation indicates this should solve the problem. The first entry in hbase classpath is /usr/lib/hbase/conf which is a symlink to /etc/hbase/conf, so in theory, this should add the hbase configuration into the HADOOP_CLASSPATH variable.
However, the logs from the tasktracker show this:
2013-08-14 12:47:24,308 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.class.path=<output of `hadoop classpath`>
....
2013-08-14 12:47:24,309 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=localhost:2181 sessionTimeout=180000 watcher=hconnection
.....
2013-08-14 12:47:24,328 WARN org.apache.zookeeper.ClientCnxn: Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
So, for some reason, the tasktrackers are completely ignoring my effort to set HADOOP_CLASSPATH to hbase classpath. The documentation (http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#classpath) states this should just work. What's wrong?
I'm aware I could work around this by explicitly specifying the zookeeper quorum address in the jar code, but I need this jar to be portable, and pick up the local configuration without recompiling, so I don't see hard coding the address as a viable option.
if you made java programming:
conf.set("hbase.zookeeper.quorum", "server1,server2,server3");
conf.set("hbase.zookeeper.property.clientPort", "2181");
if you used command:add -Dhbase.zookeeper.quorum
sudo hadoop jar /opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hbase/hbase.jar rowcounter -Dhbase.zookeeper.quorum=server1,server2,server3 hly_temp

Resources