Hadoop Image Processing Interface - hadoop

I have one question about HIPI as I new in these field..I am trying to run simple example and my command is like.:
$> bin/hadoop jar /opt/hipi-dev/examples/downloader.jar /user/hduser/hipiFile /user/hduser/outputhipi.hib 1
Where hipiFile folder contain one text file containing 4 image url. Path in my build.xml file is right. Though it gives me following error:
14/03/09 09:44:01 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
Found host successfully: 0
Tried to get 1 nodes, got 1
14/03/09 09:44:01 INFO input.FileInputFormat: Total input paths to process : 2
First n-1 nodes responsible for 4 images
Last node responsible for 4 images
14/03/09 09:44:02 INFO mapred.JobClient: Running job: job_201403090903_0003
14/03/09 09:44:03 INFO mapred.JobClient: map 0% reduce 0%
14/03/09 09:44:13 INFO mapred.JobClient: Task Id : attempt_201403090903_0003_m_000000_0,
**Status : FAILED**
**Error:** java.lang.ClassNotFoundException: org.apache.hadoop.fs.FSDataInputStream
at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
at java.lang.ClassLoader.loadClass(ClassLoader.java:268)
at hipi.examples.downloader.Downloader$DownloaderMapper.map(Unknown Source)
at hipi.examples.downloader.Downloader$DownloaderMapper.map(Unknown Source)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:170)

Check whether hadoop-common-*.jar's are in class path.

Related

ERROR security.UserGroupInformation: PriviledgedActionException in Hadoop 2.2

I am using Hadoop 2.2.0. hadoop-mapreduce-examples-2.2.0.jar are running fine on hdfs.
I have made a wordcount program in eclipse and add the jars using maven and run this jar:
ubuntu#ubuntu-linux:~$ yarn jar Sample-0.0.1-SNAPSHOT.jar com.vij.Sample.WordCount /user/ubuntu/wordcount/input/vij.txt user/ubuntu/wordcount/output
it give following error: 15/02/17 13:09:09 WARN util.NativeCodeLoader: Unable to load
native-hadoop library for your platform... using builtin-java classes
where applicable
15/02/17 13:09:10 INFO client.RMProxy: Connecting to ResourceManager
at /0.0.0.0:8032
15/02/17 13:09:11 ERROR security.UserGroupInformation:
PriviledgedActionException as:ubuntu (auth:SIMPLE)
cause:org.apache.hadoop.mapred.FileAlreadyExistsException: Output
directory hdfs://localhost:54310/user/ubuntu/wordcount/input/vij.txt
already exists
Exception in thread "main"
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
hdfs://localhost:54310/user/ubuntu/wordcount/input/vij.txt already
exists
at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:146)
at org.apache.hadoop.mapreduce.JobSubmitter.checkSpecs(JobSubmitter.java:456)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:342)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1268)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1265)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1265)
at com.vij.Sample.WordCount.main(WordCount.java:33)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
jar is on my local system. both input and output path is on hdfs. there is no output dir exist on output path on hdfs.
please advice.
Thanks.
Acctually the error is:
ERROR security.UserGroupInformation:PriviledgedActionException as:ubuntu (auth:SIMPLE)
cause:org.apache.hadoop.mapred.FileAlreadyExistsException: Output
directory hdfs://localhost:54310/user/ubuntu/wordcount/input/vij.txt
already exists
Delete the output file that already exists "vij.txt", or output to a different file.
OR try to do the following steps:
Download and unzip the WordCount source code from this link under $HADOOP_HOME.
$ cd $HADOOP_HOME
$ wget http://salsahpc.indiana.edu/tutorial/source_code/Hadoop-WordCount.zip
$ unzip Hadoop-WordCount.zip
then , upload the input files (any text format file) into Hadoop distributed file system (HDFS):
$bin/hadoop fs -put $HADOOP_HOME/Hadoop-WordCount/input/ input
$bin/hadoop fs -ls input
Here, $HADOOP_HOME/Hadoop-WordCount/input/ is the local directory where the program inputs are stored. The second "input" represents the remote destination directory on the HDFS.
After uploading the inputs into HDFS, run the WordCount program with the following commands. We assume you have already compiled the word count program.
$ bin/hadoop jar $HADOOP_HOME/Hadoop-WordCount/wordcount.jar WordCount input output
If Hadoop is running correctly, it will print hadoop running messages similar to the following:
WARNING: org.apache.hadoop.metrics.jvm.EventCounter is deprecated.
Please use org.apache.hadoop.log.metrics.EventCounter in all the log4j.properties files.
11/11/02 18:34:46 INFO input.FileInputFormat: Total input paths to process : 1
11/11/02 18:34:46 INFO mapred.JobClient: Running job: job_201111021738_0001 11/11/02 18:34:47 INFO mapred.JobClient: map 0% reduce 0%
11/11/02 18:35:01 INFO mapred.JobClient: map 100% reduce 0%
11/11/02 18:35:13 INFO mapred.JobClient: map 100% reduce 100%
11/11/02 18:35:18 INFO mapred.JobClient: Job complete: job_201111021738_0001 11/11/02 18:35:18 INFO mapred.JobClient: Counters: 25
...

reducer always fails and map succeeds

I am running simple wordcount job on 1GB of text file . My cluster has 8 Datanodes and 1 namenode each has a storage capacity of 3GB.
When i run wordcount I can see map always succeeds and reducer is throwing an error and fails. Please find below error message.
14/10/05 15:42:02 INFO mapred.JobClient: map 100% reduce 31%
14/10/05 15:42:07 INFO mapred.JobClient: Task Id : attempt_201410051534_0002_m_000016_0, Status : FAILED
FSError: java.io.IOException: No space left on device
14/10/05 15:42:14 INFO mapred.JobClient: Task Id : attempt_201410051534_0002_r_000000_0, Status : FAILED
java.io.IOException: Task: attempt_201410051534_0002_r_000000_0 - The reduce copier failed
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:390)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for file:/app/hadoop/tmp/mapred/local/taskTracker/hduser/jobcache/job_201410051534_0002/attempt_201410051534_0002_r_000000_0/output/map_18.out
Could you please tell me how can i fix this problem ?
Thanks
Navaz

mapreduce wroking on single node cluster but not on multinode cluster

I am running a map reduce program which works fine on my cdh quickstart vm but when trying on a multinode cluster, it gives the below error:
WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
14/02/12 00:23:06 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
14/02/12 00:23:06 INFO input.FileInputFormat: Total input paths to process : 1
14/02/12 00:23:07 INFO mapred.JobClient: Running job: job_201401221117_5777
14/02/12 00:23:08 INFO mapred.JobClient: map 0% reduce 0%
14/02/12 00:23:16 INFO mapred.JobClient: Task Id : attempt_201401221117_5777_m_000000_0, Status : FAILED
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class Mappercsv not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1774)
at org.apache.hadoop.mapreduce.task.JobContextImpl.getMapperClass(JobContextImpl.java:191)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:631)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330)
at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.Child.main(Child.java:262)
Caused by: java.lang.ClassNotFoundException: Class Mappercsv not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1680)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1772)
... 8 more"
Please help.

How to run large Mahout fuzzy kmeans clustering without running out of memory?

I am running Mahout 0.7 fuzzy k-means clustering on Amazon's EMR (AMI 2.3.1) and
I am running out of memory.
My overall question: how do I get it working most easily?
Here is an invocation:
./bin/mahout fkmeans \
--input s3://.../foo/vectors.seq \
--output s3://.../foo/fuzzyk2 \
--numClusters 128 \
--clusters s3://.../foo/initial_clusters/ \
--maxIter 20 \
--m 2 \
--method mapreduce \
--distanceMeasure org.apache.mahout.common.distance.TanimotoDistanceMeasure
More detailed questions:
How do I tell how much memory I'm using? I'm on c1.xlarge instances. If I believe AWS docs, that sets mapred.child.java.opts=-Xmx512m.
How do I tell how much memory I need? I can just try different sizes, but it gives me no idea of the size of problem I can handle.
How do I change my memory usage? Start up a different workflow with a different class of machine? Try setting mapred.child.java.opts?
My dataset does not seem that large. Is it?
vectors.seq is a collection of sparse vectors with 50225 vectors
(50225 things related to 124420 others), a total of 1.2M relationships.
This post says set --method mapreduce, which I am, and which is the
default.
This post says all clusters are held in memory on every mapper and
reducer. That would be 4*124420=498K things, which doesn't seem too
bad.
Here is the stack:
13/04/19 18:12:53 INFO mapred.JobClient: Job complete: job_201304161435_7034
13/04/19 18:12:53 INFO mapred.JobClient: Counters: 7
13/04/19 18:12:53 INFO mapred.JobClient: Job Counters
13/04/19 18:12:53 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=28482
13/04/19 18:12:53 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
13/04/19 18:12:53 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
13/04/19 18:12:53 INFO mapred.JobClient: Rack-local map tasks=4
13/04/19 18:12:53 INFO mapred.JobClient: Launched map tasks=4
13/04/19 18:12:53 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
13/04/19 18:12:53 INFO mapred.JobClient: Failed map tasks=1
Exception in thread "main" java.lang.InterruptedException: Cluster Iteration 1 failed processing s3://.../foo/fuzzyk2/clusters-1
at org.apache.mahout.clustering.iterator.ClusterIterator.iterateMR(ClusterIterator.java:186)
at org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.buildClusters(FuzzyKMeansDriver.java:288)
at org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.run(FuzzyKMeansDriver.java:221)
at org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.run(FuzzyKMeansDriver.java:110)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.mahout.clustering.fuzzykmeans.FuzzyKMeansDriver.main(FuzzyKMeansDriver.java:52)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:187)
And here's part of the log of the mapper:
2013-04-19 18:10:38,734 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Received IOException while reading '.../foo/vectors.seq', attempting to reopen.
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at com.sun.net.ssl.internal.ssl.InputRecord.readFully(InputRecord.java:293)
at com.sun.net.ssl.internal.ssl.InputRecord.readV3Record(InputRecord.java:405)
at com.sun.net.ssl.internal.ssl.InputRecord.read(InputRecord.java:360)
at com.sun.net.ssl.internal.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:798)
at com.sun.net.ssl.internal.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:755)
at com.sun.net.ssl.internal.ssl.AppInputStream.read(AppInputStream.java:75)
at org.apache.http.impl.io.AbstractSessionInputBuffer.read(AbstractSessionInputBuffer.java:187)
at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:164)
at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:138)
at java.io.FilterInputStream.read(FilterInputStream.java:116)
at org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsInputStream.read(NativeS3FileSystem.java:291)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
at java.io.DataInputStream.readFully(DataInputStream.java:178)
at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2060)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2194)
at org.apache.hadoop.mapreduce.lib.input.SequenceFileRecordReader.nextKeyValue(SequenceFileRecordReader.java:68)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:540)
at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:771)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:375)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
2013-04-19 18:10:38,737 INFO org.apache.hadoop.fs.s3native.NativeS3FileSystem (main): Stream for key '.../foo/vectors.seq' seeking to position '62584'
2013-04-19 18:10:42,619 INFO org.apache.hadoop.mapred.TaskLogsTruncater (main): Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
2013-04-19 18:10:42,730 INFO org.apache.hadoop.io.nativeio.NativeIO (main): Initialized cache for UID to User mapping with a cache timeout of 14400 seconds.
2013-04-19 18:10:42,730 INFO org.apache.hadoop.io.nativeio.NativeIO (main): Got UserName hadoop for UID 106 from the native implementation
2013-04-19 18:10:42,733 FATAL org.apache.hadoop.mapred.Child (main): Error running child : java.lang.OutOfMemoryError: Java heap space
at org.apache.mahout.math.map.OpenIntDoubleHashMap.rehash(OpenIntDoubleHashMap.java:434)
at org.apache.mahout.math.map.OpenIntDoubleHashMap.put(OpenIntDoubleHashMap.java:387)
at org.apache.mahout.math.RandomAccessSparseVector.setQuick(RandomAccessSparseVector.java:139)
at org.apache.mahout.math.AbstractVector.assign(AbstractVector.java:560)
at org.apache.mahout.clustering.AbstractCluster.observe(AbstractCluster.java:253)
at org.apache.mahout.clustering.AbstractCluster.observe(AbstractCluster.java:241)
at org.apache.mahout.clustering.AbstractCluster.observe(AbstractCluster.java:37)
at org.apache.mahout.clustering.classify.ClusterClassifier.train(ClusterClassifier.java:158)
at org.apache.mahout.clustering.iterator.CIMapper.map(CIMapper.java:55)
at org.apache.mahout.clustering.iterator.CIMapper.map(CIMapper.java:18)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:771)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:375)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Yes you're running out of memory. As far as I know, that "memory intensive workload" bootstrap action is long since deprecated, so may do nothing. See the note on that page.
A c1.xlarge should use 384MB per mapper by default. When you subtract out all the JVM overhead, room for splits and combining, etc, you probably don't have a whole lot left.
You set Hadoop params in a bootstrap action. Choose the "Configure Hadoop" action instead if using the console and set something like --site-key-value mapred.map.child.java.opts=-Xmx1g
(If you're doing this programmatically, and having any trouble, contact me offline; I can provide snippets from Myrrix since it heavily tunes the EMR clusters for speed in its recommend/clustering jobs.)
You can set mapred.map.java.child.opts instead to control mappers separately from reducers. You can also turn down the number of mappers per machine to make more room, or, choose a high-memory instance. I usually find ml.xlarge is optimal for EMR given price-to-I/O ratio, and because most jobs end up being I/O-bound.

Hadoop on EC2 error: could only be replicated to 0 nodes, instead of 1

I'm running a very small Hadoop cluster on EC2.
I'm starting a cluster using whirr (version: whirr-0.2.0-incubating) with 1 (jobtracker + namenode) and 4 (datanodes + tasktracker) :
whirr.hardware-id=c1.medium
whirr.instance-templates=1 jt+nn,4 dn+tt
whirr.provider=ec2
When I run my job I'm getting the following error , not right away but after a while :
.....
11/01/17 17:31:47 INFO mapred.JobClient: map 100% reduce 66%
11/01/17 17:31:49 INFO mapred.JobClient: map 100% reduce 68%
11/01/17 17:31:52 INFO mapred.JobClient: map 100% reduce 70%
11/01/17 17:31:56 INFO mapred.JobClient: map 100% reduce 73%
11/01/17 17:32:01 INFO mapred.JobClient: Task Id : attempt_201101172141_0002_r_000000_0, Status : FAILED
org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/root/test/test2/test3/_temporary/_attempt_201101172141_0002_r_000000_0/parts/81/part-00000 could only be replicated to 0 nodes, instead of 1
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1271)
at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:422)
at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
....
I found the same error in the tasktracker log file :
....
2011-01-17 22:31:36,968 INFO org.apache.hadoop.mapred.JobTracker: Adding task (cleanup)'attempt_201101172141_0002_m_000004_1' to tip task_201101172141_0002_m_000004, for tracker 'tracker_ip-11-222-333-444.ec2.internal:localhost/127.0.0.1:44840'
2011-01-17 22:31:39,972 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'attempt_201101172141_0002_m_000004_1' from 'tracker_ip-11-222-333-444.ec2.internal:localhost/127.0.0.1:44840'
2011-01-17 22:31:57,985 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201101172141_0002_r_000000_0: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/root/dsp-test/test1/test2/_temporary/_attempt_201101172141_0002_r_000000_0/parts/81/part-00000 could only be replicated to 0 nodes, instead of 1
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1271)
at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:422)
at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
In job's configuration I have:
dfs.replication=2
mapred.child.java.opts=-server -Xmx180m -XX:ErrorFile=/mnt/hadoop/logs/logs/java/java_error.log
Does anyone have an idea why I'm getting this error?
Did you run out of disk space? Sometimes if you run out of disk space it will complain in this way.

Resources