Copy large files into HDFS - hadoop

I'm trying to copy a large file (32 GB) into HDFS. I never had any troubles copying files in HDFS but these were all smaller. I'm using hadoop fs -put <myfile> <myhdfsfile> and up to 13,7 GB everything goes well but then I get this exception:
hadoop fs -put * /data/unprocessed/
Exception in thread "main" org.apache.hadoop.fs.FSError: java.io.IOException: Input/output error
at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.read(RawLocalFileSystem.java:150)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at java.io.DataInputStream.read(DataInputStream.java:149)
at org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker.java:384)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:217)
at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:237)
at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:189)
at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
at java.io.DataInputStream.read(DataInputStream.java:100)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:74)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:47)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:100)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:230)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:191)
at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1183)
at org.apache.hadoop.fs.FsShell.copyFromLocal(FsShell.java:130)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:1762)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:1895)
Caused by: java.io.IOException: Input/output error
at java.io.FileInputStream.readBytes(Native Method)
at java.io.FileInputStream.read(FileInputStream.java:242)
at org.apache.hadoop.fs.RawLocalFileSystem$TrackingFileInputStream.read(RawLocalFileSystem.java:91)
at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStream.read(RawLocalFileSystem.java:144)
... 20 more
When I check the log files (on my NameNode and DataNodes) I see that the lease on the file is removed but there's no reason specified. According to the log files everything went well. Here are the last lines of my NameNode log:
2013-01-28 09:43:34,176 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.allocateBlock: /data/unprocessed/AMR_EXPORT.csv. blk_-4784588526865920213_1001
2013-01-28 09:44:16,459 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.1.6.114:50010 is added to blk_-4784588526865920213_1001 size 30466048
2013-01-28 09:44:16,466 INFO org.apache.hadoop.hdfs.StateChange: Removing lease on file /data/unprocessed/AMR_EXPORT.csv from client DFSClient_1738322483
2013-01-28 09:44:16,472 INFO org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.completeFile: file /data/unprocessed/AMR_EXPORT.csv is closed by DFSClient_1738322483
2013-01-28 09:44:16,517 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 168 Total time for transactions(ms): 26Number of transactions batched in Syncs: 0 Number of syncs: 0 SyncTimes(ms): 0
Does anyone have a clue on this? I've checked core-default.xml and hdfs-default.xml for properties I could overwrite that would extend the lease or so but couldn't find one.

Some suggestions:
If you have multiple files to copy then use multiple -put sessions
If there is only one large file then use compression before copy OR you can split the large file into small ones then copy

This sounds to be an issue reading the local file than a problem with the hdfs client. The stack trace shows a problem reading the local file that has bubbled all the way up. The lease is dropped because the client has dropped the connection due to the IOException while reading the file.

Related

hdfs: Failed to place enough replicas: expected size is 2 but only 0 storage types can be selected

It was working fine, not sure what changed and suddenly got this error.
Hdfs is running on a docker cluster (1 rm + 2 nodes).
works fine inside the containers, so there is no problem with data nodes.
Thrown when copy file from host machine to hdfs, with code or hdfs commands.
The error stack is from hadoop-root-namenode-master.log.
2017-10-26 09:18:05,102 WARN org.apache.hadoop.hdfs.protocol.BlockStoragePolicy: Failed to place enough replicas: expected size is 2 but only 0 storage types can be selected (replication=2, selected=[], unavailable=[DISK], removed=[DISK, DISK], policy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]})
2017-10-26 09:18:05,102 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to place enough replicas, still in need of 2 to reach 2 (unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}, newBlock=true) All required storage types are unavailable: unavailableStorages=[DISK], storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]}
2017-10-26 09:18:05,103 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 9000, call org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 192.168.1.6:52435 Call#7 Retry#0
java.io.IOException: File /spark/apps/spark-app.jar could only be replicated to 0 nodes instead of minReplication (=1). There are 2 datanode(s) running and 2 node(s) are excluded in this operation.
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1571)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3107)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3031)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:725)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:492)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
Source code is:
System.setProperty("HADOOP_USER_NAME", "root");
Configuration conf = new Configuration();
conf.addResource(new Path("src/main/resources/core-site.xml"));
FileSystem fs = FileSystem.get(conf);
fs.copyFromLocalFile(new Path(jars/spark-app.jar"), new Path("/spark/apps/spark-app.jar"));
It took me a lot of time surfing, but after I changed the log4j level to DEBUG, I quickly located the problem. Log shows the client is connecting "172.20.0.3" which is the ip of datanode and not accessible from the host. Then add a route to 172.20/16 (route add -p ...), and it works fine now.
When trying to copy a file to my cluster I received a similar message. All I had to do was to stop the firewall in my datanodes, using:
service firewalld stop

How to put large data sets in HDFS?

I've tried to put large datasets(about 200 folders) in HDFS.
But I got errors:
WARN hdfs.DFSClient: Slow waitForAckedSeqno took 72699ms;
INFO hdfs.DFSClient: Excluding datanode DatanodeInfoWithStorage[192.168.111.3:50010;
java.io.IOException: Got error, status message, ask with firstBadLink as 192.168.111.3:50010
at org.apache.hadoop.hdfs.protocol.datatransfer.DataTransferProtoUtil.checkBlockOpStatus(DataTransferProtoUtil.java:140)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.createBlockOutputStream(DFSOutputStream.java:1363)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1266)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:449)
I used this command for the number of folders, not at once: hdfs dfs -put "eache folder" /hadoopPath
Is there a solution to address these errors?

Shuffle, merger and fetcher errors when processing large files in hadoop

I am running a word-count like mapreduce job processing 200 files of 1Gb each. I am running the job on a hadoop cluster comprising 4 datanodes (2cpu each) with 8Gb of memory and about 200G of space. I have tried various configurations options but every time my job fails, with either InMemory Shuffle, OnDisk Shuffle, InMemory merger, OnDisk Merger, or Fetcher errors.
The size of the mapper output is comparable to the size of the input files, therefore , in order to minimise the mapper output size I am using the BZip2 compression for the mapreduce output. However even with a compressed map output I still get errors in the reducer phase. I use 4 reducers. Thus I have tried various configurations of the hadoop cluster:
The standard configuration of the cluster was:
Default virtual memory for a job's map-task 3328 Mb
Default virtual memory for a job's reduce-task 6656 Mb
Map-side sort buffer memory 205 Mb
Mapreduce Log Dir Prefix /var/log/hadoop-mapreduce
Mapreduce PID Dir Prefix /var/run/hadoop-mapreduce
yarn.app.mapreduce.am.resource.mb 6656
mapreduce.admin.map.child.java.opts -Djava.net.preferIPv4Stack=TRUE -Dhadoop.metrics.log.level=WARN
mapreduce.admin.reduce.child.java.opts -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN
mapreduce.admin.user.env LD_LIBRARY_PATH=/usr/lib/hadoop/lib/native:/usr/lib/hadoop/lib/native/`$JAVA_HOME/bin/java -d32 -version &> /dev/null;if [ $? -eq 0 ]; then echo Linux-i386-32; else echo Linux-amd64-64;fi`
mapreduce.am.max-attempts 2
mapreduce.application.classpath $HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*,$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*
mapreduce.cluster.administrators hadoop
mapreduce.framework.name yarn
mapreduce.job.reduce.slowstart.completedmaps 0.05
mapreduce.jobhistory.address ip-XXXX.compute.internal:10020
mapreduce.jobhistory.done-dir /mr-history/done
mapreduce.jobhistory.intermediate-done-dir /mr-history/tmp
mapreduce.jobhistory.webapp.address ip-XXXX.compute.internal:19888
mapreduce.map.java.opts -Xmx2662m
mapreduce.map.log.level INFO
mapreduce.map.output.compress true
mapreduce.map.sort.spill.percent 0.7
mapreduce.map.speculative false
mapreduce.output.fileoutputformat.compress true
mapreduce.output.fileoutputformat.compress.type BLOCK
mapreduce.reduce.input.buffer.percent 0.0
mapreduce.reduce.java.opts -Xmx5325m
mapreduce.reduce.log.level INFO
mapreduce.reduce.shuffle.input.buffer.percent 0.7
mapreduce.reduce.shuffle.merge.percent 0.66
mapreduce.reduce.shuffle.parallelcopies 30
mapreduce.reduce.speculative false
mapreduce.shuffle.port 13562
mapreduce.task.io.sort.factor 100
mapreduce.task.timeout 300000
yarn.app.mapreduce.am.admin-command-opts -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN
yarn.app.mapreduce.am.command-opts -Xmx5325m
yarn.app.mapreduce.am.log.level INFO
yarn.app.mapreduce.am.staging-dir /user
mapreduce.map.maxattempts 4
mapreduce.reduce.maxattempts 4
This configuration gave me the following error:
14/05/16 20:20:05 INFO mapreduce.Job: map 20% reduce 3%
14/05/16 20:27:13 INFO mapreduce.Job: map 20% reduce 0%
14/05/16 20:27:13 INFO mapreduce.Job: Task Id : attempt_1399989158376_0049_r_000000_0, Status : FAILED
Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in InMemoryMerger - Thread to merge in-memory shuffled map-outputs
at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:121)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:380)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/attempt_1399989158376_0049_r_000000_0/map_2038.out
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)
at org.apache.hadoop.mapred.YarnOutputFiles.getInputFileForWrite(YarnOutputFiles.java:213)
at org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl$InMemoryMerger.merge(MergeManagerImpl.java:450)
at org.apache.hadoop.mapreduce.task.reduce.MergeThread.run(MergeThread.java:94)
Then I've tried changing various options, hopping to reduce the load during the shuffle phase, however I got the same error.
mapreduce.reduce.shuffle.parallelcopies 5
mapreduce.task.io.sort.factor 10
or
mapreduce.reduce.shuffle.parallelcopies 10
mapreduce.task.io.sort.factor 20
Then I realised that the tmp files on my data node were non existing and therefore all the merging and shuffling was happening in memory. Therefore I've manually added on each datanode.
I've kept the initial configuration but increased the time delay before the reducer starts in order to limit the load on the datanode.
mapreduce.job.reduce.slowstart.completedmaps 0.7
I've also tried increasing the io.sort.mb:
mapreduce.task.io.sort.mb from 205 to 512.
However now I get the following onDisk error:
14/05/26 12:17:08 INFO mapreduce.Job: map 62% reduce 21%
14/05/26 12:20:13 INFO mapreduce.Job: Task Id : attempt_1400958508328_0021_r_000000_0, Status : FAILED
Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in OnDiskMerger - Thread to merge on-disk map-outputs
at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:121)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:380)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for hadoop/yarn/local/usercache/eoc21/appcache/application_1400958508328_0021/output/attempt_1400958508328_0021_r_000000_0/map_590.out
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)
at org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl$OnDiskMerger.merge(MergeManagerImpl.java:536)
at org.apache.hadoop.mapreduce.task.reduce.MergeThread.run(MergeThread.java:94)
The reducer dropped down to 0% and when it got back to 17% I got the following error:
14/05/26 12:32:03 INFO mapreduce.Job: Task Id : attempt_1400958508328_0021_r_000000_1, Status : FAILED
Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#22
at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:121)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:380)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/attempt_1400958508328_0021_r_000000_1/map_1015.out
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:398)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:150)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:131)
at org.apache.hadoop.mapred.YarnOutputFiles.getInputFileForWrite(YarnOutputFiles.java:213)
at org.apache.hadoop.mapreduce.task.reduce.OnDiskMapOutput.<init>(OnDiskMapOutput.java:61)
at org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.reserve(MergeManagerImpl.java:257)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:411)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:341)
at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165)
I read around and it seems that "Could not find any valid local directory for output/attempt_1400958508328_0021_r_000000_1/map_1015.out" is correlated to not having enough space on the node for the spill. However I checked the data node and it seems that there is enough space:
Filesystem Size Used Avail Use% Mounted on
/dev/xvde1 40G 22G 18G 56% /
none 3.6G 0 3.6G 0% /dev/shm
/dev/xvdj 1008G 758G 199G 80% /hadoop/hdfs/data
So not sure what to try anymore. Is the cluster too small for processing such jobs? Do I require more space on the datanodes? Is there a way to find an optimum configuration for the job on hadoop? Any suggestion is highly appreciated!
It could be one of four things I know if, most likely being the point you made in your question about disk space, or a similar problem - inodes:
Files being deleted by another process (unlikely, unless you remember doing this yourself)
Disk error (unlikely)
Not enough disk space
Not enough inodes (run df -i)
Even if you run df -h and df -i before/after the job, you don't know how much is being eaten and cleaned away during the job. So while your job is running, suggest watching these numbers / log them to a file / graph them / etc. E.g.
watch "df -h && df -i"
You need to specify some temp directories to store the intermediate map and reduce output.
May be you have not specified any temp directories so it could not find any valid directory to store the intermediate data.
You can do it by editing mapred-site.xml
<property>
<name>mapred.local.dir</name>
<value>/temp1,/temp2,/temp3</value>
</property>
Comma-separated list of paths on the local filesystem where temporary MapReduce data is written. Multiple paths help spread disk i/o.
After specifying these temp directories it will store intermediate map and reduce output by choosing the temp directories in any of the below ways
random: In this case, the intermediate data for reduce tasks is stored at a data location chosen at random.
max: In this case, the intermediate data for reduce tasks is stored at a data location with the most available space.
roundrobin: In this case, the mappers and reducers pick disks through round-robin scheduling for storing intermediate data at the job level within the number of local disks. The job ID is used to create unique sub directories on the local disks to store the intermediate data for each job.
you can set this property in mapred-site.xml
example
<property>
<name>mapreduce.job.local.dir.locator</name>
<value>max</value>
</property>
By default in hadoop it is roundrobin
"mapreduce.cluster.local.dir" (Old deprecated name: mapred.local.dir) specified in the mapred-site.xml.

HBase distributed log-splitting keeps failing because unable to get a lease

We used up all the free space on our test HDFS cluster so HBase crashed. After cleaning up some space, we were able to restart HBase, but after the startup a distributed log split job keeps failing.
The job looks like this:
Splitting log file hdfs://192.168.249.1:9000/hdfs/hbase/.logs/host-3,60020,1382113928374-splitting/host-3%2C60020%2C1382113928374.1382523937002 into a temporary staging area.
The Regionserver are trying to get a lease on the file for some time:
2013-10-24 11:50:47,662 DEBUG org.apache.hadoop.hbase.regionserver.SplitLogWorker: tasks arrived or departed
2013-10-24 11:50:47,671 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: worker host-4,60020,1382614844870 acquired task /hbase/splitlog/hdfs%3A%2F%2F192.168.249.1%3A9000%2Fhdfs%2Fhbase%2F.logs%2Fhost-3%2C60020%2C1382113928374-splitting%2Fhost-3%252C60020%252C1382113928374.1382523937002
2013-10-24 11:50:47,672 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog: hdfs://192.168.249.1:9000/hdfs/hbase/.logs/host-3,60020,1382113928374-splitting/host-3%2C60020%2C1382113928374.1382523937002, length=41274332
2013-10-24 11:50:47,672 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recovering lease on dfs file hdfs://192.168.249.1:9000/hdfs/hbase/.logs/host-3,60020,1382113928374-splitting/host-3%2C60020%2C1382113928374.1382523937002
2013-10-24 11:50:47,673 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: recoverLease=false, attempt=0 on file=hdfs://192.168.249.1:9000/hdfs/hbase/.logs/host-3,60020,1382113928374-splitting/host-3%2C60020%2C1382113928374.1382523937002 after 1ms
2013-10-24 11:50:50,674 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: recoverLease=false, attempt=1 on file=hdfs://192.168.249.1:9000/hdfs/hbase/.logs/host-3,60020,1382113928374-splitting/host-3%2C60020%2C1382113928374.1382523937002 after 3002ms
2013-10-24 11:50:51,674 DEBUG org.apache.hadoop.hbase.util.FSHDFSUtils: isFileClosed not available
2013-10-24 11:51:51,680 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: recoverLease=false, attempt=2 on file=hdfs://192.168.249.1:9000/hdfs/hbase/.logs/host-3,60020,1382113928374-splitting/host-3%2C60020%2C1382113928374.1382523937002 after 64008ms
Then the Master abort the job:
2013-10-24 11:55:48,685 INFO org.apache.hadoop.hbase.regionserver.SplitLogWorker: Sending interrupt to stop the worker thread
2013-10-24 11:55:48,687 WARN org.apache.hadoop.hbase.regionserver.SplitLogWorker: log splitting of hdfs://192.168.249.1:9000/hdfs/hbase/.logs/host-3,60020,1382113928374-splitting/host-3%2C60020%2C1382113928374.1382523937002 interrupted, resigning
java.io.InterruptedIOException
at org.apache.hadoop.hbase.util.FSHDFSUtils.recoverDFSFileLease(FSHDFSUtils.java:136)
at org.apache.hadoop.hbase.util.FSHDFSUtils.recoverFileLease(FSHDFSUtils.java:54)
at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.getReader(HLogSplitter.java:780)
at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:414)
at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:381)
at org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:112)
at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:280)
at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:211)
at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:179)
at java.lang.Thread.run(Thread.java:724)
Caused by: java.lang.InterruptedException: sleep interrupted
at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.hbase.util.FSHDFSUtils.recoverDFSFileLease(FSHDFSUtils.java:118)
... 9 more
It seems to me that the problem is the Regionserver which are unable to get a lease on this file, because it's already open, so I checked with sudo -u hdfs hadoop fsck /hdfs/hbase/.logs/ -openforwrite, and it confirms:
OPENFORWRITE: /hdfs/hbase/.logs/host-3,60020,1382113928374-splitting/host-3%2C60020%2C1382113928374.1382523937002 41274332 bytes, 1 block(s), OPENFORWRITE:
/hdfs/hbase/.logs/host-3,60020,1382113928374-splitting/host-3%2C60020%2C1382113928374.1382523937002: Under replicated blk_1073337163743094520_3534698. Target Replicas is 3 but found 2 replica(s).
I tried to shut down HBase, but the file stays OPENFORWRITE. How could I remove this flag?
ps> Hadoop 1.0.1, HBase 0.94.12

ERROR namenode.FSNamesystem: FSNamesystem initialization failed

I'm running hadoop in pseudo distributed mode on an ubuntu VM. I recently decided to increase the RAM and number of cores available to my VM, and that seems to have completely screwed hdfs. First, it was in safemode and I manually released that using:
hadoop dfsadmin -safemode leave
Then I ran:
hadoop fsck -blocks
and practically every block was corrupt or missing. So I figured, this is just for my learning, I deleted everything in "/user/msknapp" and everything in "/var/lib/hadoop-0.20/cache/mapred/mapred/.settings". So the block errors were gone. Then I try:
hadoop fs -put myfile myfile
and get (abridged):
12/01/07 15:05:29 WARN hdfs.DFSClient: DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/msknapp/myfile could only be replicated to 0 nodes, instead of 1
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1490)
at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:653)
at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
12/01/07 15:05:29 WARN hdfs.DFSClient: Error Recovery for block null bad datanode[0] nodes == null
12/01/07 15:05:29 WARN hdfs.DFSClient: Could not get block locations. Source file "/user/msknapp/myfile" - Aborting...
put: java.io.IOException: File /user/msknapp/myfile could only be replicated to 0 nodes, instead of 1
12/01/07 15:05:29 ERROR hdfs.DFSClient: Exception closing file /user/msknapp/myfile : org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/msknapp/myfile could only be replicated to 0 nodes, instead of 1
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1490)
at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:653)
at ...
org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/msknapp/myfile could only be replicated to 0 nodes, instead of 1
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1490)
at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:653)
at ...
So I tried to stop and restart the namenode and datanode. No luck:
hadoop namenode
12/01/07 15:13:47 ERROR namenode.FSNamesystem: FSNamesystem initialization failed.
java.io.FileNotFoundException: /var/lib/hadoop-0.20/cache/hadoop/dfs/name/image/fsimage (Permission denied)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.<init>(RandomAccessFile.java:233)
at org.apache.hadoop.hdfs.server.namenode.FSImage.isConversionNeeded(FSImage.java:683)
at org.apache.hadoop.hdfs.server.common.Storage.checkConversionNeeded(Storage.java:690)
at org.apache.hadoop.hdfs.server.common.Storage.access$000(Storage.java:60)
at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:469)
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:297)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:99)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:358)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:327)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:271)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:465)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1239)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1248)
12/01/07 15:13:47 ERROR namenode.NameNode: java.io.FileNotFoundException: /var/lib/hadoop-0.20/cache/hadoop/dfs/name/image/fsimage (Permission denied)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.<init>(RandomAccessFile.java:233)
at org.apache.hadoop.hdfs.server.namenode.FSImage.isConversionNeeded(FSImage.java:683)
at org.apache.hadoop.hdfs.server.common.Storage.checkConversionNeeded(Storage.java:690)
at org.apache.hadoop.hdfs.server.common.Storage.access$000(Storage.java:60)
at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:469)
at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:297)
at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:99)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:358)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:327)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:271)
at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:465)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1239)
at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1248)
12/01/07 15:13:47 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1
************************************************************/
Would somebody please help me out here? I have been trying to fix this for hours.
Go into where you have configured the hdfs. delete everything there, format namenode and you are good to go. It usually happens if you don't shut down your cluster properly!
Following error means, fsimage file not have permission
namenode.NameNode: java.io.FileNotFoundException: /var/lib/hadoop-0.20/cache/hadoop/dfs/name/image/fsimage (Permission denied)
so give permission to fsimage file,
$chmod -R 777 fsimage

Resources