Can't exit or forceExit from Hadoop safe mode - hadoop

I am trying to delete a folder in hadoop:
hadoop fs -rm -R /seqr-reference-data/GRCh37/all_reference_data
But it is giving me an error:
INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
rm: Cannot delete /seqr-reference-data/GRCh37/all_reference_data. Name node is in safe mode.
I tried the solutions specified here:
Name node is in safe mode. Not able to leave
None of them work. It writes me Safe mode is OFF and then I see the same error. What else could I try doing?
Update
Just found a duplicate thread but it has no answer:
Not able to delete the data from hdfs, even after leaving safemode?

Related

DistCP - Even simple copies result in CRC Exceptions

I'm running into an issue using distcp to copy files - every copy fails with an IO Exception (Checksum mismatch), even if performing a simple copy within the cluster (i.e. hadoop distcp -pbugctrx /foo/bar /foo/baz).
If forced to complete the copy using -skipcrccheck, I can see that the checksum is different ( hdfs dfs -checksum ), but that this isn't being caused by a difference in the actual source data (hdfs dfs -cat | md5sum returns matching checksums for source and destination).
I'm leery of disabling a data integrity check if I don't need to. Is there a better way to address this failing check than just ignoring it.
Both the source and target may be in different encryption zones. In that case also the checksum will fail

HDFS Reoccurring Error: Under-Replicated Blocks

Every day our Hadoop Cluster reports that there are "Under-replicated blocks". It is managed through Cloudera Manager. An example of the health warning is:
! Under-Replicated Blocks
Concerning: 767 under replicated blocks in the cluster. 3,115 total blocks in the cluster. Percentage under replicated blocks: 24.62%. Warning threshold: 10.00%.
I have been running commands that fixes the problem, but the following morning the warning is back and sometimes without any new data being added. One of the temporarily successful commands was
hdfs dfs -setrep -R 2 /*
I have also tried another recommended command
su hdfs
hdfs fsck / | grep 'Under replicated' | awk -F':' '{print $1}' >> /tmp/under_replicated_files
for hdfsfile in `cat /tmp/under_replicated_files`; do echo "Fixing $hdfsfile :" ; hadoop fs -setrep 2 $hdfsfile; done
Both do work, but the fix isn't permanent.
In Cloudera Manager the Replication Factor and Minimal Block Replication are both set to 2.
Due to the problem only happening approximately once every 24h it is difficult and time consuming to attempt to fix, with trial and error being my only resort. I have no idea why this error keeps on coming back! Any suggestions would be appreciated. Thanks
Issue solved by setting the following HDFS configuration in Cloudera Manager:
Go to the HDFS service.
Click the Configuration tab.
Select Scope > NameNode.
Filesystem Trash Interval: 0 day(s)
Entering '0' disables the trash feature.
This property can also be configured using fs.trash.interval
Once I had set this I deleted all of the offending unreplicated trash blocks -
shown by looking in the under_replicated_files file produced by running the following command:
hdfs fsck / | grep 'Under replicated' | awk -F':' '{print $1}' >> /tmp/under_replicated_files
I ended up just deleting all of .Trash for the users.
This all then stopped anything else from being moved into .Trash once it was deleted (which I realise might not be an acceptable solution for everybody but that was perfectly fine for my use case). Also the removal of all of the unreplicated blocks meant that the warning disappeared.

Load a folder from LocalSystem to HDFS

I have a folder in my LocalSystem. It contains 1000 files, and I would move or copy him from my LocalSystem to HDFS
I tried by these two commands:
hadoop fs copyFromLocal C:/Users/user/Downloads/ProjectSpark/ling-spam /tmp
And I also tried this command:
hdfs dfs -put /C:/Users/user/Downloads/ProjectSpark/ling-spam
/tmp/ling-spam
It displays an error message which says that my directory not found and yet I'm sure that correct.
I found a function getmerge() to move a folder from HDFS to LocalSystem, but I did not find the inverse.
Please, can you help me?
my VirtualBox on Windows, and i work by HDP2.3.2 with the console secure shell
You can't copy files from your Windows machine to HDFS. You have to first SCP the files into the VM (I recommend WinSCP or Filezilla) and only then can you use hadoop fs to put files onto HDFS.
The error was correct in that C:/Users/user/Downloads does not exist on the HDP sandbox because it's a Linux machine.
As noted, you can also try and use the Ambari HDFS file viewer, but I still standby by note that SCP is the official way because not all Hadoop systems have Ambari (or at least the HDFS file view for Ambari)
I would take the Mutual Information for classification of the word spam or ham. I have this operation: MI(Word)= ∑ Probabi(Occ,Class) * Log2 * (Probabi(Occuren,Class)/Probabi(Occurren) * Probabi(Class)).
I understand the function, I must compute 4 operation (true,ham), (false,ham), (true,spam) and (false,spam).
I do not understand who i do write exactly, in fact, I computed the number of the file in which in occur.
But I do not who exactly I must write in my function.
Thank you very much!
This isthe corps of my function:
def computeMutualInformationFactor(
probaWC:RDD[(String, Double)],// probability of occurrence of the word in a given class.
probaW:RDD[(String, Double)],// probability of occurrence of the word in whether class
probaC: Double, //probability an email appears in class (spam or ham)
probaDefault: Double // default value when a probability is missing
):RDD[(String, Double)] = {

Merging small files in hadoop

I have a directory (Final Dir) in HDFS in which some files(ex :10 mb) are loading every minute.
After some time i want to combine all the small files to a large file(ex :100 mb). But the user is continuously pushing files to Final Dir. it is a continuous process.
So for the first time i need to combine the first 10 files to a large file (ex : large.txt) and save file to Finaldir.
Now my question is how i will get the next 10 files excluding the first 10 files?
can some please help me
Here is one more alternate, this is still the legacy approach pointed out by #Andrew in his comments but with extra steps of making your input folder as a buffer to receive small files pushing them to a tmp directory in a timely fashion and merging them and pushing the result back to input.
step 1 : create a tmp directory
hadoop fs -mkdir tmp
step 2 : move all the small files to the tmp directory at a point of time
hadoop fs -mv input/*.txt tmp
step 3 -merge the small files with the help of hadoop-streaming jar
hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar \
-Dmapred.reduce.tasks=1 \
-input "/user/abc/input" \
-output "/user/abc/output" \
-mapper cat \
-reducer cat
step 4- move the output to the input folder
hadoop fs -mv output/part-00000 input/large_file.txt
step 5 - remove output
hadoop fs -rm -R output/
step 6 - remove all the files from tmp
hadoop fs -rm tmp/*.txt
Create a shell script from step 2 till step 6 and schedule it to run at regular intervals to merge the smaller files at regular intervals (may be for every minute based on your need)
Steps to schedule a cron job for merging small files
step 1: create a shell script /home/abc/mergejob.sh with the help of above steps (2 to 6)
important note: you need to specify the absolute path of hadoop in the script to be understood by cron
#!/bin/bash
/home/abc/hadoop-2.6.0/bin/hadoop fs -mv input/*.txt tmp
wait
/home/abc/hadoop-2.6.0/bin/hadoop jar /home/abc/hadoop-2.6.0/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar \
-Dmapred.reduce.tasks=1 \
-input "/user/abc/input" \
-output "/user/abc/output" \
-mapper cat \
-reducer cat
wait
/home/abc/hadoop-2.6.0/bin/hadoop fs -mv output/part-00000 input/large_file.txt
wait
/home/abc/hadoop-2.6.0/bin/hadoop fs -rm -R output/
wait
/home/abc/hadoop-2.6.0/bin/hadoop fs -rm tmp/*.txt
step 2: schedule the script using cron to run every minute using cron expression
a) edit crontab by choosing an editor
>crontab -e
b) add the following line at the end and exit from the editor
* * * * * /bin/bash /home/abc/mergejob.sh > /dev/null 2>&1
The merge job will be scheduled to run for every minute.
Hope this was helpful.
#Andrew pointed you to a solution that was appropriate 6 years ago, in a batch-oriented world.
But it's 2016, you have a micro-batch data flow running and require a non-blocking solution.
That's how I would do it:
create an EXTERNAL table with 3 partitions, mapped on 3 directories
e.g. new_data, reorg and history
feed the new files into new_data
implement a job to run the batch compaction, and run it periodically
Now the batch compaction logic:
make sure that no SELECT query will be executed while the compaction is running, else it would return duplicates
select all files that are ripe for compaction (define your own
criteria) and move them from new_data directory to reorg
merge the content of all these reorg files, into a new file in history dir (feel free to GZip it on the fly, Hive will recognize the .gz extension)
drop the files in reorg
So it's basically the old 2010 story, except that your existing data flow can continue dumping new files into new_data while the compaction is safely running in separate directories. And in case the compaction job crashes, you can safely investigate / clean-up / resume the compaction without compromising the data flow.
By the way, I am not a big fan of the 2010 solution based on a "Hadoop Streaming" job -- on one hand, "streaming" has a very different meaning now; on the second hand, "Hadoop streaming" was useful in the old days but is now out of the radar; on the gripping hand [*] you can do it quite simply with a Hive query e.g.
INSERT INTO TABLE blahblah PARTITION (stage='history')
SELECT a, b, c, d
FROM blahblah
WHERE stage='reorg'
;
With a couple of SET some.property = somevalue before that query, you can define what compression codec will be applied on the result file(s), how many file(s) you want (or more precisely, how big you want the files to be - Hive will run the merge accordingly), etc.
Look into https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties under hive.merge.mapfiles and hive.merge.mapredfiles (or hive.merge.tezfiles if you use TEZ) and hive.merge.smallfiles.avgsize and then hive.exec.compress.output and mapreduce.output.fileoutputformat.compress.codec -- plus hive.hadoop.supports.splittable.combineinputformat to reduce the number of Map containers since your input files are quite small.
[*] very old SF reference here :-)

Hadoop, Mapreduce - Cannot obtain block length for LocatedBlock

I've a file on hdfs in the path 'test/test.txt' which is 1.3G
output of ls and du commands is:
hadoop fs -du test/test.txt -> 1379081672 test/test.txt
hadoop fs -ls test/test.txt ->
Found 1 items
-rw-r--r-- 3 testuser supergroup 1379081672 2014-05-06 20:27 test/test.txt
I want to run a mapreduce job on this file but when i start the mapreduce job on this file the job fails with the following error:
hadoop jar myjar.jar test.TestMapReduceDriver test output
14/05/29 16:42:03 WARN mapred.JobClient: Use GenericOptionsParser for parsing the
arguments. Applications should implement Tool for the same.
14/05/29 16:42:03 INFO input.FileInputFormat: Total input paths to process : 1
14/05/29 16:42:03 INFO mapred.JobClient: Running job: job_201405271131_9661
14/05/29 16:42:04 INFO mapred.JobClient: map 0% reduce 0%
14/05/29 16:42:17 INFO mapred.JobClient: Task Id : attempt_201405271131_9661_m_000004_0, Status : FAILED
java.io.IOException: Cannot obtain block length for LocatedBlock{BP-428948818-namenode-1392736828725:blk_-6790192659948575136_8493225; getBlockSize()=36904392; corrupt=false; offset=1342177280; locs=[datanode4:50010, datanode3:50010, datanode1:50010]}
at org.apache.hadoop.hdfs.DFSInputStream.readBlockLength(DFSInputStream.java:319)
at org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:263)
at org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:205)
at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:198)
at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1117)
at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:249)
at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:82)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:746)
at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:83)
at org.apache.hadoop.mapred.Ma`
i tried the following commands:
hadoop fs -cat test/test.txt gives the following error
cat: Cannot obtain block length for LocatedBlock{BP-428948818-10.17.56.16-1392736828725:blk_-6790192659948575136_8493225; getBlockSize()=36904392; corrupt=false; offset=1342177280; locs=[datanode3:50010, datanode1:50010, datanode4:50010]}
additionally i can't copy the file hadoop fs -cp test/test.txt tmp gives same error:
cp: Cannot obtain block length for LocatedBlock{BP-428948818-10.17.56.16-1392736828725:blk_-6790192659948575136_8493225; getBlockSize()=36904392; corrupt=false; offset=1342177280; locs=[datanode1:50010, datanode3:50010, datanode4:50010]}
output of the hdfs fsck /user/testuser/test/test.txt command:
Connecting to namenode via `http://namenode:50070`
FSCK started by testuser (auth:SIMPLE) from /10.17.56.16 for path
/user/testuser/test/test.txt at Thu May 29 17:00:44 EEST 2014
Status: HEALTHY
Total size: 0 B (Total open files size: 1379081672 B)
Total dirs: 0
Total files: 0 (Files currently being written: 1)
Total blocks (validated): 0 (Total open file blocks (not validated): 21)
Minimally replicated blocks: 0
Over-replicated blocks: 0
Under-replicated blocks: 0
Mis-replicated blocks: 0
Default replication factor: 3
Average block replication: 0.0
Corrupt blocks: 0
Missing replicas: 0
Number of data-nodes: 5
Number of racks: 1
FSCK ended at Thu May 29 17:00:44 EEST 2014 in 0 milliseconds
The filesystem under path /user/testuser/test/test.txt is HEALTHY
by the way i can see the content of the test.txt file from the web browser.
hadoop version is: Hadoop 2.0.0-cdh4.5.0
I got the same issue with you and I fixed it by the following steps.
There are some files that opened by flume but never closed (I am not sure about your reason).
You need to find the name of the opened files by the command:
hdfs fsck /directory/of/locked/files/ -files -openforwrite
You can try to recover files as command:
hdfs debug recoverLease -path <path-of-the-file> -retries 3
Or removing them by the command:
hdfs dfs -rmr <path-of-the-file>
I had the same error, but it was not due to the full disk problem, and I think the inverse, where there were files and blocks referenced by in the namenode that did not exist on any datanodes.
Thus, hdfs dfs -ls shows the files, but any operation on them fails, e.g. hdfs dfs -copyToLocal.
In my case, the hard part was isolating which files were listed but corrupted, as they existed in a tree having thousands of files. Oddly, hdfs fsck /path/to/files/ did not report any problems.
My solution was:
Isolate the location using copyToLocal which resulted in copyToLocal: Cannot obtain block length for LocatedBlock{BP-1918381527-10.74.2.77-1420822494740:blk_1120909039_47667041; getBlockSize()=1231; corrupt=false; offset=0; locs=[10.74.2.168:50010, 10.74.2.166:50010, 10.74.2.164:50010]} for several files
Get a list of the local directories using ls -1 > baddirs.out
get rid of the local files from the first copyToLocal
use for files incat baddirs.out;do echo $files; hdfs dfs -copyToLocal $files This will produce a list of directories checks, and errors where files are found.
get rid of the local files again, and now get lists of files from each affected subdirectory. Use that as input to a file-by-file copyToLocal, at which point you can echo each file as it's copied, then see where the error occurs.
use hdfs dfs -rm <file> for each file.
Confirm you got 'em all be removing all local files again, and using the original copyToLocal on the top level directory where you had problems.
A simple two hour process!
You are having some corrupted files with no blocks on datanode but an entry in namenode. Best to follow this:
https://stackoverflow.com/a/19216037/812906
According to this this may be produced by a full disk problem. I came across the same problem recently with an old file and checking my servers metrics it effectively was a full disk problem during the creation of that file. Most solutions just claim to delete the file and prey for it not happening again.

Resources