Does Hadoop copyFromLocal creates 2 copies? - 1 inside hdfs and other inside datanode? - hadoop

I have installed a pseudo distributed standalone hadoop version on Ubuntu present inside my vmware installed on my windows10.
I downloaded a file from internet and copied into ubuntu local directory /lab/data
I have created namenode and datanode folders(not hadoop folder) with name namenodep and datan1 in ubuntu. I have also created a folder inside hdfs as /input.
When I copied the file from ubuntu local to hdfs, why is that file is present in both the below directories?
$ hadoop fs -copyFromLocal /lab/data/Civil_List_2014.csv /input
$hadoop fs -ls /input/
input/Civil_List_2014.csv ?????
$cd lab/hdfs/datan1/current
blk_3621390486220058643 ?????
blk_3621390486220058643_1121.meta
Basically I want to understand if it created 2 copies, 1 inside datan1 folder and the other inside hdfs?
Thanks

No. Only one copy is created.
When you create a file in HDFS, the contents of the file are stored on one of the disks of the Data Node. The disk location where the Data Node stores the data is determined by the configuration parameter: dfs.datanode.data.dir (present in hdfs-site.xml)
Check the description of this property:
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///e:/hdpdatadn/dn</value>
<description>Determines where on the local filesystem an DFS data node
should store its blocks. If this is a comma-delimited
list of directories, then data will be stored in all named
directories, typically on different devices.
Directories that do not exist are ignored.
</description>
<final>true</final>
</property>
So above, the contents of your file HDFS file "/input/Civil_List_2014.csv", are stored in physical location: lab/hdfs/datan1/current/blk_3621390486220058643.
"blk_3621390486220058643_1121.meta" contains the check sum of the data stored in "blk_3621390486220058643".
This file may be small enough to be put in a single file. But, if a file is big (assuming > 256 MB and a Hadoop block size of 256 MB), then Hadoop splits the contents of the file into 'n' number of blocks and stores them on the disk. In that case, you will see 'n' number of "blk_*" files in the data node's data directory.
Also, since the replication factor is typically set to "3", 3 instances of the same block are created.

The output from the hadoop fs -ls /input/ command is actually showing you the metadata information and is not actually a physical file, its logical abstraction around the files which are hosted by datanode's. This metadata information is stored by NameNode.
The actual physical file's are split into blocks and are hosted by the datanode's in the path specified in the configuration in your case lab/hdfs/datan1/current.

Related

Merging small files into single file in hdfs

In a cluster of hdfs, i receive multiple files on a daily basis which can be of 3 types :
1) product_info_timestamp
2) user_info_timestamp
3) user_activity_timestamp
The number of files received can be of any number but they will belong to one of these 3 categories only.
I want to merge all the files(after checking whether they are less than 100mb) belonging to one category into a single file.
for eg: 3 files named product_info_* should be merged into one file named product_info.
How do i achieve this?
You can use getmerge toachieve this, but the result will be stored in your local node (edge node), so you need to be sure you have enough space there.
hadoop fs -getmerge /hdfs_path/product_info_* /local_path/product_inf
You can move them back to hdfs with put
hadoop fs -put /local_path/product_inf /hdfs_path
You can use hadoop archive (.har file) or sequence file. It is very simple to use - just google "hadoop archive" or "sequence file".
Another set of commands along the similar lines as suggested by #SCouto
hdfs dfs -cat /hdfs_path/product_info_* > /local_path/product_info_combined.txt
hdfs dfs -put /local_path/product_info_combined.txt /hdfs_path/

Load a folder from LocalSystem to HDFS

I have a folder in my LocalSystem. It contains 1000 files, and I would move or copy him from my LocalSystem to HDFS
I tried by these two commands:
hadoop fs copyFromLocal C:/Users/user/Downloads/ProjectSpark/ling-spam /tmp
And I also tried this command:
hdfs dfs -put /C:/Users/user/Downloads/ProjectSpark/ling-spam
/tmp/ling-spam
It displays an error message which says that my directory not found and yet I'm sure that correct.
I found a function getmerge() to move a folder from HDFS to LocalSystem, but I did not find the inverse.
Please, can you help me?
my VirtualBox on Windows, and i work by HDP2.3.2 with the console secure shell
You can't copy files from your Windows machine to HDFS. You have to first SCP the files into the VM (I recommend WinSCP or Filezilla) and only then can you use hadoop fs to put files onto HDFS.
The error was correct in that C:/Users/user/Downloads does not exist on the HDP sandbox because it's a Linux machine.
As noted, you can also try and use the Ambari HDFS file viewer, but I still standby by note that SCP is the official way because not all Hadoop systems have Ambari (or at least the HDFS file view for Ambari)
I would take the Mutual Information for classification of the word spam or ham. I have this operation: MI(Word)= ∑ Probabi(Occ,Class) * Log2 * (Probabi(Occuren,Class)/Probabi(Occurren) * Probabi(Class)).
I understand the function, I must compute 4 operation (true,ham), (false,ham), (true,spam) and (false,spam).
I do not understand who i do write exactly, in fact, I computed the number of the file in which in occur.
But I do not who exactly I must write in my function.
Thank you very much!
This isthe corps of my function:
def computeMutualInformationFactor(
probaWC:RDD[(String, Double)],// probability of occurrence of the word in a given class.
probaW:RDD[(String, Double)],// probability of occurrence of the word in whether class
probaC: Double, //probability an email appears in class (spam or ham)
probaDefault: Double // default value when a probability is missing
):RDD[(String, Double)] = {

Checksum verification in Hadoop

Do we need to verify checksum after we move files to Hadoop (HDFS) from a Linux server through a Webhdfs ?
I would like to make sure the files on the HDFS have no corruption after they are copied. But is checking checksum necessary?
I read client does checksum before data is written to HDFS
Can somebody help me to understand how can I make sure that source file on Linux system is same as ingested file on Hdfs using webhdfs.
If your goal is to compare two files residing on HDFS, I would not use "hdfs dfs -checksum URI" as in my case it generates different checksums for files with identical content.
In the below example I am comparing two files with the same content in different locations:
Old-school md5sum method returns the same checksum:
$ hdfs dfs -cat /project1/file.txt | md5sum
b9fdea463b1ce46fabc2958fc5f7644a -
$ hdfs dfs -cat /project2/file.txt | md5sum
b9fdea463b1ce46fabc2958fc5f7644a -
However, checksum generated on the HDFS is different for files with the same content:
$ hdfs dfs -checksum /project1/file.txt
0000020000000000000000003e50be59553b2ddaf401c575f8df6914
$ hdfs dfs -checksum /project2/file.txt
0000020000000000000000001952d653ccba138f0c4cd4209fbf8e2e
A bit puzzling as I would expect identical checksum to be generated against the identical content.
Checksum for a file can be calculated using hadoop fs command.
Usage: hadoop fs -checksum URI
Returns the checksum information of a file.
Example:
hadoop fs -checksum hdfs://nn1.example.com/file1
hadoop fs -checksum file:///path/in/linux/file1
Refer : Hadoop documentation for more details
So if you want to comapre file1 in both linux and hdfs you can use above utility.
I wrote a library with which you can calculate the checksum of local file, just the way hadoop does it on hdfs files.
So, you can compare the checksum to cross check.
https://github.com/srch07/HDFSChecksumForLocalfile
If you are doing this check via API
import org.apache.hadoop.fs._
import org.apache.hadoop.io._
Option 1: for the value b9fdea463b1ce46fabc2958fc5f7644a
val md5:String = MD5Hash.digest(FileSystem.get(hadoopConfiguration).open(new Path("/project1/file.txt"))).toString
Option 2: for the value 3e50be59553b2ddaf401c575f8df6914
val md5:String = FileSystem.get(hadoopConfiguration).getFileChecksum(new Path("/project1/file.txt"))).toString.split(":")(0)
It does crc check. For each and everyfile it create .crc to make sure there is no corruption.

Checking filesize and its distribution in HDFS

Is it possible to know filesize in blocks and its distribution over DataNodes in Hadoop?
Currently I am using:
frolo#A11:~/hadoop> $HADOOP_HOME/bin/hadoop dfs -stat "%b %o %r %n" /user/frolo/input/rmat-*
318339 67108864 1 rmat-10.0
392835957 67108864 1 rmat-20.0
Which does not show actual number of blocks created after uploading file to HDFS. And I dont know any way how to find out its distribution.
Thanks,
Alex
The %r in your stat command shows the replication factor of the queried file. If this is 1, it means there will only be only a single replica across the cluster for blocks belonging to this file. The hadoop fs -ls output also shows this value for listed files as one of its numeric columns, as replication factor is a per file FS attribute.
If you are looking to find where the blocks reside instead, you are looking for hdfs fsck (or hadoop fsck if using a dated release) instead. The below, for example, will let you see the list of block IDs and their respective set of resident locations, for any file:
hdfs fsck /user/frolo/input/rmat-10.0 -files -blocks -locations

Change block size of dfs file

My map is currently inefficient when parsing one particular set of files (a total of 2 TB). I'd like to change the block size of files in the Hadoop dfs (from 64MB to 128 MB). I can't find how to do it in the documentation for only one set of files and not the entire cluster.
Which command changes the block size when I upload? (Such as copying from local to dfs.)
For me, I had to slightly change Bkkbrad's answer to get it to work with my setup, in case anyone else finds this question later on. I've got Hadoop 0.20 running on Ubuntu 10.10:
hadoop fs -D dfs.block.size=134217728 -put local_name remote_location
The setting for me is not fs.local.block.size but rather dfs.block.size
I change my answer! You just need to set the fs.local.block.size configuration setting appropriately when you use the command line.
hadoop fs -D fs.local.block.size=134217728 -put local_name remote_location
Original Answer
You can programatically specify the block size when you create a file with the Hadoop API. Unfortunately, you can't do this on the command line with the hadoop fs -put command. To do what you want, you'll have to write your own code to copy the local file to a remote location; it's not hard, just open a FileInputStream for the local file, create the remote OutputStream with FileSystem.create, and then use something like IOUtils.copy from Apache Commons IO to copy between the two streams.
In conf/ folder we can change the value of dfs.block.size in configuration file hdfs-site.xml.
In hadoop version 1.0 default size is 64MB and in version 2.0 default size is 128MB.
<property>
<name>dfs.block.size<name>
<value>134217728<value>
<description>Block size<description>
<property>
you can also modify your block size in your programs like this
Configuration conf = new Configuration() ;
conf.set( "dfs.block.size", 128*1024*1024) ;
We can change the block size using the property named dfs.block.size in the hdfs-site.xml file.
Note:
We should mention the size in bits.
For example :
134217728 bits = 128 MB.

Resources