Scheduled data load into Hadoop - hadoop

Just wondering what is the best way to bulk load data from various sources into HDFS, mainly from FTP locations / file servers at scheduled times with regular frequency.
I know Sqoop / Oozie combination can be used for RDBMS data. However, wondering what is the best way to load unstructured data into HDFS with a scheduling mechanism.

you can do it with shell programming.i can guide with some code
hadoop fs -cp ftp://uname:password#ftp2.xxxxa.com/filename hdfs://IPofhdfs/user/root/Logs/
some points:
1 finding the new files in ftp folder source by comparing hdfs dest with filenames.
2 pass the new filename to hdfs copy command.
---list out all files in ftp,store list of file to allfiles.txt--
ftp -in ftp2.xxxx.com << SCRIPTEND
user Luname pass
lcd /home/Analytics/TempFiles
ls > AllFiles.txt
binary
quit
SCRIPTEND
let me know if you need any info

Related

How can I append multiple files in HDFS to a single file in HDFS without the help of local file system?

I am learning hadoop. I came across a problem now. I ran the mapreduce job and output was stored in multiple files but not as single file. I want to append all of them into a single file in hdfs. I know about appendToFile and getmerge commands. But they work only for either local file system to hdfsor hdfs to local system but not from HDFS to HDFS. Is there any way to append the output files in HDFS to a single file in HDFS without touching local file system?
The only way to do this would be to force your mapreduce code to use one reducer, for example, by sorting all the results by a single key.
However, this defeats the purpose of having a distributed filesystem and multiple processors. All Hadoop jobs should be able to read a directory of files, not isolated to process a single file
If you need a single file to download from HDFS, then you should use getmerge
There is no easy way to do this directly in HDFS. But the below trick works. Although not a feasible solution, but should work if output is not huge.
hadoop fs -cat source_folder_path/* | hadoop fs -put target_filename

How does cp command work in Hadoop?

I am reading "Hadoop: The Defnitive Guide" and to explain my question let me quote from the book
distcp is implemented as a MapReduce job where the work of copying is done by the
maps that run in parallel across the cluster. There are no reducers. Each file is copied
by a single map, and distcp tries to give each map approximately the same amount of
data by bucketing files into roughly equal allocations. By default, up to 20 maps are used, but this can be changed by specifying the -m argument to distcp.
and in a footnote
Even for a single file copy, the distcp variant is preferred for large files since hadoop fs -cp copies the file
via the client running the command.
I understand why distcp works better for collection of files as different mappers are performing parallelly each on a single file. But when only a single file is to be copied why distcp performs better when the file size is large (according to the footnote). I am only getting started so it would be helpful if how cp command in hadoop works is explained and what is meant by "hadoop fs -cp copies the file via the client running the command.". I understand the write process of Hadoop which is explained in the book where a pipeline of datanodes are formed and each datanode is responsible to write data to the following datanode in the pipeline.
When a file is copied "via the client", the byte content is streamed from HDFS, to the local node running the command, then uploaded back to the destination HDFS location. The file metadata is not simply copied over to a new spot between datanodes directly as you'd expect.
Compare that to distcp, which creates smaller, parallel cp commands spread out over multiple hosts

How to edit txt file inside the HDFS in terminal?

Is there any way to modify the txt file inside HDFS directly via terminal?
Assume, I have "my_text_file.txt", and I would like to modify it inside HDFS using below command.
$ hdfs dfs -XXXX user/my_text_file.txt
I am interested to know "xxxx" if there exists any.
Please note that I don't want to make modification in local and then copy it to HDFS.
You cannot edit files, which all are already in HDFS. It will not support. HDFS works on "Write once, read many". So if you want to edit a file, make changes in your local copy then move it to HDFS.
Currently as explained by #BruceWayne, its not possible. It would be very difficult to edit the files stored in hdfs because all the files are distributed in hdfs and it would be very difficult to edit in the terminal using hdfs commands. Currently these are supported as terminal commands.
You can edit them by locating the data location of each datanode in the cluster.But that would be troublesome.
Moreover you can install HUE. With HUE you can edit the files in hdfs using web UI.
You can not edit files in HDFS, as it works on the principle of Write once, Read Many.But now a day, we can edit file using Hue file browser in cloudera.

What is the best way of loading huge size files from local to hdfs

I have dir which contain multiple folder with N number of files in each dir. single file size would be 15 GB. I don't know what is the best way to copy/move the file from local to HDFS.
There are many ways to do this (using traditional methods), like,
hdfs dfs -put /path/to/localdir/ hdfs://path/to/hdfsdir
hdfs dfs -copyFromLocal /path/to/localdir/ hdfs://path/to/hdfsdir
hdfs dfs -moveFromLocal /path/to/localdir/ hdfs://path/to/hdfsdir
hadoop distcp file:///path/to/localdir/ hdfs://namenode:port/path/to/hdfsdir
Option 1 & 2 are same in your case. There will not be any difference in copy time.
Option 3 might take some more time as it copies the data to HDFS filesystem (same as -put) and then deletes the file from the local filesystem.
Option 4 is a tricky one. It is designed for large inter/intra-cluster copying. But you can use the same command for local files also by providing local file URL with "file://" prefix. It is not the optimal solution w.r.t distcp as the tool is designed to work in parallel (using MapReduce) and as the file is on local, it cannot make use of its strength. (You can try by creating a mount on the cluster nodes which might increase the performance of distcp)

HDFS Block Split

My Hadoop knowledge is 4 weeks old. I am using a sandbox with Hadoop.
According to the theory, when a file is copied into the HDFS file system, it will be split into 128 MB blocks. Each block will then be copied into different data nodes and then replicated to data nodes.
Question:
When I copy a data file (~500 MB) from local file system into HDFS (put command) entire file is still present in HDFS (-ls command). I was expecting to see 128 MB block. What am I doing wrong here ?
If suppose, I manage to split & distribute data file in HDFS, is there a way to combine and retrieve original file back to local file system ?
You won't see the individual blocks from the -ls command. These are the logical equivalent of blocks on a hard drive not showing up in Linux's ls or Windows Explorer. You can do this on the commandline like hdfs fsck /user/me/someFile.avro -files -blocks -locations, or you can use the NameNode UI to see which hosts have the blocks for a file, and on which hosts each block is replicated.
Sure. You'd just do something like hdfs dfs -get /user/me/someFile.avro or download the file using HUE or the NameNode UI. All these options will stream the appropriate blocks to you to assemble the logical file back together.

Resources