How to copy files inside the same FileSystem efficiently

How to copy files inside the same FileSystem efficiently - hadoop

I wrote a job that one of his roles is to copy a lot of very big files inside the HDFS.
I found that using FileUtil.copy() is not efficient.
Is there more efficient way to do it? I heard about DistCp.java is it better then FileUtil.copy()? is there DistCp.java Cloudera implementation?

Is there DistCp.java Cloudera implementation?
Not sure what you meant by a Cloudera implementation. It's part of standard Hadoop installation, so it should be part of CDH also. You could also use DistCp command directly. The DistCp command internally invokes DistCp.java class to copy the files.
I heard about DistCp.java is it better then FileUtil.copy()?
The FileUtil.copy() method is copying the files in a sequence, while DistCp spawns a MR job to copy the files which is more efficient, since the copy happens in parallel. Check the DistCp documentation for more details.

Related

How to check if file transfer is done to hdfs completed or not

I am copying file to HDFS from other script. I cannot know if file transfer done since other system is doing the file transfer to HDFS. I want to perform next operation as soon as file copy done. How to perform this?

As and when you have a chain of commands, it is best advised to develop a pipeline which also allows plugging any error handling routines or alerting routines if need be.
Have you tried Apache Oozie/Airflow or tools in the similar ecosystem?
Using such a toolset, you can then define the first task as copy and then followed by any other task in line.

Is _logs/skip/ related to hadoop version?

I am doing project about MapReduce task failures. According to Hadoop Beginner's Gudie(Garry Tukington), all of the skip data is stored in _logs/skip/ folder. The author used Hadoop 1.0 version. I am working with Hadoop 2.7.4. Although I tested with skip data, neither output folder nor _logs/skip/ are created. Is _logs/skip folder related to Hadoop version? If I want to skip data in hadoop 2.7.4, what should I do?

The short answer is no, it is not related to hadoop at all.
There are many temporary folders create at the time of execution, which are removed after the execution is completed. This includes log folders, temporary output folders and other temporary folders.
You should not get confused by them. The only guarantee is that it will generate an output folder with a _SUCCESS file even though there is no output.
I hope it answers your query.

Adding support for Zip files in hadoop

Hadoop by default have a support for reading .gz compressed files, I want to have similar support for .zip files. I should be able to read content of zip files by using hadoop -text command.
I am looking for an approach where I dont have to implement inputformat and recordreader for zip files. I want my jobs to be completely agnostic of the format of the input files, it should work irrespective of whether the data is zipped or unzipped. Similar to how it is for.gz files.

I'm sorry to say that I only see two ways to do this from "within" hadoop, either using a custom inputformat and recordreader based on ZipInputStream (which you clearly specified you were not interested in) or by detecting .zip input files and unzipping them before launching the job.
I would personally do this from outside hadoop, converting to gzip (or LZO indexed if I needed splittable files) via a script before running the job, but you most certainly already thought about that...
I'm also interested to see if someone can come up with an unexpected answer.

hadoop in a shared file system

I want to run hadoop to process big files, but server machines are clustered and share a file system. So, even if I log in different machines, I have same file directories and files.
In this case, I don't know how to get started. I guess splitted files don't have to be transferred within HDFS to other nodes, but I'm not sure how to configure or start.
Is there any reference or tutorial for this??
Thanks

How to set HDFS directory times for unit testing

I'm trying to unit test a Java program that uses Hadoop's HDFS programmatic interface. I need to create directories and set their times to make sure that my program will "clean up" the directories at the right times. However, FileSystem.setTimes does not seem to work for directories, only for files. Is there any way I can set up HDFS directories access/modification times programmatically? I'm using Hadoop 0.20.204.0.
Thanks!
Frank

Looks like this is indeed HDFS bug, which marked as resolved recently. Perhaps you need to try never version or snapshot if this is critical for you.
HDFS-2436

Are you trying to unit test Hadoop or your program? If latter then the proper way to do it is to abstract any infrastructure dependencies, such as HDFS and use stub/mock in your tests.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to copy files inside the same FileSystem efficiently - hadoop

I wrote a job that one of his roles is to copy a lot of very big files inside the HDFS. I found that using FileUtil.copy() is not efficient. Is there more efficient way to do it? I heard about DistCp.java is it better then FileUtil.copy()? is there DistCp.java Cloudera implementation?

Related

How to check if file transfer is done to hdfs completed or not

Is _logs/skip/ related to hadoop version?

Adding support for Zip files in hadoop

hadoop in a shared file system

How to set HDFS directory times for unit testing

Categories

Resources