How to set HDFS directory times for unit testing - hadoop

I'm trying to unit test a Java program that uses Hadoop's HDFS programmatic interface. I need to create directories and set their times to make sure that my program will "clean up" the directories at the right times. However, FileSystem.setTimes does not seem to work for directories, only for files. Is there any way I can set up HDFS directories access/modification times programmatically? I'm using Hadoop 0.20.204.0.
Thanks!
Frank

Looks like this is indeed HDFS bug, which marked as resolved recently. Perhaps you need to try never version or snapshot if this is critical for you.
HDFS-2436

Are you trying to unit test Hadoop or your program? If latter then the proper way to do it is to abstract any infrastructure dependencies, such as HDFS and use stub/mock in your tests.

Related

Is _logs/skip/ related to hadoop version?

I am doing project about MapReduce task failures. According to Hadoop Beginner's Gudie(Garry Tukington), all of the skip data is stored in _logs/skip/ folder. The author used Hadoop 1.0 version. I am working with Hadoop 2.7.4. Although I tested with skip data, neither output folder nor _logs/skip/ are created. Is _logs/skip folder related to Hadoop version? If I want to skip data in hadoop 2.7.4, what should I do?
The short answer is no, it is not related to hadoop at all.
There are many temporary folders create at the time of execution, which are removed after the execution is completed. This includes log folders, temporary output folders and other temporary folders.
You should not get confused by them. The only guarantee is that it will generate an output folder with a _SUCCESS file even though there is no output.
I hope it answers your query.

Don't process already processed files?

In our system, we have multiple pig scripts that run against a particular HDFS directory. The pig scripts can run at different times, and are scheduled to run regularly.
Is there a way to point a pig script at the same directory for multiple executions, but make sure that it only processed new files that it hasn't seen before?
I was thinking of using a custom PathFilter for my loader, but I thought I would ask to see if there is already a way to do this, rather than me reinventing the wheel (!).
Have you tried Moving files to a processed directory when the processing finished.

How to copy files inside the same FileSystem efficiently

I wrote a job that one of his roles is to copy a lot of very big files inside the HDFS.
I found that using FileUtil.copy() is not efficient.
Is there more efficient way to do it? I heard about DistCp.java is it better then FileUtil.copy()? is there DistCp.java Cloudera implementation?
Is there DistCp.java Cloudera implementation?
Not sure what you meant by a Cloudera implementation. It's part of standard Hadoop installation, so it should be part of CDH also. You could also use DistCp command directly. The DistCp command internally invokes DistCp.java class to copy the files.
I heard about DistCp.java is it better then FileUtil.copy()?
The FileUtil.copy() method is copying the files in a sequence, while DistCp spawns a MR job to copy the files which is more efficient, since the copy happens in parallel. Check the DistCp documentation for more details.

How to test file manipulation

I hear that accessing to database is wrong in testing.
But what about file manipulation? Things like, cp, mv, rm and touch methods in FileUtils.
If I write a test and actually run the commands (moving files, renaming files, making directories and so on), I can test them. But I need to "undo" every command I ran before running a test again. So I decided to write all the code to "undo", but it seems a waste of time because I don't really need to "undo".
I really want to see how others do. How would you test, for example, when you generate a lot of static files?
In your case accessing the files is totally legit, if you are writing file manipulation code it should be tested on files. The one thing you have to be careful about is that a failed test means that you code is wrong and not that somebody deleted a file that is needed for the test or something like that. I would put the directory and the files you need for the tests in a separate folder that is only used for the test. Then in the build up of the test copy the whole folder to a temporary place do all the testing and then after the test delete the temporary files. In that way each test has a clean copy of the files that are saved for the test.
"Pure" unit testing shall not access "expensive" resources such as filesystem, DB ...
Now you may want to run those "integration" tests (or whatever you call them) at the same time as your unit-tests, and use the same framework it's convenient.
You can have a set of files for unit testing that you copy into temporary location as suggested in Janusz' answer, or generate them in your unit tests, or you can use a mock of the FileUtils instead of the real FileUtils when unit testing.
Accessing a database is not "wrong in testing". How else will you test the integration of your code with the database?
The key to repeatable testing is a consistent environment. So long as you start from the same file system or database contents for your tests, you should are not wrong. This is usually handled via a cleanup process at the start of the test suite.
Accessing resources like the database, file system, smtp server, etc. are bad ideas for unit testing. At some point obviously you have to do have to try it out with real files, that's a different kind of test, an integration test. Integration tests are more painful, you do have to take care to make sure your test is starting from a well-defined state, also they will run slower since you're accessing the real file system. On the other hand you shouldn't have to run them as frequently as you would with unit tests.
For unit tests you should be able to take advantage of duck typing to create objects that react to the same methods that the file objects you're working with have. Plus there's nothing to undo with this approach, and the tests will run a lot faster.
If your operating system supports RAM-based filesystems, you could go with one of these. This has even the advantage that an occasional `unix command` in your code keeps working.
Maybe you could create a directory inside your test package named "test_*". Then, the files that you change will be put on this directory (for example: if you create a directory, you will create the directory inside the test directory). At the end of the test you could delete this directory (with only one command). This is the unique UNDO operation that you will execute.
You will put all files that you need to the test on your test directory inside the test package.

Mock filesystem in integration testing

I'm writing a ruby program that executes some external command-line utilities. How could I mock the filesystem from my rspec tests so that I could easily setup some file hierarchy and verify it after testing. It would also be best to be implemented in ram so that tests would run quickly.
I realize that I may not find a portable solution as my external utilities are native programs interacting directly with operating system file services. Linux is my primary platform and solution for that would suffice.
Have you checked out FakeFS or MockFS?
Note: The original link to MockFS doesn't work. It looks like it's no longer being maintained.
Maybe this won't answer your question directly, but in such cases I tend to create a temporary directory during test setup and remove it on teardown. Of course you also have to ensure the application writes to this temporary directory. I always have a configuration option defining destination directory that I can overwrite during testing.
When it comes to assertions I use plain File.exist? or File.directory?, but of course you can create your own wrappers around it. If you need some initial state you can build a directory that can be used as a fixture and will be copied to the temporary direcory during test setup.
You can create a big file (size of you dummy disk) and mount the file as a loop-back device. You can create any filesystem and directory structure on this device.
You can create 2 of them and make even simple diff compare to ensure data integrity after tests.
I hope i understand you requirements correctly since i don't sure why simple ramdisk solution is not good enough.
This might be relevant as well.

Resources