hadoop in a shared file system - hadoop

I want to run hadoop to process big files, but server machines are clustered and share a file system. So, even if I log in different machines, I have same file directories and files.
In this case, I don't know how to get started. I guess splitted files don't have to be transferred within HDFS to other nodes, but I'm not sure how to configure or start.
Is there any reference or tutorial for this??
Thanks

Related

Detect incompatible file location (iCloud, Dropbox, shared folders) for custom file format

I’m designing a custom file format. It will be either a monolith file or a folder with smaller files. It’s a rather large file in total and there is no need to load everything into memory at once. It would make it also slower than necessary. One of the file(s) may or may not be database file. Running SQL queries would be useful.
The user can have many such files. The user might want to share files with others even if it takes some time to up/download it.
Conceptually I run into issues with shared network folders, Dropbox, iCloud, etc. Such services can lead to sync issues if the file is not loaded entirely in memory or the database file can get corrupted.
One solution is to prohibit storing the file on such services. Either by using a user/library folder or forcing the user to pick a local folder.
Using a folder in library means recreating a file navigation system like Finder. It limits the choice of the user as well in where the files end up. Limiting the location to a local folder seems the better choice.
Is there a way to programmatically detect if a folder is local?

Using IIB File nodes to move, copy, download and rename (S)FTP files without opening them

I am using IIB, and several of the requirements I have are for message flows that can do the following things:
Download a file from an FTP and/or SFTP server to the local file system, with a different name
Rename a file on the local file system
Move and rename a file on the (S)FTP server
Upload a file from the file system to the (S)FTP server, with a different name
Looking at the nodes available (FileInputNode, FileReadNode, FileOutputNode); it appears that they can read and write files in this way; but only by copying them into memory and then physically rewriting the files - rather than just using a copy/move/download-type command, which would never need to open the file in the same way.
I've noticed that there's options to move store files locally once the read is complete, however; so perhaps there's a way around it using that functionality? I don't need to open the files into memory at all - I don't care what's in the files.
Currently I am doing this using a Java Compute Node and Apache Commons Net classes for FTP - but they don't work for SFTP and the workaround seems too complex; so I was wondering if there was a pure IIB way to do it.
There is no native way to do this, but it can be done using Apache Commons VFS

Is _logs/skip/ related to hadoop version?

I am doing project about MapReduce task failures. According to Hadoop Beginner's Gudie(Garry Tukington), all of the skip data is stored in _logs/skip/ folder. The author used Hadoop 1.0 version. I am working with Hadoop 2.7.4. Although I tested with skip data, neither output folder nor _logs/skip/ are created. Is _logs/skip folder related to Hadoop version? If I want to skip data in hadoop 2.7.4, what should I do?
The short answer is no, it is not related to hadoop at all.
There are many temporary folders create at the time of execution, which are removed after the execution is completed. This includes log folders, temporary output folders and other temporary folders.
You should not get confused by them. The only guarantee is that it will generate an output folder with a _SUCCESS file even though there is no output.
I hope it answers your query.

Apache spark - dealing with auto-updating inputs

I'm new to spark and using it a lot recently to do some batch processing.
Currently I have a new requirement and am stuck on how to approach it.
I have a file which has to be processed but this file can get periodically updated. I want the initial file to be processed and as and when there is an update to the file, I want spark operations to be triggered and should operate only on the updated parts this time. Any way to approach this would be helpful. An
I'm open to using any other technology in combination with spark. The files will generally sit on a file system and could be several GBs in size.
Spark alone cannot recognize if a file has been updated.
It does its job when reading for a first time the file and that's all.
By default, Spark won't know that a file has been updated and won't know which parts of the file are updates.
You should rather work with folders, Spark can run on a folder and can recognize if there is a new file to process in it -> sc.textFile(PATH_FOLDER)...

How to set HDFS directory times for unit testing

I'm trying to unit test a Java program that uses Hadoop's HDFS programmatic interface. I need to create directories and set their times to make sure that my program will "clean up" the directories at the right times. However, FileSystem.setTimes does not seem to work for directories, only for files. Is there any way I can set up HDFS directories access/modification times programmatically? I'm using Hadoop 0.20.204.0.
Thanks!
Frank
Looks like this is indeed HDFS bug, which marked as resolved recently. Perhaps you need to try never version or snapshot if this is critical for you.
HDFS-2436
Are you trying to unit test Hadoop or your program? If latter then the proper way to do it is to abstract any infrastructure dependencies, such as HDFS and use stub/mock in your tests.

Resources