How to Use third party API in hadoop to read files from hdfs if those API uses only local file system path? - hadoop

I have large mbox files and I am using third party API like mstor to parse messages from mbox file using hadoop. I have uploaded those files in hdfs. But the problem is that this API uses only local file system path , similar to shown below
MessageStoreApi store = new MessageStoreApi(“file location in locl file system”);
I could not find a constructor in this API that would initialize from stream . So I cannot read hdfs stream and initialize it.
Now my question is, should I copy my files from hdfs to local file system and initialize it from local temporary folder? As thats what I have been doing for now:
Currently My Map function receives path of the mbox files.
Map(key=path_of_mbox_file in_hdfs, value=null){
String local_temp_file = CopyToLocalFile(path in hdfs);
MessageStoreApi store = new MessageStoreApi(“local_temp_file”);
//process file
}
Or Is there some other solution? I am expecting some solution like what If I increase the block-size so that single file fits in one block and somehow if I can get the location of those blocks in my map function, as mostly map functions will execute on the same node where those blocks are stored then I may not have to always download to local file system? But I am not sure if that will always work :)
Suggestions , comments are welcome!

For local filesystem path-like access, HDFS offers two options: HDFS NFS (via NFSv3 mounts) and FUSE-mounted HDFS.
The former is documented under the Apache Hadoop docs (CDH users may follow this instead)
The latter is documented at the Apache Hadoop wiki (CDH users may find relevant docs here instead)
The NFS feature is more maintained upstream than the FUSE option, currently.

Related

Apache Spark Streaming from folder (not HDFS)

I was wondering if there is any reliable way for creating spark streams from a physical location? I was using 'textFileStream' but seems it is mainly used if the files are in HDFS. If you see the definition of the function it says "Create an input stream that monitors a Hadoop-compatible filesystem"
Are you implying that HDFS is not a physical location? There are datanode directories that physically exist...
You should be able to use textFile with the file:// URI, but you need to ensure all nodes in the cluster can read from that location.
From the definition of Hadoop compatible filesystem.
The selection of which filesystem to use comes from the URI scheme used to refer to it -the prefix hdfs: on any file path means that it refers to an HDFS filesystem; file: to the local filesystem, s3: to Amazon S3, ftp: FTP, swift: OpenStackSwift, ...etc.
There are other filesystems that provide explicit integration with Hadoop through the relevant Java JAR files, native binaries and configuration parameters needed to add a new schema to Hadoop

Pull a file from remote location (local file system in some remote machine) into Hadoop HDFS

I have files in a machine (say A) which is not part of the Hadoop (OR HDFS) datacenter. So machine A is at remote location from HDFS datacenter.
Is there a script OR command OR program OR tool that can run in machines which are connected to Hadoop (part of the datacenter) and pull-in the file from machine A to HDFS directly ? If yes, what is the best and fastest way to do this ?
I know there are many ways like WebHDFS, Talend but they need to run from Machine A and requirement is to avoid that and run it in machines in datacenter.
There are two ways to achieve this:
You can pull the data using scp and store it in a temporary location, then copy it to hdfs, and delete the temporarily stored data.
If you do not want to keep it as a 2-step process, you can write a program which will read the files from the remote machine, and write it to HDFS directly.
This question along with comments and answers would come in handy for reading the file while, you can use the below snippet to write to HDFS.
outFile = <Path to the the file including name of the new file> //e.g. hdfs://localhost:<port>/foo/bar/baz.txt
FileSystem hdfs =FileSystem.get(new URI("hdfs://<NameNode Host>:<port>"), new Configuration());
Path newFilePath=new Path(outFile);
FSDataOutputStream out = hdfs.create(outFile);
// put in a while loop here which would read until EOF and write to the file using below statement
out.write(buffer);
Let buffer = 50 * 1024, if you have enough IO capicity depending on processor or you could use a much lower value like 10 * 1024 or something
Please tell me if I am getting your Question right way.
1-you want to copy the file in a remote location.
2- client machine is not a part of Hadoop cluster.
3- It is may not contains the required libraries for Hadoop.
Best way is webHDFS i.e. Rest API

Accesing local filesystem without uploading to hdfs

Is there anyway to specify the inputpath in Hadoop outside the HDFS, I am running a single node cluster and want to access files outside the HDFS, so is there any way to do this???
Yes. Just give the complete path of your file on the local FS. Don't forget to add "file://". To be on the safer side, don't add reference to the config file in your code, if you have done so.

How to get data from temp files of hadoop?

I have an application to transfer data from remote systems to HDFS using map reduce . I however am lost when I have to deal with isues like network failure .. That is , when a connection from remote data source is lost and data is no longer accessible to my mapreduce application. I can always restart the job but when data is huge then restarting is an expensive option . I know the mapreduce would create temp folder but will it put data there ? Can I read that data out and then Can I somehow start reading the rest of the data ?
A mapreduce job can write arbitrary files, not only the ones managed by Hadoop.
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
out = fs.create(new Path(fileName));
using this code you create arbitrary files which work like normal files in the local filesystem. Then, you manage connection exceptions such that when a source is unaccessible you nicely close the file and record somewhere (e.g. in HDFS itself) that happened an interruption and at which point.
In the case of FTP, you could write just the list of file paths and folders. When a job finish to download a file, write its path on the downloaded list, and when an entire folder is downloaded write the folder path, so in case of resume you will not have to traverse a directory content to check that all files were downloaded.
At the program startup, on the other hand, it will check this file to decide whether the previous attempt failed and, in case, where to start the download.
In general, Hadoop will kill your program if it's not writing/reading anything for a timeout. Your application can tell it to wait but in general is not good to have an idle job, so it's better to end the job nicely instead that waiting for the network to work again.
You can also create your own filewriter, this way:
conf.setOutputFormat(MyOwnOutputFormat.class);
your filewriter could save its own temporary files in the format you prefer, so if the application crashes you know how files are saved.
HDFS saves files with chunks of 64MB by default, and when a job fails you may not even have a temporary file unless you use your own writer.
This is a generic solution, it depends on which is the source of data (ftp, samba, http...) and its support to download resumes.
EDIT: in case of FTP, you could just use csync to syncronize a FTP server with your local filesystem, and hdfs-fuse to mount a HDFS filesystem. It works when you have many small files.
You haven't specified what tool you are using to ingress data into HDFS/Hadoop.
Some of the tools that you can use to ingress data into HDFS/Hadoop which support recoverability are Flume, Scribe & Chukwa (for log files) and they all support various configurable levels of file transfer reliability guarantees, and Sqoop for transferring relational db data into HDFS or Hive, etc.

Writing data to Hadoop

I need to write data in to Hadoop (HDFS) from external sources like a windows box. Right now I have been copying the data onto the namenode and using HDFS's put command to ingest it into the cluster. In my browsing of the code I didn't see an API for doing this. I am hoping someone can show me that I am wrong and there is an easy way to code external clients against HDFS.
There is an API in Java. You can use it by including the Hadoop code in your project.
The JavaDoc is quite helpful in general, but of course you have to know, what you are looking for *g *
http://hadoop.apache.org/common/docs/
For your particular problem, have a look at:
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSystem.html
(this applies to the latest release, consult other JavaDocs for different versions!)
A typical call would be:
Filesystem.get(new JobConf()).create(new Path("however.file"));
Which returns you a stream you can handle with regular JavaIO.
For the problem of loading the data I needed to put into HDFS, I choose to turn the problem around.
Instead of uploading the files to HDFS from the server where they resided, I wrote a Java Map/Reduce job where the mapper read the file from the file server (in this case via https), then write it directly to HDFS (via the Java API).
The list of files is read from the input. I then have an external script that populates a file with the list of files to fetch, uploads the file into HDFS (using hadoop dfs -put), then start the map/reduce job with a decent number of mappers.
This gives me excellent transfer performance, since multiple files are read/written at the same time.
Maybe not the answer you were looking for, but hopefully helpful anyway :-).
About 2 years after my last answer, there are now two new alternatives - Hoop/HttpFS, and WebHDFS.
Regarding Hoop, it was first announced in Cloudera's blog and can be downloaded from a github repository. I have managed to get this version to talk successfully to at least Hadoop 0.20.1, it can probably talk to slightly older versions as well.
If you're running Hadoop 0.23.1 which at time of writing still is not released, Hoop is instead part of Hadoop as its own component, the HttpFS. This work was done as part of HDFS-2178. Hoop/HttpFS can be a proxy not only to HDFS, but also to other Hadoop-compatible filesystems such as Amazon S3.
Hoop/HttpFS runs as its own standalone service.
There's also WebHDFS which runs as part of the NameNode and DataNode services. It also provides a REST API which, if I understand correctly, is compatible with the HttpFS API. WebHDFS is part of Hadoop 1.0 and one of its major features is that it provides data locality - when you're making a read request, you will be redirected to the WebHDFS component on the datanode where the data resides.
Which component to choose depends a bit on your current setup and what needs you have. If you need a HTTP REST interface to HDFS now and you're running a version that does not include WebHDFS, starting with Hoop from the github repository seems like the easiest option. If you are running a version that includes WebHDFS, I would go for that unless you need some of the features Hoop has that WebHDFS lacks (access to other filesystems, bandwidth limitation, etc.)
Install Cygwin, install Hadoop locally (you just need the binary and configs that point at your NN -- no need to actually run the services), run hadoop fs -copyFromLocal /path/to/localfile /hdfs/path/
You can also use the new Cloudera desktop to upload a file via the web UI, though that might not be a good option for giant files.
There's also a WebDAV overlay for HDFS but I don't know how stable/reliable that is.
It seems there is a dedicated page now for this at http://wiki.apache.org/hadoop/MountableHDFS:
These projects (enumerated below) allow HDFS to be mounted (on most
flavors of Unix) as a standard file system using the mount command.
Once mounted, the user can operate on an instance of hdfs using
standard Unix utilities such as 'ls', 'cd', 'cp', 'mkdir', 'find',
'grep', or use standard Posix libraries like open, write, read, close
from C, C++, Python, Ruby, Perl, Java, bash, etc.
Later it describes these projects
contrib/fuse-dfs is built on fuse, some C glue, libhdfs and the hadoop-dev.jar
fuse-j-hdfs is built on fuse, fuse for java, and the hadoop-dev.jar
hdfs-fuse - a google code project is very similar to contrib/fuse-dfs
webdav - hdfs exposed as a webdav resource mapR - contains a closed source hdfs compatible file system that supports read/write
NFS access
HDFS NFS Proxy - exports HDFS as NFS without use of fuse. Supports Kerberos and re-orders writes so they are written to hdfs
sequentially.
I haven't tried any of these, but I will update the answer soon as I have the same need as the OP
You can now also try to use Talend, which includes components for Hadoop integration.
you can try mounting HDFS on your machine(call it machine_X) where you are executing your code and machine_X should have infiniband connectivity with the HDFS Check this out, https://wiki.apache.org/hadoop/MountableHDFS
You can also use HadoopDrive (http://hadoopdrive.effisoft.eu). It's a Windows shell extension.

Resources