stream data from flume to collect data from different directories - hadoop

the logs from different netwok devices are getting uploaded in different directory structure /appdat/logs/device//devicename.gzip. So all the devices will store their logs in respective ZIP code dir.Can any existing flume source be used to send the new uploaded file on any of the sub-directory to HDFS or do i need to write a new custom source.the cloudera version being used is cdh4

There is a change proposed by Phil Scala that will do recursive checking. To my knowledge it hasn't been accepted yet.
The current actively developed version is Apache Flume - not the Cloudera version.

Related

Hadoop (HDFS) - file versioning

At the given time I have user file system in my application (apache CMIS).
As it's growing bigger, I'm doubting to move to hadoop (HDFS) as we need to run some statistics on it as well.
The problem:
The current file system provides versioning of the files.
When I read about hadoop - HDFS- and file versioning, I found most of the time that I have to write this (versioning) layer myself.
Is there already something available to manage versioning of files in HDFS or do I really have to write it myself (don't want to reinvent the hot water, but don't find a proper solution either).
Answer
For full details: see comments on answer(s) below
Hadoop (HDFS) doesn't support versioning of files. You can get this functionality when you combine hadoop with (amazon) S3:
Hadoop will use S3 as the filesystem (without chuncks, but recovery will be provided by S3). This solution comes with the versioning of files that S3 provides.
Hadoop will still use YARN for the distributed processing.
Versioning is not possible with HDFS.
Instead you can use Amazon S3, which provides Versioning and is also compatible with Hadoop.
HDFS supports snapshots. I think that's as close as you can get to "versioning" with HDFS.

How to implement Apache storm to monitor HDFS directory

I have a HDFS directory where files will be copied continuously (streaming) from many sources.
How to build a topology for monitoring the HDFS directory, i.e that whenever a new file is created in that directory it should be processed.
You are looking to monitor HDFS file/directory changes.
Take a look this question, which points to existing support in Oozie and HBase:
How to know that a new data is been added to HDFS?
You can send items into your topology for processing when new files are detected by these tools.
Or you can write your own custom logic in storm, listing and checking if new files are added in HDFS periodically. Check out tick tuples support in Storm.

How to Use third party API in hadoop to read files from hdfs if those API uses only local file system path?

I have large mbox files and I am using third party API like mstor to parse messages from mbox file using hadoop. I have uploaded those files in hdfs. But the problem is that this API uses only local file system path , similar to shown below
MessageStoreApi store = new MessageStoreApi(“file location in locl file system”);
I could not find a constructor in this API that would initialize from stream . So I cannot read hdfs stream and initialize it.
Now my question is, should I copy my files from hdfs to local file system and initialize it from local temporary folder? As thats what I have been doing for now:
Currently My Map function receives path of the mbox files.
Map(key=path_of_mbox_file in_hdfs, value=null){
String local_temp_file = CopyToLocalFile(path in hdfs);
MessageStoreApi store = new MessageStoreApi(“local_temp_file”);
//process file
}
Or Is there some other solution? I am expecting some solution like what If I increase the block-size so that single file fits in one block and somehow if I can get the location of those blocks in my map function, as mostly map functions will execute on the same node where those blocks are stored then I may not have to always download to local file system? But I am not sure if that will always work :)
Suggestions , comments are welcome!
For local filesystem path-like access, HDFS offers two options: HDFS NFS (via NFSv3 mounts) and FUSE-mounted HDFS.
The former is documented under the Apache Hadoop docs (CDH users may follow this instead)
The latter is documented at the Apache Hadoop wiki (CDH users may find relevant docs here instead)
The NFS feature is more maintained upstream than the FUSE option, currently.

Writing data to Hadoop

I need to write data in to Hadoop (HDFS) from external sources like a windows box. Right now I have been copying the data onto the namenode and using HDFS's put command to ingest it into the cluster. In my browsing of the code I didn't see an API for doing this. I am hoping someone can show me that I am wrong and there is an easy way to code external clients against HDFS.
There is an API in Java. You can use it by including the Hadoop code in your project.
The JavaDoc is quite helpful in general, but of course you have to know, what you are looking for *g *
http://hadoop.apache.org/common/docs/
For your particular problem, have a look at:
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSystem.html
(this applies to the latest release, consult other JavaDocs for different versions!)
A typical call would be:
Filesystem.get(new JobConf()).create(new Path("however.file"));
Which returns you a stream you can handle with regular JavaIO.
For the problem of loading the data I needed to put into HDFS, I choose to turn the problem around.
Instead of uploading the files to HDFS from the server where they resided, I wrote a Java Map/Reduce job where the mapper read the file from the file server (in this case via https), then write it directly to HDFS (via the Java API).
The list of files is read from the input. I then have an external script that populates a file with the list of files to fetch, uploads the file into HDFS (using hadoop dfs -put), then start the map/reduce job with a decent number of mappers.
This gives me excellent transfer performance, since multiple files are read/written at the same time.
Maybe not the answer you were looking for, but hopefully helpful anyway :-).
About 2 years after my last answer, there are now two new alternatives - Hoop/HttpFS, and WebHDFS.
Regarding Hoop, it was first announced in Cloudera's blog and can be downloaded from a github repository. I have managed to get this version to talk successfully to at least Hadoop 0.20.1, it can probably talk to slightly older versions as well.
If you're running Hadoop 0.23.1 which at time of writing still is not released, Hoop is instead part of Hadoop as its own component, the HttpFS. This work was done as part of HDFS-2178. Hoop/HttpFS can be a proxy not only to HDFS, but also to other Hadoop-compatible filesystems such as Amazon S3.
Hoop/HttpFS runs as its own standalone service.
There's also WebHDFS which runs as part of the NameNode and DataNode services. It also provides a REST API which, if I understand correctly, is compatible with the HttpFS API. WebHDFS is part of Hadoop 1.0 and one of its major features is that it provides data locality - when you're making a read request, you will be redirected to the WebHDFS component on the datanode where the data resides.
Which component to choose depends a bit on your current setup and what needs you have. If you need a HTTP REST interface to HDFS now and you're running a version that does not include WebHDFS, starting with Hoop from the github repository seems like the easiest option. If you are running a version that includes WebHDFS, I would go for that unless you need some of the features Hoop has that WebHDFS lacks (access to other filesystems, bandwidth limitation, etc.)
Install Cygwin, install Hadoop locally (you just need the binary and configs that point at your NN -- no need to actually run the services), run hadoop fs -copyFromLocal /path/to/localfile /hdfs/path/
You can also use the new Cloudera desktop to upload a file via the web UI, though that might not be a good option for giant files.
There's also a WebDAV overlay for HDFS but I don't know how stable/reliable that is.
It seems there is a dedicated page now for this at http://wiki.apache.org/hadoop/MountableHDFS:
These projects (enumerated below) allow HDFS to be mounted (on most
flavors of Unix) as a standard file system using the mount command.
Once mounted, the user can operate on an instance of hdfs using
standard Unix utilities such as 'ls', 'cd', 'cp', 'mkdir', 'find',
'grep', or use standard Posix libraries like open, write, read, close
from C, C++, Python, Ruby, Perl, Java, bash, etc.
Later it describes these projects
contrib/fuse-dfs is built on fuse, some C glue, libhdfs and the hadoop-dev.jar
fuse-j-hdfs is built on fuse, fuse for java, and the hadoop-dev.jar
hdfs-fuse - a google code project is very similar to contrib/fuse-dfs
webdav - hdfs exposed as a webdav resource mapR - contains a closed source hdfs compatible file system that supports read/write
NFS access
HDFS NFS Proxy - exports HDFS as NFS without use of fuse. Supports Kerberos and re-orders writes so they are written to hdfs
sequentially.
I haven't tried any of these, but I will update the answer soon as I have the same need as the OP
You can now also try to use Talend, which includes components for Hadoop integration.
you can try mounting HDFS on your machine(call it machine_X) where you are executing your code and machine_X should have infiniband connectivity with the HDFS Check this out, https://wiki.apache.org/hadoop/MountableHDFS
You can also use HadoopDrive (http://hadoopdrive.effisoft.eu). It's a Windows shell extension.

getting data in and out of hadoop

I need a system to analyze large log files. A friend directed me to hadoop the other day and it seems perfect for my needs. My question revolves around getting data into hadoop-
Is it possible to have the nodes on my cluster stream data as they get it into HDFS? Or would each node need to write to a local temp file and submit the temp file after it reaches a certain size? and is it possible to append to a file in HDFS while also running queries/jobs on that same file at the same time?
Fluentd log collector just released its WebHDFS plugin, which allows the users to instantly stream data into HDFS. It's really easy to install with ease of management.
Fluentd + Hadoop: Instant Big Data Collection
Of course you can import data directly from your applications. Here's a Java example to post logs against Fluentd.
Fluentd: Data Import from Java Applications
A hadoop job can run over multiple input files, so there's really no need to keep all your data as one file. You won't be able to process a file until its file handle is properly closed, however.
HDFS does not support appends (yet?)
What I do is run the map-reduce job periodically and output results to an 'processed_logs_#{timestamp}" folder.
Another job can later take these processed logs and push them to a database etc. so it can be queried on-line
I'd recommend using Flume to collect the log files from your servers into HDFS.

Resources