Configuring flume to read logs from different directories - hadoop

Different applications are writing their logs to different directory structure. I want to read those logs and put it in sink(can be hadoop or physical file).
How does flume supports multiple sources for single agent? is it possible to have multiple sources for a single agent ?
Can anyone guide me in this?
Thanks and regards
Chhaya

Configure your flume agent with multiple sources - one for each log file. They should probably be spooling file source type. Note that when the source gets the file it needs to not change - you need to configure the source to be sure of this..
Those sources can then go to a single channel which can have a single sink.

Related

Simple deeplearning4J Java based Spark example?

I need to run a simple Java based deeplearning4j example in hadoop cluster and I found one here. My need to specify the input from command line (which should be a path on HDFS) and output should go to HDFS, for later view
However, in the example there is no mention, it is hard coding the input from local file system and output goes to local file system.
Can anyone help me here?
Maybe some combination of this recent pull request on our examples:
https://github.com/deeplearning4j/dl4j-examples/pull/384
and Spring-hadoop could help you?
http://projects.spring.io/spring-hadoop/
I mean conceptually all you'd do is change the file system type.
The FileSystem api in hadoop can point to either local or an hdfs url so there shouldn't be much change.

Logging from mappers into one location

I would like to know what mappers are doing at the given moment. From my understanding all of them are writing StdOut to a local log file. It's not practical to tail multiple log files on many servers. I would like to make all mappers write into one place instead (like a specific path on HDFS).
Is there any build-in feature or external library which can help me with that?
In terms of external library, you can use Flume (https://flume.apache.org/FlumeUserGuide.html) to transfer all these logs to a centralized location, either HDFS or a local file.
Basically on all machines, flume agents will run and do a 'tail -f' on the log files and transfer them to a central location.

Flume: Send files to HDFS via APIs

I am new to Apache Flume-ng. I want to send files from client-agent to server-agent, who will ultimately write files to HDFS. I have seen http://cuddletech.com/blog/?p=795 . This is the best which one i found till now. But it is via script not via APIs. I want to do it via Flume APIs. Please help me in this regard. And tell me steps, how to start and organize code.
I think you should maybe explain more about what you want to achieve.
The link you post appears to be just fine for your needs. You need to start a Flume agent on your client to read the files and send them using the Avro sink. Then you need a Flume agent on your server which uses an Avro source to read the events and write them where you want.
If you want to send events directly from an application then have a look at the embedded agent in Flume 1.4 or the Flume appender in log4j2 or (worse) the log4j appender in Flume.
Check this http://flume.apache.org/FlumeDeveloperGuide.html
You can write client to send events or use Embedded agent.
As for the code organization, it is up to you.

How to Use third party API in hadoop to read files from hdfs if those API uses only local file system path?

I have large mbox files and I am using third party API like mstor to parse messages from mbox file using hadoop. I have uploaded those files in hdfs. But the problem is that this API uses only local file system path , similar to shown below
MessageStoreApi store = new MessageStoreApi(“file location in locl file system”);
I could not find a constructor in this API that would initialize from stream . So I cannot read hdfs stream and initialize it.
Now my question is, should I copy my files from hdfs to local file system and initialize it from local temporary folder? As thats what I have been doing for now:
Currently My Map function receives path of the mbox files.
Map(key=path_of_mbox_file in_hdfs, value=null){
String local_temp_file = CopyToLocalFile(path in hdfs);
MessageStoreApi store = new MessageStoreApi(“local_temp_file”);
//process file
}
Or Is there some other solution? I am expecting some solution like what If I increase the block-size so that single file fits in one block and somehow if I can get the location of those blocks in my map function, as mostly map functions will execute on the same node where those blocks are stored then I may not have to always download to local file system? But I am not sure if that will always work :)
Suggestions , comments are welcome!
For local filesystem path-like access, HDFS offers two options: HDFS NFS (via NFSv3 mounts) and FUSE-mounted HDFS.
The former is documented under the Apache Hadoop docs (CDH users may follow this instead)
The latter is documented at the Apache Hadoop wiki (CDH users may find relevant docs here instead)
The NFS feature is more maintained upstream than the FUSE option, currently.

getting data in and out of hadoop

I need a system to analyze large log files. A friend directed me to hadoop the other day and it seems perfect for my needs. My question revolves around getting data into hadoop-
Is it possible to have the nodes on my cluster stream data as they get it into HDFS? Or would each node need to write to a local temp file and submit the temp file after it reaches a certain size? and is it possible to append to a file in HDFS while also running queries/jobs on that same file at the same time?
Fluentd log collector just released its WebHDFS plugin, which allows the users to instantly stream data into HDFS. It's really easy to install with ease of management.
Fluentd + Hadoop: Instant Big Data Collection
Of course you can import data directly from your applications. Here's a Java example to post logs against Fluentd.
Fluentd: Data Import from Java Applications
A hadoop job can run over multiple input files, so there's really no need to keep all your data as one file. You won't be able to process a file until its file handle is properly closed, however.
HDFS does not support appends (yet?)
What I do is run the map-reduce job periodically and output results to an 'processed_logs_#{timestamp}" folder.
Another job can later take these processed logs and push them to a database etc. so it can be queried on-line
I'd recommend using Flume to collect the log files from your servers into HDFS.

Resources