Logging from mappers into one location - hadoop

I would like to know what mappers are doing at the given moment. From my understanding all of them are writing StdOut to a local log file. It's not practical to tail multiple log files on many servers. I would like to make all mappers write into one place instead (like a specific path on HDFS).
Is there any build-in feature or external library which can help me with that?

In terms of external library, you can use Flume (https://flume.apache.org/FlumeUserGuide.html) to transfer all these logs to a centralized location, either HDFS or a local file.
Basically on all machines, flume agents will run and do a 'tail -f' on the log files and transfer them to a central location.

Related

How to archive data stored in HDFS files on another (non-distributed) server?

I have a project folder containing approx. 50 GB of parquet files on a hadoop cluster (CDH 5.14), which I need to archive and move to another host (non-distributed with Windows or Linux). This is only a one time job - I do not plan to bring the data back to HDFS any time soon, however there should be a way to deploy it back to a distributed file system. What would be the optimal way to do it? Unfortunately, I don't have another hadoop cluster or a cloud environment where I could place this data.
I would appreciate any hints.
The optimal solution can depend on the actual data (e.g. Tables, many/few flat files). If you know how they got in there, looking at the inverse could be a logical first step.
For example, if you just use put to place the files, consider using get.
If you use Nifi to get it in, try Nifi to get it out.
After the data is on your Linux box, you can use SCP or something like FTP or a mounted drive to move it to the desired computer.

Pull a file from remote location (local file system in some remote machine) into Hadoop HDFS

I have files in a machine (say A) which is not part of the Hadoop (OR HDFS) datacenter. So machine A is at remote location from HDFS datacenter.
Is there a script OR command OR program OR tool that can run in machines which are connected to Hadoop (part of the datacenter) and pull-in the file from machine A to HDFS directly ? If yes, what is the best and fastest way to do this ?
I know there are many ways like WebHDFS, Talend but they need to run from Machine A and requirement is to avoid that and run it in machines in datacenter.
There are two ways to achieve this:
You can pull the data using scp and store it in a temporary location, then copy it to hdfs, and delete the temporarily stored data.
If you do not want to keep it as a 2-step process, you can write a program which will read the files from the remote machine, and write it to HDFS directly.
This question along with comments and answers would come in handy for reading the file while, you can use the below snippet to write to HDFS.
outFile = <Path to the the file including name of the new file> //e.g. hdfs://localhost:<port>/foo/bar/baz.txt
FileSystem hdfs =FileSystem.get(new URI("hdfs://<NameNode Host>:<port>"), new Configuration());
Path newFilePath=new Path(outFile);
FSDataOutputStream out = hdfs.create(outFile);
// put in a while loop here which would read until EOF and write to the file using below statement
out.write(buffer);
Let buffer = 50 * 1024, if you have enough IO capicity depending on processor or you could use a much lower value like 10 * 1024 or something
Please tell me if I am getting your Question right way.
1-you want to copy the file in a remote location.
2- client machine is not a part of Hadoop cluster.
3- It is may not contains the required libraries for Hadoop.
Best way is webHDFS i.e. Rest API

Configuring flume to read logs from different directories

Different applications are writing their logs to different directory structure. I want to read those logs and put it in sink(can be hadoop or physical file).
How does flume supports multiple sources for single agent? is it possible to have multiple sources for a single agent ?
Can anyone guide me in this?
Thanks and regards
Chhaya
Configure your flume agent with multiple sources - one for each log file. They should probably be spooling file source type. Note that when the source gets the file it needs to not change - you need to configure the source to be sure of this..
Those sources can then go to a single channel which can have a single sink.

How to get data from temp files of hadoop?

I have an application to transfer data from remote systems to HDFS using map reduce . I however am lost when I have to deal with isues like network failure .. That is , when a connection from remote data source is lost and data is no longer accessible to my mapreduce application. I can always restart the job but when data is huge then restarting is an expensive option . I know the mapreduce would create temp folder but will it put data there ? Can I read that data out and then Can I somehow start reading the rest of the data ?
A mapreduce job can write arbitrary files, not only the ones managed by Hadoop.
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
out = fs.create(new Path(fileName));
using this code you create arbitrary files which work like normal files in the local filesystem. Then, you manage connection exceptions such that when a source is unaccessible you nicely close the file and record somewhere (e.g. in HDFS itself) that happened an interruption and at which point.
In the case of FTP, you could write just the list of file paths and folders. When a job finish to download a file, write its path on the downloaded list, and when an entire folder is downloaded write the folder path, so in case of resume you will not have to traverse a directory content to check that all files were downloaded.
At the program startup, on the other hand, it will check this file to decide whether the previous attempt failed and, in case, where to start the download.
In general, Hadoop will kill your program if it's not writing/reading anything for a timeout. Your application can tell it to wait but in general is not good to have an idle job, so it's better to end the job nicely instead that waiting for the network to work again.
You can also create your own filewriter, this way:
conf.setOutputFormat(MyOwnOutputFormat.class);
your filewriter could save its own temporary files in the format you prefer, so if the application crashes you know how files are saved.
HDFS saves files with chunks of 64MB by default, and when a job fails you may not even have a temporary file unless you use your own writer.
This is a generic solution, it depends on which is the source of data (ftp, samba, http...) and its support to download resumes.
EDIT: in case of FTP, you could just use csync to syncronize a FTP server with your local filesystem, and hdfs-fuse to mount a HDFS filesystem. It works when you have many small files.
You haven't specified what tool you are using to ingress data into HDFS/Hadoop.
Some of the tools that you can use to ingress data into HDFS/Hadoop which support recoverability are Flume, Scribe & Chukwa (for log files) and they all support various configurable levels of file transfer reliability guarantees, and Sqoop for transferring relational db data into HDFS or Hive, etc.

getting data in and out of hadoop

I need a system to analyze large log files. A friend directed me to hadoop the other day and it seems perfect for my needs. My question revolves around getting data into hadoop-
Is it possible to have the nodes on my cluster stream data as they get it into HDFS? Or would each node need to write to a local temp file and submit the temp file after it reaches a certain size? and is it possible to append to a file in HDFS while also running queries/jobs on that same file at the same time?
Fluentd log collector just released its WebHDFS plugin, which allows the users to instantly stream data into HDFS. It's really easy to install with ease of management.
Fluentd + Hadoop: Instant Big Data Collection
Of course you can import data directly from your applications. Here's a Java example to post logs against Fluentd.
Fluentd: Data Import from Java Applications
A hadoop job can run over multiple input files, so there's really no need to keep all your data as one file. You won't be able to process a file until its file handle is properly closed, however.
HDFS does not support appends (yet?)
What I do is run the map-reduce job periodically and output results to an 'processed_logs_#{timestamp}" folder.
Another job can later take these processed logs and push them to a database etc. so it can be queried on-line
I'd recommend using Flume to collect the log files from your servers into HDFS.

Resources