Hadoop Operationalization - hadoop

I have all the pieces of a hadoop implementation ready - I have a running cluster, and a client writer that is pushing activity data into HDFS. I have a question about what happens next. I understand that we run jobs against the data that has been dumped into HDFS, but my questions are:
1) First off, I am writing into the stream and flushing periodically - I am writing the files via a thread in the HDFS java client, and I don't see the files appear in HDFS until I kill my server. If I write enough data to fill a block, will that automatically appear in the file system? How do I get to a point where I have files that are ready to be processed by M/R jobs?
2) When do we run M/R jobs? Like I said, I am writing the files via a thread in the HDFS java client, and that thread has a lock on the file for write. At what point should I release that file? How does this interaction work? At what point is it 'safe' to run a job against that data, and what happens to the data in HDFS when its done?

I would try to avoid "hard" synchronization between data insertion into hadoop and processing results. I mean that in many cases it is most practical to have to asynchronious processes:
a) One process putting files into HDFS. In many cases -building directory structure by dates is usefull.
b) Run jobs for all but most recent data.
You can run job on most recent data, but application should not relay on up to the minute results. In any case job usually takes more then a few minutes in any case
Another point - append is not 100% mainstream but advanced thing built for HBase. If you build your app without usage of it - you will be able to work with other DFS's like amazon s3 which do not support append. We are collecting data in local file system, and then copy them to HDFS when file is big enough.

write the data to fill a block , you will see the file in the system
M/R is submitted to the scheduler , which takes care of running it against data, we need not worry abt

Related

How to delete input files after successful mapreduce

We have a system that receives archives on a specified directory and on a regular basis it launches a mapreduce job that opens the archives and processes the files within them. To avoid re-processing the same archives the next time, we're hooked into the close() method on our RecordReader to have it deleted after the last entry is read.
The problem with this approach (we think) is that if a particular mapping fails, the next mapper that makes another attempt at it finds that the original file has been deleted by the record reader from the first one and it bombs out. We think the way to go is to hold off until all the mapping and reducing is complete and then delete the input archives.
Is this the best way to do this?
If so, how can we obtain a listing of all the input files found by the system from the main program? (we can't just scrub the whole input dir, new files may be present)
i.e.:
. . .
job.waitForCompletion(true);
(we're done, delete input files, how?)
return 0;
}
Couple comments.
I think this design is heartache-prone. What happens when you discover that someone deployed a messed up algorithm to your MR cluster and you have to backfill a month's worth of archives? They're gone now. What happens when processing takes longer than expected and a new job needs to start before the old one is completely done? Too many files are present and some get reprocessed. What about when the job starts while an archive is still in flight? Etc.
One way out of this trap is to have the archives go to a rotating location based on time, and either purge the records yourself or (in the case of something like S3) establish a retention policy that allows a certain window for operations. Also whatever the back end map reduce processing is doing could be idempotent: processing the same record twice should not be any different than processing it once. Something tells me that if you're reducing your dataset, that property will be difficult to guarantee.
At the very least you could rename the files you processed instead of deleting them right away and use a glob expression to define your input that does not include the renamed files. There are still race conditions as I mentioned above.
You could use a queue such as Amazon SQS to record the delivery of an archive, and your InputFormat could pull these entries rather than listing the archive folder when determining the input splits. But reprocessing or backfilling becomes problematic without additional infrastructure.
All that being said, the list of splits is generated by the InputFormat. Write a decorator around that and you can stash the split list wherever you want for use by the master after the job is done.
The simplest way would probably be do a multiple input job, read the directory for the files before you run the job and pass those instead of a directory to the job (then delete the files in the list after the job is done).
Based on the situation you are explaining I can suggest the following solution:-
1.The process of data monitoring I.e monitoring the directory into which the archives are landing should be done by a separate process. That separate process can use some metadata table like in mysql to put status entries based on monitoring the directories. The metadata entries can also check for duplicacy.
2. Now based on the metadata entry a separate process can handle the map reduce job triggering part. Some status could be checked in metadata for triggering the jobs.
I think you should use Apache Oozie to manage your workflow. From Oozie's website (bolding is mine):
Oozie is a workflow scheduler system to manage Apache Hadoop jobs.
...
Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty.

Re-use files in Hadoop Distributed cache

I am wondering if someone can explain how the distributed cache works in Hadoop. I am running a job many times, and after each run I notice that the local distributed cache folder on each node is growing in size.
Is there a way for multiple jobs to re-use the same file in the distributed cache? Or is the distributed cache only valid for the lifetime of any individual job?
The reason I am confused is that the Hadoop documentation mentions that "DistributedCache tracks modification timestamps of the cache files", so this leads me to believe that if the time stamp hasn't changed, then it should not need to re-cache or re-copy the files to the nodes.
I am adding files successfully to the distributed cache using:
DistributedCache.addFileToClassPath(hdfsPath, conf);
DistributedCache uses reference counting to manage the caches. org.apache.hadoop.filecache.TrackerDistributedCacheManager.CleanupThread is in charge of cleaning up the CacheDirs whose reference count is 0. It will check every minute (default period is 1 minute, you can set it by "mapreduce.tasktracker.distributedcache.checkperiod").
When a Job finishes or fails, JobTracker will send a org.apache.hadoop.mapred.KillJobAction to the TaskTrackers. Then if a TaskTracker receives a KillJobAction, it puts the action to tasksToCleanup. In the TaskTracker, there is a background Thread called taskCleanupThread which takes the action from tasksToCleanup and do the cleanup work. For a KillJobAction, it will invoke purgeJob to clean up the Job. In this method, it will decrease the reference count used by this Job (rjob.distCacheMgr.release();).
The above analysis bases on hadoop-core-2.0.0-mr1-cdh4.2.1-sources.jar. I also checked the hadoop-core-0.20.2-cdh3u1-sources.jar and found there was a litte difference between this two versions. For example, there was not a org.apache.hadoop.filecache.TrackerDistributedCacheManager.CleanupThread in 0.20.2-cdh3u1. When initializing a Job, TrackerDistributedCacheManager will check if there is enough space to put the new caches files for this Job. If not, it will delete the caches which have 0 reference count.
If you are using cdh4.2.1, you can increase "mapreduce.tasktracker.distributedcache.checkperiod" to let the clean up work delay. Then the probability that multiple Jobs use the same distributed cache is increased.
If you are using cdh3u1, you can increase the limitation of the cache size("local.cache.size", default is 10G) and the max directories for caches("mapreduce.tasktracker.cache.local.numberdirectories", default is 10000). This can be also applied to cdh4.2.1.
If you look closely at what this book says, is that there is a limit of what can be stored in Distributed Cache. By default it's 10GB (configurable). There can be multiple different jobs running in the cluster concurrently. Furthermore, Hadoop kind of guarantees the files stay available in the cache for a single job as it is maintained by reference count done by the tasktracker for different tasks accessing the files in cache. In your case, for subsequent Jobs, the files may not be there as they are already marked for deletion.
Please correct me if you disagree anywhere. I'll be glad to discuss this further.
According to this: http://www.datasalt.com/2011/05/handling-dependencies-and-configuration-in-java-hadoop-projects-efficiently/
You should be able to do this via DistributedCache API instead of "-libjars"

Flume NG and HDFS

I am very new to hadoop , so please excuse the dumb questions.
I have the following knowledge
Best usecase of Hadoop is large files thus helping in efficiency while running mapreduce tasks.
Keeping the above in mind I am somewhat confused about Flume NG.
Assume I am tailing a log file and logs are produced every second, the moment the log gets a new line it will be transferred to hdfs via Flume.
a) Does this mean that flume creates a new file on every line that is logged in the log file I am tailing or does it append to the existing hdfs file ??
b) is append allowed in hdfs in the first place??
c) if the answer to b is true ?? ie contents are appended constantly , how and when should I run my mapreduce application?
Above questions could sound very silly but a answers to the same will be highly appreciated.
PS: I have not yet set up Flume NG or hadoop as yet, just reading the articles to get an understanding and how it could add value to my company.
Flume writes to HDFS by means of HDFS sink. When Flume starts and begins to receive events, the sink opens new file and writes events into it. At some point previously opened file should be closed, and until then data in the current block being written is not visible to other redaers.
As described in the documentation, Flume HDFS sink has several file closing strategies:
each N seconds (specified by rollInterval option)
after writing N bytes (rollSize option)
after writing N received events (rollCount option)
after N seconds of inactivity (idleTimeout option)
So, to your questions:
a) Flume writes events to currently opened file until it is closed (and new file opened).
b) Append is allowed in HDFS, but Flume does not use it. After file is closed, Flume does not append to it any data.
c) To hide currently opened file from mapreduce application use inUsePrefix option - all files with name that starts with . is not visible to MR jobs.

How can I force Flume-NG to process the backlog of events after a sink failed?

I'm trying to setup Flume-NG to collect various kinds of logs from a bunch of servers (mostly running Tomcat instances and Apache Httpd) and dump them into HDFS on a 5-node Hadoop cluster. The setup looks like this:
Each application server tails the relevant logs into a one of the Exec Sources (one for each log type: java, httpd, syslog), which outs them through a FileChannel to an Avro sink. On each server the different sources, channels and sinks are managed by one Agent. Events get picked up by an AvroSource which resides on the Hadoop Cluster (the node that also hosts the SecondaryNameNode and the Jobtracker). For each logtype there is an AvroSource listening on a different port. The events go through the FileChannel into the HDFS Sink, which saves the events using the FlumeEventAvro EventSerializer and Snappy compression.
The problem: The agent on the Hadoop node that manages the HDFS Sinks (again, one for each logtype) failed after some hours because we didn't change the Heap size of the JVM. From then on lots of events were collected in the FileChannel on that node and after that also on the FileChannels on the Application Servers, because the FileChannel on the Hadoop node reached it's maximum capacity. When I fixed the problem, I couldn't get the agent on the Hadoop node to process the backlog speedily enough so it could resume normal operation. The size of the tmp dir where the FileChannel saves the events before sinking them, keeps growing all the time. Also, HDFS writes seem to be real slow.
Is there a way to force Flume to process the backlog first before ingesting new events? Is the following configuration optimal? Maybe related: The files that get written to HDFS are really small, around 1 - 3 MB or so. That's certainly not optimal with the HDFS default blocksize of 64MB and with regards to future MR operations. What settings should I use to collect the events in files large enough for the HDFS blocksize?
I have a feeling the config on the Hadoop node is not right, I'm suspecting the values for BatchSize, RollCount and related params are off, but I'm not sure what the optimal values should be.
Example config on Application Servers:
agent.sources=syslogtail httpdtail javatail
agent.channels=tmpfile-syslog tmpfile-httpd tmpfile-java
agent.sinks=avrosink-syslog avrosink-httpd avrosink-java
agent.sources.syslogtail.type=exec
agent.sources.syslogtail.command=tail -F /var/log/messages
agent.sources.syslogtail.interceptors=ts
agent.sources.syslogtail.interceptors.ts.type=timestamp
agent.sources.syslogtail.channels=tmpfile-syslog
agent.sources.syslogtail.batchSize=1
...
agent.channels.tmpfile-syslog.type=file
agent.channels.tmpfile-syslog.checkpointDir=/tmp/flume/syslog/checkpoint
agent.channels.tmpfile-syslog.dataDirs=/tmp/flume/syslog/data
...
agent.sinks.avrosink-syslog.type=avro
agent.sinks.avrosink-syslog.channel=tmpfile-syslog
agent.sinks.avrosink-syslog.hostname=somehost
agent.sinks.avrosink-syslog.port=XXXXX
agent.sinks.avrosink-syslog.batch-size=1
Example config on Hadoop node
agent.sources=avrosource-httpd avrosource-syslog avrosource-java
agent.channels=tmpfile-httpd tmpfile-syslog tmpfile-java
agent.sinks=hdfssink-httpd hdfssink-syslog hdfssink-java
agent.sources.avrosource-java.type=avro
agent.sources.avrosource-java.channels=tmpfile-java
agent.sources.avrosource-java.bind=0.0.0.0
agent.sources.avrosource-java.port=XXXXX
...
agent.channels.tmpfile-java.type=file
agent.channels.tmpfile-java.checkpointDir=/tmp/flume/java/checkpoint
agent.channels.tmpfile-java.dataDirs=/tmp/flume/java/data
agent.channels.tmpfile-java.write-timeout=10
agent.channels.tmpfile-java.keepalive=5
agent.channels.tmpfile-java.capacity=2000000
...
agent.sinks.hdfssink-java.type=hdfs
agent.sinks.hdfssink-java.channel=tmpfile-java
agent.sinks.hdfssink-java.hdfs.path=/logs/java/avro/%Y%m%d/%H
agent.sinks.hdfssink-java.hdfs.filePrefix=java-
agent.sinks.hdfssink-java.hdfs.fileType=DataStream
agent.sinks.hdfssink-java.hdfs.rollInterval=300
agent.sinks.hdfssink-java.hdfs.rollSize=0
agent.sinks.hdfssink-java.hdfs.rollCount=40000
agent.sinks.hdfssink-java.hdfs.batchSize=20000
agent.sinks.hdfssink-java.hdfs.txnEventMax=20000
agent.sinks.hdfssink-java.hdfs.threadsPoolSize=100
agent.sinks.hdfssink-java.hdfs.rollTimerPoolSize=10
There are a couple of things I see in your configuration that can cause issues:
Your first agent seems to have an avro sink with batch size of 1. You should bump this up to at least 100 or more. This is because the avro source on the second agent would be committing to the channel with batch size of 1. Each commit causes an fsync, causing the file channel performance to be poor. The batch size on the exec source is also 1, causing that channel to be slow as well. You can increase the batch size (or use the Spool Directory Source - more on that later).
You can have multiple HDFS sinks reading from the same channel to improve performance. You should just make sure that each sink writes to a different directory or have different "hdfs.filePrefix", so that multiple HDFS sinks don't try to write to the same files.
Your batch size for the HDFS sink is 20000, which is quite high, and your callTimeout is the default of 10 seconds. You should increase "hdfs.callTimeout" if you want to keep such a huge batch size. I'd recommend reducing the batch size to 1000 or so, and having timeout of about 15-20 seconds. (Note that at the current batch size, each file holds only 2 batches - so reduce the batch size, increase the rollInterval and timeOut)
If you are using tail -F, I'd recommend trying out the new Spool Directory Source. To use this source, rotate out your log files to a directory, which the Spool Directory Source processes. This source will only process files which are immutable, so you need to rotate the log files out. Using tail -F with exec source has issues, as documented in the Flume User Guide.

Does it matter where I submit hadoop jobs from?

Does it have any measurable effect on resources whether I submit a bunch of hadoop jobs from different client servers or all from the same one? I would think not since all the work is done in the cluster. Is this correct?
The only thing which is resource intensive on the client submitting to the Hadoop cluster is the calculation of the input splits. When the input data is huge or when too many jobs are submitted from the same client then because of the input split calculations, the job submission might become a bit slow.
I am not able to recall the Hadoop release or the parameter, but a configurable parameter was included to move the calculation of the input splits from the client submitting a job to the Hadoop cluster.
It really shouldn't matter where you submit your jobs from. The client itself doesn't do much, it uses RPC protocol to contact the services, and then just sits idle until the job is finished.
Also, the most important is what kind of scheduler you use to allocate resource, which is probably going to make the most significant difference and decide which resources to allocate to which job. More on job scheduling here.
I don't think you can move the input split calculation into Job Tracker in 'Classic' version. In YARN, you can move it using
"yarn.app.mapreduce.am.compute-splits-in-cluster"
I am guessing, Hadoop people didn't want to overload Job tracker with input split creation. Similar to the design decision of not assigning too much work for Namenode in HDFS.
In YARN, every job gets its own Application Master, so no worries about overloading a SPOF/bottleneck master like job tracker.
In reference to the original question, the client job would have to reach out to the namenode to get the block locations (I have see parts of code on block storage class calling data node for some meta data...not sure whether these happen during input split creation or in task tracker node) . This can become an issue if you are handling a lot of jobs on the same client node.
If you are using YARN, there would be a slight performance increase if all these communications happen inside the cluster.
Need to check how Oozie handles this issue.
Hopefully, this helps!
Arun

Resources