Flume NG and HDFS - hadoop

I am very new to hadoop , so please excuse the dumb questions.
I have the following knowledge
Best usecase of Hadoop is large files thus helping in efficiency while running mapreduce tasks.
Keeping the above in mind I am somewhat confused about Flume NG.
Assume I am tailing a log file and logs are produced every second, the moment the log gets a new line it will be transferred to hdfs via Flume.
a) Does this mean that flume creates a new file on every line that is logged in the log file I am tailing or does it append to the existing hdfs file ??
b) is append allowed in hdfs in the first place??
c) if the answer to b is true ?? ie contents are appended constantly , how and when should I run my mapreduce application?
Above questions could sound very silly but a answers to the same will be highly appreciated.
PS: I have not yet set up Flume NG or hadoop as yet, just reading the articles to get an understanding and how it could add value to my company.

Flume writes to HDFS by means of HDFS sink. When Flume starts and begins to receive events, the sink opens new file and writes events into it. At some point previously opened file should be closed, and until then data in the current block being written is not visible to other redaers.
As described in the documentation, Flume HDFS sink has several file closing strategies:
each N seconds (specified by rollInterval option)
after writing N bytes (rollSize option)
after writing N received events (rollCount option)
after N seconds of inactivity (idleTimeout option)
So, to your questions:
a) Flume writes events to currently opened file until it is closed (and new file opened).
b) Append is allowed in HDFS, but Flume does not use it. After file is closed, Flume does not append to it any data.
c) To hide currently opened file from mapreduce application use inUsePrefix option - all files with name that starts with . is not visible to MR jobs.

Related

Synchronize NiFi process groups or flows that don't/can't connect?

Like the question states, is there some way to synchronize NiFi process groups or pipelines that don't/can't connect in the UI?
Eg. I have a process where I want to getFTP->putHDFS->moveHDFS (which ends up actually being getFTP->putHDFS->listHDFS->moveHDFS, see https://stackoverflow.com/a/50166151/8236733). However, listHDFS does not seem to take any incoming connections. Trying to do something with process groups like P1{getFTP->putHDFS->outport}->P2{inport->listHDFS->moveHDFS} also runs into the same problem (listHDFS can't seem to take any incoming connections). We don't want to moveHDFS before we ever even get anything from getFTP, but given the above, I don't see how these actions can be synchronized to occur in the right order.
New to NiFi, but I imagine this is a common use case and there must be some NiFi-ish way of doing this that I am missing. Advice in this would be appreciated. Thanks.
I'm not sure what requirement is preventing you from writing the file retrieved from FTP directly to the desired HDFS location, or if this is a "write n files to HDFS with a . starting the filename and then rename all when some certain threshold is reached" scenario.
ListHDFS does not take any incoming relationships because it should not be triggered by an incoming event, but rather on a timer/CRON schedule. Every time it runs, it will produce n flowfiles, where each references an HDFS file that has been detected to be written to the filesystem since the last execution. To do this, the processor stores local state.
Your flow segments do not need to be connected in this case. You'll have "flow segment A" which performs the FTP -> HDFS writing (GetFTP -> PutHDFS) and you'll have an independent "flow segment B" which lists the HDFS directory, reads the file descriptors (but not the content of the file unless you use FetchHDFS as well) and moves them (ListHDFS -> MoveHDFS). The ListHDFS processor will run constantly, but if it does not detect any new files during a run, it will simply yield and perform a no-op. Once the PutHDFS processor completes the task of writing a file to the HDFS file system, on the next ListHDFS execution, it will detect that file and generate a flowfile describing it.
You can tune the scheduling to your liking, but in general this is a very common pattern in NiFi flows.

Spark Streaming app streams files that have already been streamed

We have a spark streaming app deployed in a YARN ec2 cluster with 1 name node and 2 data nodes. We submit the app with 11 executors with 1 core and 588 MB of RAM each.
The app streams from a directory in S3 which is constantly being written; this is the line of code that achieves that:
val ssc = new StreamingContext(sparkConf, Seconds(10))
val lines = ssc.fileStream[LongWritable, Text, TextInputFormat](Settings.S3RequestsHost , (f:Path)=> true, true )
//some maps and other logic here
ssc.start()
ssc.awaitTermination()
The purpose of using fileStream instead of textFileStream is to customize the way that spark handles existing files when the process starts. We want to process just the new files that are added after the process launched and omit the existing ones. We configured a batch duration of 10 seconds.
The process goes fine while we add a small number of files to s3, let's say 4 or 5. We can see in the streaming UI how the stages are executed successfully in the executors, one for each file that is processed. But sometimes when we try to add a larger number of files, we face a strange behavior; the application starts streaming files that have already been streamed.
For example, I add 20 files to s3. The files are processed in 3 batches. The first batch processes 7 files, the second 8 and the third 5. No more files are added to S3 at this point, but spark start repeating these phases endlessly with the same files!
Any thoughts what can be causing this?
I've posted a Jira ticket for this issue:
https://issues.apache.org/jira/browse/SPARK-3553
Note the sentence "The files must be created in the dataDirectory by atomically moving or renaming them into the data directory" from the Spark Streaming Programming Guide. The entire file must appear all at once, rather than creating the file empty and appending to it.
One approach is to get cloudberry to put the files somewhere else, and then run a script periodically that either moves or renames the files into the directory you've attached your streaming app to.

Hadoop Distributed File system

I have a file.txt that has 3 blocks (block a , block b, block c). How does hadoop write these blocks in to Cluster.. My question is Does hadoop follow parallel write? Or does block b has to wait for block a to write into cluster? Or block a and block b and block c are parallely writtten in to hadoop cluster...
When you copy a file from the local file system to HDFS or when you create a new file in HDFS: blocks are copied sequentially - first, the first block is copied to a datanode, then the second block is copied to a datanode and so on.
What is done in parallel, however, is replica placement: while a datanode receives data of the block from the client, the datanode saves the data in a file, which represents the block, and, simultaneously re-sends the data to another datanode, which is supposed to create another replica of the block.
When you copy a file from one location to another location inside a HDFS cluster or between two HDFS clusters: you do it in parallel using DistCp.
WHEN YOU ATTEMPT TO COPY A FILE OR CREATE A NEW FILE FROM A LOCAL SYSTEM TO ANY HDFS: THE BLOCKS ARE COPIED AS A SEQUENCE OF DATA-NODES, THIS IS VERY SIMILAR TO THAT IN AN ARRAY. THIS IS CONSECUTIVE-SEQUENTIAL ARRANGEMENT OF DATA-BLOCKS.
When this handshake is taking place, the moment the datanode receives the first request, this gets replicated to a file, creating a SAVEPOINT and then the same process occurs sequentially for the other blocks, which makes it redundant and the saved state is used for comparison.
Whereas when you copy the file from one segment to the other inside the same block or between two different blocks you use AHDC (Apache Hadoop DistCp).
Hadoop is designed to keep the data state restored till the transaction has been completed.

How can I force Flume-NG to process the backlog of events after a sink failed?

I'm trying to setup Flume-NG to collect various kinds of logs from a bunch of servers (mostly running Tomcat instances and Apache Httpd) and dump them into HDFS on a 5-node Hadoop cluster. The setup looks like this:
Each application server tails the relevant logs into a one of the Exec Sources (one for each log type: java, httpd, syslog), which outs them through a FileChannel to an Avro sink. On each server the different sources, channels and sinks are managed by one Agent. Events get picked up by an AvroSource which resides on the Hadoop Cluster (the node that also hosts the SecondaryNameNode and the Jobtracker). For each logtype there is an AvroSource listening on a different port. The events go through the FileChannel into the HDFS Sink, which saves the events using the FlumeEventAvro EventSerializer and Snappy compression.
The problem: The agent on the Hadoop node that manages the HDFS Sinks (again, one for each logtype) failed after some hours because we didn't change the Heap size of the JVM. From then on lots of events were collected in the FileChannel on that node and after that also on the FileChannels on the Application Servers, because the FileChannel on the Hadoop node reached it's maximum capacity. When I fixed the problem, I couldn't get the agent on the Hadoop node to process the backlog speedily enough so it could resume normal operation. The size of the tmp dir where the FileChannel saves the events before sinking them, keeps growing all the time. Also, HDFS writes seem to be real slow.
Is there a way to force Flume to process the backlog first before ingesting new events? Is the following configuration optimal? Maybe related: The files that get written to HDFS are really small, around 1 - 3 MB or so. That's certainly not optimal with the HDFS default blocksize of 64MB and with regards to future MR operations. What settings should I use to collect the events in files large enough for the HDFS blocksize?
I have a feeling the config on the Hadoop node is not right, I'm suspecting the values for BatchSize, RollCount and related params are off, but I'm not sure what the optimal values should be.
Example config on Application Servers:
agent.sources=syslogtail httpdtail javatail
agent.channels=tmpfile-syslog tmpfile-httpd tmpfile-java
agent.sinks=avrosink-syslog avrosink-httpd avrosink-java
agent.sources.syslogtail.type=exec
agent.sources.syslogtail.command=tail -F /var/log/messages
agent.sources.syslogtail.interceptors=ts
agent.sources.syslogtail.interceptors.ts.type=timestamp
agent.sources.syslogtail.channels=tmpfile-syslog
agent.sources.syslogtail.batchSize=1
...
agent.channels.tmpfile-syslog.type=file
agent.channels.tmpfile-syslog.checkpointDir=/tmp/flume/syslog/checkpoint
agent.channels.tmpfile-syslog.dataDirs=/tmp/flume/syslog/data
...
agent.sinks.avrosink-syslog.type=avro
agent.sinks.avrosink-syslog.channel=tmpfile-syslog
agent.sinks.avrosink-syslog.hostname=somehost
agent.sinks.avrosink-syslog.port=XXXXX
agent.sinks.avrosink-syslog.batch-size=1
Example config on Hadoop node
agent.sources=avrosource-httpd avrosource-syslog avrosource-java
agent.channels=tmpfile-httpd tmpfile-syslog tmpfile-java
agent.sinks=hdfssink-httpd hdfssink-syslog hdfssink-java
agent.sources.avrosource-java.type=avro
agent.sources.avrosource-java.channels=tmpfile-java
agent.sources.avrosource-java.bind=0.0.0.0
agent.sources.avrosource-java.port=XXXXX
...
agent.channels.tmpfile-java.type=file
agent.channels.tmpfile-java.checkpointDir=/tmp/flume/java/checkpoint
agent.channels.tmpfile-java.dataDirs=/tmp/flume/java/data
agent.channels.tmpfile-java.write-timeout=10
agent.channels.tmpfile-java.keepalive=5
agent.channels.tmpfile-java.capacity=2000000
...
agent.sinks.hdfssink-java.type=hdfs
agent.sinks.hdfssink-java.channel=tmpfile-java
agent.sinks.hdfssink-java.hdfs.path=/logs/java/avro/%Y%m%d/%H
agent.sinks.hdfssink-java.hdfs.filePrefix=java-
agent.sinks.hdfssink-java.hdfs.fileType=DataStream
agent.sinks.hdfssink-java.hdfs.rollInterval=300
agent.sinks.hdfssink-java.hdfs.rollSize=0
agent.sinks.hdfssink-java.hdfs.rollCount=40000
agent.sinks.hdfssink-java.hdfs.batchSize=20000
agent.sinks.hdfssink-java.hdfs.txnEventMax=20000
agent.sinks.hdfssink-java.hdfs.threadsPoolSize=100
agent.sinks.hdfssink-java.hdfs.rollTimerPoolSize=10
There are a couple of things I see in your configuration that can cause issues:
Your first agent seems to have an avro sink with batch size of 1. You should bump this up to at least 100 or more. This is because the avro source on the second agent would be committing to the channel with batch size of 1. Each commit causes an fsync, causing the file channel performance to be poor. The batch size on the exec source is also 1, causing that channel to be slow as well. You can increase the batch size (or use the Spool Directory Source - more on that later).
You can have multiple HDFS sinks reading from the same channel to improve performance. You should just make sure that each sink writes to a different directory or have different "hdfs.filePrefix", so that multiple HDFS sinks don't try to write to the same files.
Your batch size for the HDFS sink is 20000, which is quite high, and your callTimeout is the default of 10 seconds. You should increase "hdfs.callTimeout" if you want to keep such a huge batch size. I'd recommend reducing the batch size to 1000 or so, and having timeout of about 15-20 seconds. (Note that at the current batch size, each file holds only 2 batches - so reduce the batch size, increase the rollInterval and timeOut)
If you are using tail -F, I'd recommend trying out the new Spool Directory Source. To use this source, rotate out your log files to a directory, which the Spool Directory Source processes. This source will only process files which are immutable, so you need to rotate the log files out. Using tail -F with exec source has issues, as documented in the Flume User Guide.

Hadoop Operationalization

I have all the pieces of a hadoop implementation ready - I have a running cluster, and a client writer that is pushing activity data into HDFS. I have a question about what happens next. I understand that we run jobs against the data that has been dumped into HDFS, but my questions are:
1) First off, I am writing into the stream and flushing periodically - I am writing the files via a thread in the HDFS java client, and I don't see the files appear in HDFS until I kill my server. If I write enough data to fill a block, will that automatically appear in the file system? How do I get to a point where I have files that are ready to be processed by M/R jobs?
2) When do we run M/R jobs? Like I said, I am writing the files via a thread in the HDFS java client, and that thread has a lock on the file for write. At what point should I release that file? How does this interaction work? At what point is it 'safe' to run a job against that data, and what happens to the data in HDFS when its done?
I would try to avoid "hard" synchronization between data insertion into hadoop and processing results. I mean that in many cases it is most practical to have to asynchronious processes:
a) One process putting files into HDFS. In many cases -building directory structure by dates is usefull.
b) Run jobs for all but most recent data.
You can run job on most recent data, but application should not relay on up to the minute results. In any case job usually takes more then a few minutes in any case
Another point - append is not 100% mainstream but advanced thing built for HBase. If you build your app without usage of it - you will be able to work with other DFS's like amazon s3 which do not support append. We are collecting data in local file system, and then copy them to HDFS when file is big enough.
write the data to fill a block , you will see the file in the system
M/R is submitted to the scheduler , which takes care of running it against data, we need not worry abt

Resources