Spark Streaming app streams files that have already been streamed - amazon-ec2

We have a spark streaming app deployed in a YARN ec2 cluster with 1 name node and 2 data nodes. We submit the app with 11 executors with 1 core and 588 MB of RAM each.
The app streams from a directory in S3 which is constantly being written; this is the line of code that achieves that:
val ssc = new StreamingContext(sparkConf, Seconds(10))
val lines = ssc.fileStream[LongWritable, Text, TextInputFormat](Settings.S3RequestsHost , (f:Path)=> true, true )
//some maps and other logic here
ssc.start()
ssc.awaitTermination()
The purpose of using fileStream instead of textFileStream is to customize the way that spark handles existing files when the process starts. We want to process just the new files that are added after the process launched and omit the existing ones. We configured a batch duration of 10 seconds.
The process goes fine while we add a small number of files to s3, let's say 4 or 5. We can see in the streaming UI how the stages are executed successfully in the executors, one for each file that is processed. But sometimes when we try to add a larger number of files, we face a strange behavior; the application starts streaming files that have already been streamed.
For example, I add 20 files to s3. The files are processed in 3 batches. The first batch processes 7 files, the second 8 and the third 5. No more files are added to S3 at this point, but spark start repeating these phases endlessly with the same files!
Any thoughts what can be causing this?
I've posted a Jira ticket for this issue:
https://issues.apache.org/jira/browse/SPARK-3553

Note the sentence "The files must be created in the dataDirectory by atomically moving or renaming them into the data directory" from the Spark Streaming Programming Guide. The entire file must appear all at once, rather than creating the file empty and appending to it.
One approach is to get cloudberry to put the files somewhere else, and then run a script periodically that either moves or renames the files into the directory you've attached your streaming app to.

Related

Synchronize NiFi process groups or flows that don't/can't connect?

Like the question states, is there some way to synchronize NiFi process groups or pipelines that don't/can't connect in the UI?
Eg. I have a process where I want to getFTP->putHDFS->moveHDFS (which ends up actually being getFTP->putHDFS->listHDFS->moveHDFS, see https://stackoverflow.com/a/50166151/8236733). However, listHDFS does not seem to take any incoming connections. Trying to do something with process groups like P1{getFTP->putHDFS->outport}->P2{inport->listHDFS->moveHDFS} also runs into the same problem (listHDFS can't seem to take any incoming connections). We don't want to moveHDFS before we ever even get anything from getFTP, but given the above, I don't see how these actions can be synchronized to occur in the right order.
New to NiFi, but I imagine this is a common use case and there must be some NiFi-ish way of doing this that I am missing. Advice in this would be appreciated. Thanks.
I'm not sure what requirement is preventing you from writing the file retrieved from FTP directly to the desired HDFS location, or if this is a "write n files to HDFS with a . starting the filename and then rename all when some certain threshold is reached" scenario.
ListHDFS does not take any incoming relationships because it should not be triggered by an incoming event, but rather on a timer/CRON schedule. Every time it runs, it will produce n flowfiles, where each references an HDFS file that has been detected to be written to the filesystem since the last execution. To do this, the processor stores local state.
Your flow segments do not need to be connected in this case. You'll have "flow segment A" which performs the FTP -> HDFS writing (GetFTP -> PutHDFS) and you'll have an independent "flow segment B" which lists the HDFS directory, reads the file descriptors (but not the content of the file unless you use FetchHDFS as well) and moves them (ListHDFS -> MoveHDFS). The ListHDFS processor will run constantly, but if it does not detect any new files during a run, it will simply yield and perform a no-op. Once the PutHDFS processor completes the task of writing a file to the HDFS file system, on the next ListHDFS execution, it will detect that file and generate a flowfile describing it.
You can tune the scheduling to your liking, but in general this is a very common pattern in NiFi flows.

How to architecture file processing in laravel

I have task observe folder where files are coming from SFTP. File are big and processing one file is relatively time consuming. I am looking for best approach to do it. Here are some ideas how to do it, but I am not sure what is the best way.
Run scheduller each 5 min to check for new files
For each new file trigger event that there is new file.
Create listener which will listen for this event and which will using queues. In the listener for new files copy new file in the processing folder and process it. When processing of new files start insert record in the DB with status processing. When processing is done change record status and copy file to processed folder.
I this solution I have 2 copy operations for each file. This is because it is possible if second scheduler executes before all files are processed than some files could overlap in 2 processing jobs.
What is the best way to do it? Should I use another approach to avoid 2 copy operations? Something like to put database check during scheduler execution to see if the file is already in the processing state?
You should use the ->withoutOverlapping(); as stated in the manual of task Scheduler here.
Using this you will make sure that only one instance of the task run at any given time.

How to delete input files after successful mapreduce

We have a system that receives archives on a specified directory and on a regular basis it launches a mapreduce job that opens the archives and processes the files within them. To avoid re-processing the same archives the next time, we're hooked into the close() method on our RecordReader to have it deleted after the last entry is read.
The problem with this approach (we think) is that if a particular mapping fails, the next mapper that makes another attempt at it finds that the original file has been deleted by the record reader from the first one and it bombs out. We think the way to go is to hold off until all the mapping and reducing is complete and then delete the input archives.
Is this the best way to do this?
If so, how can we obtain a listing of all the input files found by the system from the main program? (we can't just scrub the whole input dir, new files may be present)
i.e.:
. . .
job.waitForCompletion(true);
(we're done, delete input files, how?)
return 0;
}
Couple comments.
I think this design is heartache-prone. What happens when you discover that someone deployed a messed up algorithm to your MR cluster and you have to backfill a month's worth of archives? They're gone now. What happens when processing takes longer than expected and a new job needs to start before the old one is completely done? Too many files are present and some get reprocessed. What about when the job starts while an archive is still in flight? Etc.
One way out of this trap is to have the archives go to a rotating location based on time, and either purge the records yourself or (in the case of something like S3) establish a retention policy that allows a certain window for operations. Also whatever the back end map reduce processing is doing could be idempotent: processing the same record twice should not be any different than processing it once. Something tells me that if you're reducing your dataset, that property will be difficult to guarantee.
At the very least you could rename the files you processed instead of deleting them right away and use a glob expression to define your input that does not include the renamed files. There are still race conditions as I mentioned above.
You could use a queue such as Amazon SQS to record the delivery of an archive, and your InputFormat could pull these entries rather than listing the archive folder when determining the input splits. But reprocessing or backfilling becomes problematic without additional infrastructure.
All that being said, the list of splits is generated by the InputFormat. Write a decorator around that and you can stash the split list wherever you want for use by the master after the job is done.
The simplest way would probably be do a multiple input job, read the directory for the files before you run the job and pass those instead of a directory to the job (then delete the files in the list after the job is done).
Based on the situation you are explaining I can suggest the following solution:-
1.The process of data monitoring I.e monitoring the directory into which the archives are landing should be done by a separate process. That separate process can use some metadata table like in mysql to put status entries based on monitoring the directories. The metadata entries can also check for duplicacy.
2. Now based on the metadata entry a separate process can handle the map reduce job triggering part. Some status could be checked in metadata for triggering the jobs.
I think you should use Apache Oozie to manage your workflow. From Oozie's website (bolding is mine):
Oozie is a workflow scheduler system to manage Apache Hadoop jobs.
...
Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty.

Flume NG and HDFS

I am very new to hadoop , so please excuse the dumb questions.
I have the following knowledge
Best usecase of Hadoop is large files thus helping in efficiency while running mapreduce tasks.
Keeping the above in mind I am somewhat confused about Flume NG.
Assume I am tailing a log file and logs are produced every second, the moment the log gets a new line it will be transferred to hdfs via Flume.
a) Does this mean that flume creates a new file on every line that is logged in the log file I am tailing or does it append to the existing hdfs file ??
b) is append allowed in hdfs in the first place??
c) if the answer to b is true ?? ie contents are appended constantly , how and when should I run my mapreduce application?
Above questions could sound very silly but a answers to the same will be highly appreciated.
PS: I have not yet set up Flume NG or hadoop as yet, just reading the articles to get an understanding and how it could add value to my company.
Flume writes to HDFS by means of HDFS sink. When Flume starts and begins to receive events, the sink opens new file and writes events into it. At some point previously opened file should be closed, and until then data in the current block being written is not visible to other redaers.
As described in the documentation, Flume HDFS sink has several file closing strategies:
each N seconds (specified by rollInterval option)
after writing N bytes (rollSize option)
after writing N received events (rollCount option)
after N seconds of inactivity (idleTimeout option)
So, to your questions:
a) Flume writes events to currently opened file until it is closed (and new file opened).
b) Append is allowed in HDFS, but Flume does not use it. After file is closed, Flume does not append to it any data.
c) To hide currently opened file from mapreduce application use inUsePrefix option - all files with name that starts with . is not visible to MR jobs.

Hadoop Operationalization

I have all the pieces of a hadoop implementation ready - I have a running cluster, and a client writer that is pushing activity data into HDFS. I have a question about what happens next. I understand that we run jobs against the data that has been dumped into HDFS, but my questions are:
1) First off, I am writing into the stream and flushing periodically - I am writing the files via a thread in the HDFS java client, and I don't see the files appear in HDFS until I kill my server. If I write enough data to fill a block, will that automatically appear in the file system? How do I get to a point where I have files that are ready to be processed by M/R jobs?
2) When do we run M/R jobs? Like I said, I am writing the files via a thread in the HDFS java client, and that thread has a lock on the file for write. At what point should I release that file? How does this interaction work? At what point is it 'safe' to run a job against that data, and what happens to the data in HDFS when its done?
I would try to avoid "hard" synchronization between data insertion into hadoop and processing results. I mean that in many cases it is most practical to have to asynchronious processes:
a) One process putting files into HDFS. In many cases -building directory structure by dates is usefull.
b) Run jobs for all but most recent data.
You can run job on most recent data, but application should not relay on up to the minute results. In any case job usually takes more then a few minutes in any case
Another point - append is not 100% mainstream but advanced thing built for HBase. If you build your app without usage of it - you will be able to work with other DFS's like amazon s3 which do not support append. We are collecting data in local file system, and then copy them to HDFS when file is big enough.
write the data to fill a block , you will see the file in the system
M/R is submitted to the scheduler , which takes care of running it against data, we need not worry abt

Resources