Getting Spark Streaming performance metrics using data from rest API - spark-streaming

I am new to using Rest Api for spark streaming resource usage computation, so I want to know some things regarding the data from rest
In allexecutor end point we have totalDuration and maxMemory , if using dynamic allocation one specific executor was removed and then reallocated for some other task then does duration for this will add to its totalDuration or it would be overwritten.
For getting memory usage for Spark Streaming is it a correct way to use this rest API , or another way to do calculate memory used by Spark Streaming Application.

Related

apache beam stream processing failure recovery

Running a streaming beam pipeline where i stream files/records from gcs using avroIO and then create minutely/hourly buckets to aggregate events and add it to BQ. In case the pipeline fails how can i recover correctly and process the unprocessed events only ? I do not want to double count events .
One approach i was thinking was writing to spanner or bigtable but it may be the case the write to BQ succeeds but the DB fails and vice versa ?
How can i maintain a state in reliable consistent way in streaming pipeline to process only unprocessed events ?
I want to make sure the final aggregated data in BQ is the exact count for different events and not under or over counting ?
How does spark streaming pipeline solve this (I know they have some checkpointing directory for managing state of query and dataframes ) ?
Are there any recommended techniques to solve accurately these kind of problem in streaming pipelines ?
Based on clarification from the comments, this question boils down to 'can we achieve exactly-once semantics across two successive runs of a streaming job, assuming both runs are start from scratch?'. Short answer is no. Even if the user is willing store some state in external storage, it needs to be committed atomically/consistently with streaming engine internal state. Streaming engines like Dataflow, Flink store required state internally, which is needed for to 'resume' a job. With Flink you could resume from latest savepoint, and with Dataflow you can 'update' a running pipeline (note that Dataflow does not actually kill your job even when there are errors, you need to cancel a job explicitly). Dataflow does provide exactly-once processing guarantee with update.
Some what relaxed guarantees would be feasible with careful use of external storage. The details really depend on specific goals (often it is is no worth the extra complexity).

How can NiFi handle burst data?

If the submitted data to NiFi are not coming in a steady flow (but on bursty) how can NiFi handle them? Does it use a message broker to buffer them? I haven't seen anything like this in its documentation.
NiFi connections (the links between processors) have the capability of buffering FlowFiles (the unit of data that NiFi handles, basically content + metadata about that content), and NiFi also has the feature of backpressure, the ability of a processor to "tell" the upstream flow that it cannot handle any more data at a particular time. The relevant discussion from User Guide is here.
Basically you can set up connection(s) to be as "wide" as you expect the burst to be, or if that is not prudent, you can set it to a more appropriate value, and NiFi will do sort of a "leaky bucket with notification" approach, where it will handle what data it can, and the framework will handle scheduling of upstream processors based on whether they would be able to do their job.
If you are getting data from a source system that does not buffer data, then you can suffer data loss when backpressure is applied; however that is because the source system must push data when NiFi can not prudently accept it, and thus the alterations should be made on the source system vs the NiFi flow.

Amazon Web Services: Spark Streaming or Lambda

I am looking for some high level guidance on an architecture. I have a provider writing "transactions" to a Kinesis pipe (about 1MM/day). I need to pull those transactions off, one at a time, validating data, hitting other SOAP or Rest services for additional information, applying some business logic, and writing the results to S3.
One approach that has been proposed is use Spark job that runs forever, pulling data and processing it within the Spark environment. The benefits were enumerated as shareable cached data, availability of SQL, and in-house knowledge of Spark.
My thought was to have a series of Lambda functions that would process the data. As I understand it, I can have a Lambda watching the Kinesis pipe for new data. I want to run the pulled data through a bunch of small steps (lambdas), each one doing a single step in the process. This seems like an ideal use of Step Functions. With regards to caches, if any are needed, I thought that Redis on ElastiCache could be used.
Can this be done using a combination of Lambda and Step Functions (using lambdas)? If it can be done, is it the best approach? What other alternatives should I consider?
This can be achieved using a combination of Lambda and Step Functions. As you described, the lambda would monitor the stream and kick off a new execution of a state machine, passing the transaction data to it as an input. You can see more documentation around kinesis with lambda here: http://docs.aws.amazon.com/lambda/latest/dg/with-kinesis.html.
The state machine would then pass the data from one Lambda function to the next where the data will be processed and written to S3. You need to contact AWS for an increase on the default 2 per second StartExecution API limit to support 1MM/day.
Hope this helps!

Using NiFi for scheduling Hadoop batch processes

According to NiFi's homepage, it "supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic".
I've been playing with NiFi in the last couple of months and can't help wondering why not using it also for scheduling batch processes.
Let's say I have a use case in which data flows into Hadoop, processed by a series of Hive \ MapReduce jobs, then exported to some external NoSql database to be used by some system.
Using NiFi in order to ingest and flow data into Hadoop is a use case that NiFi was built for.
However, using Nifi in order to schedule the jobs on Hadoop ("Oozie-like") is a use case I haven't encountered others implementing, and since it seems completely possible to implement, I'm trying to understand if there are reasons not to do so.
The gains of doing it all on NiFi is that one will get a visual representation of the entire course of data, from source to destination in one place. In case of complicated flows it is very important for maintenance.
In other words - my question is: Are there reasons not to use NiFi as a scheduler\coordinator for batch processes? If so - what problems may arise in such a use case?
PS - I've read this: "Is Nifi having batch processing?" - but my question aims to a different sense of "batch processing in NiFi" than the one raised in the attached question
You are correct that you would have a concise visual representation of all steps in the process were the schedule triggers present on the flow canvas, but NiFi was not designed as a scheduler/coordinator. Here is a comparison of some scheduler options.
Using NiFi to control scheduling feels like a "hammer" solution in search of a problem. It will reduce the ease of defining those schedules programmatically or interacting with them from external tools. Theoretically, you could define the schedule format, read them into NiFi from a file, datasource, endpoint, etc., and use ExecuteStreamCommand, ExecuteScript, or InvokeHTTP processors to kick off the batch process. This feels like introducing an unnecessary intermediate step however. If consolidation & visualization is what you are going for, you could have a monitoring flow segment ingest those schedule definitions from their native format (Oozie, XML, etc.) and display them in NiFi without making NiFi responsible for defining and executing the scheduling.

Amazon Elastic Map Reduce Hadoop Jobs

Im new to Amazon Web Services and Map Reduce staff. My basic problem is I am trying to make an academic project were basically I am processing a large bunch of images and I need to detect a particular object in them. After I need a Map filled by objects made of key = averageRGB and value = BufferedImage of the object detected. I managed to do this application single threaded and that was not a problem. My questions are : If I make a map reduce job can I achieve the Map mentioned earlier? If this is possible..can I use the Map to do something with it before the job finishes so I get the final results? And 1 last question...If I upload my sample data in a single folder in S3 bucket, will the Elastic Map Reduce of Amazon take care to split that data onto the cluster and parallelize the process or I have to split the data myself over the cluster?
Excuse my ignorance but I cannot find the right answers on the net.
Thanks
Yes you can use map as you have mentioned.
In reducer again you will get map for key and values there you can do more calculations before final results are sent.
when you upload you data to s3bucekt. You can use path as s3n for you input. Also specify s3bucket path to store output using s3n
When you provide input path using s3n, the EMR will automatically download files to EMR nodes and split them and distribute over all nodes. We need not do any thing for that purpose.

Resources