I am trying to write some performance-mindful code in Spark and wondering whether I should write an Aggregator or a User-defined Aggregate Function (UDAF) for my rollup operations on a Dataframe.
I have not been able to find any data anywhere on how fast each of these methods are and which you should be using for spark 2.0+.
You should write an Aggregator rather than an UserDefinedAggregateFunction as UserDefinedAggregateFunction performs inefficient serialization/deserialization tasks for each row. Rewriting an UserDefinedAggregateFunction to an Aggregator can improve performance from 25%-30% to 100x, as stated in pull request replacing UserDefinedAggregateFunction with Aggregator
Due to those performance issues, UserDefinedAggregateFunction class has been deprecated in Spark 3.0
Related
This is a very broad question, I’m new to Flink and looking into the possibility of using it as a replacement for a current analytics engine.
The scenario is, data collected from various equipment, the data is received As a JSON encoded string with the format of {“location.attribute”:value, “TimeStamp”:value}
For example a unitary traceability code is received for a location, after which various process parameters are received in a real-time stream. The analysis is to be ran over the process parameters however the output needs to include a relation to a traceability code. For example {“location.alarm”:value, “location.traceability”:value, “TimeStamp”:value}
What method does Flink use for caching values, in this case the current traceability code whilst running analysis over other parameters received at a later time?
I’m mainly just looking for the area to research as so far I’ve been unable to find any examples of this kind of scenario. Perhaps it’s not the kind of process that Flink can handle
A natural way to do this sort of thing with Flink would be to key the stream by the location, and then use keyed state in a ProcessFunction (or RichFlatMapFunction) to store the partial results until ready to emit the output.
With a keyed stream, you are guaranteed that every event with the same key will be processed by the same instance. You can then use keyed state, which is effectively a sharded key/value store, to store per-key information.
The Apache Flink training includes some explanatory material on keyed streams and working with keyed state, as well as an exercise or two that explore how to use these mechanisms to do roughly what you need.
Alternatively, you could do this with the Table or SQL API, and implement this as a join of the stream with itself.
Running a streaming beam pipeline where i stream files/records from gcs using avroIO and then create minutely/hourly buckets to aggregate events and add it to BQ. In case the pipeline fails how can i recover correctly and process the unprocessed events only ? I do not want to double count events .
One approach i was thinking was writing to spanner or bigtable but it may be the case the write to BQ succeeds but the DB fails and vice versa ?
How can i maintain a state in reliable consistent way in streaming pipeline to process only unprocessed events ?
I want to make sure the final aggregated data in BQ is the exact count for different events and not under or over counting ?
How does spark streaming pipeline solve this (I know they have some checkpointing directory for managing state of query and dataframes ) ?
Are there any recommended techniques to solve accurately these kind of problem in streaming pipelines ?
Based on clarification from the comments, this question boils down to 'can we achieve exactly-once semantics across two successive runs of a streaming job, assuming both runs are start from scratch?'. Short answer is no. Even if the user is willing store some state in external storage, it needs to be committed atomically/consistently with streaming engine internal state. Streaming engines like Dataflow, Flink store required state internally, which is needed for to 'resume' a job. With Flink you could resume from latest savepoint, and with Dataflow you can 'update' a running pipeline (note that Dataflow does not actually kill your job even when there are errors, you need to cancel a job explicitly). Dataflow does provide exactly-once processing guarantee with update.
Some what relaxed guarantees would be feasible with careful use of external storage. The details really depend on specific goals (often it is is no worth the extra complexity).
According to NiFi's homepage, it "supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic".
I've been playing with NiFi in the last couple of months and can't help wondering why not using it also for scheduling batch processes.
Let's say I have a use case in which data flows into Hadoop, processed by a series of Hive \ MapReduce jobs, then exported to some external NoSql database to be used by some system.
Using NiFi in order to ingest and flow data into Hadoop is a use case that NiFi was built for.
However, using Nifi in order to schedule the jobs on Hadoop ("Oozie-like") is a use case I haven't encountered others implementing, and since it seems completely possible to implement, I'm trying to understand if there are reasons not to do so.
The gains of doing it all on NiFi is that one will get a visual representation of the entire course of data, from source to destination in one place. In case of complicated flows it is very important for maintenance.
In other words - my question is: Are there reasons not to use NiFi as a scheduler\coordinator for batch processes? If so - what problems may arise in such a use case?
PS - I've read this: "Is Nifi having batch processing?" - but my question aims to a different sense of "batch processing in NiFi" than the one raised in the attached question
You are correct that you would have a concise visual representation of all steps in the process were the schedule triggers present on the flow canvas, but NiFi was not designed as a scheduler/coordinator. Here is a comparison of some scheduler options.
Using NiFi to control scheduling feels like a "hammer" solution in search of a problem. It will reduce the ease of defining those schedules programmatically or interacting with them from external tools. Theoretically, you could define the schedule format, read them into NiFi from a file, datasource, endpoint, etc., and use ExecuteStreamCommand, ExecuteScript, or InvokeHTTP processors to kick off the batch process. This feels like introducing an unnecessary intermediate step however. If consolidation & visualization is what you are going for, you could have a monitoring flow segment ingest those schedule definitions from their native format (Oozie, XML, etc.) and display them in NiFi without making NiFi responsible for defining and executing the scheduling.
My goal is to make a distributed application with publishers and consumers where i use CEP to process streams of data to generate event notifications to event consumers.
Whats the difference between Esper and Apache Storm?
Can I achieve what Storm does with only Esper, and when should I consider integrating Esper with Storm?
Im confused, I thought Esper provided the same functionality.
Storm is a distributed realtime computation system which generally can be used for any purpose while Esper is an Event Stream Processing and and event correlation engine (Complex Event Processing), therefore Esper is more specific.
Here are some use cases of them:
Storm can be used to real-time consume data from Twitter and calculate to find the most used hashtag per topic.
Esper can be used to detect event like: a normal train has speed 100 miles per hour and its speed is reported secondly. If the speed of the train increase to 130 miles per hour in 10 minutes, an event will be generated and notify train operator.
There are some more criteria you can consider when select between them:
Storm is built-in designed for distributed processing while Esper seems not (My team evaluated 2016)
Open source license (Storm) vs Commercial license (Esper)
Storm provides abstractions to run pieces of code and provides value is managing the distribution aspect such as workers, management, restart. Whatever piece of code you plug in is your job to code and your code must worry how and where it keeps state such as counts or other derived information. Storm is often used for extract-transform-load or ingest of streams into some store.
Esper provides abstractions for detecting situations among sequences of events by providing an "Event Processing Language (EPL)" that conforms to SQL92. Its an expressive, concise and extensible means to detect stuff without writing any or very little code. EPL can be dynamically injected and managed so you can achieve adding and removing rules at runtime without restarts. If you were to do this with code you would always have to restart the JVM/workers/topology.
Esper has horizontal scalability by integrating directly with Kafka Streams. But the code for this is not open source, currently.
In my opinion, storm can offer a good management of resources, like load balancing, auto restart when machine down, a easier way to add machines or tasks, etc, which you have to handle it yourself if just using esper.
I want to chain 2 Map/Reduce jobs. I am trying to use JobControl to achieve the same. My problem is -
JobControl needs org.apache.hadoop.mapred.jobcontrol.Job which in turn needs org.apache.hadoop.mapred.JobConf which is deprecated. How do I get around this problem to chain my Map/Reduce?
Anyone has any better ideas for chaining (other than Cascading).
You could use Riffle, it allows you to chain arbitrary processes together (anything you stick its Annotations on).
It has a rudimentary dependency scheduler, so it will order and execute your jobs for you. And it's Apache licensed. Its also on the Conjars repo if you're a maven user.
I'm the author, and wrote it so Mahout and other custom applications would be able to have a common tool that was also compatible with Cascading Flows.
I'm also the author of Cascading. But MapReduceFlow + Cascade in Cascading works quite well for most raw MR job chaining.
Cloudera has a workflow tool called Oozie that can help with this sort of chaining. Might be overkill for just getting one job to run after another.