Difference between SparkStreaming and Storm [closed] - apache-storm

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I am doing some analytics on live twitter streaming data.I heard about Spark Streaming.I want to know about which is best for analytics on live streaming data as my data is coming very fast from source.

I recommand this presentation about the subject:
http://fr.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming
In fact, apache storm is a true streaming architecture, with events managed one by one, if you want to group them, you have to design a topology for this purpose. It is the most powerful in terms of latency and design. But it is of course complex, and you have to design correctly what you want.
On the other hand, apache spark is a micro-batching architecture, it is like hadoop but executed every x seconds producing micro-batches of data on a defined time window. As it does look like a batching solution, it seems simpler and can be enough if you don't want a latency < few seconds.

Apart from really nice presentation linked by #zenbeni I would like to add a few specific points based on first hand experience with both Storm and Spark streaming especially about your use case (Twitter Data).
Twitter itself uses Storm for many parts of their realtime stream processing pipeline. So if the type of processing you want to do is similar, Storm is a good choice.
Storm's multi language support is great. But it is hard to pass around errors. For example, if you are calling Python code from a Java bolt and an exception happens in your Python bolt, it's not easy to propagate this exception back to Java code.
If your analysis is based on a single Tweet only, Storm will likely be better. However, if you need to do some aggregate or iterative analytics, you will have to microbatch in Storm as well. This essentially means you have to store state in bunch of your bolts.
Finally, often one needs to do both stream as well as batch processing. Spark shines when you need to mix stream processing along with batch, interactive and iterative processing. In fact, it's not clear to me how you should do iterative processing Storm.

Related

Using operation queues with combine framework [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
With the arrival of combine framework, is there a need to use operation queues anymore. For example, apple uses operation queues almost all over the place in WWDC app. So if we use SwiftUI with combine(asynchronous programming), will there be a need to use Operation Queues?
Combine is just another asynchronous pattern, but doesn’t supplant operation queues (or dispatch queues). Just as GCD and operation queues happily coexist in our code bases, the same is true with Combine.
GCD is great at easy-to-write, yet still highly performant, code to dispatching tasks to various queues. So if you have something that might risk blocking the main thread, GCD makes it really easy to dispatch that to a background thread, and then dispatch some completion block back to the main thread. It also handles timers on background threads, data synchronization, highly-optimized parallelized code, etc.
Operation queues are great for higher-level tasks (especially those that are, themselves, asynchronous). You can take these pieces of work, wrap them up in discrete objects (for nice separation of responsibilities) and the operation queues manage execution, cancelation, and constrained concurrency, quite elegantly.
Combine shines at writing concise, declarative, composable, asynchronous event handling code. It excels at writing code that outlines how, for example, one’s UI should reflect some event (network task, notification, even UI updates).
This is obviously an oversimplification, but those are a few of the strengths of the various frameworks. And there is definitely overlap in these three frameworks, for sure, but each has its place.

How to build reporting in Microservices Architecture? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
While doing a POC around Microservices architecture; one of the challenges that I need to explain it that how to obtain reporting data from different services in an effecient way?
I would appreciate guiding me in the right direction.
If the data spans over multiple microservices then it depends on the business use case. In my opinion there are couple of ways to do it
Approach 1 query microservices dbs (not a preferred approach)
If your microservices are not very load intensive then you may query the data from all the services databases at off peak time and insert records into your warehouse database. This is not preferred approach since you are still putting additional load to services but it's easier . Also the reporting data may not be in realtime.
Approach 2 Event sourcing/CQRS
This approach is very preferred since your write and read models are completely separate. In brief the way if works is events generated by your different microservices will also be updating your read models called materialized view. If you have requirement where your reporting data requires near real time data then this is the way to go forward. You can shape your reporting model as you like and you can create multiple reporting models using events. But this is complex approach and require application design accordingly. However the benefits are countless. You may want to reach more about Event Sourcing and CQRS if you are interested.
Approach 3 have read only replicas
If you are using cloud services you can create readonly replicas of your databases and can use them for reporting. this is widely accepted approach since you are not impacting transactional databases. but this may be expensive since you are paying for additional databases.

Would Hadoop help my situation? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I am in the process of creating a survey engine that will store millions of responses to various large surveys.
There are various agencies that will have 10-100 users each. Each will be able to administer a 3000+ question survey. There will be multiple agencies as well.
If each agency was to have hundreds of thousands of sessions each with 3000+ responses, I'm thinking that hadoop would be a good candidate to get the sessions and their response data to run various analyses on (aggregations etc).
The sessions, survey questions, and responses are all currently held in a sql database. I was thinking that I would keep that and put the data in parallel. So when a new session is taken under an agency, it is then added to the hadoop 'file', such that when the entire dataset is called up it would be included.
Would this implementation work well with hadoop or am I still well within the limits of a relational database?
I don't think anyone is going to be able to tell you definitively, yes or no here. I also don't think I fully grasp what your program will be doing from the wording of the question, however, in general, Hadoop Map/Reduce excels at batch processing huge volumes of data. It is not meant to be an interactive (aka real-time) tool. So if your system:
1) Will be running scheduled jobs to analyze survey results, generate trends, summarize data, etc.....then yes, M/R would be a good fit for this.
2) Will allow users to search through surveys by specifying what they are interested in and get reports in real-time based on their input....then no, M/R would probably not be the best tool for this. You might want to take a look at HBase. I haven't used it yet, but Hive is a query based tool but I am not sure how "real-time" that can get. Also, Drill is an up and coming project that looks promising for interactively querying big data.

hadoop use cases in real world [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
Newbie here with Hadoop. Concept wise, it is pretty simple to understand, however, one of the real challenge is how to model the problem to be solved in the map-reduce architecture. Suppose my data contains two parts (all in oracle):
1. Rather static data that doesn't change much
2. Fresh data collected everyday.
and currently the data processing is basically read the fresh data, find and use the corresponding static data (or metadata) and apply some algorithm on it and dump it back to Oracle.
How do I model such application paradigm? Do I save/store the static data as part of distributed cache? What if that data is pretty big?
Basically I am looking for more examples like the following:
http://stevekrenzel.com/finding-friends-with-mapreduce
Thanks,
Basically the requirement is to do join on two data sets. MapReduce programming requires a different way of thinking than normal programming. Here are some references to join and some other patterns on top of MapReduce
Data-Intensive Text Processing with MapReduce
MapReduce Design Patterns
Section 8.3 in Hadoop - The Definitive Guide
Coming back to join, it can multiple ways based on the amount of data and how the data is. The above references have more about the same.
We are collecting real life use cases here : http://hadoopilluminated.com/hadoop_book/Hadoop_Use_Cases.html
we already have good coverage of multiple domains, and will continue to add to it.
(disclaimer : I am a co-author of this free hadoop book)
I would look at the following article about Map/Reduce patterns, which should give you a nice idea of common algorithms and their translation in the Map/Reduce world.
More generally, I don't think there's a magical formula to translate a problem into a set of Map/Reduce, you have to ask yourself questions that vary from dataset to dataset, looking at existing examples is a good thing, and you should definitely try to implement something on a little toy problem.
Also if you have issues abstracting your problem to a set of Map/Reduce jobs, you could also use for example Hive which works like a relational database with a few tweaks, and generates Map/Reduce jobs for you without having to worry too much about what happens.

Use cases of hadoop [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
Recently I came across learning hadoop, all I found was only example to read text data and calculate wordcount. More or less all examples were of same task. Please help me understand is it the only use case of hadoop? Please provide me some references for more real use cases, or where I can understand and write where hadoop can be used.
Thanks
I can try to outline a few directions restricting myself to MapReduce:
a) ETL - data transformations. Here hadoop shines since latency is not important, but scalability is
b) Hive / Pig. There are cases when we need actually SQL or SQL like functionality over big data sets, but can not afford commercial MPP database
c) Log processing of different kinds.
d) Deep analytics - when we simply want to run java code over massive data volumes. Mahaout is used in many cases as machine learning library.

Resources