I am working on an streaming application using Spark Streaming, I want to index my data into elastic search.
My analysis:
I can directly push data from Spark to elastic search, but i feel in this case both the components will be tightly coupled.
If this is a spark core job, we can write that output to HDFS and use logstash to get the data from HDFS and push it to elastic search.
Solution according to me:
I can push the data from Spark Streaming to Kafka and from Kafka we can read that data using Logstash and push to ES.
Please suggest.
First of all, it is great that you have thought through the different approaches.
There are a few questions which you should ask before coming to a good design:
Timelines? Spark -> ES is a breeze and is recommended if you are starting on a PoC.
Operational bandwidth? introducing more components will increase operational concerns. From my personal experiences, making sure you spark streaming job is stable is itself a time-consuming job. You want to add Kafka as well, so you need to spend more time in trying to get the monitoring, other ops concerns right.
Scale? If it is going to take more scale, having a persistent message bus might be able to help absorb back-pressure and still scale pretty well.
If I had the time and dealing with large scale, Spark streaming -> Kafka -> ES looks to be the best bet. This way when your ES cluster is unstable, you still have the option of Kafka replay.
I am a little hazy on Kafka -> HDFS -> ES, as there could be performance implications on adding a batch layer in between the Source and Sink. Also honestly, I am not aware of how good logstash is with HDFS, so can't really comment.
Tight coupling is a oft-discussed subject. There are people who argue against it citing reusability concerns, but there are also people who argue for it, as sometimes it can create a simpler design and makes the whole system easier to reason about. Also talk about premature optimisations :) We have had successes with Spark -> ES directly at a moderate scale of data inflow. So don't discount the power of a simpler design just like that :)
Related
I am planning to spin my development cluster for trend analysis for Infrastructure Monitoring application which I am planning to build using Spark for analysing failure trend and Cassandra for storing incoming data and analysed data.
Consider collecting performance matrix from around 25000 machines/servers (probably set of same application on different servers). I am expecting performance matrix of size 2MB/sec from each machine, which I am planning to push into Cassandra table having timestamp, server as primary key and application along with some important matrix as clustering key. I will be running Spark job on top of this stored information for performance matrix failure trend analysis.
Comming to the question, How many nodes (machines) and of what configuration in terms of CPU and Memory do I need to kick start my cluster considering above scenario.
Cassandra needs a well planned out data model for things to run well. It is very much worth spending time planning things out at this stage before you have a large data set and find out you probably would have done better re-arranging the data model!
The "general" rule of thumb is you shape your model to the queries, while paying attention to avoiding things like really large rows, large deletes, batches and such the like which can have big performance penalties.
The docs give a good start on planning and testing you would probably find useful. I would also recommend using the Cassandra stress tool. You can use it to push performance tests into your Cassandra cluster to check latencies and any performance problems. You can use your own schema too which I personally think is super-useful!
If you are using cloud based hardware like AWS then its relatively easy to scale up / down and see what works best for you. You dont need to throw big hardware at Cassandra, its easier to scale horizontally than vertically.
I'm assuming you are pulling back the data into a separate spark cluster for the analytics side too so these nodes would be running plain Cassandra (less hardware specs). If however you are using the Datastax Enterprise version (where you can run nodes in spark "mode") then you will need more beefier hardware with the additional load you need for spark driver programs, executors and such the like. Another good docs link is the DSE hardware recommendations
I am new to the big-data tech stack in general. I am implementing a real time analytics infrastructure that will ingest high volume/velocity data from different services in our micro services backend. The ingested data ( and data stream ) will be used to populate dashboards for key business metrics and for BI queries and machine learning.
All of the backend services write the data events into a Kafka cluster that is now in place. I started working on a Spark prototype to read the data from the Kafka cluster and enrich/process it.
Now i am working on where to store the data at rest. I know for real time analytics Technologies like Vertica and Terradata are fairly popular. But they have non-trivial capital investment upfront.
So i am trying to stick to open source. After a bit of study i decided to use HDFS/Impala for the data at rest and running SQL on Hadoop for our real time BI queries.
I then started thinking if instead of using HDFS/Impala, it makes more sense to use Cassandra for storing our data at rest. Cassandra scales out and has fast writes and reads. I also read some literature where people gave good arguments for using C* for such use.
Any comment/feedback is welcome.
We store petabytes of expiring time series data in Cassandra, and we're very happy with it. In the ingestion pipeline, we're capable of many millions of writes per second, and reading is fast (sub-millisecond) for displaying/BI. For large ML tasks, you can run spark on top of Cassandra for analysis.
Our use case is (1) consuming data from ActiveMQ, (2) performing transformations through a general purpose reusable streaming process, and then (3) publishing to Kafka. In our case, step (2) would be a reusable Spark Streaming 'service' that would provide an event_source_id, enrich each record with metadata, and then publish to Kafka.
The straightforward approach I see is ActiveMQ -> Flume -> Spark Streaming -> Kafka.
Flume seems like an unnecessary extra step and network traffic. As far as I can tell, a Spark Streaming custom receiver would provide a more general solution for ingestion into hadoop (step 1), and, allows more flexibility for transforming the data as it is an inherent step for Spark Streaming itself, the downside being a loss of coding ease.
I would love to gain some insight from my more experienced peers as we are in the beginning stages of transforming a large data architecture; please help with any suggestions/insights/alternatives you can think of.
Thank you world
In theory, Flume should help you better create a more efficient ingestion to HDFS.
If using Spark Streaming, depending on how much you set up in your microbatch, it could not be that efficient - but if your use case needs more real time, then I think you could do it with Spark Streaming directly, yes.
Most applications would want to store the original data in HDFS so as to be able to refer to it back. Flume would help with that - but if you don't have that need, you may want to skip it. Also, you could always persist your RDD in Spark at any point.
Also, if you want to consume in realtime, you may want to look to Storm.
Your use case is weakly defined though, so more info on the constraints (volume, time requirements, how do you want to expose this info, etc.) would help to get more concrete answers.
EDIT: Here there is a link where they go from a 1-hour Flume + Hadoop, to another one on 5 seconds cycles - still using Flume to help with ingestion scalability. So it's up to your use case to use Flume there or not... I'd say it makes sense to separate the ingestion layer if you want that data to e.g. be consolidated in a lambda-like architecture.
We have a lot of user interaction data from various websites stored in Cassandra such as cookies, page-visits, ads-viewed, ads-clicked, etc.. that we would like to do reporting on. Our current Cassandra schema supports basic reporting and querying. However we also would like to build large queries that would typically involve Joins on large Column Families (containing millions of rows).
What approach is best suited for this? One possibility is to extract data out to a relational database such as mySQL and do data mining there. Alternate could be to attempt at use hadoop with hive or pig to run map reduce queries for this purpose? I must admit I have zero experience with the latter.
Anyone have experience of performance differences in one one vs the other? Would you run map reduce queries on a live Cassandra production instance or on a backup copy to prevent query load from affecting write performance?
In my experience Cassandra is better suited to processes where you need real-time access to your data, fast random reads and just generally handle large traffic loads. However, if you start doing complex analytics, the availability of your Cassandra cluster will probably suffer noticeably. In general from what I've seen it's in your best interest to leave the Cassandra cluster alone, otherwise the availability starts suffering.
Sounds like you need an analytics platform, and I would definitely advise exporting your reporting data out of Cassandra to use in an offline data-warehouse system.
If you can afford it, having a real data-warehouse would allow you to do complex queries with complex joins on multiples tables. These data-warehouse systems are widely used for reporting, here is a list of what are in my opinion the key players:
Netezza
Aster/TeraData
Vertica
A recent one which is gaining a lot of momentum is Amazon Redshift, but it is currently in beta, but if you can get your hands on it you could give this a try since it looks like a solid analytics platform with a pricing much more attractive than the above solutions.
Alternatives like using Hadoop MapReduce/Hive/Pig are also interesting to look at, but probably not a replacement for Hadoop technologies. I would recommend Hive if you have a SQL background because it will be very easy to understand what you're doing and you can scale easily. There are actually already libraries integrated with Hadoop, like Apache Mahout, which allow you to do data-mining on a Hadoop cluster, you should definitely give this a try and see if it fits your needs.
To give you an idea, an approach that I've used that has been working well so far is pre-aggregating the results in Hive and then have the reports themselves generated in a data-warehouse like Netezza to compute complex joins .
Disclosure: I'm an engineer at DataStax.
In addition to Charles' suggestions, you might want to look into DataStax Enterprise (DSE), which offers a nice integration of Cassandra with Hadoop, Hive, Pig, and Mahout.
As Charles mentioned, you don't want to run your analytics directly against Cassandra nodes that are handling your real-time application needs because they can have a substantial impact on performance. To avoid this, DSE allows you to devote a portion of your cluster strictly to analytics by using multiple virtual "datacenters" (in the NetworkToplogyStrategy sense of the term). Queries performed as part of a Hadoop job will only impact those nodes, essentially leaving your normal Cassandra nodes unaffected. Additionally, you can scale each portion of the cluster up or down separately based on your performance needs.
There are a couple of upsides to the DSE approach. The first is that you don't need to perform any ETL prior to processing your data; Cassandra's normal replication mechanisms keep the nodes devoted to analytics up to date. Second, you don't need an external Hadoop cluster. DSE includes a drop-in replacement for HDFS called CFS (CassandraFS), so all source data, intermediate results, and final results from a Hadoop job can be stored in the Cassandra cluster.
How does Storm compare to Hadoop? Hadoop seems to be the defacto standard for open-source large scale batch processing, does Storm has any advantages over hadoop? or Are they completely different?
Why don't you tell your opinion.
http://www.infoq.com/news/2011/09/twitter-storm-real-time-hadoop/
http://engineering.twitter.com/2011/08/storm-is-coming-more-details-and-plans.html
Twitter Storm has been touted as real time Hadoop. That is more a marketing take for easy consumption.
They are superficially similar since both are distributed application solutions. Apart from the typical distributed architectural elements like master/slave, zookeeper based coordination, to me comparison falls off the cliff.
Twitter is more like a pipline for processing data as it comes. The pipe is what connects various computing nodes that receive data, compute and deliver output. (There lingo is spouts and bolts) Extend this analogy to a complex pipeline wiring that can be re-engineered when required and you get Twitter Storm.
In nut shell it processes data as it comes. There is no latency.
Hadoop how ever is different in this respect primarily due to HDFS. It a solution geared to distributed storage and tolerance to outage of many scales (disks, machines, racks etc)
M/R is built to leverage data localization on HDFS to distribute computational jobs. Together, they do not provide facility for real time data processing. But that is not always a requirement when you are looking through large data. (needle in the haystack analogy)
In short, Twitter Storm is a distributed real time data processing solution. I don't think we should compare them. Twitter built it because it needed a facility to process small tweets but humungous number of them and in real time.
See: HStreaming if you are compelled to compare it with some thing
Basically, both of them are used for analyzing big data, but Storm is used for real time processing while Hadoop is used for batch processing.
This is a very good introduction to Storm that I found:
Click here
Rather than to be compared, they are supposed to supplement each other now having batch + real-time (pseudo-real time) processing. There is a corresponding video presentation - Ted Dunning on Twitter's Storm
I've been using Storm for a while and now I've quit this really good technology for an amazing one : Spark (http://spark.apache.org) which provides developer with a unified API for batch or streaming processing (micro-batch) as well as machine learning and graph processing.
worth a try.
Storm is for Fast Data (real time) & Hadoop is for Big data(pre-existing tons of data). Storm can't process Big data but it can generate Big data as a output.
Apache Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing.
Since many sub systems exists in Hadoop ecosystem, we have to chose right sub system depending on business requirements & feasibility of a particular system.
Hadoop MapReduce is efficient for batch processing of one job at a time. This is the reason why Hadoop is being used extensively as a data warehousing tool rather than data analysis tool.
Since the question is related to only "Storm" vs "Hadoop", have a look at Storm use cases - Financial Services, Telecom, Retail, Manufacturing, Transportation.
Hadoop MapReduce is best suited for batch processing.
Storm is a complete stream processing engine and can be used for real time data analytics with latency in sub-seconds.
Have a look at this dezyre article for comparison between Hadoop, Storm and Spark. It explains similarities and differences.
It can be summarized with below picture ( from dezyre article)