Any idea whether sparkR for spark 1.6 is capable of calling methods for streaming e.g. say streaming liner regression with SGD. If yes, can anyone please share any references. Thanks!
As for now SparkR doesn't support streaming and it is rather unlikely it will in the nearest future. For starters it would require a lower level R API which is still in a design phase. Morover R is probably not best choice for a real time processing.
Related
I have a cluster in databricks. Before importing the data, I want to choose among python vs scala, which one is better in terms of read/write large data from the source?
For the dataframe api, it should be the same performance. For the rdd api, scala is going to be faster.
I would choose scala , my two cents on this subject:
Scala:
supports multiple concurrency primitives
uses JVM during runtime which gives is some speed over Python
Python:
does not support concurrency or multithreading (support heavyweight process forking so only one thread is active at a time)
is interpreted and dynamically typed and this reduces the speed
Also I recommend this article: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
I am working on an streaming application using Spark Streaming, I want to index my data into elastic search.
My analysis:
I can directly push data from Spark to elastic search, but i feel in this case both the components will be tightly coupled.
If this is a spark core job, we can write that output to HDFS and use logstash to get the data from HDFS and push it to elastic search.
Solution according to me:
I can push the data from Spark Streaming to Kafka and from Kafka we can read that data using Logstash and push to ES.
Please suggest.
First of all, it is great that you have thought through the different approaches.
There are a few questions which you should ask before coming to a good design:
Timelines? Spark -> ES is a breeze and is recommended if you are starting on a PoC.
Operational bandwidth? introducing more components will increase operational concerns. From my personal experiences, making sure you spark streaming job is stable is itself a time-consuming job. You want to add Kafka as well, so you need to spend more time in trying to get the monitoring, other ops concerns right.
Scale? If it is going to take more scale, having a persistent message bus might be able to help absorb back-pressure and still scale pretty well.
If I had the time and dealing with large scale, Spark streaming -> Kafka -> ES looks to be the best bet. This way when your ES cluster is unstable, you still have the option of Kafka replay.
I am a little hazy on Kafka -> HDFS -> ES, as there could be performance implications on adding a batch layer in between the Source and Sink. Also honestly, I am not aware of how good logstash is with HDFS, so can't really comment.
Tight coupling is a oft-discussed subject. There are people who argue against it citing reusability concerns, but there are also people who argue for it, as sometimes it can create a simpler design and makes the whole system easier to reason about. Also talk about premature optimisations :) We have had successes with Spark -> ES directly at a moderate scale of data inflow. So don't discount the power of a simpler design just like that :)
Can any one explain me the key difference between Apache Hadoop vs
Google Bigdata
Which one is better(hadoop or google big data).
Simple answer would be.. it depends on what you want to do with your data.
Hadoop is used for massive storage of data and batch processing of that data. It is very mature, popular and you have lot of libraries that support this technology. But if you want to do real time analysis, queries on your data hadoop is not suitable for it.
Google's Big Query was developed specially to solve this issue. You can do real time processing on your data using google's big query.
You can use Big Query in place of Hadoop or you can also use big query with Hadoop to query datasets produced from running MapReduce jobs.
So, it entirely depends on how you want to process your data. If batch processing model is required and sufficient you can use Hadoop and if you want real time processing you have to choose Google's.
Edit: You can also explore other technologies that you can use with Hadoop like Spark, Storm, Hive etc.. (and choose depending on your use case)
Some useful links for more exploration:
1: gavinbadcock's blog
2: cloudacademy's blog
Is Hadoop a proper solution for jobs that are CPU intensive and need to process a small file of around 500 MB? I have read that Hadoop is aimed to process the so called Big Data, and I wonder how it performs with a small amount of data (but a CPU intensive workload).
I would mainly like to know if a better approach for this scenario exists or instead I should stick to Hadoop.
Hadoop is a distributed computing framework proposing a MapReduce engine. If you can express your parallelizable cpu intensive application with this paradigm (or any other supported by Hadoop modules), you may take advantage of Hadoop.
A classical example of Hadoop computations is the calculation of Pi, which doesn't need any input data. As you'll see here, yahoo managed to determine the two quadrillonth digit of pi thanks to Hadoop.
However, Hadoop is indeed specialized for Big Data in the sense that it was developped for this aim. For instance, you dispose of a file system designed to contain huge files. These huge files are chunked into a lot of blocks accross a large number of nodes. In order to ensure your data integrity, each block has to be replicated to other nodes.
To conclude, I'd say that if you already dispose of an Hadoop cluster, you may want to take advantage of it.
If that's not the case, and while I can't recommand anything since I have no idea what exactly is your need, I think you can find more light weights frameworks than Hadoop.
Well a lot of companies are moving to Spark, and I personally believe it's the future of parallel processing.
It sounds like what you want to do is use many CPUs possibly on many nodes. For this you should use a Scalable Language especially designed for this problem - in other words Scala. Using Scala with Spark is much much easier and much much faster than hadoop.
If you don't have access to a cluster, it can be an idea to use Spark anyway so that you can use it in future more easily. Or just use .par in Scala and that will paralellalize your code and use all the CPUs on your local machine.
Finally Hadoop is indeed intended for Big Data, whereas Spark is really just a very general MPP framework.
You have exactly the type of computing issue that we do for Data Normalization. This is a need for parallel processing on cheap hardware and software with ease of use instead of going through all the special programming for traditional parallel processing. Hadoop was born of hugely distributed data replication with relatively simple computations. Indeed, the test application still being distributed, WordCount, is numbingly simplistic. This is because the genesis of Hadoop was do handle the tremendous amount of data and concurrent processing for search, with the "Big Data" analytics movement added on afterwards to try to find a more general purpose business use case. Thus, Hadoop as described in its common form is not targeted to the use case you and we have. But, Hadoop does offer the key capabilities of cheap, easy, fast parallel processing of "Small Data" with custom and complicated programming logic.
In fact, we have tuned Hadoop to do just this. We have a special built hardware environment, PSIKLOPS, that is powerful for small cluster (1-10) nodes with enough power at low cost for run 4-20 parallel jobs. We will be showcasing this in a series of web casts by Inside Analysis titled Tech Lab in conjunction with Cloudera for the first series, coming in early Aug 2014. We see this capability as being a key enabler for people like you. PSIKLOPS is not required to use Hadoop in the manner we will showcase, but it is being configured to maximize ease of use to launch multiple concurrent containers of custom Java.
How does Storm compare to Hadoop? Hadoop seems to be the defacto standard for open-source large scale batch processing, does Storm has any advantages over hadoop? or Are they completely different?
Why don't you tell your opinion.
http://www.infoq.com/news/2011/09/twitter-storm-real-time-hadoop/
http://engineering.twitter.com/2011/08/storm-is-coming-more-details-and-plans.html
Twitter Storm has been touted as real time Hadoop. That is more a marketing take for easy consumption.
They are superficially similar since both are distributed application solutions. Apart from the typical distributed architectural elements like master/slave, zookeeper based coordination, to me comparison falls off the cliff.
Twitter is more like a pipline for processing data as it comes. The pipe is what connects various computing nodes that receive data, compute and deliver output. (There lingo is spouts and bolts) Extend this analogy to a complex pipeline wiring that can be re-engineered when required and you get Twitter Storm.
In nut shell it processes data as it comes. There is no latency.
Hadoop how ever is different in this respect primarily due to HDFS. It a solution geared to distributed storage and tolerance to outage of many scales (disks, machines, racks etc)
M/R is built to leverage data localization on HDFS to distribute computational jobs. Together, they do not provide facility for real time data processing. But that is not always a requirement when you are looking through large data. (needle in the haystack analogy)
In short, Twitter Storm is a distributed real time data processing solution. I don't think we should compare them. Twitter built it because it needed a facility to process small tweets but humungous number of them and in real time.
See: HStreaming if you are compelled to compare it with some thing
Basically, both of them are used for analyzing big data, but Storm is used for real time processing while Hadoop is used for batch processing.
This is a very good introduction to Storm that I found:
Click here
Rather than to be compared, they are supposed to supplement each other now having batch + real-time (pseudo-real time) processing. There is a corresponding video presentation - Ted Dunning on Twitter's Storm
I've been using Storm for a while and now I've quit this really good technology for an amazing one : Spark (http://spark.apache.org) which provides developer with a unified API for batch or streaming processing (micro-batch) as well as machine learning and graph processing.
worth a try.
Storm is for Fast Data (real time) & Hadoop is for Big data(pre-existing tons of data). Storm can't process Big data but it can generate Big data as a output.
Apache Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing.
Since many sub systems exists in Hadoop ecosystem, we have to chose right sub system depending on business requirements & feasibility of a particular system.
Hadoop MapReduce is efficient for batch processing of one job at a time. This is the reason why Hadoop is being used extensively as a data warehousing tool rather than data analysis tool.
Since the question is related to only "Storm" vs "Hadoop", have a look at Storm use cases - Financial Services, Telecom, Retail, Manufacturing, Transportation.
Hadoop MapReduce is best suited for batch processing.
Storm is a complete stream processing engine and can be used for real time data analytics with latency in sub-seconds.
Have a look at this dezyre article for comparison between Hadoop, Storm and Spark. It explains similarities and differences.
It can be summarized with below picture ( from dezyre article)