I have a "research" question:
Are there methods combining batch learning (MapReduce) in stream learning for clustering ?
Take a look at Apache Spark or Google Dataflow for programming models that works in batch and stream mode.
Apache Spark has mllib for machine learning
There's some really interesting Spark Stream/ MLlib integration work coming out of the Freeman Lab, performing mini-batch clustering on streams by introducing a "forgetfulness" parameter
https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html
https://gist.github.com/freeman-lab/9672685
Related
In the doc it says that Stateful Operations like mapGroupsWithState in Structured Streaming supported only in Scala and Java but I do need statful capabilities in Python. What should I do?
If you insist on using Pyspark -
Perform the preprocessing action in one spark job, then store the necessary "state" stream to a file sink. In another job, read this stream and perform the output action. There's an extra memory/disk/latency overhead involved.
Use updateStateByKey API instead. This will require DStreams approach instead of Structured Streaming.
Neither approach is great. If you need the latest and the greatest API features, I'd recommend transitioning to Scala now. As your project progresses, you will run into this problem repeatedly. Since Spark is written in Scala, the Python API always lags behind.
I start making research about Data science and machine learning development using mahout, and i found hadoop, Both made me confused :
what is the relationship between hadoop and mahout?
For Data Science and machine learning stuff, what is the best to start ?
Hadoop is a framework based on distributed storage and distributed processing concepts for processing large data. It is having a distributed storage layer called hadoop distributed file system (HDFS) and a distributed processing layer called mapreduce. Hadoop is designed in such a way that it can run on commodity hardware. Hadoop is written in Java.
Mahout is a member in hadoop ecosystem which contains the implementation of various machine learning algorithms. Mahout utilizes hadoop's parallel processing capability to do the processing so that the end user can use this with the large data sets without much complexity. User can either reuse these algorithms directly or use with some customizations, but no need to worry much about the complexities of the mapreduce implementation of the algorithm.
For Data Science and machine learning stuffs, you should learn about the usage and details of the algorithms. Then you can concentrate on mahout. Since mahout jobs in distributed mode are mapreduce jobs, you should learn hadoop fundamentals and mapreduce programming.
i just started to learn Hadoop and have gone through some sites and i often found that
"Hadoop is not a real-time platform" even in SO also
I mess with this and i really cant understand about it . Can any one help me and explain me about this?
Thanks all
Hadoop was initially designed for batch processing. That means, take a large dataset in input all at once, process it, and write a large output. The very concept of MapReduce is geared towards batch and not real-time. But to be honest, this was only the case at Hadoop's beginning, and now you have plenty of opportunities to use Hadoop in a more real-time way.
First I think it's important to define what you mean by real-time. It could be that you're interested in stream processing, or could also be that you want to run queries on your data that return results in real-time.
For stream processing on Hadoop, natively Hadoop won't provide you with this kind of capabilities, but you can integrate some other projects with Hadoop easily:
Storm-YARN allows you to use Storm on your Hadoop cluster via YARN.
Spark integrates with HDFS to allow you to process streaming data in real-time.
For real-time queries there are also several projects which use Hadoop:
Impala from Cloudera uses HDFS but bypasses MapReduce altogether because there's too much overhead otherwise.
Apache Drill is another project that integrates with Hadoop to provide real-time query capabilities.
The Stinger project aims to make Hive itself more real-time.
There are probably other projects that would fit into the list of "Making Hadoop real-time", but these are the most well-known ones.
So as you can see, Hadoop is going more and more towards the direction of real-time and, even if it wasn't designed for that, you have plenty of opportunities to extend it for real-time purposes.
im trying to implement an algorithm to find connected components in a large graph(size equivalent to social networks) using Mapreduce. Im not familiar with Hadoop though ive heard it can be used. I need some direction with using it.
Look at Apache Giraph. It is Hadoop-based framework to work with graph algorithms.
I have an academic course "Middleware" which covers different aspects of Distributed Software Systems including introduction to topics like [tag:Distributed File system]. This also involves introduction to hbase,hadoop,mapreduce,hiveql,piglatin.
I want to know, can I have a small project which tries to integrate above technologies. For starters, I am aware of vm provided by cloudera for having a feel of hadoop and playing around using Eclipse.
I was thinking on lines of implementing an application which accepts stream of events as an input, Analyses this and gives an output.
I have both windows/linux on my machine with i7 procoessor and 4Gb Ram.
Please let me know how to get started with everything and any suggestions for simple example application are welcome.
Here is a blog post on analyzing Tweets using Hive/HDFS. And here is a blog post on performing Clickstream analytics using Pig and Hive.
Check some of the Big Data use cases here and try to solve an interesting problem.