How can I integrate Hadoop with Mahout ?
i want to perform data Analytics and need to have machine learning libraries.
I would start by reviewing the mahout site, reviewing the tutorials, there are lots of useful links http://mahout.apache.org
There are a number of different books out there that will take you from first principles to producing Data Analytics, this is probably a good place to start (http://shop.oreilly.com/product/0636920033400.do) if you know python.
I use my API logs to extract information like:
In this period of time how many are the users of my API ?
Or in this period of time, what type of services are called the most ?
Almost all the information I extract depend on the timestamp. Actually I use MongoDB and I added the time-stamp as an index(for 80GB, indexes size is 12GB).
A migration to cassandra or Hbase was recommended for me. And I want to know which is better for my use case:
Analysis for timeseries data.
Both good write and read performance are required.
Possibility of using hadoop to do my data analysis.
Thanks for sharing your point of view or your experience.
Advantages of Cassandra:
Cassandra generally shows better performance (though both are excellent).
Cassandra is substantially easier to setup and manage from an operational stand point (though there are tools that will help either way).
Advantages of HBase:
Native to the hadoop ecosystem
HBase will require you installing hadoop anyway, and you get a nice two-for-one. To use Cassandra you will probably need to go to use DataStax Enterprise, a commercial, non-open source product, OR investigate using Spark for your analytics work which has an open-source connector with Cassandra.
Chocolate or Vanilla ice cream - which is better?
I would suggest that you would be the best decision maker. Set up development environments for each option, and this will tell you much more about operational and tuning issues than, I think, anyone else might be able to give you. :)
Why in this link:{http://www.ibm.com/developerworks/aix/library/au-cloud_apache/#figure2} in figure1,apache hadoop is defined as a Platform as a service but in http://nosql-databases.org it is defined as a no sql wide column store database?
I mean when working with hadoop do I need a database too?
Thanks in advance.
Hadoop is a basically a collection of java software that fundamentally provides two things:
A distributed file system implementation.
A framework for writing, and running Map Reduce jobs written in Java.
Many things are built on top of these two pieces (like HBase, which is probably the columnar datastore you have read about).
A good resource for learning more about Hadoop is the apache project page documetation. If that looks confusing, there is also a book called 'Hadoop: The Definitive Guide' which is pretty good reading.
If you want to read about how it all began, I'd recommend reading this google paper upon which Hadoop is based
Hope that helps.
I have an academic course "Middleware" which covers different aspects of Distributed Software Systems including introduction to topics like [tag:Distributed File system]. This also involves introduction to hbase,hadoop,mapreduce,hiveql,piglatin.
I want to know, can I have a small project which tries to integrate above technologies. For starters, I am aware of vm provided by cloudera for having a feel of hadoop and playing around using Eclipse.
I was thinking on lines of implementing an application which accepts stream of events as an input, Analyses this and gives an output.
I have both windows/linux on my machine with i7 procoessor and 4Gb Ram.
Please let me know how to get started with everything and any suggestions for simple example application are welcome.
Here is a blog post on analyzing Tweets using Hive/HDFS. And here is a blog post on performing Clickstream analytics using Pig and Hive.
Check some of the Big Data use cases here and try to solve an interesting problem.
I want to learn more on how to build CEP based applications. So I looked around and found several products (overview found here: http://rulecore.com/CEPblog/?page_id=47).
But as there are quite a few at the moment, I don't know which is the best to start with. And overall I just would consider the one available for free. The rest is a bit to expensive for just private use ;)
Esper is for free, but without Esper studio it seems quite tedious to develop a cep app. Streambase offers a free trial, but I couldn't find out how long you can use this (if only for a month, no that helpful for longer research). Oracle CEP suite seems quite complete, but in the cep scene - as far as I can see - it is the least recognized compared to Esper or Streambase.
So do you have any hints on what is the best way to start with cep development? Is it worth to spent time on working through the oracle documenation or is it better to start with Esper or Streambase?
Cheers,
Andreas
Microsoft's CEP offering StreamInsight which closely resembles the reactive programming model of the Rx Framework and LINQ.
A Hitchhiker's Guide to StreamInsight Queries is a good place to start.
Some Code Examples
I would recommend using LINQPad which can connect to Stream Insight as a canvas for your queries.
The current CEP tools do not solve identical problems! So depending on what you like to do you'd like use different tools. In short, my personal choices would be:
For building data driven algorithms, coding in a type of SQL with extensions - The Coral8 engine from Aleri. Free for test and development (Was anyway before bought by Aleri)
For detecting event patterns (situations), no coding (declarative style) but configuration using XML - RuleCore, free test subscription to (Web)service
For a mix of both with low level control and coding in Java - Esper, GPL.
For creating data driven computation logic using graphical boxes-and-arrows style of GUI: StreamBase.
I think the best choice is to compare the solutions that are freely available and then make something with them.
I'm not sure what your end goals are, if it's to learn a technology that you use at work or just to play around with something cool, but for me on a project like this, the deciding factor would be which tool can I use to make something I could share with the world.
In this case, my options would probably be Esper or OpenESB. That way, I could put the project on a resume (especially if I was applying for a job that used CEP tools) and share it with the world.
You could read the blog of Curt Monash (http://www.dbms2.com) , he writes about things like CEP.
would there be any interest in a free subscription to the ruleCore (Cloud, SaaS or whatever these are called today) Service? It would be running on smaller and less reliable (no cluster) hardware and probably only usable for testing out small low performance kind of things. If support#rulecore.com gets a couple of requests of this kind I'm sure it's put up onto the todo list...
For detecting event patterns I found that rulecore is pretty easy to use. I have only tried to detect patterns of low and medium complexity and that did work fine. It takes some time to get used to the concepts but is it actually a very small system so it was not that bad. And you need to like XML as everything is done using XML.
If you are trying to create a trading application then StreamBase would be better. But for monitoring stuff rulecore feels better.
If you have continuous streams (market feeds, IoT sensors, Twitter, news, etc), then stream processing technology is the right choice for you. Stream processing / streaming analytics is only a part of different CEP solutions (streams, rules, patterns, etc.).
There are several open source options for stream processing in the meantime, e.g. Apache Storm, Apache Spark or Apache Samza, but also powerful proprietary products such as IBM InfoSphere Streams, TIBCO StreamBase or Software AG's Apama.
Take a look at my blog post respectively article for more details about different stream processing and streaming analytics solutions (open source and proprietary):
Comparison of Stream Processing and Streaming Analytics Alternatives (Apache Storm, Spark, IBM InfoSphere Streams, TIBCO StreamBase, Software AG Apama)
i would start with the free trial of Aleri Coral8 (currently Sybase)