How can I integrate Hadoop with Mahout? - hadoop

How can I integrate Hadoop with Mahout ?
i want to perform data Analytics and need to have machine learning libraries.

I would start by reviewing the mahout site, reviewing the tutorials, there are lots of useful links http://mahout.apache.org
There are a number of different books out there that will take you from first principles to producing Data Analytics, this is probably a good place to start (http://shop.oreilly.com/product/0636920033400.do) if you know python.

Related

SalesForce Vs Hadoop which is better?

I am having 4 years of experience in .net I would like to learn new technology, what could be best for me learning Hadoop or SalesForce?
There is no answer to this question. Hadoop and SalesForce are completely different technologies. Hadoop is distributed storage and processing that is great for big data. SalesForce is a cloud based CRM tool.
The question to ask yourself, is what do you want next? Are you looking for a steady job? Are you looking for a career in a specific field where one of these technologies would be more helpful? What do you want?

online recommendation using mahout

How do I implement online recommendation using Mahout. i want to get recommendation from the mahout recommendation engine on real time using some mechanism like REST API.
please share me any implementation idea
Regards.

Hbase vs Cassandra: Which is better for a timeseries data storage?

I use my API logs to extract information like:
In this period of time how many are the users of my API ?
Or in this period of time, what type of services are called the most ?
Almost all the information I extract depend on the timestamp. Actually I use MongoDB and I added the time-stamp as an index(for 80GB, indexes size is 12GB).
A migration to cassandra or Hbase was recommended for me. And I want to know which is better for my use case:
Analysis for timeseries data.
Both good write and read performance are required.
Possibility of using hadoop to do my data analysis.
Thanks for sharing your point of view or your experience.
Advantages of Cassandra:
Cassandra generally shows better performance (though both are excellent).
Cassandra is substantially easier to setup and manage from an operational stand point (though there are tools that will help either way).
Advantages of HBase:
Native to the hadoop ecosystem
HBase will require you installing hadoop anyway, and you get a nice two-for-one. To use Cassandra you will probably need to go to use DataStax Enterprise, a commercial, non-open source product, OR investigate using Spark for your analytics work which has an open-source connector with Cassandra.
Chocolate or Vanilla ice cream - which is better?
I would suggest that you would be the best decision maker. Set up development environments for each option, and this will tell you much more about operational and tuning issues than, I think, anyone else might be able to give you. :)

What is it exactly?

Why in this link:{http://www.ibm.com/developerworks/aix/library/au-cloud_apache/#figure2} in figure1,apache hadoop is defined as a Platform as a service but in http://nosql-databases.org it is defined as a no sql wide column store database?
I mean when working with hadoop do I need a database too?
Thanks in advance.
Hadoop is a basically a collection of java software that fundamentally provides two things:
A distributed file system implementation.
A framework for writing, and running Map Reduce jobs written in Java.
Many things are built on top of these two pieces (like HBase, which is probably the columnar datastore you have read about).
A good resource for learning more about Hadoop is the apache project page documetation. If that looks confusing, there is also a book called 'Hadoop: The Definitive Guide' which is pretty good reading.
If you want to read about how it all began, I'd recommend reading this google paper upon which Hadoop is based
Hope that helps.

Example application using HDFS+Map Reduce

I have an academic course "Middleware" which covers different aspects of Distributed Software Systems including introduction to topics like [tag:Distributed File system]. This also involves introduction to hbase,hadoop,mapreduce,hiveql,piglatin.
I want to know, can I have a small project which tries to integrate above technologies. For starters, I am aware of vm provided by cloudera for having a feel of hadoop and playing around using Eclipse.
I was thinking on lines of implementing an application which accepts stream of events as an input, Analyses this and gives an output.
I have both windows/linux on my machine with i7 procoessor and 4Gb Ram.
Please let me know how to get started with everything and any suggestions for simple example application are welcome.
Here is a blog post on analyzing Tweets using Hive/HDFS. And here is a blog post on performing Clickstream analytics using Pig and Hive.
Check some of the Big Data use cases here and try to solve an interesting problem.

Resources