Hadoop vs Mahout and Machine learning Issue? - hadoop

I start making research about Data science and machine learning development using mahout, and i found hadoop, Both made me confused :
what is the relationship between hadoop and mahout?
For Data Science and machine learning stuff, what is the best to start ?

Hadoop is a framework based on distributed storage and distributed processing concepts for processing large data. It is having a distributed storage layer called hadoop distributed file system (HDFS) and a distributed processing layer called mapreduce. Hadoop is designed in such a way that it can run on commodity hardware. Hadoop is written in Java.
Mahout is a member in hadoop ecosystem which contains the implementation of various machine learning algorithms. Mahout utilizes hadoop's parallel processing capability to do the processing so that the end user can use this with the large data sets without much complexity. User can either reuse these algorithms directly or use with some customizations, but no need to worry much about the complexities of the mapreduce implementation of the algorithm.
For Data Science and machine learning stuffs, you should learn about the usage and details of the algorithms. Then you can concentrate on mahout. Since mahout jobs in distributed mode are mapreduce jobs, you should learn hadoop fundamentals and mapreduce programming.

Related

Relevance of Hadoop & Streaming solutions when Spark exists

I am starting a big data initiative for my startup. In 2018 is there any reason to use Hadoop at all since Spark is touted to be way faster due to it primarily not writing the intermediate data to disk as Hadoop’s MR.
I realize Spark has a higher need for RAM But that would be just one time CAPEX costs that would pay for itself?
In general unless there are legacy projects why should one pick up Hadoop at all since Spark is available?
Would appreciate real world comparisons of the two, gotchas etc.?
Alternately are there use cases that Hadoop can solve but Spark cannot?
—————-comment below for actual problem————
I would use YARN as the resource manager with HDFS as the file system for Spark.
Also realize that as Spark intersects quiet a bit with Hadoop ecosystem.
Comparos are :
Mapreduce vs Spark code
SparkSQL vs Hive
People mention Pig too but not a whole lot of people want to learn custom querying. And if I had to use Pig as a data scientist why wouldn’t I use say an Apache NiFi with Hadoop?
Also not sure how Spark handles the following:
If data does not fit in RAM then what ? Back to a disk based paradigm (not talking of streaming use cases here..) so no better than Mapreduce? How does Tez make MR2 better?
Hadoop 3 has support for Erasure coding to reduce data replication. What does Spark do?
Where I am unclear is the plethora of overlapping choices. For e.g. streaming alone has:
Spark streaming
Apache storm
Apache Samza
Kafka streams
CEP commercial tools.(ORacle CEP, TIBCO etc.)
A lot of them use DAG similar to Spark’s core engine so hard to pick one from the other.
Use case:
App sends data to middleware until end of event. Event can end specified on periodicity or due to a business condition being met.
Middleware must show real time addition of a value (simplifying) sent by users from their app instances. Accepted that middleware is the floor of the actual sum of values and real value can be higher. Plan to use Kafka streams here to have a consumer that adds all the inputs with minimal latency the consumer posts to a cache which is polled by apps to show current additive value.
Middleware logs all input
After event ends a big data paradigm scans through log data and database records to get accurate count by comparing all dB values and log entries (audit) and compare them to the Kafka shown value. Value calculated by this scheme is the final value.
Design choices:
I like Kafka because it decouples application middleware and is low latency high throughput messaging. Streams code is easy to write . Happy for someone to counter argue using Spark Streams Or Apache Storm or Apache Samza instead?
Application itself is Java code on Tomcat server with REST end points for iOS/ Android clients. Not doing client caching due to explicit liveliness of additive value.
You're confusing Hadoop with just MapReduce. Hadoop is an ecosystem of MapReduce, HDFS, and YARN.
First of all, Spark doesn't have a filesystem. That's primarily why Hadoop is nice, in my book. Sure, you can use S3, or many other cloud storages, or bare metal data stores like Ceph, or GlusterFS, but from what I've researched, HDFS is by far the fastest when processing data.
Maybe you're not familiar with the concept of rack locality that YARN offers. If you use Spark Standalone mode with any file system not mounted under the Spark executors, then all your data requests will need to be pulled over a network connection, therefore saturating the network, and causing a bottleneck, regardless of memory. Compare that to the Spark executors running on the YARN NodeManagers, HDFS datanodes are ideally also NodeManagers.
A similar problem - people say Hive is slow, SparkSQL is faster. Well, that's true if you run Hive with MapReduce instead of Tez or Spark execution modes.
Now, if you're wanting streaming and real-time events rather than the batch world commonly associated with Hadoop. You might want to research the SMACK stack.
Update
Pig as a data scientist why wouldn’t I use say an Apache NiFi with Hadoop
Pig is not comparable to NiFi.
You can use NiFi; nothing is stopping you. It would run closer to real-time than Spark micro batches. And it is a good tool to pair with Kafka.
plethora of overlapping choices
Yes, and you didn't even list them all... It's up to some BigData architect in your company to come up with a solution. You'll find that vendor support from Confluent is mostly for Kafka. I haven't seen them talking about Samza much. Hortonworks will support Storm, Nifi, and Spark, but they aren't running the latest version of Kafka if you want fancy features like KSQL. Streamsets is a similar company offering a tool competing with NiFi which consists of employees with backgrounds in other batch/streaming Apache projects.
Storm and Samza are two ways to do the same thing, as far as I know. I think Flink is more programmer friendly than Storm. I don't have experience with Samza, though I work closely with people who primarily are using Kafka Streams rather than it. And Kafka Streams isn't DAG based - it's just a high level Kafka library, embeddable in any JVM application.
If data does not fit in RAM then what ?
By default, it spills to disk... Spark has parameters to configure if you don't want disk to be touched. In which case, your jobs die of OOM more quickly, obviously.
How does Tez make MR2 better?
Tez isn't MR. It creates more optimized DAGs like Spark does. Go read about it.
Hadoop 3 has support for Erasure coding to reduce data replication. What does Spark do?
Spark has no filesystem. We already covered this. Erasure encoding is primarily for data at-rest, not during processing. I actually don't know if Spark supports Hadoop 3, yet.
Application itself is Java code on Tomcat server with REST end points for iOS/ Android clients
Personally, I would use Kafka Streams here because 1) You are using Java already 2) it's a standalone thread in your code that offers you to read/publish data from Kafka without Hadoop/YARN or Spark Clusters. It's not clear what your question has to do with Hadoop from your listed client-server archictecture, but feel free to string an additional line from a Kafka topic to a database/analytics engine of your choice. The Kafka Connect framework has many connectors for you to choose from.
You could also use NiFi as your mobile REST API to just ExposeHTTP and send requests to it, then route flows based on attributes in the data. Then, manipulate and publish to Kafka as well as other systems.
Spark and Hadoop works pretty similar in the way of solving MapReduce problems.
Hadoop is pretty relevant if you talk about HDFS point of view. The HDFS is a well known used solution for big data storage. But your question is about MapReduce.
Spark is the best option if you are talking about good machines with real good configuration of memory and network throughput. But we know that kind of machines are expensive and sometimes you best option is to use Hadoop to process your data. Spark is great and fast but sometimes you get crazy with Memory issues if you don't have a good cluster in case of fit too much data in the memory. Hadoop in this case can be better. But this problem year after year are less relevant.
So hadoop is here com complement Spark, Hadoop is not only MapReduce Hadoop is an ecosystem. Spark doesn't have a distributed file system, to Spark works well you need one, Spark doesn't have a resource manager, Hadoop has called Yarn. And Spark in a cluster mode need a resource manager.
Conclusion
Hadoop still relevant as an ecosystem but as only mapReduce I can say that is not been used anymore.

Difference between BOINC and Hadoop/Spark/etc

What's the difference between BOINC https://en.wikipedia.org/wiki/Berkeley_Open_Infrastructure_for_Network_Computing
vs. General Hadoop/Spark/etc. big data framework? They all seem to be distributed computing frameworks - are there places where I can read about the differences or BOINC in particular?
Seems the Large Hadron Collider in EU is using BOINC, why not Hadoop?
Thanks.
BOINC is software that can use the unused CPU and GPU cycles on a computer to do scientific computing
BOINC is strictly a single application that enables grid computing using unused computation cycles.
Apache Hadoop is an open-source software framework for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware.
All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common and should be automatically handled by the framework.
The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part called MapReduce.
(emphasis added to framework and it's dual functionality)
Here, you see Hadoop is a framework (also referred to as an ecosystem) that has both storage and computing capabilities. Hadoop vendors such as Cloudera and Hortonworks bundle in additional functionality into that (Hive, Hbase, Pig, Spark, etc) as well as a few security/auditing tools.
Additionally, hardware failure is handled differently by these two clusters. If a BOINC node dies, there is no fault tolerance; those resources are lost. In the case of Hadoop, data is replicated and tasks are re-ran a certain number of times before eventually failing, but these steps are traceable as long as the logging services built into the framework are running.
Seems the Large Hadron Collider in EU is using BOINC, why not Hadoop?
Because BOINC provides a software that anyone in the world can install to join the cluster, they gain a large range of computing power from anywhere practically for free.
They might be using Hadoop internally to do some storage and perhaps Spark to do additional computing, but buying commodity hardware in bulk and building/maintaining that cluster seems cost prohibitive.
What is similar between BOINC and Hadoop is that they exploit that a big problem can be solved in many parts. And both are most associated with distributing data across many computers, not an application.
The difference is the degree of synchronisation between all contributing machines. With Hadoop the synchronisation is very tight and you expect at some point all data to be collected from all machines to then come up with the final analysis. You literally wait for the last one and nothing is returned until that last fraction of the job was completed.
With BOINC, there is no synchronicity at all. You have many thousands of jobs to be run. The BOINC server side run by the project maintainers orchestrates the delivery of jobs to run to the BOINC client sides run by volunteers.
With BOINC, the project maintainers have no control over the clients at all. If a client is not returning a result then the work unit is sent elsewhere again. With Hadoop, the whole cluster is accessible to the project maintainer. With BOINC, the application is provided across different platforms since it is completely uncertain what platform the user offers. With Hadoop everything is well-defined and typically very homogeneous. BOINC's largest projects have many tens of thousands of regular volunteers, Hadoop has what you can afford to buy or rent.

How does Pig integrate with Map Reduce [Research Project]

I am working on a master thesis project which aims at integrating a custom Map Reduce framework with similar MR interface but own implementation and pipeline, with higher level language frameworks as PIG.
Currently, the MR master and workers have been integrated with YARN so that such jobs can be launched on YARN. The framework is written in C++ and it's running OpenCL defined Map and Reduce functions.
The aim is to make the proprietary MR framework available for usage in as many scenarios as possible, maintaining it's own pipeline, with minimum changes to the applications or frameworks which employ Hadoop Map Reduce.
Given the large landscape of Hadoop projects I would need some pointers to resources, literature or documentation of how this has been achieved (Pig can run on top of Hadoop MR and Spark at least) or which options can be considered. (I am conducting reading already into YARN, Pig and so on but some pointers would be really helpful)

Hadoop good example of production implementation [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I hear lot about Hadoop but when it comes for defining what it is i get confused. Because definition defers form point to point.
Is Hadoop something that serves files from server to client?
Ex: If we implement Hadoop for a MAILDIR where emails are stored, Can Hadoop help in accessing the emails and serving it to client in super fast speed? Is this how it can be used?
Can you tell me in simple words what is Hadoop and its uses?
Dude You are messing this up.
Hadoop is an open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware. Hadoop is an Apache project being built and used by a global community of contributors and users.
The Apache Hadoop framework is composed of the following modules
Hadoop Common – contains libraries and utilities needed by other Hadoop modules
Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.
Hadoop YARN – a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users' applications.
Hadoop MapReduce – a programming model for large scale data processing.
For the end-users, though MapReduce Java code is common, any programming language can be used with "Hadoop Streaming" to implement the "map" and "reduce" parts of the user's program.Apache Pig, Apache Hive, Apache Spark among other related projects expose higher level user interfaces like Pig Latin and a SQL variant respectively. The Hadoop framework itself is mostly written in the Java programming language, with some native code in C and command line utilities written as shell-scripts.
The Hadoop distributed file system (HDFS) is a distributed, scalable, and portable file-system written in Java for the Hadoop framework. Each node in a Hadoop instance typically has a single namenode; a cluster of datanodes form the HDFS cluster. The situation is typical because each node does not require a datanode to be present. Each datanode serves up blocks of data over the network using a block protocol specific to HDFS. The file system uses TCP/IP sockets for communication. Clients use remote procedure call (RPC) to communicate between each other.
HDFS stores large files (typically in the range of gigabytes to terabytes) across multiple machines. It achieves reliability by replicating the data across multiple hosts, and hence theoretically does not require RAID storage on hosts (but to increase I/O performance some RAID configurations are still useful). With the default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack. Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high.
The HDFS file system is not restricted to MapReduce jobs.It can be used for other applications including the HBase database, the Apache Mahout machine learning system, and the Apache Hive Data Warehouse system. Hadoop can in theory be used for any sort of work that is batch-oriented rather than real-time, that is very data-intensive, and able to work on pieces of the data in parallel.
Commercial applications of Hadoop includes :
Log and/or clickstream analysis of various kinds
Marketing analytics
Machine learning and/or sophisticated data mining
Image processing
Processing of XML messages
Web crawling and/or text processing
General archiving, including of relational/tabular data, e.g. for compliance
You can refer to YDN to have a good startup in understanding the hadoop framework.

why Hadoop is not a real-time platform

i just started to learn Hadoop and have gone through some sites and i often found that
"Hadoop is not a real-time platform" even in SO also
I mess with this and i really cant understand about it . Can any one help me and explain me about this?
Thanks all
Hadoop was initially designed for batch processing. That means, take a large dataset in input all at once, process it, and write a large output. The very concept of MapReduce is geared towards batch and not real-time. But to be honest, this was only the case at Hadoop's beginning, and now you have plenty of opportunities to use Hadoop in a more real-time way.
First I think it's important to define what you mean by real-time. It could be that you're interested in stream processing, or could also be that you want to run queries on your data that return results in real-time.
For stream processing on Hadoop, natively Hadoop won't provide you with this kind of capabilities, but you can integrate some other projects with Hadoop easily:
Storm-YARN allows you to use Storm on your Hadoop cluster via YARN.
Spark integrates with HDFS to allow you to process streaming data in real-time.
For real-time queries there are also several projects which use Hadoop:
Impala from Cloudera uses HDFS but bypasses MapReduce altogether because there's too much overhead otherwise.
Apache Drill is another project that integrates with Hadoop to provide real-time query capabilities.
The Stinger project aims to make Hive itself more real-time.
There are probably other projects that would fit into the list of "Making Hadoop real-time", but these are the most well-known ones.
So as you can see, Hadoop is going more and more towards the direction of real-time and, even if it wasn't designed for that, you have plenty of opportunities to extend it for real-time purposes.

Resources