Can I use Hadoop & MapReduce in Jupyter/IPython? Is there something similar to what PySpark for Spark is?
Of course you can. Many Frameworks like Hadoop Streaming, mrjob and dumbo to name a few. The techical aspect of including these in Jupyter should concist of either subprocess.Popen() calls or typical python imports, depending on the framework.
A nice overview/critique of some of these Frameworks can be found in this cloudera blogpost.
Related
word-counter example with hbase and hadoop
I am new to hadoop and hbase, i am going to implement a real example on a data set and understand the logic behind them.
I have already install hadoop and hbase on my system (ubuntu 17.04).
hadoop-2.8.0
hbase-1.3.1
is there any step-by-step tutorial for implementing word-counter example?
(word-counter example or any basic example exist)
There is comprehensive tutorial provided in HBase reference guide:
http://hbase.apache.org/book.html#mapreduce.example
Note, HBase provides alternative mechanism called Cascading which is similar to Map-Reduce, but allow to write code in simplified way (it's described in ref. guide too).
Are there any dependencies between Spark and Hadoop?
If not, are there any features I'll miss when I run Spark without Hadoop?
Spark is an in-memory distributed computing engine.
Hadoop is a framework for distributed storage (HDFS) and distributed processing (YARN).
Spark can run with or without Hadoop components (HDFS/YARN)
Distributed Storage:
Since Spark does not have its own distributed storage system, it has to depend on one of these storage systems for distributed computing.
S3 – Non-urgent batch jobs. S3 fits very specific use cases when data locality isn’t critical.
Cassandra – Perfect for streaming data analysis and an overkill for batch jobs.
HDFS – Great fit for batch jobs without compromising on data locality.
Distributed processing:
You can run Spark in three different modes: Standalone, YARN and Mesos
Have a look at the below SE question for a detailed explanation about both distributed storage and distributed processing.
Which cluster type should I choose for Spark?
Spark can run without Hadoop but some of its functionality relies on Hadoop's code (e.g. handling of Parquet files). We're running Spark on Mesos and S3 which was a little tricky to set up but works really well once done (you can read a summary of what needed to properly set it here).
(Edit) Note: since version 2.3.0 Spark also added native support for Kubernetes
By default , Spark does not have storage mechanism.
To store data, it needs fast and scalable file system. You can use S3 or HDFS or any other file system. Hadoop is economical option due to low cost.
Additionally if you use Tachyon, it will boost performance with Hadoop. It's highly recommended Hadoop for apache spark processing.
As per Spark documentation, Spark can run without Hadoop.
You may run it as a Standalone mode without any resource manager.
But if you want to run in multi-node setup, you need a resource manager like YARN or Mesos and a distributed file system like HDFS,S3 etc.
Yes, spark can run without hadoop. All core spark features will continue to work, but you'll miss things like easily distributing all your files (code as well as data) to all the nodes in the cluster via hdfs, etc.
Yes, you can install the Spark without the Hadoop.
That would be little tricky
You can refer arnon link to use parquet to configure on S3 as data storage.
http://arnon.me/2015/08/spark-parquet-s3/
Spark is only do processing and it uses dynamic memory to perform the task, but to store the data you need some data storage system. Here hadoop comes in role with Spark, it provide the storage for Spark.
One more reason for using Hadoop with Spark is they are open source and both can integrate with each other easily as compare to other data storage system. For other storage like S3, you should be tricky to configure it like mention in above link.
But Hadoop also have its processing unit called Mapreduce.
Want to know difference in Both?
Check this article: https://www.dezyre.com/article/hadoop-mapreduce-vs-apache-spark-who-wins-the-battle/83
I think this article will help you understand
what to use,
when to use and
how to use !!!
Yes, of course. Spark is an independent computation framework. Hadoop is a distribution storage system(HDFS) with MapReduce computation framework. Spark can get data from HDFS, as well as any other data source such as traditional database(JDBC), kafka or even local disk.
Yes, Spark can run with or without Hadoop installation for more details you can visit -https://spark.apache.org/docs/latest/
Yes spark can run without Hadoop. You can install spark in your local machine with out Hadoop. But Spark lib comes with pre Haddop libraries i.e. are used while installing on your local machine.
You can run spark without hadoop but spark has dependency on hadoop win-utils. so some features may not work, also if you want to read hive tables from spark then you need hadoop.
Not good at english,Forgive me!
TL;DR
Use local(single node) or standalone(cluster) to run spark without Hadoop,but stills need hadoop dependencies for logging and some file process.
Windows is strongly NOT recommend to run spark!
Local mode
There are so many running mode with spark,one of it is called local will running without hadoop dependencies.
So,here is the first question:how to tell spark we want to run on local mode?
After read this official doc,i just give it a try on my linux os:
Must install java and scala,not the core content so skip it.
Download spark package
There are "without hadoop" and "hadoop integrated" 2 type of package
The most important thing is "without hadoop" do NOT mean run without hadoop but just not bundle with hadoop so you can bundle it with your custom hadoop!
Spark can run without hadoop(HDFS and YARN) but need hadoop dependency jar such as parquet/avro etc SerDe class,so strongly recommend to use "integrated" package(and you will found missing some log dependencies like log4j and slfj and other common utils class if chose "without hadoop" package but all this bundled with hadoop integrated pacakge)!
Run on local mode
Most simple way is just run shell,and you will see the welcome log
# as same as ./bin/spark-shell --master local[*]
./bin/spark-shell
Standalone mode
As same as blew,but different with step 3.
# Starup cluster
# if you want run on frontend
# export SPARK_NO_DAEMONIZE=true
./sbin/start-master.sh
# run this on your every worker
./sbin/start-worker.sh spark://VMS110109:7077
# Submit job or just shell
./bin/spark-shell spark://VMS110109:7077
On windows?
I kown so many people run spark on windown just for study,but here is so different on windows and really strongly NOT recommend to use windows.
The most important things is download winutils.exe from here and configure system variable HADOOP_HOME to point where winutils located.
At this moment 3.2.1 is the most latest release version of spark,but a bug is exist.You will got a exception like Illegal character in path at index 32: spark://xxxxxx:63293/D:\classe when run ./bin/spark-shell.cmd,only startup a standalone cluster then use ./bin/sparkshell.cmd or use lower version can temporary fix this.
For more detail and solution you can refer for here
No. It requires full blown Hadoop installation to start working - https://issues.apache.org/jira/browse/SPARK-10944
I have a simple java program that sets up a MR job. I could successfully execute this in Hadoop infrastructure (hadoop 2x) using 'hadoop jar '. But I want to achieve the same thing using java command as below.
java className
How can I pass hadoop configuration to this className?
What extra arguments do I need to supply?
Any link/documentation would be highly appreciated.
As you run your 'hadoop jar' command with the other parameters, same way you can run using java.
check if, this commands evaluates to hadoop class path
$ hadoop classpath
then whatever your custom jar is should be added in class path
$ java -cp `hadoop classpath`:/my/tools/jar/tools.jar
I am able to get mine working with this, on my hadoop cluster
I don't think you can find a documentation on this. hadoop command is a script, a lot of classes are used there eg. Class for accessing filesystem FsShell, class used when we run a jar RunJar etc. Adding hadoop related libraries, configuration files to classpath are handled in the hadoop command itself.
You better take a look at the hadoop script.
How can you do that? Any jar file execution means, it has to execute in distributed environment where all daemons work together to complete the execution.
We are not running locally or on local file system. So, it needs be executed as per the norms of hdfs so i don't think we can execute like we do in local file system.
Hadoop is a framework which simplifies the distributed computing. Before hadoop also, programmers know about parallel processing and multi threading concepts. But when you deal with multiple machines you need to know how to
Communicate between machines
Network processing
What if one machine fails? fault tolerance
and many more! which is a huge, that's where hadoop simplifies your job. It takes care of all your operating level stuff and you can focus on just your business logic.
So in your case, based on what you are asking, there is no direct answer for that. Because by passing parameters the your program doesn't work. You will need to write lot of libraries to deal with distributed computing. If you want to explore them, then I would suggest go ahead and read hadoop source code.
http://hadoop.apache.org/version_control.html
I see a substitution for mapreduce jobs, MapR, which can read data directly from stream and process it. Is my understanding correct?
Are there any samples that I can refer?
Is it commercial?
Is there any catch in using it?
Is it a substitution for flume?
Can we use it with apache hadoop? If yes, then why does the distribution only talk about yarn and mapreduce and not MapR?
Thanks in advance.
MapR is a commercial distribution of Apache Hadoop with HDFS replaced with MapR-FS. Essentially it is the same Hadoop and same Map-Reduce jobs running on top of with, covered with tons of marketing that causes the confusion and questions like yours. Here's the diagram of the components they have in their distribution: https://www.mapr.com/products/mapr-distribution-including-apache-hadoop
For stream processing on top of MapR you can use Apache Spark Streaming, Apache Flume, Apache Storm - it depends on the task you need to solve
Yes, it is commercial, licensed per-node basis as far as I know. You can easily contact their sales guys, they would be glad to explain the prices and terms
Just like the other Hadoop distributions, but personally I would prefer fully open-source platform rather than proprietary MapR-FS, but its up to you to choose
No
Because Apache Hadoop is part of many commercial distributions: Cloudera, MapR, Hortonworks, Pivotal, etc. When you read about Hadoop, you read about the system architecture, and not about the commercial packages that offer its support for enterprises
A project of mine is to compare different variants of Hadoop, it is said that there are many of them out there, but googling didn't work well for me :(
Does anyone know any different variants of Hadoop? The only one I found was Haloop.
I think the more generic term is "map reduce":
http://www.google.com/search?gcx=c&sourceid=chrome&ie=UTF-8&q=map+reduce&safe=active
Not exactly sure what you mean by different variants for Hadoop.
But, there are a lot of companies providing commercial support or providing their own versions of Hadoop (open-source and proprietary). You can find more details here.
For ex., MapR has it's own proprietary implementation of Hadoop, but they claim it's compatible with Apache Hadoop, which is a bit vague because Apache Hadoop is evolving and there are no standards around Hadoop API. Cloudera has it's own version of Hadoop CDH which is based on the Apache Hadoop. HortonWorks has been spun from Yahoo, which provides commercial support for Hadoop.
You can find more information here. Hadoop is evolving very fast, so this might be a bit stale.
This can refer to
- hadoops file system,
- or its effective support for map reduce...
- or even more generally, to the idea of cloud / distributed storage systems.
Best to clarify what aspects of hadoop you are interested In.
Of course when comparing hadoop academically, you must first start looking at GFS- since that is the origin of hadoop.
Taking aside HBase we can see hadoop as two layers - storage layer and map-reduce layer.
Storage layer has the following really different implementation which would be interesting to compare: standard hadoop file system, HDFS over Cassandra (Brisk), HDFS over S3, MapR hadoop implementation.
MapR also have changed Map-reduce implementation.
This site http://www.nosql-database.org/ has a list of a lot of NoSql DBs out there. Maybe it can help you.