running word-counter example with hadoop and hbase - hadoop

word-counter example with hbase and hadoop
I am new to hadoop and hbase, i am going to implement a real example on a data set and understand the logic behind them.
I have already install hadoop and hbase on my system (ubuntu 17.04).
hadoop-2.8.0
hbase-1.3.1
is there any step-by-step tutorial for implementing word-counter example?
(word-counter example or any basic example exist)

There is comprehensive tutorial provided in HBase reference guide:
http://hbase.apache.org/book.html#mapreduce.example
Note, HBase provides alternative mechanism called Cascading which is similar to Map-Reduce, but allow to write code in simplified way (it's described in ref. guide too).

Related

Can I use hadoop in Jupyter/IPython

Can I use Hadoop & MapReduce in Jupyter/IPython? Is there something similar to what PySpark for Spark is?
Of course you can. Many Frameworks like Hadoop Streaming, mrjob and dumbo to name a few. The techical aspect of including these in Jupyter should concist of either subprocess.Popen() calls or typical python imports, depending on the framework.
A nice overview/critique of some of these Frameworks can be found in this cloudera blogpost.

How hadoop mapreduce internally works in cloud?

I started working on hadoop mapreduce.
I am beginner to Java & hadoop and know the coding for hadoop mapreduce, but interested to learn how it internally works in cloud.
Can you please share some good link which explain how hadoop works internally?
How Hadoop works in not related to cloud. It works in the same way in 3 laptop ;-) Hadoop is often "link" to cloud computing because it is designed to be used with a lot of cheap machines, so it makes sense to run Hadoop in cloud.
By the way, Hadoop is NOT only map/reduce. It's a distributed file system first, and we are able to execute distributed tasks on the distributed file. And NOT ONLY map/reduce task (since version 2 I think).
It's a very large subject. So if you start, you will have to read many articles before to be a master ;-)
My advice. First look for articles about MapReduce:
http://www-01.ibm.com/software/data/infosphere/hadoop/mapreduce/ (short)
https://developer.yahoo.com/hadoop/tutorial/module4.html (long)
Then look for articles about Hadoop architecture (file system then YARN)
http://hadoop.apache.org/docs/r2.7.0/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
http://hadoop.apache.org/docs/r2.7.0/hadoop-yarn/hadoop-yarn-site/YARN.html
You should have a look on slideshare too.

Confusion in Apache Nutch, HBase, Hadoop, Solr, Gora

I am new to all these terms and given some time to understand it. But i have some confusions in it. Please correct me if i am wrong.
Nutch: It's for web crawling, using it we can crawl web pages. We can store these web pages somewhere in db.
Solr: Solr can be used for indexing web pages crawled by Apache Nutch. It helps in searching the indexes web pages.
HBase: It's used as an interface to interact with Hadoop. It helps in getting data at real time from HDFS. It provides simple SQL type interface for interacting.
Hadoop: It provides two functionalities: One is HDFS (Hadoop data file system) and other is Map-Reduce functionality taken from Google algorithms. Its basically used for offline data backup etc.
Gora and ZooKeeper: I am not sure of.
Confusions:
1). Is HBase a key-value pair DB or just an interface to Hadoop ? or i should ask, can HBase exist without Hadoop ?
If yes, can you explain a bit more about its usage.
2). Is there any use of crawling data using Apache Nutch without indexing into Solr ?
3). For running apache nutch, do we need HBase and Hadoop ? If no, how we can make it work without it?
4). Is Hadoop part of HBase ?
Here is a good short discussion of HBase vs. Hadoop: Difference between HBase and Hadoop/HDFS
Because HBase is built on top of Hadoop you can't really have HBase without Hadoop.
Yes you can run Nutch without Solr; there do not seem to be lots of use cases, however, much less living examples in the wild.
Yes, you can run Nutch without Hadoop, but again there don't seem to be a lot of real-world examples of people doing this.
Yes Hadoop is part of HBase, in that there is no HBase without Hadoop, but of course Hadoop is used for other things as well.
Zookeeper is used for configuration, naming, synchronization, etc. in Hadoop stack workflows. Gora is a memory management/persistence framework and is built on top of Hadoop.

Variants of Hadoop

A project of mine is to compare different variants of Hadoop, it is said that there are many of them out there, but googling didn't work well for me :(
Does anyone know any different variants of Hadoop? The only one I found was Haloop.
I think the more generic term is "map reduce":
http://www.google.com/search?gcx=c&sourceid=chrome&ie=UTF-8&q=map+reduce&safe=active
Not exactly sure what you mean by different variants for Hadoop.
But, there are a lot of companies providing commercial support or providing their own versions of Hadoop (open-source and proprietary). You can find more details here.
For ex., MapR has it's own proprietary implementation of Hadoop, but they claim it's compatible with Apache Hadoop, which is a bit vague because Apache Hadoop is evolving and there are no standards around Hadoop API. Cloudera has it's own version of Hadoop CDH which is based on the Apache Hadoop. HortonWorks has been spun from Yahoo, which provides commercial support for Hadoop.
You can find more information here. Hadoop is evolving very fast, so this might be a bit stale.
This can refer to
- hadoops file system,
- or its effective support for map reduce...
- or even more generally, to the idea of cloud / distributed storage systems.
Best to clarify what aspects of hadoop you are interested In.
Of course when comparing hadoop academically, you must first start looking at GFS- since that is the origin of hadoop.
Taking aside HBase we can see hadoop as two layers - storage layer and map-reduce layer.
Storage layer has the following really different implementation which would be interesting to compare: standard hadoop file system, HDFS over Cassandra (Brisk), HDFS over S3, MapR hadoop implementation.
MapR also have changed Map-reduce implementation.
This site http://www.nosql-database.org/ has a list of a lot of NoSql DBs out there. Maybe it can help you.

Any tested Frameworks/Solutions similar to Apache Hadoop?

I am interested in the Apache Hadoop project, but i would like to know if any other tested (please mind the 'tested') projects/frameworks are out there.
Appreciate any information/links to projects similar to Apache Hadoop and any comments on the Apache Hadoop project from anyone that has used it.
Regards,
As mentioned in an answer to this question:
https://stackoverflow.com/questions/2168558/is-there-anything-like-hadoop-in-c
MongoDB might be something you could look at. Its a scalable database which allows MapReduce algorithms to be run against it.
There are indeed open-source projects utilizing and funding on Hadoop.
See Apache Mahout for data mining: http://lucene.apache.org/mahout/
And are you aware of the other MR implementations available?
http://en.wikipedia.org/wiki/MapReduce#Implementations
Maybe. But none of them will have anywhere near the testing a real world experience that hadoop does. Companies like facebook and yahoo are paying to scale hadoop and I know of no similar open source projects that are really worth looking at.
A possible way is to use org.apache.hadoop.hbase.MiniDFSCluster and org.apache.hadoop.mapred.MiniMRCluster, which are used in testing hadoop itself.
What they do is to launch a small cluster locally. To test your program, make hdfs-site.xml stuffs pointing to local cluster, and add them to your classpath. And this local cluster is just like another cluster but smaller. You can reference hadoop/src/test/*-site.xml as templates.
For more example, take a look at hadoop/src/test/.
There is a Hadoop-like framework, built over Hadoop, giving importance to prioritized execution of iterative algorithms.
It is tested. I have run The WordCount example on it. It is very very similar to Hadoop (especially the installation)
You can find the paper here :
http://rio.ecs.umass.edu/mnilpub/papers/socc11-zhang.pdf
and the code here
https://code.google.com/p/priter/
Hope this helps
A

Resources