I am trying to set up clustered Hadoop and Cassandra. Many sites I've read use a lot of words and concepts I am slowly grasping but I still need some help.
I have 3 nodes. I want to set up Hadoop and Cassandra on all 3. I am familiar with Hadoop and Cassandra individually but how so they work together and how do I configure them to work together? Also, how do I set up one node dedicated to, for example, analytics?
So far I have modified my hadoop-env.sh to point to Cassandra libs. I have put this on all of my nodes. Is that correct? What more do I need to do and how do I run it - start Hadoop cluster or Cassandra first?
Last little question: do I connect directly to Cassandra or to Hadoop from within my Java client?
Rather then connecting them via your java client, you need to install Cassandra On top of Hadoop. Please follow the article for step by step assistance.
BR
Related
I am new to cassandra, and considering it for my next bigdata project.
I have a question. Can I host it in a non-hadoop environment? If so, how many nodes I can connect?
Yes, you can. Cassandra has no dependencies other than the most
basic ones, such as Java. You can read installation guide on
official site.
Your cluster can have as many nodes as you want. There is no
described limit on the number of nodes. I read in this article that
there are clusters which contain more than 1000 Cassandra nodes.
I am new to all these terms and given some time to understand it. But i have some confusions in it. Please correct me if i am wrong.
Nutch: It's for web crawling, using it we can crawl web pages. We can store these web pages somewhere in db.
Solr: Solr can be used for indexing web pages crawled by Apache Nutch. It helps in searching the indexes web pages.
HBase: It's used as an interface to interact with Hadoop. It helps in getting data at real time from HDFS. It provides simple SQL type interface for interacting.
Hadoop: It provides two functionalities: One is HDFS (Hadoop data file system) and other is Map-Reduce functionality taken from Google algorithms. Its basically used for offline data backup etc.
Gora and ZooKeeper: I am not sure of.
Confusions:
1). Is HBase a key-value pair DB or just an interface to Hadoop ? or i should ask, can HBase exist without Hadoop ?
If yes, can you explain a bit more about its usage.
2). Is there any use of crawling data using Apache Nutch without indexing into Solr ?
3). For running apache nutch, do we need HBase and Hadoop ? If no, how we can make it work without it?
4). Is Hadoop part of HBase ?
Here is a good short discussion of HBase vs. Hadoop: Difference between HBase and Hadoop/HDFS
Because HBase is built on top of Hadoop you can't really have HBase without Hadoop.
Yes you can run Nutch without Solr; there do not seem to be lots of use cases, however, much less living examples in the wild.
Yes, you can run Nutch without Hadoop, but again there don't seem to be a lot of real-world examples of people doing this.
Yes Hadoop is part of HBase, in that there is no HBase without Hadoop, but of course Hadoop is used for other things as well.
Zookeeper is used for configuration, naming, synchronization, etc. in Hadoop stack workflows. Gora is a memory management/persistence framework and is built on top of Hadoop.
I am not sure about what hadoop can and cannot do, and how easy things are.
I understand hadoop is good at doing mapreduce jobs and at providing hdfs, their distributed filesystem.
What else is hadoop good at / easy to use ?
My problem : I would like to serve data, result of mapreduce. And as I have lot of traffic I would need 3 front end servers. Can Hadoop help me deploy a server on 3 of my n runnning nodes ?
Basically instead of running mapreduce on n machines, I would like to run a custom executable (my server) on 3 machines. And when 1 machine fails, that hadoop takes care of starting the job on another available machine.
Am I supposed to run that on the hadoop cluster ? or should the hadoop cluster be used only for the mapreduce and I should have a separate cloud to serve the data from the hadoop cluster ?
Thanks for sharing your experience.
P.S I am just considering hadoop right now as a solution, Im not tied to it
Your question isn't actually clear but here is my shot.
You want to display the result of your Hadoop job? Usually a Hadoop job writes its result to HDFS. What you can do is to create your own OutputFormat class. You might define a XMLOutputFormat for example.
But the nice thing is that you can create your own Writable. Take a look at Database Access with Apache Hadoop. In this tutorial you can save the output of a Hadoop job to a data base system.
Your frontend then can query the database and show the result.
i am new to Hadoop ,i likes to go in hadoop administration line so studied basics of hadoop and tried to install hadoop in pseudo distribution mode and installed successfully and run some basic examples also, now i need to improve me further,so i need to try a way to learn hadoop installation and configuration in real time so decided to go for Amazon micro instance ,can any one please tell how to install and configure hadoop in Amazon cloud.
Thanks in Advance.
I have tried this personally and you will not really be able to use hadoop on a single micro instance due to memory restrictions. IMHO you should atleast try a medium instance to run hadoop or better yet use their elastic-mapreduce api which is a modified version of hadoop. You can run a 3 node cluster for around 00.25 cents an hour. If you really want to learn big data this is the way I went.
You should check out their documentation here
http://aws.amazon.com/documentation/elasticmapreduce/
I'm trying to set up a multi-node Apache Hadoop cluster. I'm following their tutorial here: http://hadoop.apache.org/docs/stable/cluster_setup.html. I've set up single-node Hadoop set-ups on each individual node, as well as Sun Java installed on each. Unfortunately, the documentation is not very clear and a bit outdated. Am I supposed to (under the 'Configuring the Hadoop Daemons' sections) update those files (.conf/*-site.xml) on every single node?
Also, how am I supposed to locate the host:port/IP:port pair for each node?
Sorry, I am new to all of this. Thank you in advance for your help!