Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I'd like to ask what is the best way to keep my hadoop cluster safe and schedule periodic backups.
Is it possible to do a live backup of the namenode? How do I set up a backup node?
You can setup a secondary namenode which will automatically take a backup of namenode periodically. In cases of namenode failure, you can use the secondary namenode to regenerate the namenode metadata.
You can also set up HA (high availability) in your cluster, so that if the namenode goes down, the cluster will automatically switch to the the alternate namenode created during HA. Please read more about HA here : http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html
#amar provided a solution to make sure your cluster is highly available. In addition, you should think about how best to protect the data on the Hadoop cluster against user error, logical corruption, disasters and there are different ways to do that. You can write scripts that use HDFS snapshots and distcp to accomplish what you need. If you don't want to write and maintain scripts, you can use solutions like Cloudera BDR or Talena which offer very comprehensive backup and DR capabilities. Note that I work for Talena.
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 7 years ago.
Improve this question
My Hadoop muti node cluster has 3 nodes, one namenode and two datanodes, I am using Hbase for storing data, due to some reasons I want to change default ssh port number which I know how to do, but if I change that, what configuration changes I will have to make in hadoop and hbase?
I saw link , this link just explains the change in configuration for hadoop, but I think configuration of Hbase, Zookeper and Yarn also needs to be changed. Am I right? If yes, what changes I need to do in hadoop and hbase?
Hadoop verison 2.7.1
HBase version 1.0.1.1
Help Appreciated :)
SSH isn't a Hadoop managed configuration, and therefore has nothing to do with Spark, Hbase, Zookeper or Yarn other than adding new nodes to the cluster and inter-process communication.
You'll have to edit /etc/ssh/sshd_config on every node to change any SSH related settings. Then restart all the Hadoop services as well as sshd.
The related line is
Port 22
Change the port number, then do
sudo service sshd restart
In hadoop-env.sh there is the HADOOP_SSH_OPTS environment variable. I'm not really sure what it does, but you are welcome to try and set a port like so.
export HADOOP_SSH_OPTS="-p <num>"
Also not sure about this one, but in hbase-env.sh
export HBASE_SSH_OPTS="-p <num>"
Once done setting all the configs, restart the Hadoop services
stop-all.sh
start-all.sh
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
Since am new to Hadoop Administration, am trying to understanding the Hadoop Cluster Setup Environment in real Time Production systems.
1) As of today, do most projects run on Hadoop v1 or Hadoop v2 ?
2) Do we have Single Cluster or Multiple Clusters for a single Projects?
(I heard there are multiple clusters where each cluster is dedicated to specific roles.)
3) Do the Hadoop Clusters usually run on Cloud like AWS, Rackspace or Do they run on their Clien't own network.?
All questions you've completely depends on the client, project and lot of other factors.. but here are my 2 cents
1) Most of the projects have been switched to Hadoop v2.
2) It depends, obviously there will be 1 or 2 envs for dev, test and staging etc., before production. But in production one project will have one environment or one environment handles multiple projects. (Yahoo has 4,500 node hadoop cluster)
3) Number of nodes varies on the amount of data the company handles.. there are companies which run production cluster on 4 node cluster and 4000 node cluster
4) Again it depends on the type of data they're storing and processing.. clients with sensitive information like Banking won't normally go for Cloud as they feel data will be secure in their own data centers. But some clients they completely go for cloud because they save a lot of money (like New York Times on AWS).
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I am doing a POC on ways to import data from a shared network drive to HDFS. Data would in different folders on the shared drive and each folder would correspond to a different directory on HDFS. I looked at some popular tools that do this but most of them are for moving small pieces of data and not the whole file. These are the tools I found, are there any other?
Apache Flume: If there are only a handful of production servers producing data and the data does not need to be written out in real time, then it might also make sense to just move the data to HDFS via Web HDFS or NFS, especially if the amount of data being written out is relatively less - a few files of a few GB every few hours will not hurt HDFS. In this case, planning, configuring and deploying Flume may not be worth it. Flume is really meant to push events in real time and the stream of data is continuous and its volume reasonably large. [Flume book from safari online and flume cookbook]
Apache Kafka: Producer-consumer model : Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact.
Amazon Kinesis: Paid version for real-time data like Flume
WEB HDFS: Submit a HTTP PUT request without automatically following redirects and without sending the file data. Submit another HTTP PUT request using the URL in the Location header with the file data to be written. [http://hadoop.apache.org/docs/r1.0.4/webhdfs.html#CREATE]
Open Source Projects: https://github.com/alexholmes/hdfs-file-slurper
My requirements are simple:
Poll a directory for file, if a file comes, copy it to HDFS and move the file to a "processed" directory.
I need to do this for multiple directories
Give flume a try with a spooling directory source. You didn't mention your data volume or velocity, but I did a similar POC from a local linux filesystem to a Kerberized hdfs cluster with good results using a single flume agent running on an edge node.
Try dtingest, it supports ingesting data from different sources like shared drive, NFS, FTP to HDFS. They also have support for polling directories periodically. It should be available for free trial download.
It is developed on top Apache Apex platform.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I'm new to Hadoop, but I have been trying to create a single-node cluster for a college project. My goal is to perform mapreduce jobs into the same data but while using different Hadoop-based software, these being Hive and Pig.
So, I would like to know if I could install and run both Hive and Pig in the same node? What about in the same cluster, suppousing it has more than 10 nodes.
For the college project, it ok to create single-node cluster(make sure hadoop is installed in Pseudo-distributed mode- master, slave in same machine ).
You can install hive and pig on same node, as they both are just CLIs/Shells to launch MR jobs on hadoop-pseudo cluster.
launch both as follows in different terminals:
HIVE > $ hive
PIG : > $ pig -x mapreduce //hadoop-mode
Actually Pig grunt or hive shell are just interfaces to give jobs to cluster (Say 1 or 10 node cluster), in such case pig or hive shell acts just like a Client.So installing on any node not matters.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I have implemented a task in Hive. Currently it is working fine on my single node cluster.
Now I am planning to deploy it on AWS.
I don't know anything about the AWS. If I plan to deploy it then what should I choose Amazon EC2 or Amazon EMR?
I want to improve the performance of my task. Which one is better and reliable for me? How to approach towards them? I heard that we can also register our VM setting as it is on AWS. Is it possible?
Please suggest me as soon as possible.
Many Thanks.
EMR is a collection of EC2 instances with Hadoop (and optionally Hive and/or Pig) installed and configured on them. If you are using your cluster for running Hadoop/Hive/Pig jobs, EMR is the way to go. An EMR instance costs a little bit extra as compared to an EC2 instance. A quick check on Amazon prices today reveals that a small EC2 instances costs $0.08/hour while a small EMR instance costs $0.015/hour extra.
In my opinion, it's totally worth paying that extra money to save yourself the hassle of installing and setting up Hadoop (along with Hive and Pig), creating and maintaining and AMI and using it. Moreover, EMR's version of Hadoop and Hive has some patches that are not available (atleast, not yet) on Apache Hive. If you use EC2, you will probably be using Apache Hadoop and Hive (or may be, the cloudera distributions) and wouldn't have access to those patches (like native support for S3 or commands like ALTER TABLE my_table RECOVER PARTITIONS
References:
http://aws.amazon.com/ec2/pricing/
http://aws.amazon.com/elasticmapreduce/pricing/
I would suggest that you do NOT try and deploy your own Hadoop cluster, unless you have 2-3 months to spare, and you have a hadoop expert handy.
Elastic MapReduce will allow you to get started very quickly by providing a pre-configured hadoop environment. Seeing as you only have a single job, it should be fine.
In general, historically, EMR was pretty far behind the latest versions of Hadoop components, and some were missing entirely. That's the major reason for using another distribution. For example, if you wanted HBase, it wasn't in EMR, but not it is. Today, Spark is absent from EMR. EMR will generally lag.
That said, if you're not using the latest and greatest features, go with EMR.