I have a single node hadoop machine in which hbase is running in the amazon ec2 instance.
Due to some reason the server got restarted. So i need to start hadoop and hbase again. Then its working fine, but the old data in the hbase cannot be accessed through the web-servies.
While i use the shell command its working fine, i am getting the data.
So i created the scenario on my local server machine, but its working fine.
The version details are as follows.
hadoop-0.20.2
hbase-0.90.5
apache-tomcat-7.0.30
in Amazon ec2 medium instance
And I use restful web-services with Orm-hbase to access data.
Related
We started using Amazon EMR recently for one new project (Version emr-5.11.0).We made some architecture changes in the EMR cluster
1) We moved metastore to another Postgres instance instead of default mysql/derby
2) Running metastore service in a different instance (which is not part of amazon EMR cluster) and made the necessary changes in hive-site.xml.
In EMR
stop hive-hcatalog-server
In new instance
hive --service metastore
Everything is working as expected except 's3 external tables' .When I try to create an external s3 table it is giving us an error like below
message:java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
We tried with s3/s3n/s3a with credentials also for creating the external table .If we are running the metastore service inside the EMR master node and ran the same query .It is working without issues .
Do we need to do any configuration / adding additional libraries in metastore instance this to work ?
Note: The metastore instance has both Apache hadoop and hive latest binaries .We are going with HDFS filesystem .Able to perform all operations other than external s3 tables .Tried everything from beeline and hive CLI
I have installed ranger and the ranger hive plugin, created some policies to enable authorisation of hive objects to certain users. Its working well. However if my machine where ranger is installed (an EC2) goes down then I assume I will need to recreate policies? I have a MySQL RDS (multi az) with regular snapshots so the db data should be available but what else do I need to backup (ie to s3 and then restore to new EC2 if initial EC2 dies)? I assume some json files under /etc/ranger/. Anything else?
env
apache ranger 1.0
hive 2.1.1
hadoop 2.8.3
Note: not using hortonworks
If you want that your ranger server should be always up, deploy your admin/user server behind the ELB or using service discovery.
Also, mysql replication is enough to populate the entire same workspace.
/etc/ranger/{CLUSTER_NAME}/policycache/{.json} -- this contains the authorisation policies which will be applicable, even if your ranger server goes down.
I have a hadoop cluster setup in AWS EC2, but my development setup(spark) is in local windows system. When I am trying to connect AWS Hive thrift server I able to connect , but it is showing some connection refused error when trying to submit a job from my local spark configuration. Please note in windows my user name is different that the user name for which Hadoop eco system is running in AWS server. Can any one explain me how the underlying system works in this setup?
1) When I am submitting a job from my local Spark to HIVE thrift , if it is associated any MR job , ASW Hive setup will submit that job NN with its own identity or it will carry forward my spark setup identity.
2) In my configuration do I need to run spark in local with same user name as I have for hadoop cluster in AWS ?
3) Do I need to configure SSL also to authenticate my local system?
Please note , my local system is not part of hadoop cluster and I can not include also in AWS Hadoop cluster.
Please let me know what will be actual setup for environment where my hadoop cluster is in AWS and spark is running on my local.
To simplify the problem, you are free to compile your code locally, produce an uber/shaded JAR, SCP to any spark-client in AWS, then run spark-submit --master yarn --class <classname> <jar-file>.
However, if you want to just Spark against EC2 locally, then you can set a few properties programmatically.
Spark submit YARN mode HADOOP_CONF_DIR contents
Alternatively, as mentioned in that post, the best way would be getting your cluster's XML files from HADOOP_CONF_DIR, and copying them over into your application's classpath. This is typically src/main/resources for a Java/Scala application.
Not sure about Python, R, or the SSL configs.
And yes, you need to add a remote user account for your local Windows username on all the nodes. This is how user impersonation will be handled by Spark executors.
I have created a Graph database in ArangoDB in a 5 machine AWS cluster. I do not have enough space in the Database AWS cluster to store the dump. So I would like to take a dump of the database in an AWS instance in a different cluster. I have the key files to connect to the machines. How to do it using Arangodump ? Thanks.
I do get that correctly that you're using DC/OS clusters on AWS?
The problem with arangoimp is, that it doesn't know howto authenticate with the DC/OS proxy, and thus can't reach the routes it would require to import to arangodb.
The problem is similar to Running Arango Shell on DC/OS cluster - you want to use sshutle as lalitlogical describes to forward the ArangoDB server port (usually 8529) to your target environment.
We are currently using Apache Hadoop (Vanilla Version) in our org. We are planning to migrate to AWS EMR. I'm trying to understand how AWS EMR Hadoop works internally (not how to use it), I'm mainly interested in Hadoop administration steps and how master and slave communicates and various configuration configurations. I already checked the AWS EMR documentation but I don't see detailed comparison.
Can someone recommend me a link/tutorial for migrating to AWS EMR from an Apache Hadoop.
During EMR cluster creation, it will ask you to specify Master and Node. a default settings will provision 1 master and two nodes for you. You can also specify what all applications you want to be in the cluster (e.g.: hadoop, hive, spark, zeppelin, hue, etc.).
Once the cluster is created, it will provision all the services. you can click on these services and access them via web, or using ssh into the master. For e.g: to access the ambari interface, go to the service within EMR and click it. a new window will be launched with the ambari monitoring service interface.
Installing these applications is very easy. all you have to do is specify all the services while cluster creation.
Amazon Elastic MapReduce uses a mostly standard implementation of Hadoop and associated tools.
See: AMI Versions Supported in Amazon EMR
The benefits of using EMR are in the automated deployment of instances. For example, launching a cluster with an appropriate AMI means that software is already loaded on each instance and HDFS is configured across the core nodes.
The Master and Slave (Core/Task) nodes communicate in exactly the normal way that they communicate in any Hadoop cluster. However, only one Master is supported (with no backup Master).
When migrating to EMR, check that you are using compatible versions of software (eg Hadoop, Hive, Pig, Impala, etc). Also consider using Amazon S3 for storage of data instead of HDFS, especially for storing source data, since data on S3 persists even after the EMR cluster is terminated.
Technically, Hadoop provided with EMR, can be few releases back. You should check EMR release notes for detailed application provided with each version. EMR takes care application provisioning, setup and configuration. Based on EC2 instance type, Hadoop (and other application configuration) will change. You can override default settings using configure application.
Other than this Hadoop you have on premises and EMR should be the same.