apache ranger with hive plugin - what objects need to be persisted - hadoop

I have installed ranger and the ranger hive plugin, created some policies to enable authorisation of hive objects to certain users. Its working well. However if my machine where ranger is installed (an EC2) goes down then I assume I will need to recreate policies? I have a MySQL RDS (multi az) with regular snapshots so the db data should be available but what else do I need to backup (ie to s3 and then restore to new EC2 if initial EC2 dies)? I assume some json files under /etc/ranger/. Anything else?
env
apache ranger 1.0
hive 2.1.1
hadoop 2.8.3
Note: not using hortonworks

If you want that your ranger server should be always up, deploy your admin/user server behind the ELB or using service discovery.
Also, mysql replication is enough to populate the entire same workspace.
/etc/ranger/{CLUSTER_NAME}/policycache/{.json} -- this contains the authorisation policies which will be applicable, even if your ranger server goes down.

Related

mySql Server Placement on AWS EC2 or RDS

We are currently setting up AWS hosting for our Web Application.
This Laravel Web Application will have a Schema per company that registers, meaning it will have a large sized mySql server.
I have gone through the motions of setting up a VPC with EC2 instances and and RDS for this mySql server.
However we are currently looking at using Laravel Forge as a tool to host.
What Forge does differently is that it includes the mySql Server on the EC2 instance not on an RDS.
The question I have come to ask here is, what are the implications if any of having the mySql server on the EC2 instance rather then an RDS.
Would there be performance issues?
Is it better practice to have an RDS?
Or is Forges out the box way of packaging this all together on an EC2 server fine?
By running this on an EC2 instance you will taking more of the responsibility of managing the database, not just installation but also patching, backups, recovery. Harder to maintain functionality such as replication and HA will also be on you to implement and monitor.
By running on RDS AWS is going to take the heavy lifting of this and implement a best practice version of MySQL which offers the flexibility of allowing you to run a MySQL stack in the cloud without having to really think about the implementation details under the hood other than deciding do you want it to be HA and how many replicas do you want.
In saying this by using RDS you're also giving up the ability to run it however you want, you are limited to the versions of the database that RDS supports (although this is now quite soon after release). In addition not all plugins or extensions will be active so check this functionality before deciding.

What's the right way to provide Hadoop/Spark IAM role based access for S3?

We have Hadoop cluster running on EC2 and EC2 instances attached to a role which has access to S3 bucket for example: "stackoverflow-example".
Several users are placing Spark jobs in the cluster, we used keys in the past but do not want to continue and want to migrate to role, so any jobs placed on the Hadoop cluster will use role associated with ec2 instances. Did a lot of search and found 10+ tickets, some of them are still open, some of them are fixed and some of them do not have any comments.
Want to know whether it's still possible to use IAM role for jobs(Spark, Hive, HDFS, Oozie, etc) placing on Hadoop cluster. Most of the tutorials are discussing passing key (fs.s3a.access.key, fs.s3a.secret.key) which is not good enough and not secured as well. We also faced issues with credential provider with Ambari.
Some references:
https://issues.apache.org/jira/browse/HADOOP-13277
https://issues.apache.org/jira/browse/HADOOP-9384
https://issues.apache.org/jira/browse/SPARK-16363
That first one you link to HADOOP-13277 says "can we have IAM?" to which the JIRA was closed "you have this in s3a". The second, HADOOP-9384, was "add IAM to S3n", closed as "switch to s3a". And SPARK-16363? incomplete bugrep.
If you use S3a, and do not set any secrets, then the s3a client will fall back to looking at the special EC2 instance metadata HTTP server, and try to get the secrets from there.
That it: it should just work.

Apache Hive Installation on pseudo distributed or multi node cluster environment

I have installed hadoop on multi node environment in my PC as below
1: 4 virtual box instances loaded with ubuntu(14.04)
2: 1-master node , 2-slave node and remaining vm instance works as client
Note: All 4 VM'S are running in my PC itself
I was able to complete apace-2.6 hadoop setup successfully on the above mentioned setup .Now I want to install hive in order to do some data summarization, query, and analysis .
But I am not sure how I have to proceed further. I have few queries mentioned below :
Q1: Do I need to install/setup Apache Hive(0.14) on all nodes(master/name-node and slave/data-node)? or is it only on master node?
Q2: what is the mode should be used to deal with the meta-store is it local mode or remote mode ?
Q3: In case if I want to use mysql for hive meta-store,should I install it on master/name node itself or do I need to use separate client machine for this?
please can some one also share me if there are any steps to be followed to configure metastore? in multi node/pseudo distributed environment.
BR,
San
You need to install the required Hive services (HiveServer2, Metastore, WebHCat) only once. In your lab scenario, you would probably put them on the master. The client can then run Beeline (the HiveServer2 client.)
If you configure the Metastore as Local, Hive will use a local Derby database. Again, for your lab setup, this is probably just what you need/want.
In a production scenario, you would
set up a dedicated server for supporting services that should not fight for resources with the namenode process(es)
and use a dedicated database server for your Metastore database, which will be remote.

AWS EMR Hadoop Administration

We are currently using Apache Hadoop (Vanilla Version) in our org. We are planning to migrate to AWS EMR. I'm trying to understand how AWS EMR Hadoop works internally (not how to use it), I'm mainly interested in Hadoop administration steps and how master and slave communicates and various configuration configurations. I already checked the AWS EMR documentation but I don't see detailed comparison.
Can someone recommend me a link/tutorial for migrating to AWS EMR from an Apache Hadoop.
During EMR cluster creation, it will ask you to specify Master and Node. a default settings will provision 1 master and two nodes for you. You can also specify what all applications you want to be in the cluster (e.g.: hadoop, hive, spark, zeppelin, hue, etc.).
Once the cluster is created, it will provision all the services. you can click on these services and access them via web, or using ssh into the master. For e.g: to access the ambari interface, go to the service within EMR and click it. a new window will be launched with the ambari monitoring service interface.
Installing these applications is very easy. all you have to do is specify all the services while cluster creation.
Amazon Elastic MapReduce uses a mostly standard implementation of Hadoop and associated tools.
See: AMI Versions Supported in Amazon EMR
The benefits of using EMR are in the automated deployment of instances. For example, launching a cluster with an appropriate AMI means that software is already loaded on each instance and HDFS is configured across the core nodes.
The Master and Slave (Core/Task) nodes communicate in exactly the normal way that they communicate in any Hadoop cluster. However, only one Master is supported (with no backup Master).
When migrating to EMR, check that you are using compatible versions of software (eg Hadoop, Hive, Pig, Impala, etc). Also consider using Amazon S3 for storage of data instead of HDFS, especially for storing source data, since data on S3 persists even after the EMR cluster is terminated.
Technically, Hadoop provided with EMR, can be few releases back. You should check EMR release notes for detailed application provided with each version. EMR takes care application provisioning, setup and configuration. Based on EC2 instance type, Hadoop (and other application configuration) will change. You can override default settings using configure application.
Other than this Hadoop you have on premises and EMR should be the same.

Cannot access data from hbase with amazon ec2

I have a single node hadoop machine in which hbase is running in the amazon ec2 instance.
Due to some reason the server got restarted. So i need to start hadoop and hbase again. Then its working fine, but the old data in the hbase cannot be accessed through the web-servies.
While i use the shell command its working fine, i am getting the data.
So i created the scenario on my local server machine, but its working fine.
The version details are as follows.
hadoop-0.20.2
hbase-0.90.5
apache-tomcat-7.0.30
in Amazon ec2 medium instance
And I use restful web-services with Orm-hbase to access data.

Resources