BigCouch IDs and Backup data on EC2 - amazon-ec2

I have a few questions about BigCouch that i'm interesting getting answers before start using it.
Do I need to choose my shard key carefully or can just use an auto-generated GUID? I start with a single server with 1 replication, but I want to be ready when I need to add another shard
Any GUI for managing the cluster like CouchBase have, something similar to administer the DB
How can I backup the data when hosting BigCouch on EC2 (ie. snapshots)
Thanks

Since you have no started to use BigCouch yet and it looks like you need some features that are available out of the box in Couchbase (auto-sharding, administration console ...)
Why no going on Couchbase ?

Related

Re-Bootstrap Elastic Cluster

I need guidance to reinstate my Elastic cluster.
I had bootstrapped Elastic Cluster and had created 1 super-user and 2 other system-users too.
Ingest, Data, Gateway nodes had also joined the cluster.
Later, I felt I want to rename the Data but Google-Cloud does not allow me to rename so I created new data nodes with proper name and then deleted the old data nodes.
I had not ingest any data so far, no index was created .
Now, when I tried to see any of the cluster details ( say license information).
It does not authenticate any system user.
I tried re-creating the Bootstrap password and setting again. But that did not work either.
I'm seeing below exception in Elastic logs.
failed to retrieve password hash for reserved user [username]
org.elasticsearch.action.UnavailableShardsException: at least one primary shard for the index [.security-5] is unavailable
Please suggest me, there is a way to reinstate the existing configurations or how can I bootstrap it again .
I had not ingest any data so far
If you haven't added any actual data yet, the simplest approach is probably to delete all the current data directories and start the cluster from scratch again.
Also, is this still Elasticsearch 5 (looking at .security-5)? Because that's a really old version and some things work differently there than with current versions for a proper reset.
I had the sudo access, I created a system user using file based auth
then re-created other system users with the same password
then reverted the access type to normal login
That worked for me.

ETL process in AWS using EC2-s and EFS

I am a data engineer with experience in designing n creating data integration and ELT processes. Below is my use case, and I need to move my process to aws and would like your opinion?
My files to be processed are in s3. I need to process those files using Hadoop. I have existing logic written in hive, just need to migrate the same to aws. Is the below approach correct/ feasible?
Spin up a fleet of ec2 instances, initially say 5, enable autoscaling.
Create an EFS, and mount it on the ec2 instances.
Copy file from s3 to EFS as Hadoop tables.
Run hive queries on top of the data in EFS and create new tables.
Once the process is completed, move/ export the final reports table from EFS to s3 (somehow). Not sure that whether this is possible or not, if this is not possible then this entire solution is not feasible.
6.Terminate EFS and EC2 instances.
If the above method is correct, How does the Hadoop orchestration happen using EFS?
Thanks,
KR
Spin up a fleet of ec2 instances, initially say 5, enable autoscaling.
I'm not sure you need the autoscaling.
why?
let's say you start a "big" query which takes lot's of time & cpu.
auto-scale will start more instances , but how will it start run "fraction" of the query on the new machine?
all machines need to be ready before you run the query . just keep it in mind.
Or in other words: only the machines that available now will handle the query.
Copy file from s3 to EFS as Hadoop tables.
There isn't any problem with this idea.
just keep in mind , you can keep the data in EFS .
if EFS is too pricy for you ,
Please check options for provision EBS-magnetic with Raid 0 .
You will gain great speeds at minimal costs.
The rest is okay, and this is one of the ways to do "on demand" interactive analytics.
Please take a look into AWS Athena.
It's a service which allows you to run queries on s3 objects .
You can use Json and even Parquet (which is much more efficient !)
This service may be enough for your need .
Good luck !

What's the right way to provide Hadoop/Spark IAM role based access for S3?

We have Hadoop cluster running on EC2 and EC2 instances attached to a role which has access to S3 bucket for example: "stackoverflow-example".
Several users are placing Spark jobs in the cluster, we used keys in the past but do not want to continue and want to migrate to role, so any jobs placed on the Hadoop cluster will use role associated with ec2 instances. Did a lot of search and found 10+ tickets, some of them are still open, some of them are fixed and some of them do not have any comments.
Want to know whether it's still possible to use IAM role for jobs(Spark, Hive, HDFS, Oozie, etc) placing on Hadoop cluster. Most of the tutorials are discussing passing key (fs.s3a.access.key, fs.s3a.secret.key) which is not good enough and not secured as well. We also faced issues with credential provider with Ambari.
Some references:
https://issues.apache.org/jira/browse/HADOOP-13277
https://issues.apache.org/jira/browse/HADOOP-9384
https://issues.apache.org/jira/browse/SPARK-16363
That first one you link to HADOOP-13277 says "can we have IAM?" to which the JIRA was closed "you have this in s3a". The second, HADOOP-9384, was "add IAM to S3n", closed as "switch to s3a". And SPARK-16363? incomplete bugrep.
If you use S3a, and do not set any secrets, then the s3a client will fall back to looking at the special EC2 instance metadata HTTP server, and try to get the secrets from there.
That it: it should just work.

Running multiple elasticsearch instances

I need to setup 2 Elasticsearch instances:
one for kibana logs (my separate application will throw logs at it)
one for search for my production application
My plan is to create a separate folders with elasticsearch in them. They dont talk to each other which means they are separate databases and if one goes down, the other still runs. Is this good solution or should I use only one elasticsearch folder with muliple elasticsearch.yaml configuration files? What is the best practice for multiple elasticsearch instances?
The best practice is to NOT run two Elasticsearch instances on the SAME server.
Your production search will probably need a lot of ram to work fast and stay responsive. You don't want your logging system interfere with that.

EC2 database server failover strategy

I am planning to deploy my web app to EC2. I have several webserver instances. I have 1 primary database instance. I have 1 failover database instance. I need a strategy to redirect the webservers to the failover database instance IP when the primary database instance fails.
I was hoping I could use an Elastic IP in my connection strings. But, the webservers are not able to access/ping the Elastic IP. I have several brute force ideas to solve the problem. However, I am trying to find the most elegant solution possible.
I am using all .Net and SQL Server. My connection strings are encrypted.
Does anybody have a strategy for failing over a database instance in EC2 using some form of automation or DNS configuration?
Please let me know.
http://alestic.com/2009/06/ec2-elastic-ip-internal
tells you how to use the Elastic IP public DNS.
Haven't used EC2 but surely you need to either:
(a) put your front-end into some custom maintenance mode, that you define, while you switch the IP over; and have the front-end perform required steps to manage potential data integrity and data loss issues related to the previous server going down and the new server coming up when it enters and leaves your custom maintenance mode
OR, for a zero down-time system:
(b) design the system at the object/relational and transaction levels from the ground up to support zero-down-time fail-over. It's not something you can bolt on quicjkly to just any application.
(c) use some database support for automatic failover. I am unaware whether SQL Server support for failover suitable for your application exists or is appropriate here. I suggest adding a "sql-server" tag to the question to start a search for the right audience.
If Elastic IPs don't work (which sounds odd to say the least - shouldn't you talk to EC2 about that), you mayhave to be able to instruct your front-end which new database IP to use at the same time as telling it to go from maintenance mode to normal mode.
If you're willing to shell out a bit of extra money, take a look at Rightscale's tools; they've built custom server images and supporting tools that handle database failover (among many other things). This link explains how to do it with MySQL, so will hopefully show you some principles even though it doesn't use SQL Server.
I always thought there was this possibility in the connnection string
This is taken (but not yet tested) from How to add Failover Partner to a connection string in VB.NET :
If you connect with ADO.NET or the SQL
Native Client to a database that is
being mirrored, your application can
take advantage of the drivers ability
to automatically redirect connections
when a database mirroring failover
occurs. You must specify the initial
principal server and database in the
connection string and the failover
partner server.
Data Source=myServerAddress;Failover Partner=myMirrorServerAddress;
Initial Catalog=myDataBase;Integrated Security=True;
There is ofcourse many other ways to
write the connection string using
database mirroring, this is just one
example pointing out the failover
functionality. You can combine this
with the other connection strings
options available.
To broaden gareth's answer, cloud management softwares usually solve this type of problems. RightScale is one of them, but you can try enStratus or Scalr (disclaimer: I work at Scalr). These tools provide failover solutions like:
Backups: you can schedule automated snapshots of the EBS volume containing the data
Fault-tolerant database: in the event of failure, a slave is promoted master and mounted storage will be switched if the failed master and new master are in the same AZ, or a snapshot taken of the volume
If you want to build your own solution, you could replicate the process detailed below that we use at Scalr:
Is there a slave in the same AZ? If so, promote it, switch EBS
volumes (which are limited to a single AZ), switch any ElasticIP you
might have, reconfigure replication of the remaining slaves.
If not, is there a slave fully replicated in another AZ? If so, promote it,
then do the above.
If there are no slave in same AZ, and no slave fully
replicated in another AZ, then create a snapshot from master's
volume, and use this snapshot to create a new volume in an AZ where a
slave is running. Then do the above.

Resources