Elasticsearch deployment environment setup - amazon-ec2

We are working on setting up our elasticsearch backend for a production environment. Up until a few weeks ago, we were using Solr, but we decided to use Elasticsearch for a few reasons, but the biggest reason is for the distributed nature of the backend.
With that said, we've been looking for some documentation and best practices on deploying elasticsearch using amazon's services.
For the moment, we were considering using a extra-large box and then scaling out from there, but we aren't sure that is the best approach. For example, it may be better to have three mediums than one extra-large.
We intend to index around 100K to 150K documents per day up to around ten million docs.
The question is, can anyone provide a general environment / deployment diagram for elasticsearch or best practices in general?

There's some docs for elasticsearch that talk about EC2 deployment. There's an autodiscovery plugin based on EC2 tags or security groups or whatever you like. You can also choose S3 for persistence, although that may not really be necessary.
I'd advise launching it in a VPC so you can have permanent internal IPs, in regular EC2 your internal IPs will change with every reboot even if you're using Elastic IPs.

Related

Docker for Elasticsearch multi-tenancy SaaS or single instance and proxy?

I am trying to build a prototype of Elasticsearch as a Service. I have thought of 2 different approaches and I'd like to get opinions towards one or the other implementation
One single installation of Elasticsearch, and a proxy layer on top to add user validation (http basic authentication + user account to validate the usage).
This approach would be relatively straight forward and the main challenge would be configure the cluster properly to handle the load, as well as the permissions so there are no data leaks of the users don't have access to the cluster management APIs.
Use Docker as a container and have one instance of elasticsearch for each user. In this case I would be providing the isolation by using the Linux container (Docker). I'd still need to manage authentication.
It probably would be good to implement both, play around and see how things behave. Any opinions about pros and cons of each approach?
Thanks!
Disclaimer: I am the founder of the Elasticsearch service provider Facetflow, which currently offers shared clusters.
I think that both approaches have merit, but maybe suited for different types of customers.
Looking at other SaaS providers, like MongoDB provider MongoLab, they essentially ended up offering both setups (although not using Docker).
So, pros and cons as I see them:
Shared Cluster
Most Elasticsearch as a Service providers operate this way.
Pros:
Far more affordable for the majority of users just looking for good search and analytics.
Simpler maintenance, less clusters for you to monitor
Potentially less versions of Elasticsearch to integrate with. If you need to communicate with other systems (which you do), write your own plugins (we did, for authentication, silos, entitlements, stats etc.) less versions will be far easier to maintain.
Cons:
Noisy neighbours have to be monitored and you have to scale and relocate indices to handle this.
Users have to choose from a limited list of versions of Elasticsearch, usually a single version.
Users don't get full cluster admin control.
Private Clusters using Docker
One provider that works this way is Found.
Pros:
Users could potentially be able to deploy a variety of versions of Elasticsearch
Users can have complete cluster admin access
Noisy neighbours don't affect their cluster, less manual intervention from you
Cons:
Complex monitoring and support. If people can do whatever they want (shut down the cluster over the api), you have to be clear where your responsibility as a provider ends, and what wakes you up at night.
Complex integration with multiple versions, see shared cluster pros.
More expensive since you have to allocate resources that might not always be used.

Basic AWS questions

I'm newbie on AWS, and it has so many products (EC2, Load Balancer, EBS, S3, SimpleDB etc.), and so many docs, that I can't figure out where I must start from.
My goal is to be ready for scalability.
Suppose I want to set up a simple webserver, which access a database in mongolab. I suppose I need one EC2 instance to run it. At this point, do I need something more (EBS, S3, etc.)?
At some point of time, my app has reached enough traffic and I must scale it. I was thinking of starting a new copy (instance) of my EC2 machine. But then it will have another IP. So, how traffic is distributed between both EC2 instances? Is that did automatically? Must I hire a Load Balancer service to distribute the traffic? And then will I have to pay for 2 EC2 instances and 1 LB? At this point, do I need something more (e.g.: Elastic IP)?
Welcome to the club Sony Santos,
AWS is a very powerfull architecture, but with this power comes responsibility. I and presumably many others have learned the hard way building applications using AWS's services.
You ask, where do I start? This is actually a very good question, but you probably won't like my answer. You need to read and do research about all the technologies offered by amazon and even other providers such as Rackspace, GoGrid, Google's Cloud and Azure. Amazon is not easy to get going but its not meant to be really, its focus is more about being very customizable and have a very extensive api. But lets get back to your question.
To run a simple webserver you would need to start an EC2 instance this instance by default runs on a diskdrive called EBS. Essentially an EBS drive is a normal harddrive except that you can do lots of other cool stuff with it like take it off one server and move it to another. S3 is really more of a file storage system its more useful if you have a bunch of images or if you want to store a lot of backups of your databases etc, but its not a requirement for a simple webserver. Just running an EC2 instance is all you need, everything else will happen behind the scenes.
If you app reaches a lot of traffic you have two options. You can scale your machine up by shutting it off and starting it with a larger instance. Generally speaking this is the easiest thing to do, but you'll get to a point where you either cannot handle all the traffic with 1 instance even at the larger size and you'll decide you need two OR you'll want a more fault tolerant application that will still be online in the event of a failure or update.
If you create a second instance you will need to do some form of loadbalancing. I recommend using amazons Elastic Load Balancer as its easy to configure and its integration with the cloud is better than using Round Robin DNS or a application like haproxy. Elastic Load Balancers are not expensive, I believe they cost around $18 / month + data that's passed between the loadbalancer.
But no, you don't need anything else to do scale up your site. 2 EC2 instances and a ELB will do the trick.
Additional questions you didn't ask but probably should have.
How often does an EC2 instance experience hardware failure and crash my server. What can I do if this happens?
It happens frequently, usually in batches. Sometimes I go months without any problems then I will get a few servers crash at a time. But its defiantly something you should plan for I didn't in the beginning and I paid for it. Make sure you create scripts and have backups and a backup plan ready incase your server fails. Be ok with it being down or have a load balanced solution from day 1.
Whats the hardest part about scalabilty?
Testing testing testing testing... Don't ever assume anything. Also be prepared for sudden spikes in your traffic. You have to be prepared for anything if you page goes from 1 to 1000 people over night are you prepared to handle it? Have you tested what you "think" will happen?
Best of luck and have fun... I know I have :)

Failover proxy on Amazon aws?

This is a fairly generic question. Suppose I have three ec2 boxes: two app boxes and a box that hosts nginx as a reverse proxy, delegating requests to the two app boxes (my database is hosted elsewhere). Now, the two app machines can absorb a failure amongst themselves, however the third one represents a single point of failure. How can I configure my setup so that if the reverse proxy goes down, the site is still available?
I am looking at keepalived and HAproxy. For me this stuff is non-obvious, and any help for the ears of a beginner is appreciated.
If your nginx does no much more than proxying HTTP requests, please have a look at Amazon Elastic Load Balancer. You can set up your two (or more) app boxes, leave some spare ones (in order to keep always two or more up, if you need it), set up health checks, have SSL termination at the balancer, make use of sticky sessions, etc.
There is a lot of people, though, that would like to see the ability to set elastic IP addresses to ELBs, and others with good arguments why it is not neeeded.
My suggestions is that you take a look at ELB documentation, as it seems to perfectly fit your needs. I also recommend reading this interesting post for a good discussion on this subject.
I think if you are a beginner with HA and clusters, your best solution is Elastic Load Balancer (ELB) which is maintained by Amazon. They scale up automatically and implements a high availability cluster of balancers. So using ELB service you already mitigate the point of failure that you commented. Also it's important to have in mind that an ELB is cheaper than 2 instances in AWS. And of course it's easier to launch and maintain.
You can't see multiple ELB because it is a service, so you don't have to take care of the availability.
Other important point is that AWS elastic ips aren't assigned to NIC interface of your OS instance, so use virtual ips as well in classical infrastructures it's difficult.
After this explanation, if you still want Nginx as a proxy reverse in AWS because your reasons, I think you can implement an autoscaling group with a layer composed by Nginx instances. But if you aren't expert in autoscaling technology, it could be very tricky.

Amazon EC2 + Windows Server 2008 + Memcached = how?

We are building a system that would benefit greatly from a Distributed Caching mechanism, like Memcached. But i cant get my head around the configuration of Memcached daemons and clients finding each other on an Amazon Data Center. Do we manually setup the IP addresses of each memcache instance (they wont be dedicated, they will run on Web Servers or Worker Boxes) or is there a automagic way of getting them to talk to each other? I was looking at Microsoft Windows Server App Fabric Caching, but it seems to either need a file share or a domain to work correctly, and i have neither at the moment... given internal IP addresses are Transient on Amazon, i am wondering how you get around this...
I haven't setup a cluster of memcached servers before, but Membase is a solution that could take away all of the pain you are experiencing with memcached. Membase is basically memcached with a persistence layer underneath and comes with great cluster management software. Clustering servers together is as easy since all you need to do is tell the cluster what the ip address of the new node is. If you already have an application written for Memcached it will also work with Membase since Membase uses the Memcached protocol. It might be worth taking a look at.
I believe you could create an elastic ip in EC2 for each of the boxes that hold your memcached servers. These elastic ips can be dynamically mapped to any EC2 instance. Then your memcached clients just use the elastic ips as if they were static ip addresses.
http://alestic.com/2009/06/ec2-elastic-ip-internal
As you seemed to have discovered, Route53 is commonly used for these discovery purposes. For your specific use case, however, I would just use Amazon ElasticCache. Amazon has both memcached and redis compliant versions of ElasticCache and they manage the infrastructure for you including providing you with a DNS entry point. Also for managing things like asp.net session state, you might consider this article on the DynamoDB session state provider.
General rule of thumb: if you are developing a new app then try and leverage what the cloud provides vs. build it, it'll make your life way simpler.

Amazon EC2 consideration - redundancy and elastic IPs

I've been tasked with determining if Amazon EC2 is something we should move our ecommerce site to. We currently use Amazon S3 for a lot of images and files. The cost would go up by about $20/mo for our host costs, but we could sell our server for a few thousand dollars. This all came up because right now there are no procedures in place if something happened to our server.
How reliable is Amazon EC2? Is the redundancy good, I don't see anything about this in the FAQ and it's a problem on our current system I'm looking to solve.
Are elastic IPs beneficial? It sounds like you could point DNS to that IP and then on Amazon's end, reroute that IP address to any EC2 instance so you could easily get another instance up and running if the first one failed.
I'm aware of scalability, it's the redundancy and reliability that I'm asking about.
At work, I've had something like 20-40 instances running at all times for over a year. I think we've had 1-3 alert emails come from amazon suggesting that we terminate and boot another instance (presumably because they are detecting possible failure in the underlying hardware). We've never had an instance go down suddenly, which seems rather good.
Elastic IP's are amazing and are part of the solution. The other part is being able to rapidly bring up new instances. I've learned that you shouldn't care about instances going down, that it's more important to use proper load balancing and be able to bring up commodity instances quickly.
Yes, it's very good. If you aren't able to put together a concurrent redundancy (where you have multiple servers fulfilling requests simultaneously), using the elastic IP to quickly redirect to another EC2 instance would be a way to minimize downtime.
Yeah I think moving from inhouse server to Amazon will definitely make a lot of sense economically. EBS backed instances ensure that even if the machine gets rebooted, the transient memory is not lost. And if you have a clear separation between your application and data layer and can have them on different machines, then you can build even better redundancy for your data.
For ex, if you use mysql, then you can consider using Amazon RDS service - which gives you a highly available and reliable MySQL instance, fully managed (patches and all). The application layer then can be made more resilient by having more smaller instances rather than one larger instance, through load balancing.
The cost you will save on is really hardware maintenance and the cost you would have to incur to build in disaster recovery.

Resources