Elasticsearch cluster on kubernetes cluster. In or out? - elasticsearch

I'm currently working on deploying an elasticseacrh cluster in K8s.
Can anyone help me understand what are the cons/pros of deploying the ES cluster inside our K8s cluster or outside? Thanks in advance!

A big pro is data ingestion. If you have your ES cluster inside your k8s cluster, data ingestion will be faster.
However, a big con is resources. ES will eat away your resources worse than google-chrome eats your ram. And I mean, a lot.
And maintaining it can be quite cumbersome. Not sure about your use case but if it is logging (as in most cases), usually cloud providers have their own solution for that.
If not, then:
I would recommend having dedicated nodes for ES in your cluster, otherwise it might affect other pods if there are peaks and starts using a lot of node resources.
Also make sure to familiarize yourself and optimize your cold-warm-hot data, it will save you a lot of time and resources.
EDIT
I haven't emphasized how important is this faster data ingestion so it might not seem like a good enough reason to deploy it inside the cluster. Bottom line is pretty obvious: Network latency and bandwidth.
These things can really add up (picking up all those logs from all those pods, then scaling those same pods, then expanding the cluster, then again...), so every unit counts. If your VMs will not suffer from those two (meaning, they have same latency as any other node of the cluster), I think it won't make a huge difference.
On the other hand, I see no big benefit in separating them from the cluster. It is a part of your infrastructure anyway.
What if tomorrow you decide to switch to AWS or GKE? You would have to change your deployments, setup the whole thing again. On the other hand, if it's already a part of your cluster, just kubectl apply and 🤷
I can also guess that you will try to setup an ELK stack. If time and good will allows, give fluentd a chance (it is 100% compatible with all logstash clients but much more lghtweight).

Related

Bluemix Spark and Hadoop Service Configuration

Having run through configuration of both the Hadoop Big Insights and Apache Spark services on Bluemix, I noticed that Hadoop is very configurable.I have a choice of how many nodes there will be in the cluster and the RAM and CPU cores of those nodes as well as hard disk space
But the Spark service seems less configurable. The only choice I have is to choose between 2 and 30 Spark executors.
I am working with Bluemix as part of an IBM IC4 project to evaluate these services, so I have a few questions about this.
Is it possible to configure the Spark service in a similar way to the Hadoop service? i.e. choose nodes, RAM of nodes, CPU cores etc.
What are Spark executors in this context? Are they nodes? If so, what are their specifications?
Is there a plan to improve the options for Spark's configuration in the future?
Apologies for the questions but I need to know these specifications in order to carry out my work.
The Big Insights service is what some would call a hosted service. Which is to say, when you provision on instance of this service you get your own cluster with nodes configured as specified in the chosen plan. Consequently, you'll want to know exactly what each node you're paying for gives you. On the other hand, the Apache Spark service is a shared compute service, wherein you pay for compute to run your spark programs. Running spark is about in-memory compute, and creating RDDs over sources of data hosted by other data services. So in this context, what matters is how many concurrent jobs can I run and how many parallel tasks can I run with how much memory, and so on. In the Spark service plan, these executors seem to be an abstraction on this compute horsepower; unfortunately, hard for you to map that to physical hardware if you care about that. The plan description needs more elaboration and details about how one translates this abstraction to how you map to your workload needs.
However, I understand that this should be improved considerably at some point in the near future. There have been rumors about moving to only a single spark service plan where you can dial in, whenever you want, how much compute you need and that would take effect when you click "go", for all spark jobs from that point forward; it seems like you can twiddle the dials until you get what you want, see what that would cost, then lock it in until next time you need to change it. I can image something even more dynamic than that on a per-job basis. But anyway, seems like the direction things may be going for this compute service.

Choosing Hadoop solution for Big Data project - Pricing Options

I have to use Hadoop for my research work and I am deciding for the best option to start with. So far I have end up to go with Cloudera. I've downloaded the quick start VM
and started learning different turorials.
The issue is that my system can't afford to run it and perform very slow and I think it might just stop working after I feed it with all the data and run other services.
I was advised to go for a cloud service with 4 cluster node. Can someone please help me by providing the best option and estimated pricing to consider? 1 year plan might be enough to complete my research.
Thanks.
If you are a linux user, Just download the individual components(like hdfs, MR1, YARN, Hbase, Hive etc...) from this Cloudera Archives instead of loading Cloudera Quickstart VM.
If you want to try the 4 node cluster, easiest option is to use cloud.
There are plenty of cloud providers. I have personally used AWS, Google Cloud, Microsoft Azure, IBM SmartCloud. Out of which, AWS is the best to start with.
It is like pay as you go(use).I can recommend you to use a decent EC2 Machine(4 X m3.large Machines)
Type: m3.large CPU:2 RAM:7.5G Storage: 1 x 32 SSD Price: $0.133 per Hour AWS Pricing
If you plan to do the research for one year, I recommend you to go for VPC.
Cons of AWS EC2:
If you launch a machine in EC2, the moment you restart your machine, Your IP and the hostname will get changed.
In AWS VPC, your IP and hostname will remain the same.
If you use 4 Machinesx24x7xone month,it costs you $389.44.
You can calculate the AWS cost by yourself
As best as I can see you have two paths:
Setup Hadoop in a cloud service provider (i.e. Amazon's EC2 or
Redhat's Openshift.
Use Hadoop-as-a-service (i.e. Amazon's EMR or Microsoft's HDInsight).
The first path, setting up Hadoop in a cloud service provider will require you to become a semi-competent Hadoop administrator. If that's your goal, great! However you'll spend a great deal of time learning the necessary skills and mindset to become that. I don't suspect that that is your goal.
The second path is the one I'd recommend out of these two. Using Hadoop-as-a-service you will get up and running faster, but will cost more up front and on an ongoing (per hour basis). You'll still probably save money because you'll be spending less time troubleshooting your Hadoop cluster and more time doing the work you wanted to do in the first place.
I have to wonder, if you can even fit your dataset on your laptop, why are you using big data tools in the first place? True, they'll work. However Big Data is at least partially defined as data sets and computational problems that just don't fit on a single machine.

Docker-Swarm, Kubernetes, Mesos & Core-OS Fleet

I am relatively new to all these, but I'm having troubles getting a clear picture among the listed technologies.
Though, all of these try to solve different problems, but do have things in common too. I would like to understand what are the things that are common and what is different. It is likely that the combination of few would be great fit, if so what are they?
I am listing a few of them along with questions, but it would be great if someone lists all of them in detail and answers the questions.
Kubernetes vs Mesos:
This link
What's the difference between Apache's Mesos and Google's Kubernetes
provides a good insight into the differences, but I'm unable to understand as to why Kubernetes should run on top of Mesos. Is it more to do with coming together of two opensource solutions?
Kubernetes vs Core-OS Fleet:
If I use kubernetes, is fleet required?
How does Docker-Swarm fit into all the above?
Disclosure: I'm a lead engineer on Kubernetes
I think that Mesos and Kubernetes are largely aimed at solving similar problems of running clustered applications, they have different histories and different approaches to solving the problem.
Mesos focuses its energy on very generic scheduling, and plugging in multiple different schedulers. This means that it enables systems like Hadoop and Marathon to co-exist in the same scheduling environment. Mesos is less focused on running containers. Mesos existed prior to widespread interest in containers and has been re-factored in parts to support containers.
In contrast, Kubernetes was designed from the ground up to be an environment for building distributed applications from containers. It includes primitives for replication and service discovery as core primitives, where-as such things are added via frameworks in Mesos. The primary goal of Kubernetes is a system for building, running and managing distributed systems.
Fleet is a lower-level task distributor. It is useful for bootstrapping a cluster system, for example CoreOS uses it to distribute the kubernetes agents and binaries out to the machines in a cluster in order to turn-up a kubernetes cluster. It is not really intended to solve the same distributed application development problems, think of it more like systemd/init.d/upstart for your cluster. It's not required if you run kubernetes, you can use other tools (e.g. Salt, Puppet, Ansible, Chef, ...) to accomplish the same binary distribution.
Swarm is an effort by Docker to extend the existing Docker API to make a cluster of machines look like a single Docker API. Fundamentally, our experience at Google and elsewhere indicates that the node API is insufficient for a cluster API. You can see a bunch of discussion on this here: https://github.com/docker/docker/pull/8859 and here: https://github.com/docker/docker/issues/8781
Join us on IRC # #google-containers if you want to talk more.
I think the simplest answer is that there is no simple answer. The swift rise to power of containers, and Docker in particular has left a power vacuum for "container scheduling and orchestration", whatever that might mean. In reality, that means you have a number of technologies that can work in harmony on some levels, but with certain aspects in competition. For example, Kubernetes can be used as a one stop shop for deploying and managing containers on a compute cluster (as Google originally designed it), but could also sit atop Fleet, making use of the resilience tier that Fleet provides on CoreOS.
As this Google vid states Kubernetes is not a complete out the box container scaling solution, but is a good statement to start from. In the same way, you would at some stage expect Apache Mesos to be able to work with Kubernetes, but not with Marathon, in as much as Marathon appears to fulfil the same role as Kubernetes. Somewhere I think I've read these could become part of the same effort, but I could be wrong about that - it's really about the strategic direction of Mesosphere and the corresponding adoption of Kubernetes principles.
In the DockerCon keynote, Solomon Hykes suggested Swarm would be a tier that could provide a common interface onto the many orchestration and scheduling frameworks. From what I can see, Swarm is designed to provide a smooth Docker deployment workflow, working with some existing container workflow frameworks such as Deis, but flexible enough to yield to "heavyweight" deployment and resource management such as Mesos.
Hope this helps - this could be an enormous post. I think the key is that these are young, evolving services that will likely merge and become interoperable, but we need to ride out the next 12 months to see how it plays out. There's some very clever people on the problem, so the future looks very bright.
As far as I understand it:
Mesos, Kubernetes and Fleet are all trying to solve a very similar problem. The idea is that you abstract away all your hardware from developers and the 'cluster management tool' sorts it all out for you. Then all you need to do is give a container to the cluster, give it some info (keep it running permanently, scale up if X happens etc) and the cluster manager will make it happen.
With Mesos, it does all the cluster management for you, but it doesn't include the scheduler. The scheduler is the bit that says, ok this process needs 2 procs and 512MB RAM, and I have a machine over there with that free, so I'll run it on that machine. There are some plugin schedulers available for Mesos: Marathon and Chronos and you can write your own. This gives you a lot of power of resource distribution and cluster scaling etc.
Fleet and Kubernetes seem to abstract away those sorts of details (so you don't have to write your own scheduler basically). This means you have to define your tasks and submit them in the format/manner defined by Fleet or Kubernetes and then they take over and schedule the tasks (containers) for you.
So I guess: Using Mesos may mean a bit more work in writing your own scheduler, but potentially provides more flexibility if required.
I think the idea of running Kubernetes on top of Mesos is that Kubernetes acts as the scheduler for Mesos. Personally I'm not sure what benefits this brings over running one or the other on its own though (hopefully someone will jump in and explain!)
As MikeB said.. it's early days, and it's all up for grabs (keep an eye on Amazon's ECS as well) so there are many competing standards and a lot of overlap!
-edit- I didn't mention Docker swarm as I don't really have much experience with it.
For anyone coming to this after 2017 fleet is deprecated. Do not use it anymore.
Fleet docs say "fleet is no longer actively developed or maintained by CoreOS" and link to Container orchestration: Moving from fleet to Kubernetes. Fleet was removed from Container Linux (formerly known as CoreOS Linux) and replaced with Kubernetes kubelet (agent). This coincided with a corporate pivot to offer Tectonic (a Kubernetes distro) as their primary product.

Master/Slave setup in Elasticsearch

I'm trying to bootstrap a product but I'm constrained for money. So I would like to keep the server costs as low as possible.
My requirement is that I need to index millions of records in elasticsearch that keep coming in at the rate of 20 records per second. i also need to run search queries and percolate queries often. I currently have a basic digitalocean droplet serving the website which also hosts the elasticsearch node. It has a mere 512 mb of RAM. So I often run into out-of-heap-memory errors with elasticsearch becoming non-responsive.
I have a few computers at home to spare.
What I would like to do is, setup a master elasticsearch server in my home network, which will index all the data and also handle the percolate queries. It will push periodic updates to a slave elasticsearch node on the web server. The slave node will handle the search queries.
Is this setup possible?
If it is not possible, what is the minimum RAM I would need in the current scenario to keep elasticsearch happy?
Will indexing in bulk (like 100 documents at a time) instead of one document at a time make a difference?
Will switching to sphinx make a difference for my usecase?
(The reason I chose elasticsearch over sphinx was 1. Elasticsearch has flexible document schema which was an advantage as the product is still in defining phase. 2. The percolate feature in Elasticsearch, which I use heavily.)
Thank you very much.
You can manually setup something similar to master/slave using the Elasticsearch Snapshot and Restore mechanism:
Snapshot And Restore
The snapshot and restore module allows to create snapshots of
individual indices or an entire cluster into a remote repository. At
the time of the initial release only shared file system repository was
supported, but now a range of backends are available via officially
supported repository plugins.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-snapshots.html
Snapshot and Restore let's you backup indices or entire indexes to a shared file system (Amazon EC2 and Microsoft Azure are supported) and then restore them. You could take periodic snapshots of your index from you home Elasticsearch cluster which can then be restored to your search cluster in the cloud. You can control this via the normal Rest API so you could make it happen automatically on a schedule.
That addresses the indexing portion of your performance problem, provided you have sufficient resources on your home network (servers with enough memory and a network with sufficient uploading capacity to get you index pushed to the cloud).
Regarding your performance on queries, you need as much memory as you can get. Personally I'd look at some of Amazon's EC2 memory optimized instances that provide more memory at the expense of disk or cpu, as many ES installations (like yours) are primarily memory bound.
I'd also suggest something I've done when dealing with heap issues - a short script that searches the log file for heap issues and when they occur restarts jetty or tomcat or whatever servlet container you are using. Not a solution but certainly helps when ES dies in the middle of the night.
ElasticSearch is fantastic at indexing millions of records, but it needs lots of memory to be effecient. Our production servers have 30gigs of memory pinned just for ES. I don't see any way you can index millions of records and expect positive response times with 512mb.
Perhaps look into using Azure or EC2 to keep your costs down.

Amazon EC2 consideration - redundancy and elastic IPs

I've been tasked with determining if Amazon EC2 is something we should move our ecommerce site to. We currently use Amazon S3 for a lot of images and files. The cost would go up by about $20/mo for our host costs, but we could sell our server for a few thousand dollars. This all came up because right now there are no procedures in place if something happened to our server.
How reliable is Amazon EC2? Is the redundancy good, I don't see anything about this in the FAQ and it's a problem on our current system I'm looking to solve.
Are elastic IPs beneficial? It sounds like you could point DNS to that IP and then on Amazon's end, reroute that IP address to any EC2 instance so you could easily get another instance up and running if the first one failed.
I'm aware of scalability, it's the redundancy and reliability that I'm asking about.
At work, I've had something like 20-40 instances running at all times for over a year. I think we've had 1-3 alert emails come from amazon suggesting that we terminate and boot another instance (presumably because they are detecting possible failure in the underlying hardware). We've never had an instance go down suddenly, which seems rather good.
Elastic IP's are amazing and are part of the solution. The other part is being able to rapidly bring up new instances. I've learned that you shouldn't care about instances going down, that it's more important to use proper load balancing and be able to bring up commodity instances quickly.
Yes, it's very good. If you aren't able to put together a concurrent redundancy (where you have multiple servers fulfilling requests simultaneously), using the elastic IP to quickly redirect to another EC2 instance would be a way to minimize downtime.
Yeah I think moving from inhouse server to Amazon will definitely make a lot of sense economically. EBS backed instances ensure that even if the machine gets rebooted, the transient memory is not lost. And if you have a clear separation between your application and data layer and can have them on different machines, then you can build even better redundancy for your data.
For ex, if you use mysql, then you can consider using Amazon RDS service - which gives you a highly available and reliable MySQL instance, fully managed (patches and all). The application layer then can be made more resilient by having more smaller instances rather than one larger instance, through load balancing.
The cost you will save on is really hardware maintenance and the cost you would have to incur to build in disaster recovery.

Resources