I'm trying to bootstrap a product but I'm constrained for money. So I would like to keep the server costs as low as possible.
My requirement is that I need to index millions of records in elasticsearch that keep coming in at the rate of 20 records per second. i also need to run search queries and percolate queries often. I currently have a basic digitalocean droplet serving the website which also hosts the elasticsearch node. It has a mere 512 mb of RAM. So I often run into out-of-heap-memory errors with elasticsearch becoming non-responsive.
I have a few computers at home to spare.
What I would like to do is, setup a master elasticsearch server in my home network, which will index all the data and also handle the percolate queries. It will push periodic updates to a slave elasticsearch node on the web server. The slave node will handle the search queries.
Is this setup possible?
If it is not possible, what is the minimum RAM I would need in the current scenario to keep elasticsearch happy?
Will indexing in bulk (like 100 documents at a time) instead of one document at a time make a difference?
Will switching to sphinx make a difference for my usecase?
(The reason I chose elasticsearch over sphinx was 1. Elasticsearch has flexible document schema which was an advantage as the product is still in defining phase. 2. The percolate feature in Elasticsearch, which I use heavily.)
Thank you very much.
You can manually setup something similar to master/slave using the Elasticsearch Snapshot and Restore mechanism:
Snapshot And Restore
The snapshot and restore module allows to create snapshots of
individual indices or an entire cluster into a remote repository. At
the time of the initial release only shared file system repository was
supported, but now a range of backends are available via officially
supported repository plugins.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-snapshots.html
Snapshot and Restore let's you backup indices or entire indexes to a shared file system (Amazon EC2 and Microsoft Azure are supported) and then restore them. You could take periodic snapshots of your index from you home Elasticsearch cluster which can then be restored to your search cluster in the cloud. You can control this via the normal Rest API so you could make it happen automatically on a schedule.
That addresses the indexing portion of your performance problem, provided you have sufficient resources on your home network (servers with enough memory and a network with sufficient uploading capacity to get you index pushed to the cloud).
Regarding your performance on queries, you need as much memory as you can get. Personally I'd look at some of Amazon's EC2 memory optimized instances that provide more memory at the expense of disk or cpu, as many ES installations (like yours) are primarily memory bound.
I'd also suggest something I've done when dealing with heap issues - a short script that searches the log file for heap issues and when they occur restarts jetty or tomcat or whatever servlet container you are using. Not a solution but certainly helps when ES dies in the middle of the night.
ElasticSearch is fantastic at indexing millions of records, but it needs lots of memory to be effecient. Our production servers have 30gigs of memory pinned just for ES. I don't see any way you can index millions of records and expect positive response times with 512mb.
Perhaps look into using Azure or EC2 to keep your costs down.
Related
I am trying to copy the data from Oracle Database to Azure Synapse and it is taking more time around 3 days to copy 900 GB of data.
My Oracle database is an on-prem database and I have configured self-hosted IR.
I have configured the staging as well while copying the data from Oracle on-prem database to Azure Synapse.
Not sure, Why it is taking this much time, how we can check and fix this data copy issues
Here, the issue could be due to various reasons. The alternative is to host a growing number of concurrent workloads. Alternatively, you may wish to improve your performance while maintaining your current workload level.
By increasing the number of concurrent tasks that can execute on a node, you can scale up the self-hosted IR.
Scale up is only possible if the node's CPU and memory are not completely occupied.
By adding extra nodes, you can scale out the self-hosted IR (machines).
You may specify the parallelism you want the copy activity to employ by setting the parallel Copies attribute. Consider this parameter to be the maximum number of threads allowed within the copy operation. The threads work in tandem. The threads either read from or write to the sink data storage.
Here, are the referred Docs: Self-hosted integration runtime scalability & High availability and scalability
Our current software solution uses a local ES installation (1 cluster and 1 node) to store documents so then later the user is able to search them. The ingest of nodes is not continuously done but let's say once a month by using bulks. The document set isn't huge and the size of documents is small. This solution has been working correctly without problems in normal laptop PCs (i5 with 8Gb RAM) since the use case does not require big performance.
Now we're facing 2 new requirements for our software solution:
Should be branded for other customers
The same final user (using the same machine) should be able to work with several instances of our solution (from different customers)
With these 2 new requirement the current solution cannot be used because all documents would be indexed in the same node using the same index. Further searches would show document from different customers.
A first approach to solve this issue was to index documents based on customer, that is, to create indices per customer and index/search documents on the corresponding index. However, we're thinking on another solution that allows us the following:
ES indexed information must be easily removed from the system (i.e. by removing the data folder)
Each customer may want to use a newer version of our solution (i.e. which uses ES 7) whereas other will remain with older versions (i.e. ES 6)
Based on this, I think that the solution would be to have several ES installations on the same PC, each one with its customer dependent configuration:
Different cluster
Different node name and port
Different ES version
My questions then would be, has anyone faced a similar use case? Would it be performance issues by installing several ES an let their services running continuously at the same time? Which possible problems could arise of having this configuration?
Any help would be appreciated.
UPDATE
Based on the answer received and for possible future answers, I would like to clarify a bit more about the architecture of our solution + ES:
Our solution is a desktop application executed on normal laptop PCs
Single user
Even if more than one customer specific solution is installed in the PC, only 1 will be active at a time
Searches will be executed sporadically when the user wants to search for a specific document (as if someone opens Wikipedia to search for an article)
So topics as ...
Infrastructure failure
Data replication
Performance at high search demand
... are not critical
You can run the multiple installations of ES in the same machine in production but it has a lot of disadvantages.
Ideally, you should have at least 1 replica of your shard and it should present in another physical machine(node) so that in case of infrastructure failure, it can recover, this is done to improve the resiliency of your system.
In production, it's common to come across a use case, where having single shard is not enough and you need to break your index into multiple primary shards to make it horizontal scalable but if you just use 1 physical server then having multiple shards will not help you.
Having multiple installations also doesn't help in the case where there is a lot of traffic in one installation and it consumes all the physical resources like RAM, CPU, disk and brings down all the installations also down in production.it also becomes difficult to isolate the root cause and quickly fix the issue as ES installation is not stateless and you can not just start the same installation on another machine, without moving all its data and configuration.
Basically, yours is a truly tenant-based SAAS application and by looking into your requirement, you should design your system considering below:
Upgrading the ES version sometimes is not very straightforward and it involves a lot of breaking changes in your application code as well, having just a cluster running with the latest version will not solve the problem. Hence your application should expose the tenant(your customer) registration API which Also takes which version of ES customer wants to use and accordingly your code handles that.
ES indexed information must be easily removed from the system :- I didn't get what the issue here, you can simply delete it using the ES API which is the recommended way of doing that, instead of doing it manually.
Hope my answer is clear to you and let me know if I missed any of your requirement and you need further clarification.
Based on the update on the question I am adding below points:
As OP mentioned its a very small desktop application and not a server-side application, then it's very important to not mix and store the content of each customer. Anybody can install the ES web admin plugin like https://github.com/lmenezes/cerebro and read the data of other customers.
The best solution in your case to have a single installation of ES based on the version specified by the customer and have just 1 index pertaining to the customer running the desktop application. And you can easily use the delete API as I mentioned earlier.
There is no need to have multiple installations at all, even though they won't be active but still, they consume the local disk space(which is even more important in case of desktop app) and can cause this and this issue and its not at all cleaner design to store the unnecessary information on desktop app and also cause a security issue which is much bigger concerns in general.
I'm currently working on deploying an elasticseacrh cluster in K8s.
Can anyone help me understand what are the cons/pros of deploying the ES cluster inside our K8s cluster or outside? Thanks in advance!
A big pro is data ingestion. If you have your ES cluster inside your k8s cluster, data ingestion will be faster.
However, a big con is resources. ES will eat away your resources worse than google-chrome eats your ram. And I mean, a lot.
And maintaining it can be quite cumbersome. Not sure about your use case but if it is logging (as in most cases), usually cloud providers have their own solution for that.
If not, then:
I would recommend having dedicated nodes for ES in your cluster, otherwise it might affect other pods if there are peaks and starts using a lot of node resources.
Also make sure to familiarize yourself and optimize your cold-warm-hot data, it will save you a lot of time and resources.
EDIT
I haven't emphasized how important is this faster data ingestion so it might not seem like a good enough reason to deploy it inside the cluster. Bottom line is pretty obvious: Network latency and bandwidth.
These things can really add up (picking up all those logs from all those pods, then scaling those same pods, then expanding the cluster, then again...), so every unit counts. If your VMs will not suffer from those two (meaning, they have same latency as any other node of the cluster), I think it won't make a huge difference.
On the other hand, I see no big benefit in separating them from the cluster. It is a part of your infrastructure anyway.
What if tomorrow you decide to switch to AWS or GKE? You would have to change your deployments, setup the whole thing again. On the other hand, if it's already a part of your cluster, just kubectl apply and 🤷
I can also guess that you will try to setup an ELK stack. If time and good will allows, give fluentd a chance (it is 100% compatible with all logstash clients but much more lghtweight).
We have a memcached cluster running in production. Now we are replacing memcached with a Couchbase cluster as a persistent cache layer. The question is how to implement this cut-over and how to warm up Couchbase bucket. Obviously we can't simply switch over to the cold Couchbase since starting with old cache would bring the whole site down.
One option I was thinking is to warm up the Couchbase as a memcached node first. That means Couchbase is using the (non-persistent) memcached bucket, and getting the cache set/get traffic like any other memcached node. The good thing about it is there is minimum code changes (what's needed is configure the moxi proxy to take memcached traffic, and register that node as a memcached node). Later we will convert all memcached buckets to Couchbase. But not sure Couchbase supports the conversions between these two types of buckets.
The 2nd option is set up the persistent Couchbase bucket (as opposed to non-persistent memcached bucket) at the beginning. We change the production cache client to replicate all the traffic to both memcached and coucbase clusters. We monitor the Couchbase bucket and once the cache items reach certain size, we complete the cut-over. A small drawback is the extra complexity to change the cache client.
Thoughts?
EDIT on Aug 9, 2016
As I later found out, converting memcached bucket to couchbase bucket is not supported in Couchbase. So the first option is not feasible.
Finally we decide to set up the Client-side (standalone) proxy in each application host. We do it incrementally from host to host to ramp up the cache traffic. That way the changes in the site is small enough.
If you want easy, less work, and proven to work well, do the following:
Set up a Moxi client on each application server.
Point Moxi to Couchbase bucket on the Couchbase cluster.
Change your web application servers to point at the local MOXI install.
For your next code revision start converting your code to using the Couchbase SDK instead of memcached.
Yes, there will be a time where things will not be hot in the cache, but it will not take long for Couchbase to get populated. This method is used all of the time to switch over. It is easy, nearly fool proof. One thing I have seen people do is try and copy things from their existing memcached servers over to Couchbase before cutting over, but what I am not sure of is how they new the key of each value in memcached.
Also note that Moxi is an interim step to easily get off of regular memcached and it is great, but for the long run, it is much better to switch to the SDK. The SDK has many more features than pure memcached.
Do not use the memcached buckets as they have none of the HA, persistence or whatever features of Couchbase.
We currently have a very write-heavy web analytics application which collects a large number of real time events from a large number of websites and stores for subsequent analytics and reporting.
Our initial planned architecture involved a cluster of web servers handling requests, and writing all of the data into a Cassandra cluster, while simultaneously updating a large number of counters for real-time aggregated reports. We also plan to utilize hadoop directly on CassandraFS (as a replacement of HDFS - offered by datastax) to natively run Map Reduce jobs on the data residing in Cassandra for more involved analytics. The output of the MapR jobs would be written back onto ColumnFamilies in Cassandra natively.
Hadoop map reduce runs on a read-only replica of the main cassandra cluster which is write-heavy. The idea was to avoid multiple data hops and have all data for the analytics in one repository.
More recently we hear about, and have faced first hand issues managing and growing a cassandra cluster with frequent node outages and bad response times. Couchbase seems to be much better with response times and dynamically growing and managing the cluster. So we are considering replacing Cassandra with Couchbase.
However this brings up a few questions.
Does Couchbase scale well in a mostly sequential write-heavy scenario? I don't see our scenario making much use of the in-memory caching, as the raw data being written is rarely read back, only aggregated metrics are. Plus I haven't been able to read much about what happens when Couchbase needs to hit the disk to write back data very frequently (or all the time?). Will it end up performing poorly than Cassandra?
What happens to the Hadoop interface? Couchbase has its own map reduce capabilities, but I understand that they are limited in scope. Would I need to transfer data back and forth between CouchbaseDB and HDFS to be able to support all my analytics and reporting out of a single database?
I have recently evaluated Cassandra and Couchbase among other options for a client requirement, so I can shed some light on both datastores.
Couchbase is incredibly easy to manage and once you have installed the server on a node, you can manage the cluster completely from the dashboard. However, couchbase does not scale as well as Cassandra, when as the data size grows. I also did not find a way to integrate Couchbase and HDFS/Hadoop seemlessly.
Cassandra performs very well for super fast write throughput, but it does not have any server side aggregation capabilities. Cluster management is slightly more difficult than Couchbase, as you have to re-balance the cluster every time you add or remove a node. Apart from it, from performance standpoint, Cassandra is runs pretty much seamlessly, as long as you have designed the schema properly.
If you can afford Datastax Enterprise solutions for Hive to do map-reduce for sophisticated analytics, I'd recommend you to stay with Cassandra, as couchbase map-reduce support is not all that good, and benchmarks show Couchbase performance starts to detoriate as the cluster size grows.