We have over 100 million data store in Elasticsearch.
Dataset are too much to be fully loaded into our service memory.
Each data has a column called amount. The search is to find out several (sometimes over 10 thousand) target data that their sum of the amount equals or close to an input value.
Below is out current solution:
We merge the 100 million data input 4000 buckets by using ES's bucket. Each bucket's amount is the sum of every data it contains.
We load the 4000 buckets into our service. Afterwards we find out the solution mentioned above based on the 4000 buckets.
The obvious disadvantage is the lack of accuracy. The difference between the sum of results we find and the input target is sometimes quite large.
We are three young guys lack of experience, we need some instructions.
Related
Generally speaking, which are the tradeoffs (in terms of performance and memory usage) between large and small indexes in Elasticsearch?
Elaborating a little:
Consider a cluster with 8 nodes, each node with 1 shard and 30Gb allocated to the JVM.
Consider also a scenario with 50 million of documents per day (all with the same structure and using doc-values), retained for 90 days. Each day of documents has about 35Gb on disk.
I want to run some queries in these cluster, covering a total of 12 hours of data.
These queries are composed by some nested aggregations: a date-histogram, followed by a cardinality and a percentile aggregation.
Considering the amount of data, which is better: use daily-indexes or only a single index?
PS: I know that is a "vague" question. My question is more theoretical.
I want to understand better what occur during an aggregation and how this relates to the number of indexes.
I'm looking for a set data structure that's optimised for a very low probability that an item is part of the set.
The use case is the Gnip/Twitter compliance firehose where we get approx 1,000 events per second (that's deletions from all of Twitter). We have a table of, let's say 10 million stored tweets (growing by that amount each year), and if an item appears in the firehose I have to delete it. I'm guessing there will be a match every 100,000 seconds (to pull a number out of the air).
I had thought of a bloom filters, possibly several chained, but given that there's a very low chance of a hit, I'm always going to need to go through the entire chain and things would eventually get linear.
Is there a good sublinear data structure for this?
I don't see the problem. It seems to me that if checking the Bloom filter tells you that you have the tweet stored, you then look up that tweet in your data store. If it's there, you delete it. If it's not there, you don't delete it.
You have 10 million stored tweets, and you expect it to grow by about 10 million per year. So build a Bloom filter that has a capacity of a billion, with a 0.1% probability of false positives. According to the Bloomfilter calculator, that will cost you 1.67 gigabytes.
Understand, that "false positives" number assumes that the filter contains the 1 billion keys. When your filter is very sparsely populated, the probability of false positives is much lower.
If you're getting a thousand tweets per second and the Bloom filter has a false positive rate of 0.1%, then in the worst case you'll get an average of one false positive per second. So once per second your code will have to hit the database to determine if the tweet is there.
But it'll be many years before you get to that. With only 10 million existing records and a growth rate of 10 million per year, it'll be 10 years before the filter is even 10% full. You could probably drop the filter size to 500 million (860 MB), and still not notice a big hit due to false positives.
A Bloom Filter should be fine, assuming that it fits in memory. If it won't fit completely in memory, consider using the solution described in this paper.
Alternatively, if you really want to squeeze a bit extra performance, you can use a Cuckoo Filter, but it will be harder for you to find an open source implementation; here is one in Go.
I have content that is about 50 TB large. The number of documents in this set is about 250 million. The daily increment to this is not very large nay my be about 10000 documents of varying sizes totaling under 50 MB.
The current indexing effort is taking way too long and is guesstimated to complete in 100+ days!!!
So ... is this really that large of a data set? To me, 50 TB of content (in this day and age) is not very large. Do you have content of this size? If you do, how did you improve time taken for one-time indexing? Also, how did you improve time taken by real-time indexing?
If you can answer .. great. If you can point me in the right direct direction ... appreciate that as well.
Thanks in advance.
rd
There are number of factors to consider.
You can start with Client to index. Which client are you using. Is it Solrj, or any framework which listens to databases(like oracle or Hbase) or rest API.
This can make a difference, given that Solr is good at handling them, however the client framework and data preparation at client, also needs to be optimized. For example, if you use Hbase Indexer(which reads from Hbase tables and writes to Solr), you can expect few millions to be indexed in hour or so. Then, this should not take much time to complete 250 million.
After the client, you enter into Solr environment. How many fields are you indexing in you document. Also do you have stored fields or any other overheads for field types.
Config parameters like autoCommit based on number of records or RAm size, softCommit as mentioned in the comment above, Parallel Threads to index data, Hardware are some of the points to cosider.
You can find comprehensive check list here and can verify each. Happy Designing
The Usecase
I have an index of potentially millions of documents. I want to make around 20'0000 searches on a subset of these documents (around 25'000 documents). These 25'000 documents could take up around 100 MB stored in Solr (consisting of stored and indexes text fields).
The Problem
As the number of indexed documents increases, the performance of the queries decreases a lot. For example running 20'000 searches that hit 25'000 documents on 100'000 document index takes around 4 minutes. Running the same searches on 200'000 document index takes around 20 minutes.
So is there any way to cache these 25'000 documents in RAM before hitting them with searches?
UPDATE
Some things that really helped:
reducing returned row count (In almost all cases I had to iterate through returned results and in almost all cases where were no more than 100 matching results, but I had set rows to a very large value. Reducing the row count improved the performance around 2x. This seemed counter intuitive. If there are only 79 matches and I set returned row count to 100 it performs better than in a case when where are 79 matches and I set the row count to 1000. In the first case Solr already returns found item count and does it fast. Why should there be a performance difference?)
reducing multithreading (I had added multiple threads for querying because on the development box there were more resources available. On the resource constrained production box it was slowing things down. Using only one or two threads got me around 2x speed improvement.)
Some things that did not really help:
splitting up field queries (I was already using field queries everywhere it was possible, but I was combining them in one fq for each query fq=name:a AND type:b. Splitting them up with fq=name:a&fq=type:b caches them separately (see Apache Solr documentation) and could improve performance. But it did not make a huge difference in this case.
changing caching settings in this case filterCache seemed to have the most potential. However, increasing it or changing its settings did not make a huge difference.
A few things that are recommended for performance:
Have enough spare RAM on the box so index files can be in OS cache
Try to play around with solr caching settings in SolrConfig
Play around with autowarming after commits
Try to develop your queries to limit the result set. Large result sets, specifically if using grouping and faceting will kill performance. Now 200,000 document index is really quite small, so you should not have any problems, but I thought I'd mention this for when you scale.
Try to use Filter query (FQ) whenever possible. They are much faster than doing field:val in q, plus they are cached.
I have an index with some 10m records.
When I try to find distincts in one field (around 2m) my Java runs out of memory.
Can I implement a scan and scroll on this aggregation to retrieve the same data in smaller parts.
Thanks
Check that how much RAM you have allocated for ElasticSearch, since it is optimized to be super fast it likes to consume lots of memory. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/setup-configuration.html
I'm not sure if this applies to cardinality aggregations (or are you using terms aggregation?), but I got some success with using "doc_values" fielddata format (see http://www.elasticsearch.org/blog/disk-based-field-data-a-k-a-doc-values/), this takes more disk space but keeps less stuff in RAM. How many distinct values do you have? Returning back a JSON response on terms aggregation with a million distinct values is going to be fairly big. Cardinality aggregation just counts the number of distinct values without returning their individual values.
You could also try re-indexing your data with a larger number of shards, too big shards don't perform as well as a few smaller ones.