Slow N1QL performance on small dataset using covering index - performance

I am using Couchbase 6 community edition and my bucket has about 2 million documents of small sizes (< 5000 bytes)
Each document has a fields named country and I have an GSI on this field.
There are four unique values for this field, however, a query to get these unique values takes 8 to 10 seconds.
I am not sure why it is this slow.
my query is:
SELECT DISTINCT(country)
FROM test_bucket
USE INDEX(country_index USING GSI)
WHERE country IS NOT MISSING
The memory quota on this bucket is 50 GB.
and the machine has 40 cores.
I would like to ask what is the bottleneck here or what would cause a bottleneck in this situation

You have right index. As you have 2 million country documents query engine needs to get all the 2 million entries from indexer and eliminate duplicates. Use request profiling described in page 341 https://blog.couchbase.com/n1ql-practical-guide-second-edition/
Also checkout technique described https://dzone.com/articles/count-amp-group-faster-using-n1ql
If you can use EE version you can use index aggregation as described by here https://blog.couchbase.com/understanding-index-grouping-aggregation-couchbase-n1ql-query/ by changing query to group by like below.
SELECT country
FROM test_bucket
USE INDEX(country_index USING GSI)
WHERE country IS NOT MISSING
GROUP BY country;

Related

Get recent order counts by phone number in recent 2000 documents in Elasticsearch

There is orderdocuments stored in our elasticsearch. There is a field lets say phone-number in each document.
what we want to find out is phone number counts(how many times a order was received from a phone number) .
Each (i repeat each) phone numbers have made 10 million+ bookings across all our data.
So aggregating across all the data is taking too much time. we came to a conclusion that a limit should be put because data is too much. we introduced terminate_after in the query and the response times improved to desired limit but then a new problem got introduced. If we are going to put a limit, we want to limit results from most recent 2000 documents instead of any 2000 docs.
So in nutshell we want to find out ordercounts by phonenumber in most recent 2000 documents.
How to achieve this in elasticsearch.
what we have tried so far:
bucket sort (it applies only after the buckets are already created)
terminate_after property in query(this does not take into account that
we want counts in most recent 2000 matches)
TopNhits aggregator (This also works post the hits are calculated)

Elasticsearch caching a single field for quick response

I have a cluster of 10 nodes where I index about a 100 million records daily. Total close to 6 billion records. I am constantly loading data. Each record has about 75 fields associated with it. 99% of my queries are based on the same field query. Essentially select * from table where groupid = 'value'. The majority of the queries returning bring back about a hundred records.
My queries currently take about 30 seconds to run the first 2 times and then are in the milliseconds. The problem is that all the user queries are searching for a different groupID so there queries are going to be slow for the most part until they run it the third time.
Is it possible to "cache" the groupid field so that I can get sub second queries.
My current query looks like this. (Psuedo-query) (I'm using non-analyzed field which I believe is better?)
query : {
filtered : {
filter : {
"term" : { groupID : "valuex" }
}
}
}
I"ve researched and not sure how to go about this. I've looked into doc_values = yes and possibly field cache?
I do not care about scoring, aggregates. My only use case is to filter out records and only bringing back the 100 or so out of 5 billion that have the correct groupID.
We have about 64G Memory on each server.
Just looking for help on how to achieve optimal performance/caching? or anything else that would help.
I thought about routing but this would be difficult based on our groupid values.
thanks
Starting from elasticsearch 2.0 we did some caching changes, like:
Keeps track of 256 most recently used queries
Only caches those that appear 5 times or more
Does not cache segments which have less than 10000 documents or 3% of the documents of the index
Wondering if you are hitting this last one.
Note that we did that because the File System cache might be probably better than internal caching.
Could you try with a bool query instead of a filtered query BTW? Filtered has been deprecated (and is removed in 5.0). And see how it performs?

Elasticsearch query to return all record programmatically throws out of memory exception

I need to retrieve all records from elasticsearch and do statistical analysis on the data. The number of records are not that high 500000 records. Each record has 7 columns, 5 of these columns are type String (single word value). So the size of data to me is not that big at all. I am getting 'out of memory exception' when executing the following:
SearchResponse response = client.prepareSearch(indexFrom).setTypes(typeFrom)
.setQuery(matchAllQuery()).setSize(SIZE)
.execute().actionGet();
SIZE=500000
Any help/suggestions?
I am setting Xmx10g.
Thanks.
-Vera
If you just need to recover all documents unsorted like this, you should use a scan and scroll search.
To sum up, it combines the use of :
search of type scan which disable sorting of results (thus save some memory)
scroll API which is quite similar to a cursor for DB by allowing you to look through results by small batches.
I think it could solve your memory problem.

SolrCloud: workaround for classic pagination with "start,rows" parameters

I have SolrCloud with 3 shards.
My purpose: select and process all products from category.
Current implementation: Portion selection in cycle.
1st iteration: q=cat:1&start=0&rows=100
2nd iteration: q=cat:1&start=100&rows=100
3th: q=cat:1&start=200&rows=100
...
But growing "start", performance is down. Explanation here: https://wiki.apache.org/solr/DistributedSearch
Makes it more inefficient to use a high "start" parameter. For
example, if you request start=500000&rows=25 on an index with 500,000+
docs per shard, this will currently result in 500,000 records getting
sent over the network from the shard to the coordinating Solr
instance. If you had a single-shard index, in contrast, only 25
records would ever get sent over the network. (Granted, setting start
this high is not something many people need to do.)
What ideas how I can walk around all records in category?
There is another way to do more effective pagination in Solr - Cursors - which uses the current place in the sort instead. This is particularly useful for deep pagination.
See the section about Cursors at the Pagination of Results wiki page. This should speed up delivery as the Server should be able to do a sort of its local documents, decide where it is in that sequence and return 25 documents after that document.
UPDATE: Also useful link coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets
I think the short answer is "no" - it's a limitation of how Solr does sharding. Instead, can you amass a list of document unique keys outside of Solr - presumably from a backing database - and then retrieve from the index using sets of those keys instead?
e.g. ID:(1 OR 2 OR 3 OR ...very long list...)
Or, if the unique keys are numeric you could use a moving range instead:
ID:[1 TO 1000] then ID:[1001 TO 2000] and so forth.
In both options above you'd also restrict by category as well. They both should avoid the slow down associated with windowing however.

MongoDB Vs Oracle for Real time search

I am building an application where i am tracking user activity changes and showing the activity logs to the users. Here are a few points :
Insert 100 million records per day.
These records to be indexed and available in search results immediately(within a few seconds).
Users can filter records on any of the 10 fields that are exposed.
I think both Mongo and Oracle will not accomplish what you need. I would recommend offloading the search component from your primary data store, maybe something like ElasticSearch:
http://www.elasticsearch.org/
My recommendation is ElasticSearch as your primary use-case is "filter" (Facets in ElasticSearch) and search. Is it written to scale-up (otherwise Lucene is also good) and keeping big data in mind.
100 million records a day sounds like you would need a rapidly growing server farm to store the data. I am not familiar with how Oracle would distribute these data, but with MongoDB, you would need to shard your data based on the fields that your search queries are using (including the 10 fields for filtering). If you search only by shard key, MongoDB is intelligent enough to only hit the machines that contain the correct shard, so it would be like querying a small database on one machine to get what you need back. In addition, if the shard keys can fit into the memory of each machine in your cluster, and are indexed with MongoDB's btree indexing, then your queries would be pretty instant.

Resources