Elasticsearch filter vs term query for many ids - elasticsearch

I have an index of documents connected with some product_id. And I would like to find all documents for specific ids (around 100 000 product_ids to be found and 100 million are in total in index).
Would the filter query be the fastest and best option in that case?
"query": {
"bool": {
"filter": {"terms": {"product_id": product_ids}
}
}
Or is it better to chunkify ids and use just terms query or smth else?
The question is probably kind of a duplicate, but I would be very grateful for the best practice advice (and a bit of reasoning).

After some testing and more reading I found an answer:
Filter query works much much faster as chunks with just terms query.
But making really big filter can slower getting the result a lot.
In my case, using filter query with chunks of 10 000 ids is 10 times faster, than using filter query with all 100 000 ids at once (btw, this number is already restricted in Elasticsearch 6).
Also from official elasticsearch documentation:
Potentially the amount of ids specified in the terms filter can be a lot. In this scenario it makes sense to use the terms filter’s terms lookup mechanism.
The only disadvantage to be taken into account is that filter query is stored in cache. (The cache implements an LRU eviction policy: when a cache becomes full, the least recently used data is evicted to make way for new data.)
P.S. In all cases I always used scroll.

you can use "paging" or "scrolling" feature of elastic search query for very large result sets.
Use "from - to" query : https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-from-size.html
or "scroll" query:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html
I think that "From / To" is a more efficient way to go unless you want to return thousands of results each time (which could be many many MB of data so you probably don't want that)
Edit:
You can make a query like this in bulks:
GET my_index/_search
{
"query": {
"terms": {
"_id": [ "1", "2", "3", .... "10000" ] // tune for the best array length
}
}
}
If your document Id is sequential or some other number form that you could easily order by, and have a field available you can do a "range query"
GET _search
{
"query": {
"range" : {
"document_id_that_is_a_number" : {
"gte" : 0, // bump this on each query by "lte" step factor
"lte" : 10000 // find a good number here
}
}
}
}

Related

One large Elasticsearch lookup index, or several smaller ones?

I'm creating a lookup index that I'll use solely as a terms filter. So no searching/aggregating, only filtering and GETs.
I'm debating the structure of this lookup index, whether each document should contain all of the fields I want to filter for, or whether I should create an index per field.
For example, let's say each document pertains to a user. Each user has a list of games they've played, books they've read, and movies they've watched. When searching for game/book/movie recommendations, I'll use the term filter to filter out those items they've already interacted with.
I'm wondering if I should have a single lookup index with a document mapping like:
users_index
{
'game_ids': [],
'movie_ids' : [],
'book_ids': []
}
or one index per lookup value, like:
user_games_index
{
'game_ids': []
}
user_movies_index
{
'movie_ids': []
}
user_books_index
{
'book_ids': []
}
Pros for one index:
Each index comes with overhead, so the fewer the better
If I ever want to retrieve all of a user's info, it's all in one index
Pros for multiple indices:
According to the update api docs, updating a document means retrieving the whole thing first. I will be updating each document a lot, and those arrays can become rather large (think thousands of ids). Updating a book id will then retrieve all of the game ids, which takes up memory. If they were in separate indices, I could avoid that.
Just easier to maintain on my end of things
I should note that if I use multiple indices, it'll only be 4 or 5, with about 500k documents per index. Also, only 1 primary shard per index, no replicas, and I'm on a single m5.2xlarge EC2 instance (8 cores, 32G ram).
Are these stats so small that it won't really matter at this point, or should I favor one index or many?
How about a third option?
You have one index and each of your document in the index looks something like this:
{
"user_id" : "some_user",
"document_type" : "movie" or "game" or "book"
"document_id" : "id of movie, game or book"
}
Why? Since you say a user's games, movies or books will be updated often, this approach lets you easily add / delete individual movies, games or books for users.
You also can easily filter the books/movies/games for specific users.
All values are of type "keyword" and filtering should be fast.
PS: A "good" mapping for an ES index will try to minimize the numbers of updates on individual documents and rather work at the level of inserting / deleting documents as ES does this task very well compared to finding & updating documents.
Edit: I have added query examples to illustrate how you can filter out results with bool query.
Example:
I want all movies / games / books a user X has NOT interacted with.
GET _search
{
"query": {
"bool": {
"must_not":{
"term" : {
"user_id" : "user X"
}
}
}
}
}
I want only movies a user X has NOT interacted with.
GET _search
{
"query": {
"bool": {
"must_not":{
"term" : {
"user_id" : "user X"
}
},
"filter":{
"term" : {
"document_type" : "movie"
}
}
}
}
}

Application-side Joins Elasticsearch

I have two indexes in Elasticsearch, a system index, and a telemetry index. I'd like to perform queries and aggregations on the telemetry index using filters from the systems index. The systems index is relatively small and only receives new documents occasionally, but the telemetry index is much larger and is constantly receiving new documents. This seems like an ideal situation for using an application-side join.
I tried emulating the example query at the pervious link, but it turns out the filtered query is deprecated as of ES 5.0. (Why is this example in the current documentation?!)
Here are my queries:
GET /system/_search
{
"query": {
"match": {
"name": "George's system"
}
}
}
GET /telemetry/_search
{
"query": {
"bool":{
"must": {
"multi_match": {
"operator": "and",
"fields": ["systemId"]
, [1] }
}
}
}
}
}
The second one fails with a json_parse_exception because for some reason it doesn't like the [ ] characters after "fields".
Can anyone provide a simple example of using application-side joins?
Once such a query is defined (perhaps in Kibana's Dev Tools console) is there a way to visualize it in Kibana?
With elastic there is no way to execute two nested queries like in a relational database where the first query uses the response of the second. The example in the application-side join, means that you are actually making two queries (two different requests to elastic) on the application side.
First query you get the list of ids you need to filter on.
Second query you pass the list of ids that you got to the terms filter.
This works when you have no more than 1024 values for systemId. Because terms query has a limit on the number of terms.
Because this query is not feasible, then you can't visualize it in kibana.
In such case you have to sacrifice a little of space and add the systemId to your mapping.
Good Luck!

Querying large amounts of terms without expanding maxClauseCount

In a data flow of mine, I am trying to retrieve a subset of documents from a previous terms aggregation, but hitting the maxClauseCount limit within my ES cluster. The follow up query is along these lines:
GET dataset/_search
{
"size": 2000,
"query": {
"bool": {
"must": [
(a filter or two)...,
{
"terms":{
"otherid":[
"789e18f2-bacb-4e38-9800-bf8e4c65c206",
"8e6967aa-5b98-483e-b50f-c681c7396a6a",
...
]
}
}
]}
}
}
In my research I've come across a lookup - which sadly we can't use - as well as the ids query.
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-terms-query.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-ids-query.html
From experimentation, it appears that the ids query doesn't share the limit the terms query has (potentially it's not converted into terms clauses). Do any of you know if there's a good way to achieve similar functionality to the ids query without using the ids fields.
My version of ES is 5.0.
Thanks!
instead of using terms use the Terms filter it will solve the issue
OR
index.query.bool.max_clause_count: increase to higher value(*Not Recommended)
http://george-stathis.com/2013/10/18/setting-the-booleanquery-maxclausecount-in-elasticsearch/

Nested count queries

i'm looking to add a feature to an existing query. Basically, I run a query that returns say 1000 documents. Those documents all have the same structure, only the values of certain fields vary. What i'd like, is to not only get the full list as a result, but also count how many results have a field X with the value Y, how many results have the same field X with the value Z etc...
Basically get all the results + 4 or 5 "counts" that would act like the SQL "group by", in a way.
The point of this is to allow full text search over all the clients in our database (without filtering), while showing how many of those are active clients, past clients, active prospects etc...
Any way to do this without running additional / separate queries ?
EDIT WITH ANSWER :
Aggregations is the way to go. Here's how I did it, it's so straightforward that I expected much harder work !
{
"query": {
"term": {
"_type":"client"
}
},
"aggregations" : {
"agg1" : {
"terms" : {
"field" : "listType.typeRef.keyword"
}
}
}
}
Note that it's even in a list of terms and not a single field, that's just how easy it was !
I believe what you are looking for is the aggregation query.
The documentation should be clear enough, but if you struggle please give us your ES query and we will help you from there.

Elastic Search Distinct values

I want to know how it's possible to get distinct value of a field in elastic search. I read an article here shows how to do that with facets, but I read facets are deprecated:
http://elasticsearch-users.115913.n3.nabble.com/Getting-Distinct-Values-td3830953.html
Is there any other way to do that? if not is it possible to tell me how to do that? it's abit hard to understand solutions like this: Elastic Search - display all distinct values of an array
Use aggregations:
GET /my_index/my_type/_search?search_type=count
{
"aggs": {
"my_fields": {
"terms": {
"field": "name",
"size": 1000
}
}
}
}
You can use the Cardinality metric
Although the counts returned aren't guaranteed to be 100% accurate, they almost always are for low cardinality terms and the precision is configurable via the precision_threshold param.
http://www.elastic.co/guide/en/elasticsearch/guide/current/cardinality.html

Resources