How can I query/filter an elasticsearch index by an array of values? - elasticsearch

I have an elasticsearch index with numeric category ids like this:
{
"id": "50958",
"name": "product name",
"description": "product description",
"upc": "00302590602108",
"**categories**": [
"26",
"39"
],
"price": "15.95"
}
I want to be able to pass an array of category ids (a parent id with all of it's children, for example) and return only results that match one of those categories. I have been trying to get it to work with a term query, but no luck yet.
Also, as a new user of elasticsearch, I am wondering if I should use a filter/facet for this...
ANSWERED!
I ended up using a terms query (as opposed to term). I'm still interested in knowing if there would be a benefit to using a filter or facet.

As you already discovered, a termQuery would work. I would suggest a termFilter though, since filters are faster, and cache-able.
Facets won't limit result, but they are excellent tools. They count hits within your total results of specific terms, and be used for faceted navigation.

Related

Finding all words and their frequencies in an elasticsearch index

Elasticsearch Newbie here. I have an elasticsearch cluster and an index http://localhost:9200/products and each product looks like this:
{
"name": "laptop",
"description" : "Intel Laptop with 16 GB RAM",
"title" : "...."
}
I wanted all keywords in a field and their frequencies across all documents for an index. For eg.
description : intel -> 2500, laptop -> 40000 etc. I looked at termvectors https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-termvectors.html but that only let's me do it across a single document. I want it across all documents in a particular field.
I wrote a plug-in for this ..but its expensive call ( based on how many terms you want to get and cardinality of terms ) https://github.com/nirmalc/es-termstat
Currently, there is no way to use term vectors on all documents at a time in an index. You can either use single term vector API for single document's term frequency count or multi-term vectors API to multiple document's term frequency. But a possible workaround could be like this -
make a scan request in order to get all documents from a given type,
and for each page to build a multi-term vector mentioned above to
request to get term vectors.
POST /products/_mtermvectors
{
"ids" : ["1", "2"],
"parameters": {
"fields": [
"description"
],
"term_statistics": true
}
}

Elasticsearch: search word forms only

I have collection of docs and they have field tags which is array of strings. Each string is a word.
Example:
[{
"id": 1,
"tags": [ "man", "boy", "people" ]
}, {
"id": 2,
"tags":[ "health", "boys", "people" ]
}, {
"id": 3,
"tags":[ "people", "box", "boxer" ]
}]
Now I need to query only docs which contains word "boy" and its forms("boys" in my example). I do not need elasticsearch to return doc number 3 because it is not form of boy.
If I use fuzzy query I will get all three docs and also doc number 3 which I do not need. As far as I understand, elasticsearch use levenshtein distance to determine whether doc relevant or not.
If I use match query I will get number 1 only but not both(1,2).
I wonder is there any ability to query docs by word form matching. Is there a way to make elastic match "duke", "duchess", "dukes" but not "dikes", "buke", "bike" and so on? This is more complicated case with "duke" but I need to support such case also.
Probably it could be solved using some specific settings of analyzer?
With "word-form matching" I guess you are referring to matching morphological variations of the same word. This could be about addressing plural, singular, case, tense, conjugation etc. Bear in mind that the rules for word variations are language specific
Elasticsearch's implementation of fuzziness is based on the Damerau–Levenshtein distance. It handles mutations (changes, transformations, transpositions) independent of a specific language, solely based on the number if edits.
You would need to change the processing of your strings at indexing and at search time to get the language-specific variations addressed via stemming. This can be achieved by configuring a suitable an analyzer for your field that does the language-specific stemming.
Assuming that your tags are all in English, your mapping for tags could look like:
"tags": {
"type": "text",
"analyzer": "english"
}
As you cannot change the type or analyzer of an existing index you would need to fix your mapping and then re-index everything.
I'm not sure whether Duke and Duchesse are considered to be the same word (and therefore addresses by the stemmer). If not, you would need to use a customised analyzer that allows you to configure synonyms.
See also Elasticsearch Reference: Language Analyzers

Sorting by product price considering special prices (client, group, country)

we have a shop with a few products (~ 5000).
There are, of course, category overview sites which show all products that are in the current category. A requirement is that all products can be sorted by price (ASC and DESC).
This already works (partially), because the problem is, in our Elasticsearch, we currently only have the "original" price, so any product discounts are not considered and therefore the sorting does not work correctly.
My task is it now to fix that.
But I am already struggling with "how to" persist the "special prices" into Elasticsearch.
The problem is every product can be discounted in general, on a customer level, on a customer group level and on a country level.
So I imagine a structure like this would be a start:
# current
{
"articleNumber": "12345",
...
"price": 9.99,
...
}
# new
{
"articleNumber": "12345",
...
"price": 9.99,
...
"special_prices": [
{
"customer": "123456",
"client_price": 5.99,
"client_group_price": null,
"country_de": null
"country_es": null,
...
},
...
]
}
Following thoughts:
The specials prices could be stored as a nested object inside the product index (but I am not sure how to do the sorting on it later)
Maybe I could create a second index with prices, then I would have two queries, but I guess that would be ok? Because I have to build a whole matrix with every customer we have (also ~5000), with every product with every possible price. But if I would have a second index then I would have to join and maybe the sorting is incorrect then
If possible, I would like to only persist any prices if a product has a special price and if not, I don't want to blow up the index
I tried something with painless to return the special price if one exists for the product and customer, but this gives me this:
...
"script": "if (doc['special_prices.customer'] != null && doc['special_prices.customer'].value == '123456') { return 12.45; } else { return doc['price']; }",
"lang": "painless",
"caused_by": {
"type": "illegal_argument_exception",
"reason": "Fielddata is disabled on text fields by default. Set fielddata=true on [special_prices.customer] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
...
Maybe something like SQL ORDER BY CASE WHEN would be an option?
Any ideas on how I should model and persist the special prices? And how can I achieve the sorting?
Is joining a second index a good idea?
Best regards
The error you see is because special_prices.customer is not indexed as keyword, and instead is a text (which allows full-text search). If you didn't specify mapping explicitly, Elasticsearch most likely created a keyword for you. Just try to replace special_prices.customer with special_prices.customer.keyword in your script.
The idea of using a script for sorting is good, given that you only have 5000 documents. Scripts do not have good performance, but in your case this might not matter.
In general this looks like a tough case, because you need some kind of joining between products and prices, and Elasticsearch is not good at joins. It has got some joining options: nested datatype, join datatype (a.k.a. parent-child), and denormalization. The last one you have already considered - when you put different prices in the original product document.
Unfortunately I can't recommend one over another, because there is no single recipe. I would try with scripts, and if performance is not good enough consider remodelling the data.
Hope that helps!

Why does ES recommend to use single mapping per index and doesn't provide any "Join" functionality for this?

As you know, starting from version 6, ElasticSearch team deprecates multiple types per index as well as parent-child relationships. Proof is here
They recommend to use join queries instead of parent-child. But let's look on this join query here. They write:
The join datatype is a special field that creates parent/child
relation within documents of the same index.
They offer to use multiple indexes, restrict their indexes to work with only 1 single mapping _doc, but join query is designed to work only in bounds of the same index.
How to live on? How could I create parent-child relationships for separate indexes?
Example:
Index: "City"
{
"name": "Moscow",
"id": 1
}
Index: "Product"
{
"name": "Shirt",
"city": 1,
"id": 1
}
How could I get that "Shirt" above if I know only "Moscow" city name?

Relative Performance of ElasticSearch on inner fields vs outer fields

All other things being equal, including indexing, I'm wondering if it is more performant to search on fields closer to the root of the document.
For example, lets say we have a document with a customer ID. Two ways to store this:
{
"customer_id": "xyz"
}
and
{
"customer": {
"id": "xyz"
}
}
Will it be any slower to search for documents where "customer.id = 'xyq'" than to search for documents where "customer_id = 'xyz'" ?
That's pure syntactic sugar. The second form, i.e. using object type, will be flattened out and internally stored as
"customer.id": "xyz"
Hence, both forms you described are semantically equivalent as far as what gets indexed into ES, i.e.:
"customer_id": "xyz"
"customer.id": "xyz"

Resources