Finding all words and their frequencies in an elasticsearch index - elasticsearch

Elasticsearch Newbie here. I have an elasticsearch cluster and an index http://localhost:9200/products and each product looks like this:
{
"name": "laptop",
"description" : "Intel Laptop with 16 GB RAM",
"title" : "...."
}
I wanted all keywords in a field and their frequencies across all documents for an index. For eg.
description : intel -> 2500, laptop -> 40000 etc. I looked at termvectors https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-termvectors.html but that only let's me do it across a single document. I want it across all documents in a particular field.

I wrote a plug-in for this ..but its expensive call ( based on how many terms you want to get and cardinality of terms ) https://github.com/nirmalc/es-termstat

Currently, there is no way to use term vectors on all documents at a time in an index. You can either use single term vector API for single document's term frequency count or multi-term vectors API to multiple document's term frequency. But a possible workaround could be like this -
make a scan request in order to get all documents from a given type,
and for each page to build a multi-term vector mentioned above to
request to get term vectors.
POST /products/_mtermvectors
{
"ids" : ["1", "2"],
"parameters": {
"fields": [
"description"
],
"term_statistics": true
}
}

Related

Elastic Search - Sorting & Filtering on nested Documents

I am working on an E-Commerce application. Catalog Data is being served by Elastic Search.
I have document's for Product which is already indexed in Elastic Search.
Document Looks something like this (Excluded few fields for the purpose of better readability):
{
"title" : "Product Name",
"volume" : "200gm",
"brand" : {
"brand_code" : XXXX,
"brand_name" : "Brand Name"
},
"#timestamp" : "2021-08-26T08:08:11.319Z",
"store" : [
{
"physical_unit" : 0,
"default_price" : 115.0,
"_id" : "1234_111",
"product_code" : "1234",
"warehouse_code" : 111,
"available_unit" : 100
}
],
"category" : {
"category_code" : 987,
"category_name" : "CategoryName",
"category_url_link" : "CategoryName",
"super_category_name" : "SuperCategoryName",
"parent_category_name" : "ParentCategoryName"
}
}
store object in the above document is the one where ES Query will look for price and to decide if item is in stock or Out Of Stock.
I would like to add more child objects to store (Basically data from multiple inventory). This can go up to more than 150 child objects for each product.
Eventually, A product document will look something like this with multiple inventory's data mapped to a particular document.
{
"title" : "Product Name",
"volume" : "200gm",
"brand" : {
"brand_code" : XXXX,
"brand_name" : "Brand Name"
},
"#timestamp" : "2021-08-26T08:08:11.319Z",
"store" : [
{
"physical_unit" : 0,
"default_price" : 115.0,
"_id" : "1234_111",
"product_code" : "1234",
"warehouse_code" : 111,
"available_unit" : 100
},
{
"physical_unit" : 0,
"default_price" : 125.0,
"_id" : "1234_112",
"product_code" : "1234",
"warehouse_code" : 112,
"available_unit" : 100
},
{
"physical_unit" : 0,
"default_price" : 105.0,
"_id" : "1234_113",
"product_code" : "1234",
"warehouse_code" : 113,
"available_unit" : 100
}
Upto N no of stores
],
"category" : {
"category_code" : 987,
"category_name" : "CategoryName",
"category_url_link" : "CategoryName",
"super_category_name" : "SuperCategoryName",
"parent_category_name" : "ParentCategoryName"
}
}
Functional Requirement :
For any product, we should show lowest price across all warehouse.
For EX: If a particular product has 50 store mapped to it, Elastic Search query should look into the nested object and get the value which is lowest in all 50 stores if item is available.
Performance should not be degraded.
Challenges :
If we start storing those many stores for each product, data will go considerably high. Will that be a problem ?
What would be the efficient way to extract the lowest price from nested document?
How would facets work within nested document ? Like if i apply price range filter ES picks up the data which was not showed earlier. (It might pick the data from other store which matches the range)
We are using template to query ES and the Version of the Elastic Search is 6.0.
Thanks in Advance!!
First there are improvements to nested document search in version 7.x that are worth the upgrade.
As for version 6.x, there are a lot of factors there that I could not give you a concrete answer. It also seems you may not be understanding the way that nested documents work, they are not relational.
In particular when you say that each product might have 50 stores mapped to it that sounds like you are implying a relationship, which will not exist with a nested document. However, the values from those 50 stores would be stored within an index nested under the parent document. Having 50 stores under a product or category does not sound concerning.
ElasticSearch has not really talked in terms of facets since the introduction of the aggregation framework. Its not that they dont exist, just not how they are discussed.
So lets try this. ElasticSearch optimizes its search and query through a divide and conquer mechanism. The data is spread across several shards, a configurable number, and each shard is responsible for reviewing its own data. Further, those shards can be distributed across many machines so that there are many cpus and lots of memory for the search. So growing the data doesn't matter if you are willing to grow the cluster, as it is possible to maintain a situation where each machine is doing the same amount of work as it was doing before.
Unlike a relational database, filters search terms allow Elastic to drastically reduce the data that it is looking at and a larger number of filters will improve performance where on a relational database performance declines.
Now back to nested documents. They are stored as a separate index, but instead of mapping the results to the nested doc, the results map to the parent doc id. So you're nested docs arent exactly in the same index as the rest of the document, though they are not truly separate either. But that does mean that the nested documents should have minimal impact the performance of the queries against the parent documents. But if your data size grows beyond the capacity of your current system you will still need to increase its size.
As to how you would query, you would use Elastic aggregations. These will allow you to calculate your "facet" counts and identify the best prices. The Elastic aggregations are very powerful and very fast. There are caveats that are well documented, but in general they will work as you expect.
In version 6.x query string queries cannot access the search criteria in a nested document, and a complex query must be used.
To recap
Functional Requirement :
For any product, we should show lowest price across all warehouse.
For EX: If a particular product has 50 store mapped to it,
ElasticSearch query should look into the nested object and get the
value which is lowest in all 50 stores if item is available.
Yes a nested aggregation will do this.
Performance should not be degraded.
Performance will continue to depend on the ratio of the size of the data to the overall cluster size.
Challenges :
If we start storing those many stores for each product, data will go considerably high. Will that be a problem ?
No this should not be a problem
What would be the efficient way to extract the lowest price from nested document?
Elastic Aggregations
How would facets work within nested document ? Like if i apply price range filter ES picks up the data which was not showed earlier. (It might pick the data from other store which matches the range)
Yes filtering can work with Aggregations very well. The aggregation will be based on the filtered data. In fact you could have an aggregation based on just minimum price, and in the same query then have an aggregation using your price ranges, which will give you the count of documents that have a store within that price range, and you could have a sub aggregation showing the stores under each price range.
We are using template to query ES and the Version of the Elastic Search is 6.0. Thanks in Advance!!
I know nothing about template. The ElasticSearch API is so dead simple I do not know why anyone uses additional tools on top of the API, they just add weight, and increase complexity and make key features not available because the wrapper author did not pass through the feature.

Elasticsearch: search word forms only

I have collection of docs and they have field tags which is array of strings. Each string is a word.
Example:
[{
"id": 1,
"tags": [ "man", "boy", "people" ]
}, {
"id": 2,
"tags":[ "health", "boys", "people" ]
}, {
"id": 3,
"tags":[ "people", "box", "boxer" ]
}]
Now I need to query only docs which contains word "boy" and its forms("boys" in my example). I do not need elasticsearch to return doc number 3 because it is not form of boy.
If I use fuzzy query I will get all three docs and also doc number 3 which I do not need. As far as I understand, elasticsearch use levenshtein distance to determine whether doc relevant or not.
If I use match query I will get number 1 only but not both(1,2).
I wonder is there any ability to query docs by word form matching. Is there a way to make elastic match "duke", "duchess", "dukes" but not "dikes", "buke", "bike" and so on? This is more complicated case with "duke" but I need to support such case also.
Probably it could be solved using some specific settings of analyzer?
With "word-form matching" I guess you are referring to matching morphological variations of the same word. This could be about addressing plural, singular, case, tense, conjugation etc. Bear in mind that the rules for word variations are language specific
Elasticsearch's implementation of fuzziness is based on the Damerau–Levenshtein distance. It handles mutations (changes, transformations, transpositions) independent of a specific language, solely based on the number if edits.
You would need to change the processing of your strings at indexing and at search time to get the language-specific variations addressed via stemming. This can be achieved by configuring a suitable an analyzer for your field that does the language-specific stemming.
Assuming that your tags are all in English, your mapping for tags could look like:
"tags": {
"type": "text",
"analyzer": "english"
}
As you cannot change the type or analyzer of an existing index you would need to fix your mapping and then re-index everything.
I'm not sure whether Duke and Duchesse are considered to be the same word (and therefore addresses by the stemmer). If not, you would need to use a customised analyzer that allows you to configure synonyms.
See also Elasticsearch Reference: Language Analyzers

Search After (pagination) in Elasticsearch when sorting by score

Search after in elasticsearch must match its sorting parameters in count and order. So I was wondering how to get the score from previous result (example page 1) to use it as a search after for next page.
I faced an issue when using the score of the last document in previous search. The score was 1.0, and since all documents has 1.0 score, the result for next page turned out to be null (empty).
That's actually make sense, since I am asking elasticsearch for results that has lower rank (score) than 1.0 which are zero, so which score do I use to get the next page.
Note:
I am sorting by score then by TieBreakerID, so one possible solution is using high value (say 1000) for score.
What you're doing sounds like it should work, as explained by an Elastic team member. It works for me (in ES 7.7) even with tied scores when using the document ID (copied into another indexed field) as a tiebreaker. It's true that indexing additional documents while paginating will make your scores slightly unstable, but not likely enough to cause a significant problem for an end user. If you need it to be reliable for a batch job, the Scroll API is the better choice.
{
"query": {
...
},
"search_after": [
12.276552,
14173
],
"sort": [
{ "_score": "desc" },
{ "id": "asc" }
]
}

Elasticsearch: Multiple partial words not scored high enough

so I'm trying to get good search results out of an Elasticsearch installation.
But I run into problems when I'm trying to make a fuzzy search on some very simple data.
Somehow multiple (some of them partial) words are scored too low and only get scored higher, when more letters of the word are present in the search query.
Let me explain:
I have a simple index built with two simple documents.
{
"name": "Product with good qualities and awesome sound system"
},
{
"name": "Another Product that has better acustics than the other one"
}
Now I query the index with this parameters:
{
"query": {
"multi_match": {
"fields": ["name"],
"query": "product acust",
"fuzziness": "auto"
}
}
}
And the results look like this:
"hits": [
{
"_index": "test_products",
"_type": "_doc",
"_id": "1",
"_score": 0.19100355,
"_source": {
"name": "Product with good qualities and awesome sound system"
}
},
{
"_index": "test_products",
"_type": "_doc",
"_id": "2",
"_score": 0.17439455,
"_source": {
"name": "Another Product that has better acustics than the other one"
}
}
]
As you can see the product with the ID 2 is scored less than the other product even though it has possibly more similarity with the given query string than the other product because it has 1 full word match and 1 partial word match.
When the query would looke like "product acusti" the results would start to behave correctly.
I've already fiddled around with bool search but the results are identical.
Any ideas how I can get the wanted results back faster than having to have almost the whole second word typed in?
As far as I know, Elasticsearch does not do partial word matching by default, so the term acust is not matched in neither of your documents.
The reason you are getting a higher score in the first document is that your matched term, product, appears in a shorter sentence:
Product with good qualities and awesome sound system
But as for the second document, product appears in a longer sentence:
Another Product that has better acoustics than the other one
So your second document is getting a lower score because the ratio of your match term (product) to the number of terms in the sentence is lower.
In other words in has lower Field length normalization:
norm = 1/sqrt(numFieldTerms)
Now if you you want to be able to do partial prefix matching, you need to tokenize your term into ngrams, for example you can create the following ngrams for the term "acoustics":
"ac", "aco", "acou", "acous", "acoust", "acousti", "acoustic", "acoustics"
You have 2 options to achieve this, see the answer by Russ Cam on this question
use Analyze API
with an analyzer that will tokenize the field into tokens/terms from
which you would want to partial prefix match, and index this
collection as the input to the completion field. The Standard analyzer
may be a good one to start with...
Don't use the Completion Suggester here and instead set up your field (name) as a text datatype with
multi-fields
that include the different ways that name should be analyzed (or not
analyzed, with a keyword sub field for example). Spend some time with the Analyze API to build an analyzer that will
allow for partial prefix of terms anywhere in the name. As a start,
something like the Standard tokenizer, Lowercase token filter,
Edgengram token filter and possibly Stop token filter would get you
running...
You may also find this guide helpful.

How can I query/filter an elasticsearch index by an array of values?

I have an elasticsearch index with numeric category ids like this:
{
"id": "50958",
"name": "product name",
"description": "product description",
"upc": "00302590602108",
"**categories**": [
"26",
"39"
],
"price": "15.95"
}
I want to be able to pass an array of category ids (a parent id with all of it's children, for example) and return only results that match one of those categories. I have been trying to get it to work with a term query, but no luck yet.
Also, as a new user of elasticsearch, I am wondering if I should use a filter/facet for this...
ANSWERED!
I ended up using a terms query (as opposed to term). I'm still interested in knowing if there would be a benefit to using a filter or facet.
As you already discovered, a termQuery would work. I would suggest a termFilter though, since filters are faster, and cache-able.
Facets won't limit result, but they are excellent tools. They count hits within your total results of specific terms, and be used for faceted navigation.

Resources