How can I influence an Elasticsearch autosuggest query to return an exact match first? - elasticsearch

Consider the following job titles that are indexed into 3 separate documents:
[ "Software Developer Analyst, Senior",
"Software Developer and Analyst - iOS, iPad, . Net",
"Software Developer" ]
In the real world, we have hundreds of variations for "software developer", so if the autocomplete only returns 10 documents, it's likely buried in noise.
Is it possible to do any sort of ordering to prefer exact matches?
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters-completion.html

Completion suggester use FST (special in-memory data structure that built at index time) for fast searches, so it is possible to influence the results of future queries only at index time (see issue related to your question at GitHub).
In your case you can add context for your suggester. For example, it could be category field which will contain the features of particular software developer. Such features should be somehow retrieved from data that will be indexing:
PUT jobs/_doc/1
{
"suggest": "Software Developer and Analyst - iOS, iPad, . Net",
"category": ["apple", "ios", "ipad", "dotnet"]
}
And at query time you should try to retrieve such features from user input before send it to ES. For example, if user types "java senior software developer", you transform this input to query
POST jobs/_search
{
"suggest": {
"jobs_suggestion" : {
"prefix" : "java senior software developer",
"completion" : {
"field" : "suggest",
"size": 10,
"contexts": {
"category_context": [ "java", "senior" ]
}
}
}
}
}
Of course, this approach requires complex preliminary analysis of index data and search queries.
Another option is to assign weights to jobs titles at index time, but it does not fit in your case in my opinion.

Related

Elasticsearch re-index all vs join

I'm pretty new on Elasticsearch and all its concepts. I would like to understand how I could accomplish what I have in my Relational DB in an Elasticsearch architecture.
The scenario is the following
I have a index "data":
{
"id": "00001",
"content" : "some text here ..",
"type": "T1",
"categories: ["A", "A1", "B"]
}
The requirement says that data can be queried by:
some text search in the context field
that belongs to a specific type or category
So far, so simple, so good.
This data will not be completed from the creating time. It might happen that new categories will be added/removed to the data later. So, many data uploads/re-indexes might happen along the way
For example:
create the data
{
"id": "00001",
"content" : "some text here ..",
"type": "T1",
"categories: ["A"]
}
Then it was decided that all data with type=T1 must belong to both A & B categories.
{
"id": "00001",
"content" : "some text here ..",
"type": "T1",
"categories: ["A", "B"]
}
If I have a billion hits for type=T1 I would have to update/re-index a billion entries. Maybe it is how things should work and this where my question lands on.
Is ok to re-index all the data just to add/remove a new category, or would it be possible to have a second much smaller index just to do this association and somehow join both indexes at time to query?
Something like it:
Data:
{
"id": "00001",
"content" : "some text here ..",
"type": "T1"
}
DataCategories:
{
"type": "T1"
"categories" : ["A", "B"]
}
Is it acceptable/possible?
This is a common scenario - but unfortunately, there is no 1:1 mapping for RDBMS features in text search engines like Lucene/elasticsearch.
Possible options:
1 - For the best performance, reindex. It may not be practical depending on the velocity of your change
2 - Consider Parent-Child; Though it's a slower option - often will meet performance requirements. The category could be a parent document, each having several thousands of children.
3 - If its category renaming - Consider using IDs for the category and translating it to text in the application.
4 - Update document depends on the number of documents to be updated; maybe for few thousand - run an update query, if more - reindex.
Suggested reading - https://www.elastic.co/blog/managing-relations-inside-elasticsearch

Finding all words and their frequencies in an elasticsearch index

Elasticsearch Newbie here. I have an elasticsearch cluster and an index http://localhost:9200/products and each product looks like this:
{
"name": "laptop",
"description" : "Intel Laptop with 16 GB RAM",
"title" : "...."
}
I wanted all keywords in a field and their frequencies across all documents for an index. For eg.
description : intel -> 2500, laptop -> 40000 etc. I looked at termvectors https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-termvectors.html but that only let's me do it across a single document. I want it across all documents in a particular field.
I wrote a plug-in for this ..but its expensive call ( based on how many terms you want to get and cardinality of terms ) https://github.com/nirmalc/es-termstat
Currently, there is no way to use term vectors on all documents at a time in an index. You can either use single term vector API for single document's term frequency count or multi-term vectors API to multiple document's term frequency. But a possible workaround could be like this -
make a scan request in order to get all documents from a given type,
and for each page to build a multi-term vector mentioned above to
request to get term vectors.
POST /products/_mtermvectors
{
"ids" : ["1", "2"],
"parameters": {
"fields": [
"description"
],
"term_statistics": true
}
}

Elasticsearch Multi Index Query and Filter

I have 2 indexes, one that stores data about an event and one that stores the availability of that event. I am trying to create a single query that gets events by a query but only returns ones that are available, and I am having difficulty doing so.
The events index stores
{
"id" : "152ce52d-e975-4ebd-849a-0a12f535e644",
"createdAt" : 1.5519999143126902E12,
"description" : "A very not so concise description",
"geoHash" : "dnh00x6x5",
"name" : "a name",
...etc...
}
The availability index stores availability like so:
{
"eventId" : "152ce52d-e975-4ebd-849a-0a12f535e644",
"maxGuests" : 8,
"availability" : {
"lte" : "2019-10-18T22:15:00.000Z",
"gte" : "2019-10-18T02:30:00.000Z"
}
}
I am trying to create a query like below, but what I can't figure out is how to filter by listings that meet the criteria in the events index AND are available in the availability index.
GET events,availability/_search
{
"size": 5,
"from": 0,
"_source": [
"id"
],
"query": {
"bool": {
"must": [
{
"geo_distance": {
"distance": "25mi",
"geoHash": {
"lat": 34.0389,
"lon": -84.3826
}
}
}
],
"should": [],
"filter":[
{
"range" : {
"availability" : {
"gte" : "2019-10-31",
"lte" : "2020-11-01",
"relation" : "within"
}
}
}
]
}
}
}
--
The reason I want to only do one query is that the client is expecting a certain specified number of events. If I filter out the unavailable events after I get the event data then I am likely to be left with fewer events than the client expected and would need to do yet another search to fill the gap.
Also, of course, I could merge the two indices so that the event also stores the availability info, but I originally set them up this way because the availability info may have hundreds or thousands of entries per event.
What you want to accomplish is an equivalent of a foreign key of SQL (join). There is no way to have exactly what you want, meaning to filter documents from index A by querying an index B. Your options are:
As you've mentioned solve it on application level (although this causes other problems for you, so it's not a solution).
Merge the data in one index and have duplicated event informatin. Although it seems expensive, the duplication of data in a NoSQL database is to be expected. If you need a relational model then maybe you should use a SQL solution.
Use parent/child (join datatype). The problem here is that you will need to have the data in the same index overall. Moreover, parent and child will be stored in the same shard as well.
One approach to this (a bit more complex though) that I believe would work for you is to use the nested datatype, which actually is a more compact approach for the solution number 2 (combine your data in one index, but save root information only once). Make events be at the root and availability appear as nested. When you want to add one availability you can use the update api, and when you query, you can search by the root fields and by the nested. If you need to retrieve specific availability entries for an event you can use inner hits
What you are trying to do (multi-index search) will not join your data automatically, it will not work. Elasticsearch doesn't work that way, and the relational model is not suited for this product.
One last thing, it's a good thing to plan ahead, but it's a bad thing to try to optimize early on.
The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.
An interesting read that summarizes the above

How to give more weight-age to specific keywords while searching for similar text using elasticsearch?

I am using elasticsearch to get relevant blog articles from a database of articles. I want results that contain particular words to be given higher score than the search results who do not have them.
I have tried adding stop words and given more to other fields but the results are not quite as expected. I am using developer mode of the Kibana interface of elasticsearch
"""
GET blog-desc/_search
{
"query": {
"more_like_this" : {
"fields" : ["Meta description","Title^5",
"Short title^0.5"],
"like" : "Harry had a silver wand he likes to play with! Among his friends he has the most expensive one. The only difference between his wand and his sister's is that in the color",
"min_term_freq" : 1,
"max_query_terms" : 12,
"minimum_should_match": "30%",
"stop_words": ["difference", "play", "among"]
, "boost_terms": 1
}
}
}
"""
In the sample code above, I would want search results having "silver" as a word in them given more score than other articles who do not that word.

Boosting in Elasticsearch

I am new to elasticsearch. In elasticsearch we can use the term boost in almost all queries. I understand it's used for modify score of documents. But i can't find actual use of it. My query is if i use boost values in some queries, will it affect final score of search or the boost rank of docs in index itself.
And what is main difference between boost at index and boost at querying..
Thanks in Advance..!
Query time boost allows you to give more weight to one query than to another. For instance, let's say you are querying the title and body fields for "Quick Brown Fox", you could write it as:
{
"query": {
"bool": {
"should": [
{
"match": {
"title": "Quick Brown Fox"
}
},
{
"match": {
"body": "Quick Brown Fox"
}
}
]
}
}
}
But you decide that you want the title field to be more important than the body field, which means you need to boost the query on the title field by (eg) 2:
{
"query": {
"bool": {
"should": [
{
"match": {
"title": {
"query": "Quick Brown Fox",
"boost": 2
}
}
},
{
"match": {
"body": "Quick Brown Fox"
}
}
]
}
}
}
(Note how the structure of the match clause changed to accommodate the boost parameter).
The boost value of 2 doesn't double the _score exactly - the scores go through a normalization process. So you should think of boost as make this query clause relatively more important than the other query clauses.
My doubt is if i use boost values in some queries. will it affect final score of search
Yes it does, but you shouldn't rely on the actual value of _score anyway. Its only purpose is to allow Elasticsearch to decide which documents are most relevant to this query. If the query changes, the scores change.
Re index time boosting: don't use it. It's inflexible and error prone.
Boost at query time won't modify your index. It only applies boost factor on fields when searching.
I prefer boost at query time as it's more flexible. If you need to change your boost rules and you had set it at index time, you will probably need to reindex.
Use cases of boosting : Suppose you are building a e-commerce web app, and your product data is in elastic search. Whenever a customer uses search bar you query elastic search and displays the result in web app.
Elastic search keeps relevance score for every document and returns the result in sorted order of the relevance score.
Now let's assume a user searches for "samsung phones", then should your web app just show samsung phones -> Answer is NO.
Your web app should show other phones as well (as user may like those as well) but first show samsung phones (as he/she is looking for those) and then show other phones as well.
So question is how do you query where samsung phones comes up in result ? -> Answer is relevance score.
Let say you hit query like for all mobile phones and samsung phone and the keep high relevance score of samsung phones,
Then result will contain first samsung phones and then other phones.

Resources