ElasticSearch: is it possible to highlight words in the query rather than the results - elasticsearch

We use ElasticSearch in a reverse manner from what I usually see. We store lots of small documents, usually 1 or 2 words, for example, Job Titles like "software engineering", "car mechanics", "architect", etc.
Then we query with a longer string, for example a 1000 word Job Spec. This way we get all Job Titles present in the text of the Job Spec.
It works well. But I was wondering whether I could get ElasticSearch to highlight the matching Job Titles in the Job Spec, i.e. highlight the results in the query. I have tried the highlight keyword, but it doesn't highlight the query text, it highlights the results. I'm not sure how to get the query to be returned in the ElasticSearch response, let alone whether it can be highlighted.
You might wonder why I need ElasticSearch to highlight the query, can't I just pick out all the results from the text and highlight them myself? Yes I can, but there's various things to think about that makes it hard such as stemming and stopword removal. for example "jquery" is stemmed to "jqueri" when doing the tokenising in ElasticSearch, so it's found as a result, but if I want to highlight it myself, I have to unstem it so it matches the original text. Elasticsearch also removes symbols, so terms & conditions would become terms conditions which is problematic if I want to highlight it manually as I have to add back the "&" symbol. There's a hundred other problem cases, hence the question about whether ElasticSearch can do it for me.

I'm quite sure highlighting the query string isn't possible - only highlighting parts of documents in an index.
What you might try is indexing the query string itself in it's own index and then using the results of the first query as the query terms for a second query against the query string (in the second index). You could then have highlighting on the query string. You'll have to make an extra request to ES each time, but I think it'll get what you want.

Related

Elasticsearch show what is matched from a query

I'm implementing a sort of "natural language" search assistant. I have a form with a number of select fields. The list of options in each field can be pretty lengthy. So rather than having to select each item individually, I'm adding a text input box where people can just type what they're looking for and the app will suggest possible searches, based on the options in the select dropdowns.
Let's say my options are:
Color: red, blue, black, yellow, green
Size: very small, kinda medium, super large
Shape: round, square, oblong, cylindrical
Year: 2007, 2008, 2009, 2010
If you typed in "2007 very small star-spangled", the text input would suggest "Search all 2007 very small widgets for 'star-spangled'". It understood that "2007" and "very small" were select options in the form, and that "star-spangled" was not, and suggested a search where "2007" and "very small" are selected, and then left the "star-spangled" bit for a plaintext search.
What I'm working on right now is parsing the search query and picking out the bits that fit into the select fields. I have all the options in Elasticsearch. I was thinking of searching each type individually to see if it matches anything in the search query. That seems straightforward to me. I can easily find matches. However, I don't know which part of the query actually matches each type, which I need in order to find out that e.g. "star-spangled" is the part that didn't match options.
So, in the end, I need to know that only the "2007" substring matched the year, only the "very small" substring matched the size, and "star-spangled" didn't match anything.
My first thought is to split the query into word-grams (e.g. "2007", "2007 very", "2007 very small", "2007 very small star-spangled", "very", "very small", "very small star-spangled", "small", "small star-spangled", "star-spangled") and search each option for each gram. Then I would know for sure which gram matched. However, this could obviously get resource intensive pretty quickly. Also, I know Elasticsearch can do that sort of search internally much faster.
So what I really need is to be able to perform a search and, along with the results, get back which part of the original query actually matched. So if I searched, "2007 verr small" (intentional misspelling) and did a fuzzy search of sizes, passing the entire query string, and I get the "Very Small" size back as a result, it would indicate that "verr small" is the part of the query that matched that size.
Any idea of how to do that? Or possibly some other solutions?
I could do the search and parse the results to see which bits match the string. Though I could see that being resource intensive as well. And if I'm doing a fuzzy search, it wouldn't necessarily be clear which part of the query triggered a match in the result.
I was also thinking that highlighting might work for this, but I don't know enough about Elasticsearch to know for sure.
EDIT: I tested this out using highlighting. It's so close to working. The highlight field comes back with the part of the string that matches. However, it only shows the part of the result that matches. It doesn't show the part of the query that matches. So if I want to allow for fuzzy searches, the highlight field won't match the original query and I won't be able to tell which part of the query matched. For example, a query of "very smaal" will return the size "Very Small", but the highlight field will show <em>very</em> <em>small</em>, not <em>very</em> <em>smaal</em>.
There are 2 types of queries in Elasticsearch, Match Query and Filtered Query. Match query matches your term in the documents and find all the relevant documents with a relevance score. For example when you search for term: "help fixing javascript problem" you are interested in all documents which contain one or more of the search term.
On the other hand, when you are using Filtered Query, a document is either a match or not match... there is no relevance score here... as an example, you want all the products built in year "2007"... here you need to use a filtered query. All the product built in 2007 have the same score and all other years are excluded from the result.
In my opinion, your problem should be dealt with Filter Query...
When using filter query, normally each filter has its own corresponding input in the UI, consider the following screen-shot which is from ebay:
If I have understood your requirement correctly, you want to include all those filters in a single search-box. In my opinion, this is nearly impossible to implement because you have no way to parse user input and decide which word corresponds to which filter...
If you want to go down the filter path, it's better to introduce corresponding UI fields for each filter...
If you want to stick to a single search box, then don't implement the filter functionality and stick to Elasticsearch Multi-match query... you can match the input term across multiple fields but you won't be able to filter out (exclude) result instead you get a relevance score.

Elastic search giving strange results

I am following this tutorial on elastic search.
Two employees have 'about' value as:
"about": "I love to go rock climbing"
"about": "I like to collect rock albums"
I run following query:
GET /megacorp/employee/_search {"query":{"match":{"about":"rock coll"}}}
Both above entries are returned, but surprisingly wit same score:
"_score": 0.2876821
Shouldn't the second one must have higher score as it has 'about' value containing both 'rock' and 'coll' while first one only contains 'rock'?
That totally depends on what analyzer you are using. if you are using standard or english analyzer this result is correct. I recommend you to spend some time working with elasticsearch's Analyze API to get familiar how each analyzer affect your text.
By the way, if you want second document to have higher score, take a look at Partial matching.
When we search on a full-text field, we need to pass the query string through the same analysis process as we have when we index a document, to ensure that we are searching for terms in the same form as those that exist in the index.
Analysis process usually consists of normalization and tokenization (the string is tokenized into individual terms by a tokenizer).
As for match Query:
If you run a match query against a full-text field, it will analyze the query string by using the correct analyzer for that field before executing the search. It just looks for the words that are specified.
So, in your match query Elasticsearch will look for occurrences of the whole separate words: rock or/and coll.
Your 2nd document doesn't contain a separate word coll but was matched by the word rock.
Conclusion: the 2 documents are equivalent in their _score value (they were matched by the same word rock)
Elasticsearch analyzes each text field before storing it. The default analyzer (standard analyzer) splits the text based on whitespaces and lowercases it. The output of analysis process is a list of tokens which are used to match your query tokens. If any of the tokens match exactly the relevant document is returned. That's being said, your second document doesn't contain the token col and that's why you are having the same score for both documents.
Even if you build your custom analyzer and use stemming, the word collect won't be stemmed as coll.
You can build custom analyzers in which you can specify that tokens should be of length 1 character, then Elasticsearch will consider each single character as a token and you can search for the existence of any character in your documents.

Get top 10 most used words in text fields

I have an index containing thousands of documents, each one of them having a full text field.
I want to search through all those fields and fetch the 10 most common words that come back most often.
I would also like a way of visualizing it on Kibana if that's possible.
The most common way to achieve that is to duplicate your full text field with a keyword datatype. That will get you able to make terms aggregation on that field - doc here. Maybe you could consider to do a significant term aggregation - doc here, thus to avoid the presence of stopwords and common words. In ES 6.x you could use also the significant text aggregation - doc here, without create the keyword field, but i never try it, i don't know how it works. Instead if you need to retrieve the frequency of the words for each document, you should use the termvector - doc here

Kibana 4 - Why does my simple query return correct results when using .raw but not without?

I'm trying out Elasticsearch/Kibana 4 and while my simple query:
program.raw:"MYAPPLICATION" AND entityId.raw:"12345-67N"
will return the results I want - i.e. result posts having the program and entityId field and containing the queried terms straight off, as I want.
However, I guess the right way to query this would be:
program:"MYAPPLICATION" AND entityId:"12345-67N"
but that gives my correct results only regarding the program query, and then a lot of hits on terms containing N or n. The entityId-part seems to only query on N?. I'm confused, please explain this. I've read up on the Lucene query syntax and can't find anything explaining this.
The .raw fields are setup by logstash as "not_analyzed" fields in elasticsearch. As such, they are not split into tokens and can be used intact.
To elasticsearch, entityId really looks like ['12345', '67n'], which is why your query doesn't match.
Note that, in your example, program:myapplication should work (since there are no special characters). Lowercase is automatic, IIRC.

Elasticsearch multi term search

I am using Elasticsearch to allow a user to type in a term to search. I have the following property 'name' I'd like to search, for instance:
'name': 'The car is black'
I'd like to have this document returned if the following is used to search black car or car black.
I've tried doing a bool must and doing multiple terms ['black', 'car'] but it seems like it only works if the entire string is a match.
So what I'd really like to do is more of a, does the term contain both words in any order.
Can someone please get me on the right track? I've been banging my head on this one for a while.
If it seems like it only works if the entire string is a match, first make sure that in index mapping your string property name is analysed, i.e. mapping for this property doesn't contain "index": "not_analyzed". If it isn't so, you'll need to reindex your index in order to be able to search for tokens rather than for the whole phrase only.
Once you're sure your strings are analysed you can use:
Terms query with "minimum_should_match" parameter equalling to the number of words entered.
Bool query with must clause containing term queries per each word.
Common terms query which has a nice clean syntax for this purpose (you don't need to break down input string and construct more complex query structure in your app like with previous two) in addition to taking a smarter approach to stopwords analysing.

Resources