Elasticsearch - use a "tags" index to discover all tags in a given string - elasticsearch

I have an elasticsearch v2.x cluster with a "tags" index that contains about 5000 tags: {tagName, tagID}. Given a string, is it possible to query the tags index to get all tags that are found in that string? Not only do I want exact matches, but I also want to be able to control for fuzzy matches without being too generous. By too generous, a tag should only match if all tokens in the tag are found within a certain proximity of each other (say 5 words).
For example, given the string:
Model 22340 Sound Spectrum Analyzer
The following tags should match:
sound analyzer sound spectrum analyzer
BUT NOT
sound meter light spectrum chemical analyzer

I don't think it's possible to create an accurate elasticsearch query that will auto-tag a random string. That's basically a reverse query. The most accurate way to match a tag to a document is to construct a query for the tag, and then search the document. Obviously this would be terribly inefficient if you need to iterate over each tag to auto-tag a document.
To do a reverse query, you want to use the Elasticsearch Percolator API:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-percolate.html
The API is very flexible and allows you to create fairly complex queries into documents with multiple fields.
The basic concept is this (assuming your tags have an app specific ID field):
For each tag, create a query for it, and register the query with the percolator (using the tag's ID field).
To auto-tag a string, pass your string (as a document) to the Percolator, which will match it against all registered queries.
Iterate over the matches. Each match includes the _id of the query. Use the _id to reference the tag.
This is also a good article to read: https://www.elastic.co/blog/percolator-redesign-blog-post

"query": {
"match": {
"tagName": {
"query": "Model 22340 Sound Spectrum Analyzer",
"fuzziness": "AUTO",
"operator": "or"
}
}
}
If you want an equal match so that "sound meter" will not match you will have to add another field for each tag containing the terms count in the tag name, add a script to count the terms in the query and add a comparison of the both in the match_query, see: Finding Multiple Exact Values.
Regarding the proximity issue: Since you require "Fuzzyness" you cannot control the proximity because the "match_phrase" query is not integrated with Fuzzyness, as stated by Elastic docs Fuzzy-match-query:
Fuzziness works only with the basic match and multi_match queries. It doesn’t work with phrase matching, common terms, or cross_fields matches.
so you need to decide: Fuzzyness vs. Proximity.

Of course you can. You can achieve what you want to get using only just match query with standard analyzer.
curl -XGET "http://localhost:9200/tags/_search?pretty" -d '{
"query": {
"match" : {
"tagName" : "Model 22340 Sound Spectrum Analyzer"
}
}
}'

Related

Why does elastic search analyze a document 2 times?

From what I've understood, When I index a document say:
PUT <index>/_doc/1
{
"title":"black white fox cat"
}
Elastic search analyzes this via a standard analyzer and turns the title into an array of tokens.
But then when I search for this document let's say
POST <index>/_search
{
"query":
{
"match":
{
"title":"black"
}
}
}
It analyzez again via the same analyzer, isn't that inefficient?
It's not efficient, its necessary step to provide the search results.
let me explain under the hood, how search and index process works.
Index tokenize the text based on data type, and configured analyzer and index the tokens into the inverted index.
Search terms again is tokenised based on the query type(no tokens in case of term family of queries), and search generated tokens into the inverted index created at index time(step-1).
Tokens match process(matching index time tokens in the inverted index to the tokens generated at the query time), is what finds the matches documents and provides the search results, normally this tokens match is a exact string match process, with the exception in some cases like (prefix query, wildcard query etc). and as its a exact string match, its very fast and optimized process.
There are various use-cases, like when you use the keywords data type, text is not analyzed and when you use term level queries search time analysis doesn't happen.
Now, important thing to not is that during search time also same analyzer used at index time, otherwise it would end up generating different token which not produce match in step-3 Described earlier.

Elasticsearch -Search in very big text in a document

we ingest log data into Elasticsearch and there's a business requirement to store the whole original input message as a string. In this - so called - original row we must be able to search with wildcards, for partial matches, for exact matches and all this must be done in a case insensitive manner.
Some logs are so large that in cannot be stored as keyword because of the 32K limit, text is the only option. Do you have any idea how the above can be achieved?
The text is a mix of Hungarian phrases, some HEX identifiers, short codes, dates, etc.
I use the hungarian stemmer, hungarian stopwords, lowercase and asciifolding filters in the mapping.
Every hints are highly appreciated.
Thank you!
I dont have expriance with hungarian language but you can try query_string or match query of Elasticsearch.
First you can defind index mapping with hungarian analyzer for text field and then you can use below query.
POST index_name/_search
{
"query": {
"query_string": {
"default_field": "text_field",
"query": "sample text",
"default_operator": "AND"
}
}
}
You can follow this documentation where they have given example of all types of query like wildcards, partial matches, exact matches etc.

Elasticsearch: How does search work when using combination of analyzers?

I'm a novice to Elasticsearch (ES), messing around with the analyzers. As the documentation states, the analyzer can be specifed "index time" and "search time", depending on the use case.
My document has a text field title, and i have defined the following mapping that introduces a sub-field custom:
PUT index/_mapping
{
"properties": {
"title": {
"type": "text",
"fields": {
"custom": {
"type": "text",
"analyzer": "standard",
"search_analyzer":"keyword"
}
}
}
}
}
So if i have the text : "email-id is someid#someprovider.com", the standard-analyzer would analyze the text into the following tokens during indexing:
[email, id, is, someid, someprovider.com].
However whenever I try to query on the field (with different variations in query terms) title.custom, it results in no hits.
This is what I think is happening when i query with the keyword: email:
It gets analyzed by the keyword analyzer.
The field title.custom's value also analyzed by keyword analyzer (analysis on tokens), resulting in same set of tokens as mentioned earlier.
An exact match should happen on email token, returning the document.
Clearly this is not the case and there are gaps in my understanding.
I would like to know what exactly is happening during search.
On a generic level, I would like to know how the analysis and search happens when combination of search and index analyzer is specified.
search_analyzer is set to "keyword" for title.custom, making the whole string work as a single search keyword.
So, in order to get a match on title.custom, it is needed to search for "email-id is someid#someprovider.com", not a part of it.
search_analyzer is applied at search time to override the default behavior of the analyzer applied at indexing time.
Good question, but to make it simple let me explain one by one different use cases:
Analyzers plays a role based on
Type of query (match is analyzed while term is not analyzed query).
By default, if the query is analyzed like match query it uses the same analyzer on the search term used on a field that is used at index time.
If you override the default behavior by specifying the search_analyzer on a field that at query time that analyzer is used to create the tokens which will be matched with the tokens generated depends on the analyzer(Standard is default analyzer).
Now using the above three points and explain API you can figure out what is happening in your case.
Let me know if you need further information and would be happy to explain further.
Match vs term query difference and Analyze API to see the tokens will be helpful as well.

difference between a field and the field.keyword

If I add a document with several fields to an Elasticsearch index, when I view it in Kibana, I get each time the same field twice. One of them will be called
some_field
and the other one will be called
some_field.keyword
Where does this behaviour come from and what is the difference between both of them?
PS: one of them is aggregatable (not sure what that means) and the other (without keyword) is not.
Update : A short answer would be that type: text is analyzed, meaning it is broken up into distinct words when stored, and allows for free-text searches on one or more words in the field. The .keyword field takes the same input and keeps as one large string, meaning it can be aggregated on, and you can use wildcard searches on it. Aggregatable means you can use it in aggregations in elasticsearch, which resembles a sql group by if you are familiar with that. In Kibana you would probably use the .keyword field with aggregations to count distinct values etc.
Please take a look on this article about text vs. keyword.
Briefly: since Elasticsearch 5.0 string type was replaced by text and keyword types. Since then when you do not specify explicit mapping, for simple document with string:
{
"some_field": "string value"
}
below dynamic mapping will be created:
{
"some_field": {
"type" "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
As a consequence, it will both be possible to perform full-text search on some_field, and keyword search and aggregations using the some_field.keyword field.
I hope this answers your question.
Look at this issue. There is some explanation of your question in it. Roughly speaking some_field is analyzed and can be used for fulltext search. On the other hand some_field.keyword is not analyzed and can be used in term queries or in aggregation.
I will try to answer your questions one by one.
Where does this behavior come from?
It is introduced in Elastic 5.0.
What is the difference between the two?
some_field is used for full text search and some_field.keyword is used for keyword searching.
Full text searching is used when we want to include individual tokens of a field's value to be included in search. For instance, if you are searching for all the hotel names that has "farm" in it, such as hay farm house, Windy harbour farm house etc.
Keyword searching is used when we want to include the whole value of the field in search and not individual tokens from the value. For eg, suppose you are indexing documents based on city field. Aggregating based on this field will have separate count for "new" and "york" instead of "new york" which is usually the expected behavior.
From Elastic 5.0 onwards, strings now will be mapped both as keyword and text by default.

Elasticsearch exact match performance for very long string

I have a usecase:
I need to extract pieces of information from a single url and save each piece as separate data units to be shown in different pages. When a user visits a data unit in a page, I wish to list all other data units from the same original url.
I intend to define the original url field as a not_analyzed string field and then use exact match to get all the pieces extracted from the original url.
My question is:
The original url could be very long. How efficient is elasticsearch to do exact matching for very long string? Does elasticsearch use some sort of hash algorithm such as git's for long string exact matching?
This usecase will be heavily used thus quite important for me to get an answer.
Thanks in advance.
To match exact documents in a not_analyzed filed You can use a term query which will :
Find documents that contain the exact term specified in the inverted
index.
For example :
POST _search
{
"query": {
"term" : { "url" : "google.com" }
}
}
I can't really talk in terms of performance. But this query will match as it is , and it won't apply any transformation to the url as it will be not_analyzed.

Resources