ElasticSearch - Fuzzy search in list elements - elasticsearch

I've got some documents stored in ElasticSearch like this:
{
"tag" : ["tag1", "tag2", "tag3"]
...
}
I want to search through the "tag" field. I know that It should work with a query like:
{
"query":
{
"match" : {"tag" : "tag1"}
}
}
But, I don't want to use a match, I want to use a fuzzy search through the list, for example, something like:
{
"query":
{
"fuzzy" : {"tag" : "tagg1"}
}
}
The problem is, the above query doesn't return anything. What should I use instead?

What is the type of tag field in your elasticsearch mapping ?
I have tried with following type for tag field & elastisearch version is 7.2
"tag" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
And working well for me.
Query With elastic fuzzy will be :
{
"query":
{
"fuzzy": {"tag" : "tagg1"}
}
}

Related

Can only use wildcard queries on keyword, text and wildcard fields - not on [id] which is of type [long]

Elasticsearch version 7.13.1
GET test/_mapping
{
"test" : {
"mappings" : {
"properties" : {
"id" : {
"type" : "long"
},
"name" : {
"type" : "text"
}
}
}
}
}
POST test/_doc/101
{
"id":101,
"name":"hello"
}
POST test/_doc/102
{
"id":102,
"name":"hi"
}
Wildcard Search pattern
GET test/_search
{
"query": {
"query_string": {
"query": "*101* *hello*",
"default_operator": "AND",
"fields": [
"id",
"name"
]
}
}
}
Error is : "reason" : "Can only use wildcard queries on keyword, text and wildcard fields - not on [id] which is of type [long]",
It was working fine in version 7.6.0 ..
What is new change in latest ES and what is the resolution of this issue?
It's not directly possible to perform wildcards on numeric data types. It is better to convert those integers to strings.
You need to modify your index mapping to
PUT /my-index
{
"mappings": {
"properties": {
"code": {
"type": "text"
}
}
}
}
Otherwise, if you want to perform a partial search you can use edge n-gram tokenizer

How do I get the occurences and doc_count of every term of an ES Index?

I have an ES Index of the form
{
"adminfile" : {
"mappings" : {
"properties" : {
"text" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"title" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
With the field 'title' being the title of the string found in the field 'text'. The titles do not contain any spaces, while the texts are normal texts (sentences with spaces and dots etc).
I want to get all the terms in the index and their doc_count and/or frequency. I found this query in the ES doc: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html
GET /adminfile/_search
{
"size": 10,
"aggs" : {
"text" : {
"terms" : {
"field" : "text.keyword",
"order" : { "_count" : "asc" },
"size": 10
}
}
}
}
This returns all the sources but the aggregation buckets are empty. If I change "text.keyword" to "title.keyword" in that command, it does work and return all the titles as keys.
Why does it not work on the text fields?
Is there a better command to use? I know that this:
GET /adminfile/_search
{
"query" : {
"match" : {"text" : "WordToSearch"}
},
"_source":false,
"aggregations": {
"keywords" : {
"significant_text" : {
"field" : "text",
"filter_duplicate_text": true,
"size": 100
}
}
},
"highlight": {
"fields": {
"text": {}
}
}
}
works to get all occurences of wordToSearch in every document of the index, with the counts and frequency. Is there a way to ask this command to match every word of every doc?
EDIT: I have also tried changing the name of the text field to "contenu" in case ES didn't like have a field of name 'text' and of type 'text'. No effect.
Another option could be using https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-termvectors.html but it _termvectors only works for one specific ID (or _mtermvectors for mutliple specific ID, not all the documents in any case)
EDIT2: I realised that the ignore_above could be a problem. I tried cutting all my texts to 200 chars as a test. The query now runs, except that it returns the entire text as a key instead of cutting it into words.
When you use the keyword version of the field, the content is kept as a single, large token. You're right in assuming that ignore_above is the source of your problem, since these tokens apparently will be longer than 256 characters in your data set.
If you instead aggregate across the tokenized field (the normal text field), instead of the keyword version, you'll get counts for each word (i.e. each token) as processed by the field.

elasticsearch percolator filter fails

I'm using a document query against a percolator that works ok. When I try to filter the percolator queries against which document percolate using queries ids, it doesn't return any result. For example:
{
"doc" : {
"text" : "This is the text within my document"
},
"highlight" : {
"order" : "score",
"pre_tags" : ["<example>"],
"post_tags" : ["</example>"],
"fields" : {
"text" : { "number_of_fragments" : 0 }
}
},
"filter":{"ids":{"values":[11,15]}}
,
"size" : 100
}
I know for sure that those ids are correct, but allways obtain "matches" : [ ]. When I don't use filter, ES retrieves correct matches.
Thanks for your help.
I think I've solved it. It seems that the filter only works on the "metadata" fields, meaning that you have to add customized fields to the queries indexed in the percolator in order to use them to filter when you need.
Using my previous example, I would have to index in percolator queries like:
{
"query" : {
"match_phrase" : {
"text" : "document"
}
},
"id" : 11
}
Adding "manually" a redundant id field in order to use it later as filter reference.
At percolation time, you have to use something like:
{
"doc" : {
"text" : "This is the text within my document"
},
"filter":{"match":{"id":11}},
"highlight" : {
"order" : "score",
"pre_tags" : ["<example>"],
"post_tags" : ["</example>"],
"fields" : {
"text" : { "number_of_fragments" : 0 }
}
},
"size" : 100
}
In order to use only that percolator query. Complementary information can be found here.

Query documents with access control filter

Each document in my Elasticsearch index has two access control lists containing user ids. One is an allow list, the other is a deny list. I am trying to add a filter to a given query that considers these ACLs. I thought I could use a bool query with a must clause for the given query, a filter clause for the allow list, and a must_not clause for the deny list. What I have so far (example for user 1):
{
"bool" : {
"must" : {
[given query]
},
"filter" : [ {
"match" : {
"acl.allow" : {
"query" : "/user/1",
"type" : "boolean"
}
}
}],
"must_not" : [ {
"match" : {
"acl.deny" : {
"query" : "/user/1",
"type" : "boolean"
}
}
}]
}
}
Unfortunately, this query does not return the desired result. It returns objects that have not listed user 1 in their allow list (a behavior I don't understand). Also, it (obviously) ignores objects with empty access control lists (which should be visible to anyone). Any suggestions to fix that?
I figured it out. First of all, using match isn't really a good solution for that kind of query—due to its analyzer. Using term though left me puzzled why I did not get any results. Term queries only return results if the corresponding field is set to not_analyzed. Thus I changed my mapping:
"acl": {
"properties": {
"allow": {
"type": "string",
"index": "not_analyzed"
},
"deny": {
"type": "string",
"index": "not_analyzed"
}
}
}
My second problem—treating objects with empty ACLs as visible to anyone—was solved using exists nested in must_not nested in bool. This is recommended as substitute for the deprecated missing query. My final query looks like this and passed all ACL related tests I could think of.
{
"bool" : {
"must" : {
[given query]
},
"filter" : {
"bool" : {
"should" : [ {
"terms" : {
"acl.allow" : [ "/user/1" ]
}
}, {
"bool" : {
"must_not" : {
"exists" : {
"field" : "acl.allow"
}
}
}
} ]
}
},
"must_not" : {
"terms" : {
"acl.deny" : [ "/user/1" ]
}
}
}
}

Indexing a comma-separated value field in Elastic Search

I'm using Nutch to crawl a site and index it into Elastic search. My site has meta-tags, some of them containing comma-separated list of IDs (that I intend to use for search). For example:
contentTypeIds="2,5,15". (note: no square brackets).
When ES indexes this, I can't search for contentTypeIds:5 and find documents whose contentTypeIds contain 5; this query returns only the documents whose contentTypeIds is exactly "5". However, I do want to find documents whose contentTypeIds contain 5.
In Solr, this is solved by setting the contentTypeIds field to multiValued="true" in the schema.xml. I can't find how to do something similar in ES.
I'm new to ES, so I probably missed something. Thanks for your help!
Create custom analyzer which will split indexed text into tokens by commas.
Then you can try to search. In case you don't care about relevance you can use filter to search through your documents. My example shows how you can attempt search with term filter.
Below you can find how to do this with sense plugin.
DELETE testindex
PUT testindex
{
"index" : {
"analysis" : {
"tokenizer" : {
"comma" : {
"type" : "pattern",
"pattern" : ","
}
},
"analyzer" : {
"comma" : {
"type" : "custom",
"tokenizer" : "comma"
}
}
}
}
}
PUT /testindex/_mapping/yourtype
{
"properties" : {
"contentType" : {
"type" : "string",
"analyzer" : "comma"
}
}
}
PUT /testindex/yourtype/1
{
"contentType" : "1,2,3"
}
PUT /testindex/yourtype/2
{
"contentType" : "3,4"
}
PUT /testindex/yourtype/3
{
"contentType" : "1,6"
}
GET /testindex/_search
{
"query": {"match_all": {}}
}
GET /testindex/_search
{
"filter": {
"term": {
"contentType": "6"
}
}
}
Hope it helps.
POST _analyze
{
"tokenizer": {
"type": "char_group",
"tokenize_on_chars": [
"whitespace",
"-",
"\n",
","
]
},
"text": "QUICK,brown, fox"
}

Resources