Search using “OR” condition on keyword field that contains spaces - elasticsearch

My data "keywords" contain spaces. So "X AAA" is one "keyword". And "B AAA" is another keyword. My data will only have one of these in the actual field. So the data field will never look like a combination of the two "X AAA B AAA". There will always be just one "keyword" in the field.
Here is a sample data set of 6 rows for the field:
X AAA
Y AAA
Z AAA
X BBB
Y BBB
Z BBB
My mapping looks like this for the field
"mappings" : {
"properties" : {
"MYKEYWORDFIELD" : {
"type" : "keyword"
},
...
When I query the MYKEYWORDFIELD for only part of the "keyword" such as "AAA" I don't get any results. This is what I want. Thus my understanding is that the field is being treated as the entire contents of the field is one keyword. Am I understanding this correctly?
Also, I want to query MYKEYWORDFIELD for "X AAA" OR "X BBB" in a single query. Is it possible to do so? If so, how would I do so?
====
1/7/20 Update: To clarify, for the results of my query, I don't want to potentially receive rows other than those in the query. Therefore I don't believe I can use "should" which only affects result scoring and therefore may allow other rows like "Y BBB" to show up in my query.

You can use should query, like:
{
"query" :{
"bool" :{
"should": [
{
"match" :{
"MYKEYWORDFIELD.keyword": "X AAA"
}
},
{
"match" :{
"MYKEYWORDFIELD.keyword": "X BBB"
}
}
]
}
}
}

Related

Elasticsearch fuzziness with multi_match and bool_prefix type

I have a set of search_as_you_type_fields I need to search against. Here is my mapping
"mappings" : {
"properties" : {
"description" : {
"type" : "search_as_you_type",
"doc_values" : false,
"max_shingle_size" : 3
},
"questions" : {
"properties" : {
"content" : {
"type" : "search_as_you_type",
"doc_values" : false,
"max_shingle_size" : 3
},
"tags" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword"
}
}
}
}
},
"title" : {
"type" : "search_as_you_type",
"doc_values" : false,
"max_shingle_size" : 3
},
}
}
I am using a multi_match query with bool_prefix type.
"query": {
"multi_match": {
"query": "triangle",
"type": "bool_prefix",
"fields": [
"title",
"title._2gram",
"title._3gram",
"description",
"description._2gram",
"description._3gram",
"questions.content",
"questions.content._2gram",
"questions.content._3gram",
"questions.tags",
"questions.tags._2gram",
"questions.tags._3gram"
]
}
}
So far works fine. Now I want to add a typo tolerance which is fuzziness in ES. However, looks like bool_prefix has some conflicts working with this. So if I modify my query and add "fuzziness": "AUTO" and make an error in a word "triangle" -> "triangld", it won't get any results.
However, if I am looking for a phrase "right triangle", I have some different behavior:
even if no typos is made, I got more results with just "fuzziness": "AUTO" (1759 vs 1267)
if I add a typo to the 2d word "right triangdd", it seems to work, however looks like it now pushes the results containing "right" without "triangle" first ("The Bill of Rights", "Due process and right to privacy" etc.) in front.
If I make a typo in the 1st word ("righd triangle") or both ("righd triangdd"), the results seems to be just fine. So this is probably the only correct behavior.
I've seen a couple of articles and even GitHub issues that fuzziness does not work in a proper way with a multi_match query with bool_prefix, however I can't find a workaround for this. I've tried changing the query type, but looks like bool_prefix is the only one that supports search as you type and I need to get search result as a user starts typing something.
Since I make all the requests from ES from our backend What I also can do is manipulate a query string to build different search query types if needed. For example, for 1 word searches use one type for multi use another. But I basically need to maintain current behavior.
I've also tried appending a sign "~" or "~1[2]" to the string which seems to be another way of specifying the fuzziness, but the results are rather unclear and performance (search speed) seems to be worse.
My questions are:
How can I achieve fuzziness for 1 word searches? so that query "triangld" returns documents containing "triangle" etc.
How can I achieve correct search results when the typo in the 2d (last?) word of the query? Like I mentioned above it works, but see the point 2 above
Why just adding a fuzziness (see p. 1) returns more results even if the phrase is correct?
Anything I need to change in my analyzers etc.?
so to achieve a desired behavior, we did the following:
change query type to "query_string"
added query string preprocessing on the backend. We split the query string by white spaces and add "~1" or "~2" to each word if their length is more 4 chars or 8 chars respectively. ~ is a fuzziness syntax in ES. However, we don't add this to the current typing word until the user types a white space. For example, user typing [t, tr, tri, ... triangle] => no fuzzy, but once "triangle " => "triangle~2". This is because there will be unexpected results with the last word having fuzziness
we also removed all ngram fields from the search fields as we get the same results but performance is a bit better.
added "default_operator": "AND" to the query to contain the results from one field for phrase queries

ElasticSearch Multiple words multiple times containing and or in multiple fields

I was trying to build queries where, I can search in multiple fields. Example:
1. "Criminal Law" and "Act 1999 Vol 2" and "Human Rights"
2. "Human Rights Law" or "Labor Law" and "Chan Mia"
Note: Strings needed to be matched exactly when inside quotation.
I was trying with following Query:
"query": {
"multi_match" : {
"query": "\"Criminal Law\"" or "\"Act 1999 Vol 2\"" and "\"Human Rights\"",
"fields": [ "transcript" ],
"operator": "and"
}
}
Multi_math doesn't support operators in the query text. You need to use query_string
"query_string" : {
"query": "\"Criminal Law\"" OR "\"Act 1999 Vol 2\"" AND "\"Human Rights\"",
"fields": [ "transcript" ]
}
}
Query_string supports a strict syntax. If your query is malformated it will throw error. This can create problem when input is coming directly from user. You can use simple_query_string it is fault tolerant. Though it does not provide all features of query_string it doesn't throw error for invalid input. It will ignore the invalid part of query.

how decrease score on TF in elasticsearch?

two docs: 1. "Some Important Company",2. "Some Important Company Important branch"
since "Important" have a high docCount(many docs has Important word), so when search on "Some Important Company"
the 2nd doc get a higher score, even though 1st doc has exactlly match.
so my question is how to boost score when exactlly matched or decrease the TF score?
my query is multi_match for customerName usedName,but usedName is all "" in this case
I assume the field of your document is indexed using a standard text analyzer or something of the like. I would combine a match query and a match_phrase query using a dismax compound query.
This would give something like that:
{
"query": {
"dis_max" : {
"queries" : [
{ "match" : { "myField" : "Some Important Company" }},
{ "match_phrase" : { "myField" : "Some Important Company" }}
],
"tie_breaker" : 0.7
}
}
}
There's no notion of "matching an exact phrase" with the match query. For this you need to use the match_phrase query. That's why you combine the two here. Using the dis_max, documents that match the two queries will get a boost. You can read more about dis_max and match_phrase:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-dis-max-query.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase.html

How to get the word count for all the documents based on index and type in elasticsearch?

If I have few documents and would like to get the count of each word in all the documents for a particular field how do I get?
ex: Doc1 : "aaa bbb aaa ccc"
doc2 : "aaa ccc"
doc3 : "www"
I want it like aaa-3, bbb-1, ccc-2, www-1
If you want the document counts, you can do it by using a terms aggregation like this:
POST your_index/_search
{
"aggs" : {
"counts" : {
"terms" : { "field" : "your_field" }
}
}
}
UPDATE
If you want to get the term count, you need to use the _termvector API, however, you'll only be able to query one document after another.
GET /your_index/your_type/1/_termvector?fields=your_field
And for doc1 you'll get
aaa: 2
bbb: 1
ccc: 1
The multi-term vectors API can help but you'll still need to specify the documents to get the term vectors from.
POST /your_index/your_type/_mtermvectors' -d '{
"docs": [
{
"_id": "1"
},
{
"_id": "2"
},
{
"_id": "3"
}
]
}'
And for your docs you'll get
aaa: 2 + 1
bbb: 1
ccc: 1 + 1
www: 1

Aggregation distinct values in ElasticSearch

I'm trying to get the distinct values and their amount in ElasticSearch.
This can be done via:
"distinct_publisher": {
"terms": {
"field": "publisher", "size": 0
}
}
The problem I've is that it counts the terms, but if there are values in publishers separated via a space e.g.:
"Chicken Dog"
and 5 documents have this value in the publisher field, then I get 5 for Chicken and 5 for Dog:
"buckets" : [
{
"key" : "chicken",
"doc_count" : 5
},
{
"key" : "dog",
"doc_count" : 5
},
...
]
But I want to get as the result:
"buckets" : [
{
"key" : "Chicken Dog",
"doc_count" : 5
}
]
The reason you're getting 5 buckets for each of chicken and dog is because your documents were analyzed at the time that you indexed them.
This means elasticsearch did some small processing to turn Chicken Dog into chicken and dog (lowercase, and tokenize on space). You can see how elasticsearch will analyze a given piece of text into searchable tokens by using the Analyze API, for example:
curl -XGET 'localhost:9200/_analyze?&text=Chicken+Dog'
In order to aggregate over the "raw" distinct values, you need to utilize the not_analyzed mapping so elasticsearch doesn't do its usual processing. This reference may help. You may need to reindex your data to apply the not_analyzed mapping to get the result you want.

Resources