Elasticsearch: what my index contains: docs or positions? - elasticsearch

I've created ES index using the following command:
curl -X PUT -H 'Content-Type: application/json' -H 'Accept: application/json' -d '{"settings" :{"number_of_shards" : 10, "number_of_replicas" : 0, "analysis":{"analyzer": {"my_analyzer": {"type": "custom", "tokenizer":"whitespace","filter":["lowercase","porter_stem"],"stopwords":[...stopwords here ...]}}}}, "mappings" : {"html" : {"properties" : "head" : { "type" : "text", "analyzer": "my_analyzer" }, "body" : { "type" : "text", "analyzer": "my_analyzer"}}}}}' localhost:9200/docs
I read here that:
Analyzed string fields use positions as the default, and all other fields use docs as the default.
Since my fields are of text type, are they considered string fields?
My main issue is how to know what does my index contain (docs or positions?) for each field! I used \docs\_settings command to get the index settings, but didn't get useful answer?
Any hints?
EDIT:
In addition answer of #ibexit below, I verified that practically by issuing phrase queries against ES indices.

You defined the fields as text, without specifying index_options in your mapping. In this case the default for text fields will be applied (index_options=positions). The inverse index will now contain doc number, term frequencies, and term positions (or order) for the text fields.
For more in depth information about inverted indices please have a look on https://www.elastic.co/blog/found-elasticsearch-from-the-bottom-up or https://youtu.be/x37B_lCi_gc
This should be a good starting point for your research.
Cheers!

Related

Elasticsearch fuzziness with multi_match and bool_prefix type

I have a set of search_as_you_type_fields I need to search against. Here is my mapping
"mappings" : {
"properties" : {
"description" : {
"type" : "search_as_you_type",
"doc_values" : false,
"max_shingle_size" : 3
},
"questions" : {
"properties" : {
"content" : {
"type" : "search_as_you_type",
"doc_values" : false,
"max_shingle_size" : 3
},
"tags" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword"
}
}
}
}
},
"title" : {
"type" : "search_as_you_type",
"doc_values" : false,
"max_shingle_size" : 3
},
}
}
I am using a multi_match query with bool_prefix type.
"query": {
"multi_match": {
"query": "triangle",
"type": "bool_prefix",
"fields": [
"title",
"title._2gram",
"title._3gram",
"description",
"description._2gram",
"description._3gram",
"questions.content",
"questions.content._2gram",
"questions.content._3gram",
"questions.tags",
"questions.tags._2gram",
"questions.tags._3gram"
]
}
}
So far works fine. Now I want to add a typo tolerance which is fuzziness in ES. However, looks like bool_prefix has some conflicts working with this. So if I modify my query and add "fuzziness": "AUTO" and make an error in a word "triangle" -> "triangld", it won't get any results.
However, if I am looking for a phrase "right triangle", I have some different behavior:
even if no typos is made, I got more results with just "fuzziness": "AUTO" (1759 vs 1267)
if I add a typo to the 2d word "right triangdd", it seems to work, however looks like it now pushes the results containing "right" without "triangle" first ("The Bill of Rights", "Due process and right to privacy" etc.) in front.
If I make a typo in the 1st word ("righd triangle") or both ("righd triangdd"), the results seems to be just fine. So this is probably the only correct behavior.
I've seen a couple of articles and even GitHub issues that fuzziness does not work in a proper way with a multi_match query with bool_prefix, however I can't find a workaround for this. I've tried changing the query type, but looks like bool_prefix is the only one that supports search as you type and I need to get search result as a user starts typing something.
Since I make all the requests from ES from our backend What I also can do is manipulate a query string to build different search query types if needed. For example, for 1 word searches use one type for multi use another. But I basically need to maintain current behavior.
I've also tried appending a sign "~" or "~1[2]" to the string which seems to be another way of specifying the fuzziness, but the results are rather unclear and performance (search speed) seems to be worse.
My questions are:
How can I achieve fuzziness for 1 word searches? so that query "triangld" returns documents containing "triangle" etc.
How can I achieve correct search results when the typo in the 2d (last?) word of the query? Like I mentioned above it works, but see the point 2 above
Why just adding a fuzziness (see p. 1) returns more results even if the phrase is correct?
Anything I need to change in my analyzers etc.?
so to achieve a desired behavior, we did the following:
change query type to "query_string"
added query string preprocessing on the backend. We split the query string by white spaces and add "~1" or "~2" to each word if their length is more 4 chars or 8 chars respectively. ~ is a fuzziness syntax in ES. However, we don't add this to the current typing word until the user types a white space. For example, user typing [t, tr, tri, ... triangle] => no fuzzy, but once "triangle " => "triangle~2". This is because there will be unexpected results with the last word having fuzziness
we also removed all ngram fields from the search fields as we get the same results but performance is a bit better.
added "default_operator": "AND" to the query to contain the results from one field for phrase queries

Elasticsearch Document search related

I have an Index in Elasticsearch with one document we can say doc id 01 and I updated the document with new doc ID we can say id 02 now I have two documents.
My Question is I want only one latest document(which is doc id 02) in search query(index/_search)
what will be the query for such type of scenario.
If you want to get the document having the maximum value (assuming you are creating doc_id in increase numerical order from the example given) for doc_id, you can use this query:
curl "https://{es_endpoint}/sample_index/_search?pretty" -H 'Content-Type: application/json' -d'
{
"sort" : [
{ "_id" : {"order" : "desc"}}
],
"size": 1
}'

Elasticsearch: How to achieve a case sensitive term query?

I try to query for items by a field called unit which is case sensitive (like kWh), but my term query matches only when I query for kwh (lower case W). What I have seen in the docs is that term should be the right one for case sensitivity, so I am not sure what I am doing wrong.
## Create an item
curl -X POST "localhost:9200/my_index/my_type/my_id" -H 'Content-Type: application/json' -d'{"point_name" : "my_point_name", "unit" : "kWh"}'
=> {"_index":"my_index","_type":"my_type","_id":"my_id","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"created":true}
## Try to query it by unit with exact match (kWh)
curl -X GET "localhost:9200/my_index/my_type/_search" -H 'Content-Type: application/json' -d'{"query" : { "bool" : {"must" : [{ "term" : {"unit" : "kWh"}}]}}}'
=> {"took":36,"timed_out":false,"_shards":{"total":5,"successful":5,"skipped":0,"failed":0},"hits":{"total":0,"max_score":null,"hits":[]}}
## Query with lower case unit kwh
curl -X GET "localhost:9200/my_index/my_type/_search" -H 'Content-Type: application/json' -d'{"query" : { "bool" : {"must" : [{ "term" : {"unit" : "kwh"}}]}}}'
=> {"took":12,"timed_out":false,"_shards":{"total":5,"successful":5,"skipped":0,"failed":0},"hits":{"total":1,"max_score":0.2876821,"hits":[{"_index":"my_index","_type":"my_type","_id":"my_id","_score":0.2876821,"_source":{"point_name" : "my_point_name", "unit" : "kWh"}}]}}
I don't want to use match here since I create these queries by other fields as well and I want to ensure an exact match behaviour. Can anyone point me how the query would be correct and why this term query does not work?
I am using this dockerimage as my server:
docker.elastic.co/elasticsearch/elasticsearch:6.2.4

can terms lookup mechanism query be nested

I want to know can I nest a terms lookup mechanism query in anther terms lookup mechanism.
For instance:
curl -XPUT localhost:9200/users/user/2 -d '{
"tweets" : ["1", "3"]
}'
curl -XPUT localhost:9200/tweets/tweet/1 -d '{
"uuid" : "1",
"comments":["1","2","3"]
}'
curl -XPUT localhost:9200/comments/comment/1 -d '{
"uuid" : "1"
}'
As you know, we can use a terms lookup mechanism query to get tweets which belong to the user:
curl -XGET localhost:9200/tweets/tweet/_search -d'{
"query" : {
"terms" : {
"uuid" : {
"index" : "users",
"type" : "user",
"id" : "2",
"path" : "tweets"
}
}
}
}'
But if i want to get comments, i must do anther query.
However my documents is so many, it is not a good method.
So i want to nest terms lookup query in order to get comments in only one query by user's id, can i?
I will so appreciate it, if you can give me some help. Thank you! :)
At the moment, this is not possible as far as I know, because you expect data from three different indices to be returned in one query, which would equate to a JOIN. The terms lookup query sort of implements JOINs between two indices "only" (which is already quite cool considering the fact that ES does not want to support JOINs in the first place).
One way out of this would be to refactor your data model to get rid of the comments index and use either parent/child and/or nested relationships within the tweet mapping type. Since a comment can only belong to a single tweet and there aren't usually hundreds of comments on a tweet (I'm pretty confortable with the idea that 99% of the time there are less than half a dozen comments per tweet, if any at all), you could add comments either as a child documents or as a nested document (my preference), instead of just storing their ids in the comments array. That way you'd get your comments right away with your existing query, without the need for a second query.
curl -XPUT localhost:9200/tweets/tweet/1 -d '{
"uuid" : "1",
"comments":[{
"id": 1,
"content": "Nice tweet!"
},{
"id": 2,
"content": "Way to go!"
},{
"id": 3,
"content": "Sucks!"
}]
}'
Or you can wait for this pull request (#3278) (Terms Lookup by Query/Filter (aka. Join Filter)) to be merged, which will effectively allow to do what you're asking for, but that PR has been created more than 2 years ago and there still are conflicts to be resolved.

Substring and similarity matching in elasticsearch

I am learning to use elastisearch as alternative for database queries and I am not able to perform substring matches on the built index.
The mapping I have used to create index is
"mappings" : {
"user" : {
"properties" : {
"name" : {"type": "string"},
"specialty" : {"type": "string" ,"analyzer":"snowball"},
"address : {"type": "string" ,"analyzer":"snowball"}
}
}
}
The document I am indexing is
{
"name" : "John Doe",
"speciality": ["pediatrician","Child Doctor"],
"address": ["#123 park road Abbeyville","#423 park road AbbeyTown" ]
}
When I perform a search like
curl -XGET localhost:9200/test/user/_search?q=speciality:pediatrician
I get the correct document
However when I search strings like
curl -XGET localhost:9200/test/user/_search?q=speciality:pedia
curl -XGET localhost:9200/test/user/_search?q=speciality:pediatricians
No results are retrieved
P.S I know that wild cards can be used for matching but I need to be able to search for both the whole word and parts of the words based on user input so as to return the most relevant documents.
Did you try reindexing after changing the mapping? Also try setting the search analyzer to snowball in the settings.
UPDATE:
You can go for wild card search. Better go for trailing wild card search alone instead of both leading and trailing wild card search.
curl -XGET localhost:9200/test/user/_search?q=speciality:pedia*
curl -XGET localhost:9200/test/user/_search?q=speciality:pediatricians*

Resources