How to get the exact match of using DSL - elasticsearch

How mapping have role to find the search??
GET courses/_search
return is below
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 1.0226655,
"hits" : [
{
"_index" : "courses",
"_type" : "classroom",
"_id" : "7",
"_score" : 1.0226655,
"_source" : {
"name" : "Computer Internals 250",
"room" : "C8",
"professor" : {
"name" : "Gregg Va",
"department" : "engineering",
"facutly_type" : "part-time",
"email" : "payneg#onuni.com"
},
"students_enrolled" : 33,
"course_publish_date" : "2012-08-20",
"course_description" : "cpt Int 250 gives students an integrated and rigorous picture of applied computer science, as it comes to play in the construction of a simple yet powerful computer system. "
}
},
{
"_index" : "courses",
"_type" : "classroom",
"_id" : "4",
"_score" : 0.2876821,
"_source" : {
"name" : "Computer Science 101",
"room" : "C12",
"professor" : {
"name" : "Gregg Payne",
"department" : "engineering",
"facutly_type" : "full-time",
"email" : "payneg#onuni.com"
},
"students_enrolled" : 33,
"course_publish_date" : "2013-08-27",
"course_description" : "CS 101 is a first year computer science introduction teaching fundamental data structures and algorithms using python. "
}
}
]
}
}
mapping is below
{
"courses" : {
"mappings" : {
"properties" : {
"course_description" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"course_publish_date" : {
"type" : "date"
},
"name" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"professor" : {
"properties" : {
"department" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"email" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"facutly_type" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"name" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
},
"room" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"students_enrolled" : {
"type" : "long"
}
}
}
}
}
I need to return the exact match phrase professor.name=Gregg Payne
I tried below query as per direction from https://www.elastic.co/guide/en/elasticsearch/guide/current/_finding_exact_values.html
GET courses/_search
{
"query" : {
"constant_score" : {
"filter" : {
"term" : {
"professor.name" : "Gregg Payne"
}
}
}
}
}

Based on your mapping, here is the query that shall work for you -
POST http://localhost:9200/courses/_search
{
"query" : {
"constant_score" : {
"filter" : {
"term" : {
"professor.name.keyword" : "Gregg Payne"
}
}
}
}
}
Answering your question in the comments - search is always about mappings :) In your case you use Term query which is about searching for exact values and it needs a keyword field. Text fields get analyzed:
Avoid using the term query for text fields.
By default, Elasticsearch changes the values of text fields as part of
analysis. This can make finding exact matches for text field values
difficult.
To search text field values, use the match query instead

Related

ElasticSearch - knn search. Sometimes returns _score = null

The more I pass an array to knn_vetcors, the more sources have _score=null
For example - I sent array with length 2 and I got 3 results with valid _score. But if i sent array with length 60 I got all results with _score is null
Request
{
"_source":[],
"collapse":{
"field":"id"
},
"query":{
"knn":{
"vector":{
"k":10,
"vector":[
0,
// array size - 46
0
]
}
}
},
"size":100,
"track_scores":false
}
Response (first and second scores is null but third is float)
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 7,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "sb_index_images_ba7587a1-35ab-482f-93d8-a433dd132556_1667904180",
"_type" : "_doc",
"_id" : "207445df53a7b54c76ff76c0bec352c9",
"_score" : null,
"fields" : {
"id" : [
"377007"
]
}
},
{
"_index" : "sb_index_images_ba7587a1-35ab-482f-93d8-a433dd132556_1667904180",
"_type" : "_doc",
"_id" : "ea374a9b90d83ab93a77fb03226cafd3",
"_score" : null,
"fields" : {
"id" : [
"377009"
]
}
},
{
"_index" : "sb_index_images_ba7587a1-35ab-482f-93d8-a433dd132556_1667904180",
"_type" : "_doc",
"_id" : "1f93035d08e2b7af7d482a89f36e3c7c",
"_score" : 0.134376,
"fields" : {
"id" : [
"377014"
]
}
}
]
}
}
Mapping my index
{
"sb_index_images_ba7587a1-35ab-482f-93d8-a433dd132556_1667904180" : {
"mappings" : {
"properties" : {
"colors" : {
"type" : "long"
},
"colors_vector" : {
"type" : "knn_vector",
"dimension" : 9
},
"id" : {
"type" : "keyword"
},
"name" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"params" : {
"properties" : {
"0d23f34d9f2168ab98e5542149eb2f3d" : {
"properties" : {
"name" : {
"type" : "keyword",
"ignore_above" : 256
},
"value" : {
"type" : "keyword",
"eager_global_ordinals" : true,
"ignore_above" : 256,
"fields" : {
"float" : {
"type" : "float",
"ignore_malformed" : true
}
}
}
}
}
}
},
"vector" : {
"type" : "knn_vector",
"dimension" : 2048
}
}
}
}
}

ElasticSearch: What is the param limit in painless scripting?

I will have documents with the following data -
1. id
2. user_id
3. online_hr
4. offline_hr
My use case is the following -
I wish to sort the users who are active using online_hr field,
While I want to sort the users who are inactive using the offline_hr field.
I am planning to use ElasticSearch painless script for this use case,
I will have using 2 arrays of online_user_list and offline_user_list into the script params,
And I plan to compare each document's user_id,
if it is present in the either of the params lists and sort accordingly.
I want to know if there is any limit to the param object,
As the userbase may be in 100s of thousands,
And if passing 2 lists of that size in the ES scripting params would be troublesome?
And if there is any better approach?
Query to add data -
POST /products/_doc/1
{
"id":1,
"user_id" : "1",
"online_hr" : "1",
"offline_hr" : "2"
}
Sample data -
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "products",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"id" : 1,
"user_id" : "1",
"online_hr" : "1",
"offline_hr" : "2"
}
}
]
}
}
Mapping -
{
"products" : {
"aliases" : { },
"mappings" : {
"properties" : {
"id" : {
"type" : "long"
},
"offline_hr" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"online_hr" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"user_id" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
},
"settings" : {
"index" : {
"creation_date" : "1566466257331",
"number_of_shards" : "1",
"number_of_replicas" : "1",
"uuid" : "g2F3UlxQSseHRisVinulYQ",
"version" : {
"created" : "7020099"
},
"provided_name" : "products"
}
}
}
}
I found Painless scripts have a default size limit of 65,535 bytes,
while the ElasticSearch compiler had a limit of 16834 characters
Reference -
https://www.elastic.co/guide/en/elasticsearch/painless/current/painless-walkthrough.html
https://www.elastic.co/guide/en/elasticsearch/painless/current/painless-sort-context.html

How to sort an Elasticsearch query result by a determined field in DESC?

Let's say I have the following query:
curl -XGET 'localhost:9200/library/document/_search?pretty=true'
That returns me the following example results:
{
"took" : 108,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 5,
"max_score" : 1.0,
"hits" : [
{
"_index" : "library",
"_type" : "document",
"_id" : "5",
"_score" : 1.0,
"_source" : {
"page content" : [
"Page 0:",
"Page 1: something"
],
"publish date" : "2015-12-05",
"keywords" : "sample, example, article, alzheimer",
"author" : "Author name",
"language" : "",
"title" : "Sample article",
"number of pages" : 2
}
},
{
"_index" : "library",
"_type" : "document",
"_id" : "2",
"_score" : 1.0,
"_source" : {
"page content" : [
"Page 1: eBay",
"Page 2: Paypal",
"Page 3: Google"
],
"publish date" : "2017-08-03",
"keywords" : "something, another, thing",
"author" : "Alex",
"language" : "english",
"title" : "Microsoft Word - TL0032.doc",
"number of pages" : 21
}
},
...
I want to order by publish date and by id (different querys) so that the most recent one shows first in the list. Is it possible to do? I know I have to use the sort function of Elasticsearch together with the DESC parameter. But somehow it is not working for me.
EDIT: Mapping of the fields
curl -XGET 'localhost:9200/library/_mapping/document?pretty'
{
"library" : {
"mappings" : {
"document" : {
"properties" : {
"author" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"keywords" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"language" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"number of pages" : {
"type" : "long"
},
"page content" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"publish date" : {
"type" : "date"
},
"title" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
}
First you need good mapping like this:
PUT my_index
{
"mappings": {
"documents": {
"properties": {
"post_date" : {
"type": "date"
, "format": "yyyy-MM-dd HH:mm:ss"
}
}
}
}
}
And then the search:
GET my_index/_search
{
"sort": [
{
"post_date": {
"order": "desc"
}
}
]
}
Thank you everyone. Managed to get it working with this query:
curl -XGET 'localhost:9200/library/document/_search?pretty=true' -d '{"query": {"match_all": {}},"sort": [{"publish date": {"order": "desc"}}]}'
Didn't need aditional mapping.

Should a keyword field be an array or a single string?

I'm an ElasticSearch newbie so this is a pretty basic question.
If I map a field as a keyword in the index should the document contain an array of strings for that field? Or one string with all the keywords separated by a space?
If my index looks like this (Keyword is what we're interested in):
{
"esidx_j_cv" : {
"mappings" : {
"j_cv" : {
"properties" : {
"Id" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"Keyword" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"Name" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
}
should a document look like this:
{
"_index" : "esidx_j_cv",
"_type" : "j_cv",
"_id" : "2fab7349-c13f-447a-95fa-984df6836c14",
"_score" : 1.0,
"_source" : {
"Id" : "2fab7349-c13f-447a-95fa-984df6836c14",
"Name" : "Jim Bloggs",
"Keyword" : [
"San",
"Andreas",
"Fault",
"California"
]
}
}
or like this:
{
"_index" : "esidx_j_cv",
"_type" : "j_cv",
"_id" : "2fab7349-c13f-447a-95fa-984df6836c14",
"_score" : 1.0,
"_source" : {
"Id" : "2fab7349-c13f-447a-95fa-984df6836c14",
"Name" : "Jim Bloggs",
"Keyword" : "San Andreas Fault California"
}
}
Thanks,
Adam.
Using an array would be the most idiomatic and would be my recommendation. You can query it like this (this query will actually work with either option):
GET index/_search
{
"query": {
"match": {
"Keyword": "California"
}
}
}

How to get Elasticsearch boolean match working for multiple fields

I need some expert guidance on trying to get a bool match working. I'd like the query to only return a successful search result if both 'message' matches 'Failed password for', and 'path' matches '/var/log/secure'.
This is my query:
curl -s -XGET 'http://localhost:9200/logstash-2015.05.07/syslog/_search?pretty=true' -d '{
"filter" : { "range" : { "#timestamp" : { "gte" : "now-1h" } } },
"query" : {
"bool" : {
"must" : [
{ "match_phrase" : { "message" : "Failed password for" } },
{ "match_phrase" : { "path" : "/var/log/secure" } }
]
}
}
} '
Here is the start of the output from the search:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 46,
"max_score" : 13.308596,
"hits" : [ {
"_index" : "logstash-2015.05.07",
"_type" : "syslog",
"_id" : "AU0wzLEqqCKq_IPSp_8k",
"_score" : 13.308596,
"_source":{"message":"May 7 16:53:50 s_local#logstash-02 sshd[17970]: Failed password for fred from 172.28.111.200 port 43487 ssh2","#version":"1","#timestamp":"2015-05-07T16:53:50.554-07:00","type":"syslog","host":"logstash-02","path":"/var/log/secure"}
}, ...
The problem is if I change '/var/log/secure' to just 'var' say, and run the query, I still get a result, just with a lower score. I understood the bool...must construct meant both match terms here would need to be successful. What I'm after is no result if 'path' doesn't exactly match '/var/log/secure'...
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 46,
"max_score" : 10.354593,
"hits" : [ {
"_index" : "logstash-2015.05.07",
"_type" : "syslog",
"_id" : "AU0wzLEqqCKq_IPSp_8k",
"_score" : 10.354593,
"_source":{"message":"May 7 16:53:50 s_local#logstash-02 sshd[17970]: Failed password for fred from 172.28.111.200 port 43487 ssh2","#version":"1","#timestamp":"2015-05-07T16:53:50.554-07:00","type":"syslog","host":"logstash-02","path":"/var/log/secure"}
},...
I checked the mappings for these fields to check that they are not analyzed :
curl -X GET 'http://localhost:9200/logstash-2015.05.07/_mapping?pretty=true'
I think these fields are non analyzed and so I believe the search will not be analyzed too (based on some training documentation I read recently from elasticsearch). Here is a snippet of the output _mapping for this index below.
....
"message" : {
"type" : "string",
"norms" : {
"enabled" : false
},
"fields" : {
"raw" : {
"type" : "string",
"index" : "not_analyzed",
"ignore_above" : 256
}
}
},
"path" : {
"type" : "string",
"norms" : {
"enabled" : false
},
"fields" : {
"raw" : {
"type" : "string",
"index" : "not_analyzed",
"ignore_above" : 256
}
}
},
....
Where am I going wrong, or what am I misunderstanding here?
As mentioned in the OP you would need to use the "not_analyzed" view of the fields but as per the OP mapping the non-analyzed version of the field is message.raw, path.raw
Example:
{
"filter" : { "range" : { "#timestamp" : { "gte" : "now-1h" } } },
"query" : {
"bool" : {
"must" : [
{ "match_phrase" : { "message.raw" : "Failed password for" } },
{ "match_phrase" : { "path.raw" : "/var/log/secure" } }
]
}
}
}
.The link alongside gives more insight to multi-fields
.To expand further
The mapping in the OP for path is as follows:
"path" : {
"type" : "string",
"norms" : {
"enabled" : false
},
"fields" : {
"raw" : {
"type" : "string",
"index" : "not_analyzed",
"ignore_above" : 256
}
}
}
This specifies that the path field uses the default analyzer and field.raw is not analyzed.
If you want to set the path field to be not analyzed instead of raw it would be something on these lines:
"path" : {
"type" : "string",
"index" : "not_analyzed",
"norms" : {
"enabled" : false
},
"fields" : {
"raw" : {
"type" : "string",
"index" : <whatever analyzer you want>,
"ignore_above" : 256
}
}
}

Resources