Can spring data elasticsearch join parent and child relationship? - spring-boot

{
"properties":{
"id":{
"type":"text",
"fields":{
"keyword":{
"ignore_above":256,
"type":"keyword"
}
}
},
"username":{
"type":"text",
"fields":{
"keyword":{
"ignore_above":256,
"type":"keyword"
}
}
},
"parentId":{
"type":"text",
"fields":{
"keyword":{
"ignore_above":256,
"type":"keyword"
}
}
}
}
}
for example I have a user:
id:1,
username: admin,
parentId: null
I have another user:
id:5,
username:manager,
parentId:1
I have another user:
id:10,
username:staff001,
parentId:5
If I query like this:
{
"query": {
"query_string": {
"query": "*staff*",
"default_field": "*"
}
}
}
my expected result is staff001 and his parent's detail
Is it possible to do this on spring data elasticsearch?
I am sure that it is possible to do this on spring jpa mapping using #OneToOne or #ManyToOne (for example mysql/postgresql)

Related

error while mapping : unknown setting [index.knn.algo_param.m]

I am trying to change mapping in elastic search and getting this error
https://ibb.co/q5LkfWz
"reason": "unknown setting [index.knn.algo_param.m] please check that any required plugins are installed, or check the breaking changes documentation for removed settings
and this is the PUT request i am trying to make
PUT posting
{
"settings":{
"index":{
"number_of_shards":1,
"number_of_replicas":0,
"knn":{
"algo_param":{
"ef_search":40,
"ef_construction":40,
"m":"4"
}
}
},
"knn":true
},
"mappings":{
"properties":{
"vector":{
"type":"knn_vector",
"dimension":384
},
"title":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"index":false
}
}
},
"company":{
"type":"keyword",
"index":false
},
"location":{
"type":"keyword",
"index":false
},
"salary":{
"type":"keyword",
"index":false
},
"job_description":{
"type":"keyword",
"index":false
}
}
}
}
The reason indicates that the KNN plugin is not installed on the OpenSearch cluster.

How to implement space ignored search across a document with multiple fields in Elasticsearch

I'm trying to implement a space agnostic product catalog search solution using Elasticsearch for a chemical oriented product database. The use case is as follows:
Consider a chemical with the name: Dimethyl Sulphoxide
Some manufacturers label it as Dimethylsulphoxide and some as Dimethyl Sulphoxide. So I could have two Item entries in my ES document as follows
Item 1: {"item_name":"Dimethyl Sulphoxide", "brand":"Merck"}
Item 2: {"item_name":"Dimethylsulphoxide","brand":"Spectrochem"}
Now ideally, If the user enters either string (i.e Dimethyl sulphoxide or Dimethylsulphoxide), I want both the documents to be displayed in the hits.
To achieve this I'm doing two things:
1)At index time, I'm currently running the item_name field through a custom analyzer that consists of the following flow:
Tokenizing with keyword, then filtering with lowercase, then filtering with word joiner(with catenate_all), then filtering with an edge_ngram filter.
So the string "Dimethyl Sulphoxide" becomes ("Dimethyl","Sulphoxide") then ("dimethyl","sulphoxide"), then ("dimethyl","sulphoxide","dimethylsulphoxide"), then ("d","di","dim",dime"....."dimethyl","s","su","sul"....,"sulphoxide","d","di"......,"dimethylsulphoxide")
I'm also running the other fields in the product document, such as the brand field with the same analyzer.
At query time, I'm running the query search string via a similar analyzer without the edge_ngram. So a query string of "Dimethyl Sul" will become ("Dimethyl","Sul") then ("dimethyl","sul") then ("dimethyl","sul","dimethylsul") by specifying a custom search analyzer for each field at index time.
Now I'm able to surface both the results when the user searches a string with or without space, but this approach is coming in the way of my other use cases.
Consider the second use case where the user should also be able to search for an item by the name + the brand and other fields, all in one search box. For example, A user could search for one of the items above by entering, "dimethyl spectrochem" or "sulphoxide merck".
To allow this, I have tried using a multi_match with type as cross fields query and a combined_fields query, both with the operator as AND. But this combination of word_joiner in query string with cross_fields/combined_fields is giving me undesired results for my second use case.
When a user enters, "dimethyl spectrochem", the query search analyzer generates three tokens ("dimethyl","spectrochem" and "dimethylspectrochem") and when these are passed to the cross_fields/combined_fields it essentially generates a query as :
+("item_name":"dimethyl","brand":"dimethyl")
+("item_name":"spectrochem","brand":"spectrochem")
+("item_name":"dimethylspectrochem","brand":"dimethylspectrochem")
Given the way how cross_field works , it looks for a document where all the three queryStrings are present in either field. Since it's unable to find "dimethylspectrochem" in a single field it returns zero results.
Is there a way I can satisfy both use cases?
The mapping that I have specified during index creation is below
curl -XPUT http://localhost:9200/test-item-summary-5 -H 'Content-Type: application/json' -d'
{
"settings":{
"analysis":{
"tokenizer":{
"whitespace":{
"type":"whitespace"
},
"keyword":{
"type":"keyword"
}
},
"filter":{
"lowercase":{
"type":"lowercase"
},
"shingle_word_joiner":{
"type":"shingle",
"token_separator":""
},
"word_joiner":{
"type":"word_delimiter_graph",
"catenate_all":true,
"split_on_numerics":false,
"stem_english_possessive":false
},
"edge_ngram_filter":{
"type":"edge_ngram",
"min_gram":1,
"max_gram":20,
"token_chars":[
"letter",
"digit"
]
}
},
"analyzer":{
"whitespaceWithEdgeNGram":{
"tokenizer":"keyword",
"filter":[
"lowercase",
"word_joiner",
"edge_ngram_filter"
]
},
"spaceIgnoredWithLowerCase":{
"tokenizer":"keyword",
"char_filter":[
"dash_char_filter"
],
"filter":[
"lowercase",
"word_joiner"
]
},
"shingleSearchAnalyzer":{
"tokenizer":"whitespace",
"char_filter":[
"dash_char_filter"
],
"filter":[
"lowercase",
"shingle_word_joiner"
]
},
"whitespaceWithLowerCase":{
"tokenizer":"whitespace",
"char_filter":[
"dash_char_filter"
],
"filter":[
"lowercase"
]
}
},
"char_filter":{
"dash_char_filter":{
"type":"mapping",
"mappings":[
"- => ",
", => ",
". => ",
"( => ",
") => ",
"? => ",
"! => ",
": => ",
"; => ",
"_ => ",
"% => ",
"& => ",
"+ => ",
"\" => ",
"\/ => ",
"\\[ => ",
"\\] => ",
"* => ",
"\u0027 => "
]
}
}
}
},
"mappings":{
"properties":{
"item_code":{
"type":"text",
"analyzer":"whitespaceWithEdgeNGram",
"search_analyzer":"shingleSearchAnalyzer"
},
"mfr_item_code":{
"type":"text",
"analyzer":"whitespaceWithEdgeNGram",
"search_analyzer":"shingleSearchAnalyzer"
},
"brand":{
"type":"text",
"analyzer":"whitespaceWithEdgeNGram",
"search_analyzer":"shingleSearchAnalyzer"
},
"name":{
"type":"text",
"analyzer":"whitespaceWithEdgeNGram",
"search_analyzer":"shingleSearchAnalyzer"
},
"short_name":{
"type":"text",
"analyzer":"whitespaceWithEdgeNGram",
"search_analyzer":"shingleSearchAnalyzer"
},
"alias":{
"type":"text",
"analyzer":"whitespaceWithEdgeNGram",
"search_analyzer":"shingleSearchAnalyzer"
},
"attrs":{
"type":"nested",
"properties":{
"name":{
"type":"text",
"index":"false"
},
"value":{
"type":"text",
"copy_to":"item:attrs:value",
"index":"false"
},
"primaryAttribute":{
"type":"boolean",
"index":"false"
}
}
},
"variant_summaries":{
"type":"nested",
"properties":{
"item_code":{
"type":"text",
"index":"false"
},
"variant_code":{
"type":"text",
"copy_to":"variant:variant_code",
"index":"false"
},
"mfr_item_code":{
"type":"text",
"index":"false"
},
"mfr_variant_code":{
"type":"text",
"copy_to":"variant:mfr_variant_code",
"index":"false"
},
"brand":{
"type":"text",
"index":"false"
},
"unit":{
"type":"text",
"copy_to":"variant:unit",
"index":"false"
},
"unit_mag":{
"type":"float",
"copy_to":"variant:unit",
"index":"false"
},
"primary_alternate_unit":{
"type":"nested",
"properties":{
"unit":{
"type":"text",
"copy_to":"variant:unit",
"index":"false"
},
"unit_mag":{
"type":"float",
"copy_to":"variant:unit",
"index":"false"
}
}
},
"attrs":{
"type":"nested",
"properties":{
"name":{
"type":"text",
"index":"false"
},
"value":{
"type":"text",
"copy_to":"variant:attrs:value",
"index":"false"
},
"primaryAttribute":{
"type":"boolean",
"index":"false"
}
}
},
"image":{
"type":"text",
"index":"false"
},
"in_stock":{
"type":"boolean",
"index":"false"
}
}
},
"added_by":{
"type":"text",
"index":"false"
},
"modified_by":{
"type":"text",
"index":"false"
},
"created_on":{
"type":"date",
"index":"false"
},
"updated_on":{
"type":"date",
"index":"false"
},
"is_deleted":{
"type":"boolean",
"index":"false"
},
"variant:variant_code":{
"type":"text",
"analyzer":"whitespaceWithEdgeNGram",
"search_analyzer":"shingleSearchAnalyzer"
},
"variant:mfr_variant_code":{
"type":"text",
"analyzer":"whitespaceWithEdgeNGram",
"search_analyzer":"shingleSearchAnalyzer"
},
"variant:attrs:value":{
"type":"text",
"analyzer":"whitespaceWithEdgeNGram",
"search_analyzer":"shingleSearchAnalyzer"
},
"variant:unit":{
"type":"text",
"analyzer":"whitespaceWithEdgeNGram",
"search_analyzer":"shingleSearchAnalyzer"
},
"item:attrs:value":{
"type":"text",
"analyzer":"whitespaceWithEdgeNGram",
"search_analyzer":"shingleSearchAnalyzer"
}
}
}
}'
Any suggestions on implementing a space ignored search across multiple fields would be highly appreciated.

Error while performing aggregation query in elastic search . "illegal_argument_exception / Fielddata is disabled on text fields by default

Hi I am performing a curl request to an elastic search instance. However i am getting an error as below.
curl -X GET "localhost:57457/mep-reports*/_search?pretty&size=0" -H 'Content-Type: application/json' --data-binary #query.txt
Response.
{
"error" : {
"root_cause" : [
{
"type" : "illegal_argument_exception",
"reason" : "Fielddata is disabled on text fields by default. Set fielddata=true on [status] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
}
],
"type" : "search_phase_execution_exception",
"reason" : "all shards failed",
"phase" : "query",
"grouped" : true,
"failed_shards" : [
{
"shard" : 0,
"index" : "mep-reports",
"node" : "NJuAFq3YSni4TIK9PzgJxg",
"reason" : {
"type" : "illegal_argument_exception",
"reason" : "Fielddata is disabled on text fields by default. Set fielddata=true on [status] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
}
}
],
"caused_by" : {
"type" : "illegal_argument_exception",
"reason" : "Fielddata is disabled on text fields by default. Set fielddata=true on [status] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead.",
"caused_by" : {
"type" : "illegal_argument_exception",
"reason" : "Fielddata is disabled on text fields by default. Set fielddata=true on [status] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead."
}
}
},
"status" : 400
}
any idea how to fix it . the following is my mapping definition i got
curl-XGET"localhost:57457/mep-reports*/_mapping/field/*?pretty"
{
"mep-reports":{
"mappings":{
"doc":{
"_index":{
"full_name":"_index",
"mapping":{
}
},
"status.keyword":{
"full_name":"status.keyword",
"mapping":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"inventory":{
"full_name":"inventory",
"mapping":{
"inventory":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
}
}
},
"flight_name.keyword":{
"full_name":"flight_name.keyword",
"mapping":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"nof_segments":{
"full_name":"nof_segments",
"mapping":{
"nof_segments":{
"type":"long"
}
}
},
"_all":{
"full_name":"_all",
"mapping":{
}
},
"_ignored":{
"full_name":"_ignored",
"mapping":{
}
},
"campaign_name":{
"full_name":"campaign_name",
"mapping":{
"campaign_name":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
}
}
},
"_parent":{
"full_name":"_parent",
"mapping":{
}
},
"flight_id.keyword":{
"full_name":"flight_id.keyword",
"mapping":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"flight_name":{
"full_name":"flight_name",
"mapping":{
"flight_name":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
}
}
},
"#version":{
"full_name":"#version",
"mapping":{
"#version":{
"type":"long"
}
}
},
"_version":{
"full_name":"_version",
"mapping":{
}
},
"campaign_id":{
"full_name":"campaign_id",
"mapping":{
"campaign_id":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
}
}
},
"_routing":{
"full_name":"_routing",
"mapping":{
}
},
"_type":{
"full_name":"_type",
"mapping":{
}
},
"msg_text":{
"full_name":"msg_text",
"mapping":{
"msg_text":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
}
}
},
"delivery_ts":{
"full_name":"delivery_ts",
"mapping":{
"delivery_ts":{
"type":"long"
}
}
},
"sender.keyword":{
"full_name":"sender.keyword",
"mapping":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"submission_ts":{
"full_name":"submission_ts",
"mapping":{
"submission_ts":{
"type":"long"
}
}
},
"flight_id":{
"full_name":"flight_id",
"mapping":{
"flight_id":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
}
}
},
"_seq_no":{
"full_name":"_seq_no",
"mapping":{
}
},
"#timestamp":{
"full_name":"#timestamp",
"mapping":{
"#timestamp":{
"type":"date"
}
}
},
"account_id":{
"full_name":"account_id",
"mapping":{
"account_id":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
}
}
},
"_field_names":{
"full_name":"_field_names",
"mapping":{
}
},
"sender":{
"full_name":"sender",
"mapping":{
"sender":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
}
}
},
"recipient":{
"full_name":"recipient",
"mapping":{
"recipient":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
}
}
},
"account_id.keyword":{
"full_name":"account_id.keyword",
"mapping":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"_source":{
"full_name":"_source",
"mapping":{
}
},
"_id":{
"full_name":"_id",
"mapping":{
}
},
"campaign_name.keyword":{
"full_name":"campaign_name.keyword",
"mapping":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"_uid":{
"full_name":"_uid",
"mapping":{
}
},
"recipient.keyword":{
"full_name":"recipient.keyword",
"mapping":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"inventory.keyword":{
"full_name":"inventory.keyword",
"mapping":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"msg_text.keyword":{
"full_name":"msg_text.keyword",
"mapping":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"status":{
"full_name":"status",
"mapping":{
"status":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
}
}
},
"campaign_id.keyword":{
"full_name":"campaign_id.keyword",
"mapping":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
}
}
}
}
}
really appreciate if you can help.
From the error, it appear that you are trying to perform aggregation on a text field i.e. status.
Note that you cannot perform aggregation on a text field and that in order to do so, you need to have fielddata:true.
However this is not recommended as it would consume lot of heap space which is what you see in the error.
The mapping you've shared has the below details for status field.
{
"status":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
}
}
I see that status has a sibling field in status.keyword.
Just change your query that you would have in your query.txt to make use of status.keyword instead of status in your aggregation query and it should fix the issue.
Likewise if you see more errors like that, you may want to make similar changes on the other fields too accordingly. Note that this change is something you need to do in your aggregation query.
Let me know if this helps!

Elastic Search Error mapper [email.keyword] of different type, current_type [text], merged_type [keyword]

I'm doing an upsert in PHP on a newly created index so there is no data present.I'm getting an exception that I would expect to see if data was already there but the index is freshly created. Is there something special I have to do with upsert on newly created indexes as well? The upsert works fine until I add the custom analyzer.
{
"error":{
"root_cause":[
{
"type":"remote_transport_exception",
"reason":"[8902bb997443][127.0.0.1:9300][indices:data/write/update[s]]"
}
],
"type":"illegal_argument_exception",
"reason":"mapper [email.keyword] of different type, current_type [text], merged_type [keyword]"
},
"status":400
}
Listed below is my creation code for the index
{
"index":"myindex",
"body":{
"settings":{
"analysis":{
"analyzer":{
"my_email_analyzer":{
"type":"custom",
"tokenizer":"uax_url_email",
"filter":[
"lowercase",
"stop"
]
}
}
}
},
"mappings":{
"properties":{
"ak_additional_recovery_email":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"ak_city_town":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"ak_first_name":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"ak_last_name":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"ak_second_additional_recovery_email":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"ak_state":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"email":{
"type":"text",
"fields":{
"keyword":{
"type":"text",
"analyzer":"my_email_analyzer"
}
}
},
"indexedHash":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"uID":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"uName":{
"type":"text",
"fields":{
"keyword":{
"type":"text",
"analyzer":"my_email_analyzer"
}
}
}
}
}
}
}
And here is the PHP code trying to do the upsert
$this->client->update([
'id' => $data['uID'],
'body' => [
'doc' => $data,
'upsert' => [
'uName' => $data['uName'],
'email' => $data['email'],
'ak_first_name' => $data['ak_first_name'],
'ak_last_name' => $data['ak_last_name'],
'ak_city_town' => $data['ak_city_town'],
'ak_state' => $data['ak_state']
]
],
'index' => $this->dbName,
'type' => 'general'
]);
Simple mistake! I was using an incorrect type for the index type. I'm not sure why this error was posted though.

Elasticsearch [match] unknown token [START_OBJECT] after [created_utc]

I am learning how to use elasticsearch using the 2006 dataset of reddit comments from pushshift.io.
created_utc is the field with the time a comment was created.
I am trying to get all the posts within a certain time range. I googled a bit and found out that I need to use the "range" keyword.
This is my query right now:
{
"query": {
"match" : {
"range": {
"created_utc": {
"gte": "1/1/2006",
"lte": "31/1/2006",
"format": "dd/MM/yyyy"
}
}
}
}
}
I then tried using a bool query so I can match time range with edited must not = False (edited being the boolean field that tells me whether a post has been edited or not):
{
"query": {
"bool" : {
"must" : {
"range" : {
"created_utc": {
"gte" : "01/12/2006", "lte": "31/12/2006", "format": "dd/MM/yyyy"
}
}
},
"must_not": {
"edited": False
}
}
}
}
However, this gave me another error that I can't figure out:
[edited] query malformed, no start_object after query name
I'd appreciate if anyone can help me out with this, thanks!
Here is my mapping for the comment if it helps:
{
"comment":{
"properties":{
"author":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"body":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"controversiality":{
"type":"long"
},
"created_utc":{
"type":"date"
},
"edited":{
"type":"boolean"
},
"gilded":{
"type":"long"
},
"id":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"link_id":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"parent_id":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"score":{
"type":"long"
},
"subreddit":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
}
}
}
}
If you want to get all the posts within a time range, then you must be using a range query. The problem with your query is you are using range inside a match query which is not allowed in elasticsearch, so your query should look like:
{
"query": {
"range": {
"created_utc": {
"gte": 1136074029,
"lte": 1136076410
}
}
}
}
Providing the fact that the created_utc field is saved as epoch, you must use a epoch format to query.
The second query where you want to find the posts within a range where edited must not false:
{
"query": {
"bool": {
"must": [
{
"range": {
"created_utc": {
"gte": 1136074029,
"lte": 1136076410
}
}
}
],
"must_not": [
{
"match": {
"edited": false
}
}
]
}
}
}
Note: If your created_utc is stored in dd/MM/yyyy format then while querying you should use a strict companion format, i.e. instead of 1/1/2006 you should be giving 01/01/2006.
Hope this helps !

Resources