How to get outer object in elasticsearch from highlightfield - elasticsearch

I have some documents like this:
{
"content": "DocumentFile",
"title": "no title",
"post_id": "18",
"url": "http://localhost/wp/?p=18",
"attachments": [{
"content": "Hi this is extrach data of file",
"hash": "UPR9BC57IW3PUNTQ3LP79Q6UN0V3ZR7AAFJNUFGH",
"name": "file."
}],
"isDeleted": "false",
"__creationdDate": "1456758952671"}
and I initialize elastic Search mapping so:
{
"post": {
"properties": {
"content": {
"type": "string",
},
"title": {
"type": "string",
},
"url": {
"type": "string",
},
"post_id": {
"type": "string",
"fields": {
"raw": {
"type": "integer",
"index": "not_analyzed"
}
}
},
"attachments": {
"type": "nested",
"include_in_parent": true,
"properties": {
"hash": {
"type": "string",
"analyzer": "vira_analyzer"
},
"name": {
"type": "string",
"analyzer": "vira_analyzer"
},
"content": {
"type": "string",
"analyzer": "vira_analyzer"
}
}
},
"_uid": {
"type": "string",
"analyzer": "vira_analyzer"
}
}
}}
and I add title,content,attachments.name , attachments.content to highlightfields
I search data in this document
it is in attachments.content and elasticsearch find it
now i want to get hash code of this attachment, what should i do ?
is there any option that elastic search give me the block of the attachment that have search query in name or content ??
by this:
Text[] highlightFragments = hitHighLights.get(field).getFragments();
elastic search me just the content field of that attachment. I want all of that block, some thing like this:
{
"content": "Hi this is extrach data of file",
"hash": "UPR9BC57IW3PUNTQ3LP79Q6UN0V3ZR7AAFJNUFGH",
"name": "file."
}
(a way is to get source of that document and search in it but it is not good because the speed decrease a lot.)

Related

Return document based on nested array matched field count in Elastic search

Using Elastic version 7.15.1
{
"mappings": {
"properties": {
"Activity": {
"type": "nested",
"properties": {
"Data": {
"type": "text"
},
"Type": {
"type": "keyword"
},
"created_at": {
"type": "date"
},
"updated_at": {
"type": "date"
}
}
},
"FirstName": {
"type": "text",
"analyzer": "standard_autocomplete",
"search_analyzer": "standard_autocomplete_search"
}
}
}
}
Example Data
{
"Activity": [
{
"Type": "type1",
"Data": "data",
"created_at": "2022-08-08T15:23:58.000000Z"
},
{
"Type": "type1",
"Data": "data",
"created_at": "2022-08-08T15:25:45.000000Z"
},
{
"Type": "type2",
"Data": "data",
"created_at": "2022-08-08T15:26:03.000000Z"
}
],
"FirstName": "Testtt"
}
Want this document to return only if "Activity.Type" is "type1" and the count of the "type1" is greater than 1.
Also how can we use created_at in nested array with above constraint

Es index rate become slow after create index mapping

I write data using ES BulkProcessor(I tried python script, storm es-bolt, flink es-sink), but the index rate is so slow after create index mapping.
Situation 1: Leave all index settings as its default, index rate can reach about 10000+.
Situation 2: Just create index mapping, index rate fall to 3000.
I use the same data, same code, same machines.
result
flink es-sink write json data to es:
My data
repeat write the same data below(the message field is the raw log, it's about 7KB size, the delete some content for exceeding the question limit):
{
"_index": "nyc_flink_test997",
"_type": "doc",
"_id": "k8uS92cBOH4ugSIjCzmn",
"_score": 1,
"_source": {
"exception": "false",
"log_id": "8F71AF1606EE46BFA9D57AA2282D8596",
"offset": "2368",
"message_length": "2103",
"level": "INFO",
"source": "/opt/hadoop/elastic-stack/s_login/Gusermanager.usermanager.s_login.20.log",
"sessionid": "provider-60-2883b4bd3ff2b",
"associate_id": "33d081b83a0654a2",
"message": """
[16:41:33.376][I][ec4edfe0b2584b73]log start:53F9A1A1E71044E281755E930E1B004C
[16:41:33.376][T][ec4edfe0b2584b73]入参0=__REQ__
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4119)
at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2570)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2731)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2815)
at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:2155)
at com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:2322)
at cn.com.agree.addal.cp.ProxyPreparedStatement.executeQuery(ProxyPreparedStatement.java:46)
at tc.bank.aesb.mbs.MBS_DBIMPL.PyDBGetSel(MBS_DBIMPL.java:1624)
at tc.bank.aesb.mbs.MBS_DBIMPL.PyDBExecOneSQL(MBS_DBIMPL.java:466)
at tc.bank.aesb.mbs.MBS_DBIMPL.PyDBExecGrpSQL(MBS_DBIMPL.java:123)
at tc.bank.aesb.mbs.B_MBS_DataBase.B_DBUnityRptOpr(B_MBS_DataBase.java:121)
at CUST.CustomerInfoQry.TCustomerInfoQry$Step1$Node4.execute(TCustomerInfoQry.java:200)
at CUST.CustomerInfoQry.TCustomerInfoQry$Step1.execute(TCustomerInfoQry.java:113)
at CUST.CustomerInfoQry.TCustomerInfoQry.execute(TCustomerInfoQry.java:76)
at cn.com.agree.afa.svc.javaengine.JavaEngine.execute(JavaEngine.java:237)
at cn.com.agree.afa.svc.handler.TradeHandler.handle(TradeHandler.java:62)
[16:41:33.414][I][ec4edfe0b2584b73]log end:53F9A1A1E71044E281755E930E1B004C
""",
"exec_ip": "10.88.188.167",
"start_time": "2018-12-09 16:46:14.764",
"group_v2": "Gusermanager",
"script_exec_time": "1",
"trade_exec_time": "2"
}
}
index mapping
{
"mappings": {
"doc":{
"dynamic_templates": [
{
"string_fields": {
"match": "*",
"match_mapping_type": "string",
"mapping": {
"type": "text",
"norms": false,
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
],
"properties": {
"#timestamp": {
"type": "date"
},
"#version": {
"type": "keyword"
},
"geoip": {
"dynamic": true,
"properties": {
"ip": {
"type": "ip"
},
"location": {
"type": "geo_point"
},
"latitude": {
"type": "half_float"
},
"longitude": {
"type": "half_float"
}
}
},
"exception": {
"type": "boolean"
},
"message":{
"type":"text",
"norms": false,
"analyzer": "ik_max_word"
},
"associate_id": {
"type": "text",
"analyzer": "ik_max_word"
},
"end_time": {
"type": "date",
"format": "date_time||yyyy-MM-dd HH:mm:ss.SSS||yyyy-MM-dd||epoch_millis||HH:mm:ss.SSS"
},
"start_time": {
"type": "date",
"format": "date_time||yyyy-MM-dd HH:mm:ss.SSS||yyyy-MM-dd||epoch_millis||HH:mm:ss.SSS"
},
"exec_ip": {
"type": "ip"
},
"level": {
"type": "keyword"
},
"script_exec_time": {
"type": "long"
},
"trade_exec_time": {
"type": "long"
},
"sessionid": {
"type": "text",
"analyzer": "ik_max_word"
},
"log_id": {
"type": "text",
"analyzer": "ik_max_word"
},
"discard_time": {
"type": "long"
},
"scene_code": {
"type": "text",
"analyzer": "ik_max_word"
},
"service_code": {
"type": "text",
"analyzer": "ik_max_word"
},
"group": {
"type": "text"
},
"group_v2" :{
"type": "text",
"analyzer": "ik_max_word"
},
"message_length":{
"type": "long"
},
"log_filename":{
"type": "text",
"analyzer": "ik_max_word"
},
"ingest_time":{
"type": "date"
}
}
}
}
}
I tried writing with python scirpts, storm es-bolt, the result is same, index rate falls after create index mapping. Can anyone give some ideas about it. Thanks in advance.

Elasticsearch.js analyzer error using custom analyzer

Using the latest version of the elasticsearch.js and trying to create a custom path analyzer when indexing and creating the mapping for some posts.
The goal is creating keywords out of each segment of the path. However as a start simply trying to get the analyzer working.
Here is the elasticsearch.js create_mapped_index.js, you can see the custom analyzer near the top of the file:
var client = require('./connection.js');
client.indices.create({
index: "wcm-posts",
body: {
"settings": {
"analysis": {
"analyzer": {
"wcm_path_analyzer": {
"tokenizer": "wcm_path_tokenizer",
"type": "custom"
}
},
"tokenizer": {
"wcm_path_tokenizer": {
"type": "pattern",
"pattern": "/"
}
}
}
},
"mappings": {
"post": {
"properties": {
"id": { "type": "string", "index": "not_analyzed" },
"titles": {
"type": "object",
"properties": {
"main": { "type": "string" },
"subtitle": { "type": "string" },
"alternate": { "type": "string" },
"concise": { "type": "string" },
"seo": { "type": "string" }
}
},
"tags": {
"properties": {
"id": { "type": "string", "index": "not_analyzed" },
"name": { "type": "string", "index": "not_analyzed" },
"slug": { "type": "string" }
},
},
"main_taxonomies": {
"properties": {
"id": { "type": "string", "index": "not_analyzed" },
"name": { "type": "string", "index": "not_analyzed" },
"slug": { "type": "string", "index": "not_analyzed" },
"path": { "type": "string", "index": "wcm_path_analyzer" }
},
},
"categories": {
"properties": {
"id": { "type": "string", "index": "not_analyzed" },
"name": { "type": "string", "index": "not_analyzed" },
"slug": { "type": "string", "index": "not_analyzed" },
"path": { "type": "string", "index": "wcm_path_analyzer" }
},
},
"content_elements": {
"dynamic": "true",
"type": "nested",
"properties": {
"content": { "type": "string" }
}
}
}
}
}
}
}, function (err, resp, respcode) {
console.log(err, resp, respcode);
});
If the call to wcm_path_analyzer is set to "non_analyzed" or index is omitted the index, mapping and insertion of posts work.
As soon as I try to use the custom analyzer on the main_taxonomy and categories path fields, like shown in the json above, I get this error:
response: '{"error":{"root_cause":[{"type":"mapper_parsing_exception","reason":"wrong value for index [wcm_path_analyzer] for field [path]"}],"type":"mapper_parsing_exception","reason":"Failed to parse mapping [post]: wrong value for index [wcm_path_analyzer] for field [path]","caused_by":{"type":"mapper_parsing_exception","reason":"wrong value for index [wcm_path_analyzer] for field [path]"}},"status":400}',
toString: [Function],
toJSON: [Function] } { error:
{ root_cause: [ [Object] ],
type: 'mapper_parsing_exception',
reason: 'Failed to parse mapping [post]: wrong value for index [wcm_path_analyzer] for field [path]',
caused_by:
{ type: 'mapper_parsing_exception',
reason: 'wrong value for index [wcm_path_analyzer] for field [path]' } },
status: 400 } 400
Here is an example of the two objects that need the custom analyzer on the path field. I pulled this example, after inserting 15 posts into the elasticsearch index when not using the custom analyzer:
"main_taxonomies": [
{
"id": "123",
"type": "category",
"name": "News",
"slug": "news",
"path": "/News/"
}
],
"categories": [
{
"id": "157",
"name": "Local News",
"slug": "local-news",
"path": "/News/Local News/",
"main": true
},
To this point, I had googled similar questions and most said that people were missing putting the analyzers in settings and not adding the parameters to the body. I believe this is correct.
I have also reviewed the elasticsearch.js documentation and tried to create a:
client.indices.putSettings({})
But for this to be used the index needs to exist with the mappings or it throws an error 'no indices found'
Not sure where to go from here? Your suggestions are appreciated.
So the final analyzer is:
var client = require('./connection.js');
client.indices.create({
index: "wcm-posts",
body: {
"settings": {
"analysis": {
"analyzer": {
"wcm_path_analyzer": {
"type" : "pattern",
"lowercase": true,
"pattern": "/"
}
}
}
},
"mappings": {
"post": {
"properties": {
"id": { "type": "string", "index": "not_analyzed" },
"client_id": { "type": "string", "index": "not_analyzed" },
"license_id": { "type": "string", "index": "not_analyzed" },
"origin_id": { "type": "string" },
...
...
"origin_slug": { "type": "string" },
"main_taxonomies_path": { "type": "string", "analyzer": "wcm_path_analyzer", "search_analyzer": "standard" },
"categories_paths": { "type": "string", "analyzer": "wcm_path_analyzer", "search_analyzer": "standard" },
"search_tags": { "type": "string" },
// See the custom analyzer set here --------------------------^
I did determine that at least for the path or pattern analyzers that complex nested or objects cannot be used. The flattened fields set to "type": "string" was the only way to get this to work.
I ended up not needing a custom tokenizer as the pattern analyzer is full featured and already includes a tokenizer.
I chose to use the pattern analyzer as it breaks on the pattern leaving individual terms whereas the path segments the path in different ways but does not create individual terms ( I hope I'm correct in saying this. I base it on the documentation ).
Hope this helps someone else!
Steve
So I got it working ... I think that the json objects were too complex or it was the change of adding the analyzer to the field mappings that did the trick.
first I flattened out:
To:
"main_taxonomies_path": "/News/",
"categories_paths": [ "/News/Local/", "/Business/Local/" ],
"search_tags": [ "montreal-3","laval-4" ],
Then I updated the analyzer to:
"settings": {
"analysis": {
"analyzer": {
"wcm_path_analyzer": {
"tokenizer": "wcm_path_tokenizer",
"type": "custom"
}
},
"tokenizer": {
"wcm_path_tokenizer": {
"type": "pattern",
"pattern": "/",
"replacement": ","
}
}
}
},
Notice that the analyzer 'type' is set to custom.
Then when mapping theses flattened fields:
"main_taxonomies_path": { "type": "string", "analyzer": "wcm_path_analyzer" },
"categories_paths": { "type": "string", "analyzer": "wcm_path_analyzer" },
"search_tags": { "type": "string" },
which when searching yields for these fields:
"main_taxonomies_path": "/News/",
"categories_paths": [ "/News/Local News/", "/Business/Local Business/" ],
"search_tags": [ "montreal-2", "laval-3" ],
So the custom analyzer does what it was set to do in this situation.
I'm not sure if I could apply type object to the main_taxonomies_path and categories_paths, so I will play around with this and see.
I will be refining the pattern searches to format the results differently but happy to have this working.
For completeness I will put my final custom pattern analyzer, mapping and results, once I've completed this.
Regards,
Steve

Elasticsearch + Kibana, sorting on uri yields no results. (uri isn't analyzed)

I have a log of HTTP requests, one of the fields is a URI field. I want to get the average duration in ms for each URI. I set the y-axis in Kibana to
"Aggregation: Average , Field: durationInMs".
For the x-axis I have
"Aggregation: terms, Field uri, Order by: metric average durationInMs, Order: descending: 5"
Image to clarify:
This gives me a result but it doesn't use the entire URI. It instead splits up the URI and matches parts of it. After a quick google I found "Multi-fields" and I have added a URI.raw field on my index. The analyzed field warning disappeared but I get no result at all.
Any hints or tips?
lsc-logs2 mapping:
{
"lsc-logs2": {
"mappings": {
"httplogentry": {
"properties": {
"context": {
"type": "string"
},
"durationInMs": {
"type": "double"
},
"id": {
"type": "long"
},
"method": {
"type": "string"
},
"source": {
"type": "string"
},
"startTime": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
"status": {
"type": "long"
},
"uri": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
},
"username": {
"type": "string"
},
"version": {
"type": "long"
}
}
}
}
}
}
An example document:
{
"_index": "lsc-logs2",
"_type": "httplogentry",
"_id": "1148440",
"_score": 1,
"_source": {
"startTime": "2016-08-22T10:30:57.2298086+02:00",
"context": "contexturi",
"method": "GET",
"uri": "http://uri/plannings/unassigned?date=2016-08-22T03:58:57.168Z&page=1&pageSize=9999",
"username": "user",
"source": "192.168.1.82",
"durationInMs": 171.83710000000002,
"status": 200,
"id": 1148440,
"version": 1
}
}
When reindexing data, the httplogentry mapping doesn't get ported from lsc-logs to lsc-logs2, you need to create the destination index+mapping first and only then reindex.
First delete the current destination index
curl -XDELETE localhost:9200/lsc-logs2
Then create it anew by specifying the proper mapping
curl -XPUT localhost:9200/lsc-logs2 -d '{
"mappings": {
"httplogentry": {
"properties": {
"context": {
"type": "string"
},
"durationInMs": {
"type": "double"
},
"id": {
"type": "long"
},
"method": {
"type": "string"
},
"source": {
"type": "string"
},
"startTime": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
"status": {
"type": "long"
},
"uri": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
},
"username": {
"type": "string"
},
"version": {
"type": "long"
}
}
}
}
}'
Then you can reindex your data:
curl -XPOST localhost:9200/_reindex -d '{
"source": {
"index": "lsc-logs"
},
"dest": {
"index": "lsc-logs2"
}
}'
Then refresh your the fields in your index pattern in Kibana and it should work.

Elastic get relevant results

I have a query that searchs in title, description, author, ean, isbn.
I have boosts for title^3, author^2, ean^100 and isbn^100.
When I get hit with ean it returns only 1 result. (ean is a number)
ISBN is a string eg. 978-12-1234-123-8 and I get thousands of results for ISBN. But if there is hit the one will have marginally higher result then the others.
I'm using multi_match with type best_fields.
Is there a way to get only relevant results? Or I have to do it by myself?
EDIT:
Mappings:
"product": {
"properties": {
"img": {
"type": "string"
},
"dobrovsky_rating": {
"type": "float"
},
"isbn": {
"type": "string"
},
"saleType": {
"type": "string"
},
"rating": {
"type": "float"
},
"description": {
"analyzer": "hunspell_cs",
"type": "string"
},
"availability": {
"type": "string"
},
"priceDph": {
"type": "long"
},
"title": {
"analyzer": "hunspell_cs",
"type": "string"
},
"url": {
"index": "not_analyzed",
"type": "string"
},
"rating_count": {
"type": "long"
},
"ean": {
"type": "string"
},
"serie": {
"analyzer": "hunspell_cs",
"type": "string"
},
"id": {
"type": "long"
},
"category": {
"analyzer": "hunspell_cs",
"type": "string"
},
"authors": {
"analyzer": "hunspell_cs",
"type": "string"
}
}
Data example:
Id: 123
Title: Game of Thrones
Author: George R.R. Martin
Img: www.aaa.cz/got.png
Url: www.aaa.cz/got.html
Description: Game of Thrones is a ...
EAN: 9788071974925
ISBN: 978-80-7197-492-5
...
Try this
POST /MyINdex/_search
{
"from": 0,
"size": 10,
"_source": {
"include": [
"*"
]
},
"query": {
"query_string": {
"query": "978-12-1234-123-8",
"fields": [
"title^3",
"author2^2",
"ean^100",
"isbn^100",
"description^1"
],
"default_operator": "and"
}
}
}

Resources