extract text from field arrays - elasticsearch

One of the fields called "resources" has the following 2 inner documents.
{
"type": "AWS::S3::Object",
"ARN": "arn:aws:s3:::sms_vild/servers_backup/db_1246/db/reports_201706.schema"
},
{
"accountId": "934331768510612",
"type": "AWS::S3::Bucket",
"ARN": "arn:aws:s3:::sms_vild"
}
I need to split the ARN field and get the last part of it. i.e. "reports_201706.schema" preferably using scripted field.
What I have tried:
1) I checked the fileds list and found only 2 entries resources.accountId and resources.type
2) I tried with date-time field and it worked correctly in the scripted filed option (expression).
doc['eventTime'].value
3) But the same does not work with other text fields for e.g.
doc['eventType'].value
Getting this error:
"caused_by":{"type":"script_exception","reason":"link error","script_stack":["doc['eventType'].value","^---- HERE"],"script":"doc['eventType'].value","lang":"expression","caused_by":{"type":"illegal_argument_exception","reason":"Fielddata is disabled on text fields by default. Set fielddata=true on [eventType] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory."}}},"status":500}
It means I need to change the mapping. Is there any other way to extract text from nested arrays in an object?
Update:
Please visit sample kibana here...
https://search-accountact-phhofxr23bjev4uscghwda4y7m.us-east-1.es.amazonaws.com/_plugin/kibana/
search for "ebs_attach.png" and then check resources field. You will see 2 nested arrays like this...
{
"type": "AWS::S3::Object",
"ARN": "arn:aws:s3:::datameetgeo/ebs_attach.png"
},
{
"accountId": "513469704633",
"type": "AWS::S3::Bucket",
"ARN": "arn:aws:s3:::datameetgeo"
}
I need to split ARN field and extract the last part that is again "ebs_attach.png"
If I can some-how display it as scripted field, then I can see the bucket name and the file name side-by-side on discovery tab.
Update 2
In other words, I am trying to extract the text shown in this image as a new field on discovery tab.

While you can use scripting for this, I highly encourage you to extract those kind of information at index time. I have provided two examples here, which are far from failsafe (you need to test with different path or with this field missing at all), but it should provide a base to start with
PUT foo/bar/1
{
"resources": [
{
"type": "AWS::S3::Object",
"ARN": "arn:aws:s3:::sms_vild/servers_backup/db_1246/db/reports_201706.schema"
},
{
"accountId": "934331768510612",
"type": "AWS::S3::Bucket",
"ARN": "arn:aws:s3:::sms_vild"
}
]
}
# this is slow!!!
GET foo/_search
{
"script_fields": {
"document": {
"script": {
"inline": "return params._source.resources.stream().filter(r -> 'AWS::S3::Object'.equals(r.type)).map(r -> r.ARN.substring(r.ARN.lastIndexOf('/') + 1)).findFirst().orElse('NONE')"
}
}
}
}
# Do this on index time, by adding a pipeline
PUT _ingest/pipeline/my-pipeline-id
{
"description" : "describe pipeline",
"processors" : [
{
"script" : {
"inline": "ctx.filename = ctx.resources.stream().filter(r -> 'AWS::S3::Object'.equals(r.type)).map(r -> r.ARN.substring(r.ARN.lastIndexOf('/') + 1)).findFirst().orElse('NONE')"
}
}
]
}
# Store the document, specify the pipeline
PUT foo/bar/1?pipeline=my-pipeline-id
{
"resources": [
{
"type": "AWS::S3::Object",
"ARN": "arn:aws:s3:::sms_vild/servers_backup/db_1246/db/reports_201706.schema"
},
{
"accountId": "934331768510612",
"type": "AWS::S3::Bucket",
"ARN": "arn:aws:s3:::sms_vild"
}
]
}
# lets check the filename field of the indexed document by getting it
GET foo/bar/1
# We can even search for this file now
GET foo/_search
{
"query": {
"match": {
"filename": "reports_201706.schema"
}
}
}

Note: Considered "resources" is kind of array
NSArray *array_ARN_Values = [resources valueForKey:#"ARN"];
Hope it will work for you!!!

Related

Good practice for ElasticSearch index mapping

Im new to Elasticsearch and I would like to know if there are any good practices for the use case I have.
I have heterogeneous data sent from an API that I save into a database (as a JSON) then save in Elasticsearch for search purposes. The data in sent in this format (because it's heterogeneous, the users can send any type of data, some metadata can be multivalued, other single values and the name of the key in the JSON may vary :)
{
"indices":{
"MultipleIndices":[
{
"index":"editors",
"values":[
"The Editing House",
"Volcan Editing"
]
},
{
"index":"colors",
"values":[
"Red",
"Blue"
]
}
],
"SimpleIndices":[
{
"index":"AuthorName",
"value": "George R. R. Martin"
},
{
"index":"NumberOfPages",
"value":"2898"
},
{
"index":"BookType",
"value":"Fantasy"
}
]
}
}
Once we receive this JSON, its formatted in the code and stored as a JSON in a database with this format :
{
"indices":{
"editors":[
"The Editing House",
"Volcan Editing"
],
"colors":[
"Red",
"Blue"
],
"AuthorName" : "George R. R. Martin"
"NumberOfPages" : "2898",
"BookType" : "Fantasy"
}
}
I then want to save this data into Elasticsearch, what's the best way I can map it ? Store it as a JSON in one field ? Will the search be efficilent if I do it this way ?
You must mapping each field individually.
You can take a look at field types to understand which type is ideal for your schema.
Another suggestion is to study the text analysis because it is responsible for the process of structuring the text to optimize the search.
My suggestion map:
PUT indices
{
"mappings": {
"properties": {
"editors": {
"type": "keyword"
},
"colors":{
"type": "keyword"
},
"author_name":{
"type": "text"
},
"number_pages":{
"type": "integer"
},
"book_type":{
"type": "keyword"
}
}
}
}
I think in your case, you don't have much choice apart from dynamic mapping, which Elasticsearch will generate for you as soon as first document is index in a particular index.
However, you can improve the process by using the dynamic template so that you can optimize your mapping, there is good examples of that in the official link I provided.

How to search by non-tokenized field length in ElasticSearch

Say I create an index people which will take entries that will have two properties: name and friends
PUT /people
{
"mappings": {
"properties": {
"friends": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
}
and I put two entries, each one of them has two friends.
POST /people/_doc
{
"name": "Jack",
"friends": [
"Jill", "John"
]
}
POST /people/_doc
{
"name": "Max",
"friends": [
"John", "John" # Max will have two friends, but both named John
]
}
Now I want to search for people that have multiple friends
GET /people/_search
{
"query": {
"bool": {
"filter": [
{
"script": {
"script": {
"source": "doc['friends.keyword'].length > 1"
}
}
}
]
}
}
}
This will only return Jack and ignore Max. I assume this is because we are actually traversing the inversed index, and John and John create only one token - which is 'john' so the length of the tokens is actually 1 here.
Since my index is relatively small and performance is not the key, I would like to actually traverse the source and not the inversed index
GET /people/_search
{
"query": {
"bool": {
"filter": [
{
"script": {
"script": {
"source": "ctx._source.friends.length > 1"
}
}
}
]
}
}
}
But according to the https://github.com/elastic/elasticsearch/issues/20068 the source is supported only when updating, not when searching, so I cannot.
One obvious solution to this seems to take the length of the field and store it to the index. Something like friends_count: 2 and then filter based on that. But that requires reindexing and also this appears as something that should be solved in some obvious way I am missing.
Thanks a lot.
There is a new feature in ES 7.11 as runtime fields a runtime field is a field that is evaluated at query time. Runtime fields enable you to:
Add fields to existing documents without reindexing your data
Start working with your data without understanding how it’s structured
Override the value returned from an indexed field at query time
Define fields for a specific use without modifying the underlying schema
you can find more information here about runtime fields, but how you can use runtime fields you can do something like this:
Index Time:
PUT my-index/
{
"mappings": {
"runtime": {
"friends_count": {
"type": "keyword",
"script": {
"source": "doc['#friends'].size()"
}
}
},
"properties": {
"#timestamp": {"type": "date"}
}
}
}
You can also use runtime fields in search time for more information check here.
Search Time
GET my-index/_search
{
"runtime_mappings": {
"friends_count": {
"type": "keyword",
"script": {
"source": "ctx._source.friends.size()"
}
}
}
}
Update:
POST mytest/_update_by_query
{
"query": {
"match_all": {}
},
"script": {
"source": "ctx._source.arrayLength = ctx._source.friends.size()"
}
}
You can update all of your document with query above and adjust your query.
For everyone wondering about the same issue, I think #Kaveh answer is the most likely way to go, but I did not manage to make it work in my case. It seems to me that source is created after the query is performed and therefore you cannot access source for the purposes of filtering query.
This leaves you with two options:
filter the result on the application level (ugly and slow solution)
actually save the filed length in a separate field. Such as friends_count
possibly there is another option I don't know about(?).

ELASTICSEARCH - Include date automatically without a predefined date field

It is possible to include a "date and time" field in a document that receives elasticsearch without it being previously defined.
The date and time corresponds to the one received by the json to elasticsearch
This is the mapping:
{
"mappings": {
"properties": {
"entries":{"type": "nested"
}
}
}
}
Is it possible that it can be defined in the mapping field so that elasticsearch includes the current date automatically?
What you can do is to define an ingest pipeline to automatically add a date field when your document are indexed.
First, create a pipeline, like this (_ingest.timestamp is a built-in field that you can access):
PUT _ingest/pipeline/add-current-time
{
"description" : "automatically add the current time to the documents",
"processors" : [
{
"set" : {
"field": "#timestamp",
"value": "_ingest.timestamp"
}
}
]
}
Then when you index a new document, you need to reference the pipeline, like this:
PUT test-index/_doc/1?pipeline=add-current-time
{
"my_field": "test"
}
After indexing, the document would look like this:
GET test-index/_doc/1
=>
{
"#timestamp": "2020-08-12T15:48:00.000Z",
"my_field": "test"
}
UPDATE:
Since you're using index templates, it's even easier because you can define a default pipeline to be run for each indexed documents.
In your index templates, you need to add this to the index settings:
{
"order": 1,
"index_patterns": [
"attom"
],
"aliases": {},
"settings": {
"index": {
"number_of_shards": "5",
"number_of_replicas": "1",
"default_pipeline": "add-current-time" <--- add this
}
},
...
Then you can keep indexing documents without referencing the pipeline, it will be automatic.
"value": "{{{_ingest.timestamp}}}"
Source

Fields do not appear in Kibana

The fields are not displaying in visualize section neither under Terms nor Significant Terms but are available in discover page. What should i change to make fields available in Visualize?
I have tried with keywords and text as type.
Mapping:
PUT /vcmts-new
{
"mappings": {
"build": {
"properties": {
"rel_ver": {
"type": "text"
},
"build_time": {
"type": "date"
},
"rel_status": {
"type": "text"
},
"rel_tags": {
"type": "text"
},
"job_num": {
"type": "integer"
}
}
}
}
}
Document Example:
POST /vcmts-new/build
{
"rel_ver": "2.6.7.1-1-auto14",
"build_time": "2018-05-21",
"rel_status": "fail",
"rel_tags": "PI09",
"job_num": 22
}
Discover page
Missing fields in Visualize page
PS: The data is added manually, there is no logstash or other forwarders configured
Using keyword instead of text as the type helped make those fields as aggregatable. One also needs to either refresh the Index Patterns under Management or restart Kibana to reflect the changes. Simply refreshing the visualize tab wont help.

Mapping in elasticsearch

Good morning, In my code I can't search data which contain separate words. If I search on one word all good. I think problem in mapping. I use postman. When I put in URL http://192.168.1.153:9200/sport_scouts/video/_mapping and use method GET I get:
{
"sport_scouts": {
"mappings": {
"video": {
"properties": {
"hashtag": {
"type": "string"
},
"id": {
"type": "long"
},
"sharing_link": {
"type": "string"
},
"source": {
"type": "string"
},
"title": {
"type": "string"
},
"type": {
"type": "string"
},
"user_id": {
"type": "long"
},
"video_preview": {
"type": "string"
}
}
}
}
}
}
All good title have type string but if I search on two or more words I get empty massive. My code in Trait:
public function search($data) {
$this->client();
$params['body']['query']['filtered']['filter']['or'][]['term']['title'] = $data;
$search = $this->client->search($params)['hits']['hits'];
dump($search);
}
Then I call it in my Controller. Can you help me with this problem?
The reason that your indexed data can't be found is caused by a mismatch of the analyzing during indexing and a strict term filter when querying the data.
With your mapping configuration, you are using the default analyzing which (besides many other operations) does a tokenizing. So every multi-word data you insert is split at punctuation or whitespaces. If you insert for example "some great sentence", elasticsearch maps the following terms to your document: "some", "great", "sentence", but not the term "great sentence". So if you do a term filter on "great sentence" or any other part of the original value containing a whitespace, you will not get any results.
Please see the elasticsearch docs on how to configure your mapping for indexing without analyzing (https://www.elastic.co/guide/en/elasticsearch/guide/current/mapping-intro.html#_index_2) or consider doing a match query instead of a term filter on the existing mapping (https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html).
Please be aware that if you switch to not_analyzed you will be disabling many of the great fuzzy fulltext query functionality. Of course you can set up a mapping that does both, analyzed and not_analyzed in different fields. Then it's up on you to decide on which field you want to query on.

Resources