Good practice for ElasticSearch index mapping - elasticsearch

Im new to Elasticsearch and I would like to know if there are any good practices for the use case I have.
I have heterogeneous data sent from an API that I save into a database (as a JSON) then save in Elasticsearch for search purposes. The data in sent in this format (because it's heterogeneous, the users can send any type of data, some metadata can be multivalued, other single values and the name of the key in the JSON may vary :)
{
"indices":{
"MultipleIndices":[
{
"index":"editors",
"values":[
"The Editing House",
"Volcan Editing"
]
},
{
"index":"colors",
"values":[
"Red",
"Blue"
]
}
],
"SimpleIndices":[
{
"index":"AuthorName",
"value": "George R. R. Martin"
},
{
"index":"NumberOfPages",
"value":"2898"
},
{
"index":"BookType",
"value":"Fantasy"
}
]
}
}
Once we receive this JSON, its formatted in the code and stored as a JSON in a database with this format :
{
"indices":{
"editors":[
"The Editing House",
"Volcan Editing"
],
"colors":[
"Red",
"Blue"
],
"AuthorName" : "George R. R. Martin"
"NumberOfPages" : "2898",
"BookType" : "Fantasy"
}
}
I then want to save this data into Elasticsearch, what's the best way I can map it ? Store it as a JSON in one field ? Will the search be efficilent if I do it this way ?

You must mapping each field individually.
You can take a look at field types to understand which type is ideal for your schema.
Another suggestion is to study the text analysis because it is responsible for the process of structuring the text to optimize the search.
My suggestion map:
PUT indices
{
"mappings": {
"properties": {
"editors": {
"type": "keyword"
},
"colors":{
"type": "keyword"
},
"author_name":{
"type": "text"
},
"number_pages":{
"type": "integer"
},
"book_type":{
"type": "keyword"
}
}
}
}

I think in your case, you don't have much choice apart from dynamic mapping, which Elasticsearch will generate for you as soon as first document is index in a particular index.
However, you can improve the process by using the dynamic template so that you can optimize your mapping, there is good examples of that in the official link I provided.

Related

elastic search 6 - use one or two type in one index?

How can I use, create two index or what?
I have one entity goods and one entity shop, should I create two index or two type in elastic search 6?
I have tried two mapping two type but it throw Exception.
How Can I do ?
In elacticsearch 6, you cannot create more than one doc type for an index. Earlier for an index company you could have doc type employee, infra, 'building' etc but now you it will throw an error.
In future versions doc type will be completely removed, so you will only have to deal with index.
An index in the elasticsearch is like table in normal database. And every document that you store will be row, and fields of that document will be columns.
Without seeing the data and knowing what you want to accomplish it is pretty hard to suggest how you should plan the schema of elasticsearch, but these information can help you decide.
you can use one of these two options:
1)Index per document type
2)Custom type field
for option 2:
PUT twitter
{
"mappings": {
"goods": {
"properties": {
"field1": { "type": "text" },
"field2": { "type": "keyword" },
}
},
"shop": {
"properties": {
"field1": { "type": "text" },
"field2": { "type": "date" }
}
}
}
}
see this

Elasticsearch custom mapping definition

I have to upload data to elk in the following format:
{
"location":{
"timestamp":1522751098000,
"resources":[
{
"resource":{
"name":"Node1"
},
"probability":0.1
},
{
"resource":{
"name":"Node2"
},
"probability":0.01
}]
}
}
I'm trying to define a mapping this kind of data and I produced he following mapping:
{
"mappings": {
"doc": {
"properties": {
"location": {
"properties" : {
"timestamp": {"type": "date"},
"resources": []
}
}
}
}
}
I have 2 questions:
how can I define the "resources" array in my mapping?
is it possible to define a custom type (e.g. resource) and use this type in my mapping (e.g "resources": [{type:resource}]) ?
There is a lot of things to know about the Elasticsearch mapping. I really highly suggest to read through at least some of their documentation.
Short answers first, in case you don't care:
Elasticsearch automatically allows storing one or multiple values of defined objects, there is no need to specify an array. See Marker 1 or refer to their documentation on array types.
I don't think there is. Since Elasticsearch 6 only 1 type per index is allowed. Nested objects is probably the closest, but you define them in the same file. Nested objects are stored in a separate index (internally).
Long answer and some thoughts
Take a look at the following mapping:
"mappings": {
"doc": {
"properties": {
"location": {
"properties": {
"timestamp": {
"type": "date"
},
"resources": { [1]
"type": "nested", [2]
"properties": {
"resource": {
"properties": {
"name": { [3]
"type": "text"
}
}
},
"probability": {
"type": "float"
}
}
}
}
}
}
}
}
This is how your mapping could look like. It can be done differently, but I think it makes sense this way - maybe except marker 3. I'll come to these right now:
Marker 1: If you define a field, you usually give it a type. I defined resources as a nested type, but your timestamp is of type date. Elasticsearch automatically allows storing one or multiple values of these objects. timestamp could actually also contain an array of dates, there is no need to specify an array.
Marker 2: I defined resources as a nested type, but it could also be an object like resource a little below (where no type is given). Read about nested objects here. In the end I don't know what your queries would look like, so not sure if you really need the nested type.
Marker 3: I want to address two things here. First, I want to mention again that resource is defined as a normal object with property name. You could do that for resources as well.
Second thing is more a thought-provoking impulse: Don't take it too seriously if something absolutely doesn't fit your case. Just take it as an opinion.
This mapping structure looks very inspired by a relational database approach. I think you usually want to define document structures for elasticsearch more for the expected searches. Redundancy is not a problem, but nested objects can make your queries complicated. I think I would omit the whole resources part and do it something like this:
"mappings": {
"doc": {
"properties": {
"location": {
"properties": {
"timestamp": {
"type": "date"
},
"resource": {
"properties": {
"resourceName": {
"type": "text"
}
"resourceProbability": {
"type": "float"
}
}
}
}
}
}
}
}
Because as I said, in this case resource can contain an array of objects, each having a resourceName and a resourceProbability.

extract text from field arrays

One of the fields called "resources" has the following 2 inner documents.
{
"type": "AWS::S3::Object",
"ARN": "arn:aws:s3:::sms_vild/servers_backup/db_1246/db/reports_201706.schema"
},
{
"accountId": "934331768510612",
"type": "AWS::S3::Bucket",
"ARN": "arn:aws:s3:::sms_vild"
}
I need to split the ARN field and get the last part of it. i.e. "reports_201706.schema" preferably using scripted field.
What I have tried:
1) I checked the fileds list and found only 2 entries resources.accountId and resources.type
2) I tried with date-time field and it worked correctly in the scripted filed option (expression).
doc['eventTime'].value
3) But the same does not work with other text fields for e.g.
doc['eventType'].value
Getting this error:
"caused_by":{"type":"script_exception","reason":"link error","script_stack":["doc['eventType'].value","^---- HERE"],"script":"doc['eventType'].value","lang":"expression","caused_by":{"type":"illegal_argument_exception","reason":"Fielddata is disabled on text fields by default. Set fielddata=true on [eventType] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory."}}},"status":500}
It means I need to change the mapping. Is there any other way to extract text from nested arrays in an object?
Update:
Please visit sample kibana here...
https://search-accountact-phhofxr23bjev4uscghwda4y7m.us-east-1.es.amazonaws.com/_plugin/kibana/
search for "ebs_attach.png" and then check resources field. You will see 2 nested arrays like this...
{
"type": "AWS::S3::Object",
"ARN": "arn:aws:s3:::datameetgeo/ebs_attach.png"
},
{
"accountId": "513469704633",
"type": "AWS::S3::Bucket",
"ARN": "arn:aws:s3:::datameetgeo"
}
I need to split ARN field and extract the last part that is again "ebs_attach.png"
If I can some-how display it as scripted field, then I can see the bucket name and the file name side-by-side on discovery tab.
Update 2
In other words, I am trying to extract the text shown in this image as a new field on discovery tab.
While you can use scripting for this, I highly encourage you to extract those kind of information at index time. I have provided two examples here, which are far from failsafe (you need to test with different path or with this field missing at all), but it should provide a base to start with
PUT foo/bar/1
{
"resources": [
{
"type": "AWS::S3::Object",
"ARN": "arn:aws:s3:::sms_vild/servers_backup/db_1246/db/reports_201706.schema"
},
{
"accountId": "934331768510612",
"type": "AWS::S3::Bucket",
"ARN": "arn:aws:s3:::sms_vild"
}
]
}
# this is slow!!!
GET foo/_search
{
"script_fields": {
"document": {
"script": {
"inline": "return params._source.resources.stream().filter(r -> 'AWS::S3::Object'.equals(r.type)).map(r -> r.ARN.substring(r.ARN.lastIndexOf('/') + 1)).findFirst().orElse('NONE')"
}
}
}
}
# Do this on index time, by adding a pipeline
PUT _ingest/pipeline/my-pipeline-id
{
"description" : "describe pipeline",
"processors" : [
{
"script" : {
"inline": "ctx.filename = ctx.resources.stream().filter(r -> 'AWS::S3::Object'.equals(r.type)).map(r -> r.ARN.substring(r.ARN.lastIndexOf('/') + 1)).findFirst().orElse('NONE')"
}
}
]
}
# Store the document, specify the pipeline
PUT foo/bar/1?pipeline=my-pipeline-id
{
"resources": [
{
"type": "AWS::S3::Object",
"ARN": "arn:aws:s3:::sms_vild/servers_backup/db_1246/db/reports_201706.schema"
},
{
"accountId": "934331768510612",
"type": "AWS::S3::Bucket",
"ARN": "arn:aws:s3:::sms_vild"
}
]
}
# lets check the filename field of the indexed document by getting it
GET foo/bar/1
# We can even search for this file now
GET foo/_search
{
"query": {
"match": {
"filename": "reports_201706.schema"
}
}
}
Note: Considered "resources" is kind of array
NSArray *array_ARN_Values = [resources valueForKey:#"ARN"];
Hope it will work for you!!!

Mapping in elasticsearch

Good morning, In my code I can't search data which contain separate words. If I search on one word all good. I think problem in mapping. I use postman. When I put in URL http://192.168.1.153:9200/sport_scouts/video/_mapping and use method GET I get:
{
"sport_scouts": {
"mappings": {
"video": {
"properties": {
"hashtag": {
"type": "string"
},
"id": {
"type": "long"
},
"sharing_link": {
"type": "string"
},
"source": {
"type": "string"
},
"title": {
"type": "string"
},
"type": {
"type": "string"
},
"user_id": {
"type": "long"
},
"video_preview": {
"type": "string"
}
}
}
}
}
}
All good title have type string but if I search on two or more words I get empty massive. My code in Trait:
public function search($data) {
$this->client();
$params['body']['query']['filtered']['filter']['or'][]['term']['title'] = $data;
$search = $this->client->search($params)['hits']['hits'];
dump($search);
}
Then I call it in my Controller. Can you help me with this problem?
The reason that your indexed data can't be found is caused by a mismatch of the analyzing during indexing and a strict term filter when querying the data.
With your mapping configuration, you are using the default analyzing which (besides many other operations) does a tokenizing. So every multi-word data you insert is split at punctuation or whitespaces. If you insert for example "some great sentence", elasticsearch maps the following terms to your document: "some", "great", "sentence", but not the term "great sentence". So if you do a term filter on "great sentence" or any other part of the original value containing a whitespace, you will not get any results.
Please see the elasticsearch docs on how to configure your mapping for indexing without analyzing (https://www.elastic.co/guide/en/elasticsearch/guide/current/mapping-intro.html#_index_2) or consider doing a match query instead of a term filter on the existing mapping (https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html).
Please be aware that if you switch to not_analyzed you will be disabling many of the great fuzzy fulltext query functionality. Of course you can set up a mapping that does both, analyzed and not_analyzed in different fields. Then it's up on you to decide on which field you want to query on.

Is it possible to sort nested documents in ElasticSearch?

Lets say I have the following mapping:
"site": {
"properties": {
"title": { "type": "string" },
"description": { "type": "string" },
"category": { "type": "string" },
"tags": { "type": "array" },
"point": { "type": "geo_point" }
"localities": {
type: 'nested',
properties: {
"title": { "type": "string" },
"description": { "type": "string" },
"point": { "type": "geo_point" }
}
}
}
}
I'm then doing an "_geo_distance" sort on the parent document and am able to sort the documents on "site.point". However I would also like the nested localities to be sorted by "_geo_distance", inside the parent document.
Is this possible? If so, how?
Unfortunately, no (at least not yet).
A query in ElasticSearch just identifies which documents match the query, and how well they match.
To understand what nested documents are useful for, consider this example:
{
"title": "My post",
"body": "Text in my body...",
"followers": [
{
"name": "Joe",
"status": "active"
},
{
"name": "Mary",
"status": "pending"
},
]
}
The above JSON, once indexed in ES, is functionally equivalent to the following. Note how the followers field has been flattened:
{
"title": "My post",
"body": "Text in my body...",
"followers.name": ["Joe","Mary"],
"followers.status": ["active","pending"]
}
A search for: followers with status == active and name == Mary would match this document... incorrectly.
Nested fields allow us to work around this limitation. If the followers field is declared to be of type nested instead of type object then its contents are created as a separate (invisible) sub-document internally. That means that we can use a nested query or nested filter to query these nested documents as individual docs.
However, the output from the nested query/filter clauses only tells us if the main doc matches, and how well it matches. It doesn't even tell us which of the nested docs matched. To figure that out, we'd have to write code in our application to check each of the nested docs against our search criteria.
There are a few open issues requesting the addition of these features, but it is not an easy problem to solve.
The only way to achieve what you want is to index your sub-docs as separate documents, and to query and sort them independently. It may be useful to establish a parent-child relationship between the main doc and these separate sub-docs. (see parent-type mapping, the Parent & Child section of the index api docs, and the top-children and has-child queries.
Also, an ES user has mailed the list about a new has_parent filter that they are currently working on in a fork. However, this is not available in the main ES repo yet.

Resources