Create field based on an existing field in Elasticsearch - elasticsearch

I have an Elasticsearch index that stores products and it's properties like size, color, material as a dynamic field:
"raw_properties" : {
"dynamic" : "true",
"properties" : {
"Color" : {
"type" : "text",
"fields" : {
"keyword" : { "type" : "keyword", "ignore_above" : 256 }
}
},
"Size" : {
"type" : "text",
"fields" : {
"keyword" : { "type" : "keyword", "ignore_above" : 256
}
}
}
}
}
An indexed document looks like this:
{
"_index" : "development-products",
"_type" : "_doc",
"_id" : "3",
"_score" : 1.0,
"_source" : {
"raw_properties" : {
"Size" : ["XS", "S", "XL"],
"Color" : ["blue", "orange"]
}
}
}
The problem is, that the value of raw_properties comes from various sources, and they differ a lot. For example, the field Color is called Colour from another source, and blue could be light-blue and so on.
So, I've implemented a normalization step in my app, that does a simple mapping like this (for simplicity, the mapping here is just a Ruby Hash, in reality the mapping is read from a database):
PROPERTY_MAPPING = {
"Colour_blue" => ["Color", "blue"],
"Color_light-blue" => ["Color", "blue"],
"Size_46" => ["Size", "S"]
}
When my app indexes a product, it looks into this property mapping and normalizes the property. This keeps the cardinality of the fields low and the user isn't presented with too much properties to filter.
The problem: Updating those mappings is pretty slow, as I have to reindex the affected products by applying the new mapping in my app and sending the data to Elasticsearch. I'm dealing with about 3 million products here, and new data with a new normalization comes in every day. I try to only find the products that are affected and so on, but it is still too slow.
So I was thinking if there was a way to do the normalization inside Elasticsearch? I've read about enriching data (https://www.elastic.co/guide/en/elasticsearch/reference/master/ingest-enriching-data.html) or the pipelines with processors (https://www.elastic.co/guide/en/elasticsearch/reference/master/ingest-processors.html) and had a look into Painless.
The main idea would be to only update the mapping, do an update_by_query, and let Elasticsearch take care of the rest.
So, I'm not sure if this is possible at all or where I should start. Any advice or hint is appreciated!

Related

What does "mappings" do in Elasticsearch?

I just started learning Elasticsearch. I am trying out to create index, adding data, deleting data, and search data.
I can also understand the settings of Elasticsearch.
When using "PUT" to use settings
{
"settings": {
"index.number_of_shards" : 1,
"index.number_of_replicas" : 0
}
}
When using "GET" to retrieve settings information
{
"dsm" : {
"settings" : {
"index" : {
"creation_date" : "1555487684262",
"number_of_shards" : "1",
"number_of_replicas" : "0",
"uuid" : "qsSr69OdTuugP2DUwrMh4g",
"version" : {
"created" : "7000099"
},
"provided_name" : "dsm"
}
}
}
}
However,
What does "mappings" do in Elasticsearch?
{
"kibana_sample_data_flights" : {
"aliases" : { },
"mappings" : {
"properties" : {
"AvgTicketPrice" : {
"type" : "float"
},
"Cancelled" : {
"type" : "boolean"
},
"Carrier" : {
"type" : "keyword"
},
"Dest" : {
"type" : "keyword"
},
"DestAirportID" : {
"type" : "keyword"
},
"DestCityName" : {
}, // just part of data
The mapping document is a way of describing the structure of your data and defining the types eg boolean, text, keyword. These types are important as they determine how your fields are indexed and analysed.
Elasticsearch supports dynamic mapping, so effectively performs an automatic best guess of the appropriate types but you may wish to override these.
I found this to be a useful article to explain the mapping process:
https://www.elastic.co/blog/found-elasticsearch-mapping-introduction
Indexing is determined by the field type for example where the type is 'keyword' the search engine will be expecting an exact match, when the type is 'text' the search engine will be trying to determine how well the document matches the query term and in so doing so will be performing a 'full text search'.
So for example:
- A search for jump should also match jumped, jumps, jumping, and perhaps even leap.
This is a great article describing exact vs full text search and is where I took the jump example: https://www.elastic.co/guide/en/elasticsearch/guide/current/_exact_values_versus_full_text.html
Much of the power of elasticsearch is in the mapping and analysis.
Its the mapping of the index. This means it describes the data that is stored in this index. Take a deeper look here.

How to implement fuzzy field-centric (cross_fields) query on fields with multiple analysers?

Mapping:
{
"articles" : {
"mappings" : {
"data" : {
"properties" : {
"author" : {
"type" : "text",
"analyzer" : "standard"
},
"content" : {
"type" : "text",
"analyzer" : "english"
},
"tags" : {
"type" : "keyword"
},
"title" : {
"type" : "text",
"analyzer" : "english"
}
}
}
}
}
}
Example data:
{
"author": "John Smith",
"title": "Hello world",
"content": "This is some example article",
"tags": ["programming", "life"]
}
So as you see I have mapping with different analysers on different fields. Now I want to search across those fields in a following way:
only documents matching all search keywords are returned (like multi_match with cross_fields as a type and and as operator)
query should be fuzzy so it can tolerate some typos
different fields should have different boost values (e.g. title more important than content)
For example following query should match above document:
programing worlds john examlpe
How can I do it? According to documentation fuzziness won't work with cross_fields nor fields with different analysers.
One way of doing it would be implementing custom _all fields and coping all values there using copy_to but with this approach I can't assign different weights nor use different analysers.

Elasticsearch slow results with IN query and Scoring

I have text document data (500k approximately) saved in elasticsearch where the document text is mapped with it's corresponding document number.
I am trying to fetch results in batches for "Sample Text" in particular set of document numbers (300k appoximately) with scoring and i am facing extreme slowness in the result.
Here is the the Mapping
PUT my_index
{
"mappings" : {
"doc_repo" : {
"properties" : {
"doc_number" : {
"type" : "integer"
},
"document" : {
"type" : "string",
"term_vector" : "with_positions_offsets_payloads"
}
}
}
}
}
Here is the request query
{
"query" : {
"bool" : {
"must" : [
{
"terms" : {
"document" : [
"sample text"
]
}
},
{
"terms" : {
"doc_number" : [1,2,3....,300K] //ArrayOf_300K_DocNumbers
}
}
]
}
},
"fields" : [
"doc_number"
],
"size" : 500,
"from" : 0
}
I Tried fetching result in two other ways
Result without scoring in particular set of document numbers(i used filtering for this)
Result with scoring but without any particular set of document numbers (in batches)
Both of these were pretty quick, but problem comes when i am trying achieve both.
Do i need to change mapping or search query or any other ways to achieve this.
Thanks in advance.
Issue was specifically with elasticsearch 2.X, Upgrading elasticsearch solves the issue.

In elasticsearch, how important is it to fully define a mapping during mapping-creation?

I am creating a mapping like this
"institution" : {
"properties" : {
"InstitutionCode" : {
"type" : "string",
"store" : "yes"
},
"InstitutionID" : {
"type" : "integer",
"store" : "yes"
},
"Name" : {
"type" : "string",
"store" : "yes"
}
}
}
However, when I perform actual indexing operations for institutions, I am adding an Alias property (0 or more aliases per institution)
"institution" : {
"properties" : {
"Aliases" : {
"dynamic" : "true",
"properties" : {
"InstitutionAlias" : {
"type" : "string"
},
"InstitutionAliasTypeID" : {
"type" : "long"
}
}
},
"InstitutionCode" : {
"type" : "string",
"store" : "yes"
},
"InstitutionID" : {
"type" : "integer",
"store" : "yes"
},
"Name" : {
"type" : "string",
"store" : "yes"
}
}
}
This is actually a simplified example, as I am actually adding more fields than just Aliases during the actual indexing of records.
How important is it to to fully define a mapping during mapping-creation?
Am I going to suffer any penalties by having the mapping automatically adjusted during indexing operations due to the indexing of institution records with additional properties? I expect institutions to gain additional properties over time and I wonder if I need to maintain the mapping-creation code in addition to the institution-indexing code.
I believe the overhead of dynamic mapping is fairly negligible...using them won't hurt indexing speed. However, you can run into some unexpected situations where ElasticSearch auto-detects a field type incorrectly.
A common example is detecting an integer because the first example of a field is a number ("25"), when in reality the rest of the data for that field is a string. Or seeing an integer when the rest of the data is actually a float. Etc etc.
If your data is well standardized that isn't much of a problem.
Alternatively, you can use dynamic templates to apply mappings to new fields based on a regex pattern.

Storing only selected fields and not storing _all in pyes/elasticsearch

I am trying to use pyes with elasticsearch as full text search engine, I store only UUIDs and indexes of string fields, actual data is stored in MonogDB and retrieved using UUIDs. Unfortunately, I am unable to create a mapping that wouldn't store original data, I've tried various combinations of "store"/"source" fields and disabling "_all" but I can still get text of indexed fields. It seems that documentation is misleading on this topic as it's just a copy of original docs.
Can anyone please provide an example of mapping that would only store some fields and not the original document JSON?
Sure, you could use something like this (with two fields, 'uuid' and 'body'):
{
"mytype" : {
"_source" : {
"enabled" : false
},
"_all" : {
"enabled" : false
},
"properties" : {
"data" : {
"store" : "no",
"type" : "string"
},
"uuid" : {
"store" : "yes",
"type" : "string",
"index" : "not_analyzed"
}
}
}
}

Resources