How to denormalize hierarchy in ElasticSearch? - elasticsearch

I am new to ElasticSearch and I have a tree, which describes a path to a certain document (not real filesystem paths, just simple text fields categorizing articles, images, documents as one). Each path entry has a type, like.: Group Name, Assembly name or even Unknown. The types could be used in queries to skip certain entries in the path for example.
My source data is stored in SQL Server, the schema looks something like this:
Tree builds up by connecting the Tree.Id to Tree.ParentId, but each node must have a type. The Documents are connected to a leaf in the Tree.
I am not worried about querying the structure in SQL Server, however I should find an optimal approach to denormalize and search them in Elastic. If I flatten the paths and make a list of "descriptors" for a document, I can store each of the Document entries as an Elastic Document.:
{
"path": "NodeNameRoot/NodeNameLevel_1/NodeNameLevel_2/NodeNameLevel_3/NodeNameLevel_4",
"descriptors": [
{
"name": "NodeNameRoot",
"type": "type1"
},
{
"name": "NodeNameLevel_1",
"type": "type1"
},
{
"name": "NodeNameLevel_2",
"type": "type2"
},
{
"name": "NodeNameLevel_3",
"type": "type2"
},
{
"name": "NodeNameLevel_4",
"type": "type3"
}
],
"document": {
...
}
}
Can I query such a structure in ElasticSearch? Or Should I denormalize the paths in a different way?
My main questions:
Can query them based on type or text value (regex matching for example). For example: Give me all the type2->type3 paths (practically leave the type1 out), where the path contains X?
Is it possible to query based on levels? Like I would like the paths where there are 4 descriptors.
Can I do the searching with the built-in functionality or do I need to write an extension?
Edit
Based on G Quintana 's anwser, I made an index like this.:
curl -X PUT \
http://localhost:9200/test \
-H 'cache-control: no-cache' \
-H 'content-type: application/json' \
-d '{
"mappings": {
"path": {
"properties": {
"names": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
},
"tokens": {
"type": "text",
"analyzer": "pathname_analyzer"
},
"depth": {
"type": "token_count",
"analyzer": "pathname_analyzer"
}
}
},
"types": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
},
"tokens": {
"type": "text",
"analyzer": "pathname_analyzer"
}
}
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"pathname_analyzer": {
"type": "pattern",
"pattern": "#->>",
"lowercase": true
}
}
}
}
}'
And could query the depth like this.:
curl -X POST \
http://localhost:9200/test/path/_search \
-H 'content-type: application/json' \
-d '{
"query": {
"bool": {
"should": [
{"match": { "names.depth": 5 }}
]
}
}
}'
Which return correct results. I will test it a little more.

First of all you should identify all your query patterns to design how you will index your data.
From the example you gave, I would index documents of the form:
{
"path": "NodeNameRoot/NodeNameLevel_1/NodeNameLevel_2/NodeNameLevel_3/NodeNameLevel_4",
"types: "type1/type1/type2/type2/type3",
"document": {
...
}
}
Before indexing, you must configure mapping and analysis:
Field path:
use type text + analyzer based on pattern analyzer to split at / characters
use type token_count + same analyzer to compute path depth. Create a multi field (path.depth)
Field types
use type text + analyzer based on pattern analyzer to split at / characters
Configure index mappings and analysis to split the path and types fields and the , use a or a
Give me all the type2->type3 paths use a match_phrase query on the types field
where the path contains X use match query on the path field
where there are 4 descriptors use term query on path.depth sub field
Your descriptors field is not interesting.
The Path tokenizer might be interesting for some usecases.
You can apply multiple analyzer on the same field using multi-fields and then query if sub fields.

Related

Elasticsearch - searching for punctuation terms over both text and keyword fields

Using elasticsearch 7, I'm trying to use a simple query string query for searches over different fields, both text and keyword. Here's a minimum, reproducible example to show the initial setup and problem:
mapping.json:
{
"dynamic": false,
"properties": {
"publicId": {
"type": "keyword"
},
"eventDate": {
"type": "date",
"format": "yyyy-MM-dd",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"name": {
"type": "text"
}
}
}
test-data1.json:
{
"publicId": "a1b2c3",
"eventDate": "2022-06-10",
"name": "Research & Development"
}
test-data2.json
{
"publicId": "d4e5f6",
"eventDate": "2021-05-11",
"name": "F.inance"
}
Create index on ES running on localhost:19200:
#!/bin/bash -e
host=${1-localhost:19200}
dir=$( dirname `readlink -f $0` )
mapping=$(<${dir}/mapping.json);
param="{ \"mappings\": $mapping}"
curl -XPUT "http://${host}/test/" -H 'Content-Type: application/json' -d "$param"
curl -XPOST "http://${host}/test/_doc/a1b2c3" -H 'Content-Type: application/json' -d #${dir}/test-data1.json
curl -XPOST "http://${host}/test/_doc/d4e5f6" -H 'Content-Type: application/json' -d #${dir}/test-data2.json
Now the task is to support searches like "Research & Development", "Research & Development 2022-06-10", "Finance" (note the removed dot) or simply "a1b2c3". For example using a query like this:
{
"from": 0,
"size": 20,
"query": {
"bool": {
"must": [
{
"simple_query_string": {
"query": "Research & Development 2022-06-10",
"fields": [
"publicId^1.0",
"eventDate.keyword^1.0",
"name^1.0"
],
"flags": -1,
"default_operator": "and",
"analyze_wildcard": false,
"auto_generate_synonyms_phrase_query": true,
"fuzzy_prefix_length": 0,
"fuzzy_max_expansions": 50,
"fuzzy_transpositions": true,
"boost": 1.0
}
}
],
"adjust_pure_negative": true,
"boost": 1.0
}
},
"version": true
}
The problem with this setup is that the standard analyzer for the text field that removes most punctuation of course also removes the ampersand character. The simple query string query splits the query into three tokens [research, &, development] and searches over all fields using the and operator. There are two matches ("Research" and "Development") for the name text field, but no matches for the ampersand in any field. Thus the result is empty.
Now I came up with a solution to add a second field for name with a different analyzer, the whitespace analyzer, that doesn't remove punctuation:
{
"dynamic": false,
"properties": {
"publicId": {
"type": "keyword"
},
"eventDate": {
"type": "date",
"format": "yyyy-MM-dd",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"name": {
"type": "text",
"fields": {
"whitespace": {
"type": "text",
"analyzer": "whitespace"
}
}
}
}
}
This way all searches work, including "Finance" that matches for "F.inance" for the name field. Also, "Research & Development" matches for the name field and for name.whitespace, but most crucially & matches for name.whitespace and therefore returns a result.
My question now is: given the fact that the real setup includes many more fields and a lot of data, adding an additional field and therefore indexing most terms in the same way twice seems quite heavy. Is there a way to only index analyzed terms to name.whitespace that differ from the standard analyzer's terms of name, i.e. that are not in the "parent" field? E.g. "Research & Development" results in the terms [research, development] for name and [research, development, &] for name.whitespace - ideally it would only index [&] for name.whitespace.
Or is there a more elegant/performant solution for this particular problem altogether?
I guess you can define a dynamic property mapping for all string fields and use whitespace analyzer since your use case has that specification to search on non-standard tokens. In addition, you can specify those fields in the mapping where you don't need whitespace tokenizer.
This would ensure that already mapped fields are analyzed using standard tokenizer while others (dynamic or unmapped fields) are analyzed using whitespace, thus reducing the complexity, field duplication, etc.

elasticsearch aggregations separated words

I simply run an aggregations in browser plugin(marvel) as you see in picture below there is only one doc match the query but aggregrated separated by spaces but it doesn't make sense I want aggregrate for different doc.. ın this scenario there should be only one group with count 1 and key:"Drow Ranger".
What is the true way of do this in elasticsearch..
It's probably because your heroname field is analyzed and thus "Drow Ranger" gets tokenized and indexed as "drow" and "ranger".
One way to get around this is to transform your heroname field to a multi-field with an analyzed part (the one you search on with the wildcard query) and another not_analyzed part (the one you can aggregate on).
You should create your index like this and specify the proper mapping for your heroname field
curl -XPUT localhost:9200/dota2 -d '{
"mappings": {
"agust": {
"properties": {
"heroname": {
"type": "string",
"fields": {
"raw: {
"type": "string",
"index": "not_analyzed"
}
}
},
... your other fields go here
}
}
}
}
Then you can run your aggregation on the heroname.raw field instead of the heroname field.
UPDATE
If you just want to try on the heroname field, you can just modify that field and not recreate the whole index. If you run the following command, it will simply add the new heroname.raw sub-field to your existing heroname field. Note that you still have to reindex your data though
curl -XPUT localhost:9200/dota2/_mapping/agust -d '{
"properties": {
"heroname": {
"type": "string",
"fields": {
"raw: {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
Then you can keep using heroname in your wildcard query, but your aggregation will look like this:
{
"aggs": {
"asd": {
"terms": {
"field": "heroname.raw", <--- use the raw field here
"size": 0
}
}
}
}

How to sort the fields when the numbers are entered as strings,in elasticsearch?

This is a sample of the document I have indexed using elasticsearch.
{
"itemName": "refridgirator"
"brand": "samsung"
"price": "1600"
"emi": "40"
"location":"Brisbane"
}
As you can see from the above document that,all the numerical fields such as "price" and "emi" are entered as strings. Now as I need to sort my indexed documents in the basis of this value,Im not able to do that. How can I make it done?
In the mapping , you need to specify this as long or int.
That way , the number would be stored as long in the reverse index and field data cache and actual number sort will work out -
curl -X PUT "http://$hostname:9200/your_index/your_type/_mapping" -d '{
"yout_type": {
"properties": {
"price": {
"type": "long"
}
}
}
}'
You can find more info here.
Also an alternate method by parsing string values have been discussed in detail in this blog
In your mapping, you need to declare the price and emi fields as number type (double, float, integer, etc) like this:
// remove the current index
curl -XDELETE localhost:9200/your_index
// create the index anew
curl -XPUT localhost:9200/your_index -d '
{
"mappings": {
"your_type": {
"properties": {
"itemName": {
"type": "string"
},
"brand": {
"type": "string"
},
"price": {
"type": "double"
},
"emi": {
"type": "integer"
},
"location": {
"type": "string"
}
}
}
}
}'
And then when you re-index your documents, make sure the price and emi fields in your JSON are numbers as well and not strings (i.e. without the double quotes):
curl -XPOST localhost:9200/your_index/your_type/123 -d '
{
"itemName": "refridgirator",
"brand": "samsung",
"price": 1600,
"emi": 40,
"location": "Brisbane"
}'

Elasticsearch: How to forbid the analyzing for a certain data type (e.g. string)

we gathering data(in json format) from user input. I want to know is there any way to forbid the analyzing process for certain data type(e.g. string), so that if we detect the value of some field is in string format, we will not analyze it, just leave it as a whole.
thanks!
You can do this using index template,
curl -XPUT "http://localhost:9200/_template/not_analyzed_template" -d'
{
"template": "*",
"mappings": {
"_default_": {
"dynamic_templates": [
{
"template_1": {
"mapping": {
"index": "not_analyzed",
"type": "string"
},
"match_mapping_type": "string",
"match": "*"
}
}
]
}
}
}'
just, do above request.
After that add data to es, string is not analyzed from now.
Hope this helps!!

Why Elasticsearch "not_analyzed" field is split into terms?

I have the following field in my mapping definition:
...
"my_field": {
"type": "string",
"index":"not_analyzed"
}
...
When I index a document with value of my_field = 'test-some-another' that value is split into 3 terms: test, some, another.
What am I doing wrong?
I created the following index:
curl -XPUT localhost:9200/my_index -d '{
"index": {
"settings": {
"number_of_shards": 5,
"number_of_replicas": 2
},
"mappings": {
"my_type": {
"_all": {
"enabled": false
},
"_source": {
"compressed": true
},
"properties": {
"my_field": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}'
Then I index the following document:
curl -XPOST localhost:9200/my_index/my_type -d '{
"my_field": "test-some-another"
}'
Then I use the plugin https://github.com/jprante/elasticsearch-index-termlist with the following API:
curl -XGET localhost:9200/my_index/_termlist
That gives me the following response:
{"ok":true,"_shards":{"total":5,"successful":5,"failed":0},"terms": ["test","some","another"]}
Verify that mapping is actually getting set by running:
curl localhost:9200/my_index/_mapping?pretty=true
The command that creates the index seems to be incorrect. It shouldn't contain "index" : { as a root element. Try this:
curl -XPUT localhost:9200/my_index -d '{
"settings": {
"number_of_shards": 5,
"number_of_replicas": 2
},
"mappings": {
"my_type": {
"_all": {
"enabled": false
},
"_source": {
"compressed": true
},
"properties": {
"my_field": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}'
In ElasticSearch a field is indexed when it goes within the inverted index, the data structure that lucene uses to provide its great and fast full text search capabilities. If you want to search on a field, you do have to index it. When you index a field you can decide whether you want to index it as it is, or you want to analyze it, which means deciding a tokenizer to apply to it, which will generate a list of tokens (words) and a list of token filters that can modify the generated tokens (even add or delete some). The way you index a field affects how you can search on it. If you index a field but don't analyze it, and its text is composed of multiple words, you'll be able to find that document only searching for that exact specific text, whitespaces included.
You can have fields that you only want to search on, and never show: indexed and not stored (default in lucene).
You can have fields that you want to search on and also retrieve: indexed and stored.
You can have fields that you don't want to search on, but you do want to retrieve to show them.

Resources