Elasticsearch postings highlighter failing for some search strings - elasticsearch

I have a search that works well with most search strings, but fails spectacularly on others. Experimenting, it appears to fail when at least one word in the query doesn't match (like this made up search phrase), with the error:
{
"error": "SearchPhaseExecutionException[Failed to execute phase [query_fetch], all shards failed; shardFailures {[w3zfoix_Qi-xwpVGbCbQWw][ia_test][0]: ElasticsearchIllegalArgumentException[the field [content] should be indexed with positions and offsets in the postings list to be used with postings highlighter]}]",
"status": 400
}
The simplest search which gives this error is the one below:
POST /myindex/_search
{
"from" : 0,
"size" : 25,
"query": {
"filtered" : {
"query" : {
"multi_match" : {
"type" : "most_fields",
"fields": ["title", "content", "content.english"],
"query": "Box Fexye"
}
}
}
},
"highlight" : {
"fields" : {
"content" : {
"type" : "postings"
}
}
}
}
My query is more complicated than this, and I need to use the "postings" highlighter to pull out the best matching sentence from a document.
Indexing of the relevant fields looks like:
"properties" : {
"title" : {
"type" : "string",
"fields": {
"shingles": {
"type": "string",
"analyzer": "my_shingle_analyzer"
}
}
},
"content" : {
"type" : "string",
"analyzer" : "standard",
"fields": {
"english": {
"type": "string",
"analyzer": "my_english"
},
"shingles": {
"type": "string",
"analyzer": "my_shingle_analyzer"
}
},
"index_options" : "offsets",
"term_vector" : "with_positions_offsets"
}
}

Related

How can I use query_string to match both nested and non-nested fields at the same time?

I have an index with a mapping something like this:
"email" : {
"type" : "nested",
"properties" : {
"from" : {
"type" : "text",
"analyzer" : "lowercase_keyword",
"fielddata" : true
},
"subject" : {
"type" : "text",
"analyzer" : "lowercase_keyword",
"fielddata" : true
},
"to" : {
"type" : "text",
"analyzer" : "lowercase_keyword",
"fielddata" : true
}
}
},
"textExact" : {
"type" : "text",
"analyzer" : "lowercase_standard",
"fielddata" : true
}
I want to use query_string to search for matches in both the nested and the non-nested field at the same time, e.g.
email.to:foo#example.com AND textExact:bar
But I can't figure out how to write a query that will search both fields at once. The following doesn't work, because query_string searches do not return nested documents:
"query": {
"query_string": {
"fields": [
"textExact",
"email.to"
],
"query": "email.to:foo#example.com AND textExact:bar"
}
}
I can write a separate nested query, but that will only search against nested fields. Is there any way I can use query_string to match both nested and non-nested fields at the same time?
I am using Elasticsearch 6.8. Cross-posted on the Elasticsearch forums.
Nested documents can only be queried with the nested query.
You can follow below two approaches.
1. You can combine nested and normal query in must clause, which works like "and" for different queries.
{
"query": {
"bool": {
"must": [
{
"nested": {
"path": "email",
"query": {
"term": {
"email.to": "foo#example.com"
}
}
}
},
{
"match": {
"textExact": "bar"
}
}
]
}
}
}
2. copy-to
The copy_to parameter allows you to copy the values of multiple fields into a group field, which can then be queried as a single field.
{
"mappings": {
"properties": {
"textExact":{
"type": "text"
},
"to_email":{
"type": "keyword"
},
"email":{
"type": "nested",
"properties": {
"to":{
"type":"keyword",
"copy_to": "to_email" --> copies to non-nested field
},
"from":{
"type":"keyword"
}
}
}
}
}
}
Query
{
"query": {
"query_string": {
"fields": [
"textExact",
"to_email"
],
"query": "to_email:foo#example.com AND textExact:bar"
}
}
}
Result
"_source" : {
"textExact" : "bar",
"email" : [
{
"to" : "sdfsd#example.com",
"from" : "a#example.com"
},
{
"to" : "foo#example.com",
"from" : "sdfds#example.com"
}
]
}

Custom indexing template is not being applied

I have a project where I am to analyze and visualize access log data. I use Logstash to send data to Elasticsearch and then visualize some stuff with Kibana.
Everything has worked fine until I discovered that I needed the Path Hierarchy Analyzer to show what I want to. I now have a custom template (JSON) and changed the out section of my Logstash configuration. But when I index data, my template is not being applied.
(Version 5.2 of Elasticseach and Logstash, can't update since that is the version in use at the place where I work).
My JSON file is valid. As far as the input and filters go, my Logstash configuration is fine, too. I guess I made a mistake in the output.
I already tried setting manage_template to false. I also tried template_overwrite => "false" just for the sake of it.
I tried creating the index first (Kibana Dev Tools) and populating it after. I created the index template and then the index. That way my template was applied and when I created the index pattern, everything seemed correct. Then I indexed one of my log files. I ended up with a Courier Fetch Error. http://localhost:9200/_all/_mapping?pretty=1 showed my that while indexing my data a default template was being used instead of my custom one. Nothing was different from before adding a custom template.
I searched the web and read everything I could find on stackoverflow and in the elastic forum about custom templates not being applied. I tried out all the solutions provided there, that is why I ended up opting for a custom template saved locally and providing the path in my logstash output. But I am all out of ideas now.
This is the output of my logstash configuration:
output {
elasticsearch {
hosts => ["localhost:9200"]
template => "/etc/logstash/conf.d/template.json"
index => "beam-%{+YYYY.MM.dd}"
manage_template => "true"
template_overwrite => "true"
document_type => "beamlogs"
}
stdout {
codec => rubydebug
}
}
And this is my custom template:
{
"template": "beam_custom",
"index_patterns": "beam-*",
"order" : 5,
"settings": {
"number_of_shards": 1,
"analysis": {
"analyzer": {
"custom_path_tree": {
"tokenizer": "custom_hierarchy"
},
"custom_path_tree_reversed": {
"tokenizer": "custom_hierarchy_reversed"
}
},
"tokenizer": {
"custom_hierarchy": {
"type": "path_hierarchy",
"delimiter": "/"
},
"custom_hierarchy_reversed": {
"type": "path_hierarchy",
"delimiter": "/",
"reverse": "true"
}
}
}
},
"mappings": {
"beamlogs": {
"properties": {
"object": {
"type": "text",
"fields": {
"tree": {
"type": "text",
"analyzer": "custom_path_tree"
},
"tree_reversed": {
"type": "text",
"analyzer": "custom_path_tree_reversed"
}
}
},
"referral": {
"type": "text",
"fields": {
"tree": {
"type": "text",
"analyzer": "custom_path_tree"
},
"tree_reversed": {
"type": "text",
"analyzer": "custom_path_tree_reversed"
}
}
},
"#timestamp" : {
"type" : "date"
},
"action" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword"
}
}
},
"datetime" : {
"type" : "date",
"format": "time_no_millis",
"fields" : {
"keyword" : {
"type": "keyword"
}
}
},
"id" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword"
}
}
},
"info" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword"
}
}
},
"message" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword"
}
}
},
"page" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword"
}
}
},
"path" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword"
}
}
},
"result" : {
"type" : "long"
},
"s_direct" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword"
}
}
},
"s_limit" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword"
}
}
},
"s_mobile" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword"
}
}
},
"s_terms" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword"
}
}
},
"size" : {
"type" : "long"
},
"sort" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword"
}
}
}
}
}
}
}
After indexing my data this is part of what I get with http://localhost:9200/_all/_mapping?pretty=1
"datetime" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"object" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
datetime should not have the type text. But worse than that, fields like objet.tree are not even created.
I really don't care about the wrong mapping for datetime, but I need to get the Path Hierarchy Analyzer to work. I just don't know what to do anymore.
So. What I just tried was creating the index template in Kibana.
PUT _template/beam_custom
/followed by what is in my template.json
I then checked if the template was created.
GET _template/beam_custom
The output was this:
{
"beam_custom": {
"order": 100,
"template": "beam_custom",
"settings": {
"index": {
"analysis": {
"analyzer": {
"custom_path_tree_reversed": {
"tokenizer": "custom_hierarchy_reversed"
},
"custom_path_tree": {
"tokenizer": "custom_hierarchy"
}
},
"tokenizer": {
"custom_hierarchy": {
"type": "path_hierarchy",
"delimiter": "/"
},
...
So I guess creating the template worked.
Then I created an index
PUT beam-2019-07-15
But when I checked the index, I got this:
{
"beam-2019.07.15": {
"aliases": {},
"mappings": {},
"settings": {
"index": {
"creation_date": "1563044670605",
"number_of_shards": "5",
"number_of_replicas": "1",
"uuid": "rGzplctSQDmrI_NSlt47hQ",
"version": {
"created": "5061699"
},
"provided_name": "beam-2019.07.15"
}
}
}
}
Shouldn't the index pattern have been recognized? I think this is the heart of the problem. I thought that my template would have been used and the output should have been something like this instead:
{
"beam-2019.07.15": {
"aliases": {},
"mappings": {
"logs": {
"properties": {
"#timestamp": {
"type": "date"
},
"action": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
},...
Why doesn't it recognize the pattern?
So, I found the mistake.
When I looked up how to build my own template, at some point I looked at the documentation for the current version. But in 5.2., "index_patterns =>" doesn't exist.
"template": "beam_custom",
"index_patterns": "beam-*",
This doesn't work then, of course.
Instead, I dropped the "index_patterns" line and defined my pattern in the template-parameter.
"template": ["beam-*"],
//rest
This fixed the problem. After that, my pattern was recognized.
Yet I am facing a different problem now. The Path Hierarchy Analyzer is not working properly. object.tree and the rest of the fields I want are not being created.
GET beam-*/_search
{
"query": {
"term": {
"object.tree": "/belletristik/"
}
}
}
yields nothing, though I should have a few hundred hits. Looking at my data, there are no analyzed fields for my paths. Any ideas?

Querying nested fields with analyzer results in error

I tried to use a synonym analyzer for my already working elastic search type. Here's the mapping of my serviceEntity:
{
"serviceentity" : {
"properties":{
"ServiceLangProps" : {
"type" : "nested",
"properties" : {
"NAME" : {"type" : "string", "search_analyzer": "synonym"},
"LONG_TEXT" : {"type" : "string", "search_analyzer": "synonym"},
"DESCRIPTION" : {"type" : "string", "search_analyzer": "synonym"},
"MATERIAL" : {"type" : "string", "search_analyzer": "synonym"},
"LANGUAGE_ID" : {"type" : "string", "include_in_all": false}
}
},
"LinkProps" : {
"type" : "nested",
"properties" : {
"TITLE" : {"type" : "string", "search_analyzer": "synonym"},
"LINK" : {"type" : "string"},
"LANGUAGE_ID" : {"type" : "string", "include_in_all": false}
}
},
"MediaProps" : {
"type" : "nested",
"properties" : {
"TITLE" : {"type" : "string", "search_analyzer": "synonym"},
"FILENAME" : {"type" : "string"},
"LANGUAGE_ID" : {"type" : "string", "include_in_all": false}
}
}
}
}
}
And these are my setting
{
"analysis": {
"filter": {
"synonym": {
"ignore_case": "true",
"type": "synonym",
"synonyms": [
"lorep, spaceship",
"ipsum, planet"
]
}
},
"analyzer": {
"synonym": {
"filter": [
"lowercase",
"synonym"
],
"tokenizer": "whitespace"
}
}
}
}
When In try to search for anything, I get this Error:
Caused by: org.elasticsearch.index.query.QueryParsingException: [nested] nested object under path [ServiceLangProps] is not of nested type
And I don't understand why. If I don't add any analyzer to my setting, everything works fine.
I'm using the java API to communicate with the elasticsearch instance. Therefore my code looks something like this for the multi match query:
MultiMatchQueryBuilder multiMatchBuilder = QueryBuilders.multiMatchQuery(fulltextSearchString, QUERY_FIELDS).analyzer("synonym");
The query string created by the java API looks like this:
{
"query" : {
"bool" : {
"must" : {
"bool" : {
"should" : [ {
"nested" : {
"query" : {
"bool" : {
"must" : [ {
"match" : {
"ServiceLangProps.LANGUAGE_ID" : {
"query" : "DE",
"type" : "boolean"
}
}
}, {
"multi_match" : {
"query" : "lorem",
"fields" : [ "ServiceLangProps.NAME", "ServiceLangProps.DESCRIPTION", "ServiceLangProps.MATERIALKURZTEXT", "ServiceLangProps.DESCRIPTION_RICHTEXT" ],
"analyzer" : "synonym"
}
} ]
}
},
"path" : "ServiceLangProps"
}
}, {
"nested" : {
"query" : {
"bool" : {
"must" : [ {
"match" : {
"LinkProps.LANGUAGE_ID" : {
"query" : "DE",
"type" : "boolean"
}
}
}, {
"match" : {
"LinkProps.TITLE" : {
"query" : "lorem",
"type" : "boolean"
}
}
} ]
}
},
"path" : "LinkProps"
}
}, {
"nested" : {
"query" : {
"bool" : {
"must" : [ {
"match" : {
"MediaProps.LANGUAGE_ID" : {
"query" : "DE",
"type" : "boolean"
}
}
}, {
"match" : {
"MediaProps.TITLE" : {
"query" : "lorem",
"type" : "boolean"
}
}
} ]
}
},
"path" : "MediaProps"
}
} ]
}
},
"filter" : {
"bool" : { }
}
}
}
}
If I try it on the LinkProps or MediaProps, I get the same error for the respective nested object.
Edit: I'm using version 2.4.6 of elasticsearch
Would be helpful to check the query string as well and knowing what version of ES is being used.
I couldnt see the synonyms_path as well as the fact you are using nested types can cause that error.
You probably have seen this already but in case you havent
https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-synonym-tokenfilter.html
I created a minimal example of what I'm trying to do.
My mapping looks like this:
{
"serviceentity" : {
"properties":{
"LinkProps" : {
"type" : "nested",
"properties" : {
"TITLE" : {"type" : "string", "search_analyzer": "synonym"},
"LINK" : {"type" : "string"},
"LANGUAGE_ID" : {"type" : "string", "include_in_all": false}
}
}
}
}
}
And my settings for the synonym analyzer in JAVA code:
XContentBuilder builder = jsonBuilder()
.startObject()
.startObject("analysis")
.startObject("filter")
.startObject("synonym") // The name of the analyzer
.field("type", "synonym") // The type (derivate)
.field("ignore_case", "true")
.array("synonyms", synonyms) // The synonym list
.endObject()
.endObject()
.startObject("analyzer")
.startObject("synonym")
.field("tokenizer", "whitespace")
.array("filter", "lowercase", "synonym")
.endObject()
.endObject()
.endObject()
.endObject();
The metadata which the ElasticSearch Head Chrome plugin spits out looks like this:
{
"analysis": {
"filter": {
"synonym": {
"ignore_case": "true",
"type": "synonym",
"synonyms": [
"Test, foo",
"Title, bar"
]
}
},
"analyzer": {
"synonym": {
"filter": [
"lowercase",
"synonym"
],
"tokenizer": "whitespace"
}
}
}
}
When I now use a search query to look for "Test" I get the same error as mentioned in my first post. Here's the query
{
"query": {
"bool": {
"must": {
"nested": {
"path": "LinkProps",
"query": {
"multi_match": {
"query": "Test",
"fields": [
"LinkProps.TITLE",
"LinkProps.LINK"
],
"analyzer": "synonym"
}
}
}
}
}
}
}
which leads to this error
{
"error": {
"root_cause": [
{
"type": "query_parsing_exception",
"reason": "[nested] nested object under path [LinkProps] is not of nested type",
"index": "minimal",
"line": 1,
"col": 44
}
],
"type": "search_phase_execution_exception",
"reason": "all shards failed",
"phase": "query",
"grouped": true,
"failed_shards": [
{
"shard": 0,
"index": "minimal",
"node": "6AhE4RCIQwywl49h0Q2-yw",
"reason": {
"type": "query_parsing_exception",
"reason": "[nested] nested object under path [LinkProps] is not of nested type",
"index": "minimal",
"line": 1,
"col": 44
}
}
]
},
"status": 400
}
When I check the analyzer with
GET http://localhost:9200/minimal/_analyze?text=foo&analyzer=synonym&pretty=true
I get the correct answer
{
"tokens": [
{
"token": "foo",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "test",
"start_offset": 0,
"end_offset": 3,
"type": "SYNONYM",
"position": 0
}
]
}
So the analyzer seems to set up correctly. Did I messed up the mappings? I guess the problem is not because I have nested objects or is it?
I just tried this
{
"query": {
"bool": {
"must": {
"query": {
"multi_match": {
"query": "foo",
"fields": [
"LinkProps.TITLE",
"LinkProps.LINK"
],
"analyzer": "synonym"
}
}
}
}
}
}
As you can see, I removed the "nested" wrapper
"nested": {
"path": "LinkProps",
...
}
which now leads at least in some results (Not sure yet, if these will finally be the correct results). I'm trying to apply this to the original project and keep you posted if this also worked.

Elasticsearch query response influenced by _id

I created an index with the following mappings and settings:
{
"settings": {
"analysis": {
"analyzer": {
"case_insensitive_index": {
"type": "custom",
"tokenizer": "filename",
"filter": ["icu_folding", "edge_ngram"]
},
"default_search": {
"type":"standard",
"tokenizer": "filename",
"filter": [
"icu_folding"
]
}
},
"tokenizer" : {
"filename" : {
"pattern" : "[^\\p{L}\\d]+",
"type" : "pattern"
}
},
"filter" : {
"edge_ngram" : {
"side" : "front",
"max_gram" : 20,
"min_gram" : 3,
"type" : "edgeNGram"
}
}
}
},
"mappings": {
"metadata": {
"properties": {
"title": {
"type": "string",
"analyzer": "case_insensitive_index"
}
}
}
}
}
I have the following documents:
{"title":"P-20150531-27332_News.jpg"}
{"title":"P-20150531-27341_News.jpg"}
{"title":"P-20150531-27512_News.jpg"}
{"title":"P-20150531-27343_News.jpg"}
creating a document with simple numerical ids
111
112
113
114
and querying using the query
{
"from" : 0,
"size" : 10,
"query" : {
"match" : {
"title" : {
"query" : "P-20150531-27332_News.jpg",
"type" : "boolean",
"fuzziness" : "AUTO"
}
}
}
}
results in the correct scoring and ordering of the documents returned:
P-20150531-27332_News.jpg -> 2.780985
P-20150531-27341_News.jpg -> 0.8262239
P-20150531-27512_News.jpg -> 0.8120311
P-20150531-27343_News.jpg -> 0.7687101
Strangely, creating the same documents with UUIDs
557eec2e3b00002c03de96bd
557eec0f3b00001b03de96b8
557eec0c3b00001b03de96b7
557eec123b00003a03de96ba
as IDs results in different scorings of the documents:
P-20150531-27341_News.jpg -> 2.646321
P-20150531-27332_News.jpg -> 2.1998127
P-20150531-27512_News.jpg -> 1.7725387
P-20150531-27343_News.jpg -> 1.2718291
Is this an intentional behaviour of Elasticsearch? If yes - how can I preserve the correct ordering regardless of the IDs used?
In the query it looks like you should be using 'default_search' as the analyzer for match query unless you actuall intended to use egde-ngram on the search query too.
Example :
{
"from" : 0,
"size" : 10,
"query" : {
"match" : {
"title" : {
"query" : "P-20150531-27332_News.jpg",
"type" : "boolean",
"fuzziness" : "AUTO",
"analyzer" : "default_search"
}
}
}
}
default_search would be the default-search analyzer only if there is are no explicit search_analyzer or analyzer specified in the mapping of the field.
The articlehere gives a good explanation of the rules by which analyzers are applied.
Also to ensure idf takes into account documents across shards you could use search_type=dfs_query_then_fetch

Multiple document types with same mapping in Elasticseach

I have index named test which can be associated to n number of documents types named sub_test_1 to sub_text_n. But all will have same mapping.
Is there any way to make an index such all document types have same mapping for their documents? I.e. test\sub_text1\_mapping should be same as test\sub_text2\_mapping.
Otherwise if I have like 1000 document types, I will we having 1000 mappings of the same type referring to each document types.
UPDATE:
PUT /test_index/
{
"settings": {
"index.store.type": "default",
"index": {
"number_of_shards": 5,
"number_of_replicas": 1,
"refresh_interval": "60s"
},
"analysis": {
"filter": {
"porter_stemmer_en_EN": {
"type": "stemmer",
"name": "porter"
},
"default_stop_name_en_EN": {
"type": "stop",
"name": "_english_"
},
"snowball_stop_words_en_EN": {
"type": "stop",
"stopwords_path": "snowball.stop"
},
"smart_stop_words_en_EN": {
"type": "stop",
"stopwords_path": "smart.stop"
},
"shingle_filter_en_EN": {
"type": "shingle",
"min_shingle_size": "2",
"max_shingle_size": "2",
"output_unigrams": true
}
}
}
}
}
Intended mapping:
{
"sub_text" : {
"properties" : {
"_id" : {
"include_in_all" : false,
"type" : "string",
"store" : true,
"index" : "not_analyzed"
},
"alternate_id" : {
"include_in_all" : false,
"type" : "string",
"store" : true,
"index" : "not_analyzed"
},
"text" : {
"type" : "multi_field",
"fields" : {
"text" : {
"type" : "string",
"store" : true,
"index" : "analyzed",
},
"pdf": {
"type" : "attachment",
"fields" : {
"pdf" : {
"type" : "string",
"store" : true,
"index" : "analyzed",
}
}
}
}
}
}
}
}
I want this mapping to be an individual mapping for all sub_texts I create so that I can change it for one sub_text without affecting others e.g. I may want to add two custom analyzers to sub_text1 and three analyzers to sub_text3, rest others will stay same.
UPDATE:
PUT /my-index/document_set/_mapping
{
"properties": {
"type": {
"type": "string",
"index": "not_analyzed"
},
"doc_id": {
"type": "string",
"index": "not_analyzed"
},
"plain_text": {
"type": "string",
"store": true,
"index": "analyzed"
},
"pdf_text": {
"type": "attachment",
"fields": {
"pdf_text": {
"type": "string",
"store": true,
"index": "analyzed"
}
}
}
}
}
POST /my-index/document_set/1
{
"type": "d1",
"doc_id": "1",
"plain_text": "simple text for doc1."
}
POST /my-index/document_set/2
{
"type": "d1",
"doc_id": "2",
"pdf_text": "cGRmIHRleHQgaXMgaGVyZS4="
}
POST /my-index/document_set/3
{
"type": "d2",
"doc_id": "3",
"plain_text": "simple text for doc3 in d2."
}
POST /my-index/document_set/4
{
"type": "d2",
"doc_id": "4",
"pdf_text": "cGRmIHRleHQgaXMgaGVyZSBpbiBkMi4="
}
GET /my-index/document_set/_search
{
"query" : {
"filtered" : {
"filter" : {
"term" : {
"type" : "d1"
}
}
}
}
}
This gives me the documents related to type "d1". How to add analyzers only to document of type "d1"?
At the moment a possible solution is to use index templates or dynamic mapping. However they do not allow wildcard type matching so you would have to use the _default_ root type to apply the mappings to all types in the index and thus it would be up to you to ensure that all your types can be applied to the same dynamic mapping. This template example may work for you:
curl -XPUT localhost:9200/_template/template_1 -d '
{
"template" : "test",
"mappings" : {
"_default_" : {
"dynamic": true,
"properties": {
"field1": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
'
Do not do this.
Otherwise if I have like 1000 document types, I will we having 1000 mappings of the same type referring to each document types.
You're exactly right. For every additional _type with an identical mapping you are needlessly adding to the size of your index's mapping. They will not be merged, nor will any compression save you.
A much better solution is to simply create a shared _type and to create a field that represents the intended type. This completely avoids having wasted mappings and all of the negatives associated with it, including an unnecessary increase for your cluster state's size.
From there, you can imitate what Elasticsearch is doing for you and filter on your custom type without ballooning your mappings.
$ curl -XPUT localhost:9200/my-index -d '{
"mappings" : {
"my-type" : {
"properties" : {
"type" : {
"type" : "string",
"index" : "not_analyzed"
},
# ... whatever other mappings exist ...
}
}
}
}'
Then, for any search against sub_text1 (etc.), then you can do a term (for one) or terms (for more than one) filter to imitate the _type filter that would happen for you.
$ curl -XGET localhost:9200/my-index/my-type/_search -d '{
"query" : {
"filtered" : {
"filter" : {
"term" : {
"type" : "sub_text1"
}
}
}
}
}'
This is doing the same thing as the _type filter and you can create _aliases that contain the filter if you want to have the higher level search capability without exposing client-level logic to the filtering.

Resources