How to read from only one index and set the other one as write when searching an alias in ElasticSearch 7.6? - elasticsearch

I know it's possible to define two indices in an alias where one index has the is_write_index set to true while the other has it set to false -
POST /_aliases
{
"actions" : [
{
"add" : {
"index" : "test_index_1",
"alias" : "my_alias",
"is_write_index": true
}
}
]
}
POST /_aliases
{
"actions" : [
{
"add" : {
"index" : "test_index_2",
"alias" : "my_alias",
"is_write_index": false
}
}
]
}
As you can see, I've defined two indices test_index_1 and test_index_2 where the first one is a write index while the second one isn't.
Now, I want to be able to query the my_alias in such a way that searches happen only on the test_index_2 which has the is_write_index set to false while I write data to test_index_1, instead of reading from both the indices, which is the default behaviour. Meaning, I don't wish the search results come from the index where is_write_index is set to true.
Is this possible? I've tried setting index.blocks.read to true on the write index, but then search queries on the alias fail with an exception. Instead, I wish reads on the alias query only from that index which has the is_write_index set to false.
How can I achieve this?

This can be achieved by using filtered aliases.
The way you do this is you apply a custom filter while adding the write index to the alias. The filter property defines the bool condition based on which data is filtered on this index and presented as a new view of the dataset in this index. All search queries on this index happen on this new view that Elastic creates. So, if you want to avoid reading from the index you're currently writing to, apply a filter that is never satisfied across any documents in your dataset or an exists filter on some dummy field.
POST /_aliases
{
"actions": [
{
"add": {
"index": "test_index_2",
"alias": "my_alias",
"is_write_index": true,
"filter": {
"bool": {
"must_not": {
"exists": {
"field": "<field_that_always_exists_in_your_documents>"
}
}
}
}
}
}
]
}
Once you're done writing the data, update the alias by removing the filter property to allow reads from both the indices.

You are using this feature in an incorrect fashion. If you use alias for search, it will always attempt to read across all underlying indices. is_write_index is provided as feature to support rollover and index patterns, where writes are happening to 1 index, but reads happen across all indices with same alias or index pattern.
If your intent is to load data into one index, but allow application to continue to read from old index, when data loading is going on, you should use 2 separate alias - one for read and one for write and device a strategy to swap alias pointing to the indices, after your data loading is completed.

Related

Get most searched terms from indexes that are referenced by an alias

The search is pointing to an alias.
The indexing process create a new index every 5 minutes.
Then the alias is updated, pointing to the new index.
The index is recreated to avoid sync problems that can occur if we update item by item when a change is made.
However, I need to keep track of the searched terms to produce a dashboard to list the most searched terms in a period. Or even using Kibana to show/extract it.
*The searched terms can be multi words, such as "white", "white summer night", etc. We are looking to rank the term, not the individual words.
I don't have experience with Elasticsearch and the searches that I have tried did not bring relevant solutions.
Thanks for the help!
{
"actions" : [
{ "remove" : { "index" : "catalog*", "alias" : "catalog-index" } },
{ "add" : { "index" : "catalog1234566", "alias" : "catalog-index" } }
]
}
Mappings:
{
"mappings":{
"properties":{
"created_at":{
"type":"integer"
},
"search_terms_key":{
"type":"keyword"
}
}
}
}
Query:
{
"query":{
"match_all":{
}
},
"aggs":{
"search_terms_key":{
"terms":{
"field":"search_terms_key",
"value_type":"string"
}
}
}
}
Log the search terms (or entire queries, if necessary), ingest those into Elasticsearch, then analyze them with Kibana. The index alias configuration is not relevant.
You should get the logs either directly from whatever connects to Elasticsearch, or from a proxy between it and Elasticsearch.
You could get Elasticsearch itself to log queries, but that's usually a bad idea in terms of performance.
Since it's the entire term you're after, be sure to use keyword mapping on the search term.
Once you have search terms ingested, use a terms aggregation to show the most popular searches.
[edit: make explicit that search terms need to be logged, not full DSL queries]

Is there a way to give aliases to all the time series indices in ElasticSearch?

I saw that following is the way to add aliases for one index.
https://www.elastic.co/guide/en/elasticsearch/reference/6.8/indices-aliases.html
Time series indices are usually formed everyday as per configuration, so how to give aliases those individual indices keeping the date part as it is?
If you scroll down the page you linked to a little bit, you'll find that you can do what you want using a glob pattern (i.e. time*)
POST /_aliases
{
"actions" : [
{ "add" : { "index" : "time*", "alias" : "all_time_indices" } }
]
}
Note, however, that if a new time series index is created it won't get the alias automatically. For that you'd need to set up an index template instead:
PUT _template/my-time-series
{
"index_patterns": ["time*"],
"aliases": {
"all_time_indices": {}
},
"settings": {
...
},
"mappings": {
...
}
}

Elasticsearch 7 number_format_exception for input value as a String

I have field in index with mapping as :
"sequence_number" : {
"type" : "long",
"copy_to" : [
"_custom_all"
]
}
and using search query as
POST /my_index/_search
{
"query": {
"term": {
"sequence_number": {
"value": "we"
}
}
}
}
I am getting error message :
,"index_uuid":"FTAW8qoYTPeTj-cbC5iTRw","index":"my_index","caused_by":{"type":"number_format_exception","reason":"For input string: \"we\""}}}]},"status":400}
at org.elasticsearch.client.RestClient.convertResponse(RestClient.java:260) ~[elasticsearch-rest-client-7.1.1.jar:7.1.1]
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:238) ~[elasticsearch-rest-client-7.1.1.jar:7.1.1]
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:212) ~[elasticsearch-rest-client-7.1.1.jar:7.1.1]
at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1433) ~[elasticsearch-rest-high-level-client-7.1.1.jar:7.1.1]
at
How can i ignore number_format_exception errors, so the query just doesn't return anything or ignores this filter in particular - either is acceptable.
Thanks in advance.
What you are looking for is not possible, ideally, you should have coherce enabled on your numeric fields so that your index doesn't contain dirty data.
The best solution is that in your application which generated the Elasticsearch query(you should have a check for NumberFormatExcepton if you are searching for numeric fields as your index doesn't contain the dirty data in the first place and reject the query if you get an exception in your application).
Edit: Another interesting approach is to validate the data before inserting into ES, using the Validate API as suggested by #prakash, only thing is that it would add another network call but if your application is not latency-sensitive, it can be used as a workaround.

Update configuration for actively used index without data loss

Sometimes, I need to update mappings, settings, or bind default pipelines to the actively used index.
For the time being, I am using a method with data loss as follows:
update the index template with proper mapping (or binding the default pipeline by index.default_pipeline);
create a_new_index (matching the template index_patterns);
reindex the index_to_fix to a_new_index to migrate the data already indexed;
use alias to redirect the coming indexing request to a_new_index (the alias will have the same name as index_to_fix to ensure the indexing is undisturbed) and delete the index_to_fix;
But between step 3 and step 4, there is a time gap, during which the newly indexed data are lost in the original index_to_fix.
Is there a way, to update configurations for actively used index without any data loss?
Thanks for the help of #LeBigCat, after some discussions. I think this problem could be solved in three steps.
Use Alias for CRUD
First thing first, try not to use index directly, use alias if possible; since you can't use an alias with the same name as the existed indices, directly you can't replace the index even if it's broken (badly designed). The easiest way is to use a template and include the index name directly in the alias.
PUT _template/test
{
...
"aliases" : {
"{index}-alias" : {}
}
}
Redirect the Indexing
Since the index_to_fix is being actively used, after updating the template and create a new index a_new_fix, we can use alias to redirect the indexing to a_new_fix.
POST /_aliases
{
"actions" : [
{ "add": { "index": "a_new_index", "alias": "index_to_fix-alias" } },
{ "remove": { "index": "index_to_fix", "alias": "index_to_fix-alias" } }
]
}
Migrating the Data
Simply use _reindex to migrate all the data from index_to_fix to a_new_index.
POST _reindex
{
"source": {
"index": "index_to_fix"
},
"dest": {
"index": "index_to_fix-alias"
}
}

Find documents in Elasticsearch where `ignore_malformed` was triggered

Elasticsearch by default throws an exception if inserting data to a field which does not fit the existing type. For example, if a field has been created as number type, inserting a document with a string value for that field causes an error.
This behavior can be changed by enabling then ignore_malformed setting, which means such fields are silently ignored for indexing purposes, but retained in the _source document - meaning that the invalid values cannot be searched or aggregated, but are still included in the returned document.
This is preferable behavior in our use case, but we would wish to be able to locate such documents somehow so we can fix them in the future.
Is there any way to somehow flag documents for which some malformed fields were ignored? We control the document insertion process fully, so we can modify all insertion flags, or do a trial insert, or anything, to reach our goal.
You can use the exists query to find document where this field does not exist, see this example
PUT foo
{
"mappings": {
"bar": {
"properties": {
"baz": {
"type": "integer",
"ignore_malformed": true
}
}
}
}
}
PUT foo/bar/1
{
"baz": "field"
}
GET foo/bar/_search
{
"query": {
"bool": {
"filter": {
"bool": {
"must_not": [
{
"exists": {
"field": "baz"
}
}
]
}
}
}
}
}
There is no dedicated mechanism though, so this search finds also documents where the field is not set intentionally
You cannot, when you search on elasticsearch, you don't search on document source but on the inverted index, which contains the analyzed data.
ignore_malformed flag is saying "always store document, analyze if possible".
You can try, create a mal-formed document, and use _termvectors API to see how the document is analyzed and stored in the inverted index, in a case of a string field, you can see an "Array" is stored as an empty string etc.. but the field will exists.
So forget the inverted index, let's use the source!
Scroll all your data until you find the anomaly, I use a small python script that search scroll, unserialize and I test field type for every documents (very long) but I can have a list of wrong document IDs.
Use a script query can be very long and crash your cluster, use with caution, maybe as a post_filter:
Here I want to retrieve the document where country_name is not a string:
{
"_source": false,
"timeout" : "30s",
"query" : {
"query_string" : {
"query" : "locale:de_ch"
}
},
"post_filter": {
"script": {
"script": "!(_source.country_name instanceof String)"
}
}
}
"_source:false" => I want only document ID
"timeout" => prevent crash
As you notice, this is a missing feature, I know logstash will tag
document that fail, so elasticsearch could implement the same thing.

Resources