I am routing data through to Elasticsearch using Nifi. I'm using NiFi to dynamically create indices based on a set of attributes. I'm using Index Lifecycle Policy Management in Elasticsearch which requires all indices to be manually bootstrapped beforehand for ILM settings to be applied. Since my NiFi flow automatically ingests messages into Elasticsearch any index created automatically will not have have ILM policies applied.
Currently my flow is Nifi Consume from Kafka --> Update Attribute --> PutElasticsearch Record.
A solution (I think) would be to call the invokehttp processor in front of the PutElasticsearch processor to bootstrap the indices dynamically via the attributes extracted before ingesting into elasticsearch. Indices are dynamically created using the syntax: index_${attribute_1}_${attribute_2}. My only concern here is the invoke invokehttpprocessor would run with every new flowfile. This could be thousands of calls to bootstrap an index. And if the index already exists there could be collision there.
Is this really the best way to do this? Perhaps I could run the QueryElasticsearchRecord processor to get a list of indices and somehow match that against incoming flowfiles on the attribute_1 and attribute_2 field. But that would still require a continuous query, I think?
What you could do is have the InvokeHTTP run if and only if it sees a specific value or attribute that would signal that a new (previously unsent) index value to input into ElasticSearch is required. Just an idea if you want to head down that route.
Related
Env Details:
Elastic Search version 7.8.1
routing param is an optional in Index settings.
As per ElasticSearch docs - https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-routing-field.html
When indexing documents specifying a custom _routing, the uniqueness of the _id is not guaranteed across all of the shards in the index. In fact, documents with the same _id might end up on different shards if indexed with different _routing values.
We have landed up in same scenario where earlier we were using custom routing param(let's say customerId). And for some reason we need to remove custom routing now.
Which means now docId will be used as default routing param. This is creating duplicate record with same id across different shard during Index operation. Earlier it used to (before removing custom routing) it resulted in update of record (expected)
I am thinking of following approaches to come out of this, please advise if you have better approach to suggest, key here is to AVOID DOWNTIME.
Approach 1:
As we receive the update request, let duplicate record get created. Once record without custom routing gets created, issue a delete request for a record with custom routing.
CONS: If there is no update on records, then all those records will linger around with custom routing, we want to avoid this as this might results in unforeseen scenario in future.
Approach 2
We use Re-Index API to migrate data to new index (turning off custom routing during migration). Application will use new index after successful migration.
CONS: Some of our Indexes are huge, they take 12 hrs+ for re-index operation, and since elastic search re-index API will not migrate the newer records created between this 12hr window, as it uses snapshot mechanism. This needs a downtime approach.
Please suggest alternative if you have faced this before.
Thanks #Val, also found few other approaches like write to both indexes and read from old. And then shift to read new one after re-indexing is finished. Something on following lines -
Create an aliases pointing to the old indices (*_v1)
Point the application to these aliases instead of actual indices
Create a new indices (*_v2) with the same mapping
Move data from old indices to new using re-indexing and make sure we don't
retain custom routing during this.
Post re-indexing, change the aliases to point to new index instead of old
(need to verify this though, but there are easy alternatives if this
doesn't work)
Once verification is done, delete the old Indices
What do we do in transition period (window between reindexing start to reindexing finish) -
Write to both Indices (old and new) and read from old indices via aliases
I'm looking to setup my index such that it is partitioned into daily sub-indices that I can adjust the individual settings of depending on the age of that index, i.e. >= 30 days old should be moved to slower hardware etc. I am aware I can do this with a lifecycle policy.
What I'm unable to join-the-dots on is how to setup the original index to be partitioned by day. When adding data/querying, do I need to specify the individual daily indicies or is there something in Elasticsearch that will do this for me? If the later, how does it work with adding/querying (assuming they are different?)...how does it determine the partitions that are relevant for the query/partition to add a document to? (I'm assuming there is a timestamp field - but I can't see from the docs how its all linked together)
I'm using the base Elasticsearch OSS v7.7.1 without any plugins installed.
there's no such thing as sub indices or partitions in Elasticsearch. if you want to use ilm, which you should, then you are using aliases and multiple indices
you will need to upgrade from 7.7 - which is EOL - and use the default distribution to get access to ilm as well
getting back to your conceptual questions, https://www.elastic.co/guide/en/elasticsearch/reference/current/overview-index-lifecycle-management.html and the following few chapters dive into it. but to your questions;
the major assumption of using ilm is that data being ingested is current, so on a rough level, data from today will end up in an index from today
if you are indexing historic data then you may want to put that into "traditional" index names, eg logs-2021.08.09 and then attach them to the ilm policy as per https://www.elastic.co/guide/en/elasticsearch/reference/current/ilm-with-existing-indices.html
when querying, Elasticsearch will handle accessing all the indices it needs based on the request it receives. it does this via https://www.elastic.co/guide/en/elasticsearch/reference/current/search-field-caps.html
I currently have a successful StormCrawler instance crawling about 20 sites, and indexing the content to one Elasticsearch index. Is it possible, either in ES or via StormCrawler, to send each host's content to its own unique content index?
Out of curiosity: why do you need to do that? Having one index per host seems rather wasteful. You can filter the results based on a field like host if you want to provide results for a particular host.
To answer your question, there is no direct way of doing it currently as the IndexerBolt it connected to one index only. You could declare one IndexerBolt per index you need and add a custom bolt to fan based on the value of the host metadata but this is not dynamic and rather heavy-handed. There could be a way of doing it using pipelines in ES, not sure.
I use Kafka with the following connector connector.class=io.confluent.connect.elasticsearch.ElasticsearchSinkConnector
in order to send data to ElasticSearch. This combination already works well.
My problem:
My environment is fully dockered, and I need to set the whole system up multiple times per day. Each time, I need to map the structure of the data for each index on ES before I can send any data from Kafka. Otherwise, ES uses the wrong data types, and I can't work with the data without re-indexing. The dynamic mapping function sadly doesn't work for me good enough. Right now I use a bash script which sets the mappings for the index.
My question:
Is there a way to set/define the mapping already in the connector ? So I don't need to run my bash script ?
Consider the following use case:
I want the information from one particular log line to be indexed into Elasticsearch, as a document X.
I want the information from some log line further down the log file to be indexed into the same document X (not overriding the original, just adding more data).
The first part, I can obviously achieve with filebeat.
For the second, does anyone have any idea about how to approach it? Could I still use filebeat + some pipeline on an ingest node for example?
Clearly, I can use the ES API to update the said document, but I was looking for some solution that doesn't require changes to my application - rather, it is all possible to achieve using the log files.
Thanks in advance!
No, this is not something that Beats were intended to accomplish. Enrichment like you describe is one of the things that Logstash can help with.
Logstash has an Elasticsearch input that would allow you to retrieve data from ES and use it in the pipeline for enrichment. And the Elasticsearch output supports upsert operations (update if exists, insert new if not). Using both those features you can enrich and update documents as new data comes in.
You might want to consider ingesting the log lines as is to Elasticearch. Then using Logstash, build a separate index that is entity specific and driven based on data from the logs.