What is the best way to enrich near real-time data in ElasticSearch with batch data that may come in later? - elasticsearch

I have two types of indices in my elasticsearch. The first contains data that is updated in near-real time. The second is data I can use to enhance the first that is updated nightly. I am new to elasticsearch and I'm wondering if there are any good patterns that easily allows me to update the streaming data with the nightly batches.
I've looked at the enrichment processor, but that appears to enrich at time of index. The enrichment data I have might be there, or might show up that night.
My goal is to create a dashboard that uses the enrichment index to help identify what documents in the streaming data I care about; and eventually add more fields for detailed exploration from there. In SQL terms: "count the number of documents where the ID of the stream document exists in the enrichment data", but that is pretty much a JOIN which I believe I should be avoiding given the large size of both indices.

Enrichment processors can be run at index time but also after documents have already been indexed using the _update_by_query endpoint.
The idea is this: you index your streaming data in real-time. Once your second data set comes in, you can create a new index to store it, then create an enrichment index out of it and finally update your first data set with the enrich processor.

Related

Is is more efficient to query multiple ElasticSearch indices at once or one big index

I have an ElasticSearch cluster and my system handles events coming from an API.
Each event is a document stored in an index and I create a new index per source (the company calling the API). Sources come and go, so I have new sources every week and most sources become inactive after a few weeks. Each source send between 100k and 10M new events every day.
Right now my indices are named api-events-sourcename
The documents contain a datetime field and most of my queries look like "fetch the data for that source between those dates.
I frequently use Kibana and I have configured a filter that matches all my indices (api-events-*) at once, and I then add terms to filter a specific source and specific days.
My requests can be slow at times and they tend to slow down the ingestion of new data.
Given that workflow, should I see any performance benefits to create an index per source and per day, instead of the index per source only that I use today ?
Are there other easy tricks to avoid putting to much strain on the cluster ?
Thanks!

ElasticSearch: querying most recent snapshot design

I'm trying to decide how to structure the data in ElasticSearch.
I have a system that is producing metrics on a daily basis. I would like to put those metrics into ES so I could do some advances querying/sorting. I also only care about the most recent data that's in there. The system producing the data could also be late.
Currently I can think of two options:
I can have one index with a date column that contains the date that the metric was created. I am unsure, however, of how to write the query so that if multiple days worth of data are in the index I filter it to just the most recent set.
I could also try and split the data up into different indexes (recent and past) and have some sort of process that migrates data from the recent index to the past index. I think the challenge with this would be having downtime where the data is being moved and/or added into the recent.
Thoughts?
A common approach to solving this problem with elastic search would be to store data in a form that allows historic querying, then again in a second form that allows querying the most recent data. For example if your metric update looked like:
{
"type":"OperationsPerSecond",
"name":"Questions",
"value":10
}
Then it can be indexed into our current values index using a composite key constructed from the document (obviously, for this to work you'd need to be able to construct a composite key from your document!). For example, your identity for this document might be the type and name concatenated. You then leverage the upsert API to allow you to write your updates to the same document:
POST current_metrics/_update/OperationsPerSecond-Questions
{
"type":"OperationsPerSecond",
"name":"Questions",
"value":10
}
Every time you call this API with the same composite key it will update the existing document, rather than create a new document. This will give you an index that only contains a single record per metric you are monitoring, and you can query that index to get your most recent values.
To store your historic data, you change your primary key strategy, it would probably be most straightforward to use the index API and get elastic to generate a primary key for you.
POST all_metrics/_doc/
{
"type":"OperationsPerSecond",
"name":"Questions",
"value":10
}
This API will create a new document for every request made to it. So as long as you have something in your data that you can use in an elastic range query, such as a field like createdDate with a value that looks like a date time, then you should be able to query historic data.
The main thing is, don't worry about duplicating your data for different purposes, elastic does a good job of compressing this stuff on disk and in memory. Storing data multiple times is called denormalization and is a pretty common technique in data warehousing and big data.

Update dataset wth ElasticSearch Aggregation result

I'd like to automate a features creation process for large dataset with elastic search.
I'd like to know if it is possible to create a new field in my dataset that will be the result of an aggregation.
I'm currently working on log from a network and wants to implement the moving average (the mean of a field during the past x days) of the filed "bytes_in".
After spending time reading the doc and example, I wasn't able to do so ...
You have two possibilities:
By using the Rollup API you can create a job that will allow you to summarize data on the go and store it in a dedicated index.
A detailed example can be found in this blog article.
By using the Data Frame Transform API, you can pivot your data into a new entity-centric index, aggregate your data in various ways and store the results in a dedicated index.

How does ElasticSearch handle an index with 230m entries?

I was looking through elasticsearch and was noticing that you can create an index and bulk add items. I currently have a series of flat files with 220 million entries. I am working on Logstash to parse and add them to ElasticSearch, but I feel that it existing under 1 index would be rough to query. The row data is nothing more than 1-3 properties at most.
How does Elasticsearch function in this case? In order to effectively query this index, do you just add additional instances to the cluster and they will work together to crunch the set?
I have been walking through the documentation, and it is explaining what to do, but not necessarily all the time explaining why it does what it does.
In order to effectively query this index, do you just add additional instances to the cluster and they will work together to crunch the set?
That is exactly what you need to do. Typically it's an iterative process:
start by putting a subset of the data in. You can also put in all the data, if time and cost permit.
put some search load on it that is as close as possible to production conditions, e.g. by turning on whatever search integration you're planning to use. If you're planning to only issue queries manually, now's the time to try them and gauge their speed and the relevance of the results.
see if the queries are particularly slow and if their results are relevant enough. You change the index mappings or queries you're using to achieve faster results, and indeed add more nodes to your cluster.
Since you mention Logstash, there are a few things that may help further:
check out Filebeat for indexing the data on an ongoing basis. You may not need to do the work of reading the files and bulk indexing yourself.
if it's log or log-like data and you're mostly interested in more recent results, it could be a lot faster to split up the data by date & time (e.g. index-2019-08-11, index-2019-08-12, index-2019-08-13). See the Index Lifecycle Management feature for automating this.
try using the Keyword field type where appropriate in your mappings. It stops analysis on the field, preventing you from doing full-text searches inside the field and only allowing exact string matches. Useful for fields like a "tags" field or a "status" field with something like ["draft", "review", "published"] values.
Good luck!

Solr performance with multiple fields

I have to index around 10 million documents in solr for full text search. Each of these documents have around 25 additional metadata fields attached to them. Each of the metadata fields individually are small (upto 64 characters). Common queries would be involving a search term along with multiple metadata fields used to filter the data. So my questions is which would provide better performance wrt search response time. (indexing time is not a concern):
a. Index the text data as well as push all metadata fields into solr as stored fields and query solr for all the fields using a single query. (Effectively solr does the filtering with metadata as well as search)
b. Store the metadata fields in a db like Mysql. Use solr only for full text and then use the document ids returned from solr as an input to the database to filter based on other metadata to retrieve the final set of documents.
Thanks
Arijit
Definitely a). Solr isn't simply a fulltext search engine, it's much more. It's filter queries are at least as good/fast as MySQL select.
b) is just silly. Fetch many ids from MySQL by selecting those with correct metadata, do a fulltext search in Solr while filtering against that ids list, fetch document from MySQL or Solr (if you choose to store data in it, not just indexes). I can't imagine a case where this would be faster.
Why complicate things, especially if indexing time and HD space is not an issue, you should store all your data (meaning: subset needed by users) in Solr.
Exception would be if you had large amount of text to store (and retrieve) in each document. In those cases it would be faster to fetch it from RDB after you get your search results back. Anyway, noone can tell for sure which one would be faster in your case, so I suggest you test performance of both approaches (using JMeter for example).
Also, since you don't care about index time, you should do all the processing you can at index time instead of at query time (e.g. synonyms, payloads where they can replace boosting, ...).
See here for some additional info on Solr performance:
http://wiki.apache.org/solr/SolrPerformanceFactors

Resources