How to store reusable data in elastic search

How to store reusable data in elastic search - elasticsearch

Since I don't want to call the API every time I need certain data (like an array of 1000 rows) I would like to store that array in ElasticSearch so I can easily get this array without the need to call the api. I'm using FOS Elastic Bundle. Is this even possible to make and if it is how?
What I would do:
-I have a function that gets this data from database
-I would like to save this data in ES after calling php bin/console fos:elastica:populate
-use this array in controller to return it to the view and use it there.

I would suggest that you define a type with a mapping that can cover a single row in your database. After that, when you have fetched the 1000 rows from the database, you can index those 1000 rows in a single bulk index call in form of 1000 documents: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html
You can then fetch these 1000 documents for use in controller.
Alternatively, you can define a mapping with a nested property. This nested property should be identical to a row in your database. Using this, you can create a single document with 1000 rows worth of your data inside the nested property like an array. After that, you can fetch this single document.
Which of these strategies is better will depend on your requirement. The second is a heavier indexing process while first is relatively heavy fetch process. In my experience with ElasticSearch, it is better to have lighter indexing requests to ensure data consistency. Depending on your data, you can create the 1000 documents with IDs in a certain pattern and with the IDs known, fetching these documents becomes very efficient.

Related

Filter result in memory to search in elasticsearch from multiple indexes

I have 2 indexes and they both have one common field (basically relationship).
Now as elastic search is not giving filters from multiple indexes, should we store them in memory in variable and filter them in node.js (which basically means that my application itself is working as a database server now).
We previously were using MongoDB which is also a NoSQL DB but we were able to manage it through aggregate queries but seems the elastic search is not providing that.
So even if we use both databases combined, we have to store results of them somewhere to further filter data from them as we are giving users advanced search functionality where they are able to filter data from multiple collections.
So should we store results in memory to filter data further? We are currently giving advanced search in 100 million records to customers but that was not having the advanced text search that elastic search provides, now we are planning to provide elastic search text search to customers.
What do you suggest should we use the approach here to make MongoDB and elastic search together? We are using node.js to serve data.
Or which option to choose from
Denormalizing: Flatten your data
Application-side joins: Run multiple queries on normalized data
Nested objects: Store arrays of objects
Parent-child relationships: Store multiple documents through joins
https://blog.mimacom.com/parent-child-elasticsearch/
https://spoon-elastic.com/all-elastic-search-post/simple-elastic-usage/denormalize-index-elasticsearch/

Storing things client side in memory is not the solution.
First of all the simplest way to solve this problem is to simply make one combined index. Its very trivial to do this. Just insert all the documents from index 2 into index 1. Prefix all fields coming from index-2 by some prefix like "idx2". That way you won't overwrite any similar fields. You can use an ingestion pipeline to do this, or just do it client side. You only will ever do this once.
After that you can perform aggregations on the single index, since you have all the data in one-index.
If you are using somehting other than ES as your primary data-store you need to reconfigure the indexing operation to redirect everything that was earlier going into index-2 to go into index-1 as well(with the prefixed terms).
100 million records is trivial for something like ELasticsearch. Doing anykind of "joins" client side is NOT RECOMMENDED, as this will obviate the entire value of using ES.
If you need any further help on executing this, feel free to contact me. I have 11 years exp in ES. And I have seen people struggle with "joins" for 99% of the time. :)
The first thing to do when coming from MySQL/PostGres or even Mongodb is to restructure the indices to suit the needs of data-querying. Never try to work with multiple indices, ES is not built for that.
HTH.

What can be an alternative of Elastic Search + DynamoDB being used in combination?

I am new to DynamoDB and I am looking for suggestions / recommendations. There's a use case where we have a paginated API and we have to search for multiple values of an indexed attribute. Since DynamoDB allows only one value to be searched for an indexed attribute in a single query, a batch call should be done. However, since it requires pagination (batch call would make the pagination complicated), therefore currently, the required IDs are fetched from ElasticSearch for those multiple values (in a paginated way) after which the complete documents are fetched from DynamoDB based on IDs obtained from ElasticSearch. Is this the correct approach or is there any better alternative?

Prevent data duplication over multiple indices in Elasticsearch

Data duplication prevention is handled at the index level with the field "_id".
However, to avoid having huge indices, I work with several small indices linked under an alias. Is there a mechanism in place to check existing _ids at the alias level (over multiple indices) when a document is inserted or should it be handled at the application level ?
indices architecture

not natively, no. you'd need to handle this in your own code

Before inserting your document, you need to first find out which real index contains your document via the alias using
GET alias/_search?q=_id:123456&filter_path=hits.hits._index
In the response you'll get the concrete index name that you can then use to index/update your new document version.

ElasticSearch: querying most recent snapshot design

I'm trying to decide how to structure the data in ElasticSearch.
I have a system that is producing metrics on a daily basis. I would like to put those metrics into ES so I could do some advances querying/sorting. I also only care about the most recent data that's in there. The system producing the data could also be late.
Currently I can think of two options:
I can have one index with a date column that contains the date that the metric was created. I am unsure, however, of how to write the query so that if multiple days worth of data are in the index I filter it to just the most recent set.
I could also try and split the data up into different indexes (recent and past) and have some sort of process that migrates data from the recent index to the past index. I think the challenge with this would be having downtime where the data is being moved and/or added into the recent.
Thoughts?

A common approach to solving this problem with elastic search would be to store data in a form that allows historic querying, then again in a second form that allows querying the most recent data. For example if your metric update looked like:
{
"type":"OperationsPerSecond",
"name":"Questions",
"value":10
}
Then it can be indexed into our current values index using a composite key constructed from the document (obviously, for this to work you'd need to be able to construct a composite key from your document!). For example, your identity for this document might be the type and name concatenated. You then leverage the upsert API to allow you to write your updates to the same document:
POST current_metrics/_update/OperationsPerSecond-Questions
{
"type":"OperationsPerSecond",
"name":"Questions",
"value":10
}
Every time you call this API with the same composite key it will update the existing document, rather than create a new document. This will give you an index that only contains a single record per metric you are monitoring, and you can query that index to get your most recent values.
To store your historic data, you change your primary key strategy, it would probably be most straightforward to use the index API and get elastic to generate a primary key for you.
POST all_metrics/_doc/
{
"type":"OperationsPerSecond",
"name":"Questions",
"value":10
}
This API will create a new document for every request made to it. So as long as you have something in your data that you can use in an elastic range query, such as a field like createdDate with a value that looks like a date time, then you should be able to query historic data.
The main thing is, don't worry about duplicating your data for different purposes, elastic does a good job of compressing this stuff on disk and in memory. Storing data multiple times is called denormalization and is a pretty common technique in data warehousing and big data.

Elasticsearch index alias

I am trying to use elasticsearch to filter millions of data. All data are in one index and I want to access them in a 'direct' way.
What I mean with direct way?
Direct way means for example accessing the 700000th element of this index (not by id). Is this possible somehow?
What I tried already:
from + size works, but seems not to be fast if number of elements > 10000
Scrolling I didn't try, but it's seem somehow not the right thing for my use-case.
So any other ideas?

Scrolling will not work. That will fetch all the data.
I think elasticseach is not the correct use case for what you want to do.
It would be better to use a linked list of the ids, that will let you fetch the id by index and then you can query elasticsearch to get the data.
If you data is such that it does not get modified or deleted then you can add an extra field in the mapping that will act like an auto increment field in a database. You can fetch the data using that field.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio