Fastest way to join two indexes into a third - elasticsearch

I have the following two es indexes:
index1 = {
"id": 1,
"name": "fred",
"shared_id": 77
}
index2 = {
"id": 89,
"FacebookID": 9288347,
"shared_id": 77
}
I want to merge these two indexes into a third index:
index3 = {
"index1.id": 1,
"index2.id": 89,
"shared_id": 77,
"FacebookID": 9288347,
}
In other words, all objects with a sharedID will be merged into a third object with all existing attributes. What would be the performant way to do this? My current idea is to download all the data in the two objects and use either Java or C++ to do the merge/upload. Is there a better way to do this, perhaps something native to ES itself? I would estimate several million objects per index.
I've found this, which suggests the best solution is to search both indexes simultaneously or manually join them: http://elasticsearch-users.115913.n3.nabble.com/Merging-Two-Indexes-td4021708.html.

Related

restructure elasticsearch index to allow filtering on sum of values

I've an index of products.
Each product, has several variants (can be a few or hundreds, each has a color & size e.g. Red)
Each variant, is available (in a certain quantity) at several warehouses (aronud 100 warehouses).
Warehouses have codes e.g. AB, XY, CD etc.
If I had my choice, I'd index it as:
stock: {
Red: {
S: { AB: 100, XY: 200, CD: 20 },
M: { AB: 0, XY: 500, CD: 20 },
2XL: { AB: 5, XY: 0, CD: 9 }
},
Blue: {
...
}
}
Here's a kind of customer query I might receive:
Show me all products, that have Red.S color in stock (minimum 100) at warehouses AB & XY.
So this would probably be a filter like
Red.S.AB > 100 AND Red.S.XY > 100
I'm not writing whole filter query here, but its straightforward in elastic.
We might also get SUM queries, e.g. the sum of inventories at AB & XY should be > 500.
That'd be easy through a script filter, say Red.S.AB + Red.S.XY > 500
The problem is, given 100 warehouses, 100 sizes, 25 colors, this easily needs 100*100*25 = 250k mappings. Elasticsearch simply can't handle that many number of keys.
The easy answer is use nested documents, but nested documents pose a particular problem. We cannot sum across a given selection of nested documents, and nested docs are slow, specially when we're going to have 250k per product.
I'm open to external solutions than elastic as well. We're rails/postgres stack.
You have your product index with variants, that's fine, but I'd use another index for managing anything related to the multi-warehouse stock. One document per product/size/color/warehouse with the related count. For instance:
{
"product": 123,
"color": "Red",
"size": "S",
"warehouse": "AB",
"quantity": 100
}
{
"product": 123,
"color": "Red",
"size": "S",
"warehouse": "XY",
"quantity": 200
}
{
"product": 123,
"color": "Red",
"size": "S",
"warehouse": "CD",
"quantity": 20
}
etc...
That way, you'll be much more flexible with your stock queries, because all you'll need is to filter on the fields (product, color, size, warehouse) and simply aggregate on the quantity field, sums, averages or whatever you might think of.
You will probably need to leverage the bucket_script pipeline aggregation in order to decide whether sums are above or below a desired threshold.
It's also much easier to maintain the stock movements by simply indexing the new quantity for any given combination than having to update the master product document every time an item gets out of the stock.
No script, no nested documents required.
The best possible solution will be to create separate indexes for the warehouses and each warehouse index will have documents. One document per product/size/color/warehouse with related values like this:
{
"product": 123,
"color": "Red",
"size": "S",
"warehouse": "AB",
"quantity": 100
}
This will reduce your mappings 100 * 25 = 2500 mappings per index.
Rest for other operations, I feel #Val has mentioned in his answer which is quite impressive and beautiful.
Coming to external solutions, I would say you want to carry to out task of storing data, searching it and fetching it. Elasticsearch and Apache Solr are the best search engines to carry out these kind of tasks. I have not tried Apache Solr but I would highly recommend to go with Elasticsearch because of it's features, active community support and searching is really fast. Searching can also be made fast using analyzers and tokenizers. It also has some features like Full-Text Searching and Term Level Searching to customize searching according to situation or problem statement.

Maps vs Lists in Elasticsearch for optimized query performance

I have some data I will be putting into Elasticsearch, and want to decide on a format that will optimize query performance. The query will be in words: "Is ID X in category Y?". I have a fixed number of categories (small, say, 5), and possibly a large number of IDs to put into each category (currently in the dozens, but of indeterminate size in the future). Each ID will be in at most one category (possibly none).
Format 1:
{
"field1": "value1",
...
"categories": {
"category1": ["id10", "id24", "id38",...],
...
"category5": ["id62", "id19", "id82" ...]
}
}
or
Format 2:
{
"field1": "value1",
...
"categories": {
"id1": "category4",
"id2": "category2",
"id3": "category1",
...
}
}
Which data format would be preferred? The latter format has linear lookup time, but possibly many keys.
I think method 1 is better, Id will be more in the future, if you press method 2, then you may need to close the categories index or increase the number of index fields, and using method 1 can be more convenient to determine the type of a single id (indeOf).There are pros and cons. Maybe there's a better way.

Why does ES recommend to use single mapping per index and doesn't provide any "Join" functionality for this?

As you know, starting from version 6, ElasticSearch team deprecates multiple types per index as well as parent-child relationships. Proof is here
They recommend to use join queries instead of parent-child. But let's look on this join query here. They write:
The join datatype is a special field that creates parent/child
relation within documents of the same index.
They offer to use multiple indexes, restrict their indexes to work with only 1 single mapping _doc, but join query is designed to work only in bounds of the same index.
How to live on? How could I create parent-child relationships for separate indexes?
Example:
Index: "City"
{
"name": "Moscow",
"id": 1
}
Index: "Product"
{
"name": "Shirt",
"city": 1,
"id": 1
}
How could I get that "Shirt" above if I know only "Moscow" city name?

Elastic - Search across object without key specification

I have an index with hundreds of millions docs and each of them has an object "histogram" with values for each day:
"_source": {
"proxy": {
"histogram": {
"2017-11-20": 411,
"2017-11-21": 34,
"2017-11-22": 0,
"2017-11-23": 2,
"2017-11-24": 1,
"2017-11-25": 2692,
"2017-11-26": 11673
}
}
}
And I need one of two solutions:
Find docs where any value inside histogram object is greater then XX
Find docs where avg of values in histogram object is greater then XX
In point 1 I can use range query, but I must specify exactly name of field (i.e. proxy.histogram.2017-11-20). And wildcard version (proxy.histogram.*) doesnot work.
In point 2 I found in ES only average aggregation, but I don't want aggregation of these fields after query (because large of data), I want to only search these docs.

Using MongoDB to store time series data of arbitrary intervals

I want to store time-series-like data. There are no set intervals for the data like normal time series data. Data points could be as often as every few seconds to as seldom as every few years, all in the same time series. I basically need to store the Date data type and a value, over and over.
I would like the ability to very quickly retrieve the most recent item in the series. I would also like the ability to quickly retrieve all the values within a range between two dates. Writing efficiency is nice but not as important.
My initial thought was to use documents with keys set to dates. Something like this:
{
"entry_last": 52,
"entry_history": {
datetime(2013, 1, 15): 94,
datetime(2014, 12, 23): 25,
datetime(2016, 10, 23, 5, 34, 00): 52
}
}
However, from my undertstanding, keys have to be strings.
So then I came up with this prototype:
{
"entry_last": 52,
"entry_history": [
[datetime(2013, 1, 15), 94],
[datetime(2014, 12, 23), 25],
[datetime(2016, 10, 23, 5, 34, 00), 52],
]
}
The idea here is to give myself very easy access to the last value with entry_last (the value of which is duplicated in the history), as well as to store each data entry in the most efficient way possible by only storing the date and value in entry_history.
What I'd like to know is whether or not my prototype is an efficient approach to storing my data. Specifically, I'd like to know if this will allow me to efficiently query the most recent value as well as values between two dates. If not, what is a better approach?
You don't have to manually specify the index, you can store only the datetime and use the index of the array.
The main issue I see with your solution is you have to manually maintain entry_last, if the update ever fails, this doesn't work anymore, unless you have few failsafes. If you build another app with a different technology using the same db, you'll have to recode the same logic. And I don't see how to query between two dates easily and efficiently here, unless you reorder the array every time you insert an element.
If I had to design this kind of data storing, I would create another collection to store the history (linked to your entries by _id) and index the date to fast query. But it might depend on the quantity of your data.
/* entry */
{
_id: 1234,
"entryName": 'name'
}
/* history */
{
_id: 9876,
"_linkedEntryId": 1234,
"date": new Date(2013, 1, 15)
}
{
_id: 9877,
"_linkedEntryId": 1234,
"date": new Date(2014, 12, 23)
}
{
_id: 9878,
"_linkedEntryId": 1234,
"date": new Date(2016, 10, 23, 5, 34, 00)
}
To give an idea of the performance, I have a mongodb running on my ultrabook (far from a dedicated server's performance) and I can get the most recent document linked to a specific identifier in 5-10ms. Same speed to get all documents between two dates. I'm querying a modest collection of one million documents. It's not random data, the average object's size is 2050B.

Resources