Using MongoDB to store time series data of arbitrary intervals - performance

I want to store time-series-like data. There are no set intervals for the data like normal time series data. Data points could be as often as every few seconds to as seldom as every few years, all in the same time series. I basically need to store the Date data type and a value, over and over.
I would like the ability to very quickly retrieve the most recent item in the series. I would also like the ability to quickly retrieve all the values within a range between two dates. Writing efficiency is nice but not as important.
My initial thought was to use documents with keys set to dates. Something like this:
{
"entry_last": 52,
"entry_history": {
datetime(2013, 1, 15): 94,
datetime(2014, 12, 23): 25,
datetime(2016, 10, 23, 5, 34, 00): 52
}
}
However, from my undertstanding, keys have to be strings.
So then I came up with this prototype:
{
"entry_last": 52,
"entry_history": [
[datetime(2013, 1, 15), 94],
[datetime(2014, 12, 23), 25],
[datetime(2016, 10, 23, 5, 34, 00), 52],
]
}
The idea here is to give myself very easy access to the last value with entry_last (the value of which is duplicated in the history), as well as to store each data entry in the most efficient way possible by only storing the date and value in entry_history.
What I'd like to know is whether or not my prototype is an efficient approach to storing my data. Specifically, I'd like to know if this will allow me to efficiently query the most recent value as well as values between two dates. If not, what is a better approach?

You don't have to manually specify the index, you can store only the datetime and use the index of the array.
The main issue I see with your solution is you have to manually maintain entry_last, if the update ever fails, this doesn't work anymore, unless you have few failsafes. If you build another app with a different technology using the same db, you'll have to recode the same logic. And I don't see how to query between two dates easily and efficiently here, unless you reorder the array every time you insert an element.
If I had to design this kind of data storing, I would create another collection to store the history (linked to your entries by _id) and index the date to fast query. But it might depend on the quantity of your data.
/* entry */
{
_id: 1234,
"entryName": 'name'
}
/* history */
{
_id: 9876,
"_linkedEntryId": 1234,
"date": new Date(2013, 1, 15)
}
{
_id: 9877,
"_linkedEntryId": 1234,
"date": new Date(2014, 12, 23)
}
{
_id: 9878,
"_linkedEntryId": 1234,
"date": new Date(2016, 10, 23, 5, 34, 00)
}
To give an idea of the performance, I have a mongodb running on my ultrabook (far from a dedicated server's performance) and I can get the most recent document linked to a specific identifier in 5-10ms. Same speed to get all documents between two dates. I'm querying a modest collection of one million documents. It's not random data, the average object's size is 2050B.

Related

Storing data in ElasticSearch

I'm looking at two ways of storing data in Elastic Search.
[
{
'first': 'dave',
'last': 'jones',
'age': 43,
'height': '6ft'
},
{
'first': 'james',
'last': 'smith',
'age': 43,
'height': '6ft'
},
{
'first': 'bill',
'last': 'baker',
'age': 43,
'height': '6ft'
}
]
or
[
{
'first': ['dave','james','bill'],
'last': ['jones','smith','baker']
'age': 43,
'height': '6ft'
}
]
(names are +30 character hashes. Nesting would not exceed the above)
My goals are:
Query speed
Disk space
We are talking the difference between 300Gb and a terabyte.
My question is can Elastic Search search nested data just as quickly as flattened out data?
Elasticsearch will flatten your arrays of objects by default, exactly like you demonstrated in your example:
Arrays of inner object fields do not work the way you may expect. Lucene has no concept of inner objects, so Elasticsearch flattens object hierarchies into a simple list of field names and values.
So from the point of view of querying nothing will change. (However, if you need to query individual items of the inner arrays, like to query for dave jones, you may want to explicitly index it as nested data type, which will have poorer performance.)
Speaking about size on disk, by default there's compression enabled. Here you should keep in mind that Elasticsearch will store your original documents in two ways simultaneously: the original JSONs as source, and implicitly in the inverted indexes (which are actually used for the super fast searching).
If you want to read more about tuning for disk usage, here's a good doc page. For instance, you could enable even more aggressive compression for the source, or not store source on disk at all (although not advised).
Hope that helps!

Elastic - Search across object without key specification

I have an index with hundreds of millions docs and each of them has an object "histogram" with values for each day:
"_source": {
"proxy": {
"histogram": {
"2017-11-20": 411,
"2017-11-21": 34,
"2017-11-22": 0,
"2017-11-23": 2,
"2017-11-24": 1,
"2017-11-25": 2692,
"2017-11-26": 11673
}
}
}
And I need one of two solutions:
Find docs where any value inside histogram object is greater then XX
Find docs where avg of values in histogram object is greater then XX
In point 1 I can use range query, but I must specify exactly name of field (i.e. proxy.histogram.2017-11-20). And wildcard version (proxy.histogram.*) doesnot work.
In point 2 I found in ES only average aggregation, but I don't want aggregation of these fields after query (because large of data), I want to only search these docs.

Fastest way to join two indexes into a third

I have the following two es indexes:
index1 = {
"id": 1,
"name": "fred",
"shared_id": 77
}
index2 = {
"id": 89,
"FacebookID": 9288347,
"shared_id": 77
}
I want to merge these two indexes into a third index:
index3 = {
"index1.id": 1,
"index2.id": 89,
"shared_id": 77,
"FacebookID": 9288347,
}
In other words, all objects with a sharedID will be merged into a third object with all existing attributes. What would be the performant way to do this? My current idea is to download all the data in the two objects and use either Java or C++ to do the merge/upload. Is there a better way to do this, perhaps something native to ES itself? I would estimate several million objects per index.
I've found this, which suggests the best solution is to search both indexes simultaneously or manually join them: http://elasticsearch-users.115913.n3.nabble.com/Merging-Two-Indexes-td4021708.html.

mongodb - Recommended tree structure for large amount of data points

I'm working on a project which records price history for items across multiple territories, and I'm planning on storing the data in a mongodb collection.
As I'm relatively new to mongodb, I'm curious about what might be a recommended document structure for quite a large amount of data. Here's the situation:
I'm recording the price history for about 90,000 items across 200 or so territories. I'm looking to record the price of each item every hour, and give a 2 week history for any given item. That comes out to around (90000*200*24*14) ~= 6 billion data points, or approximately 67200 per item. A cleanup query will be run once a day to remove records older than 14 days (more specifically, archive it to a gzipped json/text file).
In terms of data that I will be getting out of this, I'm mainly interested in two things: 1) The price history for a specific item in a specific territory, and 2) the price history for a specific item across ALL territories.
Before I actually start importing this data and running benchmarks, I'm hoping someone might be able to give some advice on how I should structure this to allow for quick access to the data through a query.
I'm considering the following structure:
{
_id: 1234,
data: [
{
territory: "A",
price: 5678,
time: 123456789
},
{
territory: "B",
price: 9876
time: 123456789
}
]
}
Each item is its own document, which each territory/price point for that item in a particular territory. The issue I run into with this is retrieving the price history for a particular item. I believe I can accomplish this with the following query:
db.collection.aggregate(
{$unwind: "$data"},
{$match: {_id: 1234, "data.territory": "B"}}
)
The other alternative I was considering was just put every single data point in its own document and putting an index on the item and territory.
// Document 1
{
item: 1234,
territory: "A",
price: 5679,
time: 123456789
}
// Document 2
{
item: 1234,
territory: "B",
price: 9676,
time: 123456789
}
I'm just unsure of whether having 6 billion documents with 3 indexes or having 90,000 documents with 67200 array objects each and using an aggregate would be better for performance.
Or perhaps there's some other tree structure or handling of this problem that you fine folks and MongoDB wizards can recommend?
I would structure the documents as "prices for a product in a given territory per fixed time interval". The time interval is fixed for the schema as a whole, but different schemas result from different choices and the best one for your application will probably need to be decided by testing. Choosing the time interval to be 1 hour gives your second schema idea, with ~6 billion documents total. You could choose the time interval to be 2 weeks (don't). In my mind, the best time interval to choose is 1 day, so the documents would look like this
{
"_id" : ObjectId(...), // could also use a combination of prod_id, terr_id, and time so you get a free unique index to look up by those 3 values
"prod_id" : "DEADBEEF",
"terr_id" : "FEEDBEAD",
"time" : ISODate("2014-10-22T00:00:00.000Z"), // start of the day this document contains the data for
"data" : [
{
"price" : 1234321,
"time" : ISODate("2014-10-22T15:00:00.000Z") // start of the hour this data point is for
},
...
]
}
I like the time interval of 1 day because it hits a nice balance between number of documents (mostly relevant because of index sizes), size of documents (16MB limit, have to pipe over network), and ease of retiring old docs (hold 15 days, wipe+archive all from 15th day at some point each day). If you put an index on { "prod_id" : 1, "terr_id" : }`, that should let you fulfill your two main queries efficiently. You can gain an additional bonus performance boost by preallocating the doc for each day so that updates are in-place.
There's a great blog post about managing time series data like this, based on experience building the MMS monitoring system. I've essentially lifted my ideas from there.

ElasticSearch custom score script does not preserve array ordering

I am using ElasticSearch with a function_score property to retrieve documents sorted by createdOn. The createdOn field is stored as an Array representing date values, i.e.
"createdOn": [ 2014, 4, 24, 22, 11, 47 ]
Where createdOn[0] is year, createdOn[1] is month, createdOn[2] is day, etc. I am testing the following query, which should return documents scored by year. However, the doc['createdOn'] array does not preserve the value of the elements. In this query, doc['createdOn'].values[0] returns 4, not 2014.
POST /example/1
{
name:"apple",
createdOn: [2014, 8, 22, 5, 12, 32]
}
POST /example/2
{
name:"apple",
createdOn: [2011, 8, 22, 5, 12, 32]
}
POST /example/3
{
name:"apple",
createdOn: [2013, 8, 22, 5, 12, 32]
}
POST /example/_search
{
"query":
{
"function_score": {
"boost_mode": "replace",
"query": {
"match_all": {}
},
"script_score" : {
"script": "doc['createdOn'].values[0]"
}
}
}
}
It appears that this is due to the way ElasticSearch caches data: http://elasticsearch-users.115913.n3.nabble.com/Accessing-array-field-within-Native-Plugin-td4042848.html:
The only apparent solution other than using the source method (which is slow), is to use nested queries. Any ideas on how I could rewrite my query using nested queries? It seems like the only efficient way to sort this query by year.
The docFieldDoubles method gets it's values from the in memory
structures of the field data cache. This is done for performance. The
field data cache is not loaded from source of the document (because
this will be slow) but from the lucene index, where the values are
sorted (for lookup speed). The get api does work based on the original
document source which is why you see those values in order (note- ES
doesn't the parse the source for the get api, it just gives you back
what you've put in it).
You can access the original document (which will be parsed) using the
SourceLookup (available from the source method) but it will be slow as
it needs to go to disk for every document.
I'm not sure about the exact semantics of what you are trying to
achieve, but did you try looking at nested objects? those allow you to
store a list of object in a why that keeps values together, like [{
"key": "k1" , "value" : "v1"},...].

Resources