Rethinkdb multiple level grouping - rethinkdb

Let's say I have a table with documents like:
{
"country": 1,
"merchant": 2
"product": 123,
...
}
Is it possible to group all the documents into a final json structure like:
[
{
<country_id>: {
<merchant_id>: {
<product_id>: <# docs with this product id/merchant_id/country_id>,
... (other product_id and so on)
},
... (other merchant_id_id and so on)
},
... (other country_id and so on)
]
And if yes, what would be the best and most efficient way?
I have more than a million of these documents, on 4 shards with powerful servers (22 Gb cache each)
I have tried this (in the data explorer, in JS, for the moment):
r.db('foo')
.table('bar')
.indexCreate('test1', function(d){
return [d('country'), d('merchant'), d('product')]
})
and then
r.db('foo')
.table('bar')
.group({index: 'test1'})
But the data explorer seems to hang, still working on it as you can see...

.group({index: 'test1'}).count() will do something pretty similar to what you want, except it won't produce the nested document structure. To produce the nested document structure it would probably be easiest to ungroup, then map over the ungrouped values to produce objects of the form you want, then merge all of them.
The problem with group queries on the whole table though is that they won't stream, you'll need to traverse the whole table to get the end result back. The data explorer is meant for small queries, and I think it times out if your query takes more than 5 minutes to return, so if you're traversing a giant table then it would probably be better to run that query from one of the clients.

Related

Elasticsearch - Limit of total fields [1000] in index exceeded

I saw that there are some concerns to raising the total limit on fields above 1000.
I have a situation where I am not sure how to approach it from the design point of view.
I have lots of simple key value pairs:
key1:15, key2:45, key99999:1313123.
Where key is a string and value is a integer on which I would like to sort my results upon on where as if a certain document receives a key it gets sorted by the value.
I ended up creating an object and just put the key value pairs inside so I can match it easy.
For example I have sorting: "object.key".
I was wondering if I just use a simple object with bunch of strings inside that are just there for exact matching should I worry about raising this limit to 10k, or 20k.
Because I now have an issue where there can be more then 1k of these records. I've found I could use nested sorting but it still has a default limit of 10k.
Is there a good design pattern approach for this or should I not be worried by raising the field limits?
Simplified version of the query:
GET products/_search
{
"query": {
"match_all": {}
},
"sort": [
{
"sortingObject.someSortingKey1": {
"order": "desc",
"missing": 2,
"unmapped_type":"float"
}
}
]
}
Point is that I get the sortingKey from request and I use it to sort my results. There are 100k different ways to sort the result for example
There were some recent improvements (in 7.16) that should help there, but 10K or 20K fields is still a lot of overhead.
I'm not sure what kind of queries you need to run on those keyX fields, but maybe the flattened data-type would work for you? https://www.elastic.co/guide/en/elasticsearch/reference/current/flattened.html

Navigating terms aggregation in Elastic with very large number of buckets

Hope everyone is staying safe!
I am trying to explore the proper way to tacke the following use case in elasticsearch
Lets say that I have about 700000 docs which I would like to bucket on the basis of a field (let's call it primary_id). This primary id can be same for more than one docs (usually upto 2-3 docs will have same primary_id). In all other cases the primary_id is not repeted in any other docs.
So on average out of every 10 docs I will have 8 unique primary ids, and 1 primary id same among 2 docs
To ensure uniqueness I tried using the terms aggregation and I ended up getting buckets in response to my search request but not for the subsequent scroll requests. Upon googling, I found that scroll queries do not support aggregations.
As a result, I tried finding alternates solutions, and tried the solution in this link as well, https://lukasmestan.com/learn-how-to-use-scroll-elasticsearch-aggregation/
It suggests use of multiple search requests each specifying the partition number to fetch (dependent upon how many partitions do you divide your result in). But I receive client timeouts even with high timeout settings client side.
Ideally, I want to know what is the best way to go about such data where the variance of the field which forms the bucket is almost equal to the number of docs. The SQL equivalent would be select DISTINCT ( primary_id) from .....
But in elasticsearch, distinct things can only be processed via bucketing (terms aggregation).
I also use top hits as a sub aggregation query under terms aggregation to fetch the _source fields.
Any help would be extremely appreciated!
Thanks!
There are 3 ways to paginate aggregtation.
Composite aggregation
Partition
Bucket sort
Partition you have already tried.
Composite Aggregation: can combine multiple datasources in a single buckets and allow pagination and sorting on it. It can only paginate linearly using after_key i.e you cannot jump from page 1 to page 3. You can fetch "n" records , then pass returned after key and fetch next "n" records.
GET index22/_search
{
"size": 0,
"aggs": {
"ValueCount": {
"value_count": {
"field": "id.keyword"
}
},
"pagination": {
"composite": {
"size": 2,
"sources": [
{
"TradeRef": {
"terms": {
"field": "id.keyword"
}
}
}
]
}
}
}
}
Bucket sort
The bucket_sort aggregation, like all pipeline aggregations, is
executed after all other non-pipeline aggregations. This means the
sorting only applies to whatever buckets are already returned from the
parent aggregation. For example, if the parent aggregation is terms
and its size is set to 10, the bucket_sort will only sort over those
10 returned term buckets
So this isn't suitable for your case
You can increase the result size to value greater than 10K by updating setting index.max_result_window. Setting too big a size can cause out of memory issue so you need to test it out see how much your hardware can support.
Better option is to use scroll api and perform distinct at client side

How to transform nested JSON-payloads with Kiba-ETL?

I want to transform nested JSON-payloads into relational tables with Kiba-ETL. Here's a simplified pseudo-JSON-payload:
{
"bookings": [
{
"bookingNumber": "1111",
"name": "Booking 1111",
"services": [
{
"serviceNumber": "45",
"serviceName": "Extra Service"
}
]
},
{
"bookingNumber": "2222",
"name": "Booking 2222",
"services": [
{
"serviceNumber": "1",
"serviceName": "Super Service"
},
{
"serviceNumber": "2",
"serviceName": "Bonus Service"
}
]
}
]
}
How can I transform this payload into two tables:
bookings
services (every service belongsTo a booking)
I read a about yielding multiple rows with the help of Kiba::Common::Transforms::EnumerableExploder at wiki, blog, ... etc.
Would you solve my use-case by yielding multiple rows (the booking and multiple services), or would you implement a Destination which receives a whole booking and calls some Sub-Destinations (i.e. to create or update a service)?
Author of Kiba here!
This is a common requirement, but it can (and this is not specific to Kiba) be more or less complex to handle. Here are a few points you'll need to think about.
Handling of foreign keys
The main problem here is that you'll want to keep the relationships between services and bookings, once they are inserted.
Foreign keys using business keys
A first (most easy) way to handle this is to use a foreign-key constraint on "booking number", and make sure to insert that booking number in each service row, so that you can leverage it later in your queries. If you do this (see https://stackoverflow.com/a/18435114/20302) you'll have to set a unique-constraint on "booking number" in the bookings table target.
Foreign keys using primary keys
If you instead prefer to have a booking_id which points to the bookings table id key, things are a bit more complicated.
If this is a one-off import targeting an empty table, I recommend that you arbitrarily force the primary key using something like:
transform do |r|
#row_index ||= 0
#row_index += 1
r.merge(id: #row_index)
end
If this not a one-off import, you will have to:
* Upsert bookings in a first pass
* In a second pass, look-up (via SQL queries) "bookings" to figure out what is the id to store in booking_id, then upsert the services
As you see it's a bit more work, so stick with option 1 if you don't have strong requirements around this (although option 2 is more solid on the long run).
Example implementation (using Kiba Pro & business keys)
The simplest way to achieve this (assuming your target is Postgres) is to use Kiba Pro's SQL Bulk Insert/Upsert destination.
It would go this way (in single pass):
extend Kiba::DSLExtensions::Config
config :kiba, runner: Kiba::StreamingRunner
source Kiba::Common::Sources::Enumerable, -> { Dir["input/*.json"] }
transform { |r| JSON.parse(IO.read(r)).fetch('bookings') }
transform Kiba::Common::Transforms::EnumerableExploder
# SNIP (remapping / renaming of fields etc)
first_destination = nil
destination Kiba::Pro::Destinations::SQLBulkInsert,
row_pre_processor: -> (row) { row.except("services") },
dataset: -> (dataset) {
dataset.insert_conflict(target: :booking_number)
},
after_read: -> (d) { first_destination = d }
destination Kiba::Pro::Destinations::SQLBulkInsert,
row_pre_processor: -> (row) { row.fetch("services") },
dataset: -> (dataset) {
dataset.insert_conflict(target: :service_number)
},
before_flush: -> { first_destination.flush }
Here we iterate over each input file, parsing it and grabbing the "bookings", then generating one row per element of "bookings".
We have 2 destinations, doing "upsert" (insert or update), plus one trick to ensure we'll save the parent rows before we insert the children, to avoid a failure due to missing pointed record.
You can of course implement this yourself, but this is a bit of work though!
If you need to use primary-key based foreign keys, you'll have (likely) to split in 2 pass (one for each destination), then add some form of lookup in the middle.
Conclusion
I know that this is not trivial (depending on what you'll need, & if you'll use Kiba Pro or not), but at least I'm sharing the patterns that I'm using in such situations.
Hope it helps a bit!

For 1 billion documents, Populate data from one field to another fields in the same collection using MongoDB

I need to populate data from one field to multiple fields on the same collection. For example:
Currently I have document like below:
{ _id: 1, temp_data: {temp1: [1,2,3], temp2: "foo bar"} }
I want to populate into two different fields on the same collection as like below:
{ _id: 1, temp1: [1,2,3], temp2: "foo bar" }
I have one billion documents to migrate. Please suggest me the efficient way to update all one billion documents?
In your favorite language, write a tool that runs through all documents, migrates them, and store them in a new database.
Some hints:
When iterating the results, make sure they are sorted (e.g. on the _id) so you can implement resume should your migration code crash at 90%...
Do batch inserts: read, say, 1000 items, migrate them, then write 1000 items in a single batch to the new database. Reads are automatically batched.
Create indexes after the migration, not before. That will be faster and lead to less fragmentation
Here I made a query for you, use following query to migrate your data
db.collection.find().forEach(function(myDoc) {
db.collection_new.update(
{_id: myDoc._id},
{
$unset: {'temp_data': 1},
$set: {
'temp1': myDoc.temp_data.temp1,
'temp2': myDoc.temp_data.temp2
}
},
{ upsert: true }
)
});
To learn more about foreach cursor please visit link
Need $limit and $skip operator to migrate data in batches. In update query i have used upsert beacuse there if already exist it will update otherwise inserted entry wiil be new.
Thanks

mongodb - Recommended tree structure for large amount of data points

I'm working on a project which records price history for items across multiple territories, and I'm planning on storing the data in a mongodb collection.
As I'm relatively new to mongodb, I'm curious about what might be a recommended document structure for quite a large amount of data. Here's the situation:
I'm recording the price history for about 90,000 items across 200 or so territories. I'm looking to record the price of each item every hour, and give a 2 week history for any given item. That comes out to around (90000*200*24*14) ~= 6 billion data points, or approximately 67200 per item. A cleanup query will be run once a day to remove records older than 14 days (more specifically, archive it to a gzipped json/text file).
In terms of data that I will be getting out of this, I'm mainly interested in two things: 1) The price history for a specific item in a specific territory, and 2) the price history for a specific item across ALL territories.
Before I actually start importing this data and running benchmarks, I'm hoping someone might be able to give some advice on how I should structure this to allow for quick access to the data through a query.
I'm considering the following structure:
{
_id: 1234,
data: [
{
territory: "A",
price: 5678,
time: 123456789
},
{
territory: "B",
price: 9876
time: 123456789
}
]
}
Each item is its own document, which each territory/price point for that item in a particular territory. The issue I run into with this is retrieving the price history for a particular item. I believe I can accomplish this with the following query:
db.collection.aggregate(
{$unwind: "$data"},
{$match: {_id: 1234, "data.territory": "B"}}
)
The other alternative I was considering was just put every single data point in its own document and putting an index on the item and territory.
// Document 1
{
item: 1234,
territory: "A",
price: 5679,
time: 123456789
}
// Document 2
{
item: 1234,
territory: "B",
price: 9676,
time: 123456789
}
I'm just unsure of whether having 6 billion documents with 3 indexes or having 90,000 documents with 67200 array objects each and using an aggregate would be better for performance.
Or perhaps there's some other tree structure or handling of this problem that you fine folks and MongoDB wizards can recommend?
I would structure the documents as "prices for a product in a given territory per fixed time interval". The time interval is fixed for the schema as a whole, but different schemas result from different choices and the best one for your application will probably need to be decided by testing. Choosing the time interval to be 1 hour gives your second schema idea, with ~6 billion documents total. You could choose the time interval to be 2 weeks (don't). In my mind, the best time interval to choose is 1 day, so the documents would look like this
{
"_id" : ObjectId(...), // could also use a combination of prod_id, terr_id, and time so you get a free unique index to look up by those 3 values
"prod_id" : "DEADBEEF",
"terr_id" : "FEEDBEAD",
"time" : ISODate("2014-10-22T00:00:00.000Z"), // start of the day this document contains the data for
"data" : [
{
"price" : 1234321,
"time" : ISODate("2014-10-22T15:00:00.000Z") // start of the hour this data point is for
},
...
]
}
I like the time interval of 1 day because it hits a nice balance between number of documents (mostly relevant because of index sizes), size of documents (16MB limit, have to pipe over network), and ease of retiring old docs (hold 15 days, wipe+archive all from 15th day at some point each day). If you put an index on { "prod_id" : 1, "terr_id" : }`, that should let you fulfill your two main queries efficiently. You can gain an additional bonus performance boost by preallocating the doc for each day so that updates are in-place.
There's a great blog post about managing time series data like this, based on experience building the MMS monitoring system. I've essentially lifted my ideas from there.

Resources