Does LMDB support multiple keys to same value mapping? - leveldb

is it possible to have multiple keys mapping to the same value? If not, is there a work around for this feature?

It isn't possible. One workaround that I use is to have the value on the second key be a pointer to the primary key. That is, the value of the second key is the primary key.
In particular, I make a secondary-keys table (or "Named Database" in lmdb speak) where all the values are primary keys in the primary table. If you look further into other database this is exactly how they implement indexes.
For example
Data table:
key: 72E13E60-85A6-4191-A187-F6FA5D3F0975
value: {
"surrogate-key": "72E13E60-85A6-4191-A187-F6FA5D3F0975",
"name": "Foo Widget",
"location": "Atlantis Mall",
"last-value": 892
}
Name table:
key: "Foo Widget",
value: "72E13E60-85A6-4191-A187-F6FA5D3F0975"
Location table:
key: "Atlantis Mall",
value: "72E13E60-85A6-4191-A187-F6FA5D3F0975"

Related

Filtering all field values per row

I have a table called 'sample'. Based on which algorithm is used, each sample may have different field (property) names.
I need to be able to retrieve all samples which have field values that contain/match a user filter value.
So for instance, if a sample has the following properties:
example 1: "name", "gender", "state"
and another had properties:
example 2: "name", "gender", "rate"
and there would be thousands of such samples with more variation.
If a user looking at a table with a set of samples from the second example above ("name", "gender", "rate") and used a filter "foo", I need to query the table "sample" for all rows where any of the property's values contained/matched "foo" where value could be "foobar".
If they were looking at a set of samples that had the properties that example 1 has ("name", "gender", "state"), then I need to do the same, however, I cannot hard code the properties of either.
In SQL I would get the field names and dynamically build a SQL query string but with REQL object DOT notation, I am struggling with how to do it.

How to transform nested JSON-payloads with Kiba-ETL?

I want to transform nested JSON-payloads into relational tables with Kiba-ETL. Here's a simplified pseudo-JSON-payload:
{
"bookings": [
{
"bookingNumber": "1111",
"name": "Booking 1111",
"services": [
{
"serviceNumber": "45",
"serviceName": "Extra Service"
}
]
},
{
"bookingNumber": "2222",
"name": "Booking 2222",
"services": [
{
"serviceNumber": "1",
"serviceName": "Super Service"
},
{
"serviceNumber": "2",
"serviceName": "Bonus Service"
}
]
}
]
}
How can I transform this payload into two tables:
bookings
services (every service belongsTo a booking)
I read a about yielding multiple rows with the help of Kiba::Common::Transforms::EnumerableExploder at wiki, blog, ... etc.
Would you solve my use-case by yielding multiple rows (the booking and multiple services), or would you implement a Destination which receives a whole booking and calls some Sub-Destinations (i.e. to create or update a service)?
Author of Kiba here!
This is a common requirement, but it can (and this is not specific to Kiba) be more or less complex to handle. Here are a few points you'll need to think about.
Handling of foreign keys
The main problem here is that you'll want to keep the relationships between services and bookings, once they are inserted.
Foreign keys using business keys
A first (most easy) way to handle this is to use a foreign-key constraint on "booking number", and make sure to insert that booking number in each service row, so that you can leverage it later in your queries. If you do this (see https://stackoverflow.com/a/18435114/20302) you'll have to set a unique-constraint on "booking number" in the bookings table target.
Foreign keys using primary keys
If you instead prefer to have a booking_id which points to the bookings table id key, things are a bit more complicated.
If this is a one-off import targeting an empty table, I recommend that you arbitrarily force the primary key using something like:
transform do |r|
#row_index ||= 0
#row_index += 1
r.merge(id: #row_index)
end
If this not a one-off import, you will have to:
* Upsert bookings in a first pass
* In a second pass, look-up (via SQL queries) "bookings" to figure out what is the id to store in booking_id, then upsert the services
As you see it's a bit more work, so stick with option 1 if you don't have strong requirements around this (although option 2 is more solid on the long run).
Example implementation (using Kiba Pro & business keys)
The simplest way to achieve this (assuming your target is Postgres) is to use Kiba Pro's SQL Bulk Insert/Upsert destination.
It would go this way (in single pass):
extend Kiba::DSLExtensions::Config
config :kiba, runner: Kiba::StreamingRunner
source Kiba::Common::Sources::Enumerable, -> { Dir["input/*.json"] }
transform { |r| JSON.parse(IO.read(r)).fetch('bookings') }
transform Kiba::Common::Transforms::EnumerableExploder
# SNIP (remapping / renaming of fields etc)
first_destination = nil
destination Kiba::Pro::Destinations::SQLBulkInsert,
row_pre_processor: -> (row) { row.except("services") },
dataset: -> (dataset) {
dataset.insert_conflict(target: :booking_number)
},
after_read: -> (d) { first_destination = d }
destination Kiba::Pro::Destinations::SQLBulkInsert,
row_pre_processor: -> (row) { row.fetch("services") },
dataset: -> (dataset) {
dataset.insert_conflict(target: :service_number)
},
before_flush: -> { first_destination.flush }
Here we iterate over each input file, parsing it and grabbing the "bookings", then generating one row per element of "bookings".
We have 2 destinations, doing "upsert" (insert or update), plus one trick to ensure we'll save the parent rows before we insert the children, to avoid a failure due to missing pointed record.
You can of course implement this yourself, but this is a bit of work though!
If you need to use primary-key based foreign keys, you'll have (likely) to split in 2 pass (one for each destination), then add some form of lookup in the middle.
Conclusion
I know that this is not trivial (depending on what you'll need, & if you'll use Kiba Pro or not), but at least I'm sharing the patterns that I'm using in such situations.
Hope it helps a bit!

Elasticsearch - Unique values in a field of an index

I have an index of a following type:
{
company: {
watchlist: [ {id: 1}, {id: 2}, {id, 1} ]
}
}
In the watchlist array in the indexes, duplicate values are stored. I want the indexes not to store duplicate values as this is increasing the size of my index.
I know that i can get unique values by calling aggregation, but what I want to do here is to store unique values in the index.
I am using elasticsearch rails here, it indexes data according to the json returned from 'as_indexed_json' method. The data for the above index is in sql database, which i cannot change. I can only create indexes from that database, so i need some 'uniqueness' constraint on the field 'watchlist'.
Is there a way to do it?

MongoDB index: object keys vs array of strings

I'm new to the MongoDB and have been researching schema designs and indexing. I know you can index a property regardless of its value (ID, array, subdocument, etc...) but what I don't know is if there is a performance benefit to either indexing an array of strings or a nested object's keys.
Here's an example of both scenarios that I'm contemplating (in Mongoose):
// schema
mongoose.Schema({
visibility: {
usa: Boolean,
europe: Boolean,
other: Boolean
}
});
// query
Model.find({"visibility.usa": true});
OR
// schema
mongoose.Schema({
visibility: [String] // strings could be "usa", "europe", and/or "other"
});
// query
Model.find({visibility: "usa"});
Documents could have one, two, or all three visibility options.
Furthermore, if I went with the Boolean object design, could I simple index the visibility field or would I need to put an index on usa, europe, and other?
In MongoDB creating indexes on an array of strings results into multiKey index where all the strings in the array form the index keys and point to the same document.So in your case it would work same as nested object keys.
If you go with boolean design and you can put index on visibility field.You can further read on MongoDB Mulitkey indexing

DynamoDB: What's the best way to structure and query a sorted list of timestamped logs?

In the interest of better understanding Amazon's DynamoDB, Lambda functions and IAM roles (I'll stick to DynamoDB in this question), I'm setting up a Linux device to listen for new DynamoDB items and audibly read out updates that are being added by other functions at a regular interval. My goal is to query or scan items, returning those items in ascending order since a specific timestamp (the last time the device checked).
Here's the item structure I'm using so far:
{
"id": {
"S": "1eb4520d44715b6daa5f9d907fe43aab" //md5sum of "time"
},
"message": {
"S": "I'm creating the audible reporting log now."
},
"status": {
"S": "working"
},
"time": {
"S": "1452297505" //timestamp: should probably add milliseconds for sake of unique "id"
}
}
"id" is the partition key. "time" is the sort key. Looking at this now, I'm guessing I should probably make "time" a number, not a string...
Query or scan? Query seems like the correct option for sorting, but it requires a specific partition ID in the query (at least in in the AWS website query tool), so perhaps I'm adding those incorrectly. Scan loads all items and I'm guessing that the sort is not automatic or an option (at least not in in the AWS website query tool). I really only want to load items greater than a timestamp value, sorted.
Where am I off in my thinking? I appreciate the assistance in advance.
UPDATE
After further experimentation with AWS-CLI and DynamoDB, I ended up using a slightly different solution. Since this is a small scale "hello world" type of project, all update items are added to the same table with a single partition key, "SF Reporter", for now. This could scale if I decide to start monitoring additional "reporter"/service updates with separate queries and/or devices.
{
"datetime": { //sort key
"S": "2016-01-11T05:15:02"
},
"message": {
"S": "It is all good."
},
"reporter": { //primary partition key
"S": "SF Reporter"
},
"status": {
"S": "ok"
}
}
The JSON query itself looks something like this (abbreviated node.js example):
var AWS = require("aws-sdk");
AWS.config.credentials = new AWS.SharedIniFileCredentials({ profile: 'default' });
AWS.config.update({"region": "us-west-2"});
var docClient = new AWS.DynamoDB.DocumentClient();
var params = {
TableName: "spoken_reports",
KeyConditionExpression: "#reporter = :reporter and #datetime >= :datetime",
ExpressionAttributeNames:{
"#reporter": "reporter",
"#datetime": "datetime"
},
ExpressionAttributeValues: {
":reporter":"SF Reporter",
":datetime":"2016-01-11T05:15:02"
}
};
docClient.query(params, onUpdatesReceived);
var onUpdatesReceived = function(err, data) {
if (err) {
console.log(err, err.stack);
} else {
console.log(data);
}
}
The query gets the latest updates sorted by a string timestamp (defaults to ascending order in this example). This allows for some scaling as I can have multiple devices checking the same table for the latest updates. I would create a scheduled query/function to clear out old updates once in a while to keep things light.
Dead simple way:
You should set up a global secondary index, and project "isNew" as the primary/hash key to it, with timestamp as the range key.
On creation of an entry, mark isNew as a UUID or something. This will make the table item project into the index.
When you need to check for data, scan the secondary index - the index will have only the results which are new. Then, updateItem the items you have read within the table itself to delete the isNew key on the item. The item will be removed from the secondary index, so it is not read again.
If you stick with this table design, scanning the entire table is the only option you have, for the reasons you've mentioned: for querying, you need a partition key, which is something your devices have no way of knowing beforehand.
There is another solution that comes to my mind:
Let's say your current table is called T1. Create another table, T2, that has deviceID as partition key and timestamp as sort key.
You define a AWS Lambda function on T1's stream that will, on any update, push that row in T2 as well, one per device.
Now whenever any of your device wakes up, it queries (not scan) T2 with its own device id. Processes all the rows and deletes them.
In other words, T2 will always have all the rows that a given device is yet to process.

Resources