Get the top N records from two unconnected data sets - ruby

I have two Rails services that return data from distinct databases. In one data set I have records with fields that are something like this:
query, clicks, impressions
In the second I have records with fields something like this:
query, clicks, visitors
What I want to be able to do, is get paged data from the merged set, matching on queries. But it needs to also include all records that exist in one or the other data sets, and then sort them by the 'clicks' column.
In SQL if these two tables were in the same database I'd do this:
SELECT COALESCE(a.query, b.query), a.clicks, b.clicks, impressions, visitors
FROM a OUTER JOIN b ON a.query = b.query
LIMIT 100 OFFSET 1
ORDER BY MAX(a.clicks, b.clicks)
An individual "top 100" to each data set produces incorrect results because 'clicks' in data set 'a' may be significantly higher or lower than in dataset 'b'.
As they aren't in the same database, I'm looking for help with the algorithm that makes this kind of query efficient and clean.

I never found a way to do this outside of a database. In the end, we just used PostgreSQL's Foreign Data Wrapper feature to connect the two databases together and use PostgreSQL for handling the sorting and paging.
One trick for anyone heading down this path, we built VIEWs on the remote server that provided exactly the data needed in a above. This was thousands of times faster than trying to join tables across the remote connection as the value of the indexes was lost.

Related

How do you update an AWS Dynamodb with a condition and not a key

How do you go about updating a DynamoDB table by condition and not a key? I want to set all active flags to false where gameid = xxxx and age > 30.
When you design a DynamoDB schema you need to think differently than when you design a relational schema. Relational databases are good for small datasets, where you can simply go over all the records and update some values in them. However, it doesn't scale for millions and more records, and you need to think differently and use a NoSQL solution such as DynamoDB.
The main power of DynamoDB is the almost unlimited scale of LOOKUP operations that are mostly GET and PUT of a single or a small set of records. The solutions that were offered in the comments to the questions are good and you can:
Query the records that need to change (using PartiQL, for example) with the condition SELECT * FROM "Games" WHERE gameid = "xxxx" and age > 30 and flag = "Active"
Loop over the records and update each with the relevant value
Nevertheless, you should consider a different design for your tables, and think about the reason for the bulk update. Maybe you should have another table where you can simply update a single record to apply the change that you want. For example, if the records are part of an object called Round, point the records to this object and update the state of this single round record when needed.
It is very easy to read two records (one for game and one for round) instead of only a single record of a game. Especially, if you can minimize the complexity and cost of updating many records of games with such a flag.

Oracle 11g - Building a Type 2 SCD based on existing historical data in a relational model

I'm an ETL developer that's currently being tasked with developing a type 2 SCD from existing historical data in a relational database. I'm perfectly capable of creating a type 2 SCD that's responsible for tracking future changes to the data, but I'm completely useless when it comes to the task at hand.
The relational model is in our ODS . Based on that relational model, I'm supposed to build flat records in our DW dimension. There are multiple attributes which need to be monitored for changes, each in specific related tables in the relational model. Historical changes must be kept on a daily basis, and if multiple changes to the same attribute occur on the same day, only the last subsists.
How can I tackle this? I'm lost. Thanks in advance.
P.S. we're talking tables with 20-30 million rows and multiple attributes that may change at any given time and therefore must result in a new record in the SCD.
This will indeed be painful. I'm assuming from your question that the tables containing the attribute values are currently varying independently (or you wouldn't need to ask the question).
If you have a table 'Table1' containing 'Key', 'Attribute1' and 'Effective From','Effective To' columns, then you can 'explode' that table into a virtual table in the form 'Key','Attribute1','Date', projecting out one row for every date where that attribute was current.
(Note that you probably don't want to do this as a ranged join against your date dimension, because this will be a Triangular Join (ie perform really badly), you probably need to explode the rows in an ETL tool/programmatically)
If you perform this process across multiple tables, you will have a set of tables giving you the full day-by-day snapshot of each attribute for every day that you care about. It's then fairly easy to join those tables based on 'FK' and 'Date' to give you the complete daily snapshot across all of the attribute values.
Then, of course, you need to run this though another process to collapse rows with the same Key, contiguous dates and all the same attribute values, ie 'unexplode' the rows, back into 'effective from','effective to' form. Note again, that this is fundamentally a row-by-row operation (or at very least a windowing function), and a set-based approach will perform very badly. Personally I'd just stream it all though some .net/java code to achieve this.
Given data volumes this will take a while, but should be achievable.

performance issues while processing 2 tables in lockstep based on orderedBy from-to

Title is probably not very clear so let me explain.
I want to process a in-process join (nodeJs) on 2 tables*, Session and SessionAction. (1-N)
Since these tables are rather big (millions of records both) my idea was to get slices based on an orderBy sessionId (which they both share), and sort of lock-step walk through both tables in batches.
This however proves to be awefully slow. I'm using pseudo code as follows for both the tables to get the batches:
table('x').orderBy({index:"sessionId"}.filter(row.sessionId > start && row.sessionId < y)
It seems that even though I'm essentially filtering on a attribute sessionId which has got an index, the query planner is not smart enough to see this and every query does a complete tablescan to do the orderby before filtering afterwards (or so it seems)
Of course, this is incredibly wasteful but I don't see another option. E.g.:
Order after filter is not supported by Rethink.
Getting a slice of the ordered table doesn't work either, since slice-enumeration (i.e.: the xth until the yth record) for lack of a better work doesn't add up between the 2 tables.
Questions:
Is my approach indeed expected to be slow, due to having to do a table scan at each iteration/batch?
If so, how could I design my queries to get it working faster?
*) It's too involved to do it using Rethink Reql only.
filter is never indexed in RethinkDB. (In general a particular command will only use a secondary index if you pass index as one of its optional arguments.) You can write that query like this to avoid scanning over the whole table:
r.table('x').orderBy({index: 'sessionID'}).between(start, y, {index: 'sessionId'})

Compound rowkey in Azure Table storage

I want to move some of my Azure SQL tables to Table storage. As far as I understand, I can save everything in the same table, seperating it using PartitionKey and keeping it unique within each partition using Rowkey.
Now, I have a table with a compound key:
ParentId: (uniqueidentifier)
ReportTime: (datetime)
I also understand RowKeys have to be strings. Will I need to combine these in a single string? Or can I combine multiple keys some other way? Do I need to make a new key perhaps?
Any help is appreciated.
UPDATE
My idea is to put data from several (three for now) database tables and put in the same storage table seperating them with the partition key.
I will query using the ParentId and a WeekNumber (another column). This table has about 1 million rows that's deleted weekly from the db. My two other tables has about 6 million and 3.5 million
This question is pretty broad and there is no right answer.
The specific question - can you use Compound Keys with Azure Table Storage. Yes, you can do that. But this involves manual Serializing / Deserializing of your object's properties. You can achieve that by overriding the TableEntity's ReadEntity and WriteEntity methods. Check this detailed blog post on how can you override these methods to use your own custom serialization/deserialization.
I will further discuss my view on your more broader question.
First of all, why you want to put data from 3 (SQL) tables into one (Azure Table)? Just have 3 Azure tables.
Second thought, as Fabrizio points out is how are you going to query the records. Because Windows Azure Table service has only one index, and that is PartitionKey + RowKey properties (columns). If you are pretty sure you will mostly query data by known PartitionKey and RowKey, then Azure Tables is perfectly suiting you! However you say that your combination for RowKey is ParentId + WeekNumber! That means that a record is uniquely identified by this combination! If it is true, then you are even more ready to go.
Next you say you are going to delete records every week! You should know that DELETE operation acts on a single entity. You can use Entity Group Transactions to DELETE multiple entities at once, but there is a limit of (a) All entities in batch operation must have the same PartitionKey, (b) The maximum number of entities per batch is 100, and (c) The maximum size of batch operation is 4MB. Say you have 1M records like you say. In order to delete them, you have to first retrieve them in groups by 100, then delete in groups by 100. These are, in best possible case 10k operations on retrieval and 10k operations on deletion. Event if it will only cost 0.002 USD, think about time taken to execute 10k operations against a REST API.
Since you have to delete entities on a regular basis, which is fixed to a WeekNumber let's say, I can suggest that you dynamically create your tables and include the week number in its name. Thus you will achieve:
Even better partitioning of information
Easier and more granular information backup / delete
Deleting millions of entities requires just one operation - delete table.
There is not an unique solution for your problem. Yes, you can use ParentID as PartitionKey and ReportTime as Rowkey (or invert the assignment). But the big 2 main questions re: how do you query your data, with what conditions? and how many data do you store? 1000, 1 million items, 1000 millions items? The total storage usage is important. But it's also very important to consider the number of transaction you will generate to the storage.

Oracle: performance about filtering results from remote view

I have a remote database A which has a view v_myview. I am working on a local database, which has a dblink to access v_myview on databse A. If I query the view like this :
select * from v_myview # dblink ;
it returns half million rows. I just want to get some specific rows from the view,e.g., to get rows with id=123, my query is
select * from v_myview # dblink where id=123;
This works as expected. Here comes my question, when I run this query, will remote database generates the half million rows first then from there to find rows with id=123? or the remote view applies my filter first then query the DB without retrieving the half million rows first? how do I know that. Thank you!
Oracle is free to do either. You'd need to look at the query plan to see whether the filtering is being done locally or remotely.
Presumably, in a case as simple as the one you present, the optimizer would expect it to be more efficient to send the filter to the remote server rather than pulling half a million rows over the network only to filter them locally. That calculation may be different if the optimizer expects the unfiltered query to return a single row rather than half a million rows and it may be different if the query gets more complicated doing something like joining to a local table or calling a function on the local server.

Resources