Efficient point-in-time query of group membership - algorithm

We have a scenario like this:
Millions of records (Record 1, Record 2, Record 3...)
Partitioned into millions of small non-intersecting groups (Group A, Group B, Group C...)
Membership gradually changes over time, i.e. a record may be reassigned to another group.
We are redesigning the data schema, and one use case we need to support is given a particular record, find all other records that belonged to the same group at a given point in time. Alternatively, this can be thought of as two separate queries, e.g.:
To which group did Record 15544 belong, three years ago? (Call this Group g).
What records belonged to Group g, three years ago?
Supposing we use a relational database, the association between records and groups is easily modelled using a two-column table of record id and group id. A common approach for allowing historical queries is to add a timestamp column. This allows us to answer the question above as follows:
Find the row for Record 15544 with the most recent timestamp prior to the given date. This tells us Group g.
Find all records that have at any time belonged to Group g.
For each of these records, find the row with the most recent timestamp prior to the given date. If this indicates that the record was in Group g at that time, then add it to the result set.
This is not too bad (assuming the table is separately indexed by both record id and group id), and may even be the optimal algorithm for the naive table structure just described, but it does cost an index lookup for every record found in step 2. Is there an alternative data structure that would answer the query more efficiently?
ETA: This is only one of several use cases for the system, so we don't want to speed up this query at the expense of making queries about current groupings slower, nor do we want to pay a huge price in space consumption, etc.

How about creating two tables:
(recordID, time-> groupID) - key is recordID, time - sorted by
recordID, and secondary by time (Let that be map1)
(groupID, time-> List) - key is groupID, time - sorted by
recordID, and secondary by time (Let that be map2)
At each record change:
Retrieve the current groupID of the record you are changing
set t <- current time
create a new entry to map1 for old group: (oldGroupID,t,list') - where list' is the same list, but without the entry you just moved out from there.
Add a new entry to map1 for new group: (newGroupId,t,list'') - where list'' is the old list for the new group, with the changed record added to it.
Add a new entry (recordId,t,newGroupId) to map1
During query:
You need to find the entry in map2 that is 'closest' and smaller than
(recordId,desired_time) - this is classic O(logN) operation in
sorted data structure.
This will give you the group g the element belonged to at the desired time.
Now, look in map1 similarly for the entry with key closest but smaller than (g,desired_time). The value is the list of all records that are at the group at the desired time.
This requires quite a bit of more space (at constant factor though...), but every operation is O(logN) - where N is the number of record changes.
An efficient sorted DS for entries that are mostly stored on disk is a B+ tree, which is also implemented by many relational DS implementations.

Related

Does order of partitioning columns matter in Hive?

Lets say I have a partitioned table with multiple columns as partition keys e.g.
partitioned by (department string,year int, month int,day int)
So does this specific order really matter? All the online resources refer to advantage of scanning only specific sub-directories for search. But ultimately everything is a file in big data, directories seem to be more like logical grouping. And when one specifies a filter on partitioned column, hive just needs to know which files are involved and where they are located, not sure how directory is going to be useful -- it's not like as if directories are loaded in memory -- files are loaded in memory -- and the directory path is more like a label for a given file. If that's the case, no matter which order we specify for partitioning , it shouldn't matter. This is especially more evident in HDInsight where the underlying file system (BLOBs) has no concept of directory.
Although you're right about directories being logical constructs, if you consider the amount of metadata your HiveServer2 has to get and sift through in order to execute an average query, the order does matter. If a query contains ...WHERE department='IT'..., and the partitions are laid out as you show, given 100 departments total, the partition pruning mechanism will be able to eliminate 99 subdirectories from the tree right away. But if the order of partition columns is reversed, the same query will need to retrieve metadata for (30 days x 12 month x N years) partitions from Hive MetaStore, just to figure out whether partition /department=IT actually exists in all of them. So the order of partitions can be decided by analyzing predominant query patterns.
Another common factor to consider is devops/maintenance related, especially if data is loaded into a table incrementally. If one needs to backoff/recover from unsuccessful load, will he need to drop a partition (day=08) in each department subtree individually, or can all department data be cleared at once by dropping partition (day=08)?

Query for Latest Item & Proper Use of Partition Keys in DynamoDB

I am creating a DynamoDB table to support an Alexa Skill for use as a podcast player. The way I envision the table is to use the episode number as the Partition Key and the PublicationDate as the optional Sort Key. I have two concerns about designing my table schema in this way.
First, say I wanted to query the table to get the latest episode - I'm not sure that I can do it in this fashion, as a query requires an equivalence operation on the Partition Key (episode = X), which I wouldn't know in advance. Am I correct in believing that a scan would be quite an expensive operation if the podcast has a large number of episodes (say more than 1000)?
I would need to look at each item in the table, compare its episode number (Partition Key value) to the previous returned Item and update a variable with the more recent Item each time one was found until all Items in the table were cycled through in this way.
Secondly, DynamoDB best practices say two things which work incongruently in my use-case (probably a sign that my design is flawed). First, the Partition Key should be unique or close to unique. Second, queries should be expected to be more or less uniformly dispersed amongst the keys. In my case, though, while the Partition Key would indeed be unique, I would expect the vast majority of queries to be targeting the latest Partition Key in the table, for the Item containing data for the latest podcast episode. What would be the impact on performance if, say for example, the skill gets 1000 queries on any given day all aimed at a single Partition Key?
Does anyone have a better table architecture solution for this type of data?
Thanks to everyone in advance!
Question 1:
First, say I wanted to query the table to get the latest episode - I'm
not sure that I can do it in this fashion, as a query requires an
equivalence operation on the Partition Key (episode = X), which I
wouldn't know in advance. Am I correct in believing that a scan would
be quite an expensive operation if the podcast has a large number of
episodes (say more than 1000)?
You are right that you would NOT be able to query for the latest episode because each episode is in their own Partition. Partitions are almost like different isolated tables so there is no way to query across all Partitions without Scanning (as you said).
Question 2:
Secondly, DynamoDB best practices say two things which work
incongruently in my use-case (probably a sign that my design is
flawed). First, the Partition Key should be unique or close to unique.
Second, queries should be expected to be more or less uniformly
dispersed amongst the keys. In my case, though, while the Partition
Key would indeed be unique, I would expect the vast majority of
queries to be targeting the latest Partition Key in the table, for the
Item containing data for the latest podcast episode. What would be the
impact on performance if, say for example, the skill gets 1000 queries
on any given day all aimed at a single Partition Key?
The issue here is two fold, AWS expects you to be reading (and writing) equally to each partition (or close to equally) so basically what is going to happen is you are going to pay for Write Units (and Read Units) on the partitions you are NOT using, even though you are not using them.
Exactly how much more that is going to run you is going to depend on the number of times you QUERY the database, however, Reading is much cheaper than writing and 1000 reads is basically nothing on a table with 1000 items. ie. You MIGHT be able to get away with it but it's not ideal.
Alternate Table Schema / Key Design
What other Queries will you make? ie. other than "Check for latest Episode"
How many Podcasts are added per day? week? year?
Are there multiple 'shows' or categories that could be used for Partition Keys that might have more even distribution and could be 'known'?

How to summarize by calculated measure in Power BI?

I have transactional data which contains customer information as well as stores they shopped from. I can count the number of different stores each customer used by a simple DISTINCTCOUNT([Site Name]) measure.
There are millions of customers and I want to make a simple summary table which shows the sum of # customers who visited X number of stores. Like a histogram. Maximum stores they visited is 6, minimum is 1.
I know there are multiple ways to do this but I am new to DAX and can't do what I think yet.
The easiest way:
Assuming your DISTINCTCOUNT([Site Name]) measure is called CustomerStoreCount ...
Add a new dimension table, StoreCount, to your model containing a single column, StoreCount. Populate it with the values 1,2,3,4,5,6 (... up to maximum number of stores.)
Create a measure, ThisStoreCount = MAX(StoreCount[StoreCount]).
Create a base customer count measure, TotalCustomers:=DISTINCTCOUNT(CustomerTable[Customer])
Create a contextual measure, CustomersWhoVisitedXNumberOfStores := CALCULATE ( TotalCustomers, FILTER(VALUES(CustomerTable[Customer]), ThisStoreCount = CustomerStoreCount) )
On your pivot table / reporting tool, etc. use StoreCount[StoreCount] on the axes and CustomersWhOVisitedXNumberOfStores as the measure.
So basically walk through the customer list (since there's no relationship between StoreCount and CustomerTable), compare that customer's CustomerStoreCount with the maximum StoreCount[StoreCount] value, which for each StoreCount[StoreCount] value is ... drum roll itself. If it matches, keep it, otherwise filter it out; you end up with a count of customers whose store visits equals the value of StoreCount[StoreCount].
And of course the more general modeling hint: when you want to display a metric by something (i.e. customer count by number of stores visited), that something is an attribute, not a metric.

Designing relational system for large scale

I've been having some difficulty scaling up the application and decided to ask a question here.
Consider a relational database (say mysql). Let's say it allows users to make posts and these are stored in the post table (has fields: postid, posterid, data, timestamp). So, when you go to retrieve all posts by you sorted by recency, you simply get all posts with posterid = you and order by date. Simple enough.
This process will use timestamp as the index since it has the highest cardinality and correctly so. So, beyond looking into the indexes, it'll take literally 1 row fetch from disk to complete this task. Awesome!
But let's say it's been 1 million more posts (in the system) by other users since you last posted. Then, in order to get your latest post, the database will peg the index on timestamp again, and it's not like we know how many posts have happened since then (or should we at least manually estimate and set preferred key)? Then we wasted looking into a million and one rows just to fetch a single row.
Additionally, a set of posts from multiple arbitrary users would be one of the use cases, so I cannot make fields like userid_timestamp to create a sub-index.
Am I seeing this wrong? Or what must be changed fundamentally from the application to allow such operation to occur at least somewhat efficiently?
Indexing
If you have a query: ... WHERE posterid = you ORDER BY timestamp [DESC], then you need a composite index on {posterid, timestamp}.
Finding all posts of a given user is done by a range scan on the index's leading edge (posterid).
Finding user's oldest/newest post can be done in a single index seek, which is proportional to the B-Tree height, which is proportional to log(N) where N is number of indexed rows.
To understand why, take a look at Anatomy of an SQL Index.
Clustering
The leafs of a "normal" B-Tree index hold "pointers" (physical addresses) to indexed rows, while the rows themselves reside in a separate data structure called "table heap". The heap can be eliminated by storing rows directly in leafs of the B-Tree, which is called clustering. This has its pros and cons, but if you have one predominant kind of query, eliminating the table heap access through clustering is definitely something to consider.
In this particular case, the table could be created like this:
CREATE TABLE T (
posterid int,
`timestamp` DATETIME,
data VARCHAR(50),
PRIMARY KEY (posterid, `timestamp`)
);
The MySQL/InnoDB clusters all its tables and uses primary key as clustering key. We haven't used the surrogate key (postid) since secondary indexes in clustered tables can be expensive and we already have the natural key. If you really need the surrogate key, consider making it alternate key and keeping the clustering established through the natural key.
For queries like
where posterid = 5
order by timestamp
or
where posterid in (4, 578, 222299, ...etc...)
order by timestamp
make an index on (posterid, timestamp) and the database should pick it all by itself.
edit - i just tried this with mysql
CREATE TABLE `posts` (
`id` INT(11) NOT NULL,
`ts` INT NOT NULL,
`data` VARCHAR(100) NULL DEFAULT NULL,
INDEX `id_ts` (`id`, `ts`),
INDEX `id` (`id`),
INDEX `ts` (`ts`),
INDEX `ts_id` (`ts`, `id`)
)
ENGINE=InnoDB
I filled it with a lot of data, and
explain
select * from posts where id = 5 order by ts
picks the id_ts index
Assuming you use hash tables to implement your Data Base - yes. Hash tables are not ordered, and you have no other way but to iterate all elements in order to find the maximal.
However, if you use some ordered DS, such as a B+ tree (which is actually pretty optimized for disks and thus data bases), it is a different story.
You can store elements in your B+ tree ordered by user (primary order/comparator) and date (secondary comparator, descending). Once you have this DS, finding the first element can be achieved in O(log(n)) disk seeks by finding the first element matching the primary criteria (user-id).
I am not familiar with the implementations of data bases, but AFAIK, some of them do allow you to create an index, based on a B+ tree - and by doing so, you can achieve finding the last post of a user more efficiently.
P.S.
To be exact, the concept of "greatest" element or ordering is not well defined in Relational Algebra. There is no max operator. To get the max element of a table R with a single column a one should actually create the Cartesian product of that table and find this entry. There is no max nor sort operator in strict relational algebra (though it does exist in SQL)
(Assuming set, and not multiset semantics):
MAX = R \ Project(Select(R x R, R1.a < R2.a),R1.a)

Index needed for max(col)?

I'm currently doing some data loading for a kind of warehouse solution. I get an data export from the production each night, which then must be loaded. There are no other updates on the warehouse tables. To only load new items for a certain table I'm currently doing the following steps:
get the current max value y for a specific column (id for journal tables and time for event tables)
load the data via a query like where x > y
To avoid performance issues (I load around 1 million rows per day) I removed most indices from the tables (there are only needed for production, not in the warehouse). But that way the retrieval of the max value takes some time...so my question is:
What is the best way to get the current max value for a column without an index on that column? I just read about using the stats but I don't know how to handle columns with 'timestamp with timezone'. Disabling the index before load, and recreate it afterwards takes much too long...
The minimum and maximum values that are computed as part of column-level statistics are estimates. The optimizer only needs them to be reasonably close, not completely accurate. I certainly wouldn't trust them as part of a load process.
Loading a million rows per day isn't terribly much. Do you have an extremely small load window? I'm a bit hard-pressed to believe that you can't afford the cost of indexing the row(s) you need to do a min/ max index scan.
If you want to avoid indexes, however, you probably want to store the last max value in a separate table that you maintain as part of the load process. After you load rows 1-1000 in table A, you'd update the row in this summary table for table A to indicate that the last row you've processed is row 1000. The next time in, you would read the value from the summary table and start at 1001.
If there is no index on the column, the only way for the DBMS to find the maximum value in the column is a complete table scan, which takes a long time for large tables.
I suppose a DBMS could try to keep track of the minimum and maximum values in the column (storing the values in the system catalog) as it does inserts, updates and deletes - but deletes are why no DBMS I know of tries to keep statistics up to date with per-row operations. If you delete the maximum value, finding the new maximum requires a table scan if the column is not indexed (and if it is indexed, the index makes it trivial to find the maximum value, so the information does not have to be stored in the system catalog). This is why they're called 'statistics'; they're an approximation to the values that apply. But when you request 'SELECT MAX(somecol) FROM sometable', you aren't asking for statistical maximum; you're asking for the actual current maximum.
Have the process that creates the extract file also extract a single row file with the min/max you want. I assume that piece is scripted on some cron or scheduler, so shouldn't be too much to ask to add min/max calcs to that script ;)
If not, just do a full scan. Million rows isn't much really, esp in a data warehouse environment.
This code was written with oracle, but should be compatible with most SQL versions:
This gets the key of the max(high_val) in the table according to the range.
select high_val, my_key
from (select high_val, my_key
from mytable
where something = 'avalue'
order by high_val desc)
where rownum <= 1
What this says is: Sort mytable by high_val descending for values where something = 'avalue'. Only grab the top row, which will provide you with the max(high_val) in the selected range and the my_key to that table.

Resources