In studying for my exam I came across this question.
A website streams movies to customers’ TVs or other devices. Movies are in one of several genres such as action, drama, mystery, etc. Every movie is in exactly one genre (so that if a movie is an action movie as well as a comedy, it is in a genre called “action-Comedy”). The site has around 10 million customers, and around 25,000 movies, but both are growing rapidly. The site wants to keep track of the most popular movies streamed. You have been hired as the lead engineer to develop a tracking program.
i) Every time a movie is streamed to a customer, its name (e.g. “Harold and Kumar: Escape from Guantanamo Bay”) and genre (“Comedy”) is sent to your program so it can update the data structures it maintains.
(Assume your program can get the current year with a call to an appropriate Java class, in O(1) time.)
ii) Also, every once in a while, customers want to know what were the top k most streamed movies in genre g in year y. (If y is the current year, then accounting is done up to the current date.) For example, what were the top 10 most streamed comedy movies in 2010? Here k = 10, g=”comeday” and y = 2010. This query is sent to your program which should output the top k movie names.
Describe the data structures and algorithms used to implement both requirements. For (i), analyze the big O running time to update the data structures, and for (ii) the big O running time to output the top k streamed movies.
My thought process was to create a hash table, with every new movie added to its respective genre in the hash table in a linked list. As for the second part, my only idea is to keep the linked list sorted but that seems way too expensive. What is a better alternative?
I use a heap to keep track of the top k objects of a class (k fixed). You can find the details of this data structure in any CS text, but basically it's a binary tree in which every node is smaller than either of its children. The main operation, which we will call reheap(node) assumes that both the children of node are heaps, compares node with the smaller of its two children, does the swap if necessary, and recursively calls reheap for the modified child. The class needs to have an overloaded operator< or the equivalent defined to do this.
At any point in time, the heap holds the top k objects with the smallest of these at the top of the heap. When a new object arrives which is bigger than the top of the heap, it replaces that object on the heap, and then
reheap is called. This can also happen at a node other than the top node if an object already on the heap becomes bigger than its smaller child. Another type of update occurs if an object already on the heap becomes smaller than its parent (this probably won't happen in the case you describe). Here it gets swapped with its parent and we then compare recursively against the grandparent, etc.
All of these updates have complexity O(log(k)). If you need to output the heap sorted from the top down, the same structure works well in time
O(k log(k)). (This process is known as heapsort).
Since swapping objects can be expensive, I usually keep the objects in a fixed array somewhere, and implement the heap as an array, A, of pointers, where the children of A[i] are A[2i+1] and A[2i+2].
You could do this in O(1) using one hash table "HT1" to map from (genre, year, movie_title) to an iterator into a linked list of (num_times_streamed, hash table of movie titles). You use the iterator to see if the next element in the list is for one greater streaming count and if so insert your movie title there and remove it from the other table (which if empty can be removed from the list), otherwise if the existing hash table has no other titles then increment the num_times_streamed, otherwise insert a new hash table in the list and add your title. Update the record of the iterator in HT1 as necessary.
Note that as described above the operations in the list use the end-points or an existing iterator to step through by no more than one position as the num_times_streamed value is incremented, so O(1).
To get the top k titles you'll need a hash table HT2 from { genre, year } to each of the linked lists: simply iterate from the end of the list and you'll encounter a hash table with a movie or movies with the highest streaming count, then the next highest and so on. If the year's just changed, you may not find k entries, handle that however you like. If when looking up a movie title it's found not to exist in HT1, you'd add a new list for that genre and the current year to HT2.
More visually, using { } around hash tables (whether mappings or sets), [ ] around linked lists, and ( ) around grouped struct/tuple data:
HT2 = { "comedy 2015": [ (1, { "title1", "title2" }),
(2, { "title3" }), <--------\
(4, { "title4" }) ], |
"drama 2012": [ (1, { "title5" }), |
(3, { "title6" }) ], |
... | .
}; | .
| .
HT1 = { "title3", -----------------------------------/ |
"title2", ---------------------------------------/
...
};
Related
If I have a list of names in a sheet for example:
First Name|Last Name|Something else|
Maria|Miller|...|
John|Doe|...|
Maria|Smith|...|
Marc|Meier|...|
Marc|Park|...|
Maria|Muster|...|
Selene|Mills|...|
Adam|Broker|...|
And then I want a second sheet which then shows the list of non-unique first names and their count, and the list being in descending order. So in this example that would be:
First Name|Count
Maria|3
Marc|2
What I found was this example https://infoinspired.com/google-docs/spreadsheet/sort-by-number-of-occurrences-in-google-sheets/
which sorts of partitions the sheet entries by occurrence.
So as of now I have
=UNIQUE(sort(
Names!C3:Names!C12000;
if(len(Names!C3:Names!C12000);countif(Names!C3:Names!C12000;Names!C3:Names!C12000););
0;
2;
1
))
In the first column and
=IF(ISBLANK(A2);;COUNTIF(Names!C3:Names!C12000; A2))
In the second. This does the job somewhat (it still shows the names with count 1), but the second column needs a copying of each cell downwards for each new entry leftwards. Is there a way to tie this up directly in one line? While filtering out the unique occurrences at that.
(Also the formulas are quite slow. The names sheet has about 11k entries so far. These formulas make the sheet crash at times atm. So I kind of want to sorts of comment out the formulas most of the time and only display them by commenting out the formulas. So the second column also just being one formula would be very helpful.)
use:
=QUERY(SORT(QUERY(A2:A, "select A,count(A) group by A"), 2, ), "where Col2>1", )
I think this should work if your headers are in row 1.
=QUERY(QUERY(Sheet1!A:A,"select Col1,COUNT(Col1) where Col1<>'' group by Col1",1),"where Col2>1 label Col2'Count'",1)
I am using redisgraph with a custom implementation of ioredis.
The query runs 3 to 6 seconds on a database that has millions of nodes. It basically filters (b:brand) by different relationship counts by adding the following match and where multiple times on different nodes.
(:brand) - 1mil nodes
(:w) - 20mil nodes
(:e) - 10mil nodes
// matching b before this codeblock
MATCH (b)-[:r1]->(p:p)<-[:r2]-(w:w)
WHERE w.deleted IS NULL
WITH count(DISTINCT w) as count, b
WHERE count >= 0 AND count <= 10
The full query would look like this.
MATCH (b:brand)
WHERE b.deleted IS NULL
MATCH (b)-[:r1]->(p:p)<-[:r2]-(w:w)
WHERE w.deleted IS NULL
WITH count(DISTINCT w) as count, b
WHERE count >= 0 AND count <= 10
MATCH (c)-[:r3]->(d:d)<-[:r4]-(e:e)
WHERE e.deleted IS NULL
WITH count(DISTINCT e) as count, b
WHERE count >= 0 AND count <= 10
WITH b ORDER by b.name asc
WITH count(b) as totalCount, collect({id: b.id)[$cursor..($cursor+$limit)] AS brands
RETURN brands, totalCount
How can I optimize this query as it's really slow?
A few thoughts:
Property lookups are expensive; is there a way you can get around all the .deleted checks?
If possible, can you avoid naming r1, r2, etc.? It's faster when it doesn't have to check the relationship type.
You're essentially traversing the entire graph several times. If the paths b-->p<--w and c-->d<--e don't overlap, you can include them both in the match statement, separated by a comma, and aggregate both counts at once
I don't know if it'll help much, but you don't need to name p and d since you never refer to them
This is a very small improvement, but I don't see a reason to check count >= 0
Also, I'm sure you have your reasons, but why does the c-->d<--e path matter? This would make more sense to me if it were b-->d<--e to mirror the first portion.
EDIT/UPDATE: A few things I said need clarification:
First bullet:
The fastest lookup is on a node label; up to 4 labels are essentially O(0). (Well, for anchor nodes, it's slower for downstream nodes.)
The second-fastest lookup is on an INDEXED property. My comment above assumed UNINDEXED lookups.
Second bullet: I think I was just wrong here. Relationships are stored as doubly-linked lists grouped by relationship type. Therefore, always specify relationship type for better performance. Similarly, always specify direction.
Third bullet: What I said is generally correct, HOWEVER beware of Cartesian joins when you have two MATCH statements separated by a comma. In general, you would only use that structure when you have a common element, like you want directors, actors, and cinematographers all connected to a movie. Still, no overlap between these paths.
Definitions
A parent record has the type 'P', an ancestor key, a date interval.
Its child record has the type 'C', an identical ancestor key, and has a date interval that matches or falls within its parent's interval.
All records are unique
Parent records can share the same ancestor key, but their date intervals cannot overlap
A parent record can have many child records
Example
Parent records can be:
P, 12345, (1000-01-01, 1000-12-31)
P, 12345, (1001-01-01, 1001-12-31) // No overlapping dates, valid
Valid children for the first parent record can be
C, 12345, (1000-01-01, 1000-12-31) // Matches on everything, valid
C, 12345, (1000-05-05, 1000-09-09) // Matches on ancestor key, is within parent's date interval, valid
Problem
Given a randomized set of records with both parents and children, how can I efficiently categorize the set into different groups of one unique parent and its valid children based on both key and time intervals?
It is guaranteed that for every child, there is one and only one parent. but it is possible that a parent does not have any children.
Brute force solution
Identify all of the parent records in linear time. Then loop through them and pairwise match all of the other records in squared time.
Question
Is there a faster approach?
The easiest way is to make a list of P records and a list of C records, and then sort both of them by (ancestor_key, interval.start)
Then you walk through the parents list, and for each parent, extract it's children from the child list. Because of the sorting, the parent and child lists will be in corresponding order, so the position of interest in both lists will move only forward.
Total complexity is O(n log n), dominated by the sorting.
I see that clickhouse created multiple directories for each partition key.
Documentation says the directory name format is: partition name, minimum number of data block, maximum number of data block and chunk level. For example, the directory name is 201901_1_11_1.
I think it means that the directory is a part which belongs to partition 201901, has the blocks from 1 to 11 and is on level 1. So we can have another part whose directory is like 201901_12_21_1, which means this part belongs to partition 201901, has the blocks from 12 to 21 and is on level 1.
So I think partition is split into different parts.
Am I right?
Parts -- pieces of a table which stores rows. One part = one folder with columns.
Partitions are virtual entities. They don't have physical representation. But you can say that these parts belong to the same partition.
Select does not care about partitions.
Select is not aware about partitioning keys.
BECAUSE each part has special files minmax_{PARTITIONING_KEY_COLUMN}.idx
These files contain min and max values of these columns in this part.
Also this minmax_ values are stored in memory in a (c++ vector) list of parts.
create table X (A Int64, B Date, K Int64,C String)
Engine=MergeTree partition by (A, toYYYYMM(B)) order by K;
insert into X values (1, today(), 1, '1');
cd /var/lib/clickhouse/data/default/X/1-202002_1_1_0/
ls -1 *.idx
minmax_A.idx <-----
minmax_B.idx <-----
primary.idx
SET send_logs_level = 'debug';
select * from X where A = 555;
(SelectExecutor): MinMax index condition: (column 0 in [555, 555])
(SelectExecutor): Selected 0 parts by date
SelectExecutor checked in-memory part list and found 0 parts because minmax_A.idx = (1,1) and this select needed (555, 555).
CH does not store partitioning key values.
So for example toYYYYMM(today()) = 202002 but this 202002 is not stored in a part or anywhere.
minmax_B.idx stores (18302, 18302) (2020-02-10 == select toInt16(today()))
In my case, I had used groupArray() and arrayEnumerate() for ranking in Populate. I thought that Populate can run query with new data on the partition (in my case: toStartOfDay(Date)), the total sum of new inserted data is correct but the groupArray() function is doesn't work correctly.
I think it's happened because when insert one Part, CH will groupArray() and rank on each Part immediately then merging Parts in one Partition, therefore i wont get exactly the final result of groupArray() and arrayEnumerate() function.
Summary, Merge
[groupArray(part_1) + groupArray(part_2)] is different from
groupArray(Partition)
with
Partition=part_1 + part_2
The solution that i tried is insert new data as one block size, just like using groupArray() to reduce the new data to the number of rows that is lower than max_insert_block_size=1048576. It did correctly but it's hard to insert new data of 1 day as one Part because it will use too much memory for querying when populating the data of 1 day (almost 150Mn-200Mn rows).
But do u have another solution for Populate with groupArray() for new inserting data, such as force CH to use POPULATE on each Partition, not each Part after merging all the part into one Partition?
I have a list of nodes representing a history of events for users forming, a following pattern:
()-[:s]->()-[:s]->() and so on
Each of the nodes of the list belongs to a user (is connected via a relationship).
I'm trying to create individual user histories (add a :succeeds_for_user relationship between all events that happend for a particular user, such that each event has only one consecutive event).
I was trying to do something like this to extract nodes that should be in a relationship:
start u = node:class(_class = "User")
match p = shortestPath(n-[:s*..]->m), n-[:belongs_to]-u-[:belongs_to]-m
where n <> m
with n, MIN(length(p)) as l
match p = n-[:s*1..]->m
where length(p) = l
return n._id, m._id, extract(x IN nodes(p): x._id)
but it is painfully slow.
Does anyone know a better way to do it?
Neo4j is calculating a lot of shortest paths there.
Assuming that you have a history start node (with for the purpose of my query has id x), you can get an ordered list of event nodes with corresponding user id like this:
"START n=node(x) # history start
MATCH p = n-[:FOLLOWS*1..]->(m)<-[:DID]-u # match from start up to user nodes
return u._id,
reduce(id=0,
n in filter(n in nodes(p): n._class != 'User'): n._id)
# get the id of the last node in the path that is not a User
order by length(p) # ordered by path length, thus place in history"
You can then iterate the result in your program and add relationships between nodes belonging to the same user. I don't have a fitting big dataset, but it might be faster.