Sorting in tarantool with min() if several records have equal secondary index - tarantool

local orders = box.schema.space.create('orders')
box.schema.sequence.create('orderId')
orders:create_index('id', {sequence='orderId'})
orders:create_index('price', {unique=false, parts={2, 'integer'}})
local bestOrder = orders.index.price:min()
I'm searching best order (minimal price) with min() function over secondary index. How is Tarantool sorting records if they have the same price (secondary index)? I tested and look like by primary index. Is this behaviour standardized?

Tarantool has several types of indexes [1]. And some of are sorted some of are not sorted. For instance, the tree index [2] will be sorted, that means you can get bounds, and the hash index [3] is not sorted, means you can get bounds without copy and sort.
In your case, you use a tree index. So you can use min, max, bound selects etc for 'price' index.
UPDATE (after had some talks via email)
Tarantool returns first tuple by primary key for min() if several records have equal secondary index and last for max().
This behavior is normal for Tarantool, and you can count that it would not be changed in the near future.
[1] https://tarantool.org/en/doc/2.0/book/box/box_index.html?highlight=index#module-box.index
[2] https://en.wikipedia.org/wiki/B-tree
[3] https://en.wikipedia.org/wiki/Hash_table

Related

Oracle Reverse Primary Keys

I am trying to create a table with a reverse index on my primary key column.
The table is created, and I have inserted a lot of data, generating a key value with a sequence.
From what I understand about reverse keys, about how it gets the nextval and reverses it, then inserts...I was expecting to see the key value reversed in my select statement.
If nextval was 112, when i select from the table i was expecting to see 211. But i still see 112.
Is it still implementing the reverse key index, and Oracle is just displaying in non-reversed format?
Or is something actually wrong?
The SQL i used for the index is
CREATE UNIQUE INDEX "<schema>"."<index_name>" ON "<schema>"."<table_name>" ("SYS_I") REVERSE;
A reverse key index does not change a key value.
Only it's physical representation stored on the disk is changed.
From the documentation
Reverse Key Indexes
A reverse key index is a type of B-tree index that physically reverses the bytes of each index key while keeping the
column order. For example, if the index key is 20, and if the two
bytes stored for this key in hexadecimal are C1,15 in a standard
B-tree index, then a reverse key index stores the bytes as 15,C1.
Reversing the key solves the problem of contention for leaf blocks in
the right side of a B-tree index. This problem can be especially acute
in an Oracle Real Application Clusters (Oracle RAC) database in which
multiple instances repeatedly modify the same block. For example, in
an orders table the primary keys for orders are sequential. One
instance in the cluster adds order 20, while another adds 21, with
each instance writing its key to the same leaf block on the right-hand
side of the index.
In a reverse key index, the reversal of the byte order distributes
inserts across all leaf keys in the index. For example, keys such as
20 and 21 that would have been adjacent in a standard key index are
now stored far apart in separate blocks. Thus, I/O for insertions of
sequential keys is more evenly distributed.
Because the data in the index is not sorted by column key when it is
stored, the reverse key arrangement eliminates the ability to run an
index range scanning query in some cases. For example, if a user
issues a query for order IDs greater than 20, then the database cannot
start with the block containing this ID and proceed horizontally
through the leaf blocks.

How does hash table get(key) works when multiple keys are stored with linked nodes?

I am aware how hash table works. But I am not sure of the possible implementation of get(key) when multiple values are stored at the same place with the help of linked list.
For example:
set(1,'Val1') get stored at index 7
set(2,'Val2') also get stored at index 7. (Internal implementation create a linked list and store pointer at index 7. That's understandable).
But I am thinking if now I call get(2). How does Hash Table knows which Value to return. Because my hash function will resolve this to index 7. But at index 7 there are 2 values.
One possible way is to store at the linked node, both value and key.
Is there any other different implementation possible?
Go through the linked list and do a linear search for the key '2'. The properties of the hash function and the hash table size should guarantee that these lists' length is O(1) on average.
I think you misunderstood the fact that hash tables has to store their keys. The hash function is only for speeding up insertion/lookup.

Data Structure for Multilevel Key

What is the best data-structure for efficient level search, when the key is function of multiple values.
Say my Key is [Row Key] [Column Name] [ TimeStamp ]
Maybe you are looking for a multi-dimensional index or aggregate index?
Aggregate indexes combine several keys (timestamp, row key, ...) into a single key which is then stored in a normal search index (Hashmap, B-tree or similar). Aggregation can be done by concatenating the keys or interleaving their bits.
Multidimensional indexes allow indexing and searching for several attributes at the same time. For example if you have [ID] [attr1] [attr2] [timestamp] then you could use a 4-dimensional index. You would have to turn all values into integers or floats (for Strings you could use turn the first few characters into an integer). For look-up, you can simply those dimension empty where you don't know the value.
Multi-dimensional indexes are for example R*-Tree, X-tree, PH-Tree. There are also the kd-tree and quadtrees, they are quite simple to implement but not really that good for large datasets or many dimensions.

Does MongoDB store counts for non-unique indexes?

Forgive the SQL syntax since I'm brand new to mongo, but if I did the equivalent of
SELECT count(*) FROM table WHERE indexed_field=val
In MongoDB, will it run in O(1) time or O(N) time where N are the number of matches? It seems the answer is O(N) based on this commit and the fact that this says it only increased performance by 20 times (whereas maintaining a count would be far greater), but I'm not sure.
Just wondering if I should cache counts for large counts. It seems like the answer is yes.
MongoDBs indexes currently do not store a count per index (or collection). It makes no difference whether they are unique or non-unique indexes either. In order for MongoDB to find out how many documents there are, it needs to do an index traversal which operates in O(N)

Designing relational system for large scale

I've been having some difficulty scaling up the application and decided to ask a question here.
Consider a relational database (say mysql). Let's say it allows users to make posts and these are stored in the post table (has fields: postid, posterid, data, timestamp). So, when you go to retrieve all posts by you sorted by recency, you simply get all posts with posterid = you and order by date. Simple enough.
This process will use timestamp as the index since it has the highest cardinality and correctly so. So, beyond looking into the indexes, it'll take literally 1 row fetch from disk to complete this task. Awesome!
But let's say it's been 1 million more posts (in the system) by other users since you last posted. Then, in order to get your latest post, the database will peg the index on timestamp again, and it's not like we know how many posts have happened since then (or should we at least manually estimate and set preferred key)? Then we wasted looking into a million and one rows just to fetch a single row.
Additionally, a set of posts from multiple arbitrary users would be one of the use cases, so I cannot make fields like userid_timestamp to create a sub-index.
Am I seeing this wrong? Or what must be changed fundamentally from the application to allow such operation to occur at least somewhat efficiently?
Indexing
If you have a query: ... WHERE posterid = you ORDER BY timestamp [DESC], then you need a composite index on {posterid, timestamp}.
Finding all posts of a given user is done by a range scan on the index's leading edge (posterid).
Finding user's oldest/newest post can be done in a single index seek, which is proportional to the B-Tree height, which is proportional to log(N) where N is number of indexed rows.
To understand why, take a look at Anatomy of an SQL Index.
Clustering
The leafs of a "normal" B-Tree index hold "pointers" (physical addresses) to indexed rows, while the rows themselves reside in a separate data structure called "table heap". The heap can be eliminated by storing rows directly in leafs of the B-Tree, which is called clustering. This has its pros and cons, but if you have one predominant kind of query, eliminating the table heap access through clustering is definitely something to consider.
In this particular case, the table could be created like this:
CREATE TABLE T (
posterid int,
`timestamp` DATETIME,
data VARCHAR(50),
PRIMARY KEY (posterid, `timestamp`)
);
The MySQL/InnoDB clusters all its tables and uses primary key as clustering key. We haven't used the surrogate key (postid) since secondary indexes in clustered tables can be expensive and we already have the natural key. If you really need the surrogate key, consider making it alternate key and keeping the clustering established through the natural key.
For queries like
where posterid = 5
order by timestamp
or
where posterid in (4, 578, 222299, ...etc...)
order by timestamp
make an index on (posterid, timestamp) and the database should pick it all by itself.
edit - i just tried this with mysql
CREATE TABLE `posts` (
`id` INT(11) NOT NULL,
`ts` INT NOT NULL,
`data` VARCHAR(100) NULL DEFAULT NULL,
INDEX `id_ts` (`id`, `ts`),
INDEX `id` (`id`),
INDEX `ts` (`ts`),
INDEX `ts_id` (`ts`, `id`)
)
ENGINE=InnoDB
I filled it with a lot of data, and
explain
select * from posts where id = 5 order by ts
picks the id_ts index
Assuming you use hash tables to implement your Data Base - yes. Hash tables are not ordered, and you have no other way but to iterate all elements in order to find the maximal.
However, if you use some ordered DS, such as a B+ tree (which is actually pretty optimized for disks and thus data bases), it is a different story.
You can store elements in your B+ tree ordered by user (primary order/comparator) and date (secondary comparator, descending). Once you have this DS, finding the first element can be achieved in O(log(n)) disk seeks by finding the first element matching the primary criteria (user-id).
I am not familiar with the implementations of data bases, but AFAIK, some of them do allow you to create an index, based on a B+ tree - and by doing so, you can achieve finding the last post of a user more efficiently.
P.S.
To be exact, the concept of "greatest" element or ordering is not well defined in Relational Algebra. There is no max operator. To get the max element of a table R with a single column a one should actually create the Cartesian product of that table and find this entry. There is no max nor sort operator in strict relational algebra (though it does exist in SQL)
(Assuming set, and not multiset semantics):
MAX = R \ Project(Select(R x R, R1.a < R2.a),R1.a)

Resources