Dual-keyed map without additional data, is it possible? - algorithm

Let's assume that we have a huge map where each element need to be accessed by 2 different keys, K1 and K2. We have both K1 and K2 when we add data to the tree, but we need to retrieve data using either K1 or K2. This means both K1 and K2=ignored and K1=ignored and K2 while retrieving data refer to the data defined by K1 and K2. Is it possible to do this with a correct comparison method, without duplicating data or using secondary map for showing relationship between K1 and K2 (because these 2 methods are obvious but both need to secondary data)? What about hash maps? Because hash maps need both comparison and hash methods.

You can solve this the following way.
Lets say you maintain a single Map <Key,Value> map to store your info.
Now, you when you want to store 2 keys K1 and K2 to a value. Store it in a hierarchial form with a parent-child relationship between K1 and K2.
For, example, the map structure would look like this.
K1 - Value
K2 - <child_of_identifier>K1
So, whenver you want to query using K1 or K2, you can do the following:
if(map.exists(key)){
if(map.get(key).startsWith("<child_of_identifier>"){
// Parse parent key value
return map.get(parent_key)
}else{
return map.get(key);
}
}
About the solution:
Does not duplicate data
Does not have another auxillary map
Two lookups for child keys
Flat storage of keys instead of merging 2 keys.

Related

Delete Caffeine entries based on a timestamp condition

Is there a way to remove Caffeine entries based on a timestamp condition? Eg.,
At T1 I have following entries
K1 -> V1
K2 -> V2
K3 -> V3
At time T2 I update only K2 and K3. (I dont know if both entries will have exact timestamp. K2 might have T2 but K3 might be T2 + some nanos. But sake of this question let's assume they do)
Now I want caffeine to invalidate entry K1 -> V1 because T1 < T2.
One way to do this is to iterate over entries and check if their write timestamp is < T2. Collect such keys and in the end call invalidateKeys(keys).
Maybe there is a non-iterative way?
If you are using expireAfterWrite, then you can obtain a snapshot of entries in timestamp order. As this call requires obtaining the eviction lock, it provides an immutable snapshot rather than an iterator. That is messy, e.g. you have to provide a limit which might not be correct and it depends on expiration.
Duration maxAge = Duration.ofMinutes(1);
cache.policy().expireAfterWrite().ifPresent(policy -> {
Map<K, V> oldest = policy.oldest(1_000);
for (K key : oldest.keySet()) {
// Remove everything written more than 1 minute ago
policy.ageOf(key)
.filter(duration -> duration.compareTo(maxAge) > 0)
.ifPresent(duration -> cache.invalidate(key));
}
});
If you maintain the timestamp yourself, then an unordered iteration is possible using the cache.asMap() view. That's likely simplest and fast.
long cutoff = ...
var keys = cache.asMap().entrySet().stream()
.filter(entry -> entry.getValue().timestamp() < cutoff)
.collect(toList());
cache.invalidateAll(keys);
An approach that won't work, but worth mentioning to explain why, is variable expiration, expireAfter(expiry). You can set a new duration on every read based on the prior setting. This takes effect after the entry is returned to the caller, so while you can expire immediately it will serve K1 (at least) once.
Otherwise you could validate at retrieval time outside of the cache and rely on size eviction. The flaw with this approach is that it does pollute the cache will dead entries.
V value = cache.get(key);
if (value.timestamp() < cutoff) {
cache.asMap().remove(key, value);
return cache.get(key); // load a new value
}
return value;
Or you could maintain your own write-order queue, etc. All of these get messy the fancier you get. For your case, likely a full iteration is the simplest and least error-prone approach.

Simpler alternative to simultaneously Sort and Filter by column in Google Spreadsheets

I have a spreadsheet (here's a copy) with the following (headered) columns:
A: Indices for a list of groceries;
B: Names for the groceries to be indexed by column A;
C: Check column with "x" for inactive items in column B, empty otherwise;
D: Sorting indices that I want to apply to column B;
Currently, I am getting the sorted AND filtered result with this formula:
=SORT(FILTER(B2:B; C2:C = ""); FILTER(D2:D; C2:C = ""); TRUE)
The problem is that I need to apply the filter two times: one for the items and one for the indices, otherwise I get a mismatch between elements for the Sort function.
I feel that this doesn't scale well since it creates duplication.
Is there a way to get the same results with a simpler formula or another arrangement of columns?
=SORT(FILTER({Itens!B2:B\Itens!G2:G}; Itens!D2:D=""))
=SORT(FILTER({Itens!B2:B\Itens!G2:G}; Itens!D2:D="");2;1)
or maybe: =SORT(FILTER(Itens!B2:B; Itens!D2:D="");2;1)

Map IDs to matrix rows in Hadoop/MapReduce

I have data about users buying products. I want to create a binary matrix of size |users| x |products| such that the element (i,j) in the matrix is 1 iff user_i has bought product_j, else the value is 0.
Now, my data looks something like
userA, productX
userB, productY
userA, productZ
...
UserIds and productIds are all strings. My problem is, how to map these IDs to row indices (for users) and column indices (for products) in the matrix.
There are over a million unique userIds and roughly 3 million productIds.
To make the problem well defined: given the user1, product1 like input above, how do I convert it to something like
1,1
2,2
1,3
where userA is mapped to row 0 of the matrix, userB is mapped to row 1, productX is mapped to column 0 and so on.
Given the size of data, I would have to use Hadoop Map-Reduce but can't think of a foolproof way of efficiently doing this.
This can be solved if we can do the following:
Dump unique userIds.
Dump unique productIds.
Map each unique userId in (1) to a row index.
Map each unique productId in (2) to a column index.
I can do (1) and (2) easily but having trouble coming up with an efficient approach to solve (3) (4 will be solved if we solve 3).
I have a couple of solutions but they are not foolproof.
Solution 1 (naive) for step 3 above
Map all userIds and emit the same key (say "1") for all map tasks.
Have a long counter initialized to 0 in setup() of the reducer.
In the reduce(), emit the counter value along with the input userId and increment the counter by 1.
This would be very inefficient since all 100 million userIds would be processed by a single reducer.
Solution 2 for step 3 above
While mapping userIds, emit each userId against a key which is an integer uniformly sampled from 1,2,3....N (where N is configurable. N = 100 for example). In a way, we are partitioning the input set.
Within the mapper, use Hadoop counters to count the number of userIds assigned to each of those random partitions.
In the reducer setup, first access the counters in the mapping stage to determine how many IDs were assigned to each partition. Use these counters to determine the start and end values for that partition.
Iterate (while counting) over each userId in reduce and generate matrix rowId as start_of_partition + counter.
context.write(userId, matrix row Id)
This method should work but I am not sure how to handle cases when reducer tasks failed/killed.
I believe there should be ways of doing this which I am not aware of. Can we use hashing/modulo to achieve this? How would we handle collisions at scale?

How to retrieve nth element from the sorted keys of sorted TreeMap?

I am using TreeMap to as I want to store sorted keys. I have also passed the comparator to sort the order. Now, I want to retrieve the 2nd key form the map. How do I go about doing it. The TreeMap is as given below :
private TreeMap<Coupon, LineItem> couponVsDiscountLine = new TreeMap<>((c1, c2) -> c1.weight().compareTo(c2.weight()));
Getting the sorted keys from TreeMap :
TreeSet<Coupon> coupons = (TreeSet<Coupon>) couponVsDiscountLine.keySet();
There is no method in TreeSet to get(index) as the elements in TreeSet are not indexed.
Other question, which Set does keySet() method of TreeMap return? How does TreeMap store the keys internally?
I read in some post that the TreeMap or TreeSet does not maintain the order if any modifications is done on that. Does it mean that retrieval of element may not give the elements in the order specified in the comparator?

Is relational database able to leverage the way of consistent hashing to do the partition table?

Assume we have a user table to be partitioned by user id as integer 1,2,3...n . Can I use the way of consistent hashing used to partition the table?
The benefit would be if the number of partitions is increased or decreased, old index can be the same.
Question A.
Is it a good idea to use consistent hashing algorithm to do the partition table?
Question B.
Any relational database has this built in supported?
I guess some nosql database already use it.
But database here refer to relational database.
I just encountered this question in an interview. In the first reaction I just answered mod by length, but then be challenged if partitioning the table into more pieces, then it will cause problems.
After I researched some wiki reference pages like Partition (database)
I believe my idea belongs to Composite partitioning .
Composite partitioning allows for certain combinations of the above partitioning schemes, by for example first applying a range partitioning and then a hash partitioning. Consistent hashing could be considered a composite of hash and list partitioning where the hash reduces the key space to a size that can be listed.
It also introduces some concepts like Consistent hashing, and Hash table
But some link like Partition (database) is kind of old. If some one can find more latest reference that will be better. My answer is incomplete indeed. Hope some one can answer it better!
UPDATE
Looks like Jonathan Ellis already mentioned in his blog, The Cassandra distributed database supports two partitioning schemes now: the traditional consistent hashing scheme, and an order-preserving partitioner.
http://spyced.blogspot.com/2009/05/consistent-hashing-vs-order-preserving.html
From Tom White's blog. A sample implemtation in java of consistent hashing
import java.util.Collection;
import java.util.SortedMap;
import java.util.TreeMap;
public class ConsistentHash<T> {
private final HashFunction hashFunction;
private final int numberOfReplicas;
private final SortedMap<Integer, T> circle = new TreeMap<Integer, T>();
public ConsistentHash(HashFunction hashFunction, int numberOfReplicas,
Collection<T> nodes) {
this.hashFunction = hashFunction;
this.numberOfReplicas = numberOfReplicas;
for (T node : nodes) {
add(node);
}
}
public void add(T node) {
for (int i = 0; i < numberOfReplicas; i++) {
circle.put(hashFunction.hash(node.toString() + i), node);
}
}
public void remove(T node) {
for (int i = 0; i < numberOfReplicas; i++) {
circle.remove(hashFunction.hash(node.toString() + i));
}
}
public T get(Object key) {
if (circle.isEmpty()) {
return null;
}
int hash = hashFunction.hash(key);
if (!circle.containsKey(hash)) {
SortedMap<Integer, T> tailMap = circle.tailMap(hash);
hash = tailMap.isEmpty() ? circle.firstKey() : tailMap.firstKey();
}
return circle.get(hash);
}
}
About oracle hash partition, part from oracle help doc
After some research, oracle actually do support consistent hashing by the default hash partitioning. Though how it did is a secret and not published. But it actually leverage the way HashMap, but hidden some partitions. So when you add/remove partition, very less work for oracle to adjust the data in different partitions. The algorithms only ensures evenly splitting data into partitions of numbers power of 2 such as 4. So if it's not, then merge/split some partitions.
The magic is like if to increase from four partitions to five, it actually spilts one partition into two. If to decrease from four partitions into three, it actually merges two partitions into one.
If anyone has more insight, add a more detailed answer.
Hash Partitioning
Hash partitioning maps data to partitions based on a hashing algorithm that Oracle applies to the partitioning key that you identify. The hashing algorithm evenly distributes rows among partitions, giving partitions approximately the same size.
Hash partitioning is the ideal method for distributing data evenly across devices. Hash partitioning is also an easy-to-use alternative to range partitioning, especially when the data to be partitioned is not historical or has no obvious partitioning key.
Note:
You cannot change the hashing algorithms used by partitioning.
About MYSQL hash partition, part from mysql help doc
It provides two partition function
One is partition by HASH.
The other is partition by KEY.
Partitioning by key is similar to partitioning by hash, except that where hash partitioning employs a user-defined expression, the hashing function for key partitioning is supplied by the MySQL server. MySQL Cluster uses MD5() for this purpose; for tables using other storage engines, the server employs its own internal hashing function which is based on the same algorithm as PASSWORD().
The syntax rules for CREATE TABLE ... PARTITION BY KEY are similar to those for creating a table that is partitioned by hash.
The major differences are listed here:
•KEY is used rather than HASH.
•KEY takes only a list of one or more column names. Beginning with MySQL 5.1.5, the column or columns used as the partitioning key must comprise part or all of the table's primary key, if the table has one.
CREATE TABLE k1 (
id INT NOT NULL PRIMARY KEY,
name VARCHAR(20)
)
PARTITION BY KEY()
PARTITIONS 2;
If there is no primary key but there is a unique key, then the unique key is used for the partitioning key:
CREATE TABLE k1 (
id INT NOT NULL,
name VARCHAR(20),
UNIQUE KEY (id)
)
PARTITION BY KEY()
PARTITIONS 2;
However, if the unique key column were not defined as NOT NULL, then the previous statement would fail.
But it don't tell how it partitions, will have to look into code.

Resources