Get weight of a key in redis - sorting

I'm implementing a sorting algorithm using zsets on redis, and I would like to know how much space each key is using.
Is there a redis command to know how big(in Bytes) a set is?

In Redis v4, you can use the MEMORY USAGE command to do just that.

I think you can calculate by yourself. In redis almost everything is stored by string except the integer. In zset every member has a score, If the score is Int32, then it is 4 bytes, float will be 8 bytes. and the member is a string, you can use the average string length to compute, for e.g. let say the average length is 10. the approximate byte number is 10. So a member is about 14 bytes. You can use zcount to get the size of zset. Then you get a min space It took. because Zset is maintained by skip-list and hash table, there will be extra space to use for these data structure.

Related

What column type in Laravel is ideal to store values from 0 to 60 and how do I mitigate the excess storage space

What datatype will be ideal to store values from 0 to 60 in the database? The options I've seen according to the documentation are $table->unsignedTinyInteger('votes'); and $table->unsignedSmallInteger('votes'); But even at this, I think it's too big a size for what is needed.
So, is there a way to specify the amount of values I can store in a column?
Documentation ref: Laravel 7
Stick with unsigned tiny int. Its size is 1 byte ( so it'd take 1 million+ entries to reach 1MB ). Trying to over optimize column datatypes would just lead to a lot of headaches in the future.

Redis GEORADIUS with one ZSET versus a lot of ZSETs of particular size

What will work faster, one big ZSET with geodata where I'll query for 100m radius with GEORADIUS
OR
a lot of ZSETs where each ZSET is responsible for 100m X 100m square covering the whole world? and named after this 100m squares like:
left_corner1_49_2440000_28_5010000
left_corner2_49_2450000_28_5010000
.......
and have all the 100 meters to the right and bottom inside the sets.
So when searching for the nearest point I'll just omit the redundant digits from gps like: 49.2440408, 28.5011694 will become
49.2440000, 28.5010000 so this way I'll know the ZSETS's name where just to get all the exact values with 100 meters precision.
OR to question it in general form: how are the ZSET's names are stored and accessed in redis? If I have too much ZSETS will it impact performance while accessing them?
Precise comparison of this approaches could only be done via benchmark and it would be specific to your dataset and configuration. But architecturally speaking, your pros and cons are:
BIG ZSET: less bandwidth and less operations (CPU cycles) taken to execute, no problems on borders (possible duplicates with many ZSETS), can get throughput with sharding;
MANY ZSETS: less latency for other operations (while big ZSET is going, other commands are waiting), can get throughput with sharding AND latency with clustering.
As for bottom line question, I did not see implementation code, but set names should be the same keys as any other keys you use. This is what Redis FAQ says about number of keys:
What is the maximum number of keys a single Redis instance can hold? <...>
Redis can handle up to 2^32 keys, and was tested in practice to handle
at least 250 million keys per instance.
UPDATE:
Look at what Redis docs say about GEORADIUS:
Time complexity: O(N+log(M)) where N is the number of elements inside
the bounding box of the circular area delimited by center and radius
and M is the number of items inside the index.
It means that items outside of your query make O(log(M)) impact on your query. So, 17 hops for 10m items or 21 hop for 1b items which is quite affordable. The question left is will you do partitioning between nodes?

Using ChronicleMap as a key-value database

I would like to use a ChronicleMap as a memory-mapped key-value database (String to byte[]). It should be able to hold up to the order of 100 million entries. Reads/gets will happen much more frequently than writes/puts, with an expected write rate of less than 10 entries/sec. While the keys would be similar in length, the length of the value could vary strongly: it could be anything from a few bytes up to tens of Mbs. Yet, the majority of values will have a length between 500 to 1000 bytes.
Having read a bit about ChronicleMap, I am amazed about its features and am wondering why I can't find articles describing it being used as a general key-value database. To me there seem to be a lot of advantages of using ChronicleMap for such a purpose. What am I missing here?
What are the drawbacks of using ChronicleMap for the given boundary conditions?
I voted for closing this question because any "drawbacks" would be relative.
As a data structure, Chronicle Map is not sorted, so it doesn't fit when you need to iterate the key-value pairs in the sorted order by key.
Limitation of the current implementation is that you need to specify the number of elements that are going to be stored in the map in advance, and if the actual number isn't close to the specified number, you are going to overuse memory and disk (not very severely though, on Linux systems), but if the actual number of entries exceeds the specified number by approximately 20% or more, operation performance starts to degrade, and the performance hit grows linearly with the number of entries growing further. See https://github.com/OpenHFT/Chronicle-Map/issues/105

How to get top N elements from an Apache Spark RDD for large N

I have an RDD[(Int, Double)] (where Int is unique) with around 400 million entries and need to get top N. rdd.top(N)(Ordering.by(_._2)) works great for small N (tested up to 100,000), but when I need the top 1 million, I run into this error:
Total size of serialized results of 5634 tasks (1024.1 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
I understand why the error happens (although it is beyond my imagination why 1024 bytes are used to serialize a single pair (Int, Double)) and I also understand that I can overcome it by increasing spark.driver.maxResultSize, but this solution only works up to a certain N and I cannot know whether it will work or not until the whole job crashes.
How can I get the top N entries efficiently without using top or takeOrdered, since they both return Arrays that can get too big for a large N?
Scala solutions are preferred.
So there are a few solutions to this. The simplest is enabling kyro serialization which will likely reduce the amount of memory required.
Another would be using sortByKey followed with mapPartitionsWithIndex to get the count of each partition and then figuring out which partitions you need to keep and then working with the resulting RDD (this one is better if you are ok with expressing the rest of your operations on RDDs).
If you need the top n locally in the driver, you could use sortByKey and then cache the resulting RDD and use toLocalIterator.
Hope that one of these three approaches meets your needs.

How does neo4j perform in time and space complexity for given type of nodes, relationships and queries?

Consider I'm going to have following things in my graph:
100 Million nodes, More than 1 Billion connections/relationships
Node properties: around 10 properties, mix of int, doubles, strings, HashMaps etc.
Relationship properties: around 10 double values and 2-3 string (with avg 50 chars) values
Now, Suppose I want to update all node and relationship property values, by querying neighbors on each node once. i.e. say as,
step1: search a node, say X, with given Id,
step2: get it's neighbours,
step3: update node properties of X and all relationship properties between X and it's neighbors.
Repeat these 3 steps for all nodes once.
How much time will it take for once update of all nodes(approx time is OK for me, may be in seconds / minutes / hrs) given following system configuration:
Two dual core processors, 3.0 GHz each, 4*4 GB memory, 250 GB Hard disk space.
How much approximate storage space will be required for above mentioned data?
Please help me by providing any approximate, sample performance (time and storage) analysis.
Any sample performance analysis will help me to visualize my requirements. Thanks.
Size consideration is pretty easy for node/relationships. Each node is 9 Bytes, and each relationship is 33 Bytes.
9B x 100M = 900 Million Bytes =~ 858.3 Megabytes for nodes
33B x 1B = 33 Billion bytes =~ 30.7 Gigabytes for relationships
As for the computation, it's tough to gauge that. Neo4j cache isn't 1-to-1 with what is on disk, so your storage may be ~31Gb, but you'll need much more then that to store it in cache. The way neo4j stores the information on disk is efficient for this type of traversal though, as they store all relationships and properties for a node in a linked list, so accessing them through an iterator is more effient then searching for one type of relationship.
It would be hard to give you an estimate, but I'd say since you are going through duplicate relationships, what can fit on RAM vs Disk, etc. My guess would be a few hours(<6Hrs.) given your system and size requirements.

Resources