Choosing Redis datatypes for advanced data manipulation in a simple torrent tracker service

Choosing Redis datatypes for advanced data manipulation in a simple torrent tracker service - ruby

I need your advice on Redis datatypes for my project. The project is a torrent-tracker (ruby, simple sinatra-based) with pure in-memory data store for current information about peers. I feel like this is what Redis is made for. But I'm stuck at choosing proper data types for this. For now I tend to the following setup:
Use list for seeders. Actually I'd better need a ring buffer to get a sequential range of seeders (with given size and start position) and save new start position for the next time.
Use sorted set for leechers. Score for each leecher is downloaded/(downloaded+left) so I can also extract a range for any specific case.
All string values in set and list are string (bencoded) representation of peer data.
What I actually lack in the setup above is:
Necessity to store offset for seeders so data access needs synchronization.
Unknown method of finding a specific seeder in list. Here I may benefit from set but then I won't be able to extract a range of items at once.
(General problem) Need TTL for set/list members (if client shuts down without sending any data before this). Possible option is to make each peer an ordinary string key/value (string or hash), give it TTL, subscribe on destroy and delete it in corresponding list or set.
What could you suggest? Any practical advice?

Related

Maintain a growing list in offchain storage substrate or use StorageValueRef::mutate in extrensics

I was trying to have a offchain storage that stores a collection of data (likely Vector).
And I was assuming to keep this vector growing.
One smooth-seeming approach was to use StorageValueRef::mutate() function but only later that I found we can't use that in extrinsic ( or maybe we could and I am not aware of ).
Another simple approach is to use BlockNumber to create a storage key and use BlockNumber from offchain wroker to reference that value.
But on what I am doing there will be need to store multiple data coming into single block. So I will be restricted to be able to store only one value per block which also doesn't fit the requirements.

You could create a map like this:
#[pallet::storage]
pub type MyData<T: Config> =
StorageMap<_, Twox64Concat, T::BlockNumber, Vec<MyData>>;
Then you can do MyData::<T>::append(block_number, data) in your pallet as often as you want.
But I would propose that you introduce some "pruning" window. Let's say 10 and only keep the data of the latest 10 blocks in the state. For that you can just have some MyData::<T>::remove(block_number - 10) in your on_intialize.
But if it is really just about data that you want to set from the runtime for the offchain worker, you could use sp_io::offchain_index::set("key", "data");. But this is a more low level interface. However, here you could also prefix the key by block number to have it unique per block, but you will need to come up with your own custom way of storing multiple values per block.

My company uses memcache as object just fine, can't see need for redis in caching

I'm learning about redis/memcache and redis is clearly the more popular option. My question is about supported data types. At my company we use the memcashier library which is built in memcached. We store temporary user data when they're making a purchase in memcache. We can easily update this object as things are added to the cart or more info about the user is given. This appears to be the same functionality as a hash in redis. I don't understand how this is only a basic string data type and how it's less powerful than a hash.

If you are using strings, that's fine - but any change involves loading the data to your application, parsing it, modifying it, and serializing it back to Redis/Memcache.
This has two problems: it's slow and non atomic. You can have two servers modifying the same object arriving in an inconsistent state - such as double or missing items in a shopping cart. And again, it's slow.
With a Redis hash key, you can atomically modify specific fields of the object without loading the entire object into memory. Instead of read, parse, modify, save - you just update.
Besides, Redis has many many data structures that can create very flexible data stores with different properties, whereas Memcache can only store strings.
BTW Redis has a module that allows you to store JSON objects just as you would a string, and manipulate them directly and atomically without getting them to the client. See Rejson.io for details.

Memcached doesn't support complex datastructures
In redis you have Lists, Sets, SortedSets, HashTables , and more.
Each data-structure mentioned above supports mutation of one or more of its elements atomically and without replacing the entire data-structure/value.
Memcached on the other hand , is a simple key-value store - that means every operation involving an attribute change within a complex object is a read-modify-write. If you just go around blindly replacing fields in objects then you are risking race-conditions and operations atomicity issues (which you can get away from by using CAS )
If the library abstracts that complexity, well - that's great but it's still less efficient than mutating only the relevant field(s)
This answer only relates to your usecase. Redis holds many other virtues over memcached, which are not relevant to this question.

How do you RESTfully get a complicated subset of records?

I have a question about getting 'random' chunks of available content from a RESTful service, without duplicating what the client has already cached. How can I do this in a RESTful way?
I'm serving up a very large number of items (little articles with text and urls). Let's pretend it's:
/api/article/
My (software) clients want to get random chunks of what's available. There's too many to load them all onto the client. They do not have a natural order, so it's not a situation where they can just ask for the latest. Instead, there are around 6-10 attributes that the client may give to 'hint' what type of articles they'd like to see (e.g. popular, recent, trending...).
Over time the clients get more and more content, but at the server I have no idea what they have already, and because they're sent randomly, I can't just pass in the 'most recent' one they have.
I could conceivably send up the GUIDS of what's stored locally. The clients only store 50-100 locally. That's small enough to stuff into a POST variable, but not into the GET query string.
What's a clean way to design this?
Key points:
Data has no logical order
Clients must cache the content locally
Each item has a GUID
Want to avoid pulling down duplicates

You'll never be able to make this work satisfactorily if the data is truly kept in a random order (bear in mind the Dilbert RNG Effect); you need to fix the order for a particular client so that they can page through it properly. That's easy to do though; just make that particular ordering be a resource itself; at that point, you've got a natural (if possibly synthetic) ordering and can use normal paging techniques.
The main thing to watch out for is that you'll be creating a resource in response to a GET when you do the initial query: you probably should use a resource name that is a hash of the query parameters (including the client's identity if that matters) so that if someone does the same query twice in a row, they'll get the same resource (so preserving proper idempotency). You can always delete the resource after some timeout rather than requiring manual disposal…

Transferring lots of objects with Guid IDs to the client

I have a web app that uses Guids as the PK in the DB for an Employee object and an Association object.
One page in my app returns a large amount of data showing all Associations all Employees may be a part of.
So right now, I am sending to the client essentially a bunch of objects that look like:
{assocation_id: guid, employees: [guid1, guid2, ..., guidN]}
It turns out that many employees belong to many associations, so I am sending down the same Guids for those employees over and over again in these different objects. For example, it is possible that I am sending down 30,000 total guids across all associations in some cases, of which there are only 500 unique employees.
I am wondering if it is worth me building some kind of lookup index that I also send to the client like
{ 1: Guid1, 2: Guid2 ... }
and replacing all of the Guids in the objects I send down with those ints,
or if simply gzipping the response will compress it enough that this extra effort is not worth it?
Note: please don't get caught up in the details of if I should be sending down 30,000 pieces of data or not -- this is not my choice and there is nothing I can do about it (and I also can't change Guids to ints or longs in the DB).

Your wrote at the end of your question the following
Note: please don't get caught up in the details of if I should be
sending down 30,000 pieces of data or not -- this is not my choice and
there is nothing I can do about it (and I also can't change Guids to
ints or longs in the DB).
I think it's your main problem. If you don't solve the main problem you will be able to reduce the size of transferred data to 10 times for example, but you still don't solve the main problem. Let us we think about the question: Why so many data should be sent to the client (to the web browser)?
The data on the client side are needed to display some information to the user. The monitor is not so large to show 30,000 total on one page. No user are able to grasp so much information. So I am sure that you display only small part of the information. In the case you should send only the small part of information which you display.
You don't describe how the guids will be used on the client side. If you need the information during row editing for example. You can transfer the data only when the user start editing. In the case you need transfer the data only for one association.
If you need display the guids directly, then you can't display all the information at once. So you can send the information for one page only. If the user start to scroll or start "next page" button you can send the next portion of data. In the way you can really dramatically reduce the size of transferred data.
If you do have no possibility to redesign the part of application you can implement your original suggestion: by replacing of GUID "{7EDBB957-5255-4b83-A4C4-0DF664905735}" or "7EDBB95752554b83A4C40DF664905735" to the number like 123 you reduce the size of GUID from 34 characters to 3. If you will send additionally array of "guid mapping" elements like
123:"7EDBB95752554b83A4C40DF664905735",
you can reduce the original size of data 30000*34 = 1020000 (1 MB) to 300*39 + 30000*3 = 11700+90000 = 101700 (100 KB). So you can reduce the size of data in 10 times. The usage of compression of dynamic data on the web server can reduce the size of data additionally.
In any way you should examine why your page is so slowly. If the program works in LAN, then the transferring of even 1MB of data can be quick enough. Probably the page is slowly during placing of the data on the web page. I mean the following. If you modify some element on the page the position of all existing elements have to be recalculated. If you would be work with disconnected DOM objects first and then place the whole portion of data on the page you can improve the performance dramatically. You don't posted in the question which technology you use in you web application so I don't include any examples. If you use jQuery for example I could give some example which clear more what I mean.

The lookup index you propose is nothing else than a "custom" compression scheme. As amdmax stated, this will increase your performance if you have a lot of the same GUIDs, but so will gzip.
IMHO, the extra effort of writing the custom coding will not be worth it.
Oleg states correctly, that it might be worth fetching the data only when the user needs it. But this of course depends on your specific requirements.

if simply gzipping the response will compress it enough that this extra effort is not worth it?
The answer is: Yes, it will.
Compressing the data will remove redundant parts as good as possible (depending on the algorithm) until decompression.
To get sure, just send/generate the data uncompressed and compressed and compare the results. You can count the duplicate GUIDs to calculate how big your data block would be with the dictionary compression method. But I guess gzip will be better because it can also compress the syntactic elements like braces, colons, etc. inside your data object.

So what you are trying to accomplish is Dictionary compression, right?
http://en.wikibooks.org/wiki/Data_Compression/Dictionary_compression
What you will get instead of Guids which are 16 bytes long is int which is 4 bytes long. And you will get a dictionary full of key value pairs that will associate each guid to some int value, right?
It will decrease your transfer time when there're many objects with the same id used. But will spend CPU time before transfer to compress and after transfer to decompress. So what is the amount of data you transfer? Is it mb / gb / tb? And is there any good reason to compress it before sending?

I do not know how dynamic is your data, but I would
on a first call send two directories/dictionaries mapping short ids to long GUIDS, one for your associations and on for your employees e.g. {1: AssoGUID1, 2: AssoGUID2,...} and {1: EmpGUID1, 2:EmpGUID2,...}. These directories may also contain additional information on the Associations and Employees instances; I suspect you do not simply display GUIDs
on subsequent calls just send the index of Employees per Association { 1: [2,4,5], 3:[2,4], ...}, the key being the association short id and the ids in the array value, the short ids of the employees. Given your description building the reverse index: Employee to Associations may give better result size wise (but higher processing)
Then its all down to associative arrays manipulations which is straightforward in JS.
Again, if your data is (very) dynamic server side, the two directories will soon be obsolete and maintaining synchronization may cost you a lot.

I would start by answering the following questions:
What are the performance requirements? Are there size requirements? Speed requirements? What is the minimum performance that is truly needed?
What are the current performance metrics? How far are you from the requirements?
You characterized the data as possibly being mostly repeats. Is that the normal case? If not, what is?
The 2 options you listed above sound reasonable and trivial to implement. Try creating a look-up table and see what performance gains you get on actual queries. Try zipping the results (with look-ups and without), and see what gains you get.
In my experience if you're not TOO far from the goal, performance requirements are often trial and error.
If those options don't get you close to the requirements, I would take a step back and see if the requirements are reasonable in the time you have to solve the problem.
What you do next depends on which performance goals are lacking. If it is size, you're starting to be limited if you're required to send the entire association list ever time. Is that truly a requirement? Can you send the entire list once, and then just updates?

A Persistent Store for Increment/Decrementing Integers Easily and Quickly

Does there exist some sort of persistent key-value like store that allows for quick and easy incrementing, decrementing, and retrieval of integers (and nothing else). I know that I could implement something with a SQL database, but I see two drawbacks to that:
It's heavyweight for the task at hand. All I need is the ability to say "server[key].inc()" or "server[key].dec()"
I need the ability to handle potentially thousands of writes to a single key simultaneously. I don't want to deal with excessive resource contention. Change the value and get out - that's all I need.
I know memcached supports inc/dec, but it's not persistent. My strategy at this point is going to be to use a SQL server behind a queueing system of some sort such that there's only one process updating the database. It just seems... harder than it should be.
Is there something someone can recommend?

Redis is a key-value store that supports several data types. Integer is present, along with incr and decr commands.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio