Md5 cache keys in Memcache - caching

In this reddit blog post, the author talks about MD5ing the cache keys and hence the reason why they find it very difficult to scale out.
Can someone tell me why one would want to md5 cache keys? I didn’t understand the reason even though they explained it as
“A few years ago, we decided to md5
all of our cache keys. We did this
because at the time memcached (which
is what memcachedb is based on) could
only take keys of a certain length. In
fact, the version it is based on still
has this limitation. MD5ing the keys
was a good solution to this problem,
so we thought.”

The key size back then was probably shorter than it is now (currently 250 bytes - and 250 bytes is a pretty huge key name) meaning a sensible key naming convention may not have been possible, so they just used the sensible naming convention and md5'd it.

We did this because at the time memcached (which is what memcachedb is based on) could only take keys of a certain length
I guess that since some keys where larger than the max length the server allowed, they decided to create a md5 of the key to store it.
However, I'm not sure there is a relation between this and the fact that they can't easily add new servers (since memcached also use hashing to even repartition .. maybe memcachedb doesn't)

Related

Why do BCrypt hashes store information about version and iterations?

BCrypt hashes usually begin with some repeating symbols.
Lets say for example we see $2a$10 as a beginning of our hash.Every BCrypt hash has something similar to this.
$ is a separator
2a is in this case the version
10 is the number of iterations 2 to the power of 10
My question is - why is this information in the hash?
There is no dehashing algorithm that might need this information in particular and when people log-in they generate the same has using the same version and the same number of iterations and then the result is compared to what is stored in the database. This means that the algorithm doesn't have build in comparing function that gets the has and based on this information (version and iterations) hashes the password to make the comparison.
Then...why is it so that this information is given away? Who uses this information?
My guess is so that if the version has changed or the number of iterations our program or whatever will know, but...why? I mean that the algorithm must be configured only once and if changes are required then it is the company's job to make the appropriate arrangements so that it knows what version was used and what is used now. Why is it the hash's job to remember the version and number of iterations?
Hashes get leaked every week or so and with this information someone can easily set up his BCrypt and make it running with the same configuration of version and iterations...however if this information wasn't visible in the hash and the hash got public...then how would anyone make their own BCrypt version and start comparing it?
Isn't it more safe to not provide this information so that if the hash alone gets leaked nobody would know what configuration was used to make it?
It makes bcrypt forward and backwards compatible.
For example, bcrypt hashes do not start with 2a.
They start with:
$2$
$2a
$2x$
$2y$
$2b$
You need to know which version of the has you're reading, so you handle it correctly.
Also, you need to know the number of iterations.
not every hash will use 10
not every hash will use the same cost
Why store the version and iterations? Because you have to.
Also, it's an excellent. In the past, people used to just store a hash, and it was awful.
people used either no salt, or the same salt every time, because storing it was too hard
people uses 1 iteration, or a hard-coded number of iterations, because storing it was too hard
BCrypt, SCrypt, and Argon2 use the extraordinarily clever idea of doing all that grunt-work for you, leaving you with only having to call one function, without weakening the security of the system in any way.
you're trying to preach security by obscurity. Stop that, it doesn't work.
not having details in the data is not unusual, in this old hack it was hackers that mentioned it was SHA1. This is easy - the attackers, and researchers too, will take the list of data that was leaked and simply try all kinds of common algorithms and interation/work factor counts with a small list of the common passwords, like the phpbb list from SkullSecurity; when they find the inevitable cracked terrible, passwords, they'll know they found the algorithm and break out the full scale cracking.
having the algorithm stored means you can transition from old to new gradually, and upgrade individual users as they come in, AND have multiple variants in use at one - including transitional types
transitional: you were on salted SHA-1 (BAD), moving to PBKDF2-HMAC-SHA-512 with 200,000 iterations (good), in the middle you actually bulk convert to PBKDF2-HMAC-SHA-512(SHA-1(salted password)), but at each user's login, move them to pure PBKDF2-HMAC-SHA-512(password).
having the iteration count means, like the transitional above, you can increase it over time and have different counts for different users set as they log in.

Performance-wise, is it worth it to rename every mongo key name for production? [duplicate]

This question already has answers here:
Is shortening MongoDB property names worthwhile?
(7 answers)
Closed 5 years ago.
As far as I know, every key name is stored "as-is" in the mongo database. It means that a field "name" will be stored using the 4 letters everywhere it is used.
Would it be wise, if I want my app to be ready to store a large amount of data, to rename every key in my mongo documents? For instance, "name" would become "n" and "description" would become "d".
I expect it to reduce significantly the space used by the database as well as reducing the amount of data sent to client (not to mention that it kinda uglify the mongo documents content). Am I right?
If I undertake the rename of every key in my code (no need to rename the existing data, I can rebuild it from scratch), is there a good practice or any additional advise I should know?
Note: this is mainly speculation, I don't have benchmarking results to back this up
While "minifying" your keys technically would reduce the size of your memory/diskspace footprint, I think the advantages of this are quite minimal if not actually disadvantageous.
The first thing to realize is that data stored in Mongodb is actually not stored in its raw JSON format, its actually stored as pure binary using a standard know as BSON. This allows Mongo to do all sorts of internal optimizationsm, such as compression if you're using WiredTiger as your storage engine (thanks for pointing that ouT #Jpaljasma).
Second, lets say you do minify your keys. Well then you need to minify your keys. Every time. Forever. Thats a lot of work on your application side. Plus you need to unminify your keys when you read (because users wont know what n is). Every time. Forever. All of a sudden your minor memory optimization becomes a major runtime slowdown.
Third, that minifying/unminifying process is kinda complicated. You need to maintain and test mappings between the two, keep it tested, up to date, and never having any overlap (if you do, thats the end of all your data pretty much). I wouldn't ever work on that.
So overall, I think its a pretty terrible idea to minify your keys to save a couple of characters. Its important to keep the big picture in mind: the VAST majority of your data will be not in the keys, but in the values. If you want to optimize data size, look there.
The full name of every field is included in every document. So when your field-names are long and your values rather short, you can end up with documents where the majority of the used space is occupied by redundant field names.
This affects the total storage size and decreases the number of documents which can be cached in RAM, which can negatively affect performance. But using descriptive field-names does of course improve readability of the database content and queries, which makes the whole application easier to develop, debug and maintain.
Depending on how flexible your driver is, it might also require quite a lot of boilerplate code to convert between your application field-names and the database field-names.
Whether or not this is worth it depends on how complex your database is and how important performance is to you.

Ideal hashing method for wide distribution of values?

As part of my rhythm game that I'm working, I'm allowing users to create and upload custom songs and notecharts. I'm thinking of hashing the song and notecharts to uniquely identify them. Of course, I'd like as few collisions as possible, however, cryptographic strength isn't of much importance here as a wide uniform range. In addition, since I'd be performing the hashes rarely, computational efficiency isn't too big of an issue.
Is this as easy as selecting a tried-and-true hashing algorithm with the largest digest size? Or are there some intricacies that I should be aware of? I'm looking at either SHA-256 or 512, currently.
All cryptographic-strength algorithm should exhibit no collision at all. Of course, collisions necessarily exist (there are more possible inputs than possible outputs) but it should be impossible, using existing computing technology, to actually find one.
When the hash function has an output of n bits, it is possible to find a collision with work about 2n/2, so in practice a hash function with less than about 140 bits of output cannot be cryptographically strong. Moreover, some hash functions have weaknesses that allow attackers to find collisions faster than that; such functions are said to be "broken". A prime example is MD5.
If you are not in a security setting, and fear only random collisions (i.e. nobody will actively try to provoke a collision, they may happen only out of pure bad luck), then a broken cryptographic hash function will be fine. The usual recommendation is then MD4. Cryptographically speaking, it is as broken as it can be, but for non-cryptographic purposes it is devilishly fast, and provides 128 bits of output, which avoid random collisions.
However, chances are that you will not have any performance issue with SHA-256 or SHA-512. On a most basic PC, they already process data faster than what a hard disk can provide: if you hash a file, the file reading will be the bottleneck, not the hashing. My advice would be to use SHA-256, possibly truncating its output to 128 bits (if used in a non-security situation), and consider switching to another function only if some performance-related trouble is duly noticed and measured.
If you're using it to uniquely identify tracks, you do want a cryptographic hash: otherwise, users could deliberately create tracks that hash the same as existing tracks, and use that to overwrite them. Barring a compelling reason otherwise, SHA-1 should be perfectly satisfactory.
If cryptographic security is not of concern then you can look at this link & this. The fastest and simplest (to implement) would be Pearson hashing if you are planing to compute hash for the title/name and later do lookup. or you can have look at the superfast hash here. It is also very good for non cryptographic use.
What's wrong with something like an md5sum? Or, if you want a faster algorithm, I'd just create a hash from the file length (mod 64K to fit in two bytes) and 32-bit checksum. That'll give you a 6-byte hash which should be reasonably well distributed. It's not overly complex to implement.
Of course, as with all hashing solutions, you should monitor the collisions and change the algorithm if the cardinality gets too low. This would be true regardless of the algorithm chosen (since your users may start uploading degenerate data).
You may end up finding you're trying to solve a problem that doesn't exist (in other words, possible YAGNI).
Isn't cryptographic hashing an overkill in this case, though I understand that modern computers do this calculation pretty fast? I assume that your users will have an unique userid. When they upload, you just need to increment a number. So, you will represent them internally as userid1_song_1, userid1_song_2 etc. You can store this info in a database with that as the unique key along with user specified name.
You also didn't mention the size of these songs. If it is midi, then file size will be small. If file sizes are big (say 3MB) then sha calculations will not be instantaneous. On my core2-duo laptop, sha256sum of a 3.8 MB file takes 0.25 sec; for sha1sum it is 0.2 seconds.
If you intend to use a cryptographic hash, then sha1 should be more than adequate and you don't need sha256. No collisions --- though they exist --- have been found yet. Git, Mercurial and other distributed version control systems use sh1. Git is a content based system and uses sha1 to find out if content has been modified.

Ok to use memcache in this way? or need a system re-architecture?

I have a "score" i need to calculate for multiple items for multiple users. Each user has many many scores unique to them, and calculating can be time/processor intensive. (the slowness isn't on the database end). To deal with this, I'm making extensive use of memcached. Without memcache some pages would take 10 seconds to load! Memcache seems to work well because the scores are very small pieces of information, but take awhile to compute. I'm actually setting the key to never expire, and then I delete them on the occasional circumstances the score changes.
I'm entering a new phase on this product, and am considering re-architecting the whole thing. There seems to be a way I can calculate the values iteratively, and then store them in a local field. It'll be a bit similar to whats happening now, just the value updates will happen faster, and the cache will be in the real database, and managing it will be a bit more work (I think I'd still use memcache on top of that though).
if it matters, its all in python/django.
Is intending on the cache like this bad practice? is it ok? why? should I try and re-architect things?
If it ain't broke...don't fix it ;^) It seems your method is working, so I'd say stick with it. You might look at memcachedb (or tokyo cabinet) , which is a persistent version of memcache. This way, when the memcache machine crashes and reboots, it doesn't have to recalc all values.
You're applying several architectural patterns here, and each of them certainly has a place. There's not enough information here for me to evaluate whether your current solution needs rearchitecting or whether your ideas will work. It does seem likley to me that as your understanding of the user's requirements grows you may want to improve things.
As always, prototype, measure performance, consider the trade off between complexity and performance - you don't need to be as fast as possible, just fast enough.
Caching in various forms is often the key to good performance. The question here is whether there's merit in persisting the caclulated, cahced values. If they're stable over time then this is often an effective strategy. Whether to persist the cache or make space for them in your database schema will probably depend upon the access patterns. I there are various query paths then a carefully designed database scheme may be appropriate.
Rather than using memcached, try storing the computed score in the same place as your other data; this may be simpler and require fewer boxes.
Memcached is not necessarily the answer to everything; it's intended for systems which need to read-scale very highly. It sounds like in your case, it doesn't need to, it simply needs to be a bit more efficient.

Best Hash function for detecting data changes?

We have a pricing dataset that changes the contained values or the number of records. The number of added or removed records is small compared to the changes in values. The dataset usually has between 50 and 500 items with 8 properties.
We currently use AJAX to return a JSON structure that represents the dataset and update a webpage using this structure with the new values and where necessary removing or adding items.
We make the request with two hash values, one for the values and another for the records. These are MD5 hashes returned with the JSON structure to be sent with a following request. If there is a change to the hashes we know we need a new JSON structure otherwise the hashes are just returned to save bandwidth and eliminate unnecessary client-side processing.
As MD5 is normally used with encryption is the best choice of hashing algorithm for just detecting data changes?
What alternative ways can we detect a change to the values and update as well as detecting added or removed items and manipulating the page DOM accordingly?
MD5 is a reasonable algorithm to detect changes to a set of data. However, if you're not concerned with the cryptographic properties, and are very concerned with the performance of the algorithm, you could go with a simpler checksum-style algorithm that isn't designed to be cryptographically secure. (though weaknesses in MD5 have been discovered in recent years, it's still designed to be cryptographically secure, and hence does more work than may be required for your scenario).
However, if you're happy with the computational performance of MD5, I'd just stick with it.
MD5 is just fine. Should it have too low performance, you can try fast checksum algorithm, such as for example Adler-32.
What you're doing sounds pretty good to me.
If server-side capacity is cheap and minimising network usage is crucial, you could have the server remember, for each client, what it's last dataset was, and send only the differences (as a list of insertions, deletions and edits) on each request. If you sort your data rows first, these differences can be calculated fairly efficiently using a differencing algorithm such as that used by diff.
This approach is sensitive to network outages -- if one response is not received by the client, errors will accumulate. However this can be remedied by having the client sent the MD5 hash with each request: if it is different than what the server expects, an entire list will be sent instead of a list of changes.
I agree with Jonathan's answer regarding MD5. As for alternative ways to detect changes, if you are willing to store (or already store) on the server the time/date of the most recent change, you could pass that back and forth to the client. You avoid the computation entirely and you might even be able to use most of your existing code.
--
bmb
I think that any commonly used hash function will do what you want - provide a unique representation of an entity.
For the problem you are trying to solve, my solution would be to have a backend table that records all changes. Not the changes themselves, but an identifier of the rows that have changed. On a periodic basis callback to the server and get a list of all the objects that have changed, and use this to decide on the client which rows need updating/deleting/adding.

Resources