Hash stable to small changes in text [closed] - algorithm

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Is there a hash function that is stable to small changes in text? I'm looking for the opposite of a cryptographic hash, where small changes in the source lead to huge changes in the result.
Something like a perceptual hash for text. Is there such a thing?
Edited: by "small changes in text" I mean changes in punctuation, correction of ortographic / grammatical mistakes, etc. The text itself is an article, like a wikipedia entry (but it can be much smaller, like 2 or 3 paragraphs).
Bonus points if somebody can point to a Python implementation.

You're looking for locality sensitive hashing.

Related

Why is it so difficult to program a true random number generator? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I don't understand why a PRNG is easier to program than a true RNG. Shouldn't a typical processor make short work of producing a truly random number?
Computers are deterministic machines, given the same input, code included, they will produce the same result. To get true randomness you need to introduce something random from the real world, like the time or cosmic rays or something else that you can't predict.

Draw images for documentation [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I am in need to draw some diagrams for documentation. Thinks like tables, flow charts, tree, etc.
I usually work in Linux environments and use Latex to write text and write mathematical formulas and equations. What else do you use to draw the things above?
Thanks in advance.
A unsorted list of tools that I generally like:
PGF: A really useful LaTeX macro package for drawing all kind of professional graphics.
Graphviz: A tool to "program" directed graphs and other things with automatical layout.
Balsamique: Web-tool for prototyping GUIs with PDF-export.
Ascii-Art-Tools like Ditaa or aafigure
Dia Diagram Editor: old but usefull GUI for drawing diagrams

Which data structure should be used while storing large number of data, but not any RDBMS? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
This question was asked in an interview. First, I came up with B-tree. He asked me to be more specific and asked me to describe how I would store the data so that it would be easier to retrieve.
Can you please throw some light on this. Thanks in advance
You question isn't really clear.
"Good" ways to store the data depend on what you want to do with it.
If you want access parts of your data, a list of offsets suffices. If you want to search in text, using an additional inverted index in combonation with docIds->offsets is great. If you have frequent updates to your data and reading is rare, none of those make sense. So it really depends
Sounds like an open question, so you can demonstrate your vast experience of ... well, http://en.wikipedia.org/wiki/NoSQL would be my guess, but you could argue that http://en.wikipedia.org/wiki/Dbm answers the question.

Robust image hashing algorithm implementation? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Is there any robust image hashing implementation in any programming language that I can use? This image hashing at minimum be able to generate the same hash for images that are altered in minor form (resized, rotated, minor touch, cropped etc ) .
The best example will be Tineye.com. They somehow hash each image and they are able to detect other duplicate images with minor modification.
I found some research but not implementation.
http://scholar.google.com/scholar?hl=en&as_sdt=0,10&q=robust+image+hashing

Best Fuzzy Matching Algorithm? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
What is the best Fuzzy Matching Algorithm (Fuzzy Logic, N-Gram, Levenstein, Soundex ....,) to process more than 100000 records in less time?
I suggest you read the articles by Navarro mentioned in the Refences section of the Wikipedia article titled
Approximate string matching.
Making your decision based on actual research is always better than on suggestions by random
strangers.. Especially if performance on a known set of records is important to you.
It massively depends on your data. Certain records can be matched better than others. For example postcode is a defined format so can be compared in a different way to normal strings. People can be matched on initials and DOB, or other combinations etc.

Resources