First Non Repeating Character: Bit Vector - algorithm

Find the first non repeating character in a given string. You may assume that the string contains any character from any language in the world, for e.g. an Arabic
or Greek character even.
I came across a solution using bit vectors for the above problem. It used a bit vector of size 95000. Can somebody please explain why this size is used?

See How many characters can be mapped with Unicode? for part of an explanation.
According to that question, in Unicode 6.0, 109384 code points have been allocated. It's possible that, depending on how old the solution you found is, 95000 was large enough to hold all of the code points which had been allocated at that time, or that the author of your solution was happy with a "good enough" approach.

Related

In a trie for inserting and searching normal paths, are ascii 1-31 worth considering?

I am working on a trie data structure which inserts and searches for normal paths.
A path can contain any character from unicode, so in order to represent it completely in utf-8, the array in trie needs to contain next nodes for all 256 ascii.
But I am also concerned about the space and insertion time taken by trie.
The conditions under which my trie is setup rarely would insert a character of unicode(I mean 128-255 ascii). So I just put an if condition to reject paths which contain above ascii 127. I don’t think the ascii 1-31 are relevant either, although I am unsure about this. As 1-31 chars are like carriage return, esc etc, can I simply continue the loop without inserting them? Like is it possible to encounter paths that are actually differentiable because of ascii 1-31 in a real scenario?
Answering this old question, on macOS ascii 13 is used to represent custom icons which may appear in many paths. Thanks to #EricPostpischil who told that in comments.
All other characters ranging between 1-31 appear pretty less in paths.
Also, macOS users mostly have a case-insensitive path, so generally considering both lowercase and uppercase is also useless.
PS:
Although this question seems to be opinion based, but it actually isn't because it can be answered quite concisely. It attempts to ask for frequency of appearance of characters in paths on macOS. (sorry for the confusing title, I was a noob that time, changing it now will make all comments on it absurd)

Algorithm for one sign checksum

I am desperate in the search for an algorithm to create a checksum that is a maximum of two characters long and can recognize the confusion of characters in the input sequence. When testing different algorithms, such as Luhn, CRC24 or CRC32, the checksums were always longer than two characters. If I reduce the checksum to two or even one character, then no longer all commutations are recognized.
Does any of you know an algorithm that meets my needs? I already have a name with which I can continue my search. I would be very grateful for your help.
Taking that your data is alphanumeric, you want to detect all the permutations (in the perfect case), and you can afford to use the binary checksum (i.e. full 16 bits), my guess is that you should probably go with CRC-16 (as already suggested by #Paul Hankin in the comments), as it is more information-dense compared to check-digit algorithms like Luhn or Damm, and is more "generic" when it comes to possible types of errors.
Maybe something like CRC-CCITT (CRC-16-CCITT), you can give it a try here, to see how it works for you.

Algorithm/Hashing/Creative Way To Map Beyond 2 Alphanumeric Characters Combinations

I have a system that is confined to two alphanumeric characters. Some simple math shows that we get 1,296 combinations if we use all possible permutations 0-9 and a-z. Lower case letters cannot be distinguished from upper case, special characters (including a blank character) cannot be used.
Is there any creative mapping, perhaps to an external reference, to create a way to take this two character field significantly beyond 1,296 combinations?
Examples of identifers would be `00, OO, AZ, Z4, etc.'
Thanks!
I'm afraid not, no more than you could get a 3 bit number to represent more than 8 different numbers. If you're interested in the details you can look up information theory or Kolmogorov complexity. Essentially with only 1,296 combinations then you can only label 1,296 possible pieces of information.
As an example, consider if you had 1,297 things. All of those two letter combinations would take up the first 1,296 so what combination would be associated with the next one? It would have to be a repeat of something which you had earlier.
Shor also has some good material on this, and the implications of that sort of thing form the basis for a lot of file compression systems.
You could maybe squeeze out one more combination if you cheat, and allow a 'null' value to represent a different possibility, but thats not totally relevant to the idea of the question.
If you are restricted to two characters taken from an alphabet of 36, then you are limited to 36² distinct symbols, that's it.
More context is required to find workarounds, like stealing bits elsewhere, using symbols in pairs, breaking the case limitation, exploiting the history of transations...
The precise meaning of "a system that is confined to two alphanumeric characters" needs to be known to be able to suggest a workaround. Is that a space constraint? Do you need the restriction to 2 chars for efficiency? Does it need to work with other code that accepts or generates 2 char indexes?
If you have up to 1295 identifiers that are used often, and some others that occur only occasionally, you could choose an identifier, e.g. "ZZ", to indicate that another identifier is following. So "00" through to "ZY" would be 1295 simple 2-char identifiers, and "ZZ00" though to "ZZZZ" would be a further 1296 combined 4-char identifiers. (Or "ZZ0000" through to "ZZZZZZ" for a further 1296*1296 identifiers ...)
This could work for space constraints. For efficiency, it depends on whether the additional check to see if the identifier is "ZZ" is too expensive or not.

What is the name of this text compression scheme?

A couple years ago I read about a very lightweight text compression algorithm, and now I can't find a reference or remember its name.
It used the difference between each successive pair of characters. Since, for example, a lowercase letter predicts that the next character will also be a lowercase letter, the differences tend to be small. (It might have thrown out the low-order bits of the preceding character before subtracting; I cannot recall.) Instant complexity reduction. And it's Unicode friendly.
Of course there were a few bells and whistles, and the details of producing a bitstream, but it was super lightweight and suitable for embedded systems. No hefty dictionary to store. I'm pretty sure that the summary I saw was on Wikipedia, but I cannot find anything.
I recall that it was invented at Google, but it was not Snappy.
I think what you're on about is BOCU, Binary-Ordered Compression for Unicode or one of its predecessors/successors. In particular,
The basic structure of BOCU is simple. In compressing a sequence of code points, you subtract the last code point from the current code point, producing a signed delta value that can range from -10FFFF to 10FFFF. The delta is then encoded in a series of bytes. Small differences are encoded in a small number of bytes; larger differences are encoded in a successively larger number of bytes.

How to find "equivalent" texts?

I want to find (not generate) 2 text strings such that, after removing all non letters and ucasing, one string can be translated to the other by simple substitution.
The motivation for this comes from a project I known of that is testing methods for attacking cyphers via probability distributions. I'd like to find a large, coherent plain text that, once encrypted with a simple substitution cypher, can be decrypted to something else that is also coherent.
This ends up as 2 parts, find the longest such strings in a corpus, and get that corpus.
The first part seems to me to be amiable to some sort of attack with a B-tree keyed off the string after a substitution that makes the sequence of first occurrences sequential.
HELLOWORLDTHISISIT
1233454637819a9b98
A little optimization based on knowing the maximum value and length of the string based on each depth of the tree and the rest is just coding.
The Other part would be quite a bit more involved; how to generate a large corpus of text to search? some kind of internet spider would seem to be the ideal approach as it would have access to the largest amount of text but how to strip it to just the text?
The question is; Any ideas on how to do this better?
Edit: the cipher that was being used is an insanely basic 26 letter substitution cipher.
p.s. this is more a thought experiment then a probable real project for me.
There are 26! different substitution ciphers. That works out to a bit over 88 bits of choice:
>>> math.log(factorial(26), 2)
88.381953327016262
The entropy of English text is something like 2 bits per character at least. So it seems to me you can't reasonably expect to find passages of more than 45-50 characters that are accidentally equivalent under substitution.
For the large corpus, there's the Gutenberg Project and Wikipedia, for a start. You can download an dump of all the English Wikipedia's XML files from their website.
I think you're asking a bit much to generate a substitution that is also "coherent". That is an AI problem for the encryption algorithm to figure out what text is coherent. Also, the longer your text is the more complicated it will be to create a "coherent" result... quickly approaching a point where you need a "key" as long as the text you are encrypting. Thus defeating the purpose of encrypting it at all.

Resources