Composing data-structures/strings matching a CFG - algorithm

The JSON syntax is an example of a CFG (Context free grammar).
Valid JSON strings are a sequence of tokens constrained to a CFG... or, the tokens can -equivalently- be thought of as a structure of nested values.
Valid JSON strings can be sequentially composed to produce a new valid JSON string - but this is not the only way existing strings can be combined. If the first string were an array, the second string could be 'inserted' as a new value at any position in the array - and strings can be composed into an infinite number of distinct new strings with the addition only of arrays. By carefully choosing where to insert one string into an array in another, a large number of other possibilities are revealed.
I'm interested to establish a taxonomy of the ways in which valid strings can be composed. I'm interested in all the compositions of values. I accept that a composition may require introduction of new characters such as ',' - to insert a value into an array.
It strikes me that this is a sufficiently fundamental question that there is likely (at least one) standard answer. Does anyone know what I should look up?

Related

Natural sorting of UTF-8 strings in DynamoDB

I'm storing file names (with extension) and directory names as UTF-8 strings in DynamoDB as sort keys.
As far as I know, file names + ext and directory names are unique within a directory, so I can use those strings as unique IDs within the parent directory.
These strings will, being UTF-8, be sorted alphabetically. 10 will come before 2, uppercase before lowercase and so on.
As I try to represent a file hierarchy, I would like to retrieve the items sorted in a natural order instead.
I could do some magic on the strings to have them sort naturally before I use them as sort keys, but then I would need to keep an attribute with the original name and those are bytes I would like to save, if possible.
If it matters, this is part of a single table design.
Are there any design patterns, hashing algorithms or other approaches I could use to solve this?
I don't know what "magic" you intend to do. Usually people will zero-pad the numbers to some arbitrary max length so that string sorting the numbers matches the numeric sort, for positive integers anyway. If you do that you could remove the padding on display.

How to assign small numbers efficiently to random strings

I have a datasource that outputs some data with each data piece containing some seemingly random string. Two data pieces can contain the same such random string.
I have to insert the data into a database. Since the random strings are quite long, I want to assign small numbers to them such that two random strings are equal if and only if the same number is assigned to both of them.
How can I do this efficiently?
Since I am not too familiar with programming (except for a little bit of Java), I would insert the random strings into an array assigning to them the indices they have in the array.
Is this efficient?
Thanks for answering! Please be rather concrete with your expanation because I have really little experience.

Bad word table for random alpha string?

I'm writing an algorithm to generate a random 6-character string (e.g. customer code XDEJQW). I want to ensure no and or offensive words or strings within. I guess I have no choice but to have a database table of those bad words, right? Just seems icky that I'll have to have an add/edit page for someone to go to that has some pretty awful words in it.
Thanks.
No need for a table, you can either use a string array or an enum for this purpose. The advantage is that you do not have to send a request to get the records of the bad word table. It is better for performance. Basically you can randomize the 6-character value until the result does not contain bad words.
depending on the purpose of the value, you can change the random process to make sure that no valid words are generate.. so if no valid words are generated.. offensive strings wont ether.. for example..
use only consonants
use only vowels
use 3 consecutive consonants and 3 consecutive vowels
etc..
the point is, normally, words of languages are made of syllables, a syllables to be pronounceable need to have a vowel.. usually paired with one or two (maybe more) consonants, before, after or around, that act as a "modifier" of the sound bi,ca,do,et,if,or or get,for etc.. if you can avoid these "patterns"
the probability of generating a word is low..
on the other and if you want to generate pronounceable passwords you do exactly the opposite alternating between consonants and vowels to produce syllables, ex: cidofe, but in that case you do have to validate against a list of "bad words"
but in ether case remember if you are going to validate.. don't just validate against a full word also try to filter out partial words, misspells or abbreviation to avoid things like SUKMYDIK

String comparison algorithm, relevancy, how much "alike" 2 strings are

I have 2 sources of information for the same data (companies), which I can join together via a unique ID (contract number). The presence of the second, different source, is due to the fact that the 2 sources are updated manually, independently. So what I have is an ID and a company Name in 2 tables.
I need to come up with an algorithm that would compare the Name in the 2 tables for the same ID, and order all the companies by a variable which indicates how different the strings are (to highlight the most different ones, to be placed at the top of the list).
I looked at the simple Levenshtein distance calculation algorithm, but it's at the letter level, so I am still looking for something better.
The reason why Levenshtein doesn't really do the job is this: companies have a name, prefixed or postfixed by the organizational form (LTD, JSC, co. etc). So we may have a lot of JSC "Foo" which will differ a lot from Foo JSC., but what I am really looking for in the database is pairs of different strings like SomeLongCompanyName JSC and JSC OtherName.
Are there any Good ways to do this? (I don't really like the idea of using regex to separate words in each string, then find matches for every word in the other string by using the Levenshtein distance, so I am searching for other ideas)
How about:
1. Replace all punctuation by whitespace.
2. Break the string up into whitespace-delimited words.
3. Move all words of <= 4 characters to the end, sorted alphabetically.
4. Levenshtein.
Could you filter out (remove) those "common words" (similar to removing stop words for fulltext indexing) and then search on that? If not, could you sort the words alphabetically before comparing?
As an alternative or in addition to the Levenshtein distance, you could use Soundex. It's not terribly good, but it can be used to index the data (which is not possible when using Levenshtein).
Thank you both for ideas.
I used 4 indices which are levenshtein distances divided by the sum of the length of both words (relative distances) of the following:
Just the 2 strings
The string composed of the result after separating the word sequences, eliminating the non-word chars, ordering ascending and joining with space as separator.
The string which is contained between quotes (if no such string is present, the original string is taken)
The string composed of alphabetically ordered first characters of each word.
each of these in return is an integer value between 1 and 1000. The resulting value is the product of:
X1^E1 * X2^E2 * X3^E3 * X4^E4
Where X1..X4 are the indices, and E1..E4 are user-provided preferences of valuable (significant) is each index. To keep the result inside reasonable values of 1..1000, the vector (E1..E4) is normalized.
The results are impressive. The whole thing works much faster than I've expected (built it as a CLR assembly in C# for Microsoft SQL Server 2008). After picking E1..E4 correctly, the largest index (biggest difference) on non-null values in the whole database is 765. Right untill about 300 there is virtually no matching company name. Around 200 there are companies that have kind of similar names, and some are the same names but written in very different ways, with abbreviations, additional words, etc. When it comes down to 100 and less - practically all the records contain names that are the same but written with slight differences, and by 30, only the order or the punctuation may differ.
Totally works, result is better than I've expected.
I wrote a post on my blog, to share this library in case someone else needs it.

calculating a hash of a data structure?

Let's say I want to calculate a hash of a data structure, using a hash algorithm like MD5 which accepts a serial stream, for the purposes of equivalence checking. (I want to record the hash, then recalculate the hash on the same or an equivalent data structure later, and check the hashes to gauge equivalence with high probability.)
Are there standard methods of doing this?
Issues I can see that are problematic are
if the data structure contains an array of binary strings, I can't just concatenate them since ["abc","defg"] and ["ab","cdefg"] are not equivalent arrays
if the data structure contains a collection that isn't guaranteed to enumerate in the same order, e.g. a key-value dictionary {a: "bc", d: "efg", h: "ijkl"} which should be considered equivalent to a key-value pair {d: "efg", h: "ijkl", a: "bc"}.
For the first issue, also hash the lengths of the strings. This will differentiate their hashes.
For the second, sort the keys.
A "standard" way of doing this is to define a serialized form of the data structure, and digest the resulting byte stream.
For example, a TBSCertificate is a data structure comprising a subject name, extensions, and other information. This is converted to a string of octets in a deterministic way and hashed as part of a digital signature operation to produce a certificate.
There is also another problem with structs and it is the alignment of data members on different platforms.
If you want a stable and portable solution, you can solve this by implementing "serialize" method for your data structure in such a way that serialize will produce byte stream (or more commonly, output to the byte stream).
Then, you can use hash algorithm with the serialized stream. In such a way, you will be able to solve the problems you mentioned by explicit traversion of your data. As other additional features you will get ability to save your data onto hdd or to send it over the network.
For the strings, you can implement Pascal type storage where length comes first.
If the strings can't have any nul characters, you can use C strings to guarantee uniqueness, eg. "abc\0defg\0" is distinct from "cdefg\0".
For dictionaries, maybe you can sort before hashing.
This also reminds me of an issue I heard of once... I don't know what language you are using, but if you are also hashing C structs without filtering them in any way, be careful about the space between fields that the compiler might have introduced for alignment reasons. Sometimes those will not be zeroed out.

Resources