Sorting with underscores - sorting

I'm trying to sort a list of files so that underscore chars are considered "later" than other ascii characters as seen in the example below (this is porting external software to python3). I'd like the sorting to account for file paths in the same way it was originally as to produce no diffs with the original sorting.
Requirements: Avoid 3rd party sorting modules if possible
files = sorted( files, key=lambda d: d['name'].lower() )
Example re-ordering which I'm trying to avoid
-/usr/wte/wte_scripts/wfaping.sh
/usr/wte/wte_scripts/wfa_test_cli.sh
+/usr/wte/wte_scripts/wfaping.sh
I've searched for similar examples of sorting and couldn't find anything concrete with the same issue.
Thanks

The easiest way would be to just replace "_" with a character that is "cosidered later" than letters (e.g. "{", the first ASCII character after "z") in the key function:
sorted(files, key=lambda d: d["name"].lower().replace("_", "{"))
If a sorting collision between "_" and "{" is not acceptable, the solution would be to write a custom comparator function which imposes the desired order, and write the sorting algorithm yourself as python3 no longer supports supplying your own cmp function to list.sort or sorted.

Related

What algorithms can group characters into words?

I have some text generated by some lousy OCR software.
The output contains mixture of words and space-separated characters, which should have been grouped into words. For example,
Expr e s s i o n Syntax
S u m m a r y o f T e r minology
should have been
Expression Syntax
Summary of Terminology
What algorithms can group characters into words?
If I program in Python, C#, Java, C or C++, what libraries provide the implementation of the algorithms?
Thanks.
Minimal approach:
In your input, remove the space before any single letter words. Mark the final words created as part of this somehow (prefix them with a symbol not in the input, for example).
Get a dictionary of English words, sorted longest to shortest.
For each marked word in your input, find the longest match and break that off as a word. Repeat on the characters left over in the original "word" until there's nothing left over. (In the case where there's no match just leave it alone.)
More sophisticated, overkill approach:
The problem of splitting words without spaces is a real-world problem in languages commonly written without spaces, such as Chinese and Japanese. I'm familiar with Japanese so I'll mainly speak with reference to that.
Typical approaches use a dictionary and a sequence model. The model is trained to learn transition properties between labels - part of speech tagging, combined with the dictionary, is used to figure out the relative likelihood of different potential places to split words. Then the most likely sequence of splits for a whole sentence is solved for using (for example) the Viterbi algorithm.
Creating a system like this is almost certainly overkill if you're just cleaning OCR data, but if you're interested it may be worth looking into.
A sample case where the more sophisticated approach will work and the simple one won't:
input: Playforthefunofit
simple output: Play forth efunofit (forth is longer than for)
sophistiated output: Play for the fun of it (forth efunofit is a low-frequency - that is, unnatural - transition, while for the is not)
You can work around the issue with the simple approach to some extent by adding common short-word sequences to your dictionary as units. For example, add forthe as a dictionary word, and split it in a post processing step.
Hope that helps - good luck!

Simple compression algorithm - how to avoid marker-strings?

I was trying to come up with a string compression algorithm for plaintext, e.g.
AAAAAAAABB -> A#8BB
where n symbols y are written out like
y#n
The problem is: what if I need to compress the string "A#8" ? That would confuse the decompression algorithm into thinking that the original input was "AAAAAAAA" instead of just "A#8".
How can I solve this problem? I was thinking of using a "marker" character instead of the #, but what if I wanted the algorithm to work with binary data? There is no marker character that can be used in that case I suppose
A simple solution is escaping: you could represent each # in the source by ##.
Everytime you encounter an #, you look one character ahead and find either a number (repeat previous character) or another # (its literally #).
A variant would be encoding each # as ##1, which would fit nicely into your current scheme and allows encoding n consecutive # as ##n.

Is there a method to find the most specific pattern for a string?

I'm wondering whether there is a way to generate the most specific regular expression (if such a thing exists) that matches a given string. Here's an illustration of what I want the method to do:
str = "(17 + 31)"
find_pattern(str)
# => /^\(\d+ \+ \d+\)$/ (or something more specific)
My intuition was to use Regex.new to accumulate the desired pattern by looping through str and checking for known patterns like \d, \s, and so on. I suspect there is an easy way for doing this.
This is in essence an algorithm compression problem. The simplest way to match a list of known strings is to use Regexp.union factory method, but that just tries each string in turn, it does not do anything "clever":
combined_rx = Regexp.union( "(17 + 31)", "(17 + 45)" )
=> /\(17\ \+\ 31\)|\(17\ \+\ 45\)/
This can still be useful to construct multi-stage validators, without you needing to write loops to check them all.
However, a generic pattern matcher that could figure out what you mean to match from examples is not really possible. There are too many ways in which you could consider strings to be similar or not. The closest I could think of would be genetic programming where you supply a large list of should match/should not match strings and the code guesses at the best regex by constructing random Regexp objects (a challenge in itself) and seeing how accurately they match and don't match your examples. The best matchers could be combined and mutated and tried again until you got 100% accuracy. This might be a fun project, but ultimately much more effort for most purposes than writing the regular expressions yourself from a description of the problem.
If your problem is heavily constrained - e.g. any example integer could always be replaced by \d+, any example space by \s+ etc, then you could work through the string replacing "matchable units", in fact using the same regular expressions checked in turn. E.g. if you match \A\d+ then consume the match from the string, and add \d+ to your regex. Then take the remainder of the string and look for next matching pattern. Working this way will have its limitations (you must know the full set of patterns you want to match in advance, and all examples would have to be unambiguous). However, it is more tractable than a genetic program.

Diffing many texts against each other to derive template and data (finding common subsequences)

Suppose there are many texts that are known to be made from a single template (for example, many HTML pages, rendered from a template backed by data from some sort of database). A very simple example:
id:937 name=alice;
id:28 name=bob;
id:925931 name=charlie;
Given only these 3 texts, I'd like to get original template that looks like this:
"id:" + $1 + " name=" + $2 + ";"
and 3 sets of strings that were used with this template:
$1 = 937, $2 = alice
$1 = 28, $2 = bob
$1 = 925931, $3 = charlie
In other words, "template" is a list of the common subsequences encountered in all given texts always in a certain order and everything else except these subsequences should be considered "data".
I guess the general algorithm would be very similar to any LCS (longest common subsequence) algorithm, albeit with different backtracking code, that would somehow separate "template" (characters common for all given texts) and "data strings" (different characters).
Bonus question: are there ready-made solutions to do so?
I agree with the comments about the question being ill-defined. It seems likely that the format is much more specific than your general question indicates.
Having said that, something like RecordBreaker might be a help. You could also Google "wrapper induction" to see if you find some useful leads.
Perform a global multiple sequence alignment, and then call every resulting column that has a constant value part of the template:
id: 937 name=alice ;
id: 28 name=bob ;
id:925931 name=charlie;
Inferred template: XXX XXXXXX X
Most tools that I'm aware of for multiple sequence alignment require smaller alphabets -- DNA or protein -- but hopefully you can find a tool that works on the alphabet you're using (which presumably is at least all printable ASCII characters). In the worst case, you can of course implement the DP yourself: to align 2 sequences (strings) globally you use the Needleman-Wunsch algorithm, while for more than two sequences there are several approaches, the most common being sum-of-pairs scoring. The exact algorithm for k > 2 sequences unfortunately takes time exponential in k, but the heuristics employed in bioinformatics tools such as MUSCLE are much faster, and produce alignments that are very nearly as good. If they can be persuaded to work with the alphabet you're using, they would be the natural choice.

How to elegantly compute the anagram signature of a word in ruby?

Arising out of this question, I'm looking for an elegant (ruby) way to compute the word signature suggested in this answer.
The idea suggested is to sort the letters in the word, and also run length encode repeated letters. So, for example "mississippi" first becomes "iiiimppssss", and then could be further shortened by encoding as "4impp4s".
I'm relatively new to ruby and though I could hack something together, I'm sure this is a one liner for somebody with more experience of ruby. I'd be interested to see people's approaches and improve my ruby knowledge.
edit: to clarify, performance of computing the signature doesn't much matter for my application. I'm looking to compute the signature so I can store it with each word in a large database of words (450K words), then query for words which have the same signature (i.e. all anagrams of a given word, that are actual english words). Hence the focus on space. The 'elegant' part is just to satisfy my curiosity.
The fastest way to create a sorted list of the letters is this:
"mississippi".unpack("c*").sort.pack("c*")
It is quite a bit faster than split('') and join(). For comparison it is also best to pack the array back together into a String, so you dont have to compare arrays.
I'm not much of a Ruby person either, but as I noted on the other comment this seems to work for the algorithm described.
s = "mississippi"
s.split('').sort.join.gsub(/(.)\1{2,}/) { |s| s.length.to_s + s[0,1] }
Of course, you'll want to make sure the word is lowercase, doesn't contain numbers, etc.
As requested, I'll try to explain the code. Please forgive me if I don't get all of the Ruby or reg ex terminology correct, but here goes.
I think the split/sort/join part is pretty straightforward. The interesting part for me starts at the call to gsub. This will replace a substring that matches the regular expression with the return value from the block that follows it. The reg ex finds any character and creates a backreference. That's the "(.)" part. Then, we continue the matching process using the backreference "\1" that evaluates to whatever character was found by the first part of the match. We want that character to be found a minimum of two more times for a total minimum number of occurrences of three. This is done using the quantifier "{2,}".
If a match is found, the matching substring is then passed to the next block of code as an argument thanks to the "|s|" part. Finally, we use the string equivalent of the matching substring's length and append to it whatever character makes up that substring (they should all be the same) and return the concatenated value. The returned value replaces the original matching substring. The whole process continues until nothing is left to match since it's a global substitution on the original string.
I apologize if that's confusing. As is often the case, it's easier for me to visualize the solution than to explain it clearly.
I don't see an elegant solution. You could use the split message to get the characters into an array, but then once you've sorted the list I don't see a nice linear-time concatenate primitive to get back to a string. I'm surprised.
Incidentally, run-length encoding is almost certainly a waste of time. I'd have to see some very impressive measurements before I'd think it worth considering. If you avoid run-length encoding, you can anagrammatize any string, not just a string of letters. And if you know you have only letters and are trying to save space, you can pack them 5 bits to a letter.
---Irma Vep
EDIT: the other poster found join which I missed. Nice.

Resources