Lexicographically sort file by line, reading right-to-left - sorting

I'm looking for a command-line (ideally) solution that lets me sort the lines in a file by comparing each line from right to left.
For example...
Input:
aabc
caab
bcaa
abca
Output:
bcaa
abca
caab
aabc
I'll select the answer which I think will be the easiest to remember in a year when I've forgotten I posted this question, but I'll also upvote clever/short answers as well.

The easiest to remember would be
reverse < input | sort | reverse
You will have to write a reverse command though. Under Linux, there's rev.

Related

Removing blankspace at the start of a line (size of blankspace is not constant)

I am a beginner to using sed. I am trying to use it to edit down a uniq -c result to remove the spaces before the numbers so that I can then convert it to a usable .tsv.
The furthest I have gotten is to use:
$ sed 's|\([0-9].*$\)|\1|' comp-c.csv
With the input:
8 Delayed speech and language development
15 Developmental Delay and additional significant developmental and morphological phenotypes referred for genetic testing
4 Developmental delay AND/OR other significant developmental or morphological phenotypes
1 Diaphragmatic eventration
3 Downslanted palpebral fissures
The output from this is identical to the input; it recognises (I have tested it with a simple substitute) the first number but also drags in the prior blankspace for some reason.
To clarify, I would like to remove all spaces before the numbers; hardcoding a simple trimming will not work as some lines contain double/triple digit numbers and so do not have the same amount of blankspace before the number.
Bonus points for some way to produce a usable uniq -c result without this faffing around with blank space.
It's all about writing the correct regex:
sed 's/^ *//' comp-c.csv
That is, replace zero or more spaces at the start of lines (as many as there are) with nothing.
Bonus points for some way to produce a usable uniq -c result without this faffing around with blank space.
The uniq command doesn't have a flag to print its output without the leading blanks. There's no other way than to strip it yourself.

Sorting on two columns in vim

I have a table that looks something like this:
FirstName SurName;Length;Weight;
I need to sort on length, and if the length is equal for one or more names, I need to sort those on weight. sort ni sorts only on length, I tried sort /.\{-}\ze\dd/ that too, but that didn't work either.
Any help would be greatly appreciated!
This can be done using an external (GNU) sort pretty straightforwardly:
!sort -t ';' -k 2,2n -k 3,3n
This says: split fields by semicolon, sort by 2nd field numerically, then by 3rd field numerically. Probably a lot easier to read and remember than whatever vim-internal command you can cook up.
Much more info on GNU sort here: http://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html
Try with the r flag.
Sort on Length:
:%sort rni /.*;\ze\d/
Sort on Weight:
:%sort rni /\d+\ze;$/
Without this flag, the sorting is performed on what comes after the match, which can be a little cumbersome.
With the r flag, the sorting is done on the match itself which may be easier to define. Here, the pattern matches a series of 1 or more digits just before a semicolon at the end of the line.

Is there a diff-like algorithm that handles moving block of lines?

The diff program, in its various incarnations, is reasonably good at computing the difference between two text files and expressing it more compactly than showing both files in their entirety. It shows the difference as a sequence of inserted and deleted chunks of lines (or changed lines in some cases, but that's equivalent to a deletion followed by an insertion). The same or very similar program or algorithm is used by patch and by source control systems to minimize the storage required to represent the differences between two versions of the same file. The algorithm is discussed here and here.
But it falls down when blocks of text are moved within the file.
Suppose you have the following two files, a.txt and b.txt (imagine that they're both hundreds of lines long rather than just 6):
a.txt b.txt
----- -----
1 4
2 5
3 6
4 1
5 2
6 3
diff a.txt b.txt shows this:
$ diff a.txt b.txt
1,3d0
< 1
< 2
< 3
6a4,6
> 1
> 2
> 3
The change from a.txt to b.txt can be expressed as "Take the first three lines and move them to the end", but diff shows the complete contents of the moved chunk of lines twice, missing an opportunity to describe this large change very briefly.
Note that diff -e shows the block of text only once, but that's because it doesn't show the contents of deleted lines.
Is there a variant of the diff algorithm that (a) retains diff's ability to represent insertions and deletions, and (b) efficiently represents moved blocks of text without having to show their entire contents?
Since you asked for an algorithm and not an application, take a look at "The String-to-String Correction Problem with Block Moves" by Walter Tichy. There are others, but that's the original, so you can look for papers that cite it to find more.
The paper cites Paul Heckel's paper "A technique for isolating differences between files" (mentioned in this answer to this question) and mentions this about its algorithm:
Heckel[3] pointed out similar problems with LCS techniques and proposed a
linear-lime algorithm to detect block moves. The algorithm performs adequately
if there are few duplicate symbols in the strings. However, the algorithm gives
poor results otherwise. For example, given the two strings aabb and bbaa,
Heckel's algorithm fails to discover any common substring.
The following method is able to detect block moves:
Paul Heckel: A technique for isolating differences between files
Communications of the ACM 21(4):264 (1978)
http://doi.acm.org/10.1145/359460.359467 (access restricted)
Mirror: http://documents.scribd.com/docs/10ro9oowpo1h81pgh1as.pdf (open access)
wikEd diff is a free JavaScript diff library that implements this algorithm and improves on it. It also includes the code to compile a text output with insertions, deletions, moved blocks, and original block positions inserted into the new text version. Please see the project page or the extensively commented code for details. For testing, you can also use the online demo.
Git 2.16 (Q1 2018) will introduce another possibility, by ignoring some specified moved lines.
"git diff" learned a variant of the "--patience" algorithm, to which the user can specify which 'unique' line to be used as anchoring points.
See commit 2477ab2 (27 Nov 2017) by Jonathan Tan (jhowtan).
(Merged by Junio C Hamano -- gitster -- in commit d7c6c23, 19 Dec 2017)
diff: support anchoring line(s)
Teach diff a new algorithm, one that attempts to prevent user-specified lines from appearing as a deletion or addition in the end result.
The end user can use this by specifying "--anchored=<text>" one or more
times when using Git commands like "diff" and "show".
The documentation for git diff now reads:
--anchored=<text>:
Generate a diff using the "anchored diff" algorithm.
This option may be specified more than once.
If a line exists in both the source and destination, exists only once, and starts with this text, this algorithm attempts to prevent it from appearing as a deletion or addition in the output.
It uses the "patience diff" algorithm internally.
See the tests for some examples:
pre post
a c
b a
c b
normally, c is moved to produce the smallest diff.
But:
git diff --no-index --anchored=c pre post
Diff would be a.
With Git 2.33 (Q3 2021), the command line completion (in contrib/) learned that "git diff"(man) takes the --anchored option.
See commit d1e7c2c (30 May 2021) by Thomas Braun (t-b).
(Merged by Junio C Hamano -- gitster -- in commit 3a7d26b, 08 Jul 2021)
completion: add --anchored to diff's options
Signed-off-by: Thomas Braun
This flag was introduced in 2477ab2 ("diff: support anchoring line(s)", 2017-11-27, Git v2.16.0-rc0 -- merge listed in batch #10) but back then, the bash completion script did not learn about the new flag.
Add it.
Here's a sketch of something that may work. Ignore diff insertations/deletions for the moment for the sake of clarity.
This seems to consist of figuring out the best blocking, similar to text compression. We want to find the common substring of two files. One options is to build a generalized suffix tree and iteratively take the maximal common substring , remove it and repeat until there are no substring of some size $s$. This can be done with a suffix tree in O(N^2) time (https://en.wikipedia.org/wiki/Longest_common_substring_problem#Suffix_tree). Greedily taking the maximal appears to be optimal (as a function of characters compressed) since taking a character sequence from other substring means adding the same number of characters elsewhere.
Each substring would then be replaced by a symbol for that block and displayed once as a sort of 'dictionary'.
$ diff a.txt b.txt
1,3d0
< $
6a4,6
> $
$ = 1,2,3
Now we have to reintroduce diff-like behavior. The simple (possibly non-optimal) answer is to simply run the diff algorithm first, omit all the text that wouldn't be output in the original diff and run the above algorithm.
SemanticMerge, the "semantic scm" tool mentioned in this comment to one of the other answers, includes a "semantic diff" that handles moving a block of lines (for supported programming languages). I haven't found any details about the algorithm but it's possible the diff algorithm itself isn't particular interesting as it's relying on the output of a separate parsing of the programming language source code files themselves. Here's SemanticMerge's documentation on implementing an (external) language parser, which may shed some light on how its diffs work:
External parsers - SemanticMerge
I tested it just now and its diff is fantastic. It's significantly better than the one I produced using the demo of the algorithm mentioned in this answer (and that diff was itself much better than what was produced by Git's default diff algorithm) and I suspect still better than one likely to be produced by the algorithm mentioned in this answer.
Our Smart Differencer tools do exactly this when computing differences between source texts of two programs in the same programmming language. Differences are reported in terms of program structures (identifiers, expressions, statements, blocks) precise to line/column number, and in terms of plausible editing operations (delete, insert, move, copy [above and beyond OP's request for mere "copy"], rename-identifier-in-block).
The SmartDifferencers require an structured artifact (e.g., a programming language), so it can't do this for arbitrary text. (We could define structure to be "just lines of text" but didn't think that would be particularly valuable compared to standard diff).
For this situation in my real life coding, when I actually move a whole block of code to another position in the source, because it makes more sense either logically, or for readability, what I do is this:
clean up all the existing diffs and commit them
so that the file just requires the move that we are looking for
remove the entire block of code from the source
save the file
and stage that change
add the code into the new position
save the file
and stage that change
commit the two staged patches as one commit with a reasonable message
Check also this online tool simtexter based on the SIM_TEXT algorithm. It strongly seems the best.
You can also have a look to the source code for the Javascript implementation or C / Java.

How to compare all the lines in a sorted file (file size > 1GB) in a very efficient manner

Lets say the input file is:
Hi my name NONE
Hi my name is ABC
Hi my name is ABC
Hi my name is DEF
Hi my name is DEF
Hi my name is XYZ
I have to create the following output:
Hi my name NONE 1
Hi my name is ABC 2
Hi my name is DEF 2
Hi my name is XYZ 1
The number of words in a single line can vary from 2 to 10. File size will be more than 1GB.
How can I get the required output in the minimum possible time. My current implementation uses a C++ program to read a line from the file and then compare it with next line. The running time of this implementation will always be O(n) where n is the number of characters in the file.
To improve the running time, the next option is to use the mmap. But before implementing it, I just wanted to confirm is there a faster way to do it? Using any other language/scripting?
uniq -c filename | perl -lane 'print "#F[1..$#F] $F[0]"'
The perl step is only to take the output of uniq (which looks like "2 Hi my name is ABC") and re-order it into "Hi my name is ABC 2". You can use a different language for it, or else leave it off entirely.
As for your question about runtime, big-O seems misplaced here; surely there isn't any chance of scanning the whole file in less than O(n). mmap and strchr seem like possibilities for constant-factor speedups, but a stdio-based approach is probably good enough unless your stdio sucks.
The code for BSD uniq could be illustrative here. It does a very simple job with fgets, strcmp, and a very few variables.
In most cases this operation will be completely I/O bound. (Especially using well-designed C++)
Given that, its likely the only bottleneck you need to care about is the disk.
I think you will find this to be relevant:
mmap() vs. reading blocks
Ben Collins has a very good answer comparing mmap to standard read/write.
Well there is two time scales you are comparing which aren't related to each other really. The first is algorithmic complexity which you are expressing in O notation. This has, however, nothing to do with the complexity of reading from a file.
Say in the ideal case you have all your data in memory and you have to find the duplicates with an algorithm - depending on how your data is organized (e.g. a simple list, a hash map etc) you can find duplicates you could go with O(n^2), O(n) or even O(1) if you have a perfect hash (just for detecting the item).
Reading from a file or mapping to memory has no relation to the "big-Oh" notation at all so you don't consider that for complexity calculations at all. You will just pick the one that takes less measured time - nothing more.

Sorting lines with numbers and word characters

I recently wrote a simple utility in Perl to count words in a file to determinate its frequency, that's how many times it appears.
It's all fine, but I'd like to sort the result to make it easier to read. An output example would be:
4:an
2:but
5:does
10:end
2:etc
2:for
As you can see, it's ordered by word, not frequency. But with a little help of :sort I could reorganize that. Using n, numbers like 10 go to the right place (even though it start with 1), plus a little ! and the order gets reversed, so the word that appears more is the first one.
:sort! n
10:end
5:does
4:an
2:for
2:etc
2:but
The problem is: when the number is repeated it gets sorted by word — which is nice — but remember, the order was reversed!
for -> etc -> but
How can I fix that? Will I have to use some Vim scripting to iterate over each line checking whether it starts with the previous number, and marking relevant lines to sort them after the number changes?
tac | sort -nr
does this, so select the lines with shift+V and use !
From the vim :help sort:
The details about sorting depend on the library function used. There is no
guarantee that sorting is "stable" or obeys the current locale. You will have
to try it out.
As a result, you might want to perform the sorting in your Perl script instead; it wouldn't be hard to extend your Perl sort to be stable, see perldoc sort for entirely too many details.
If you just want this problem finished, then you can replace your :sort command with this:
!sort -rn --stable (it might be easiest to use Shift-V to visually select the lines first, or use a range for the sort, or something similar, but if you're writing vim scripts, none of this will be news to you. :)

Resources