gnu/unix sort numerical only using first column? - sorting

With regular strings, if the first field matches, we sort by the next field and so on, and things work as we expect.
echo -e 'a c\na b' | sort #regular string sort
a b
a c
With numbers, if the first field matches, we…switch to string sort on subsequent fields? Why? I would think it would compare each field numerically.
echo -e '1 22\n1 3' | sort -n #numeric sort
1 22
1 3
FYI, using sort (GNU coreutils) 5.97 on RHEL 5.5.
What am I missing here? I know I can use -k to pick the field I want to sort on, but that drastically reduces the flexibility of input allowed, as it requires the user to know the numbers of fields.
Thanks!

Sadly you haven't missed anything. This apparently simple task - split lines into fields and then sort numerically on all of them - can't be done by the unix sort program. You just have to figure out how many columns there are and name them all individually as keys.
What's happening when you specify -n no other options is that the whole line is being passed to the "convert string to number" routine, which converts the number at the start of the line and ignores the rest. The split into fields is not done at all.
Your first example, without -n, is also doing whole-line comparison. It's not comparing "a" to "a" then "b" to "c". It's comparing "a b" to "a c".

Related

Regular expression in bash to match multiple conditions

I would like to implement a regular expression in bash that allows me to verify a series of characteristics on a dataset.
A sample is attached below:
id, date of birth, grade, explusion, serious misdemeanor
123,2005-01-01,5.36,1,1
582,1999-05-12,8.51,0,1
9274,2001-25-12,9.65,0,0
21,2006-14-05,0.53,4,1
id is required to have only 3 digits, date of birth less than 2000, minimum grade point average is 5.60 with the second decimal place being other than 0, and at least one expulsion or serious misconduct.
The result of executing the regular expression should be:
582, 1999-05-12, 8.51, 0, 1
I have tried to implement the following regular expression and it does not give me any result.
grep -E "^\d{0,3},[0-2][0-9][0-9][0-9].*,[1-5].[0-5][1-9],[1-9],[1-9]$"
Any idea?
If it is mandatory to use grep, would you please try:
grep -E '^[0-9]{1,3},1[0-9]{3}(-[0-9]{2}){2},(5\.[6-9][1-9]|[6-9]\.[0-9][1-9]|[1-9][0-9]+\.[0-9][1-9]),([1-9][0-9]*,[0-9]+|[0-9]+,[1-9][0-9]*)[[:space:]]?$' input_file
Result:
582,1999-05-12,8.51,0,1
[0-9]{1,3} matches if id has 1-3 digits. (I have interpreted only 3 digits like that. If it means differently, tweak the regex accordingly.)
1[0-9]{3}(-[0-9]{2}){2} matches if the birth year is before 200 exclusive.
(5\.[6-9][1-9]|[6-9]\.[0-9][1-9]|[1-9][0-9]+\.[0-9][1-9]) matches if grade is greater than 5.60 with the second decimal place being other than 0.
([1-9][0-9]*,[0-9]+|[0-9]+,[1-9][0-9]*) matches if either or both of explusion and serious misdemeanor have non-zero value.
Regular expressions do not understand numeric values, and they certainly do not understand boolean logic. All it knows is text. You'll need to use an actual programming language like Awk or Perl to do this.
Here's an example:
$ perl -l -a -F, -E'say if length($F[0])>3 || $F[2] < 5.60' foo.txt
123,2005-01-01,5.36,1,1
9274,2001-25-12,9.65,0,0
21,2006-14-05,0.53,4,1
This call to perl splits apart the fields on commas, and then prints the line if the length of the first column is over 3, or the value of the third column is less than 5.60.
This is just a starting point, but this is the direction to go.

sort on pipe-delimited fields not behaving as expected

Consider this tiny text file:
ab
a
If we run it through sort(1), we get
a
ab
because of course a comes before ab.
But now consider this file:
ab|c
a|c
If we run it through sort -t'|', we again expect a to sort before ab, but it does not! (Try it under your version of Unix and see.)
What I think is happening here is that the -t option to sort is not really delimiting fields -- it may be changing the way (say) the start of field 2 would be found, but it's not changing the way field 1 ends. a|c sorts after ab|c because '|' comes after 'b' in ASCII. (It's as if the -t'|' argument is ignored, because you get the same result without it.)
So is this a bug in sort or in my understanding of it? And is there a way to sort on the first pipe-delimited field properly?
This question came up in my attempt to answer another SO question, Join Statement omitting entries .
sort's default behavior is to treat everything from field 1 to the end of the line as the sort key. If you want it to sort on field 1 first, then field 2, you need to specify that explicitly.
$ sort -k1,1 -k2,2 -t'|' <<< $'ab|c\na|c'
a|c
ab|c

Sorting two files that have the same column gives different sorting

I am sorry for the title but I didn't know how to explain this:
I am trying to sort two files because I want to merge them, they look like this:
test1.txt
rs1010735 224915429
rs1010805 38189142
rs10108 114516330
rs1010863 185432942
rs1010891 110712154
rs1010910 61212213
rs1011124 7533164
and
test2.txt
rs1010735 C
rs1010805 T
rs1010863 T
rs1010891 T
rs10108 C
rs1010910 A
rs1011124 A
I use sort -k1 test1.txt and sort -k1 test2.txt and got this:
test1_sort.txt
rs1010735 224915429
rs1010805 38189142
rs10108 114516330
rs1010863 185432942
rs1010891 110712154
rs1010910 61212213
rs1011124 7533164
and
test2_sort.txt
rs1010735 C
rs1010805 T
rs1010863 T
rs1010891 T
rs10108 C
rs1010910 A
rs1011124 A
Why is there a different sorting if both first columns have the same values.
I also tried sort -n -s k1,1 but got the same result.
Add spaces:
$ sort -k 1,1 /tmp/2
rs1010735 C
rs10108 C
rs1010805 T
rs1010863 T
rs1010891 T
rs1010910 A
rs1011124 A
$ sort -k 1,1 /tmp/1
rs1010735 224915429
rs10108 114516330
rs1010805 38189142
rs1010863 185432942
rs1010891 110712154
rs1010910 61212213
rs1011124 7533164
There are two issues here.
Locale-aware sorting
At base, the problem here is that you are sorting according to your "locale", which is presumably en_US.UTF-8 (or some other Unicode locale). In theory, a locale-aware sort will produce an ordering which is what would be expected according to the normal sorting rules for that location, while a non-locale-aware sort will sort according to the "arbitrary" character codes for each character.
In a locale-aware sort, for example, it would be common for a word starting with a capital letter to come just before (or just after) the same word starting with a lower-case letter, whereas an non-locale-aware sort will put all the words starting with a capital letter before any word starting with a lower-case letter. Also, in an English-speaking locale, you would probably find words starting with an ä intermingled with words starting with a, whereas in a Swedish-locale, you'd find them after words starting with z because in Swedish, ä is the 28th letter (it comes after å and before ö, in case you're interested).
For all that to work, the locale descriptions on your machine need to actually describe the sorting order which would be expected in each locale, and particularly with the default locale, which should correspond to what you would expect. As can be seen from this example, that is sometimes not the case. Indeed, it sometimes produces bizarrely unexpected results.
What is happening in your example is that the locale description for your locale says that whitespace does not participate in sortation. It also indicates that digits come before letters. Now, consider a subset of your data (with both files combined):
rs10108 114516330
rs1010805 38189142
rs1010863 185432942
rs10108 C
rs1010805 T
rs1010863 T
If we eliminate the whitespace altogether, that would be:
rs10108114516330
rs101080538189142
rs1010863185432942
rs10108C
rs1010805T
rs1010863T
And if we then sort that according to normal alphabetic rules, with digits first, we get:
rs101080538189142
rs1010805T
rs10108114516330
rs1010863185432942
rs1010863T
rs10108C
Or, putting the whitespace back:
rs1010805 38189142
rs1010805 T
rs10108 114516330
rs1010863 185432942
rs1010863 T
rs10108 C
Those are the rules sort is following, and the result is that the two lines whose first field is rs10108 do not get sorted together. Counter-intuitive, ¿no?
Probably the correct solution would be to tell whoever built the locale files for your distribution that the normal rule is "nothing (visible) comes before something", which was the alphabetization rule we were taught in school. In other words, a space (nothing visible) comes before any character. Or you could try to fix the collation files yourself.
But in practical terms, the solution is to tell sort to do a non-locale-aware sort by default. I do that by putting:
export LC_COLLATE=C
in my bash startup files. (C is the special name of the locale corresponding to the programming language "C", in which symbols are sorted by their internal character codes.) You could also just type that everytime you want to sort something:
LC_COLLATE=C sort test1.txt
The meaning of the -k argument
The -k argument to sort has the basic syntax:
-kstart[,end]
where the positions start (and optionally end) define a range of text to use as a sort key. If end is not specified, the range continues to the end of the line.
The simplest form of a position is just a field number, such as 1, meaning "the first field". But -k1 does nothing, because it means, precisely, "use the text from the first field to the end of the line", which is essentially the same as saying "use the entire line as a sort key", which is the default. So anytime you see -k1 you should know that it is not doing what is expected.
Explicitly specifying the end would be more precise: -k1,1 means the the sort key is the text from the (start of) the first field to the (end of) the first field, or in other words, the first field. That would be better, but it wouldn't provide any hint on how to sort two lines which had the same first field. The standard sort utility is not "stable" by default, so it is not predictable which order two such lines will be sorted. It would generally be better to add more secondary sort fields:
sort -k1,1 -k2,2
which means, effectively, "sort by the first field, but if the first fields are equal, then compare the second fields."
Fields are split at whitespace (even if whitespace is ignored for sortation), so the above is different from sort -k1,2 in that it is guaranteed to put lines with the same value in the first field in consecutive positions.
Appendix: Why locales ignore whitespace in sorting
Unfortunately, sort -k1,1 -k2,2 also might not do what you want, particularly if you do it in the "C" locale, because of the historic definition of sort fields used by sort. Unless an explicit delimiter is specified with the -t option, sort fields start with each whitespace character which follows a non-whitespace character. Consequently, all fields except the first field start with whitespace. That's fine if they all start with the same whitespace, but often fields have been lined up by explicitly adding the right number of space characters. And that almost always produces incorrect sorting on fields other than the first field.
Since that is not generally what is wanted, sort provides a way of suppressing this annoying behaviour: the b sort-key flag (sort key flags go at the end of the -k specification). This flag tells sort to ignore leading whitespace in a sort-key. Also, you can specify -b as a command-line option before any -k option to specify at all sort keys should be treated as having the b flag. That would suggest that the correct invocation of sort would be:
sort -k1,1 -k2,2b
or
sort -b -k1,1 -k2,2
Some people believe that it is irritating to have to specify b all the time (since it is almost always what you want), and complicated to explain to users why they have to do it. As a consequence, it may appear easier to set up the locale definitions to ignore whitespace, which will certainly cause leading whitespace to be ignored. The problem with that "solution" is that it produces results which are at least as confusing that the results caused by having sort include the spaces between fields in the field definition, but which are rather more difficult to fix because there is no simple way to modify a locale's collation order.

Sort ignores an apostrophe - sometimes (except when it is the only column used); WHY?

This happens to me both on Linux and on cygwin, so I suspect it is not a bug. Still, I don't understand it. Can anyone explain?
Consider the following file (tab-delimited, and that's a regular apostrophe)
(I create it with cat to ensure that it wasn't non-printing characters that were the source of the problem)
$cat > temp
cat 1389
cat' 1747
ca't 3175
cat 46848484
ca't 720
$sort temp
<gives the exact same output as cat temp>
$sort -k1,1 temp
cat 1389
cat 46848484
cat' 1747
ca't 3456
ca't 720
Why do I have to ignore the second column in order to sort correctly?
I pulled up the manual for sort and noticed the following:
* WARNING * The locale specified by the environment affects sort
order. Set LC_ALL=C to get the traditional sort order that uses native
byte values.
As it turns out, locales actually specify how lexicographic ordering works for a given locale. This makes a lot of sense, but for some reason it trips over multi field files...
(see also:)
Unusual behaviour of linux's sort command
Why does the sort command sort differently if there are trailing fields?
There are a couple of things you can do:
You can sort naively by byte value using
LC_ALL="C" sort temp
This will give a more logical result, but it might not be the one you actually want.
You could try to get sort to do a more basic lexicographical ordering by setting the locale to C and telling it you want dictionary ordering:
LC_ALL="C" sort -d temp
To have sort output your locale information and hilight the sort key, you can use
sort --debug temp
Personally I'm really curious to know what rule is being specified that makes sort behave unintuitively across multiple fields.
They're supposed to specify correct lexicographic order in the given language and dialect. Do the locales' functions simply not handle the multiple field case at all, or are they taking some kind of different interpretation on the "meaning" of the line?

Bash script frequency analysis of unique letters and repeating letter pairs how should i build this script?

Ok,first post..
So I have this assignment to decrypt cryptograms by hand,but I also wanted to automate the process a little if not all at least a few parts,so i browsed around and found some sed and awk one liners to do some things I wanted done,but not all i wanted/needed.
There are some websites that sort of do what I want, but I really want to just do it in bash for some reason,just because I want to understand it better and such :)
The script would take a filename as parameter and output another file such as solution$1 when done.
if [ -e "$PWD/$1" ]; then
echo "$1 exists"
else
echo "$1 doesnt exists"
fi
Would start the script to see if the file in param exists..
Then I found this one liner
sed -e "s/./\0\n/g" $1 | while read c;do echo -n "$c" ; done
Which works fine but I would need to have the number of occurences per letter, I really don't see how to do that.
Here is what I'm trying to achieve more or less http://25yearsofprogramming.com/fun/ciphers.htm for the counting unique letter occurences and such.
I then need to put all letters in lowercase.
After this I see the script doing theses things..
-a subscript that scans a dictionary file for certain pattern and size of words
the bigger words the better.
For example: let's say the solution is the word "apparel" and the crypted word is "zxxzgvk"
is there a regex way to express the pattern that compares those two words and lists the word "apparel" in a dictionnary file because "appa" and "zxxz" are similar patterns and "zxxzgvk" is of similar length with "apparel"
Can this be part done and is it realistic to view the problem like this or is this just far fetched ?
Another subscript who takes the found letters from the previous output word and that swap
letters in the cryptogram.
The swapped letters will be in uppercase to differentiate them over time.
I'll have to figure out then how to proceed to maybe rescan the new found words to see if they're found in a dictionnary file partly or fully as well,then swap more letters or not.
Did anyone see this problem in the past and tried to solve it with the patterns in words
like i described it,or is this just too complex ?
Should I log any of the swaps ?
Maybe just scan through all the crypted words and swap as I go along then do another sweep
with having for constraint in the first sweep to not change uppercase letters(actually to use them as more precise patterns..!)
Anyone did some similar script/program in another langage? If so which one? Maybe I can relate somehow :)
Maybe we can use your insight as to how you thought out your code.
I will happily include the cryptograms I have decoded and the one I have yet to decode :)
Again, the focus of my assignment is not to do this script but just to resolve the cryptograms. But doing scripts or at least trying to see how I would do this script does help me understand a little more how to think in terms of code. Feel free to point me in the right directions!
The cryptogram itself is based on simple alphabetic substitution.
I have done a pastebin here with the code to be :) http://pastebin.com/UEQDsbPk
In pseudocode the way I see it is :
call program with an input filename in param and optionally a second filename(dictionary)
verify the input file exists and isnt empty
read the file's content and echo it on screen
transform to lowercase
scan through the text and count the amount of each letter to do a frequency analysis
ask the user what langage is the text supposed to be (english default)
use the response to specify which letter frequencies to use as a baseline
swap letters corresponding to the frequency analysis in uppercase..
print the changed document on screen
ask the user to swap letters in the crypted text
if user had given a dictionary file as the second argument
then scan the cipher for words and find the bigger words
find words with a similar pattern (some letters repeating letters) in the dictionary file
list on screen the results if any
offer to swap the letters corresponding in the cipher
print modified cipher on screen
ask again to swap letters or find more similar words
More or less it the way I see the script structured.
Do you see anything that I should add,did i miss something?
I hope this revised version is more clear for everyone!
Tl,dr to be frank. To the only question i've found - the answer is yes:) Please split it to smaller tasks and we'll be happy to assist you - if you won't find the answer to these smaller questions before.
If you can put it out in pseudocode, it would be easier. There's all kinds of text-manipulating stuff in unix. The means to employ depend on how big are your texts. I believe they are not so big, or you would have used some compiled language.
For example the easy but costly gawk way to count frequences:
awk -F "" '{for(i=1;i<=NF;i++) freq[$i]++;}END{for(i in freq) printf("%c %d\n", i, freq[i]);}'
As for transliterating, there is tr utility. You can forge and then pass to it the actual strings in each case (that stands true for Caesar-like ciphers).
grep -o . inputfile | sort | uniq -c | sort -rn
Example:
$ echo 'aAAbbbBBBB123AB' | grep -o . | sort | uniq -c | sort -rn
5 B
3 b
3 A
1 a
1 3
1 2
1 1

Resources