Case Sensitive Sort Unix Bash - bash

Here is a screenshot of an issue I'm having with sort:
http://i.imgur.com/cIvAF.png
The objective I want out of this, is to put all equal strings on consecutive lines. It works for 99% of the list I'm sorting, but there's a few hitches such as those in the screen shot.
So all the yahoo.coms should be next to each other, and then all the Yahoo.coms then the YAHOO.coms yahoo.cmos yhoo.c etc. (The typos even getting their own group of lines)
Not entirely sure how to handle this with sort, but I'm certainly trying.
I print all the domains unsorted to a file and then sort it with just vanilla sort filename
Would love some advice/input.

You probably need to override the locale; most Linux systems default to a UTF8 locale which specifies both case independent sorting and ignoring punctuation.
LANG=C sort filename

normalize your input a bit
tr [A-Z] [a-z]
Try reading "Unix for poets"

Related

Terminal: SORT command; how to sort correctly?

I have written a shell script that gets all the file names from a folder, and all its sub-folders, and copies them to the clipboard after sorting (removing all paths; I just need a simple file list of the thousands of randomly named files within).
What I can’t figure out is how to get the SORT command to sort properly. Meaning, the way a spreadsheet would sort things. Or the way your Mac finder sorts things.
Underscores > numbers > letters (regardless of case)
Anyone know how to do this? Sort -n only works for files starting with numbers, sort -f was close but separated the lower case and capitals in a weird way, and anything starting with a number was all over the place. Sort -V was the closest, but anything started with an underscore went to the bottom instead of the top… I’m about to lose my mind. 🤣
I’ve been trying to figure this out for a week, and no combination of anything I have tried gets the sort command to actually, ya know, sort properly.
Help?
If I understand the problem correctly, you want the "natural sort order" as described in Natural sort order - Wikipedia, Sorting for Humans : Natural Sort Order, and macos - How does finder sort folders when they contain digits and characters?.
Using Linux sort(1) you need the -V (--version-sort) option for "natural" sort. You also need the -f (--ignore-case) option to disregard the case of letters. So, assuming that the file names are stored one-per-line in a file called files.txt you can produce a list (mostly) sorted in the way that you want with:
sort -Vf files.txt
However, sort -Vf sorts underscores after digits and letters on my system. I've tried using different locales (see How to set locale in the current terminal's session?), but with no success. I can't see a way to change this with sort options (but I may be missing something).
The characters . and ~ seem to consistently sort before numbers and letters with sort -V. A possible hack to work around the problem is to swap underscore with one of them, sort, and then swap again. For example:
tr '_~' '~_' <files.txt | LC_ALL=C sort -Vf | tr '_~' '~_'
seems to do what you want on my system. I've explicitly set the locale for the sort command with LC_ALL=C ... so it should behave the same on other systems. (See Why doesn't sort sort the same on every machine?.)
It appears you want to sort in dictionary order and fold case, so it would be sort -df.

Sort ignores an apostrophe - sometimes (except when it is the only column used); WHY?

This happens to me both on Linux and on cygwin, so I suspect it is not a bug. Still, I don't understand it. Can anyone explain?
Consider the following file (tab-delimited, and that's a regular apostrophe)
(I create it with cat to ensure that it wasn't non-printing characters that were the source of the problem)
$cat > temp
cat 1389
cat' 1747
ca't 3175
cat 46848484
ca't 720
$sort temp
<gives the exact same output as cat temp>
$sort -k1,1 temp
cat 1389
cat 46848484
cat' 1747
ca't 3456
ca't 720
Why do I have to ignore the second column in order to sort correctly?
I pulled up the manual for sort and noticed the following:
* WARNING * The locale specified by the environment affects sort
order. Set LC_ALL=C to get the traditional sort order that uses native
byte values.
As it turns out, locales actually specify how lexicographic ordering works for a given locale. This makes a lot of sense, but for some reason it trips over multi field files...
(see also:)
Unusual behaviour of linux's sort command
Why does the sort command sort differently if there are trailing fields?
There are a couple of things you can do:
You can sort naively by byte value using
LC_ALL="C" sort temp
This will give a more logical result, but it might not be the one you actually want.
You could try to get sort to do a more basic lexicographical ordering by setting the locale to C and telling it you want dictionary ordering:
LC_ALL="C" sort -d temp
To have sort output your locale information and hilight the sort key, you can use
sort --debug temp
Personally I'm really curious to know what rule is being specified that makes sort behave unintuitively across multiple fields.
They're supposed to specify correct lexicographic order in the given language and dialect. Do the locales' functions simply not handle the multiple field case at all, or are they taking some kind of different interpretation on the "meaning" of the line?

How to enumerate unique characters in a UTF-8 document? With sed?

I'm converting some Polish<->English dictionaries from RTF to HTML. The Polish special characters are coming out fine. But IPA (International Phonetic Alphabet) glyphs get changed to funny things, depending on what program I use for conversion. For example, /ˈbiːrɪ/ comes out as /ÈbiùrI/ or /∪βιρΙ/.
I'd like to correct these documents with a search & replace, but I want to make sure I don't miss any characters and don't want to manually pore over dictionary entries. I'd like to output a list of all unique, NON-ascii characters in a document.
I found this thread:
Find Unique Characters in a File
... and I tried the following two proposals:
sed -e "s/./\0\n/g" inputfile | sort -u
sed -e "s/(.)/\1\n/g" inputfile | sort -u
They both work nicely, and seem to both generate the same output. My problem is that they only output standard ASCII characters, and what I'm looking for is exactly the opposite.
The sed tool looks awesome, but I don't have time to learn it right now (though I intend to later). I'm hoping the solution will be clear to someone who's already mastered this tool, and they can save me a lot of time. [-:
Thanks in advance!
This is not a sed solution but a Python solution. It reads the contents of a file, takes it as UTF-8 and then turns it into a set (thus throwing away duplicates), throws away ASCII characters (0-127), sorts it and then joins it back together again with a blank line between each character:
'\n'.join(sorted(set(unicode(open(inputfile).read(), 'utf-8')) - set(chr(i) for i in xrange(128))))
As something you'd run from the command line if you felt so inclined,
python -c "print '\n'.join(sorted(set(unicode(open('inputfile').read(), 'utf-8')) - set(chr(i) for i in xrange(128))))"
(You could also use ''.join instead of '\n'.join which would list the characters without a newline in between.)

Bash script frequency analysis of unique letters and repeating letter pairs how should i build this script?

Ok,first post..
So I have this assignment to decrypt cryptograms by hand,but I also wanted to automate the process a little if not all at least a few parts,so i browsed around and found some sed and awk one liners to do some things I wanted done,but not all i wanted/needed.
There are some websites that sort of do what I want, but I really want to just do it in bash for some reason,just because I want to understand it better and such :)
The script would take a filename as parameter and output another file such as solution$1 when done.
if [ -e "$PWD/$1" ]; then
echo "$1 exists"
else
echo "$1 doesnt exists"
fi
Would start the script to see if the file in param exists..
Then I found this one liner
sed -e "s/./\0\n/g" $1 | while read c;do echo -n "$c" ; done
Which works fine but I would need to have the number of occurences per letter, I really don't see how to do that.
Here is what I'm trying to achieve more or less http://25yearsofprogramming.com/fun/ciphers.htm for the counting unique letter occurences and such.
I then need to put all letters in lowercase.
After this I see the script doing theses things..
-a subscript that scans a dictionary file for certain pattern and size of words
the bigger words the better.
For example: let's say the solution is the word "apparel" and the crypted word is "zxxzgvk"
is there a regex way to express the pattern that compares those two words and lists the word "apparel" in a dictionnary file because "appa" and "zxxz" are similar patterns and "zxxzgvk" is of similar length with "apparel"
Can this be part done and is it realistic to view the problem like this or is this just far fetched ?
Another subscript who takes the found letters from the previous output word and that swap
letters in the cryptogram.
The swapped letters will be in uppercase to differentiate them over time.
I'll have to figure out then how to proceed to maybe rescan the new found words to see if they're found in a dictionnary file partly or fully as well,then swap more letters or not.
Did anyone see this problem in the past and tried to solve it with the patterns in words
like i described it,or is this just too complex ?
Should I log any of the swaps ?
Maybe just scan through all the crypted words and swap as I go along then do another sweep
with having for constraint in the first sweep to not change uppercase letters(actually to use them as more precise patterns..!)
Anyone did some similar script/program in another langage? If so which one? Maybe I can relate somehow :)
Maybe we can use your insight as to how you thought out your code.
I will happily include the cryptograms I have decoded and the one I have yet to decode :)
Again, the focus of my assignment is not to do this script but just to resolve the cryptograms. But doing scripts or at least trying to see how I would do this script does help me understand a little more how to think in terms of code. Feel free to point me in the right directions!
The cryptogram itself is based on simple alphabetic substitution.
I have done a pastebin here with the code to be :) http://pastebin.com/UEQDsbPk
In pseudocode the way I see it is :
call program with an input filename in param and optionally a second filename(dictionary)
verify the input file exists and isnt empty
read the file's content and echo it on screen
transform to lowercase
scan through the text and count the amount of each letter to do a frequency analysis
ask the user what langage is the text supposed to be (english default)
use the response to specify which letter frequencies to use as a baseline
swap letters corresponding to the frequency analysis in uppercase..
print the changed document on screen
ask the user to swap letters in the crypted text
if user had given a dictionary file as the second argument
then scan the cipher for words and find the bigger words
find words with a similar pattern (some letters repeating letters) in the dictionary file
list on screen the results if any
offer to swap the letters corresponding in the cipher
print modified cipher on screen
ask again to swap letters or find more similar words
More or less it the way I see the script structured.
Do you see anything that I should add,did i miss something?
I hope this revised version is more clear for everyone!
Tl,dr to be frank. To the only question i've found - the answer is yes:) Please split it to smaller tasks and we'll be happy to assist you - if you won't find the answer to these smaller questions before.
If you can put it out in pseudocode, it would be easier. There's all kinds of text-manipulating stuff in unix. The means to employ depend on how big are your texts. I believe they are not so big, or you would have used some compiled language.
For example the easy but costly gawk way to count frequences:
awk -F "" '{for(i=1;i<=NF;i++) freq[$i]++;}END{for(i in freq) printf("%c %d\n", i, freq[i]);}'
As for transliterating, there is tr utility. You can forge and then pass to it the actual strings in each case (that stands true for Caesar-like ciphers).
grep -o . inputfile | sort | uniq -c | sort -rn
Example:
$ echo 'aAAbbbBBBB123AB' | grep -o . | sort | uniq -c | sort -rn
5 B
3 b
3 A
1 a
1 3
1 2
1 1

Replacing huge blocks with sed

I have 2 files that are generated elsewhere. First one is "what to search", and second one is the replacement. Both files are huge, about 2-3mb each.
I need to write a bash script that takes an even bigger file (about 200-300mb) and replaces all occurrences of file1 contents to file2 contents.
Problem is, file1 and file2 can contain any possible characters, including regexp special symbols.
How can I solve this problem using sed?
Thanks in advance.
Maybe have a look at chgrep:
http://www.bmk-it.com/projects/chgrep/
Cheers,
gregx
Since you don't actually need regular expressions, just direct string matching, sed is overkill. What you're really looking for is a fixed-string (maybe even binary) stream editor. Unfortunately, I don't know of one... I hate to suggest possibly reinventing a wheel, but you could write something fairly quickly in C that'd do what you want. A rough draft outline:
read search-file into memory
create a buffer of the same size as search-file
read from stdin (or input-file) into buffer.
For each character, if it does not match the parallel character from search-file, shift the buffer. To find out how much to shift it by, read until you find a match to the first character of input-file, then check to see if the rest matches, repeating until you've found a partial match to input-file (or gotten to the end of the buffer). When you shift, print all the non-matching characters to stdout (or output-file)
If the buffer ever fills up, i.e. totally matches input-file, print replacement-file to stdout (or output-file). Depending on memory vs. speed, you can keep replacement-file in memory or read it from disk each time.
You could also attempt to automatically escape all regex characters from your input file. This could be done with a horribly ugly list of sed substitutions, like
sed -e 's/\\/\\\\/g' -e 's#/#\/#' -e 's/\[/\\[/g' ...
(make sure you do the \ one first!)
I don't know about sed but in Perl you could do (off the top of my head, untested):
perl -0777 -pe 'BEGIN{local $/ = undef; open FROM, "<", shift #ARGV; $from = <FROM>; open TO, "<" shift #ARGV; $to = <TO>} s/\Q$from\E/$to/sog' file1 file2 bigger-file > new-bigger-file
If you're interesting in trying Perl, I could try testing it for you tomorrow.
But it sucks the entire bigger-file into memory because it ignores line-breaks so that your search text can span multiple lines. This will meant that it uses quite a lot of memory!
This answer assumes that the search file is one long search string over multiple lines which must be matched in its entirety rather than a number of separate search strings, any of which can be matched.

Resources