Change sorting order to AaBbCcDd - sorting

I have several files in a directory in my unix system that I need to sort. The problem I'm having is that when using the sort -f command it sorts in the order a A b B c C etc. ls does the same ordering. Is there a way I can make it sort with the uppercase letter coming first? i.e. sort in the order A a B b C c ...

With multiple passes through the sort command I think this is what you want:
ls | sort -f -r | sort

Related

sorting file names ascending where names have a dash in bash

I have a list of files in a folder.
The names are:
1-a
100-a
2-b
20-b
3-x
and I want to sort them like
1-a
2-b
3-x
20-b
100-a
The files are always a number, followed by a dash, followed by anything.
I tried a ls with a col and sort and it works, but I wanted to know if there's a simpler solution.
Forgot to mention: This is bash running on a Mac OS X.
Some ls implementations, GNU coreutils' ls is one of them, support the -v (natural sort of (version) numbers within text) option:
% ls -v
1-a 2-b 3-x 20-b 100-a
or:
% ls -v1
1-a
2-b
3-x
20-b
100-a
Use sort to define the fields.
sort -s -t- -k1,1n -k2 filenames.txt
The -t tells sort to treat - as the field separator in input items. -k1,1n instructs sort to first sort on the first field numerically; -k2 sorts using the remaining fields as the second key in cade the first fields are equal. -s keeps the sort stable (although you could omit it since the entire input string is being used in one field or another).
(Note: I'm assuming the file names do not contain newlines, so that something like ls > filenames.txt is guaranteed to produce a file with one name per line. You could also use ls | sort ... in that case.)

What is the difference between 'sort -u' and 'uniq'?

I need script that sorts a text file and remove the duplicates.
Most, if not all, of the examples out there use the sort file1 | uniq > file2 approach.
In the man sort though, there is an -u option that does this at the time of sorting.
Is there a reason to use one over the other? Maybe availability to the -u option? Or memory/speed concern?
They should be equivalent in the simple case, but will behave differently if you're using the -k option to define only certain fields of the input line to use as sort keys. In that case, sort -u will suppress lines which have the same key even if other parts of the line differ, whereas uniq will only suppress lines that are exactly identical.
$ cat example
foo baz
quux ping
foo bar
$ sort -k 1,1 --stable example # use just the first word as sort key
foo baz
foo bar
quux ping
$ sort -k 1,1 --stable -u example # suppress lines with the same first word
foo baz
quux ping
but
$ sort -k 1,1 --stable example | uniq
foo baz
foo bar
quux ping
I'm not sure that it's about availability. Most systems I've ever seen have sort and uniq as they are usually provided by the same package. I just checked a Solaris system from 2001 and it's sort has the -u option.
Technically, using a linux pipe (|) launches a subshell and is going to be more resource intensive as it requests multiple pid's from the OS.
If you go to the source code for sort, which comes in the coreutils package, you can see that it actually just skips printing duplicates as it's printing its own sorted list and doesn't make use of the independent uniq code.
To see how it works follow the link to sort's source and see the functions below this comment:
/* If uniquified output is turned on, output only the first of
an identical series of lines. */
Although I believe sort -u should be faster, the performance gains are really going to be minimal unless you're running sort | uniq on huge files, as it will have to read through the entire file again.
One difference is 'uniq -c' can count (and print) the number of matches. You lose this ability when you use 'sort -c' for sorting.
They should be functionally equivalent, and sort -u should be more efficient.
I'm guessing the examples you're looking at simply didn't consider (or didn't have) "sort -u" as an option.
Does uniq sort?
I do not think so...
Because, at least on Ubuntu 18.04 and CentOS 6, it does not. It will just remove consecutive duplicates.
You can simply conduct a mini experiment.
Let the file sample.txt be:
a
a
a
b
b
b
a
a
a
b
b
b
cat sample.txt | uniq will output:
a
b
a
b
while cat sample.txt | sort -u will output:
a
b
sort | uniq may be functionally equivalent to sort -u.

How to find which line from first file appears most frequently in second file?

I have two lists. I need to determine which word from the first list appears most frequently in the second list. The first, list1.txt contains a list of words, sorted alphabetically, with no duplicates. I have used some scripts which ensures that each word appears on a unique line, e.g.:
canyon
fish
forest
mountain
river
The second file, list2.txt is in UTF-8 and also contains many items. I have also used some scripts to ensure that each word appears on a unique line, but some items are not words, and some might appear many times, e.g.:
fish
canyon
ocean
ocean
ocean
ocean
1423
fish
109
fish
109
109
ocean
The script should output the most frequently matching item. For e.g., if run with the 2 files above, the output would be “fish”, because that word from list1.txt most often occurs in list2.txt.
Here is what I have so far. First, it searches for each word and creates a CSV file with the matches:
#!/bin/bash
while read -r line
do
count=$(grep -c ^$line list2.txt)
echo $line”,”$count >> found.csv
done < ./list1.txt
After that, found.csv is sorted descending by the second column. The output is the word appearing on the first line.
I do not think though, that this is a good script, because it is not so efficient, and it is possible that there might not be a most frequent matching item, for e.g.:
If there is a tie between 2 or more words, e.g. “fish”, “canyon”, and “forest” each appear 5 times, while no other appear as often, the output would be these 3 words in alphabetical order, separated by commas, e.g.: “canyon,fish,forest”.
If none of the words from list1.txt appears in list2.txt, then the output is simply the first word from the file list1.txt, e.g. “canyon”.
How can I create a more efficient script which finds which word from the first list appears most often in the second?
You can use the following pipeline:
grep -Ff list1.txt list2.txt | sort | uniq -c | sort -n | tail -n1
F tells grep to search literal words, f tells it to use list1.txt as the list of words to search for. The rest sorts the matches, counts duplicates, and sorts them according to the number of occurrences. The last part selects the last line, i.e. the most common one (plus the number of occurrences).
> awk 'FNR==NR{a[$1]=0;next}($1 in a){a[$1]++}END{for(i in a)print a[i],i}' file1 file2 | sort -rn|head -1
assuming 'list1.txt' is sorted, I would use unix join :
sort list2.txt | join -1 1 -2 1 list1.txt - | sort |\
uniq -c | sort -n | tail -n1

ls fails to order files in desired way

I have a script that creates files with output_#.root where # is a number. When I do ls in the directory, it chooses to order the files in a weird way:
output_1.root
output_10.root
output_100.root
output_11.root
output_2.root
etc.
How do I make it order the files in the logical order 1, 2, 3, etc.
Your files are sorted by alphabetical order. It's normal behavior. If you want to sort them by numerical order, you can try this:
ls *.root | sort -k2 -t_ -n
This will split your result using _ as a separator, and order by numerical order -n based on the second field -k2.
If you are using ls from GNU coreutils you can use the version-sort switch:
ls -v
Create example files:
touch output_1.root output_10.root output_100.root output_11.root output_2.root
List them:
ls -1v
Output:
output_1.root
output_2.root
output_10.root
output_11.root
output_100.root

Find unmatched items between two list using bash or DOS

I have two files with two single-column lists:
//file1 - full list of unique values
AAA
BBB
CCC
//file2
AAA
AAA
BBB
BBB
//So the result here would be:
CCC
I need to generate a list of values from file1 that have no matches in file2. I have to use bash script (preferably without special tools like awk) or DOS batch file.
Thank you.
Method 1
Looks like a job for grep's -v flag.
grep -v -F -f listtocheck uniques
Method 2
A variation to Drake Clarris's solution (that can be extended to checking using several files, which grep can't do unless they are first merged), would be:
(
sort < file_to_check | uniq
cat reference_file reference_file
) | sort | uniq -u
By doing this, any words in file_to_check will appear, in the output combined by the subshell in brackets, only once. Words in reference_file will be output at least twice, and words appearing in both files will be output at least three times - one from the first file, twice from the two copies of the second file.
There only remains to find a way to isolate the words we want, those that appear once, which is what sort | uniq -u does.
Optimization I
If reference_file contains a lot of duplicates, it might be worthwhile to run a heavier
sort < reference_file | uniq
sort < reference_file | uniq
instead of cat reference_file reference_file, in order to have a smaller output and weigh less on the final sort.
Optimization II
This would be even faster if we used temporary files, since merging already-sorted files can be done efficiently (and in case of repeated checks with different files, we could reuse again and again the same sorted reference file without need of re-sorting it); therefore
sort < file_to_check | uniq > .tmp.1
sort < reference_file | uniq > .tmp.2
# "--merge" works way faster, provided we're sure the input files are sorted
sort --merge .tmp.1 .tmp.2 .tmp.2 | uniq -u
rm -f .tmp.1 .tmp.2
Optimization III
Finally in case of very long runs of identical lines in one file, which may be the case with some logging systems for example, it may be also worthwhile to run uniq twice, one to get rid of the runs (ahem) and another to uniqueize it, since uniq works in linear time while sort is linearithmic.
uniq < file | sort | uniq > .tmp.1
For a Windows CMD solution (commonly referred to as DOS, but not really):
It should be as simple as
findstr /vlxg:"file2" "file1"
but there is a findstr bug that results in possible missing matches when there are multiple literal search strings.
If a case insensitive search is acceptable, then adding the /I option circumvents the bug.
findstr /vlixg:"file2" "file1"
If you are not restricted to native Windows commands then you can download a utility like grep for Windows. The Gnu utilities for Windows are a good source. Then you could use Isemi's solution on both Windows and 'nix.
It is also easy to write a VBScript or JScript solution for Windows.
cat file1 file2 | sort | uniq -u

Resources