Sorting data based on second column of a file

Sorting data based on second column of a file - bash

I have a file of 2 columns and n number of rows.
column1 contains names and column2 age.
I want to sort the content of this file in ascending order based on the age (in second column).
The result should display the name of the youngest person along with name and then second youngest person and so on...
Any suggestions for a one liner shell or bash script.

You can use the key option of the sort command, which takes a "field number", so if you wanted the second column:
sort -k2 -n yourfile
-n, --numeric-sort compare according to string numerical value
For example:
$ cat ages.txt
Bob 12
Jane 48
Mark 3
Tashi 54
$ sort -k2 -n ages.txt
Mark 3
Bob 12
Jane 48
Tashi 54

Solution:
sort -k 2 -n filename
more verbosely written as:
sort --key 2 --numeric-sort filename
Example:
$ cat filename
A 12
B 48
C 3
$ sort --key 2 --numeric-sort filename
C 3
A 12
B 48
Explanation:
-k # - this argument specifies the first column that will be used to sort. (note that column here is defined as a whitespace delimited field; the argument -k5 will sort starting with the fifth field in each line, not the fifth character in each line)
-n - this option specifies a "numeric sort" meaning that column should be interpreted as a row of numbers, instead of text.
More:
Other common options include:
-r - this option reverses the sorting order. It can also be written as --reverse.
-i - This option ignores non-printable characters. It can also be written as --ignore-nonprinting.
-b - This option ignores leading blank spaces, which is handy as white spaces are used to determine the number of rows. It can also be written as --ignore-leading-blanks.
-f - This option ignores letter case. "A"=="a". It can also be written as --ignore-case.
-t [new separator] - This option makes the preprocessing use a operator other than space. It can also be written as --field-separator.
There are other options, but these are the most common and helpful ones, that I use often.

For tab separated values the code below can be used
sort -t$'\t' -k2 -n
-r can be used for getting data in descending order.
-n for numerical sort
-k, --key=POS1[,POS2] where k is column in file
For descending order below is the code
sort -t$'\t' -k2 -rn

Use sort.
sort ... -k 2,2 ...

Simply
$ sort -k2,2n <<<"$data"

Related

How to sort by numbers that are part of a filename in bash?

I'm trying to assign a variable in bash to the file in this directory with the largest number before the '.tar.gz' and I'm drawing a complete blank on the best way to approach this:
ls /dirname | sort
daily-500-12345.tar.gz
daily-500-12345678.tar.gz
daily-500-987654321.tar.gz
weekly-200-1111111.tar.gz
monthly-100-8675309.tar.gz

sort -Vrt - -k3,3
-V Natural sort
-r Reverse, so you can use head -1 to get the first line only
-t - Use hyphen as field separator
-k3,3 Sort using only the third field
Output:
daily-500-987654321.tar.gz
daily-500-12345678.tar.gz
monthly-100-8675309.tar.gz
weekly-200-1111111.tar.gz
daily-500-12345.tar.gz

"sort -c -k1" compares second field too?

"sort" correctly reports these two lines are out of order:
> echo "a b\na a" | sort -c
sort: -:2: disorder: a a
How do I tell sort to compare only the first field of each line? I tried:
> echo "a b\na a" | sort -c -k1
sort: -:2: disorder: a a
but it failed, as above.
Can I make sort compare the first field of each line only, or must I
used something like sed to trim the lines before comparing them?
EDIT: I'm using "sort (GNU coreutils) 7.2". I tried using a different field separator but it didn't help:
> echo "a b\na a" | sort -k1 -c -t" "
sort: -:2: disorder: a a
although I'm pretty sure space is the default separator anyway.

The following works as expected:
echo "a b\na a" | sort -s -c -k1,1
There were two problems with your sort invocation:
The argument to -k is a key definition that specifies a start and end position. If end position is omitted, it defaults to the last field of the line, not the start field. -k1,1 specifies both, telling sort not to include the second field in the comparison.
sort is not stable by default, which means it doesn't guarantee not to disturb the order of lines that compare equal. Quoting the documentation:
Finally, as a last resort when all keys compare equal, sort compares
entire lines as if no ordering options other than --reverse (-r)
were specified. The --stable (-s) option disables this
"last-resort comparison" so that lines in which all fields compare
equal are left in their original relative order.

bash: different sort output on files with identical first column

Sorry for the vague title, I couldn't think of a better one...
I have 2 tab-delimited files with identical first columns (different numbers of total columns). I would like to sort both files by their first column.
I think I could do this either with it -t\t option or the -k1,12 option (since first column is never longer than 12 characters.) Both options produce the same (wrong) output.
Even though both files have the same first column, they are sorted differently. Notice that on the file1 I get 23,29,2; file2, I get 2,23,29.
$ head file1 | sort -t\t | cut -f1
rs1000000
rs10000010
rs10000012
rs10000013
rs10000017
rs10000023
rs10000029
rs1000002
rs10000030
$ head file2 | sort -t\t | cut -f1
rs1000000
rs10000010
rs10000012
rs10000013
rs10000017
rs1000002
rs10000023
rs10000029
rs10000030
how I can I sort both files such that the first column is in the same order in each?
Thank you!

sort -t $'\t' -k 1,1
Use $'\t' to have the shell interpret \t as a tab since sort doesn't parse escape sequences. Use -k to tell it to only sort on the first field rather than the entire line.
You might also want the -V flag if you want 2 to sort in between 0 and 10.

How to find which line from first file appears most frequently in second file?

I have two lists. I need to determine which word from the first list appears most frequently in the second list. The first, list1.txt contains a list of words, sorted alphabetically, with no duplicates. I have used some scripts which ensures that each word appears on a unique line, e.g.:
canyon
fish
forest
mountain
river
The second file, list2.txt is in UTF-8 and also contains many items. I have also used some scripts to ensure that each word appears on a unique line, but some items are not words, and some might appear many times, e.g.:
fish
canyon
ocean
ocean
ocean
ocean
1423
fish
109
fish
109
109
ocean
The script should output the most frequently matching item. For e.g., if run with the 2 files above, the output would be “fish”, because that word from list1.txt most often occurs in list2.txt.
Here is what I have so far. First, it searches for each word and creates a CSV file with the matches:
#!/bin/bash
while read -r line
do
count=$(grep -c ^$line list2.txt)
echo $line”,”$count >> found.csv
done < ./list1.txt
After that, found.csv is sorted descending by the second column. The output is the word appearing on the first line.
I do not think though, that this is a good script, because it is not so efficient, and it is possible that there might not be a most frequent matching item, for e.g.:
If there is a tie between 2 or more words, e.g. “fish”, “canyon”, and “forest” each appear 5 times, while no other appear as often, the output would be these 3 words in alphabetical order, separated by commas, e.g.: “canyon,fish,forest”.
If none of the words from list1.txt appears in list2.txt, then the output is simply the first word from the file list1.txt, e.g. “canyon”.
How can I create a more efficient script which finds which word from the first list appears most often in the second?

You can use the following pipeline:
grep -Ff list1.txt list2.txt | sort | uniq -c | sort -n | tail -n1
F tells grep to search literal words, f tells it to use list1.txt as the list of words to search for. The rest sorts the matches, counts duplicates, and sorts them according to the number of occurrences. The last part selects the last line, i.e. the most common one (plus the number of occurrences).

> awk 'FNR==NR{a[$1]=0;next}($1 in a){a[$1]++}END{for(i in a)print a[i],i}' file1 file2 | sort -rn|head -1

assuming 'list1.txt' is sorted, I would use unix join :
sort list2.txt | join -1 1 -2 1 list1.txt - | sort |\
uniq -c | sort -n | tail -n1

what is the difference between 'sort -k1 file.txt' and 'sort -k1,1 file.txt'?

if a file contains multiple columns, seperated by comma, like this:
aaa,1,4,4,5,7
bbb,1,4,9,1,2
Is there difference between 'sort -t, -k1 file.txt' and 'sort -t, -k1,1 file.txt'?
though with the example above, there is no difference, but in some of my project case, it
does has difference, but the difference is reflected in the case that I use the sorted file to join, and the join command throws out exception that 'join: file 2 is not in sorted order' (at that time , I use 'sort -t, -k1 file.txt') . Later I use 'sort -t, -k1,1 file.txt', and join command works well then. Can anybody tell me why?

sort -k1 means sort starting on key 1 till the end of the line. sort -k1,1 means sort from key 1 to key 1 (so only the first key). On my machine, the two make a difference if I specify stable sort with -s:
~ $ cat test.txt
aaa,1,4,4,5,7
aaa,1,3,9,1,2
~ $ sort -t, -k1 -s test.txt
aaa,1,3,9,1,2
aaa,1,4,4,5,7
~ $ sort -t, -k1,1 -s test.txt
aaa,1,4,4,5,7
aaa,1,3,9,1,2

The second number is where the sort key ends, which defaults to the end of the line. From the manpage:
-k, --key=POS1[,POS2]:
start a key at POS1 (origin 1), end it at POS2 (default end of line)
So, yes, there is a difference. Not in the case for your data since the 1,1 sort key has no duplicates.
But where you specify the 1,1 sort key, the two lines:
abc,plugh
abc,xyzzy
can sort in either order. With just 1 (meaning 1,end-of-line), they'll sort in the order given.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio