UNIX sort with -c - sorting

I have a file ( 3 columns tab delimited), I need to check if file is sorted or not
Example:
chr1 9999999 10000125 C57T3ANXX:7:2114:14205:58915/2 50 -
chr1 10010918 10011044 C57T3ANXX:7:2310:08814:31632/1 50 +
chr1 10011185 10011311 C57T3ANXX:7:2310:08814:31632/2 50 -
On above file, I use
cut -f1,2 f |sort -cn,
which give me
sort: -:2: disorder: chr1 10010918.
I am not sure why, as the file is already sorted. I get same order when I use
sort -k1,1 -k2,2 f

sort -cn assumes that the entire line is the key, because the line starts with a non-numeric character it resorts to non-numeric mode for that key, which is the only one
enable numeric mode for you keys:
sort -k1,1 -k2,2 -cn

Related

how to find total count of two matching files

I have two text files Like
file1
1018 2
1019 7
1023 4
file2
1018 2
1019 7
1023 4
1026 8
I have a small bash code to find match and count
awk 'FNR==NR{a[$0]=1; next} $0 in a { count[$0]++ }
END { for( i in a ) print i, count[i]}' file1 file2
the output I get;
1018 2 1
1019 7 1
1023 4 1
I just want total count that is in this case: 3. Its simple to print count after the loop but didn't work, any solution....
When I have an outputted list in bash I'll use "wc". WC does a word count and you can specify it to count the number of lines. So say I want to count the number of files in a directory. I'll do:
ls -lh | wc -l
You could use a combination of sort and uniq to do so. This is how it would look like:
cat file1 file2 | sort | uniq -d | wc -l
Explanation:
cat is used to concatenate the two files
sort is used to sort the merged content
uniq (with option -d) is used to only show duplicated lines
wc (with option -l) is counting the left lines

How to sort one column and add values later

I would to first sort a specific column, which I do using sort -k2 <file>. Then, after it is sorted using the values from the second column, I would like to add all the values from column 1 , delete duplicates, and keep the value from column 1.
Example:
2 AAAAAA
3 BBBBBB
1 AAAAAA
2 BBBBBB
1 CCCCCC
sort -k2 <file> does this:
2 AAAAAA
1 AAAAAA
3 BBBBBB
2 BBBBBB
1 CCCCCC
I know uniq -c will removes duplicates and outputs how many times it occurred, however I don't want to know how many times it occurred, I just need column 1 to be added and displayed. So that I would get:
3 AAAAAA
5 BBBBBB
1 CCCCCC
I came up with a solution using two for loops:
The first loop loops over all different strings in the file (test.txt), for each one we find all the numbers in the original file, and add them in the second loop. After adding all numbers we echo the total, and the string.
for chars in `sort -k2 test.txt | uniq -f 1 | cut -d' ' -f 2 `;
do
total=0;
for nr in `grep $a test.txt | cut -d' ' -f 1`;
do
total=$(($total+$nr));
done;
echo $total $chars
done
-c is your enemy. You explicitly asked for the count . Here is my suggestion:
sort -k2 <file>| uniq -f1 file2
which gives me
cat file2
1 AAAAAA
2 BBBBBB
1 CCCCCC
If you want only column 2 in file, then use awk
sort -k2 <file>| uniq -f1 |awk '{print $2}' > file2
leading to
AAAAAA
BBBBBB
CCCCCC
Now I got it at last.
.... But if you want to sum in column 1, then just use awk ... Of course you could not make a grouped count with uniq...
awk '{array[$2]+=$1} END { for (i in array) {print array[i], i}}' file |sort -k2
which leads to your solution (even if I sorted afterwards):
3 AAAAAA
5 BBBBBB
1 CCCCCC

Finding biggest number from text file contaning two columns and several rows

For example, I have got a text file containing 2 columns:
0.000000e+00 0.000000e+00
1.958870e-02 1.566242e-02
3.923750e-02 6.509739e-03
4.394830e-01 3.216723e-03
4.594830e-01 2.508868e-03
4.794890e-01 3.813512e-04
4.995070e-01 8.846235e-04
5.997070e-01 1.671057e-03
I want to find maximum values in column 2 which shows corresponding value of column 1 in the output.
this awk one-liner will do it without sorting:
awk '$2>m{f=$1;m=$2}END{print f}' file
it outputs:
1.958870e-02
I was just testing the exact same solution that #jeanrjc just posted as a comment - I think, if I understand your question, it is the correct answer (to get the MAX row):
sort -n -k2 file.dat | tail -1
If you have 1e-2 notation, you need to sort with -g option :
Max :
sort -k2g file.dat | tail -1
Min :
sort -k2gr file.dat | tail -1
-k2 stands for column 2
-r (or -k2r) for reverse order
if you have a header you can remove it with awk :
awk 'NR>1' file.dat | sort -k2g | tail -1
You can alternatively use head instead of tail to get opposite result, eg :
sort -k2g file.dat | head -1
will give you the min
Hope this helps.

Using sort | awk on one column from a csv?

Using p.txt:
$cat p.txt
R 3
R 4
S 1
S 2
R 1
T 1
R 3
The following command sorts based on the second column:
$cat p.txt | sort -k2
R 1
S 1
T 1
S 2
R 3
R 3
R 4
The following command removes repeated values in the second column:
$cat p.txt | sort -k2 | awk '!x[$2]++'
R 1
S 2
R 3
R 4
Now inserting a comma for the sapce, we have the following file:
$cat p1.csv
R,3
R,4
S,1
S,2
R,1
T,1
R,3
The following command still sorts based on the second column:
$cat p1.csv | sort -t "," -k2
R,1
S,1
T,1
S,2
R,3
R,3
R,4
Below is NOT the correct output:
$cat p1.csv | sort -t "," -k2 | awk '!x[$2]++'
R,1
Correct output:
R,1
S,2
R,3
R,4
Any suggestions?
well you have already used sort, then you don't need the awk at all. sort has -u
Also the cat is not needed either:
sort -t, -k2 -u p1.csv
should give you expected output.
Try awk -F, in your last command. So:
cat p1.csv | sort -t "," -k2 | awk -F, '!x[$2]++'
Since your fields are separated by commas, you need to tell awk that the field separator is no longer whitespace, but instead the comma. The -F option to awk does that.
Well you don't need all such things, sort and uniq are enough to do such things
sort -t "," -k2 p1.csv | uniq -s 2
uniq -s 2 tells uniq to skip first 2 characters (i.e. till ,)
You need to provide field separator for awk
cat p1.csv | sort -t "," -k2 | awk -F, '!x[$2]++'

Unix - randomly select lines based on column values

I have a file with ~1000 lines that looks like this:
ABC C5A 1
CFD D5G 4
E1E FDF 3
CFF VBV 1
FGH F4R 2
K8K F9F 3
... etc
I would like to select 100 random lines, but with 10 of each third column value (so random 10 lines from all lines with value "1" in column 3, random 10 lines from all lines with value "2" in column 3, etc).
Is this possible using bash?
First grep all the files with a certain number, shuffle them and pick the first 10 using shuf -n 10.
for i in {1..10}; do
grep " ${i}$" file | shuf -n 10
done > randomFile
If you don't have shuf, use sort -R to randomly sort them instead:
for i in {1..10}; do
grep " ${i}$" file | sort -R | head -10
done > randomFile
If you can use awk, you can do the same with a one-liner
sort -R file | awk '{if (count[$3] < 10) {count[$3]++; print $0}}'

Resources