how to sort with the third column - bash

I know there have been some questions about it. I tried the methods they mentioned but it does not work.
My data is in Book1.csv file like this:
Then I used bash code: sort -r -n -k 3,3 Book1.csv > sorted.csv
But the outcome is not what I want:
I want the outcome to be like:
In addition, since the first colume is Id, the third column is score, I want to print the ids with the highest scores. In this case, it should print the two id whose score are 50, like this:TRAAAAY128F42A73F0 TRAAAAV128F421A322 How to achieve it?

Assuming that your csv is comma separated and not another delimiter this is one way to do it. However, I think there is probably away to do most of this if not all in awk, unfortunately my knowledge is limited with awk so here is how I would do it quickly.
First according to the comments the -t flag of sort resolved your sorting issue.
#!/bin/bash
#set csv file to variable
mycsv="/path/csv.csv"
#get the third value of the first line after sorting on the third value descending.
max_val=$(sort -t, -k3,3nr $mycsv | head -n1 | cut -f3)
#use awk to evaluate the thrid column is equal to the maxvalue then print the first column.
#Note I am setting the delimiter to a comma here with the -F flag
awk -F"," -v awkmax="$maxval" '$3 == awkmax {print $1}' $mycsv

While the printing all IDs with the highest score can be done in bash with basic unix commands, I think it's better to, at this point, switch to an actual scripting language. (unless you're in some very limited environment)
Fortunately, perl is everywhere, and this task of printing the ids with the largest scores can be done as one (long) line in perl:
perl -lne 'if (/^([^,]*),[^,]*,\s*([^,]*)/) {push #{$a{$2}},$1; if($2>$m) {$m=$2;}} END {print "#{$a{$m}}";}' Book1.csv

Related

Remove duplicated entries in a table based on first column (which consists of two values sep by colon)

I need to sort and remove duplicated entries in my large table (space separated), based on values on the first column (which denote chr:position).
Initial data looks like:
1:10020 rs775809821
1:10039 rs978760828
1:10043 rs1008829651
1:10051 rs1052373574
1:10051 rs1326880612
1:10055 rs892501864
Output should look like:
1:10020 rs775809821
1:10039 rs978760828
1:10043 rs1008829651
1:10051 rs1052373574
1:10055 rs892501864
I've tried following this post and variations, but the adapted code did not work:
sort -t' ' -u -k1,1 -k2,2 input > output
Result:
1:10020 rs775809821
Can anyone advise?
Thanks!
Its quite easy when doing with awk. Split the file on either of space or : as the field separator and group the lines by the word after the colon
awk -F'[: ]' '!unique[$2]++' file
The -F[: ] defines the field separator to split the individual words on the line and the part !unique[$2]++ creates a hash-table map based on the value from $2. We increment the value every time a value is seen in $2, so that on next iteration the negation condition ! on the line would prevent the line from printed again.
Defining the regex with -F flag might not be supported on all awk versions. In a POSIX compliant way, you could do
awk '{ split($0,a,"[: ]"); val=a[2]; } !unique[val]++ ' file
The part above assumes you want to unique the file based on the word after :, but for completely based on the first column only just do
awk '!unique[$1]++' file
since your input data is pretty simple, the command is going to be very easy.
sort file.txt | uniq -w7
This is just going to sort the file and do a unique with the first 7 characters. the data for first 7 character is numbers , if any aplhabets step in use -i in the command.

Unix: Find duplicate occurrences in column in csv file, omit one possible value

I am hoping for a line or two of code for a bash script to find and print repeated items in a column in 2.5G csv file except for an item that I know is commonly repeated.
The data file has a header, but it is not duplicated, so I'm not worried about code that accounts for the header being present.
Here is an illustration of what the data look like:
header,cat,Everquest,mermaid
1f,2r,7g,8c
xc,7f,66,rp
Kf,87,gH,||
hy,7f,&&,--
rr,2r,89,))
v6,2r,^&,!c
92,#r,hd,m
2r,2r,2r,2r
7f,7f,7f,7f
9,10,11,12
7f,2r,7f,7f
76,#r,88,u|
I am seeking the output:
7f
#r
as both of these are duplicated in column two. As you can see, 2r is also duplicated, but it is commonly duplicated and I know it, so I just want to ignore it.
To be clear, I can't know the values of the duplicates other than the common one, which, in my real data files, is actually the word 'none'. It's '2r' above.
I read here that I can do something like
awk -F, ' ++A[$2] > 1 { print $2; exit 1 } ' input.file
However, I cannot figure out how to skip '2r' nor what ++A means.
I have read the awk manual, but I am afraid I find it a little confusing with respect to the question I am asking.
Additionally,
uniq -d
looks promising based on a few other questions and answers, but I am still unsure how to skip over the value that I want to ignore.
Thank you in advance for you help.
how to skip '2r':
$ awk -F, ' ++a[$2] == 2 && $2 != "2r" { print $2 } ' file
7f
#r
++a[$2] adds an element to a hash array and increases its value by 1, ie counts how many occurrences of each value in the second column exist.
Get only the second column using cut -d, -f2
sort
uniq -d to get repeated lines
grep -Fv 2r to exclude a value, or grep -Fv -e foo -e bar … to exclude multiple values
In other words something like this:
cut -d, -f2 input.csv | sort | uniq -d | grep -Fv 2r
Depending on the data it might be faster if you move grep earlier in the pipeline, but you should verify that with some benchmarking.

Can I use grep to extract a single column of a CSV file?

I'm trying to solve o problem I have to do as soon as possible.
I have a csv file, fields separated by ;.
I'm asked to make a shell command using grep to list only the third column, using regex. I can't use cut. It is an exercise.
My file is like this:
1;Evan;Bell;39;Obigod Manor;Ekjipih;TN;25008
2;Wayne;Watkins;22;Lanme Place;Cotoiwi;NC;86578
3;Danny;Vega;25;Fofci Center;Momahbih;MS;21027
4;Larry;Robinson;23;Bammek Boulevard;Gaizatoh;NE;27517
5;Myrtie;Black;20;Savon Square;Gokubpat;PA;92219
6;Nellie;Greene;23;Utebu Plaza;Rotvezri;VA;17526
7;Clyde;Reynolds;19;Lupow Ridge;Kedkuha;WI;29749
8;Calvin;Reyes;47;Paad Loop;Beejdij;KS;29247
9;Douglas;Graves;43;Gouk Square;Sekolim;NY;13226
10;Josephine;Estrada;48;Ocgig Pike;Beheho;WI;87305
11;Eugene;Matthews;26;Daew Drive;Riftemij;ME;93302
12;Stanley;Tucker;54;Cure View;Woocabu;OH;45475
13;Lina;Holloway;41;Sajric River;Furutwe;ME;62184
14;Hettie;Carlson;57;Zuheho Pike;Gokrobo;PA;89098
15;Maud;Phelps;57;Lafni Drive;Gokemu;MD;87066
16;Della;Roberson;53;Zafe Glen;Celoshuv;WV;56749
17;Cory;Roberson;56;Riltav Manor;Uwsupep;LA;07983
18;Stella;Hayes;30;Omki Square;Figjitu;GA;35813
19;Robert;Griffin;22;Kiroc Road;Wiregu;OH;39594
20;Clyde;Reynolds;19;Lupow Ridge;Kedkuha;WI;29749
21;Calvin;Reyes;47;Paad Loop;Beejdij;KS;29247
22;Douglas;Graves;43;Gouk Square;Sekolim;NY;13226
23;Josephine;Estrada;48;Ocgig Pike;Beheho;WI;87305
24;Eugene;Matthews;26;Daew Drive;Riftemij;ME;93302
I think I should use something like: cat < test.csv | grep 'regex'.
Thanks.
Right Tools For The Job: Using awk or cut
Assuming you want to match the third column against a specific field:
awk -F';' '$3 ~ /Foo/ { print $0 }' file.txt
...will print any line where the third field contains Foo. (Changing print $0 to print $3 would print only that third field).
If you just want to print the third column regardless, use cut: cut -d';' -f3 <file.txt
Wrong Tool For The Job: Using GNU grep
On a system where grep has the -o option, you can chain two instances together -- one to trim everything after the fourth column (and remove lines with less than four columns), another to take only the last remaining column (thus, the fourth):
str='foo;bar;baz;qux;meh;whatever'
grep -Eo '^[^;]*[;][^;]*[;][^;]*[;][^;]*' <<<"$str" \
| grep -Eo '[^;]+$'
To explain how that works:
^, outside of square brackets, matches only at the beginning of a line.
[^;]* matches any character except ; zero-or-more times.
[;] matches only the character ;.
...thus, each [^;]*[;] in the regex matches a single field, whether or not that field contains text. Putting four of those in the first stage means we're matching only fields, and grep -o tells grep to only emit content it was successfully able to match.
If you just need the 3rd field and it's always properly delimited with ';' why not use 'cut'?
cut -d';' -f3 <filename>
UPDATED:
OP wasn't clear, maybe only want to look at the 3rd line?
head -3 <filename> | tail -1
OR.. Maybe just getting of list of the things that appear in the 3rd field?
Not clear what the intended use of 'grep' would be??
cut -d';' -f3 <filename> | sort -u
As the other answers have said, using grep is a bad/unfortunate idea.
The only way I can think of using grep is to pull out a specific row where the 3rd column == some value. E.g.,
grep '^\([^;]*;\)\{2\}Bell;' test.txt
1;Evan;Bell;39;Obigod Manor;Ekjipih;TN;25008
Or if the first column is the index (not counting it as a column):
grep '^\([^;]*;\)\{3\}39;' test.txt
1;Evan;Bell;39;Obigod Manor;Ekjipih;TN;25008
Even using grep in this case leads to a pretty ugly solution.
Edit: Didn't see Charles Duffy's answer... that's pretty clever.

Performant way of displaying the number of unique column entries in a set of files?

I'm attempting to pipe a large amount of files in to a sequence of commands which displays the number of unique entries in a given column of said files. I'm inexperienced with the shell, but after a short while I was able to come up with this:
awk '{print $5 }' | sort | uniq | wc - l
This sequence of commands works fine for a small amount of files, but takes an unacceptable amount of time to execute on my target set. Is there a set of commands that can accomplish this more efficiently?
You can count unique occurrences of values in the fifth field in a single pass with awk:
awk '{if (!seen[$5]++) ++ctr} END {print ctr}'
This creates an array of the values in the fifth field and increments the ctr variable if the value has never seen before. The END rule prints the value of the counter.
With GNU awk, you can alternatively just check the length of the associative array in the end:
awk '{seen[$5]++} END {print length(seen)}'
Benjamin has supplied the good oil, but depending on just how much data is to be stored in the array, it may pay to pass the data to wc anyway:
awk '!_[$5]++' file | wc -l
the sortest and fastest (i could) using awk but not far from previous version of #BenjaminW. I think a bit faster (difference could only be interesting on very huge file) because of test made earlier in the process
awk '!E[$5]++{c++}END{print c}' YourFile
works with all awk version
GNU datamash has a count function for columns:
datamash -W count 5

I need to be able to print the largest record value from txt file using bash

I am new to bash programming and I hit a roadblock.
I need to be able to calculate the largest record number within a txt file and store that into a variable within a function.
Here is the text file:
student_records.txt
12345,fName lName,Grade,email
64674,fName lName,Grade,email
86345,fName lName,Grade,email
I need to be able to get the largest record number ($1 or first field) in order for me to increment this unique record and add more records to the file. I seem to not be able to figure this one out.
First, I sort the file by the first field in descending order and then, perform this operation:
largest_record=$(awk-F,'NR==1{print $1}' student_records.txt)
echo $largest_record
This gives me the following error on the console:
awk-F,NR==1{print $1}: command not found
Any ideas? Also, any suggestions on how to accomplish this in the best way?
Thank you in advance.
largest=$(sort -r file|cut -d"," -f1|head -1)
You need spaces, and quotes
awk -F, 'NR==1{print $1}'
The command is awk, you need a space after it so bash parses your command line properly, otherwise it thinks the whole thing is the name of the command, which is what the error messages is telling you.
Learn how to use the man command so you can learn how to invoke other commands:
man awk
This will tell you what the -F option does:
The -F fs option defines the input field separator to be the regular expression fs.
So in your case the field separator is a comma -F,
What follows in quotes is what you want awk to interpret, it says to match a line with the pattern NR==1, NR is special, it is the record number, so you want it to match the first record, following that is the action you want awk to take when that pattern matches, {print $1}, which says to print the first field (comma separated) of the line.
A better way to accomplish this would be to use awk to find the largest record for you rather than sorting it first, this gives you a solution that is linear in the number for records - you just want the max, no need to do extra work of sorting the whole file:
awk -F, 'BEGIN {max = 0} {if ($1>max) max=$1} END {print max}' student_records.txt
For this and other awk "one liners" look here.

Resources