I need to be able to print the largest record value from txt file using bash - bash

I am new to bash programming and I hit a roadblock.
I need to be able to calculate the largest record number within a txt file and store that into a variable within a function.
Here is the text file:
student_records.txt
12345,fName lName,Grade,email
64674,fName lName,Grade,email
86345,fName lName,Grade,email
I need to be able to get the largest record number ($1 or first field) in order for me to increment this unique record and add more records to the file. I seem to not be able to figure this one out.
First, I sort the file by the first field in descending order and then, perform this operation:
largest_record=$(awk-F,'NR==1{print $1}' student_records.txt)
echo $largest_record
This gives me the following error on the console:
awk-F,NR==1{print $1}: command not found
Any ideas? Also, any suggestions on how to accomplish this in the best way?
Thank you in advance.

largest=$(sort -r file|cut -d"," -f1|head -1)

You need spaces, and quotes
awk -F, 'NR==1{print $1}'
The command is awk, you need a space after it so bash parses your command line properly, otherwise it thinks the whole thing is the name of the command, which is what the error messages is telling you.
Learn how to use the man command so you can learn how to invoke other commands:
man awk
This will tell you what the -F option does:
The -F fs option defines the input field separator to be the regular expression fs.
So in your case the field separator is a comma -F,
What follows in quotes is what you want awk to interpret, it says to match a line with the pattern NR==1, NR is special, it is the record number, so you want it to match the first record, following that is the action you want awk to take when that pattern matches, {print $1}, which says to print the first field (comma separated) of the line.
A better way to accomplish this would be to use awk to find the largest record for you rather than sorting it first, this gives you a solution that is linear in the number for records - you just want the max, no need to do extra work of sorting the whole file:
awk -F, 'BEGIN {max = 0} {if ($1>max) max=$1} END {print max}' student_records.txt
For this and other awk "one liners" look here.

Related

Remove duplicated entries in a table based on first column (which consists of two values sep by colon)

I need to sort and remove duplicated entries in my large table (space separated), based on values on the first column (which denote chr:position).
Initial data looks like:
1:10020 rs775809821
1:10039 rs978760828
1:10043 rs1008829651
1:10051 rs1052373574
1:10051 rs1326880612
1:10055 rs892501864
Output should look like:
1:10020 rs775809821
1:10039 rs978760828
1:10043 rs1008829651
1:10051 rs1052373574
1:10055 rs892501864
I've tried following this post and variations, but the adapted code did not work:
sort -t' ' -u -k1,1 -k2,2 input > output
Result:
1:10020 rs775809821
Can anyone advise?
Thanks!
Its quite easy when doing with awk. Split the file on either of space or : as the field separator and group the lines by the word after the colon
awk -F'[: ]' '!unique[$2]++' file
The -F[: ] defines the field separator to split the individual words on the line and the part !unique[$2]++ creates a hash-table map based on the value from $2. We increment the value every time a value is seen in $2, so that on next iteration the negation condition ! on the line would prevent the line from printed again.
Defining the regex with -F flag might not be supported on all awk versions. In a POSIX compliant way, you could do
awk '{ split($0,a,"[: ]"); val=a[2]; } !unique[val]++ ' file
The part above assumes you want to unique the file based on the word after :, but for completely based on the first column only just do
awk '!unique[$1]++' file
since your input data is pretty simple, the command is going to be very easy.
sort file.txt | uniq -w7
This is just going to sort the file and do a unique with the first 7 characters. the data for first 7 character is numbers , if any aplhabets step in use -i in the command.

How to detect some pattern with grep -f on a file in terminal, and extract those lines without the pattern

I'm on mac terminal.
I have a txt file with one column with 9 IDs, allofthem.txt, where every ID starts with ¨rs¨:
rs382216
rs11168036
rs9296559
rs9349407
rs10948363
rs9271192
rs11771145
rs11767557
rs11
Also, I have another txt file, useful.txt, with those IDs that were useful in an analysis I did. It looks the same, one column with several rows of IDs, but with less IDS, only 5.
rs9349407
rs10948363
rs9271192
rs11
Problem:I want to generate a new txt file with the non-useful ones (the ones that appear in allofthem.txt but not in useful.txt).
I want to do the inverse of:
grep -f useful.txt allofthem.txt
I want to use some systematic way of deleting all the IDs in useful and obtain a file with the remaining ones. Maybe with awk or sed, but I can´t see it. Can you help me, please? Thanks in advance!
Desired output:
rs382216
rs11168036
rs9296559
rs11771145
rs11767557
-v option does the inverse for you:
grep -vxf useful.txt allofthem.txt > remaining.txt
-x option matches the whole line in allofthem.txt, not parts.
As #hek2mgl rightly pointed out, you need -F if you want to treat the content of useful.txt as strings and not patterns:
grep -vxFf useful.txt allofthem.txt > remaining.txt
Make sure your files have no leading or trailing white spaces - they could affect the results.
I recommend to use awk:
awk 'FNR==NR{patterns[$0];next} $0 in patterns' useful.txt allofthem.txt
Explanation:
FNR==NR is true as long as we are reading useful.txt. We create an index in patterns for every line of useful.txt. next stops further processing.
$0 in patterns runs, because of the previous next statement, on every line of allofthem.txt. It checks for every line of that file if it is a key in patterns. If that checks evaluates to true awk will print that line.

How to strip date in csv output using shell script?

I have a few csv extracts that I am trying to fix up the date on, they are as follows:
"Time Stamp","DBUID"
2016-11-25T08:28:33.000-8:00,"5tSSMImFjIkT0FpiO16LuA"
The first column is always the "Time Stamp", I would like to convert this so it only keeps the date "2016-11-25" and drops the "T08:28:33.000-8:00".
The end result would be..
"Time Stamp","DBUID"
2016-11-25,"5tSSMImFjIkT0FpiO16LuA"
There are plenty of files with different dates.
Is there a way to do this in ksh? Some kind of for each loop to loop through all the files and replace the long time-stamp and leave just the date?
Use sed:
$ sed '2,$s/T[^,]*//' file
"Time Stamp","DBUID"
2016-11-25,"5tSSMImFjIkT0FpiO16LuA"
How it works:
2,$ # Skip header (first line) removing this will make a
# replacement on the first line as well.
s/T[^,]*// # Replace everything between T (inclusive) and , (exclusive)
# `[^,]*' Matches everything but `,' zero or more times
Here's one solution using a standard aix utility,
awk -F, -v OFS=, 'NR>1{sub(/T.*$/,"",$1)}1' file > file.cln && mv file.cln file
output
"Time Stamp","DBUID"
2016-11-25,"5tSSMImFjIkT0FpiO16LuA"
(but I no longer have access to an aix environment, so only tested with my local awk).
NR>1 skips the header line, and the sub() is limited to only the first field (up to the first comma). The trailing 1 char is awk shorthand for {print $0}.
If your data layout changes and you get extra commas in your data, this may required fixing.
IHTH
Using sed:
sed -i "s/\([0-9]\{4\}\)-\([0-9]\{2\}\)-\([0-9]\{2\}\).*,/\1-\2-\3,/" file.csv
Output:
"Time Stamp","DBUID"
2016-11-25,"5tSSMImFjIkT0FpiO16LuA"
-i edit files inplace
s substitute
This is a perfect job for awk, but unlike the previous answer, I recommend using the substring function.
awk -F, 'NR > 1{$1 = substr($1,1,10)} {print $0}' file.txt
Explanation
-F,: The -F flag sets a field separator, in this case a comma
NR > 1: Ignore the first row
$1: Refers to the first field
$1 = substr($1,1,10): Sets the first field to the first 10 characters of the field. In the example, this is the date portion
print $0: This will print the entire row

how to sort with the third column

I know there have been some questions about it. I tried the methods they mentioned but it does not work.
My data is in Book1.csv file like this:
Then I used bash code: sort -r -n -k 3,3 Book1.csv > sorted.csv
But the outcome is not what I want:
I want the outcome to be like:
In addition, since the first colume is Id, the third column is score, I want to print the ids with the highest scores. In this case, it should print the two id whose score are 50, like this:TRAAAAY128F42A73F0 TRAAAAV128F421A322 How to achieve it?
Assuming that your csv is comma separated and not another delimiter this is one way to do it. However, I think there is probably away to do most of this if not all in awk, unfortunately my knowledge is limited with awk so here is how I would do it quickly.
First according to the comments the -t flag of sort resolved your sorting issue.
#!/bin/bash
#set csv file to variable
mycsv="/path/csv.csv"
#get the third value of the first line after sorting on the third value descending.
max_val=$(sort -t, -k3,3nr $mycsv | head -n1 | cut -f3)
#use awk to evaluate the thrid column is equal to the maxvalue then print the first column.
#Note I am setting the delimiter to a comma here with the -F flag
awk -F"," -v awkmax="$maxval" '$3 == awkmax {print $1}' $mycsv
While the printing all IDs with the highest score can be done in bash with basic unix commands, I think it's better to, at this point, switch to an actual scripting language. (unless you're in some very limited environment)
Fortunately, perl is everywhere, and this task of printing the ids with the largest scores can be done as one (long) line in perl:
perl -lne 'if (/^([^,]*),[^,]*,\s*([^,]*)/) {push #{$a{$2}},$1; if($2>$m) {$m=$2;}} END {print "#{$a{$m}}";}' Book1.csv

is there a way to make awk keep track of changes that it makes

I was wondering if there was a way to make awk output the number of changes that it has made, or somehow keep track of what it changes. For instance if there was a .csv file, in which I used awk to replace the word "the" with "it", would there be a way to make awk output how many times it had found and replaced "the" inside of the .csv file? I've been searching around on google + the rest of this site but have not been able to find anything, not sure if I am just working it the wrong way or something of the sort though, if someone could help me out with this though I would really appreciate it, thanks!
as far as I know, awk cannot give that substitution counts automatically. We have to sum the number manually somehow. for example, with your "the"->"it":
gawk '{x+=gsub("the","it");print} END{print "total changes:"x}' file.csv
Whenever you replace "the" with "it" in your awk script, increment a counter. Then print the counter at end of you awk script by putting it in the END block.
A simple example, you are only replacing "the"'s within in column 1:
awk 'BEGIN{FS=OFS=","}
{if ($1 == "the") {$1 = "it"; counter++; print}}
END{print counter}' input.csv
But what you are looking to do is simply count how many times the word / pattern "the" occurred inside the original .cvs file, you can simply do:
grep -c "the" input.csv
or depending on the format of your .csv:
grep -c " the," input.csv

Resources