Bash script - remove lines by looking ahead - bash

I have a csv file where some rows have an empty first field, and some rows have content in the first field. The rows with content in the first field are header rows.
I would like to remove every unnecessary header row. The best way I can see of doing this is by deleting every row for which:
First field is not empty
First field in the following row is not empty
I do not necessarily need to keep the data in the same file, so I can see this being possible using grep, awk, or sed, but none of my attempts have come close to working.
Example input:
header1,value1,etc
,value2,etc
header2,value3,etc
header3,value4,etc
,value5,etc
Desired output:
header1,value1,etc
,value2,etc
header3,value4,etc
,value5,etc
Since the header2 line is not followed by a line with an empty field 1, it is an unnecessary header row.

awk -F, '$1{h=$0;next}h{print h;h=""}1' file
-F,: Use comma as a field separator
$1{h=$0;next}: If the first field has data ( other than 0 ), save the line and go on to the next line.
h{print h;h=""}1: If there is a saved header line, print it and forget it. (This can only execute if there is nothing in $1 because of the next above.)
1: print the current line.

These kind of tasks are often conceptually easier by reversing the file and checking if the previous line is a header:
tac file |
awk -F, '$1 && have_header {next} {print; have_header = length($1)}' |
tac

Related

Cut string of text after a character from a column each line of a csv, keeping the other columns, and printing to a new file

I have a CSV file with a first column that reads:
/Users/swilki/Desktop/Africa_OSSD/OSSD_Output/BER5_OSSD_F008071.csv.0.01.out.csv
Followed by additional columns listing counts pulled from other CSV files.
What I want is to remove "/Users/swilki/Desktop/Africa_OSSD/OSSD_Output/" from each line without affecting any other part of the file.
I've tried using sed, grep, and cut, but that only seems to print the output in the terminal or a new file only containing that part of the line, and not the rest of the columns. Can I remove the "/Users/swilki/Desktop/Africa_OSSD/OSSD_Output/" and keep everything else the same?
You can use awk to get this job done.
Please see below code which will replace the contents/Users/swilki/Desktop/Africa_OSSD/OSSD_Output/ with "" empty and will update the operation in the same file with the option inplace
yourfile.csv is the input file.
awk -i inplace '{sub(/\/Users\/swilki\/Desktop\/Africa_OSSD\/OSSD_Output\//,"")}1' yourfile.csv
The above will remove the "/Users/swilki/Desktop/Africa_OSSD/OSSD_Output/" and keep everything else same.
Output of yourfile.csv:
BER5_OSSD_F008071.csv.0.01.out.csv
Option 2, If you want to print in a new file:
Below code will be give the replaced contents in the new file your_newfile.csv
awk '{sub(/\/Users\/swilki\/Desktop\/Africa_OSSD\/OSSD_Output\//,"")}1' yourfile.csv >your_newfile.csv

Remove duplicated entries in a table based on first column (which consists of two values sep by colon)

I need to sort and remove duplicated entries in my large table (space separated), based on values on the first column (which denote chr:position).
Initial data looks like:
1:10020 rs775809821
1:10039 rs978760828
1:10043 rs1008829651
1:10051 rs1052373574
1:10051 rs1326880612
1:10055 rs892501864
Output should look like:
1:10020 rs775809821
1:10039 rs978760828
1:10043 rs1008829651
1:10051 rs1052373574
1:10055 rs892501864
I've tried following this post and variations, but the adapted code did not work:
sort -t' ' -u -k1,1 -k2,2 input > output
Result:
1:10020 rs775809821
Can anyone advise?
Thanks!
Its quite easy when doing with awk. Split the file on either of space or : as the field separator and group the lines by the word after the colon
awk -F'[: ]' '!unique[$2]++' file
The -F[: ] defines the field separator to split the individual words on the line and the part !unique[$2]++ creates a hash-table map based on the value from $2. We increment the value every time a value is seen in $2, so that on next iteration the negation condition ! on the line would prevent the line from printed again.
Defining the regex with -F flag might not be supported on all awk versions. In a POSIX compliant way, you could do
awk '{ split($0,a,"[: ]"); val=a[2]; } !unique[val]++ ' file
The part above assumes you want to unique the file based on the word after :, but for completely based on the first column only just do
awk '!unique[$1]++' file
since your input data is pretty simple, the command is going to be very easy.
sort file.txt | uniq -w7
This is just going to sort the file and do a unique with the first 7 characters. the data for first 7 character is numbers , if any aplhabets step in use -i in the command.

How to transform a csv file having multiple delimiters using awk

Below is a sample data. Please note this operation is required to be done on files with millions of records hence I need the optimal method. Essentially we are looking to update 2nd column with concatenation of first two characters from 4th column and excluding first 3 fields ('_' delimited) of 2nd column.
I have been trying using cut and reading the file line by line which is very time consuming. I need something with awk something like
awk -F, '{print $1","substr($4,1,2)"_"cut -f4-6 -d'_'($2)","$3","$4","$5","$6}'
Input Data:
234234234,123_33_3_11111_asdf_asadfas,01,06_1234,4325325432,2
234234234,123_11_2_234111_aadsvfcvxf_anfews,01,07_4444,423425432,2
234234234,123_33_3_11111_mlkvffdg_mlkfgufks,01,08_2342,436876532,2
234234234,123_33_3_11111_qewf_mkhsdf,01,09_68645,43234532,2
Output is required as:
234234234,06_11111_asdf_asadfas,01,06_1234,4325325432,2
234234234,07_234111_aadsvfcvxf_anfews,01,07_4444,423425432,2
234234234,08_11111_mlkvffdg_mlkfgufks,01,08_2342,436876532,2
234234234,09_11111_qewf_mkhsdf,01,09_68645,43234532,2
You can use awk and printf for line re-formating
awk -F"[,_]" '{
printf "%s,%s_%s_%s_%s,%s,%s_%s,%s,%s\n", $1,$9,$5,$6,$7,$8,$9,$10,$11,$12
}' file
you get,
234234234,06_11111_asdf_asadfas,01,06_1234,4325325432,2
234234234,07_234111_aadsvfcvxf_anfews,01,07_4444,423425432,2
234234234,08_11111_mlkvffdg_mlkfgufks,01,08_2342,436876532,2
234234234,09_11111_qewf_mkhsdf,01,09_68645,43234532,2

Remove all lines except the last which start with the same string

I'm using awk to process a file to filter lines to specific ones of interest. With the output which is generated, I'd like to be able to remove all lines except the last which start with the same string.
Here's an example of what is generated:
this is a line
duplicate remove me
duplicate this should go too
another unrelated line
duplicate but keep me
example remove this line
example but keep this one
more unrelated text
Lines 2 and 3 should be removed because they start with duplicate, as does line 5. Therefore line 5 should be kept, as it is the last line starting with duplicate.
The same follows for line 6, since it begins with example, as does line 7. Therefore line 7 should be kept, as it is the last line which starts with example.
Given the example above, I'd like to produce the following output:
this is a line
another unrelated line
duplicate but keep me
example but keep this one
more unrelated text
How could I achieve this?
I tried the following, however it doesn't work correctly:
awk -f initialProcessing.awk largeFile | awk '{currentMatch=$1; line=$0; getline; nextMatch=$1; if (currentMatch != nextMatch) {print line}}' -
Why don't you read the file from the end to the beginning and print the first line containing duplicate? This way you don't have to worry about what was printed or not, hold the line, etc.
tac file | awk '/duplicate/ {if (f) next; f=1}1' | tac
This sets a flag f the first time duplicate is seen. From the second timem, this flag makes the line be skipped.
If you want to make this generic in a way that every first word is printed just the last time, use an array approach:
tac file | awk '!seen[$1]++' | tac
This keeps track of the first words that have appeared so far. They are stored in the array seen[], so that by saying !seen[$1]++ we make it True just when $1 occurs for the first time; from the second time on, it evaluates as False and the line is not printed.
Test
$ tac a | awk '!seen[$1]++' | tac
this is a line
another unrelated line
duplicate but keep me
example but keep this one
more unrelated text
You could use an (associative) array to always keep the last occurence:
awk '{last[$1]=$0;} END{for (i in last) print last[i];}' file

I need to be able to print the largest record value from txt file using bash

I am new to bash programming and I hit a roadblock.
I need to be able to calculate the largest record number within a txt file and store that into a variable within a function.
Here is the text file:
student_records.txt
12345,fName lName,Grade,email
64674,fName lName,Grade,email
86345,fName lName,Grade,email
I need to be able to get the largest record number ($1 or first field) in order for me to increment this unique record and add more records to the file. I seem to not be able to figure this one out.
First, I sort the file by the first field in descending order and then, perform this operation:
largest_record=$(awk-F,'NR==1{print $1}' student_records.txt)
echo $largest_record
This gives me the following error on the console:
awk-F,NR==1{print $1}: command not found
Any ideas? Also, any suggestions on how to accomplish this in the best way?
Thank you in advance.
largest=$(sort -r file|cut -d"," -f1|head -1)
You need spaces, and quotes
awk -F, 'NR==1{print $1}'
The command is awk, you need a space after it so bash parses your command line properly, otherwise it thinks the whole thing is the name of the command, which is what the error messages is telling you.
Learn how to use the man command so you can learn how to invoke other commands:
man awk
This will tell you what the -F option does:
The -F fs option defines the input field separator to be the regular expression fs.
So in your case the field separator is a comma -F,
What follows in quotes is what you want awk to interpret, it says to match a line with the pattern NR==1, NR is special, it is the record number, so you want it to match the first record, following that is the action you want awk to take when that pattern matches, {print $1}, which says to print the first field (comma separated) of the line.
A better way to accomplish this would be to use awk to find the largest record for you rather than sorting it first, this gives you a solution that is linear in the number for records - you just want the max, no need to do extra work of sorting the whole file:
awk -F, 'BEGIN {max = 0} {if ($1>max) max=$1} END {print max}' student_records.txt
For this and other awk "one liners" look here.

Resources