How to remove rows from a CSV with no data using AWK - bash

I am working with a large csv in a linux shell that I narrowed down to 3 columns:
Species name, Latitude, and Longitude.
awk -F "\t" '{print $10,","$22,",",$23}' occurance.csv > three_col.csv
The file ends up looking like this:
species | Lat | Long |
Leucoraja erinacea | 41.0748 | 72.9461|
Brevoortia tyrannus | 39.0748 | 70.9461|
Paralichthys dentatus | | 73.2354|
Paralichthys dentatus | | |
Leucoraja erinacea | 41.0748 | |
Brevoortia tyrannus | | |
Brevoortia tyrannus | | |
Paralichthys dentatus | 39.0748 | 70.9461|
Brevoortia tyrannus | 39.0748 | 70.9461|
However this is what I want it to Look: Notice all species with no lat or long data have been removed
species | Lat | Long |
Leucoraja erinacea | 41.0748 | 72.9461|
Brevoortia tyrannus | 39.0748 | 70.9461|
Paralichthys dentatus | 39.0748 | 70.9461|
Brevoortia tyrannus | 39.0748 | 70.9461|
I've been trying to remove rows that are lacking either Lat or Long data. Using a line like this:
awk -F "\t" BEGIN '{print $1,$2,$3}' END '{$2!=" " && $3!= " " }' three_col.csv > del_blanks.csv
but it results in this error even with small changes that I make trying to solve the problem
awk: line 1: syntax error at or near end of line
How can I get rid of these rows with missing data, is this something I need a "for" loop for?

Since I don't know what your occurance.csv file looks like, this is a shot in the dark:
awk -F "\t" '$22 && $23 {print $10,","$22,",",$23}' occurance.csv > three_col.csv
The expression $22 && $23 says: Both field 22 and field 23 must not be blank. It is a condition to filter out those lines which don't qualify. It is a shorthand for $22 != "" && $3 != "".

awk -F "|" '
if (substr($1,1,1) == "-"){
e = ""
gsub(/[ \t]+$/, "", $2)
gsub(/[ \t]+$/, "", $3)
if(length($2) !=0 && length($3) !=0){
printf "%s%s%-9s%s%-8s%s\n", $1, FS, $2, FS, $3, e
}' file.txt
species | Lat | Long |
Leucoraja erinacea | 41.0748 | 72.9461|
Brevoortia tyrannus | 39.0748 | 70.9461|
Paralichthys dentatus | 39.0748 | 70.9461|
Brevoortia tyrannus | 39.0748 | 70.9461|

perhaps something like this ?
mawk '($!NF=$10","$22","$23)!~",,$"' FS='\t' OFS=','
You already know only fields 10/22/23 needs to be printed, so you can first overwrite $0 with those just 3 columns, already-split by OFS
afterwards simply use a quick regex check, since 2 consecutive OFS at the tail is the sign $22 and $23 are empty - saving the print statement and pattern-action blocks.


Match string in file1 with string in file2

my data examples are
MTQZ3CODT0SQKGE3QE6B | j t | j | t | 22312 | stimpy | EST | 8 | 20 | text | list | 0 | | 2002-08-22 13:07:05
desired output | j t | j | t | 22312 | stimpy | EST | 8 | 20 | text | list | 0 | | 2002-08-22 13:07:05
I suppose to match & replace 1st column from 1.txt
with 2nd column in 2.txt
so far i did try :
awk 'BEGIN { while((getline < "file2.txt") > 0) a[$1]=$3 } { $1 = a[$1] } 1' file1.txt
Its work well but after 12hours of running i just finalise only 1GB looks very slow
INFO: file1.txt=7GB file2.txt=4GB my memory 16GB
I'm not sure what cause the slowly thing but i hope if there's another fast way then i'm using of awk
will be helpfull.
Note: I'm running out of memory is there another way to do it
and that's to not have an array at all?
Also in my case lines are randomly and not in the same lines!
$ join <(sort 2.txt) <(sort 1.txt) | cut -d' ' -f3- | j t | j | t | 22312 | stimpy | EST | 8 | 20 | text | list | 0 | | 2002-08-22 13:07:05
If that's not all you need then edit your question to provide more truly representative sample input/output including cases that this doesn't work for.
You may use this awk:
awk -F ' *\\| *' -v OFS=' | ' '
FNR == NR {
$1 in map {
$1 = map[$1]
} 1' 2.txt 1.txt | j t | j | t | 22312 | stimpy | EST | 8 | 20 | text | list | 0 | | 2002-08-22 13:07:05

Join two csv files if value is between interval in file 2

I have two csv files that I need to join, F1 has milions of lines, F2 (file 1) has thousands of lines. I need to join these files, if the position in file F1 (F1.pos) is between F2.start and F2.end. Is there any way, how to do this in bash? Because I have a code in Python pandas to sqllite3 and I am looking for something quicker.
Table F1 looks like:
| name | pos |
|------ |------ |
| a | 1020 |
| b | 1200 |
| c | 1800 |
Table F2 looks like:
| interval_name | start | end |
|--------------- |------- |------ |
| int1 | 990 | 1090 |
| int2 | 1100 | 1150 |
| int3 | 500 | 2000 |
Result should look like:
| name | pos | interval_name | start | end |
|------ |------ |--------------- |------- |------ |
| a | 1020 | int1 | 990 | 1090 |
| a | 1020 | int3 | 500 | 2000 |
| b | 1200 | int1 | 990 | 1090 |
| b | 1200 | int3 | 500 | 2000 |
| c | 1800 | int3 | 500 | 2000 |
DISCLAIMER: Use dedicated/local tools if available, this is hacking:
There is an apparent error in your desired output: name b should not match int1.
$ tail -n+1 *.csv
==> f1.csv <==
==> f2.csv <==
$ awk -F, -vOFS=, '
print "name,pos,interval_name,start,end"
FNR==1 {next}
NR==FNR {Int[$1] = $2 "," $3; next}
for(i in Int) {
split(Int[i], I)
if($2 >= I[1] && $2 <= I[2]) print $0, i, Int[i]
' f2.csv f1.csv
This is not particularly efficient in any way; the only sorting used is to ensure that the Int array is parsed in the correct order, which changes if your sample data is not indicative of the actual schema. I would be very interested to know how my solution performs vs pandas.
Here's one in awk. It hashes the smaller file records to arrays and for each of the bigger file records it iterates thru the hashes so it is slow:
$ awk '
NR==FNR { # hash f2 records
FNR<=2 { # mind the front matter
print $0 data[FNR]
{ # check if in range and output
for(i in start)
if($4>start[i] && $4<end[i])
print $0 data[i]
}' f2 f1
| name | pos | interval_name | start | end |
|------ |------ |--------------- |------- |------ |
| a | 1020 | int1 | 990 | 1090 |
| a | 1020 | int3 | 500 | 2000 |
| b | 1200 | int3 | 500 | 2000 |
| c | 1800 | int3 | 500 | 2000 |
I doubt that a bash script would be faster than a python script. Just don't import the files into a database – write a custom join function instead!
The best way to join depends on your input data. If nearly all F1.pos are inside of nearly all intervals then a naive approach would be the fastest. The naive approach in bash would look like this:
#! /bin/bash
join --header -t, -j99 F1 F2 |
sed 's/^,//' |
awk -F, 'NR>1 && $2 >= $4 && $2 <= $5'
# NR>1 is only there to skip the column headers
However, this will be very slow if there are only a few intersections, for instance, when the average F1.pos only is in 5 intervals. In this case the following approach will be way faster. Implement it in a programing language of your choice – bash is not appropriate for this:
Sort F1 by pos in ascending order.
Sort F2 by start and then by end in ascending order.
For each sorted file, keep a pointer to a line, starting at the first line.
Repeat until F1's pointer reaches the end:
For the current F1.pos advance F2's pointer until F1.pos ≥ F2.start.
Lock F2's pointer, but continue to read lines until F1.pos ≤ F2.end. Print the read lines in the output format name,pos,interval_name,start,end.
Advance F1's pointer by one line.
Only sorting the files could be actually faster in bash. Here is a script to sort both files.
#! /bin/bash
sort -t, -n -k2 F1-without-headers > F1-sorted
sort -t, -n -k2,3 F2-without-headers > F2-sorted
Consider using LC_ALL=C, -S N% and --parallel N to speed up the sorting process.

Combine multiple grep variables in one column-wise file

I have some grep expressions which count the number of lines matching a string, each one for a group of files with different extension:
Nreads_ini=$(grep -c '^>' $WDIR/*_R1.trim.contigs.fasta)
Nreads_align=$(grep -c '^>' $WDIR/*_R1.trim.contigs.good.unique.align)
Nreads_preclust=$(grep -c '^>' $WDIR/*_R1.trim.contigs.good.unique.filter.unique.precluster.fasta)
Nreads_final=$(grep -c '^>' $WDIR/*_R1.trim.contigs.good.unique.filter.unique.precluster.pick.fasta)
Each of these greps outputs the sample name and the number of occurences, as follows.
The first one:
The second one:
And so on. I need to create a .txt file with these numerical grep outputs as columns taking the sample name as a key column. The sample name is the part of the file name before "_R1" (V3_F357_N_V4_R805_1_A5_bach5_TGTGCTCGCA, V3_F357_N_V4_R805_1_A4_bach4_TGTGTGCATG...):
Sample | Nreads_ini | Nreads_align |
V3_F357_N_V4_R805_1_A1_bach1_GTATCGTCGT | 13175 | 12589 |
V3_F357_N_V4_R805_1_A2_bach2_GAGTGATCGT | 14801 | 13934 |
V3_F357_N_V4_R805_1_A3_bach3_TGAGCGTGCT | 13475 | 12981 |
V3_F357_N_V4_R805_1_A4_bach4_TGTGTGCATG | 13424 | 12896 |
V3_F357_N_V4_R805_1_A5_bach5_TGTGCTCGCA | 12053 | 11617 |
Any idea? Is there another easier solution for my problem?
In this answers the variable names are shortened to ini and align.
First, we extract the sample name and count from grep's output. Since we have to do this multiple times, we define the function
e() { sed -E 's,^.*/(.*)_R1.*:(.*)$,\1\t\2,'; }
Then we join the extracted data into one file. Lines with the same sample name will be combined.
join -t $'\t' <(e <<< "$ini") <(e <<< "$align")
Now we nearly have the expected output. We only have to add the header and draw lines for the table.
join ... | column -to " | " -N Sample,ini,align
This will print
Sample | ini | align
V3_F357_N_V4_R805_1_A1_bach1_GTATCGTCGT | 13175 | 12589
V3_F357_N_V4_R805_1_A2_bach2_GAGTGATCGT | 14801 | 13934
V3_F357_N_V4_R805_1_A3_bach3_TGAGCGTGCT | 13475 | 12981
V3_F357_N_V4_R805_1_A4_bach4_TGTGTGCATG | 13424 | 12896
V3_F357_N_V4_R805_1_A5_bach5_TGTGCTCGCA | 12053 | 11617
Adding a horizontal line after the header is left as an exercise for the reader :)
This approach also works with more than two number columns. The join and -N parts have to be extended. join can only work with two files, requiring us to use an unwieldy workaround ...
e() { sed -E 's,^.*/(.*)_R1.*:(.*)$,\1\t\2,'; }
join -t $'\t' <(e <<< "$var1") <(e <<< "$var2") |
join -t $'\t' - <(e <<< "$var3") | ... | join -t $'\t' - <(e <<< "$varN") |
column -to " | " -N Sample,Col1,Col2,...,ColN
... so it would be easier to add another helper function
e() { sed -E 's,^.*/(.*)_R1.*:(.*)$,\1\t\2,'; }
j2() { join -t $'\t' <(e <<< "$1") <(e <<< "$2"); }
j() { join -t $'\t' - <(e <<< "$1"); }
j2 "$var1" "$var2" | j "$var3" | ... | j "$varN" |
column -to " | " -N Sample,Col1,Col2,...,ColN
Alternatively, if all inputs contain the same samples in the same order, join can be replaced with one single paste command.
Assuming you have files containing the data you want parse:
$ cat file1
$ cat file2
$ cat file3 # This is a copy of file2 but could be different
If there is a key like V3_F357_N_V4_R805_1_A1_bach1_GTATCGTCGT, you could use awk:
$ awk -F'[/.:]' '
for(i in row) {
printf "%s ",substr(i,1,length(i)-3)
for(j in col)
printf "%s ",a[j SUBSEP i]; printf "\n"
}' file1 file2 file3
V3_F357_N_V4_R805_1_A4_bach4_TGTGTGCATG 13424 12896 12896
V3_F357_N_V4_R805_1_A1_bach1_GTATCGTCGT 13175 12589 12589
V3_F357_N_V4_R805_1_A3_bach3_TGAGCGTGCT 13475 12981 12981
V3_F357_N_V4_R805_1_A2_bach2_GAGTGATCGT 14801 13934 13934
V3_F357_N_V4_R805_1_A5_bach5_TGTGCTCGCA 12053 11617 11617
This awk script fills 3 array col, row and a that respectively stores the column name (filename), the row content and the values for all files.
The END statement prints the content of the array a by looping through all rows and columns.
If you need table decoration, use this:
{ printf "Sample Nreads_ini Nreads_align Nreads_align \n"; awk -F'[/.:]' 'BEGINFILE{col[FILENAME]}{row[$2];a[FILENAME,$2]=$NF;next}END{for(i in row) { printf "%s ",substr(i,1,length(i)-3); for(j in col) printf "%s ",a[j SUBSEP i]; printf "\n" }}' file1 file2 file3; } | column -t -s' ' -o ' | '
Could you please try following and let me know if this helps you.
awk --re-interval -F"[/.:]" '
print "Sample | Nreads_ini | Nreads_align |"
match($2,/.*[A-Z]{10}/) && (substr($2,RSTART,RLENGTH) in array){
print substr($2,RSTART,RLENGTH),array[substr($2,RSTART,RLENGTH)],$NF
' OFS=" | " first_one second_one | column -t
Output will be as follows.
Sample | Nreads_ini | Nreads_align |
V3_F357_N_V4_R805_1_A1_bach1_GTATCGTCGT | 13175 | 12589
V3_F357_N_V4_R805_1_A2_bach2_GAGTGATCGT | 14801 | 13934
V3_F357_N_V4_R805_1_A3_bach3_TGAGCGTGCT | 13475 | 12981
V3_F357_N_V4_R805_1_A4_bach4_TGTGTGCATG | 13424 | 12896
V3_F357_N_V4_R805_1_A5_bach5_TGTGCTCGCA | 12053 | 11617

awk query with numbers vs. strings

I am writing a function in R that will generate an awk script to pull in rows from a csv according to conditions that a user selected through a UI.
This is the example of the string generated by the function:
$ tail -n +2 ../data/faults_main_only_dp_1_shopFlag.csv |
> parallel -k -q --block 500M --pipe \
> awk -F , '$5 > "2013-01-01" && $5 < "2015-11-05" && ($3 == "20116688") && ($20 == "Disregard") {print $1 "," $3 "," $17 "," $20 }' |
> head | csvlook
It doesn’t return anything because $3 is a numeric variable. Neither does:
$ tail -n +2 ../data/faults_main_only_dp_1_shopFlag.csv |
> parallel -k -q --block 500M --pipe \
> awk -F , '$5 > "2013-01-01" && $5 < "2015-11-05" && ($3 == 20116688) && ($20 == Disregard) {print $1 "," $3 "," $17 "," $20 }' |
> head | csvlook
… because $20 is a string.
This returns a portion of the dataset:
$ tail -n +2 ../data/faults_main_only_dp_1_shopFlag.csv |
> parallel -k -q --block 500M --pipe \
> awk -F , '$5 > "2013-01-01" && $5 < "2015-11-05" && ($3 == 20116688) && ($20 == "Disregard") {print $1 "," $3 "," $17 "," $20 }' |
> head | csvlook`
| 5058.0 | 20116688.0 | 4162 | Disregard |
| 5060.0 | 20116688.0 | 3622 | Disregard |
| 5060.0 | 20116688.0 | 3619 | Disregard |
| 5061.0 | 20116688.0 | 766 | Disregard |
| 5059.0 | 20116688.0 | 3603 | Disregard |
| 5055.0 | 20116688.0 | 1013 | Disregard |
| 5058.0 | 20116688.0 | 1012 | Disregard |
| 5055.0 | 20116688.0 | 4163 | Disregard |
| 5060.0 | 20116688.0 | 4225 | Disregard |
| 5061.0 | 20116688.0 | 3466 | Disregard |
Unfortunately, I don’t currently have a way of anticipating which of the variables that the user selects through the UI will be string or numerical (I know how to do that, but it will take time that I’d rather not spend if there was a workaround). Is there a way to cast each variable a string before the comparison or have some other way of dealing with this issue?
Edit This is what the raw data look like:
$ csvcut -c15:20 faults_main_only_dp_1_shopFlag.csv | head
-0.8100106,-1.0,3604,25.07.01 11367,2.0,Work Item
-0.81860137,840.0,766,25.07.01 11367,5.0,Disregard
-0.8100140690000001,-1.0,4279,25.07.01 11367,2.0,Work Item
-0.8100509640000001,-2.0,4279,25.07.01 11367,2.0,Work Item
-0.8102342,14.0,3604,25.07.01 11367,2.0,Work Item
-0.8181563620000001,831.0,3604,25.07.01 11367,5.0,Disregard
-0.81022054,11.0,3604,25.07.01 11367,2.0,Work Item
-0.8102272,11.0,4279,25.07.01 11367,2.0,Work Item
-0.8083836999999999,17.0,766,25.07.01 11367,5.0,Disregard
awk can do the int <--> string comparison if the token can be converted. Note that you're using comma as the field separator and spaces will be part of the fields. If it's not a decimal point issue where your numbers are integers,
Check these three cases
$ echo "42,42" | awk -F, '$1=="42" && $2==42{print "works";next} {print "does not work"}'
$ echo "42, 42" | awk -F, '$1=="42" && $2==42{print "works";next} {print "does not work"}'
$ echo "42 , 42" | awk -F, '$1=="42" && $2==42{print "works";next} {print "does not work"}'
does not work
The string interpretation (first field) should not have the space!
You can try setting up your field separator to " *, *"
UPDATE: If your integers get .0 floating point extensions which you can ignore, convert the them to int before the comparison
$ echo "42.0 , 42" | awk -v FS=" *, *" 'int($1)=="42" && $2=="42"{print "works";next} {print "does not work"}'
Here your generic value will be quoted but the field will be converted to int before the string conversion. You need to know what fields are numeric what fields are string though.

Can't iterate over array in Bash

I need to add a new column with a (ordinal) number after the last column in my table.
Both input and output files are .CSV tables.
Incoming table has more then 500 000 lines (rows) of data and 7 columns, e.g.
Incoming CSV table (this is just an example, so "|" and "-" are here for the sake of clarity):
| id | Name |
| 1 | Foo |
| 1 | Foo |
| 1 | Foo |
| 4242 | Baz |
| 4242 | Baz |
| 4242 | Baz |
| 4242 | Baz |
| 702131 | Xyz |
| 702131 | Xyz |
| 702131 | Xyz |
| 702131 | Xyz |
Result CSV (this is just an example, so "|" and "-" are here for the sake of clarity):
| id | Name | |
| 1 | Foo | 1 |
| 1 | Foo | 2 |
| 1 | Foo | 3 |
| 4242 | Baz | 1 |
| 4242 | Baz | 2 |
| 4242 | Baz | 3 |
| 4242 | Baz | 4 |
| 702131 | Xyz | 1 |
| 702131 | Xyz | 2 |
| 702131 | Xyz | 3 |
| 702131 | Xyz | 4 |
First column is ID, so I've tried to group all lines with the same ID and iterate over them. Script (I don't know bash scripting, to be honest):
# Delete header and extract IDs and delete non-unique values. Also change \n to ♥, because awk doesn't properly work with it.
IDS_ARRAY=$(awk -v FS="|" '{for (i=1;i<=NF;i++) if ($i=="\"") inQ=!inQ; ORS=(inQ?"♥":"\n") }1' $FILE | awk -F'|' '{if (NR!=1) {print $1}}' | awk '!seen[$0]++')
for id in $IDS_ARRAY; do
# Group $FILE by $id from $IDS_ARRAY.
cat $FILE | grep $id >> temp_mail_group.csv
# Add a number after each row.
# NF+1 — add a column after last existing.
awk -F'|' '{$(NF+1)=++i;}1' OFS="|", $ROW_GROUP >> "numbered_mails_$(date +%Y-%m-%d).csv"
rm -f $PWD/temp_mail_group.csv
Right now this script works almost like I want to, except that it thinks that (for example) ID 2834 and 772834 are the same.
UPD: Although I marked one answer as approved it does not assign correct values to some groups of records with the same ID (right now I don't see a pattern).
You can do everything in a single script:
gawk 'BEGIN { FS="|"; OFS="|";}
/^-/ {print; next;}
$2 ~ /\s*id\s*/ {print $0,""; next;}
{print "", $2, $3, ++a[$2];}
$1 is the empty field before the first | in the input. I use an empty output column "" to get the leading |.
The trick is ++a[$2] which takes the second field in each row (= the ID column) and looks for it in the associative array a. If there is no entry, the result is 0. By pre-incrementing, we start with 1 and add 1 every time the ID reappears.
Every time you write a loop in shell just to manipulate text you have the wrong approach. The guys who invented shell also invented awk for shell to call to manipulate text - don't disappoint them :-).
$ awk '
BEGIN{ w = 8 }
if (NR==1) {
val = sprintf("%*s|",w,"")
else if (NR==2) {
val = sprintf("%*s",w+1,"")
gsub(/ /,"-",val)
else {
val = sprintf(" %-*s|",w-1,++cnt[$2])
print $0 val
' file
| id | Name | |
| 1 | Foo | 1 |
| 1 | Foo | 2 |
| 1 | Foo | 3 |
| 42 | Baz | 1 |
| 42 | Baz | 2 |
| 42 | Baz | 3 |
| 42 | Baz | 4 |
| 70 | Xyz | 1 |
| 70 | Xyz | 2 |
| 70 | Xyz | 3 |
| 70 | Xyz | 4 |
An awk way
Without considering the dotted line being extended.
awk 'NR>2{$0=$0 (++a[$2])"|"}1' file
| id | Name |
| 1 | Foo |1|
| 1 | Foo |2|
| 1 | Foo |3|
| 42 | Baz |1|
| 42 | Baz |2|
| 42 | Baz |3|
| 42 | Baz |4|
| 70 | Xyz |1|
| 70 | Xyz |2|
| 70 | Xyz |3|
| 70 | Xyz |4|
Here's a way to do it with pure Bash:
while IFS= read -r line ; do
printf '%s' "$line"
IFS=$'| \t\n' read t1 id name t2 <<<"$line"
if [[ $line == -* ]] ; then
printf '%s\n' '---------'
elif [[ $id == 'id' ]] ; then
printf ' Number |\n'
if [[ $id != "$prev_id" ]] ; then
printf '%2d |\n' "$(( ++id_count ))"
done <"$inputfile"
