shell : Number of string occurrences in a column using grep - bash

My data file is:
name,age,favourite_person
Adam,19,Helen Keller
Alex,18,Joe Biden
Kyle,18,George Washington
Mary,20,Marie Curie
Jade,16,Marie Kondo
I want to find number of times "Marie" occurred in the column 'favourite_person' (column 3). My code right now is grep -R "Marie" file | wc -l but this checks for the word "Marie" in the entire file. I only want it to check among the favourite_person column. What should I add in this case?

You can use awk as follows:
awk 'BEGIN { FS = "," } {if ($3 ~ "Marie") { count++ }} END { print count }' file
BEGIN { FS = "," } sets , as the field separator,
{ if ... } part reads like "if the third field matches "Marie", then increment variable count",
END { print count } prints count at the end.

You can use cut as well as grep:
cut -d "," -f3 file | grep Marie | wc -l
-d means delimeter, and -f3 takes the third column only
grep Marie checks if Marie is in the third column, and wc -l counts the occurences

Related

Processing text with multiple delims in awk

I have a text which looks like -
Application.||dates:[2022-11-12]|models:[MODEL1]|count:1|ids:2320
Application.||dates:[2022-11-12]|models:[MODEL1]|count:5|ids:2320
I want the number from the count:1 columns so 1 and i wish to store these numbers in an array.
nums=($(echo -n "$grepResult" | awk -F ':' '{ print $4 }' | awk -F '|' '{ print $1 }'))
this seems very repetitive and not very efficient, any ideas how to simplify this ?
You can use awk once, set the field separator to |. Then loop all the fields and split on :
If the field starts with count then print the second part of the splitted value.
This way the count: part can occur anywhere in the string and can possibly print this multiple times.
nums=($(echo -n "$grepResult" | awk -F'|' '
{
for(i=1; i<=NF; i++) {
split($i, a, ":")
if (a[1] == "count") {
print a[2]
}
}
}
'))
for i in "${nums[#]}"
do
echo "$i"
done
Output
1
5
If you want to combine the both split values, you can use [|:] as a character class and print field number 8 for a precise match as mentioned in the comments.
Note that it does not check if it starts with count:
nums=($(echo -n "$grepResult" | awk -F '[|:]' '{print $8}'))
With gnu awk you can use a capture group to get a bit more precise match where on the left and right can be either the start/end of string or a pipe char. The 2nd group matches 1 or more digits:
nums=($(echo -n "$grepResult" | awk 'match($0, /(^|\|)count:([0-9]+)(\||$)/, a) {print a[2]}' ))
Try sed
nums=($(sed 's/.*count://;s/|.*//' <<< "$grepResult"))
Explanation:
There are two sed commands separated with ; symbol.
First command 's/.*count://' remove all characters till 'count:' including it.
Second command 's/|.*//' remove all characters starting from '|' including it.
Command order is important here.

Cut and sort delimited dates from stdout via pipe

I am trying to split some strings from stdout to get the dates from it, but I have two cases
full.20201004T033103Z.vol93.difftar.gz
full.20201007T033103Z.vol94.difftar.gz
Which should produce: 20201007T033103Z which is the nearest date to now (newest)
Or:
inc.20200830T033103Z.to.20200906T033103Z.vol1.difftar.gz
inc.20200929T033103Z.to.20200908T033103Z.vol10.difftar.gz
Should get the second date (after .to.) not the first one, and print only the newest date: 20200908T033103Z
What I tried:
cat dates_file | awk -F '.to.' 'NF > 1 {print $2}' | cut -d\. -f1 | sort -r -t- -k3.1,3.4 -k2,2 | head -1
This only works for the second case and not covering the first, also I am not sure about the date sorting logic.
Here is a sample data
full.20201004T033103Z.vol93.difftar.gz
full.20201004T033103Z.vol94.difftar.gz
full.20201004T033103Z.vol95.difftar.gz
full.20201004T033103Z.vol96.difftar.gz
full.20201004T033103Z.vol97.difftar.gz
full.20201004T033103Z.vol98.difftar.gz
full.20201004T033103Z.vol99.difftar.gz
inc.20200830T033103Z.to.20200906T033103Z.manifest
inc.20200830T033103Z.to.20200906T033103Z.vol1.difftar.gz
inc.20200830T033103Z.to.20200906T033103Z.vol10.difftar.gz
inc.20200830T033103Z.to.20200906T033103Z.vol11.difftar.gz
inc.20200830T033103Z.to.20200906T033103Z.vol12.difftar.gz
inc.20200830T033103Z.to.20200906T033103Z.vol13.difftar.gz
inc.20200830T033103Z.to.20200906T033103Z.vol14.difftar.gz
inc.20200830T033103Z.to.20200906T033103Z.vol15.difftar.gz
inc.20200830T033103Z.to.20200906T033103Z.vol16.difftar.gz
inc.20200830T033103Z.to.20200906T033103Z.vol17.difftar.gz
To get most recent data from your sample data you can use this awk:
awk '{
sub(/^(.*\.to|[^.]+)\./, "")
gsub(/\..+$|[TZ]/, "")
}
$0 > max {
max = $0
}
END {
print max
}' file
20201004033103

Using awk to extract two separate strings

MacOS, Unix
So I have a file in the following stockholm format:
# STOCKHOLM 1.0
#=GS WP_002855993.1/5-168 DE [subseq from] MULTISPECIES: AAC(3) family N-acetyltransferase [Campylobacter]
#=GS WP_002856586.1/5-166 DE [subseq from] MULTISPECIES: aminoglycoside N(3)-acetyltransferase [Campylobacter]
WP_002855993.1/5-168 ------LEHNGKKYSDKDLIDAFYQLGIKRGDILCVHTELmkfgKALLT.K...NDFLKTLLECFFKVLGKEGTLLMP-TF---TYSF------CKNE------VYDKVHSKG--KVGVLNEFFRTSGgGVRRTSDPIFSFAVKGAKADIFLKEN--SSCFGKDSVYEILTREGGKFMLLGLNYG-HALTHYAEE-----
#=GR WP_002855993.1/5-168 PP ......6788899999***********************9333344455.6...8999********************.33...3544......4555......799999975..68********98626999****************999865..689*********************9875.456799996.....
WP_002856586.1/5-166 ------LEFENKKYSTYDFIETFYKLGLQKGDTLCVHTEL....FNFGFpLlsrNEFLQTILDCFFEVIGKEGTLIMP-TF---TYSF------CKNE------VYDKINSKT--KMGALNEYFRKQT.GVKRTNDPIFSFAIKGAKEELFLKDT--TSCFGENCVYEVLTKENGKYMTFGGQG--HTLTHYAEE-----
#=GR WP_002856586.1/5-166 PP ......5566677788889999******************....**9953422246679*******************.33...3544......4455......799998876..589**********.******************99999886..689******************999765..5666***96.....
#=GC PP_cons ......6677788899999999*****************9....77675.5...68889*******************.33...3544......4455......799999976..689*******998.8999**************99999876..689******************9998765.466699996.....
#=GC RF xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx....xxxxx.x...xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
WP_002855993.1/5-168 -----------------------------------------------------------------------------------------------------
#=GR WP_002855993.1/5-168 PP .....................................................................................................
WP_002856586.1/5-166 -----------------------------------------------------------------------------------------------------
#=GR WP_002856586.1/5-166 PP .....................................................................................................
#=GC PP_cons .....................................................................................................
#=GC RF xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
//
And I've created a script to extract the IDs I want, in this case, WP_002855993.1 and WP_002856586.1, and search through another file to extract DNA sequences with the appropriate IDs. The script is as follows:
#!/bin/bash
for fileName in *.sto;
do
protID=$(grep -o "WP_.\{0,11\}" $fileName | sort | uniq)
echo $protID
file=$(echo $fileName | cut -d '_' -f 1,2,3)
file=$(echo $file'_protein.faa')
echo $file
if [ -n "$protID" ]; then
gawk "/^>/{N=0}/^.*$protID/{N=1} {if(N)print}" $file >>
sequence_protein.file
fi
done
And here's an example of the type of file I'm looking through:
>WP_002855993.1 MULTISPECIES: AAC(3) family N-acetyltransferase [Campylobacter]
MKYFLEHNGKKYSDKDLIDAFYQLGIKRGDILCVHTELMKFGKALLTKNDFLKTLLECFFKVLGKEGTLLMPTFT
>WP_002856586.1 MULTISPECIES: aminoglycoside N(3)-acetyltransferase [Campylobacter]
MKYLLEFENKKYSTYDFIETFYKLGLQKGDTLCVHTELFNFGFPLLSRNEFLQTILDCFFEVIGKEGTLIMPTFT
YSFCKNEVYDKINSKTKMGALNEYFRKQTGVKRTNDPIFSFAIKGAKEELFLKDTTSCFGENCVYEVLTKENGKY
>WP_002856595.1 MULTISPECIES: acetyl-CoA carboxylase biotin carboxylase subunit [Campylobacter]
MNQIHKILIANRAEIAVRVIRACRDLHIKSVAVFTEPDRECLHVKIADEAYRIGTDAIRGYLDVARIVEIAKACG
This script works if I have one ID, but in some cases I get two IDs, and I get an error, because I think it's looking for an ID like "WP_002855993.1 WP_002856586.1". Is there a way to modify this script so it looks for two separate occurrences? I guess it's something with the gawk command, but I'm not sure what exactly. Thanks in advance!
an update to the original script:
#!/usr/bin/env bash
for file_sto in *.sto; do
file_faa=$(echo $file_sto | cut -d '_' -f 1,2,3)
file_faa=${file_faa}"_protein.faa"
awk '(NR==FNR) { match($0,/WP_.\{0,11\}/);
if (RSTART > 0) a[substr($0,RSTART,RLENGTH)]++
next; }
($1 in a){ print RS $0 }' $file_sto RS=">" $file_faa >> sequence_protein.file
done
The awk part can probably even be reduced to :
awk '(NR==FNR) { if ($0 ~ /^WP_/) a[$1]++; next }
($1 in a) { print RS $0 }' FS='/' $file_sto FS=" " RS=">" $file_faa
This awk script does the following:
Set the field separator FS to / and read file $file_sto.
When reading $file_sto the record number NR is the same as the file record number FNR.
(NR==FNR) { if ($0 ~ /^WP_/) a[$1]++; next }: this line works only one $file_sto due to the condition in the front. It checks if the line starts with WP_. If it does, it stores the first field $1 (separated by FS which is a /) in an array a; it then skips to the next record in the file (next).
If we finished reading file $file_sto, we set the field separator back to a single space FS=" " (see section Regular expression) and the record separator RS to > and start reading file $file_faa The latter implies that $0 will contain all lines between > and the first field $1 is the protID.
Reading $file_faa, the file record number FNR is restarted from 1 while NR is not reset. Hence the first awk line is skipped.
($1 in a){ print RS $0 } if the first field is in the array a, print the record with the record separator in front of it.
fixing the original script:
If you want to keep your original script, you could store the protID in a list and then loop the list :
#!/bin/bash
for fileName in *.sto; do
protID_list=( $(grep -o "WP_.\{0,11\}" $fileName | sort | uniq) )
echo ${protID_list[#]}
file=$(echo $fileName | cut -d '_' -f 1,2,3)
file=$(echo $file'_protein.faa')
echo $file
for protID in ${protID_list[#]}; do
if [ -n "$protID" ]; then
gawk "/^>/{N=0}/^.*$protID/{N=1} {if(N)print}" $file >>
sequence_protein.file
fi
done
done
Considering your output file is test.
Using following command gives you only file names:
>>cat text | awk '{print $1}' | grep -R 'WP*' | cut -d":" -f2
gives me output:
WP_002855993.1/5-168
WP_002856586.1/5-166
WP_002855993.1/5-168
WP_002856586.1/5-166
Do you want output like that?

Include header in grep of specific csv columns

I am trying to extract relevant information from a large csv file for further processing, so I would like to have the column names (header) saved in my output mini-csv files.
I have:
grep "Example" $fixed_file | cut -d ',' -f 4,6 > $outputpath"Example.csv"
which works fine in generating a csv file with two columns, but I would like the header information to also be included in the output file.
Use command grouping and add head -1 to the mix:
{ head -1 "$fixed_file" && grep "Example" "$fixed_file" | cut -d ',' -f 4,6 ;} \
>"$outputpath"Example.csv
My suggestion would be to replace your multiple-command pipeline with a single awk script.
awk '
BEGIN {
OFS=FS=","
}
NR==1;
/Example/ {
print $4,$6
}
' "$fixed_file" > "$outputpath/Example.csv"
If you want your header only to contain the headers fields fields 4 and 6, you could change this to:
awk '
BEGIN {
OFS=FS=","
}
NR==1 || /Example/ {
print $4,$6
}
' "$fixed_file" > "$outputpath/Example.csv"
Awk scripts consist of pairs of condition { statement }. A missing statement assumes you want to print the line (which is why NR==1; prints the header).
And of course, you could compact this into a one-liner:
awk -F, 'NR==1||/Example/{print $4 FS $6}' "$fixed_file" > "$outputpath/Example.csv"

awk: division by zero input record number 1, file source line number 1

Im trying to get the signed log10-transformed t-test P-value by using the sign of the log2FoldChange multiplied by the inverse of the pvalue,
cat test.xlx | sort -k7g \
| cut -d '_' -f2- \
| awk '!arr[$1]++' \
| awk '{OFS="\t"}
{ if ($6>0) printf "%s\t%4.3e\n", $1, 1/$7; else printf "%s\t%4.3e\n", $1, -1/$7 }' \
| sort -k2gr > result.txt
text.xls =
ID baseMean log2FoldChange lfcSE stat pvalue padj
ENSMUSG00000037692-Ahdc1 2277.002091 1.742481553 0.170388822 10.22650154 1.51e-24 2.13e-20
ENSMUSG00000035561-Aldh1b1 768.4504879 -2.325533089 0.248837002 -9.345608047 9.14e-21 6.45e-17
ENSMUSG00000038932-Tcfl5 556.1693605 -3.742422892 0.402475728 -9.298505809 1.42e-20 6.71e-17
ENSMUSG00000057182-Scn3a 1363.915962 1.621456045 0.175281852 9.250564289 2.23e-20 7.89e-17
ENSMUSG00000038552-Fndc4 378.821132 2.544026087 0.288831276 8.808000721 1.27e-18 3.6e-15
but getting error awk: division by zero
input record number 1, file
source line number 1
As #jas points out in a comment, you need to skip your header line but your script could stand some more cleanup than that. Try this:
sort -k7g test.xlx |
awk '
BEGIN { OFS="\t" }
{ sub(/^[^_]+_/,"") }
($6~/[0-9]/) && (!seen[$1]++) { printf "%s\t%4.3e\n", $1, ($7?($6>0?1:-1)/$7:0) }
' |
sort -k2gr
ENSMUSG00000035561-Aldh1b1 1.550e+16
ENSMUSG00000037692-Ahdc1 4.695e+19
ENSMUSG00000038552-Fndc4 2.778e+14
ENSMUSG00000038932-Tcfl5 1.490e+16
ENSMUSG00000057182-Scn3a 1.267e+16
The above will print a result of zero instead of failing when $7 is zero.
What's the point of the cut -d '_' -f2- in your original script though (implemented above with sub()? You don't have any _s in your input file.

Resources