shell : Number of string occurrences in a column using grep

shell : Number of string occurrences in a column using grep - bash

My data file is:
name,age,favourite_person
Adam,19,Helen Keller
Alex,18,Joe Biden
Kyle,18,George Washington
Mary,20,Marie Curie
Jade,16,Marie Kondo
I want to find number of times "Marie" occurred in the column 'favourite_person' (column 3). My code right now is grep -R "Marie" file | wc -l but this checks for the word "Marie" in the entire file. I only want it to check among the favourite_person column. What should I add in this case?

You can use awk as follows:
awk 'BEGIN { FS = "," } {if ($3 ~ "Marie") { count++ }} END { print count }' file
BEGIN { FS = "," } sets , as the field separator,
{ if ... } part reads like "if the third field matches "Marie", then increment variable count",
END { print count } prints count at the end.

You can use cut as well as grep:
cut -d "," -f3 file | grep Marie | wc -l
-d means delimeter, and -f3 takes the third column only
grep Marie checks if Marie is in the third column, and wc -l counts the occurences

Related

Processing text with multiple delims in awk

I have a text which looks like -
Application.||dates:[2022-11-12]|models:[MODEL1]|count:1|ids:2320
Application.||dates:[2022-11-12]|models:[MODEL1]|count:5|ids:2320
I want the number from the count:1 columns so 1 and i wish to store these numbers in an array.
nums=($(echo -n "$grepResult" | awk -F ':' '{ print $4 }' | awk -F '|' '{ print $1 }'))
this seems very repetitive and not very efficient, any ideas how to simplify this ?

You can use awk once, set the field separator to |. Then loop all the fields and split on :
If the field starts with count then print the second part of the splitted value.
This way the count: part can occur anywhere in the string and can possibly print this multiple times.
nums=($(echo -n "$grepResult" | awk -F'|' '
{
for(i=1; i<=NF; i++) {
split($i, a, ":")
if (a[1] == "count") {
print a[2]
}
}
}
'))
for i in "${nums[#]}"
do
echo "$i"
done
Output
1
5
If you want to combine the both split values, you can use [|:] as a character class and print field number 8 for a precise match as mentioned in the comments.
Note that it does not check if it starts with count:
nums=($(echo -n "$grepResult" | awk -F '[|:]' '{print $8}'))
With gnu awk you can use a capture group to get a bit more precise match where on the left and right can be either the start/end of string or a pipe char. The 2nd group matches 1 or more digits:
nums=($(echo -n "$grepResult" | awk 'match($0, /(^|\|)count:([0-9]+)(\||$)/, a) {print a[2]}' ))

Try sed
nums=($(sed 's/.*count://;s/|.*//' <<< "$grepResult"))
Explanation:
There are two sed commands separated with ; symbol.
First command 's/.*count://' remove all characters till 'count:' including it.
Second command 's/|.*//' remove all characters starting from '|' including it.
Command order is important here.

Cut and sort delimited dates from stdout via pipe

I am trying to split some strings from stdout to get the dates from it, but I have two cases
full.20201004T033103Z.vol93.difftar.gz
full.20201007T033103Z.vol94.difftar.gz
Which should produce: 20201007T033103Z which is the nearest date to now (newest)
Or:
inc.20200830T033103Z.to.20200906T033103Z.vol1.difftar.gz
inc.20200929T033103Z.to.20200908T033103Z.vol10.difftar.gz
Should get the second date (after .to.) not the first one, and print only the newest date: 20200908T033103Z
What I tried:
cat dates_file | awk -F '.to.' 'NF > 1 {print $2}' | cut -d\. -f1 | sort -r -t- -k3.1,3.4 -k2,2 | head -1
This only works for the second case and not covering the first, also I am not sure about the date sorting logic.
Here is a sample data
full.20201004T033103Z.vol93.difftar.gz
full.20201004T033103Z.vol94.difftar.gz
full.20201004T033103Z.vol95.difftar.gz
full.20201004T033103Z.vol96.difftar.gz
full.20201004T033103Z.vol97.difftar.gz
full.20201004T033103Z.vol98.difftar.gz
full.20201004T033103Z.vol99.difftar.gz
inc.20200830T033103Z.to.20200906T033103Z.manifest
inc.20200830T033103Z.to.20200906T033103Z.vol1.difftar.gz
inc.20200830T033103Z.to.20200906T033103Z.vol10.difftar.gz
inc.20200830T033103Z.to.20200906T033103Z.vol11.difftar.gz
inc.20200830T033103Z.to.20200906T033103Z.vol12.difftar.gz
inc.20200830T033103Z.to.20200906T033103Z.vol13.difftar.gz
inc.20200830T033103Z.to.20200906T033103Z.vol14.difftar.gz
inc.20200830T033103Z.to.20200906T033103Z.vol15.difftar.gz
inc.20200830T033103Z.to.20200906T033103Z.vol16.difftar.gz
inc.20200830T033103Z.to.20200906T033103Z.vol17.difftar.gz

To get most recent data from your sample data you can use this awk:
awk '{
sub(/^(.*\.to|[^.]+)\./, "")
gsub(/\..+$|[TZ]/, "")
}
$0 > max {
max = $0
}
END {
print max
}' file
20201004033103

Using awk to extract two separate strings

MacOS, Unix
So I have a file in the following stockholm format:
# STOCKHOLM 1.0
#=GS WP_002855993.1/5-168 DE [subseq from] MULTISPECIES: AAC(3) family N-acetyltransferase [Campylobacter]
#=GS WP_002856586.1/5-166 DE [subseq from] MULTISPECIES: aminoglycoside N(3)-acetyltransferase [Campylobacter]
WP_002855993.1/5-168 ------LEHNGKKYSDKDLIDAFYQLGIKRGDILCVHTELmkfgKALLT.K...NDFLKTLLECFFKVLGKEGTLLMP-TF---TYSF------CKNE------VYDKVHSKG--KVGVLNEFFRTSGgGVRRTSDPIFSFAVKGAKADIFLKEN--SSCFGKDSVYEILTREGGKFMLLGLNYG-HALTHYAEE-----
#=GR WP_002855993.1/5-168 PP ......6788899999***********************9333344455.6...8999********************.33...3544......4555......799999975..68********98626999****************999865..689*********************9875.456799996.....
WP_002856586.1/5-166 ------LEFENKKYSTYDFIETFYKLGLQKGDTLCVHTEL....FNFGFpLlsrNEFLQTILDCFFEVIGKEGTLIMP-TF---TYSF------CKNE------VYDKINSKT--KMGALNEYFRKQT.GVKRTNDPIFSFAIKGAKEELFLKDT--TSCFGENCVYEVLTKENGKYMTFGGQG--HTLTHYAEE-----
#=GR WP_002856586.1/5-166 PP ......5566677788889999******************....**9953422246679*******************.33...3544......4455......799998876..589**********.******************99999886..689******************999765..5666***96.....
#=GC PP_cons ......6677788899999999*****************9....77675.5...68889*******************.33...3544......4455......799999976..689*******998.8999**************99999876..689******************9998765.466699996.....
#=GC RF xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx....xxxxx.x...xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
WP_002855993.1/5-168 -----------------------------------------------------------------------------------------------------
#=GR WP_002855993.1/5-168 PP .....................................................................................................
WP_002856586.1/5-166 -----------------------------------------------------------------------------------------------------
#=GR WP_002856586.1/5-166 PP .....................................................................................................
#=GC PP_cons .....................................................................................................
#=GC RF xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
//
And I've created a script to extract the IDs I want, in this case, WP_002855993.1 and WP_002856586.1, and search through another file to extract DNA sequences with the appropriate IDs. The script is as follows:
#!/bin/bash
for fileName in *.sto;
do
protID=$(grep -o "WP_.\{0,11\}" $fileName | sort | uniq)
echo $protID
file=$(echo $fileName | cut -d '_' -f 1,2,3)
file=$(echo $file'_protein.faa')
echo $file
if [ -n "$protID" ]; then
gawk "/^>/{N=0}/^.*$protID/{N=1} {if(N)print}" $file >>
sequence_protein.file
fi
done
And here's an example of the type of file I'm looking through:
>WP_002855993.1 MULTISPECIES: AAC(3) family N-acetyltransferase [Campylobacter]
MKYFLEHNGKKYSDKDLIDAFYQLGIKRGDILCVHTELMKFGKALLTKNDFLKTLLECFFKVLGKEGTLLMPTFT
>WP_002856586.1 MULTISPECIES: aminoglycoside N(3)-acetyltransferase [Campylobacter]
MKYLLEFENKKYSTYDFIETFYKLGLQKGDTLCVHTELFNFGFPLLSRNEFLQTILDCFFEVIGKEGTLIMPTFT
YSFCKNEVYDKINSKTKMGALNEYFRKQTGVKRTNDPIFSFAIKGAKEELFLKDTTSCFGENCVYEVLTKENGKY
>WP_002856595.1 MULTISPECIES: acetyl-CoA carboxylase biotin carboxylase subunit [Campylobacter]
MNQIHKILIANRAEIAVRVIRACRDLHIKSVAVFTEPDRECLHVKIADEAYRIGTDAIRGYLDVARIVEIAKACG
This script works if I have one ID, but in some cases I get two IDs, and I get an error, because I think it's looking for an ID like "WP_002855993.1 WP_002856586.1". Is there a way to modify this script so it looks for two separate occurrences? I guess it's something with the gawk command, but I'm not sure what exactly. Thanks in advance!

an update to the original script:
#!/usr/bin/env bash
for file_sto in *.sto; do
file_faa=$(echo $file_sto | cut -d '_' -f 1,2,3)
file_faa=${file_faa}"_protein.faa"
awk '(NR==FNR) { match($0,/WP_.\{0,11\}/);
if (RSTART > 0) a[substr($0,RSTART,RLENGTH)]++
next; }
($1 in a){ print RS $0 }' $file_sto RS=">" $file_faa >> sequence_protein.file
done
The awk part can probably even be reduced to :
awk '(NR==FNR) { if ($0 ~ /^WP_/) a[$1]++; next }
($1 in a) { print RS $0 }' FS='/' $file_sto FS=" " RS=">" $file_faa
This awk script does the following:
Set the field separator FS to / and read file $file_sto.
When reading $file_sto the record number NR is the same as the file record number FNR.
(NR==FNR) { if ($0 ~ /^WP_/) a[$1]++; next }: this line works only one $file_sto due to the condition in the front. It checks if the line starts with WP_. If it does, it stores the first field $1 (separated by FS which is a /) in an array a; it then skips to the next record in the file (next).
If we finished reading file $file_sto, we set the field separator back to a single space FS=" " (see section Regular expression) and the record separator RS to > and start reading file $file_faa The latter implies that $0 will contain all lines between > and the first field $1 is the protID.
Reading $file_faa, the file record number FNR is restarted from 1 while NR is not reset. Hence the first awk line is skipped.
($1 in a){ print RS $0 } if the first field is in the array a, print the record with the record separator in front of it.
fixing the original script:
If you want to keep your original script, you could store the protID in a list and then loop the list :
#!/bin/bash
for fileName in *.sto; do
protID_list=( $(grep -o "WP_.\{0,11\}" $fileName | sort | uniq) )
echo ${protID_list[#]}
file=$(echo $fileName | cut -d '_' -f 1,2,3)
file=$(echo $file'_protein.faa')
echo $file
for protID in ${protID_list[#]}; do
if [ -n "$protID" ]; then
gawk "/^>/{N=0}/^.*$protID/{N=1} {if(N)print}" $file >>
sequence_protein.file
fi
done
done

Considering your output file is test.
Using following command gives you only file names:
>>cat text | awk '{print $1}' | grep -R 'WP*' | cut -d":" -f2
gives me output:
WP_002855993.1/5-168
WP_002856586.1/5-166
WP_002855993.1/5-168
WP_002856586.1/5-166
Do you want output like that?

Include header in grep of specific csv columns

I am trying to extract relevant information from a large csv file for further processing, so I would like to have the column names (header) saved in my output mini-csv files.
I have:
grep "Example" $fixed_file | cut -d ',' -f 4,6 > $outputpath"Example.csv"
which works fine in generating a csv file with two columns, but I would like the header information to also be included in the output file.

Use command grouping and add head -1 to the mix:
{ head -1 "$fixed_file" && grep "Example" "$fixed_file" | cut -d ',' -f 4,6 ;} \
>"$outputpath"Example.csv

My suggestion would be to replace your multiple-command pipeline with a single awk script.
awk '
BEGIN {
OFS=FS=","
}
NR==1;
/Example/ {
print $4,$6
}
' "$fixed_file" > "$outputpath/Example.csv"
If you want your header only to contain the headers fields fields 4 and 6, you could change this to:
awk '
BEGIN {
OFS=FS=","
}
NR==1 || /Example/ {
print $4,$6
}
' "$fixed_file" > "$outputpath/Example.csv"
Awk scripts consist of pairs of condition { statement }. A missing statement assumes you want to print the line (which is why NR==1; prints the header).
And of course, you could compact this into a one-liner:
awk -F, 'NR==1||/Example/{print $4 FS $6}' "$fixed_file" > "$outputpath/Example.csv"

awk: division by zero input record number 1, file source line number 1

Im trying to get the signed log10-transformed t-test P-value by using the sign of the log2FoldChange multiplied by the inverse of the pvalue,
cat test.xlx | sort -k7g \
| cut -d '_' -f2- \
| awk '!arr[$1]++' \
| awk '{OFS="\t"}
{ if ($6>0) printf "%s\t%4.3e\n", $1, 1/$7; else printf "%s\t%4.3e\n", $1, -1/$7 }' \
| sort -k2gr > result.txt
text.xls =
ID baseMean log2FoldChange lfcSE stat pvalue padj
ENSMUSG00000037692-Ahdc1 2277.002091 1.742481553 0.170388822 10.22650154 1.51e-24 2.13e-20
ENSMUSG00000035561-Aldh1b1 768.4504879 -2.325533089 0.248837002 -9.345608047 9.14e-21 6.45e-17
ENSMUSG00000038932-Tcfl5 556.1693605 -3.742422892 0.402475728 -9.298505809 1.42e-20 6.71e-17
ENSMUSG00000057182-Scn3a 1363.915962 1.621456045 0.175281852 9.250564289 2.23e-20 7.89e-17
ENSMUSG00000038552-Fndc4 378.821132 2.544026087 0.288831276 8.808000721 1.27e-18 3.6e-15
but getting error awk: division by zero
input record number 1, file
source line number 1

As #jas points out in a comment, you need to skip your header line but your script could stand some more cleanup than that. Try this:
sort -k7g test.xlx |
awk '
BEGIN { OFS="\t" }
{ sub(/^[^_]+_/,"") }
($6~/[0-9]/) && (!seen[$1]++) { printf "%s\t%4.3e\n", $1, ($7?($6>0?1:-1)/$7:0) }
' |
sort -k2gr
ENSMUSG00000035561-Aldh1b1 1.550e+16
ENSMUSG00000037692-Ahdc1 4.695e+19
ENSMUSG00000038552-Fndc4 2.778e+14
ENSMUSG00000038932-Tcfl5 1.490e+16
ENSMUSG00000057182-Scn3a 1.267e+16
The above will print a result of zero instead of failing when $7 is zero.
What's the point of the cut -d '_' -f2- in your original script though (implemented above with sub()? You don't have any _s in your input file.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

shell : Number of string occurrences in a column using grep - bash

You can use awk as follows: awk 'BEGIN { FS = "," } {if ($3 ~ "Marie") { count++ }} END { print count }' file BEGIN { FS = "," } sets , as the field separator, { if ... } part reads like "if the third field matches "Marie", then increment variable count", END { print count } prints count at the end.

You can use cut as well as grep: cut -d "," -f3 file | grep Marie | wc -l -d means delimeter, and -f3 takes the third column only grep Marie checks if Marie is in the third column, and wc -l counts the occurences

Related

Processing text with multiple delims in awk

Cut and sort delimited dates from stdout via pipe

Using awk to extract two separate strings

Include header in grep of specific csv columns

awk: division by zero input record number 1, file source line number 1

Categories

Resources