Make grep output more readable - bash

I'm working with grep to patterns in files with grep -orI "id=\"[^\"]\+\"" . | sort | uniq -d
Which gives an output like the following:
./myFile.html:id="matchingR"
./myFile.html:id="other"
./myFile.html:id="cas"
./otherFile.html:id="what"
./otherFile.html:id="wheras"
./otherFile.html:id="other"
./otherFile.html:id="whatever"
What would be a convenient way to pipe this an have the following as output:
./myFile.html
id="matchingR"
id="other"
id="cas"
./otherFile.html
id="what"
id="wheras"
id="other"
id="whatever"
Basically group results by filename.

Not the prettiest but it works.
awk -F : -v OFS=: 'f!=$1 {f=$1; print f} f==$1 {$1=""; $0=$0; sub(/^:/, " "); print}'
If none of your lines can ever contain a colon then this simpler version also works.
awk -F : 'f!=$1 {f=$1; print f} f==$1 {$1=""; print}'
These both split fields on colons (-F :) print out the first field (filename) when it differs from a saved value (and save the new value) and when the first field matches the saved value they remove the first field and print. They differ in how they remove the field and print the output. The first attempts to preserve colons in the matched line. The second (and #fedorqui's version ... f==$1 {$0=$2; print}) assume no other colons were on the line to begin with.

Pass output to this script:
#!/bin/sh
sed 's/:/ /' | while read FILE TEXT; do
if [ "$FILE" = "$GROUP" ]; then
echo " $TEXT"
else
GROUP="$FILE"
echo "$FILE"
echo " $TEXT"
fi
done

Here is an short awk
awk -F: '{print ($1!=f?$1 RS:""),$2;f=$1}' file
./myFile.html
id="matchingR"
id="other"
id="cas"
./otherFile.html
id="what"
id="wheras"
id="other"
id="whatever"

Related

How to get all the group names in given subscription az cli [duplicate]

I am trying to use awk to get the name of a file given the absolute path to the file.
For example, when given the input path /home/parent/child/filename I would like to get filename
I have tried:
awk -F "/" '{print $5}' input
which works perfectly.
However, I am hard coding $5 which would be incorrect if my input has the following structure:
/home/parent/child1/child2/filename
So a generic solution requires always taking the last field (which will be the filename).
Is there a simple way to do this with the awk substr function?
Use the fact that awk splits the lines in fields based on a field separator, that you can define. Hence, defining the field separator to / you can say:
awk -F "/" '{print $NF}' input
as NF refers to the number of fields of the current record, printing $NF means printing the last one.
So given a file like this:
/home/parent/child1/child2/child3/filename
/home/parent/child1/child2/filename
/home/parent/child1/filename
This would be the output:
$ awk -F"/" '{print $NF}' file
filename
filename
filename
In this case it is better to use basename instead of awk:
$ basename /home/parent/child1/child2/filename
filename
If you're open to a Perl solution, here one similar to fedorqui's awk solution:
perl -F/ -lane 'print $F[-1]' input
-F/ specifies / as the field separator
$F[-1] is the last element in the #F autosplit array
Another option is to use bash parameter substitution.
$ foo="/home/parent/child/filename"
$ echo ${foo##*/}
filename
$ foo="/home/parent/child/child2/filename"
$ echo ${foo##*/}
filename
Like 5 years late, I know, thanks for all the proposals, I used to do this the following way:
$ echo /home/parent/child1/child2/filename | rev | cut -d '/' -f1 | rev
filename
Glad to notice there are better manners
It should be a comment to the basename answer but I haven't enough point.
If you do not use double quotes, basename will not work with path where there is space character:
$ basename /home/foo/bar foo/bar.png
bar
ok with quotes " "
$ basename "/home/foo/bar foo/bar.png"
bar.png
file example
$ cat a
/home/parent/child 1/child 2/child 3/filename1
/home/parent/child 1/child2/filename2
/home/parent/child1/filename3
$ while read b ; do basename "$b" ; done < a
filename1
filename2
filename3
I know I'm like 3 years late on this but....
you should consider parameter expansion, it's built-in and faster.
if your input is in a var, let's say, $var1, just do ${var1##*/}. Look below
$ var1='/home/parent/child1/filename'
$ echo ${var1##*/}
filename
$ var1='/home/parent/child1/child2/filename'
$ echo ${var1##*/}
filename
$ var1='/home/parent/child1/child2/child3/filename'
$ echo ${var1##*/}
filename
you can skip all of that complex regex :
echo '/home/parent/child1/child2/filename' |
mawk '$!_=$-_=$NF' FS='[/]'
filename
2nd to last :
mawk '$!--NF=$NF' FS='/'
child2
3rd last field :
echo '/home/parent/child1/child2/filename' |
mawk '$!--NF=$--NF' FS='[/]'
child1
4th-last :
mawk '$!--NF=$(--NF-!-FS)' FS='/'
echo '/home/parent/child000/child00/child0/child1/child2/filename' |
child0
echo '/home/parent/child1/child2/filename'
parent
major caveat :
- `gawk/nawk` has a slight discrepancy with `mawk` regarding
- how it tracks multiple,
- and potentially conflicting, decrements to `NF`,
- so other than the 1st solution regarding last field,
- the rest for now, are only applicable to `mawk-1/2`
just realized it's much much cleaner this way in mawk/gawk/nawk :
echo '/home/parent/child1/child2/filename' | …
'
awk ++NF FS='.+/' OFS= # updated such that
# root "/" still gets printed
'
filename
You can also use:
sed -n 's/.*\/\([^\/]\{1,\}\)$/\1/p'
or
sed -n 's/.*\/\([^\/]*\)$/\1/p'

Using awk to extract two separate strings

MacOS, Unix
So I have a file in the following stockholm format:
# STOCKHOLM 1.0
#=GS WP_002855993.1/5-168 DE [subseq from] MULTISPECIES: AAC(3) family N-acetyltransferase [Campylobacter]
#=GS WP_002856586.1/5-166 DE [subseq from] MULTISPECIES: aminoglycoside N(3)-acetyltransferase [Campylobacter]
WP_002855993.1/5-168 ------LEHNGKKYSDKDLIDAFYQLGIKRGDILCVHTELmkfgKALLT.K...NDFLKTLLECFFKVLGKEGTLLMP-TF---TYSF------CKNE------VYDKVHSKG--KVGVLNEFFRTSGgGVRRTSDPIFSFAVKGAKADIFLKEN--SSCFGKDSVYEILTREGGKFMLLGLNYG-HALTHYAEE-----
#=GR WP_002855993.1/5-168 PP ......6788899999***********************9333344455.6...8999********************.33...3544......4555......799999975..68********98626999****************999865..689*********************9875.456799996.....
WP_002856586.1/5-166 ------LEFENKKYSTYDFIETFYKLGLQKGDTLCVHTEL....FNFGFpLlsrNEFLQTILDCFFEVIGKEGTLIMP-TF---TYSF------CKNE------VYDKINSKT--KMGALNEYFRKQT.GVKRTNDPIFSFAIKGAKEELFLKDT--TSCFGENCVYEVLTKENGKYMTFGGQG--HTLTHYAEE-----
#=GR WP_002856586.1/5-166 PP ......5566677788889999******************....**9953422246679*******************.33...3544......4455......799998876..589**********.******************99999886..689******************999765..5666***96.....
#=GC PP_cons ......6677788899999999*****************9....77675.5...68889*******************.33...3544......4455......799999976..689*******998.8999**************99999876..689******************9998765.466699996.....
#=GC RF xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx....xxxxx.x...xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
WP_002855993.1/5-168 -----------------------------------------------------------------------------------------------------
#=GR WP_002855993.1/5-168 PP .....................................................................................................
WP_002856586.1/5-166 -----------------------------------------------------------------------------------------------------
#=GR WP_002856586.1/5-166 PP .....................................................................................................
#=GC PP_cons .....................................................................................................
#=GC RF xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
//
And I've created a script to extract the IDs I want, in this case, WP_002855993.1 and WP_002856586.1, and search through another file to extract DNA sequences with the appropriate IDs. The script is as follows:
#!/bin/bash
for fileName in *.sto;
do
protID=$(grep -o "WP_.\{0,11\}" $fileName | sort | uniq)
echo $protID
file=$(echo $fileName | cut -d '_' -f 1,2,3)
file=$(echo $file'_protein.faa')
echo $file
if [ -n "$protID" ]; then
gawk "/^>/{N=0}/^.*$protID/{N=1} {if(N)print}" $file >>
sequence_protein.file
fi
done
And here's an example of the type of file I'm looking through:
>WP_002855993.1 MULTISPECIES: AAC(3) family N-acetyltransferase [Campylobacter]
MKYFLEHNGKKYSDKDLIDAFYQLGIKRGDILCVHTELMKFGKALLTKNDFLKTLLECFFKVLGKEGTLLMPTFT
>WP_002856586.1 MULTISPECIES: aminoglycoside N(3)-acetyltransferase [Campylobacter]
MKYLLEFENKKYSTYDFIETFYKLGLQKGDTLCVHTELFNFGFPLLSRNEFLQTILDCFFEVIGKEGTLIMPTFT
YSFCKNEVYDKINSKTKMGALNEYFRKQTGVKRTNDPIFSFAIKGAKEELFLKDTTSCFGENCVYEVLTKENGKY
>WP_002856595.1 MULTISPECIES: acetyl-CoA carboxylase biotin carboxylase subunit [Campylobacter]
MNQIHKILIANRAEIAVRVIRACRDLHIKSVAVFTEPDRECLHVKIADEAYRIGTDAIRGYLDVARIVEIAKACG
This script works if I have one ID, but in some cases I get two IDs, and I get an error, because I think it's looking for an ID like "WP_002855993.1 WP_002856586.1". Is there a way to modify this script so it looks for two separate occurrences? I guess it's something with the gawk command, but I'm not sure what exactly. Thanks in advance!
an update to the original script:
#!/usr/bin/env bash
for file_sto in *.sto; do
file_faa=$(echo $file_sto | cut -d '_' -f 1,2,3)
file_faa=${file_faa}"_protein.faa"
awk '(NR==FNR) { match($0,/WP_.\{0,11\}/);
if (RSTART > 0) a[substr($0,RSTART,RLENGTH)]++
next; }
($1 in a){ print RS $0 }' $file_sto RS=">" $file_faa >> sequence_protein.file
done
The awk part can probably even be reduced to :
awk '(NR==FNR) { if ($0 ~ /^WP_/) a[$1]++; next }
($1 in a) { print RS $0 }' FS='/' $file_sto FS=" " RS=">" $file_faa
This awk script does the following:
Set the field separator FS to / and read file $file_sto.
When reading $file_sto the record number NR is the same as the file record number FNR.
(NR==FNR) { if ($0 ~ /^WP_/) a[$1]++; next }: this line works only one $file_sto due to the condition in the front. It checks if the line starts with WP_. If it does, it stores the first field $1 (separated by FS which is a /) in an array a; it then skips to the next record in the file (next).
If we finished reading file $file_sto, we set the field separator back to a single space FS=" " (see section Regular expression) and the record separator RS to > and start reading file $file_faa The latter implies that $0 will contain all lines between > and the first field $1 is the protID.
Reading $file_faa, the file record number FNR is restarted from 1 while NR is not reset. Hence the first awk line is skipped.
($1 in a){ print RS $0 } if the first field is in the array a, print the record with the record separator in front of it.
fixing the original script:
If you want to keep your original script, you could store the protID in a list and then loop the list :
#!/bin/bash
for fileName in *.sto; do
protID_list=( $(grep -o "WP_.\{0,11\}" $fileName | sort | uniq) )
echo ${protID_list[#]}
file=$(echo $fileName | cut -d '_' -f 1,2,3)
file=$(echo $file'_protein.faa')
echo $file
for protID in ${protID_list[#]}; do
if [ -n "$protID" ]; then
gawk "/^>/{N=0}/^.*$protID/{N=1} {if(N)print}" $file >>
sequence_protein.file
fi
done
done
Considering your output file is test.
Using following command gives you only file names:
>>cat text | awk '{print $1}' | grep -R 'WP*' | cut -d":" -f2
gives me output:
WP_002855993.1/5-168
WP_002856586.1/5-166
WP_002855993.1/5-168
WP_002856586.1/5-166
Do you want output like that?

reverese order of strings except first word in bash

I'm trying to reverse the order of words excluding first word in final output. For example, I have a word db.in.com.example I'm using this command to reverse the order
$ basename db.in.com.example | awk -F'.' '{ for (i=NF; i>1; i--) \
printf("%s.",$i); print $1; }'
example.com.in.db
I want to exclude last .db in the output. Like this
example.com.in
I'm having trouble with this. Can this be done using only awk ? Can anybody help me on this ?
$ echo db.in.com.example | awk -F. '{ # set . as delimiter
for(i=NF;i>1;i--) # loop from last to next-to-first
printf "%s%s", $i, (i==2?ORS:".") # output item and ORS or . after next-to-first
}'
example.com.in
If perl is okay
$ echo 'db.in.com.example' | perl -F'\.' -lane 'print join ".", reverse(#F[1..$#F])'
example.com.in
$ echo '1.2.3.db.in.com.example' | perl -F'\.' -lane 'print join ".", reverse(#F[2..$#F])'
example.com.in.db.3
-F'\.' set . as input field separator and save to #F array
reverse(#F[1..$#F]) will give reversed array of elements from index 1 to last index
similarly, #F[2..$#F] will exclude first and second element
join "." to add . as separator between elements of array
See http://perldoc.perl.org/perlrun.html#Command-Switches for details on command line options
You can use cut, tac, and parameter expansion:
reverse=$(basename db.in.com.example |
cut -d. -f2- --output-delimiter=$'\n' |
tac )
echo ${reverse//$'\n'/.}
You've got some nice answers here. I am adding one which in my opinion is more readable, of course if ruby is an option for you:
$ echo "db.in.com.example" | ruby -ne 'p ($_.strip.split(".").drop(1).reverse.join("."))'
"example.com.in"
try following too once, which will reverse the text and it allows you to remove any string from output not only db, you need to just change the variable's value and it should fly then.
echo "db.in.com.example" | awk -v var="db" -F"." '{for(i=NF;i>0;i--){val=$i!=var?(val?val FS $i:$i):val};print val;val=""}'
EDIT: Adding a non-one liner form of solution too now.
echo "db.in.com.example" | awk -v var="db" -F"." '{
for(i=NF;i>0;i--){
val=$i!=var?(val?val FS $i:$i):val
}
print val;val=""
}'

How can I retrieve numeric value from text file in shell script?

below content has been written in a text file called test.txt. How can I retrieve pending & completed count value in shell script?
<p class="pending">Count: 0</p>
<p class="completed">Count: 0</p>
Here's what I tried:
#!/bin/bash
echo
echo 'Fetching job page and write to Jobs.txt file...'
curl -o Jobs.txt https://cms.test.com
completestatus=`grep "completed" /home/Jobs.txt | awk -F "<p|< p="">" '{print $2 }' | awk '{print $4 }'`
echo $completestatus
if [ "$completestatus" == 0 ]; then
grep and awk commands can almost always be combined into 1 awk command. And 2 awk commands can almost always be combined to 1 awk command also.
This solves your immediate problem (using a little awk type casting trickery).
completedStatus=$(echo "<p class="pending">Count: 0</p>^J
<p class="completed">Count: 0</p>" \
| awk -F : '/completed/{var=$2+0.0;print var}' )
echo completedStatus=$completedStatus
The output is
completedStatus=0
Note that you can combine grep and awk with
awk -F : '/completed/' test.txt
filters to just the completed line , output
<p class=completed>Count: 0</p>
When I added your -F argument, the output didn't change, i.e.
awk -F'<p|< p="">' '/completed/' test.txt
output
<p class=completed>Count: 0</p>
So I relied on using : as the -F (field separator). Now the output is
awk -F : '/completed/{print $2}'
output
0</p>
When performing a calculation, awk will read a value "looking" for a number at the front, if it finds a number, it will read the data until it finds a non-numeric (or if there is nothing left). So ...
awk -F : '/completed/{var=$2+0.0;print var}' test.txt
output
0
Finally we arrive at the solution above, wrap the code in a modern command-substitution, i.e. $( ... cmds ....) and send the output to the completedStatus= assignment.
In case you're thinking that the +0.0 addition is what is being output, you can change your file to show completed count = 10, and the output will be 10.
IHTH
another awk
completedStatus=$(awk -F'[ :<]' '/completed/{print $(NF-1)}' file)
If I got you right, you just want to extract pending or completed and the value. If that is the case,
Then Using SED,
please check out below script.Output shared via picture, please click to see
#!/bin/bash
file="$1"
echo "Simple"
cat $1 |sed 's/^.*=\"\([a-z]*\)\">Count: \([0-9]\)<.*$/\1=\2/g'
echo "Pipe Separated"
cat $1 |sed 's/^.*=\"\([a-z]*\)\">Count: \([0-9]\)<.*$/\1|\2/g'
echo "CSV Style or comma separeted"
cat $1 |sed 's/^.*=\"\([a-z]*\)\">Count: \([0-9]\)<.*$/\1,\2/g'

How to get word from text file BASH

I want to get only one word from this txt file: http://pastebin.com/jFDu0Le5 . The word is from last row: WER: 45.67% Correct: 65.87% Acc: 54.33%
I want to get only the value: 45.67 to save it to the file value.txt..I want to create BASH script to get this value. Can you give me an example how to do it??? I am new in Bash and I need it for school. The whole .txt file is saved on my server as text file file.txt.
Try this:
grep WER file.txt | awk '{print $2}' | uniq | sed -e 's/%//' > value.txt
Note that this will overwrite value.txt each time you run the command.
You want grep "WER:" value.txt | cut -???
I have ??? because I do not know the structure of the file. Tab delimited? Fixed Width?
Do man cut an you can get the arguments you need.
There a many ways and instruments to do the task:
sed
tac file.txt | sed -n '/^WER: /{s///;s/%.*//;p;q}' > value.txt
awk
tac file.txt | awk -F'[ %]' '/^WER:/{print $2;exit}' > value.txt
bash
while read a b c
do
if [ $a = "WER:" ]
then
b=${b%\%*}
echo ${b#* }
break
fi
done < <(tac file.txt) > value.txt
If the format is as you said, then this also works
awk -F'[: %]' '/^WER/{print $3}' file.txt > value.txt
Explanation
-F specifies the field separator as one of [: %]
/<PATTERN>/ {<ACTION>} refers to: if a line matches some PATTERN, then do some ACTION
in my case,
the PATTERN is: starts with ^ the string WER
the ACTION is: print field $3 (as split by the -F field separators)
> sends the output to value.txt

Resources