Match multiple patterns with grep and print only the matched patterns - bash

I have a file that looks like
..<long-text>..."field1":"some-value"...<long-text>...."field2":"some-value"...
..<long-text>..."field1":"some-value"...<long-text>...."field2":"some-value"...
..<long-text>..."field1":"some-value"...<long-text>...."field2":"some-value"...
I want to extract out field1 and field2 from each line of the file in bash. I want field1 and field2 to appear in the same line for each line. So the output should look like-
"field1":"some-value" "field2":"some-value"
"field1":"some-value" "field2":"some-value"
"field1":"some-value" "field2":"some-value"
I wrote a grep expression like -
grep -E '"field1":"[a-z]*".*"field2":"[a-z]*"' -o
But because of .* in between, it produces all the all text between those two expressions. I also tried
grep -E '"field1":"[a-z]*"|"field2":"[a-z]*"' -o
But this outputs all field1s in separate line and then all field2s in separate line.
How do I get the expected output?

You can use grep with awk to format the result:
grep -oE '"(field1|field2)":"[^"]*"' file | awk 'NR%2{p=$0; next} {print p, $0}'
"field1":"some-value" "field2":"some-value"
"field1":"some-value" "field2":"some-value"
"field1":"some-value" "field2":"some-value"

use sed:
echo abcdef | sed 's/\(.\).*\(.\)/\1\2/'
# yields: af
for your situation:
sed 's/.*\("field1":"[a-z]*"\).*\("field2":"[a-z]*"\).*/\1 \2/' yourfile
if some lines don't match at all, then do your grep first, e.g.,
grep -Eo '"field1":"[a-z]*".*"field2":"[a-z]*"' yourfile |
sed 's/.*\("field1":"[a-z]*"\).*\("field2":"[a-z]*"\).*/\1 \2/'

Related

Genebank files manipulation with bash

I have this genebank file. And I need your help in manipulating it
Iam picking a random part of the file
CDS complement(1750..1956)
/gene="MAMA_L4"
/note="similar to MIMI_L9"
/codon_start=1
/product="hypothetical protein"
/protein_id="AEQ60146.1"
/translation="MHFLDDDNDESNNCFDDKEKARDKIIIDMLNLIIGKKKTSYKCL
DYILSEQEYKFAILSIVENSIFLF"
misc_feature complement(2020..2235)
/note="MAMA_L5; similar to replication origin binding
protein (fragment)"
gene complement(2461..2718)
/gene="MAMA_L6"
CDS complement(2461..2718)
/gene="MAMA_L6"
/codon_start=1
/product="T5orf172 domain-containing protein"
/protein_id="AEQ60147.1"
/translation="MSNNLAFYIITTNYHQSQNIYKIGIHTGNPYDLITRYITYFPDV
IITYFQYTDKAKKVESDLKEKLSKCRITNIKGNLSEWIVID"
My target is to "extract" the info of /translation= and /product= like following
T5orf172 domain-containing protein
MSNNLAFYIITTNYHQSQNIYKIGIHTGNPYDLITRYITYFPDVIITYFQYTDKAKKVESDLKEKLSKCRITNIKGNLSEWIVID
*with bold I highlighted the issue that I had.
I am trying to write a bash script so I was thinking to apply something like:
grep -w /product= genebank.file |cut -d= -f2| sed 's/"//'g > File1
grep -w /translation= genebank.file |cut -d= -f2| sed 's/"//'g > File2
paste File1 File2
T the problem is that in the translation entries when I use grep I got only the first line. So it prints until the bold line like
T5orf172 domain-containing protein MSNNLAFYIITTNYHQSQNIYKIGIHTGNPYDLITRYITYFPDV
Can anybody help me to step over this issue? Thank you in advance!
With GNU sed:
sed -En '/^\s*\/(product|translation)="/{
s///
:a
/"$/! { N; s/\n\s*//; ba; }
s/"$//p
}' file |
sed 'N; s/\n/\t/'
Note: This assumes the second occurrence of the delimiter " is immediately followed by a newline in the input file.
I haven't fully tested this but if you add -A1 to your grep command you'll get one line after the match.
grep -w /product= genebank.file |cut -d= -f2| sed 's/"//'g > File1
grep -A1 -w /translation= genebank.file |cut -d= -f2| sed 's/^ *//g' > File2
paste File1 File2
You would need to delete that extra newline but that should get you close.

bash check for words in first file not contained in second file

I have a txt file containing multiple lines of text, for example:
This is a
file containing several
lines of text.
Now I have another file containing just words, like so:
this
contains
containing
text
Now I want to output the words which are in file 1, but not in file 2. I have tried the following:
cat file_1.txt | xargs -n1 | tr -d '[:punct:]' | sort | uniq | comm -i23 - file_2.txt
xargs -n1 to put each space separated substring on a newline.
tr -d '[:punct:] to remove punctuations
sort and uniq to make a sorted file to use with comm which is used with the -i flag to make it case insensitive.
But somehow this doesn't work. I've looked around online and found similar questions, however, I wasn't able to figure out what I was doing wrong. Most answers to those questions were working with 2 files which were already sorted, stripped of newlines, spaces, and punctuation while my file_1 may contain any of those at the start.
Desired output:
is
a
file
several
lines
of
paste + grep approach:
grep -Eiv "($(paste -sd'|' <file2.txt))" <(grep -wo '\w*' file1.txt)
The output:
is
a
file
several
lines
of
I would try something more direct:
for A in `cat file1 | tr -d '[:punct:]'`; do grep -wq $A file2 || echo $A; done
flags used for grep: q for quiet (don't need output), w for word match
One in awk:
$ awk -F"[^A-Za-z]+" ' # anything but a letter is a field delimiter
NR==FNR { # process the word list
a[tolower($0)]
next
}
{
for(i=1;i<=NF;i++) # loop all fields
if(!(tolower($i) in a)) # if word was not in the word list
print $i # print it. duplicates are printed also.
}' another_file txt_file
Output:
is
a
file
several
lines
of
grep:
$ grep -vwi -f another_file <(cat txt_file | tr -s -c '[a-zA-Z]' '\n')
is
a
file
several
lines
of
This pipeline will take the original file, replace spaces with newlines, convert to lowercase, then use grep to filter (-v) full words (-w) case insensitive (-i) using the lines in the given file (-f file2):
cat file1 | tr ' ' '\n' | tr '[:upper:]' '[:lower:]' | grep -vwif file2

How to get word from text file BASH

I want to get only one word from this txt file: http://pastebin.com/jFDu0Le5 . The word is from last row: WER: 45.67% Correct: 65.87% Acc: 54.33%
I want to get only the value: 45.67 to save it to the file value.txt..I want to create BASH script to get this value. Can you give me an example how to do it??? I am new in Bash and I need it for school. The whole .txt file is saved on my server as text file file.txt.
Try this:
grep WER file.txt | awk '{print $2}' | uniq | sed -e 's/%//' > value.txt
Note that this will overwrite value.txt each time you run the command.
You want grep "WER:" value.txt | cut -???
I have ??? because I do not know the structure of the file. Tab delimited? Fixed Width?
Do man cut an you can get the arguments you need.
There a many ways and instruments to do the task:
sed
tac file.txt | sed -n '/^WER: /{s///;s/%.*//;p;q}' > value.txt
awk
tac file.txt | awk -F'[ %]' '/^WER:/{print $2;exit}' > value.txt
bash
while read a b c
do
if [ $a = "WER:" ]
then
b=${b%\%*}
echo ${b#* }
break
fi
done < <(tac file.txt) > value.txt
If the format is as you said, then this also works
awk -F'[: %]' '/^WER/{print $3}' file.txt > value.txt
Explanation
-F specifies the field separator as one of [: %]
/<PATTERN>/ {<ACTION>} refers to: if a line matches some PATTERN, then do some ACTION
in my case,
the PATTERN is: starts with ^ the string WER
the ACTION is: print field $3 (as split by the -F field separators)
> sends the output to value.txt

bash scripting removing optional <Integer><colon> prefix

I have a list with all of the content is like:
1:NetworkManager-0.9.9.0-28.git20131003.fc20.x86_64
avahi-0.6.31-21.fc20.x86_64
2:irqbalance-1.0.7-1.fc20.x86_64
abrt-addon-kerneloops-2.1.12-2.fc20.x86_64
mdadm-3.3-4.fc20.x86_64
I need to remove the N: but leave the rest of strings as is.
Have tried:
cat service-rpmu.list | sed -ne "s/#[#:]\+://p" > end.list
cat service-rpmu.list | egrep -o '#[#:]+' > end.list
both result in an empty end.list
//* the N:, just denotes an epoch version */
With sed:
sed 's/^[0-9]\+://' your.file
Output:
NetworkManager-0.9.9.0-28.git20131003.fc20.x86_64
avahi-0.6.31-21.fc20.x86_64
irqbalance-1.0.7-1.fc20.x86_64
abrt-addon-kerneloops-2.1.12-2.fc20.x86_64
mdadm-3.3-4.fc20.x86_64
Btw, your list looks like the output of a grep command with the option -n. If this is true, then omit the -n option there. Also it is likely that your whole task can be done with a single sed command.
awk -F: '{ sub(/^.*:/,""); print}' sample
Here is another way with awk:
awk -F: '{print $NF}’ service-rpmu.list

how to pick specific words from script and create a new one with them withouth spaces

I'm want to read a string from file
this string is for example
&0001 = 1234 5678 9abc
now I want to take this string and build another string from it which is
123456789abc
I succeeded to read the the string from the end of the file by
read_addr="`awk "END {print}" file.txt`"
echo ${read_addr}
how should I continue to create the string 123456789abc out of the above?
How about this instead:
tail -n 1 file.txt | sed 's/ //g' | sed 's/.*=//'
The tail -n 1 gives you the last line of the file and the sed 's/ //g' removes the spaces.
you can just change your awk line a little bit:
awk -F= 'END{gsub(/ /,"",$2);print $2}' file.txt
this awk line will do the simple task with single process.

Resources