This question already has answers here:
How to select lines between two marker patterns which may occur multiple times with awk/sed
(10 answers)
Closed 5 years ago.
I am trying to extract set of lines between specific patterns in bash.
My input file:
=========
a
b
ven
c
d
=========
abc
def
venkata
sad
dada
=========
I am trying to extract only the lines between two ======= which contains the pattern venkata in between. ie., the second section in the above eg (abc ... dada).
I have tried sed, but it does not give what I need exactly.
I tried splitting this task to getting lines above venkata and the lines below it separately.
Using sed -n -e '/=====/,/venkata/p' gives starting from the beginning of the input, which is not what I need.
Any thoughts ?
Edit: The number of lines between ======= can be of any number and venkata can be at any line, not necessarily the exact middle. There can be multiple words,numbers,symbols in each line. This is just a sample
Edit 2: How to select lines between two marker patterns which may occur multiple times with awk/sed accepted answer is close, but gives the output from the first match. That is not what I am looking for.
Based on the command in the answer of that question, it would set the flag when the first ==== is found.
I need the ==== just before venkata, which need not be the very first match.
That answer does not help me solve my problem
Using grep you can accomplish the same:
cat infile | grep -A 2 -B 2 "venkata"
The options -A and -B print a number of trailing and leading lines respectively.
As pointed out by #Jan Gassen, if you want the same amount of lines unde and above the matching pattern, you can make it even simpler by:
cat infile | grep -C 2 "venkata"
Using gnu-awk you can do this:
awk -v RS='={2,}\n' -v ORS= '/venkata/' file
abc
def
venkata
sad
dada
If you don't have gnu-awk then use:
awk '/={2,}/{if (s && data ~ /venkata/) printf "%s", data; s=1; data=""; next} s{data = data $0 RS}' file
Related
I have a text file of character sequences that consist of two lines: a header, and the sequence itself in the following line. The structure of the file is as follow:
>header1
aaaaaaaaa
>header2
bbbbbbbbbbb
>header3
aaabbbaaaa
[...]
>headerN
aaabbaabaa
In an other file I have a list of headers of sequences that I would like to remove, like this:
>header1
>header5
>header12
[...]
>header145
The idea is to remove these sequences from the first file, so all these headers+the following line. I did it using sed like the following,
while read line; do sed -i "/$line/,+1d" first_file.txt; done < second_file.txt
It works but takes quite long since I am loading the whole file several times with sed, and it is quite big. Any idea on how I could speed up this process?
The question you have is easy to answer but will not help you when you handle generic fasta files. Fasta files have a sequence header followed by one or multiple lines which can be concatenated to represent the sequence. The Fasta file-format roughly obeys the following rules:
The description line (defline) or header/identifier line, which begins with <greater-then> character (>), gives a name and/or a unique identifier for the sequence, and may also contain additional information.
Following the description line is the actual sequence itself in a standard one-letter character string. Anything other than a valid character would be ignored (including spaces, tabulators, asterisks, etc...).
The sequence can span multiple lines.
A multiple sequence FASTA format would be obtained by concatenating several single sequence FASTA files in a common file, generally by leaving an empty line in between two subsequent sequences.
Most of the presented methods will fail on a multi-fasta with multi-line sequences
The following will work always:
awk '(NR==FNR) { toRemove[$1]; next }
/^>/ { p=1; for(h in toRemove) if ( h ~ $0) p=0 }
p' headers.txt file.fasta
This is very similar to the answers of EdMorton and Anubahuva but the difference here is that the file headers.txt could contain only a part of the header.
$ awk 'NR==FNR{a[$0];next} $0 in a{c=2} !(c&&c--)' list file
>header2
bbbbbbbbbbb
>header3
aaabbbaaaa
[...]
>headerN
aaabbaabaa
c is how many lines you want to skip starting at the one that just matched. See https://stackoverflow.com/a/17914105/1745001.
Alternatively:
$ awk 'NR==FNR{a[$0];next} /^>/{f=($0 in a ? 1 : 0)} !f' list file
>header2
bbbbbbbbbbb
>header3
aaabbbaaaa
[...]
>headerN
aaabbaabaa
f is whether or not the most recently read >... line was found in the target array a[]. f=($0 in a ? 1 : 0) could be abbreviated to just f=($0 in a) but I prefer the ternary expression for clarity.
The first script relies on you knowing how many lines each record is long while the 2nd one relies on every record starting with >. If you know both then which one you use is a style choice.
You may use this awk:
awk 'NR == FNR{seen[$0]; next} /^>/{p = !($0 in seen)} p' hdr.txt details.txt
Create a script with the delete commands from the second file:
sed 's#\(.*\)#/\1/,+1d#' secondFile.txt > commands.sed
Then apply that file to the first
sed -f commands.sed firstFile.txt
This awk might work for you:
awk 'FNR==NR{a[$0]=1;next}a[$0]{getline;next}1' input2 input1
One option is to create a long sed expression:
sedcmd=
while read line; do sedcmd+="/^$line\$/,+1d;"; done < second_file.txt
echo "sedcmd:$sedcmd"
sed $sedcmd first_file.txt
This will only read the file once. Note that I added the ^ and $ to the sed pattern (so >header1 doesn't match >header123...)
Using a file (as #daniu suggests) might be better if you have thousands of files, as you risk hitting the command-line maximum count with this method.
try gnu sed,
sed -E ':s $!N;s/\n/\|/;ts ;s~.*~/&/\{N;d\}~' second_file.txt| sed -E -f - first_file.txt
prepend time command to both scripts to compare the speed,
look time while read line;do... and time sed -.... result in my test this is done in less than half time of OP's
This can easily be done with bbtools. The seqs2remove.txt file should be one header per line exactly as they appear in the large.fasta file.
filterbyname.sh in=large.fasta out=kept.fasta names=seqs2remove.txt
This question already has answers here:
Count the number of times a word appears in a file
(3 answers)
Closed 4 years ago.
I am a Bash & Terminal NEWBIE. I have been given the task of counting the number of entries of a specific area code using a single-line Bash Terminal command. Can you please point me in the right direction to achieving this goal? I've been using a bash scripting cheat sheet but i'm not familiar enough with bash commands to create a script to iterate and count the number of times [213] appears in file:
If you are looking for the string 123 anywhere in the file, then:
grep -c 123 file # counts 123 4123 41235 etc
If you are looking for the "word" 123, then:
grep -wc 123 file # counts 123 /123/ #123# etc., but not 1234 4123 ...
If you want multiple occurrences of the word on the same line to be counted separately, then use the -o option:
grep -ow 123 file | wc -l
See also:
Confused about word boundary on Unix & Linux Stack Exchange
grep -o '213' filename | wc -l
In the future, you should try searching for general forms of your command. You would have found a number of similar questions
See man grep. grep has a count option.
So you want to run grep -c 213 file.
Following awk may help you here too.(It will look for string 213 anywhere in the line(s) of Input_file)
awk /213/{count++} END{print count}' Input_file
In case you want to look for only those lines which have exactly have digit 213 then use following.
awk /^213$/{count++} END{print count}' Input_file
How can I remove lines appear only once in a file in bash?
For example, file foo.txt has:
1
2
3
3
4
5
after process the file, only
3
3
will remain.
Note the file is sorted already.
If your duplicated lines are consecutives, you can use uniq
uniq -D file
from the man pages:
-D print all duplicate lines
Just loop the file twice:
$ awk 'FNR==NR {seen[$0]++; next} seen[$0]>1' file file
3
3
firstly to count how many times a line occurs: seen[ record ] keeps track of it as an array.
secondly to print those that appear more than once
Using single pass awk:
awk '{freq[$0]++} END{for(i in freq) for (j=1; freq[i]>1 && j<=freq[i]; j++) print i}' file
3
3
Using freq[$0]++ we count and store frequency of each line.
In the END block if frequency is greater than 1 then we print those lines as many times as the frequency.
Using awk, single pass:
$ awk 'a[$0]++ && a[$0]==2 {print} a[$0]>1' foo.txt
3
3
If the file is unordered, the output will happen in the order duplicates are found in the file due to the solution not buffering values.
Here's a POSIX-compliant awk alternative to the GNU-specific uniq -D:
awk '++seen[$0] == 2; seen[$0] >= 2' file
This turned out to be just a shorter reformulation of James Brown's helpful answer.
Unlike uniq, this command doesn't strictly require the duplicates to be grouped, but the output order will only be predictable if they are.
That is, if the duplicates aren't grouped, the output order is determined by the the relative ordering of the 2nd instances in each set of duplicates, and in each set the 1st and the 2nd instances will be printed together.
For unsorted (ungrouped) data (and if preserving the input order is also important), consider:
fedorqui's helpful answer (elegant, but requires reading the file twice)
anubhava's helpful answer (single-pass solution, but a little more cumbersome).
cat TEXT | awk -v var=$i -v varB=$j '$1~var , $1~varB {print $1}' > PROBLEM HERE
I am passing two variables from an array to parse a very large text file by range. And it works, kind of.
if I use ">" the output to the file will ONLY be the last three lines as verified by cat and a text editor.
if I use ">>" the output to the file will include one complete read of TEXT and then it will divide the second read into the ranges I want.
if I let the output go through to the shell I get the same problem as above.
Question:
It appears awk is reading every line and printing it. Then it goes back and selects the ranges from the TEXT file. It does not do this if I use constants in the range pattern search.
I undestand awk must read all lines to find the ranges I request.
why is it printing the entire document?
How can I get it to ONLY print the ranges selected?
This is the last hurdle in a big project and I am beating my head against the table.
Thanks!
give this a try, you didn't assign varB in right way:
yours: awk -v var="$i" -varB="$j" ...
mine : awk -v var="$i" -v varB="$j" ...
^^
Aside from the typo, you can't use variables in //, instead you have to specify with regular ~ match. Also quote your shell variables (here is not needed obviously, but to set an example). For example
seq 1 10 | awk -v b="3" -v e="5" '$0 ~ b, $0 ~ e'
should print 3..5 as expected
It sounds like this is what you want:
awk -v var="foo" -v varB="bar" '$1~var{f=1} f{print $1} $1~varB{f=0}' file
e.g.
$ cat file
1
2
foo
3
4
bar
5
foo
6
bar
7
$ awk -v var="foo" -v varB="bar" '$1~var{f=1} f{print $1} $1~varB{f=0}' file
foo
3
4
bar
foo
6
bar
but without sample input and expected output it's just a guess and this would not address the SHELL behavior you are seeing wrt use of > vs >>.
Here's what happened. I used an array to input into my variables. I set the counter for what I thought was the total length of the array. When the final iteration of the array was reached, there was a null value returned to awk for the variable. This caused it to print EVERYTHING. Once I correctly had a counter with the correct number of array elements the printing oddity ended.
As far as the > vs >> goes, I don't know. It did stop, but I wasn't as careful in documenting it. I think what happened is that I used $1 in the print command to save time, and with each line it printed at the end it erased the whole file and left the last three identical matches. Something to ponder. Thanks Ed for the honest work. And no thank you to Robo responses.
I'm new on linux SO and bash commands and i think someone with more experience could help me. I wanna compare 2 different text files with log's of an execution, but some lines (not all of them) begin with a time' token like this:
12345 ps line 1 content
23456 ps line 2 content
line 3 content
345 ps line 4 content
Those tokens have different values in each log, but, in that comparison, i don't care about them, i wanna just to compare the line contents and ignore them. I could use 'sed' command to generate new files without that tokens and then comepare them, but i pretend to do that repeatedly and could save me some time if i use just one command or one sh file. I've tried to use 'sed' and 'diff' combined, but without success. Would anyone please be able to help me?
You can use the following sed one liner to remove the numbers from the beginning of the file:
sed 's/^[0-9]* ps//g' file1
To diff two such files (less timestamps) you can use process substitution.
diff <(sed 's/^[0-9]* ps//g' file1) <(sed 's/^[0-9]* ps//g' file2)
Untested since you didn't show 2 input files and the expected output but from your description I THINK this would do what you want:
awk '
{ sub(/^[[:digit:]]+[[:space:]]*/,"") }
NR==FNR { file1[FNR] = $0; next }
{ print ($0 == file1[FNR] ? "==" : "!="), $0 }
' file1 file2
If that doesn't do it, post some small sample input and expected output.