Script using sed and grep gives unintended output - bash

I have a "source.fasta" file that contains information in the following format:
>TRINITY_DN80_c0_g1_i1 len=723 path=[700:0-350 1417:351-368 1045:369-722] [-1, 700, 1417, 1045, -2]
>TRINITY_DN83_c0_g1_i1 len=371 path=[1:0-173 152:174-370] [-1, 1, 152, -2]
>TRINITY_DN83_c0_g1_i2 len=218 path=[1:0-173 741:174-217] [-1, 1, 741, -2]
>TRINITY_DN99_c0_g1_i1 len=326 path=[1:0-242 221:243-243 652:244-267 246:268-325] [-1, 1, 221, 652, 246, -2]
>TRINITY_DN90_c0_g1_i1 len=1240 path=[1970:0-527 753:528-1239] [-1, 1970, 753, -2]
>TRINITY_DN84_c0_g1_i1 len=301 path=[1:0-220 358:221-300] [-1, 1, 358, -2]
>TRINITY_DN84_c0_g1_i2 len=301 path=[1:0-220 199:221-300] [-1, 1, 199, -2]
>TRINITY_DN72_c0_g1_i1 len=434 path=[412:0-247 847:248-271 661:272-433] [-1, 412, 847, 661, -2]
>TRINITY_DN75_c0_g1_i1 len=478 path=[456:0-477] [-1, 456, -2]
There are close to 400,000 sequences in this file.
I have another file ids.txt in the following format:
I have 100 sequence ids in this file. When I match these ids to the source file I want an output that gives me the match for each of these ids with the entire sequence.
For example, for an id:
I want my output to be:
I want all hundred sequences in this format.
I used this code:
while read p; do
echo ''$p >> out.fasta
grep -A 400000 -w $p source.fasta | sed -n -e '1,/>/ {/>/ !{'p''}} >> out.fasta
done < ids.txt
But my output is different in that only the last id has a sequence and the rest dont have any sequence associated:
I am only getting the desired output for the 100th id from my ids.txt. Could someone help me on where my script is wrong. I would like to get all 100 sequences when i run the script.
Thank you
I have added google drive links to the files i am working with:

Repeatedly looping over a large file is inefficient; you really want to avoid running grep (or sed or awk) more than once if you can avoid it. Generally speaking, sed and Awk will often easily allow you to specify actions for individual lines in a file, and then you run the script on the file just once.
For this particular problem, the standard Awk idiom with NR==FNR comes in handy. This is a mechanism which lets you read a number of keys into memory (concretely, when NR==FNR it means that you are processing the first input file, because the overall input line number is equal to the line number within this file) and then check if they are present in subsequent input files.
Recall that Awk reads one line at a time and performs all the actions whose conditions match. The conditions are a simple boolean, and the actions are a set of Awk commands within a pair of braces.
awk 'NR == FNR { s[$0]; next }
# If we fall through to here, we have finished processing the first file.
# If we see a wedge and p is 1, reset it -- this is a new sequence
/^>/ && p { p = 0 }
# If the prefix of this line is in s, we have found a sequence we want.
($1$2 in s) || ($1 in s) || ((substr($1, 1, 1) " " substr($1, 2)) in s) {
if ($1 ~ /^>./) { print $1 } else { print $1 $2 }; p = 1; next }
# If p is true, we want to print this line
p' ids.txt source.fasta >out.fasta
So when we are reading ids.txt, the condition NR==FNR is true, and so we simply store each line in the array s. The next causes the rest of the Awk script to be skipped for this line.
On subsequent reads, when NR!=FNR, we use the variable p to control what to print. When we see a new sequence, we set p to 0 (in case it was 1 from a previous iteration). Then when we see a new sequence, we check if it is in s, and if so, we set p to one. The last line simply prints the line if p is not empty or zero. (An empty action is a shorthand for the action { print }.)
The slightly complex condition to check if $1 is in s might be too complicated -- I put in some normalizations to make sure that a space between the > and the sequence identifier is tolerated, regardless of there was one in ids.txt or not. This can probably be simplified if your files are consistently formatted.

Only with GNU grep and sed:
grep -A 1 -w -F -f ids.txt source.fasta | sed 's/ .*//'
See: man grep

$ awk 'NR==FNR{a[$1];next} $1 in a{c=2} c&&c--' ids.txt source.fasta
>TRINITY_DN80_c0_g1_i1 len=723 path=[700:0-350 1417:351-368 1045:369-722] [-1, 700, 1417, 1045, -2]
The above was run using your posted source.fasta and this ids.txt:
$ cat ids.txt

First group all id's as one expression separated by | like this
cat ids.txt | tr '\n' '|' | awk "{print "\"" $0 "\""}'
Remove the last | symbol from the expression.
Now you can grep using the output you got from previous command like this
egrep -E ">TRINITY_DN14840_c10_g1_i1|>TRINITY_DN8506_c0_g1_i1|>TRINITY_DN12276_c0_g2_i1|>TRINITY_DN15434_c5_g3_i1|>TRINITY_DN9323_c8_g3_i5|>TRINITY_DN11957_c1_g7_i1|>TRINITY_DN15373_c1_g1_i1|>TRINITY_DN22913_c0_g1_i1|>TRINITY_DN13029_c4_g5_i1" source.fasta
This will print only the matching lines
Editing as per tripleee comments
Using the following is printing the output properly
Assuming the ID and sequence are in different lined
tr '\n' '|' <ids.txt | sed 's/|$//' | grep -A 1 -E -f - source.fasta

This might work for you (GNU sed):
sed 's#.*#/^&/{s/ .*//;N;p}#' idFile | sed -nf - fastafile
Convert the idFile into a sed script and run it against the fastaFile.

the best way to do this is using either python or perl. I was able to make a script for extracting the ids using python as follows.
#script to extract sequences from a source file based on ids in another file
#the source is a fasta file with a header and a sequence that follows in one line
#the ids file contains one id per line
#both the id and source file should contain the character '>' at the beginning that siginifies an id
def main():
#asks the user for the ids file
file1 = raw_input('ids file: ');
#opens the ids file into the memory
ids_file = open(file1, 'r');
#asks the user for the fasta file
file2 = raw_input('fasta file: ');
#opens the fasta file into memory; you need your memory to be larger than the filesize, or python will hard crash
fasta_file = open(file2, 'r');
#ask the user for the file name of output file
file3 = raw_input('enter the output filename: ');
#opens output file with append option; append is must as you dont want to override the existing data
output_file = open(file3, 'w');
#split the ids into an array
ids_lines =
#split the fasta file into an array, the first element will be the id followed by the sequence
fasta_lines =
#initializing loop counters
i = 0;
j = 0;
#while loop to iterate over the length of the ids file as this is the limiter for the program
while j<len(fasta_lines) and i<len(ids_lines):
#if statement to match ids from both files and bring matching sequences
if ids_lines[i] == fasta_lines[j]:
#output statements including newline characters
#increment i so that we go for the next id
#deprecate j so we start all over for the new id
#when there is no match check the id, we are skipping the sequence in the middle which is j+1
The code is not perfect but works for any number of ids. I have tested for my samples which contained 5000 ids in one of them and the program worked fine. If there are improvements to the code please do so, I am a relatively newbie at programming so the code is a bit crude.


Unix bash - using cut to regex lines in a file, match regex result with another similar line

I have a text file: file.txt, with several thousand lines. It contains a lot of junk lines which I am not interested in, so I use the cut command to regex for the lines I am interested in first. For each entry I am interested in, it will be listed twice in the text file: Once in a "definition" section, another in a "value" section. I want to retrieve the first value from the "definition" section, and then for each entry found there find it's corresponding "value" section entry.
The first entry starts with ' gl_ ', while the 2nd entry would look like ' "gl_ ', starting with a '"'.
This is the code I have so far for looping through the text document, which then retrieves the values I am interested in and appends them to a .csv file:
while read -r line
if [[ $line == gl_* ]] ; then (param=$(cut -d'\' -f 1 $line) | def=$(cut -d'\' -f 2 $line) | type=$(cut -d'\' -f 4 $line) | prompt=$(cut -d'\' -f 8 $line))
while read -r glline
if [[ $glline == '"'$param* ]] ; then val=$(cut -d'\' -f 3 $glline) |
"$project";"$param";"$val";"$def";"$type";"$prompt" >> /filepath/file.csv
done < file.txt
done < file.txt
This seems to throw some syntax errors related to unexpected tokens near the first 'done' statement.
Example of text that needs to be parsed, and paired:
gl_one\User Defined\1\String\1\\1\Some Text
gl_two\User Defined\1\String\1\\1\Some Text also
gl_three\User Defined\1\Time\1\\1\Datetime now
So effectively, the while loop reads each line until it hits the first line that starts with 'gl_', which then stores that value (ie. gl_one) as a variable 'param'.
It then starts the nested while loop that looks for the line that starts with a ' " ' in front of the gl_, and is equivalent to the 'param' value. In other words, the
script should couple the lines gl_one and "gl_one, gl_two and "gl_two, gl_three and "gl_three.
The text file is large, and these are settings that have been defined this way. I need to collect the values for each gl_ parameter, to save them together in a .csv file with their corresponding "gl_ values.
Wanted regex output stored in variables would be something like this:
first while loop:
$param = gl_one, $def = User Defined, $type = String, $prompt = Some Text
second while loop:
$val = Value1
Then it stores these variables to the file.csv, with semi-colon separators.
Currently, I have an error for the first 'done' statement, which seems to indicate an issue with the quotation marks. Apart from this,
I am looking for general ideas and comments to the script. I.e, not entirely sure I am looking for the quotation mark parameters "gl_ correctly, or if the
semi-colons as .csv separators are added correctly.
Edit: Overall, the script runs now, but extremely slow due to the inner while loop. Is there any faster way to match the two lines together and add them to the .csv file?
Any ideas and comments?
This will generate a file containing the data you want:
cat file.txt | grep gl_ | sed -E "s/\"//" | sort | sed '$!N;s/\n/\\/' | awk -F'\' '{print $1"; "$5"; "$7"; "$NF}' > /filepath/file.csv
It uses grep to extract all lines containing 'gl_'
then sed to remove the leading '"' from the lines that contain one [I have assumed there are no further '"' in the line]
The lines are sorted
sed removes the return from each pair of lines
awk then prints
the required columns according to your requirements
Output routed to the file.
LANG=C sort -t\\ -sd -k1,1 <file.txt |\
sed '
/^gl_/{ # if definition
N; # append next line to buffer
s/\n"gl_[^\\]*//; # if value, strip first column
t; # and start next loop
D; # otherwise, delete the line
' |\
awk -F\\ -v p="$project" -v OFS=\; '{print p,$1,$10,$2,$4,$8 }' \
sort lines so gl_... appears immediately before "gl_... (LANG fixes LC_TYPE) - assumes definition appears before value
sed to help ensure matching definition and value (may still fail if duplicate/missing value), and tidy for awk
awk to pull out relevant fields

Matching pairs using Linux terminal

I have a file named list.txt containing a (supplier,product) pair and I must show the number of products from every supplier and their names using Linux terminal
Sample input:
And the result should be something like:
stationery: 3
stationery: paper pen rubber
grocery: 2
grocery: apples pears
dairy: 2
dairy: milk cheese
Save the input to file, and remove the empty lines. Then use GNU datamash:
datamash -s -t ':' groupby 1 count 2 unique 2 < file
The following pipeline shoud do the job
< your_input_file sort -t: -k1,1r | sort -t: -k1,1r | sed -E -n ':a;$p;N;s/([^:]*): *(.*)\n\1:/\1: \2 /;ta;P;D' | awk -F' ' '{ print $1, NF-1; print $0 }'
sort sorts the lines according to what's before the colon, in order to ease the successive processing
the cryptic sed joins the lines with common supplier
awk counts the items for supplier and prints everything appropriately.
Doing it with awk only, as suggested by KamilCuk in a comment, would be a much easier job; doing it with sed only would be (for me) a nightmare. Using both is maybe silly, but I enjoyed doing it.
If you need a detailed explanation, please comment, and I'll find time to provide one.
Here's the sed script written one command per line:
s/([^:]*): *(.*)\n\1:/\1: \2 /
and here's how it works:
:a is just a label where we can jump back through a test or branch command;
$p is the print command applied only to the address $ (the last line); note that all other commands are applied to every line, since no address is specified;
N read one more line and appends it to the current pattern space, putting a \newline in between; this creates a multiline in the pattern space
s/([^:]*): *(.*)\n\1:/\1: \2 / captures what's before the first colon on the line, ([^:]*), as well as what follows it, (.*), getting rid of eccessive spaces, *;
ta tests if the previous s command was successful, and, if this is the case, transfers the control to the line labelled by a (i.e. go to step 1);
P prints the leading part of the multiline up to and including the embedded \newline;
D deletes the leading part of the multiline up to and including the embedded \newline.
This should be close to the only awk code I was referring to:
< os awk -F: '{ count[$1] += 1; items[$1] = items[$1] " " $2 } END { for (supp in items) print supp": " count[supp], "\n"supp":" items[supp]}'
The awk script is more readable if written on several lines:
awk -F: '{ # for each line
# we use the word before the : as the key of an associative array
count[$1] += 1 # increment the count for the given supplier
items[$1] = items[$1] " " $2 # concatenate the current item to the previous ones
END { # after processing the whole file
for (supp in items) # iterate on the suppliers and print the result
print supp": " count[supp], "\n"supp":" items[supp]

How can I retrieve the matching records from mentioned file format in bash

I have above file format from which I want to find a matching record. For example, match a number(7789) on line starting with XYZ and once matched look for a matching number (7345) in lines below starting with 1 until it reaches to line starting with 9. retrieve the entire line record. How can I accomplish this using shell script, awk, sed or any combination.
Expected Output:
With sed one can do:
$ sed -n '/^XYZ.*7789/,/^9$/{/^1.*7345/p}' file
sed -n ' ' # -n disabled automatic printing
/^XYZ.*7789/, # Match line starting with XYZ, and
# containing 7789
/^1.*7345/p # Print line starting with 1 and
# containing 7345, which is coming
# after the previous match
/^9$/ { } # Match line that is 9
range { stuff } will execute stuff when it's inside range, in this case the range is starting at /^XYZ.*7789/ and ending with /^9$/.
.* will match anything but newlines zero or more times.
If you want to print the whole block matching the conditions, one can use:
$ sed -n '/^XYZ.*7789/{:s;N;/\n9$/!bs;/\n1.*7345/p}' file
This works by reading lines between ^XYZ.*7779 and ^9$ into the pattern
space. And then printing the whole thing if ^1.*7345 can be matches:
sed -n ' ' # -n disables printing
/^XYZ.*7789/{ } # Match line starting
# with XYZ that also contains 7789
:s; # Define label s
N; # Append next line to pattern space
/\n9$/!bs; # Goto s unless \n9$ matches
/\n1.*7345/p # Print whole pattern space
# if \n1.*7345 matches
I'd use awk:
awk -v rid=7789 -v fid=7345 -v RS='\n9\n' -F '\n' 'index($1, rid) { for(i = 2; i < $NF; ++i) { if(index($i, fid)) { print $i; next } } }' filename
This works as follows:
-v RS='\n9\n' is the meat of the whole thing. Awk separates its input into records (by default lines). This sets the record separator to \n9\n, which means that records are separated by lines with a single 9 on them. These records are further separated into fields, and
-F '\n' tells awk that fields in a record are separated by newlines, so that each line in a record becomes a field.
-v rid=7789 -v fid=7345 sets two awk variables rid and fid (meant by me as record identifier and field identifier, respectively. The names are arbitrary.) to your search strings. You could encode these in the awk script directly, but this way makes it easier and safer to replace the values with those of a shell variables (which I expect you'll want to do).
Then the code:
index($1, rid) { # In records whose first field contains rid
for(i = 2; i < $NF; ++i) { # Walk through the fields from the second
if(index($i, fid)) { # When you find one that contains fid
print $i # Print it,
next # and continue with the next record.
} # Remove the "next" line if you want all matching
} # fields.
Note that multi-character record separators are not strictly required by POSIX awk, and I'm not certain if BSD awk accepts it. Both GNU awk and mawk do, though.
EDIT: Misread question the first time around.
an extendable awk script can be
$ awk '/^9$/{s=0} s&&/7345/; /^XYZ/&&/7789/{s=1} ' file
set flag s when line starts with XYZ and contains 7789; reset when line is just 9, and print when flag is set and contains pattern 7345.
This might work for you (GNU sed):
sed -n '/^XYZ/h;//!H;/^9/!b;x;/^XYZ[^\n]*7789/!b;/7345/p' file
Use the option -n for the grep-like nature of sed. Gather up records beginning with XYZ and ending in 9. Reject any records which do not have 7789 in the header. Print any remaining records that contain 7345.
If the 7345 will always follow the header,this could be shortened to:
sed -n '/^XYZ/h;//!H;/^9/!b;x;/^XYZ[^\n]*7789.*7345/p' file
If all records are well-formed (begin XYZ and end in 9) then use:
sed -n '/^XYZ/h;//!H;/^9/!b;x;/^[^\n]*7789.*7345/p' file

Find Replace using Values in another File

I have a directory of files, myFiles/, and a text file values.txt in which one column is a set of values to find, and the second column is the corresponding replace value.
The goal is to replace all instances of find values (first column of values.txt) with the corresponding replace values (second column of values.txt) in all of the files located in myFiles/.
For example...
Hello Goodbye
Happy Sad
Running the command would replace all instances of "Hello" with "Goodbye" in every file in myFiles/, as well as replace every instance of "Happy" with "Sad" in every file in myFiles/.
I've taken as many attempts at using awk/sed and so on as I can think logical, but have failed to produce a command that performs the action desired.
Any guidance is appreciated. Thank you!
Read each line from values.txt
Split that line in 2 words
Use sed for each line to replace 1st word with 2st word in all files in myFiles/ directory
Note: I've used bash parameter expansion to split the line (${line% *} etc) , assuming values.txt is space separated 2 columnar file. If it's not the case, you may use awk or cut to split the line.
while read -r line;do
sed -i "s/${line#* }/${line% *}/g" myFiles/* # '-i' edits files in place and 'g' replaces all occurrences of patterns
done < values.txt
You can do what you want with awk.
#! /usr/bin/awk -f
# snarf in first file, values.txt
FNR == NR {
subs[$1] = $2
# apply replacements to subsequent files
for( old in subs ) {
while( index(old, $0) ) {
start = index(old, $0)
len = length(old)
$0 = substr($0, start, len) subs[old] substr($0, start + len)
When you invoke it, put values.txt as the first file to be processed.
Option One:
create a python script
with open('filename', 'r') as infile, etc., read in the values.txt file into a python dict with 'from' as key, and 'to' as value. close the infile.
use shutil to read in directory wanted, iterate over files, for each, do popen 'sed 's/from/to/g'" or read in each file interating over all the lines, each line you find/replace.
Option Two:
bash script
read in a from/to pair
perl -p -i -e 's/from/to/g' dirname/*.txt
second is probably easier to write but less exception handling.
It's called 'Perl PIE' and it's a relatively famous hack for doing find/replace in lots of files at once.

Remove shortest leading whitespace from all lines

I have some text with some leading whitespace on all lines. I want to remove the whitespace from the shortest line (if it's simpler, this requirement could be changed to the first line) and then remove the same amount of whitespace from all other lines.
E.g. I have this text:
var flatten = function(result, next_array) {
console.log('current result', result);
return result.concat(next_array);
[1, [2], [3, 4]]
.reduce(flatten, []);
And I want to result in this text:
var flatten = function(result, next_array) {
console.log('current result', result);
return result.concat(next_array);
[1, [2], [3, 4]]
.reduce(flatten, []);
Basically, I want to shift the text over until there's at least one line with no whitespace on the left and preserve all other leading whitespace on all other lines.
The use case for this is copying code from the middle of a section of code to paste as an example elsewhere. What I currently do is copy the code, paste into vim with paste insert mode, use << until I get the desired output, and copy the buffer. The same could be done in TextMate with Cmd-[.
What I want is to do this with a shell script so I could, for example, trigger it with a hotkey to take my clipboard contents, remove the desired whitespace, and paste the result.
this awk one-liner could do it for you too. it assumes you want to remove at least 1 whitespace. (because I see in your example, there is an empty line, without any leading spaces, but all lines are shifted left anyway.)
test with your example:
kent$ cat f
var flatten = function(result, next_array) {
console.log('current result', result);
return result.concat(next_array);
[1, [2], [3, 4]]
.reduce(flatten, []);
kent$ awk -F '\\S.*' '{l=length($1);if(l>0){if(NR==1)s=l; else s=s>l?l:s;}a[NR]=$0}END{for(i=1;i<=NR;i++){sub("^ {"s"}","",a[i]);print a[i]}}' f
var flatten = function(result, next_array) {
console.log('current result', result);
return result.concat(next_array);
[1, [2], [3, 4]]
.reduce(flatten, []);
I don't think awk scripts not readable. but you have to know the syntax of awk script. anyway, I am adding some explanation:
The awk script has two blocks, the first block was executed when each line of your file was read. The END block was executed after the last line of your file was read. See commpents below for explanation.
awk -F '\\S.*' #using a delimiter '\\S.*'(regex). the first non-empty char till the end of line
#so that each line was separated into two fields,
#the field1: leading spaces
#and the field2: the rest texts
'{ #block 1
l=length($1); #get the length of field1($1), which is the leading spaces, save to l
if(l>0){ #if l >0
if(NR==1)s=l; #NR is the line number, if it was the first line, s was not set yet, then let s=l
else s=s>l?l:s;} #else if l<s, let s=l otherwise keep s value
a[NR]=$0 #after reading each line, save the line in a array with lineNo. as index
} #this block is just for get the number of "shortest" leading spaces, in s
END{for(i=1;i<=NR;i++){ #loop from lineNo 1-last from the array a
sub("^ {"s"}","",a[i]); #replace s number of leading spaces with empty
print a[i]} #print the array element (after replacement)
}' file #file is the input file
These functions can be defined in your .bash_profile to have access to them anywhere, rather than creating a script file. They don't require first line to match:
len=$(grep -e "^[[:space:]]*$" -v $1 | sed -E 's/([^ ]).*/x/' | sort -r | head -1 | wc -c)
cut -c $(($len-1))- $1
Usage: shiftleft myfile.txt
This works with a file, but would have to be modified to work with pbpaste piped to it...
NOTE: Definitely inspired by the answer of #JoSo but fixes the errors in there. (Uses sort -r and cut -c N- and missing $ on len and doesn't get hung up by blank lines w/o white space.)
EDIT: A version to work with the contents of the clipboard on OSX:
len=$(pbpaste | grep -e "^[[:space:]]*$" -v | sed -E 's/([^ ]).*/x/' | sort -r | head -1 | wc -c)
pbpaste | cut -c $(($len-1))-
Usage for this version:
Copy the text of interest and then type shiftclip. To copy output directly back to the clipboard, do shiftclip | pbcopy
len=$(sed 's/[^ ].*//' <"$file"| sort | head -n1 | wc -c)
cut -c "$((len))"- <"$file"
Or, a bit less readable but avoiding the overhead of a sort:
len=$(awk 'BEGIN{m=-1} {sub("[^ ].*", "", $0); l = length($0); m = (m==-1||l<m) ? l : m; } END { print m+1 }' <"$file")
cut -c "$((len))"- <"$file"
Hmmm... this isn't very beautiful, and also assumes you have access to Bash, but if you can live with your "first line" rule:
spaces=`head -1 $file | sed -re 's/(^[ ]+)(.+)/\1/g'`
cat $file | sed -re "s/^$spaces//"
It also assumes only the space character (i.e., you need to tweak for tabs) but you get the idea.
Assuming your example is in a file snippet.txt, put the bash code in a script (e.g., "shiftleft") , "chmod +x" the script, then run with:
shiftleft snippet.txt
