Cut and paste a line with an exact match using sed - shell

I have a text file (~8 GB). Lets call this file A. File A has about 100,000 lines with 19 words and integers separated by a space. I need to cut several lines from file A and paste them into a new file (file B). The lines should be deleted from file A. The lines to be cut from file A should have an exact matching string.
I then need to repeat this several times, removing lines from file A with a different matching string every time. Each time, file A is getting smaller.
I can do this using "sed" but using two commands, like this:
# Finding lines in file A with matching string and copying those lines to file B
sed -ne '/\<matchingString\>/ p' file A > file B
#Again finding the lines in file A with matching string and deleting those lines,
#writing a tmp file to hold the lines that were not deleted.
sed '/\<matchingString\>/d'file A > tmp
# Replacing file A with the tmp file.
mv tmp file A
Here is an example of files A and B. I want to extract all lines containing hg15
File A:
ID pos frac xp mf ...
23 43210 0.1 2 hg15...
...
...
File B:
23 43210 0.1 2 hg15...
I´m fairly new to writing shell scripts and using all the Unix tools, but I feel I should be able to do this more elegantly and faster. Can anyone please guide me along to improving this script. I don´t specifically need to use "sed". I have been searching the web and stackoverflow without finding a solution to this exact problem. I´m using RedHat and bash.
Thanks.

This might work for you (GNU sed):
sed 's|.*|/\\<&\\>/{w fileB\nd}|' matchingString_file | sed -i.bak -f - fileA
This makes a sed script from the matching strings that writes the matching lines to fileB and deletes them from fileA.
N.B. a backup of fileA is made too.
To make a different file for each exact word match use:
sed 's|.*|/\\<&\\>/{w "&.txt"\nd}|' matchingString_file | sed -i.bak -f - fileA

I'd use grep for this but besides this small improvement this is probably the fastest way to do it already, even if this means to apply the regexp to each line twice:
grep '<matchingString>' A > B
grep -v '<matchingString>' A > tmp
mv tmp A
The next approach would be to read the file line by line, check the line, and write it depending on the check either to B or to tmp. (And mv tmp A again in the end.) But there is no standard Unix tool which does this (AFAIK), and doing it in shell will probably reduce performance massively:
while IFS='' read line
do
if expr "$line" : '<matchingString>' >/dev/null
then
echo "$line" 1>&3
else
echo "$line"
fi > B 3> tmp
done < A
You could try to do this using Python (or similar scripting languages):
import re
with open('B', 'w') as b:
with open('tmp', 'w') as tmp:
with open('A') as a:
for line in a:
if re.match(r'<matchingString>', line):
b.write(line)
else:
tmp.write(line)
os.rename('tmp', 'A')
But this is a little out of scope here (not shell anymore).

Hope this will help you...
cat File A | while read line
do
#Finding lines in file A wit matching string and copying those lines to file B
sed -ne '/\<matchingString\>/ p' file A >> file B
#Again finding the lines in file A with matching string and deleting those lines
#writing a tmp file to hold the lines that were not deleted
sed '/\<matchingString\>/d'file A >> tmp
done
#once you are done with greping and copy pasting Replacing file A with the tmp file
`mv tmp file A`
PS: I'm appending to the file B since we are greping in a loop when the match pattern found.

Related

How to add an empty line at the end of these commands?

I am in a situation where I have so many fastq files that I want to convert to fasta.
Since they belong to the same sample, I would like to merge the fasta files to get a single file.
I tried running these two commands:
sed -n '1~4s/^#/>/p;2~4p' INFILE.fastq > OUTFILE.fasta
cat infile.fq | awk '{if(NR%4==1) {printf(">%s\n",substr($0,2));} else if(NR%4==2) print;}' > file.fa
And the output files is correctly a fasta file.
However I get a problem in the next step. When I merge files with this command:
cat $1 >> final.fasta
The final file apparently looks correct. But when I run makeblastdb it gives me the following error:
FASTA-Reader: Ignoring invalid residues at position(s): On line 512: 1040-1043, 1046-1048, 1050-1051, 1053, 1055-1058, 1060-1061, 1063, 1066-1069, 1071-1076
Looking at what's on that line I found that a file header was put at the end of the previous file sequence. And it turns out like this:
GGCTTAAACAGCATT>e45dcf63-78cf-4769-96b7-bf645c130323
So how can I add a blank line to the end of the file within the scripts that convert fastq to fasta?
So that when I merge they are placed on top of each other correctly and not at the end of the sequence of the previous file.
So how can I add a blank line to the end of the file within the
scripts that convert fastq to fasta?
I would use GNU sed following replace
cat $1 >> final.fasta
using
sed '$a\\n' $1 >> final.fasta
Explanation: meaning of expression for sed is at last line ($) append newline (\n) - this action is undertaken before default one of printing. If you prefer GNU AWK then you might same behavior following way
awk '{print}END{print ""}' $1 >> final.fasta
Note: I was unable to test any of solution as you doesnot provide enough information to this. I assume above line is somewhere inside loop and $1 is always name of file existing in current working directory.
if the only thing you need is extra blank line, and the input files are within 1.5 GB in size, then just directly do :
awk NF=NF RS='^$' FS='\n' OFS='\n'
Should work for mawk 1/2, gawk, and nawk, maybe others as well. This works despite appearing not to do anything special is that the extra \n comes from ORS.

How to split a text file content by a string?

Suppose I've got a text file that consists of two parts separated by delimiting string ---
aa
bbb
---
cccc
dd
I am writing a bash script to read the file and assign the first part to var part1 and the second part to var part2:
part1= ... # should be aa\nbbb
part2= ... # should be cccc\ndd
How would you suggest write this in bash ?
You can use awk:
foo="$(awk 'NR==1' RS='---\n' ORS='' file.txt)"
bar="$(awk 'NR==2' RS='---\n' ORS='' file.txt)"
This would read the file twice, but handling text files in the shell, i.e. storing their content in variables should generally be limited to small files. Given that your file is small, this shouldn't be a problem.
Note: Depending on your actual task, you may be able to just use awk for the whole thing. Then you don't need to store the content in shell variables, and read the file twice.
A solution using sed:
foo=$(sed '/^---$/q;p' -n file.txt)
bar=$(sed '1,/^---$/b;p' -n file.txt)
The -n command line option tells sed to not print the input lines as it processes them (by default it prints them). sed runs a script for each input line it processes.
The first sed script
/^---$/q;p
contains two commands (separated by ;):
/^---$/q - quit when you reach the line matching the regex ^---$ (a line that contains exactly three dashes);
p - print the current line.
The second sed script
1,/^---$/b;p
contains two commands:
1,/^---$/b - starting with line 1 until the first line matching the regex ^---$ (a line that contains only ---), branch to the end of the script (i.e. skip the second command);
p - print the current line;
Using csplit:
csplit --elide-empty-files --quiet --prefix=foo_bar file.txt "/---/" "{*}" && sed -i '/---/d' foo_bar*
If version of coreutils >= 8.22, --suppress-matched option can be used and sed processing is not required, like
csplit --suppress-matched --elide-empty-files --quiet --prefix=foo_bar file.txt "/---/" "{*}".

Use grep only on specific columns in many files?

Basically, I have one file with patterns and I want every line to be searched in all text files in a certain directory. I also only want exact matches. The many files are zipped.
However, I have one more condition. I need the first two columns of a line in the pattern file to match the first two columns of a line in any given text file that is searched. If they match, the output I want is the pattern(the entire line) followed by all the names of the text files that a match was found in with their entire match lines (not just first two columns).
An output such as:
pattern1
file23:"text from entire line in file 23 here"
file37:"text from entire line in file 37 here"
file156:"text from entire line in file 156 here"
pattern2
file12:"text from entire line in file 12 here"
file67:"text from entire line in file 67 here"
file200:"text from entire line in file 200 here"
I know that grep can take an input file, but the problem is that it takes every pattern in the pattern file and searches for them in a given text file before moving onto the next file, which makes the above output more difficult. So I thought it would be better to loop through each line in a file, print the line, and then search for the line in the many files, seeing if the first two columns match.
I thought about this:
cat pattern_file.txt | while read line
do
echo $line >> output.txt
zgrep -w -l $line many_files/*txt >> output.txt
done
But with this code, it doesn't search by the first two columns only. Is there a way so specify the first two columns for both the pattern line and for the lines that grep searches through?
What is the best way to do this? Would something other than grep, like awk, be better to use? There were other questions like this, but none that used columns for both the search pattern and the searched file.
Few lines from pattern file:
1 5390182 . A C 40.0 PASS DP=21164;EFF=missense_variant(MODERATE|MISSENSE|Aag/Cag|p.Lys22Gln/c.64A>C|359|AT1G15670|protein_coding|CODING|AT1G15670.1|1|1)
1 5390200 . G T 40.0 PASS DP=21237;EFF=missense_variant(MODERATE|MISSENSE|Gcc/Tcc|p.Ala28Ser/c.82G>T|359|AT1G15670|protein_coding|CODING|AT1G15670.1|1|1)
1 5390228 . A C 40.0 PASS DP=21317;EFF=missense_variant(MODERATE|MISSENSE|gAa/gCa|p.Glu37Ala/c.110A>C|359|AT1G15670|protein_coding|CODING|AT1G15670.1|1|1)
Few lines from a file in searched files:
1 10699576 . G A 36 PASS DP=4 GT:GQ:DP 1|1:36:4
1 10699790 . T C 40 PASS DP=6 GT:GQ:DP 1|1:40:6
1 10699808 . G A 40 PASS DP=7 GT:GQ:DP 1|1:40:7
They both in reality are much larger.
It sounds like this might be what you want:
awk 'NR==FNR{a[$1,$2]; next} ($1,$2) in a' patternfile anyfile
If it's not then update your question to provide a clear, simple statement of your requirements and concise, testable sample input and expected output that demonstrates your problem and that we could test a potential solution against.
if anyfile is actually a zip file then you'd do something like:
zcat anyfile | awk 'NR==FNR{a[$1,$2]; next} ($1,$2) in a' patternfile -
Replace zcat with whatever command you use to produce text from your zip file if that's not what you use.
Per the question in the comments, if both input files are compressed and your shell supports it (e.g. bash) you could do:
awk 'NR==FNR{a[$1,$2]; next} ($1,$2) in a' <(zcat patternfile) <(zcat anyfile)
otherwise just uncompress patternfile to a tmp file first and use that in the awk command.
Use read to parse the pattern file's columns and add an anchor to the zgrep pattern :
while read -r column1 column2 rest_of_the_line
do
echo "$column1 $column2 $rest_of_the_line"
zgrep -w -l "^$column1\s*$column2" many_files/*txt
done < pattern_file.txt >> output.txt
read is able to parse lines into multiple variables passed as parameters, the last of which getting the rest of the line. It will separate fields around characters of the $IFS Internal Field Separator (by default tabulations, spaces and linefeeds, can be overriden for the read command by using while IFS='...' read ...).
Using -r avoids unwanted escapes and makes the parsing more reliable, and while ... do ... done < file performs a bit better since it avoids an useless use of cat. Since the output of all the commands inside the while is redirected I also put the redirection on the while rather than on each individual commands.

Bash - read specific line from a file with all sorts of data and store as a variable

I have looked for an answer to what seems like a simple question, but I feel as though all these questions (below) only briefly touch on the matter and/or over-complicate the solution.
Read a file and split each line into two variables with bash program
Bash read from file and store to variables
Need to assign the contents of a text file to a variable in a bash script
What I want to do is read specific lines from a file (titled 'input'), store them variables and then use them.
For example, in this code, every 9th line after a certain point contains a filename that I want to store as a variable for later use. How can I do that?
steps=49
for((i=1;i<=${steps};i++)); do
...
g=$((9 * $i + 28)) #In.omega filename
`
For the bigger picture, I basically need to print a specific line (line 9) from the file whose name is specified in the gth line of the file named "input"
sed '1,39p;d' data > temp
sed "9,9p;d" [filename specified in line g of input] >> temp
sed '41,$p;d' data >> temp
mv temp data
Say you want to assign the 49th line of the $FILE file to the $ARG variable, you can do:
$ ARG=`cat $FILE | head -49 | tail -1`
To get line 9 of the file named in the gth line of the file named input:
sed -n 9p "$(sed -n ${g}p input)"
arg=$(cat sample.txt | sed -n '2p')
where arg is variable and sample.txt is file and 2 is line number

Using sed to dynamically generate a file name

I have a CSV file that I'd like to split up based on a field in the file. Essentially, there can be two brands, GVA and HBVL. I'd like to split the file into a file for each brand before I import it into a database.
Sample of the CSV file
"D509379D5055821451C3695A3752DCCD",'1900-01-01 01:00:00',"M","1740","GVA",'2009-07-01 13:25:00',0
"159A58BE41012787D531C7157F688D86",'1900-01-01 00:00:00',"V","1880","GVA",'2008-06-06 11:21:00',0
"D0BB5C058794BBE4478DDA536D1E4872",'1900-01-01 00:00:00',"M","9270","GVA",'2007-09-18 13:21:00',0
"BCC7096803E5E60E05DC12FB9951E0CF",'1900-01-01 00:00:00',"M","3500","HBVL",'2007-09-18 13:21:00',1
"7F85FCE6F13775A8A3054E3438B81599",'1900-01-01 00:00:00',"M","3970","HBVL",'2007-09-18 13:20:00',0
Part of the problem is the size of the file. It's about 39mb. My original attempt at this looked like this:
while read line ; do
name=`echo $line | sed -n 's/\(.*\)"\(GVA\|HBVL\)",\(.*\)$/\2/ p' | tr [:upper:] [:lower:] `
info=`echo $line | sed -n 's/\(.*\)"\(GVA\|HBVL\)",\(.*\)$/\1\3/ p'`
echo "${info}" >> ${BASEDIR}/${today}/${name}.txt
done < ${file}
After about 2.5 hours, only about 1/2 of the file had been processed. I have another file that could potentially be up to 250 mb in size and I can't imagine how long that would take.
What I'd like to do is pull out the brand out of the line and write the line to a file named after the brand. I can remove the brand, but I don't now how to use it to create a file. I've started in sed, but I'm not above using another language if it's more appropriate.
The original while loop with multiple commands per line is DIRE!
sed -e '/"GVA"/w gva.file' -e '/"HBVL"/w hbvl.file' -n $file
The sed script says:
write lines that match the GVA tag to gva.file
write lines that match the HBVL tag to hbvl.file
and don't print anything else ('-n')
Note that different versions of sed can handle different numbers of auxilliary files. If you need more than, say, twenty output files at once, you may need to look at other technology (but test what the limit is on your machine). If the file is sorted so that all the GVA records appear together followed by all the HBVL records, you could consider using csplit. Alternatively, a scripting language like Perl could handle more. If you exceed the number of file descriptors allowed to your process, it becomes hard to do the splitting in a single pass over the data file.
grep '"GVA"' $file >GVA.txt
grep '"HVBL"' $file >HVBL.txt
# awk -F"," '{o=$5;gsub(/\"/,"",o);print $0 > o}' OFS="," file
# more GVA
"D509379D5055821451C3695A3752DCCD",'1900-01-01 01:00:00',"M","1740","GVA",'2009-07-01 13:25:00',0
"159A58BE41012787D531C7157F688D86",'1900-01-01 00:00:00',"V","1880","GVA",'2008-06-06 11:21:00',0
"D0BB5C058794BBE4478DDA536D1E4872",'1900-01-01 00:00:00',"M","9270","GVA",'2007-09-18 13:21:00',0
# more HBVL
"BCC7096803E5E60E05DC12FB9951E0CF",'1900-01-01 00:00:00',"M","3500","HBVL",'2007-09-18 13:21:00',1
"7F85FCE6F13775A8A3054E3438B81599",'1900-01-01 00:00:00',"M","3970","HBVL",'2007-09-18 13:20:00',0

Resources