How to split a text file content by a string? - bash

Suppose I've got a text file that consists of two parts separated by delimiting string ---
aa
bbb
---
cccc
dd
I am writing a bash script to read the file and assign the first part to var part1 and the second part to var part2:
part1= ... # should be aa\nbbb
part2= ... # should be cccc\ndd
How would you suggest write this in bash ?

You can use awk:
foo="$(awk 'NR==1' RS='---\n' ORS='' file.txt)"
bar="$(awk 'NR==2' RS='---\n' ORS='' file.txt)"
This would read the file twice, but handling text files in the shell, i.e. storing their content in variables should generally be limited to small files. Given that your file is small, this shouldn't be a problem.
Note: Depending on your actual task, you may be able to just use awk for the whole thing. Then you don't need to store the content in shell variables, and read the file twice.

A solution using sed:
foo=$(sed '/^---$/q;p' -n file.txt)
bar=$(sed '1,/^---$/b;p' -n file.txt)
The -n command line option tells sed to not print the input lines as it processes them (by default it prints them). sed runs a script for each input line it processes.
The first sed script
/^---$/q;p
contains two commands (separated by ;):
/^---$/q - quit when you reach the line matching the regex ^---$ (a line that contains exactly three dashes);
p - print the current line.
The second sed script
1,/^---$/b;p
contains two commands:
1,/^---$/b - starting with line 1 until the first line matching the regex ^---$ (a line that contains only ---), branch to the end of the script (i.e. skip the second command);
p - print the current line;

Using csplit:
csplit --elide-empty-files --quiet --prefix=foo_bar file.txt "/---/" "{*}" && sed -i '/---/d' foo_bar*
If version of coreutils >= 8.22, --suppress-matched option can be used and sed processing is not required, like
csplit --suppress-matched --elide-empty-files --quiet --prefix=foo_bar file.txt "/---/" "{*}".

Related

How to add an empty line at the end of these commands?

I am in a situation where I have so many fastq files that I want to convert to fasta.
Since they belong to the same sample, I would like to merge the fasta files to get a single file.
I tried running these two commands:
sed -n '1~4s/^#/>/p;2~4p' INFILE.fastq > OUTFILE.fasta
cat infile.fq | awk '{if(NR%4==1) {printf(">%s\n",substr($0,2));} else if(NR%4==2) print;}' > file.fa
And the output files is correctly a fasta file.
However I get a problem in the next step. When I merge files with this command:
cat $1 >> final.fasta
The final file apparently looks correct. But when I run makeblastdb it gives me the following error:
FASTA-Reader: Ignoring invalid residues at position(s): On line 512: 1040-1043, 1046-1048, 1050-1051, 1053, 1055-1058, 1060-1061, 1063, 1066-1069, 1071-1076
Looking at what's on that line I found that a file header was put at the end of the previous file sequence. And it turns out like this:
GGCTTAAACAGCATT>e45dcf63-78cf-4769-96b7-bf645c130323
So how can I add a blank line to the end of the file within the scripts that convert fastq to fasta?
So that when I merge they are placed on top of each other correctly and not at the end of the sequence of the previous file.
So how can I add a blank line to the end of the file within the
scripts that convert fastq to fasta?
I would use GNU sed following replace
cat $1 >> final.fasta
using
sed '$a\\n' $1 >> final.fasta
Explanation: meaning of expression for sed is at last line ($) append newline (\n) - this action is undertaken before default one of printing. If you prefer GNU AWK then you might same behavior following way
awk '{print}END{print ""}' $1 >> final.fasta
Note: I was unable to test any of solution as you doesnot provide enough information to this. I assume above line is somewhere inside loop and $1 is always name of file existing in current working directory.
if the only thing you need is extra blank line, and the input files are within 1.5 GB in size, then just directly do :
awk NF=NF RS='^$' FS='\n' OFS='\n'
Should work for mawk 1/2, gawk, and nawk, maybe others as well. This works despite appearing not to do anything special is that the extra \n comes from ORS.

why empty double quote is coming in file at last record | shell |

I have 10 files which contain one columnar vertical data that i converted to consolidate one file
with data in horizontal form
file 1 :
A
B
C
B
file 2 :
P
W
R
S
file 3 :
E
U
C
S
similarly like above their will be remaing files
I consolidated all files using below script
cd /path/
#storing all file names to array_list to club data of all into one file
array_list=`( awk -F'/' '{print $2}' )`
for i in {array_list[#]}
do
sed 's/"/""/g; s/.*/"&"/' /path/$i | paste -s -d, >> /path/consolidate.txt
done
Output obtained from above script :
"A","B","C","B"
"P","W","R","S",""
"E","U","C","S"
Why the second line as last entry -> "" -> "P","W","R","S",""
when their are only four values in file 2 , it should be : "P","W","R","S"
Is it happening because of empty line in that file 2 at last ?
Solution will be appreciated
I assume it is indeed from an empty line. You could remove such 'mistakes' by
updating your script to include sed 's/,""$//' like:
sed 's/"/""/g; s/.*/"&"/' /path/$i | paste -s -d, | sed 's/,""$//' >> /path/consolidate.txt
Explanation of the above command, piece by piece
Substitute a double quote for two double quotes (the g option means do this
for every match on each line, rather than just the first match):
sed 's/"/""/g;
We use a semi-colon to tell sed that we will issue another command. The next
substitute command to sed matches the entire line, and replaces it with itself,
but surrounded by double quotes (the & represents the matched pattern):
s/.*/"&"/'
This is an argument to the above sed command, expanding the variable i in the
for loop:
/path/$i
The above commands produce some output ('stdout'), which would by default be
sent to the terminal. Instead of that, we use it as input ('stdin') to a
subsequent command (this is called a 'pipeline'):
|
The next command joins the lines of 'stdin' by replacing the newline characters
with , delimiters (be default the delimiter would be a tab):
paste -s -d,
We pipe the 'stdout' of the last command into another command (continuing the
pipeline):
|
The next command is another sed, this time substituting any occurrences of
,"" that happen at the end of the line (in sed, $ means end of line) with
nothing (in effect deleting the matched patter):
sed 's/,""$//'
The output of the above pipeline is appended to our text file (>> appends,
whilst > overwrites):
>> /path/consolidate.txt

MacOS SED to find a second matching line and insert lines above it

I am looking for a BASH sed script that can open an .mdx file and search line by line and find the second line that has the value I'm searching for: three hyphens like this ---. Then, I'm hoping to insert two lines of redirect information above that second set of hyphens.
The second occurrence of these three hyphens could be on any line following line 1, so I would need a script that is smart enough to search until it finds the second one.
I'll need something that runs in the MacOS that can do some in-place file updates.
Here's my input file:
---
title: Some kind of title
---
I'd like to locate that second instance of three hyphens and insert new text above it like this:
---
title: Some kind of title
redirects:
- /some/kind/of/directory/path
---
In my shell script, I have a variable that contains that redirect path, so I would somehow need to pass that variable along with a hard-coded redirects: to sed.
I looked at a variety of options, including the POSIX option included here, but it just deletes the second occurrence. Perhaps there's an easy way I could modify that to update?
Let me know if you need more to understand what I'm looking for.
This is one way of doing it:
testpath="/some/kind/of/directory/path"
sed "s|---|redirects:\n\ \ -\ $testpath\n---|" file.mdx | sed '/---/,$!d'
This works by adding the "redirect: path" directly above both "---" lines, then deletes the top "redirect: path". This will fail miserably if there is more than two "---" in the file.
To do it inline:
testpath="/some/kind/of/directory/path"
sed -i .bak "s|---|redirects:\n\ \ -\ $testpath\n---|" test.txt && sed -i _bak2 '/---/,$!d' test.txt
You can find out the line number of the 2nd --- and then insert before that line.
Example (tested on macos):
$ cat file
---
title: Some kind of title
---
$ cat foo.sh
path=/some/kind/of/directory/path
n=$( sed -n '/^---$/=' file | sed -n 2p )
sed -e "$n i\\
redirects:\\
- $path
" file
$ bash foo.sh
---
title: Some kind of title
redirects:
- /some/kind/of/directory/path
---
$
(Use sed -i for updating the file in place.)
Are you hellbent on using sed for this? Generally Awk is both more versatile and more readable.
awk '/^---$/ { print; hyphens=1; next }
hyphens && /^title: / { print; print "redirects:\n - /some/kind/of/directory/path"; next }
{ hyphens=0 } 1' file.mdx >newfile.mdx
In brief, we keep track of whether the previous line was three hyphens; if it was, and the current line matches the regex ^title: , print the additional lines. Otherwise, we reset the state variable and print. (The final 1 is a common Awk idiom to avoid having to say { print } explicitly.)
Unfortunately, standard Awk has no -i option. If you can use GNU Awk, it has an option -i inplace which emulates the (also nonstandard, but common) -i option of sed. Otherwise, just write to a temporary file and move it back onto the original afterwards; that's what -i does behind the scenes, too.

Add file content in another file after first match only

Using bash, I have this line of code that adds the content of a temp file into another file, after a specific match:
sed -i "/text_to_match/r ${tmpFile}" ${fileName}
I would like it to add the temp file content only after the FIRST match.
I tried using addresses:
sed -i "0,/text_to_match//text_to_match/r ${tmpFile}" ${fileName}
But it doesn't work, saying that "/" is an unknown command.
I can make addresses work if I use a standard replacement "s/to_replace/with_this/", but I can't make it work with this sed command.
It seems like I can't use addresses if my sed command starts with / instead of a letter.
I'm not stuck with addresses, as long as I can insert the temp file content into another file only once.
You're getting that error because if you have an address range (ADDR1,ADDR2) you can't put another address after it: sed expects a command there and / is not a command.
You'll want to use some braces here:
$ seq 20 > file
$ echo "new content" > tmpFile
$ sed '0,/5/{/5/ r tmpFile
}' file
outputs the new text only after the first line with '5'
1
2
3
4
5
new content
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
I found I needed to put a newline after the filename. I was getting this error otherwise
sed: -e expression #1, char 0: unmatched `{'
It appears that sed takes the whole rest of the line as the filename.
Probably more tidy to write
sed '0,/5/ {
/5/ r tmpFile
}' file
Full transparency: I don't use sed except for very simple tasks. In reality I would use awk for this job
awk '
{print}
!seen && $0 ~ patt {
while (getline line < f) print line
close(f)
seen = 1
}
' patt="5" f=tmpFile file
Glenn Jackman provided with an excellent answer to why the OP's attempt did not work.
In continuation to Glenn Jackman's answer, if you want to have the command on a single line, you should use branching so that the r command is at the end.
Editing commands other than {...}, a, b, c, i, r, t, w, :, and # can be followed by a <semicolon>, optional <blank> characters, and another editing command. However, when an s editing command is used with the w flag, following it with another command in this manner produces undefined results. [source: POSIX sed Standard]
The r,R,w,W commands parse the filename until end of the line. If whitespace, comments or semicolons are found, they will be included in the filename, leading to unexpected results.[source: GNU sed manual]
which gives:
sed -e '1,/pattern/{/pattern/ba};b;:a;r rfile' file
GNU sed also allows s///e to shell out. So there's this one-liner using Glenn's tmpFile and file.
sed '0,/5/{//{p;s/.*/cat tmpFile/e}}' file
// to repeat the previous pattern match (helps if it's longer than /5/)
p to print the matching line
s/.*/cat tmpFile/e to empty the pattern buffer and stick a the cat tmpFile shell command in there and e execute it and dump the output in the stream
You have 2 forward slashes together, right next to each other in the second sed example.

How to extract (read and delete) a line from file with a single command?

I would like to extract the first line from a file, read into a variable and delete right afterwards, with a single command. I know sed can read the first line as follows:
sed '1q' file.txt
or delete it as follows:
sed '1q;d' file.txt
but can I somehow do both with a single command?
The reason for this is that multiple processes will be reading the first line of the file, and I want to minimize the chances of them getting the same line.
It's impossible.
Except you read the manpage, and have Gnu-sed:
echo -e {1..3}"\n" > input
cat input
1
2
3
sed -n '1p;2,$ Woutput' input
1
cat output
2
3
Explanation:
sed -n '1p;2,$ Woutput' input
-n no output by default
1p; print line 1
2,$ from line 2 until $ last line
W (non posix) Write buffer to file
From the man page gnu sed:
w filename
Write the current pattern space to filename.
W filename
Write the first line of the current pattern space to filename. This is a GNU extension.
However, reading and experimenting takes longer, than opening the file in a full blown office suite and deleting the line by hand, or invoking a text-to-speech framework and training it, to do the job.
It doesn't work if invoked in posix style:
sed -n --posix '1p;2,$ Woutput' input
And you still have the hard hanwork of renaming output to input again.
I didn't try to write to input in place, because that could damage my carefully crafted input file - try it on own risk:
sed -n '1p;2,$ Winput' input
However, you might set up a filesystem notify job, which always rename freshly created output files to input again. But I fear you can't do it from within the sed command. Except ... (to be continued)

Resources