Using Bash to Manually Edit a Text or Fastq file - bash

I would like to manually edit a Fastq file using Bash to multiple similar lines.
In Fastq files a sequence read starts on line 2 and then is found every fourth line (ie lines 2,6,10,14...).
I would like to create an edited text file that is identical to a Fastq file except the first 6 characters of the sequencing reads are trimmed off.
Unedited Fastq:
#M03017:21:000000000
GAGAGATCTCTCTCTCTCTCT
+
111>>B1FDFFF
Edited Fastq:
#M03017:21:000000000
TCTCTCTCTCTCTCT
+
111>>B1FDFFF

GNU sed can do that:
sed -i~ '2~4s/^.\{6\}//' file
The address 2~4 means "start on line 2, repeat each 4 lines".
s means replace, ^ matches the line beginning, . matches any character, \{6\} specifies the length (a "quantifier"). The replacement string is empty (//).
-i~ replaces the file in place, leaving a backup with the ~ appended to the filename.

I guess awk is perfect for this:
$ awk 'NR%4==2 {gsub(/^.{6}/,"")} 1' file
#M03017:21:000000000
TCTCTCTCTCTCTCT
+
111>>B1FDFFF
This removes the first 6 characters in all the lines in the 4k+2 position.
Explanation
NR%4==2 {} do things if the number of record (number of line) is on 4k+2 form.
gsub(/^.{6}/,"") replace the 6 first chars with empty string.
1 as evaluated to True, print the line.

Related

How to refresh the line numbers in sed

How can you refresh the line numbers of a sed output inside the same sed command?
I have a sed script as follows -
#!/usr/bin/sed -f
/pattern/i #inserting a line
1~10i ####
What this does is that it inserts lines wherever the pattern is matched and then inserts #### every ten lines. The problem is that it inserts the hashes every 10 lines according to the line numbers of the original file before inserting the lines for the matching pattern. I want to refresh the line numbers after inserting the lines and use them for inserting the 4 hashes every 10 lines.
Anyway this can be done without piping the output into a new sed?
Interesting challenge. If your file is not too large, the following may work for you (tested with GNU sed):
#!/usr/bin/sed -nEf
:a; N; $!ba
{
s/([^\n]*pattern[^\n]*\n)/#inserting a line\n\1/g
s/\n/ \n/g
s/\`/####\n/
:b
s/(.*####\n([^\n]* \n){9}[^\n]*) \n/\1\n####\n/
tb
s/ \n/\n/g
p
}
Explanations, line by line:
No print, extended RE mode (-nE).
Loop around label a to concatenate the whole file in the pattern space (reason why its size matters).
Add #inserting a line\n before each line containing pattern.
Add a space before all endline characters.
Insert ####\n before the first line.
Label b.
Append ####\n' to anything followed by ####\n` and 10 space-terminated lines, removing the final space (to prevent subsequent matches).
Goto b if there was a substitution.
Remove all spaces at the end of a line.
print.
Note: if your file does not contain NUL characters the -z option of GNU sed saves a few commands:
#!/usr/bin/sed -Ezf
s/([^\n]*pattern[^\n]*\n)/#inserting a line\n\1/g
s/\n/ \n/g
s/\`/####\n/
:a
s/(.*####\n([^\n]* \n){9}[^\n]*) \n/\1\n####\n/
ta
s/ \n/\n/g
Note: with the hold space we could probably do the same on the fly, instead of storing the whole file in the pattern space.
This might work for you (GNU sed):
sed -zE 's/.*pattern/# insert line\n&/mg
s/([^\n]*\n){10}/&####\n/g
s/^/####\n/' file
Slurp the file into memory.
Insert desired text before lines containing pattern.
Insert #### every 10 lines and before the first line.

BASH: Find newlines in between text and replace with two newlines

I am looking to programmatically edit the newlines of .txt files. The desired behavior is that any single newline in between lines of text will become two newlines.
edit (clarification by #kaan): Lines separated by one newline should be separated by two newlines. Any lines that are already separated by two or more lines should be left as is
edit (context): I am working with the .fountain syntax and an npm module called afterwriting that exports text files into a script format as a pdf. lines of text separated by only one new line do not properly space when printed to pdf using the package. So i want to automatically convert single newlines into double, because i also don't want to have to add two new lines in all of the files i am converting
For instance an example of an input would look like:
File with text in it
A new line
Another new line
Line with three new lines above
One last new line
would become
File with text in it
A new line
Another new line
Line with three new lines above
One last new line
Any ideas of how this could be achieved in a bash script would be appreciated
This might work for you (GNU sed):
sed '/\S/b;N;//{P;b};:a;n;//!ba' file
This solution appends another line to the first empty line encountered. If the appended line is not empty it prints the first and bails out, thus doubling the empty line. Otherwise if the appended line is empty, it print them both and then prints any further empty lines until it encounters a non-empty line.
Here is a way to do it using sed:
read the whole file (since normal sed behavior will remove all newlines)
look for a word boundary (\b) followed by two newlines (\n\n – one for ending the current line, then one that's the single blank line), then one more word boundary (\b)
for any matches, add one extra newline in there.
With your sample text inside data.txt, it looks like this:
sed -n 'H; ${x; s/\b\n\n\b/\n\n\n/g; p}' < data.txt | tail -n +2
(Edit: added | tail -n +2 to remove the extra newline that's inserted at the beginning)

SED's Substituted string is considered as one-line string, whereas it contains newline character

I am testing the sed command to substitute one line with 3 lines and, then, to delete the last line. (I could have substituted it with only the 2 first lines, but this is deliberately stated like this to showcase the main issue).
Let's say that I have the following text :
// ##OPTION_NAME: xxxx
I want to replace the token ##OPTION_NAME by ##OP-NAME and surround it by 2 new lines; Like so :
// ##OP-START
// ##OP-NAME: xxxx
// ##OP-END
To illustrate this, I put this text in a code.c file, and the sed commands in a sed script named script.sed.
Then, I call the following shell command :
Shell command
sed -f script.sed code.c
script.sed
# Begin by replacing patterns by their equivalents, surrounding them with ##OP-START and ##OP-END lines
s/\(.*\)##OPTION_NAME:\(.*\)/\1##OP-START\n\1##OP-NAME:\2\n\1##OP-END/g
The problem
Now, I add another sed command in script.sed to delete the line containing ##OP-END. Surprise ! all 3 lines are removed !
# Begin by replacing patterns by their equivalents, surrounding them with ##OP-START and ##OP-END lines
s/\(.*\)##OPTION_NAME:\(.*\)/\1##OP-START\n\1##OP-NAME:\2\n\1##OP-END/g
# Last parse; delete ##OP-END
/##OP-END/d
I tried \r\n instead of \n in the sustitution command
s/\(.*\)##OPTION_NAME:\(.*\)/\1##OP-START\n\1##OP-NAME:\2\n\1##OP-END/g, but it does not work.
I also tested on ##OP-START to see if it makes some difference,
but alas ! All 3 lines were removed too.
It seems that sed is considering it as one line !
This is not a surprise, d operates on the pattern space, not on a per line basis. After the modification with the s command, your pattern space contains 3 lines. The content of it matches the expression and gets therefore deleted.
To delete this line from the pattern space, you need to use the s command again:
s/\(.*\)##OPTION_NAME:\(.*\)/\1##OP-START\n\1##OP-NAME:\2\n\1##OP-END/g$
s/\n\/\/ ##OP-END//
About pattern and hold space: https://pubs.opengroup.org/onlinepubs/9699919799/utilities/sed.html#tag_20_116_13

use sed to merge lines and add comma

I found several related questions, but none of them fits what I need, and since I am a real beginner, I can't figure it out.
I have a text file with entries like this, separated by a blank line:
example entry &with/ special characters
next line (any characters)
next %*entry
more words
I would like the output merge the lines, put a comma between, and delete empty lines. I.e., the example should look like this:
example entry &with/ special characters, next line (any characters)
next %*entry, more words
I would prefer sed, because I know it a little bit, but am also happy about any other solution on the linux command line.
Improved per Kent's elegant suggestion:
awk 'BEGIN{RS="";FS="\n";OFS=","}{$1=$1}7' file
which allows any number of lines per block, rather than the 2 rigid lines per block I had. Thank you, Kent. Note: The 7 is Kent's trademark... any non-zero expression will cause awk to print the entire record, and he likes 7.
You can do this with awk:
awk 'BEGIN{RS="";FS="\n";OFS=","}{print $1,$2}' file
That sets the record separator to blank lines, the field separator to newlines and the output field separator to a comma.
Output:
example entry &with/ special characters,next line (any characters)
next %*entry,more words
Simple sed command,
sed ':a;N;$!ba;s/\n/, /g;s/, , /\n/g' file
:a;N;$!ba;s/\n/, /g -> According to this answer, this code replaces all the new lines with ,(comma and space).
So After running only the first command, the output would be
example entry &with/ special characters, next line (any characters), , next %*entry, more words
s/, , /\n/g - > Replacing , , with new line in the above output will give you the desired result.
example entry &with/ special characters, next line (any characters)
next %*entry, more words
This might work for you (GNU sed):
sed ':a;$!N;/.\n./s/\n/, /;ta;/^[^\n]/P;D' file
Append the next line to the current line and if there are characters either side of the newline substitute the newline with a comma and a space and then repeat. Eventually an empty line or the end-of-file will be reached, then only print the next line if it is not empty.
Another version but a little more sofisticated (allowing for white space in the empty line) would be:
sed ':a;$!N;/^\s*$/M!s/\n/, /;ta;/\`\s*$/M!P;D' file
sed -n '1h;1!H
$ {x
s/\([^[:cntrl:]]\)\n\([^[:cntrl:]]\)/\1, \2/g
s/\(\n\)\n\{1,\}/\1/g
p
}' YourFile
change all after loading file in buffer. Could be done "on the fly" while reading the file and based on empty line or not.
use -e on GNU sed

BASH - Selective deletion

I have a file which looks like this:
Guest-List 1
All present
Guest-list 2
All present
Guest-List 3
Guest-list 4
All present
Guest-list 5
I want to remove the line containing "All present" and its title (the line just above "All present"). The desired output would be:
Guest-List 3
Guest-list 5
I am interested in implementing this using sed. Because I am a rookie, other possible solutions without sed will be appreciated as well (when answering please provide detailed explanation so I can learn) : )
(I know can delete a line matching a regex, and could store the line above it sending it to the hold buffer, something like this: sed '/^.*present$/d; h' ... then the "g" command would copy the hold buffer back to the pattern space... but how do I tell sed to delete that as well?)
Thanks in advance!
You can use fgrep like this:
fgrep -v -f <(fgrep 'All present' -B1 file) file
Guest-List 3
Guest-list 5
sed -n '/All present$/{s/.*//;x;d;};x;p;${x;p;}' file | sed '/^$/d'
Where file is your file.
This is an adapted example from here.
It has a great explanation:
In order to delete the line prior to the pattern,we store every line in a buffer called as hold space. Whenever the pattern matches, we delete the content present in both, the pattern space which contains the current line, the hold space which contains the previous line.
Let me explain this command: x;p; ; This gets executed for every line.
x exchanges the content of pattern space with hold space. p prints the pattern space. As a result, every time, the current line goes to hold space, and the previous line comes to pattern space and gets printed. When the pattern /All Present/ matches, we empty(s/.*//) the pattern space, and exchange(x) with the hold space(as a result of which the hold space becomes empty) and delete(d) the pattern space which contains the previous line. And hence, the current and the previous line gets deleted on encountering the pattern Linux. The ${x;p;} is to print the last line which will remain in the hold space if left.
The second part of sed is to remove the empty lines created by the first sed command.
If you are using more than the s, g, and p (with -n) commands in sed then you are using language constructs that became obsolete in the mid-1970s when awk was invented.
sed is an excellent tool for simple substitutions on a single line, for anything else just use awk:
$ cat file
Guest-List 1
All present
Guest-list 2
All present
Guest-List 3
Guest-list 4
All present
Guest-list 5
$ awk 'NR==FNR{ if (/All present/) {skip[FNR-1]; skip[FNR]} next} !(FNR in skip)' file file
Guest-List 3
Guest-list 5
The above just parses the file twice - first time to create an array named skip of the line numbers (FNR) you do not want output, and the second time to print the lines that are not in that array. Simple, clear, maintainable, extensible, ....

Resources