.cat.fastq to .cat.fasta file conversion problems - bioinformatics

I'm trying to convert fastq to fasta without doing a quality filter first. When I try to use fastx toolkit to run this conversion, it gives me an error message when it runs into a low quality base and terminates the conversion so that my converted output ends very early. (error says something like quality score below -30).
I then tried to use a sed solution posted earlier on this forum about how to convert to fasta using sed. The line was this:
sed -n '1~4s/^#/>/p;2~4p'
the line I input to the terminal was:
sed -n '1~4s/^#/>/p;2~4p' Sample_As_L001_R1.cat.fastq
It spit out what I wanted, but printed directly into the terminal.
How do I get this info to not print on the terminal, but to print to an output file?
How do I specify the file/file name that I want the output to go into. Thanks.

redirect it to a file
sed -n '1~4s/^#/>/p;2~4p' Sample_As_L001_R1.cat.fastq > Sample_As_L001_R1.cat.fasta

Related

Linux sed command that generates a new file on every regex match

I have the following Linux command which I am using to extract data from one very large log file.
sed -n "/<trade>/,/<\/trade>/p" Large.log > output.xml
However, the output is generated in a single file output.xml. My intention is to create a new file every time the "/<trade>/,/<\/trade>/p" is matched. Every new file will be named after the <id> tag which is inside the <trade> </trade> tags.
Something likes this...
sed -n "/<trade>/,/<\/trade>/p" Large.log > "/<id>/,/<\/id>/p".xml
However, that, of course, does not work and I am not sure how to apply a regex as a naming rule.
P.S At this point, I am also not sure if I should use sed or maybe I should try achieving this with awk

How to replace the character "F" in a huge .txt file with the return command?

I have a pretty large .txt file with data (8MB) and the data lines are separated with the character F.
To analyze this data I need to replace the letter F with the Return command.
This is how my file looks:
-0.27, -0.21, 9.56, 78.86, 47.79, 0.02F0.07, -0.35, 9.47, 78.73, 47.74, 0.05F-0.20, -0.43, 10.60, 79.00, 47.79, 0.07F-0.49, -0.14, 10.44, 76.84, 47.70, 0.10.. and so on
This is how it should look:
-0.27, -0.21, 9.56, 78.86, 47.79, 0.02
0.07, -0.35, 9.47, 78.73, 47.74, 0.05
-0.20, -0.43, 10.60, 79.00, 47.79, 0.07
-0.49, -0.14, 10.44, 76.84, 47.70, 0.10
... and so on
I have a macOS and Windows available. Already tried it with Excel, but the file seems to be to large, Excel just crashes. Any advice?
Try EditPad Lite on Windows. It's a notepad, that is able to handle big files.
You have to enable regular expressions (search->search options) to work correctly. After that you can open the search and replace F with \r\n (new line operator).
You can use TextEdit on a Mac. Use the find and replace option. It is very fast in the test I tried. I used a 5 M file and it ran in a few seconds. Refer to the previous question in Ask Different 'How to use find and replace to replace a character with new line' to see how to get newlinein character in find and replace option.
In MacOS, give this a try.
Using translate characters command
tr F '\n' < input.txt > output.txt
The result will be stored in a separate file. If no new file needed, just remove > output.txt from the command, it will display the result in the console.
Using stream editor command
sed -i '' $'s/F/\\\n/g' test.txt
The sed command will do the same operation with the use of regex. This replace the contents in the original file. To create a backup of the file, give the extension in the argument i (Ex : -i '.backup' creates a file backup test.txt.backup).
For more info, do man tr and man sed in your mac terminal.

Substitution of substring doesn't work in bash (tried sed, ${a/b/c/})

Before to write, of course I read many other similar cases. Example I used #!/bin/bash instead of #!/bin/sh
I have a very simple script that reads lines from a template file and wants to replace some keywords with real data. Example the string <NAME> will be replaced with a real name. In the example I want to replace it with the word Giuseppe. I tried 2 solutions but they don't work.
#!/bin/bash
#read the template and change variable information
while read LINE
do
sed 'LINE/<NAME>/Giuseppe' #error: sed: -e expression #1, char 2: extra characters after command
${LINE/<NAME>/Giuseppe} #error: WORD(*) command not found
done < template_mail.txt
(*) WORD is the first word found in the line
I am sorry if the question is too basic, but I cannot see the error and the error message is not helping.
EDIT1:
The input file should not be changed, i want to use it for every mail. Every time i read it, i will change with a different name according to the receiver.
EDIT2:
Thanks your answers i am closer to the solution. My example was a simplified case, but i want to change also other data. I want to do multiple substitutions to the same string, but BASH allows me only to make one substitution. In all programming languages i used, i was able to substitute from a string, but BASH makes this very difficult for me. The following lines don't work:
CUSTOM_MAIL=$(sed 's/<NAME>/Giuseppe/' template_mail.txt) # from file it's ok
CUSTOM_MAIL=$(sed 's/<VALUE>/30/' CUSTOM_MAIL) # from variable doesn't work
I want to modify CUSTOM_MAIL a few times in order to include a few real informations.
CUSTOM_MAIL=$(sed 's/<VALUE1>/value1/' template_mail.txt)
${CUSTOM_MAIL/'<VALUE2>'/'value2'}
${CUSTOM_MAIL/'<VALUE3>'/'value3'}
${CUSTOM_MAIL/'<VALUE4>'/'value4'}
What's the way?
No need to do the loop manually. sed command itself runs the expression on each line of provided file:
sed 's/<NAME>/Giuseppe/' template_mail.txt > output_file.txt
You might need g modifier if there are more appearances of the <NAME> string on one line: s/<NAME>/Giuseppe/g

bash - reading multiple input files and creating matching output files by name and sequence

I do not know much bash scripting, but I know the task I would like to do would be greatly simplified by it. I would like to test a program against expected output using many test input files.
For example, I have files named "input1.txt, input2.txt, input3.text..." and expected output in files "output1.txt, output2.txt, output3.txt...". I would like to run my program with each of the input files and output a corresponding "test1.txt, test2.txt, test3.txt...". Then I would do a "cmp output1.txt test1.txt" for each file.
So I think it would start like this.. roughly..
for i in input*;
do
./myprog.py < "$i" > someoutputthing;
done
One question I have is: how would I match the numbers in the filename? Thanks for your help.
If the input file name pattern is inputX.txt, you need to remove input from the beginning. You do not have to remove the extension, as you want to use the same for output:
output=output${i#input}
See Parameter Expansion in man bash.

Replace last line of XML file

Looking for help creating a script that will replace the last line of an XML file with a tag. I have a few hundred files so I'm looking for something that will process them in a loop. I've managed to rename the files sequentially like this:
posts1.xml
posts2.xml
posts3.xml
etc...
to make it easier to loop through. But I have no idea how to write a script to do this. I'm open to using either Linux or Windows (but i would guess that Linux is better for this kind of task).
So if you want to append a line to every file:
sed -i '$a<YOUR_SHINY_NEW_TAG>' *xml
To replace the last line:
sed -i '$s/.*/<YOUR_SHINY_NEW_TAG>/' *xml
But do note, sed is not the ideal tool to modify xml.
XMLStarlet is a command-line toolkit for performing XML parsing and manipulations. Note that as an XML-aware toolkit, it'll respect XML structure, character encoding and entity substitution.
Check out the ed command to see how to modify documents. You can wrap this in a standard bash loop.
e.g. in a doc consisting of a chain of <elem>s, you can add a following <added>5</added>:
mkdir new
for x in *.xml; do
xmlstarlet ed -a "//elem[count(//elem)]" -t elem -n added -v 5 $x > new/$x
done
Linux way using sed:
To edit the last line of the file in place, you can use sed:
sed -i '$s_pattern_replacement_' filename
To change the whole line to "replacement" use $s_.*_replacement_. Be sure to escape any _'s in replacement with a \.
To loop over files, just use for:
for f in /path/posts*.xml; do sed -i '$s_.*_replacement_' $f; done
This, however, is a dirty way as it's not aware of the XML structure, whereas the XML structure is not affected by newlines. You have to be sure the last line of the files contains exactly what you expect it to.
It makes little to no difference whether you're on Linux, Windows or MacOS
The question is what language do you want to use?
The following is an example in c# (not optimized, but read it as speudocode):
string rootDirectory = #"c:\myfiles";
var files = Directory.GetFiles(rootDirectory, "*.xml");
foreach (var file in files)
{
var lines = File.ReadAllLines(file);
lines[lines.Length - 1] = "whatever you want here";
File.WriteAllLines(file, lines);
}
You can compile this and run it on Windows, Linux, etc..
Or you could do the same in Python.
Of course this method does not actually parse the XML,
but you just wanted to replace the last line right?

Resources