Trim Illumina reads in a bam/sam file - bioinformatics

I have found plenty of tools for trimming reads in a fastq format, but are there any available for trimming already aligned reads?

I would personally discourage trimming of reads after aligning your reads especially if the sequences you're trying to trim are adapter sequences.
The presence of these adapter sequences will prevent your reads from aligning properly to the genome (you'll get a much lower percentage of alignments that you should from my experience). Since your alignment is already inaccurate, it will be quite pointless to trim the sequences after alignment (garbage in, garbage out).
You'll be much better off trimming the fastq files before aligning them.

Do you want the alignment to be informing the trimming protocol, or are you wanting to trim on things like quality values? One approach would be to simply convert back to FASTQ and then use any of the myriad of conventional trimming options available. You can do this with Picard:
http://picard.sourceforge.net/command-line-overview.shtml#SamToFastq

One possibility would be use GATK toolset, for example ClipReads. If you want to remove adaptors, you can use ReadAdaptorTrimmer. No back converting to fastq needed(Documantation : http://www.broadinstitute.org/gatk/gatkdocs/).
Picard is, off course, another possibility.

The scenario of trimming reads in bam file would be encountered when you want to normalize the reads to the same length after you have performed a tremendous alignment works. Remapping after trimming the fastq reads is not energy efficient. In site reads trimming from bam file will be a prefer solution.
Please have a try bbmap/reformat.sh, which can trim the reads with input file accepting bam format.
reformat.sh in=test.bam out=test_trim.bam allowidenticalnames=t overwrite=true forcetrimright=74 sam=1.4
## the default output format of reformat is sam 1.4. however, many tools only recognize 1.3 version. So the following step is to convert the 1.4 to version 1.3.
reformat.sh in=test_trim.bam out=test_trim_1.3.bam allowidenticalnames=t overwrite=true sam=1.3

Related

Counting lines or enumerating line numbers so I can loop over them - why is this an anti-pattern?

I posted the following code and got scolded. Why is this not acceptable?
numberOfLines=$(wc -l <"$1")
for ((i=1; $i<=$numberOfLines; ++$i)); do
lineN=$(sed -n "$i!d;p;q" "$1")
# ... do things with "$lineN"
done
We collect the number of lines in the input file into numberOfLines, then loop from 1 to that number, pulling out the next line from the file with sed in each iteration.
The feedback I received complained that reading the same file repeatedly with sed inside the loop to get the next line is inefficient. I guess I could use head -n "$i" "$1" | tail -n 1 but that's hardly more efficient, is it?
Is there a better way to do this? Why would I want to avoid this particular approach?
The shell (and basically every programming language which is above assembly language) already knows how to loop over the lines in a file; it does not need to know how many lines there will be to fetch the next one — strikingly, in your example, sed already does this, so if the shell couldn't do it, you could loop over the output from sed instead.
The proper way to loop over the lines in a file in the shell is with while read. There are a couple of complications — commonly, you reset IFS to avoid having the shell needlessly split the input into tokens, and you use read -r to avoid some pesky legacy behavior with backslashes in the original Bourne shell's implementation of read, which have been retained for backward compatibility.
while IFS='' read -r lineN; do
# do things with "$lineN"
done <"$1"
Besides being much simpler than your sed script, this avoids the problem that you read the entire file once to obtain the line count, then read the same file again and again in each loop iteration. With a typical modern OS, some repeated reading will be avoided thanks to caching (the disk driver keeps a buffer of recently accessed data in memory, so that reading it again will not actually require fetching it from the disk again), but the basic fact is still that reading information from disk is on the order of 1000x slower than not doing it when you can avoid it. Especially with a large file, the cache will fill up eventually, and so you end up reading in and discarding the same bytes over and over, adding a significant amount of CPU overhead and an even more significant amount of the CPU simply doing something else while waiting for the disk to deliver the bytes you read, again and again.
In a shell script, you also want to avoid the overhead of an external process if you can. Invoking sed (or the functionally equivalent but even more expensive two-process head -n "$i"| tail -n 1) thousands of times in a tight loop will add significant overhead for any non-trivial input file. On the other hand, if the body of your loop could be done in e.g. sed or Awk instead, that's going to be a lot more efficient than a native shell while read loop, because of the way read is implemented. This is why while read is also frequently regarded as an antipattern.
And make sure you are reasonably familiar with the standard palette of Unix text processing tools - cut, paste, nl, pr, etc etc.
In many, many cases you should avoid looping over the lines in a shell script and use an external tool instead. There is basically only one exception to this; when the body of the loop is also significantly using built-in shell commands.
The q in the sed script is a very partial remedy for repeatedly reading the input file; and frequently, you see variations where the sed script will read the entire input file through to the end each time, even if it only wants to fetch one of the very first lines out of the file.
With a small input file, the effects are negligible, but perpetuating this bad practice just because it's not immediately harmful when the input file is small is simply irresponsible. Just don't teach this technique to beginners. At all.
If you really need to display the number of lines in the input file, for a progress indicator or similar, at least make sure you don't spend a lot of time seeking through to the end just to obtain that number. Maybe stat the file and keep track of how many bytes there are on each line, so you can project the number of lines you have left (and instead of line 1/10345234 display something like line 1/approximately 10000000?) ... or use an external tool like pv.
Tangentially, there is a vaguely related antipattern you want to avoid, too; you don't want to read an entire file into memory when you are only going to process one line at a time. Doing that in a for loop also has some additional gotchas, so don't do that, either; see https://mywiki.wooledge.org/DontReadLinesWithFor
Another common variation is to find the line you want to modify with grep, only so you can find it with sed ... which already knows full well how to perform a regex search by itself. (See also useless use of grep.)
# XXX FIXME: wrong
line=$(grep "foo" file)
sed -i "s/$line/thing/" file
The correct way to do this would be to simply change the sed script to contain a search condition:
sed -i '/foo/s/.*/thing/' file
This also avoids the complications when the value of $line in the original, faulty script contains something which needs to be escaped in order to actually match itself. (For example, foo\bar* in a regular expression does not match the literal text itself.)

FB_FileGets vs FB_FileRead in twincat

There are two similar function for reading file in twincat software for Beckhoff company. FB_FileGets and FB_FileRead. I will be appreciate if someone explain what are the differences of these function and clear when we use each of them. both of them have same ‌prerequisite or not, use in same way in programs? which one has better speed(fast reading in different file format) and any inform that make them clear for better programming.
vs
The FB_FileGets reads the file line by line. So when you call it, you always get a one line of the text file as string. The maximum length of a line is 255 characters. So by using this function block, it's very easy to read all lines of a file. No need for buffers and memory copying, if the 255 line length limit is ok.
THe FB_FileReadreads given number of bytes from the file. So you can read files with for example 65000 characters in a single line.
I would use the FB_FileGets in all cases where you know that the lines are less than 255 characters and the you handle the data as line-by-line. It's very simple to use. If you have no idea of the line sizes, you need all data at once or the file is very big, I would use the FB_FileRead.
I haven't tested but I think that the FB_FileReadis probably faster, as it just copies the bytes to buffer. And you can read the whole file at once, not line-by-line.

Converting text file with spaces between CR & LF

I've never seen this line ending before and I am trying to load the file into a database.
The lines all have a fixed width. After the CSV text which contains the data (the length varies line-by-line), there is a CR followed by multiple spaces and ending with LF. The spaces provide the padding to equalize the line width.
Line1,Data 1,Data 2,Data 3,4,50D20202020200A
Line2,Data 11,Data 21,Data 31,41,510D2020200A
Line3,Data12,Data22,Data 32,42,520D202020200A
I am about to handle this with a stream reader / writer in C#, but there are 40 files that come in each month and if there is a way to convert them all at once instead of one line at a time, I would rather do that.
Any thoughts?
Line-by-line processing of a stream doesn't have to be a bottleneck if you implement it at the right point in your overall process.
When I've had to do this kind of preprocessing I put a folder watch on the inbound folder, then automatically pick up each file and process it upon arrival, putting the original into an archive folder and writing the processed file into another location from which data will be parsed or loaded into the database. Unless you have unusual real-time requirements, you'll never notice this kind of overhead. If you do have real-time requirements, this issue will pale in comparison to all the other issues you'll face with batched data files :)
But you may not even have to go through a preprocessing step at all. You didn't indicate what database you will be using or how you plan to load the data, but many databases do include utilities to process fixed-length records. In the past, fixed-format files came with every imaginable kind of bizarre format (and contained all kinds of stuff that had to be stripped out or converted). As a result those utilities tend to be very efficient at this kind of task. In my experience they can easily be at least an order of magnitude faster than line-by-line processing, which can make a real difference on larger bulk loads.
If your database doesn't have good bulk import processing tools, there are a number of many open-source or freeware utilities already written that do pretty much exactly what you need. You can find them on GitHub and other places. For example, NPM replace is here and zzzprojects findandreplace is here.
For a quick and dirty approach that allows you to preview all the changes as you develop a more robust solution, many text editors have the ability to find and replace in multiple files. I've used that approach successfully in the past. For example, here's the window from NotePad++ that lets you use RegEx to remove or change whatever you like in all files matching defined criteria.

Why is in-place edditing of a file slower than making a new file?

As you can see in this answer. It seems like editing a text file in-place takes much more time than creating a new file, deleting the old file and moving a temporary file from another file-system and renaming it. Let alone creating a new file in the same file-system and just renaming it. I was wondering what is the reason behind that?
Because when you edit a file inplace you are opening the same file for both writing and reading. But when you use another file. you only read from one file and write to another file.
When you open a file for reading it's content are moved from disk to memory. Then after, when you want to edit the file you change the content of the file in the disk so the content you have in memory should be updated to prevent data inconsistency. But when you use a new file. You don't have to update the contents of the first file in the memory. You just read the whole file once and write the other file once. And don't update anything. Removing a file also takes very small time because you just remove it from the file system and you don't write any bits to the location of the file in the disk. The same goes for renaming. Moving can also be done very fast depending on the file-system but most likely not as fast as removing and renaming.
There is also another more important reason.
When you remove the numbers from the beginning of the first line, all of the other characters have to be shifted back a little. Then when you remove the numbers from the second line again all of the characters after that point have to be shifted back because the characters have to be consecutive. If you wanted to just change some characters, editing in place would have been a lit faster. But since you are changing the length of the file on each removal all of the other characters have to get shifted and that takes so much time. It's not exactly like this and it's much more complicated depending on the implementation of your operation system and your file-system but this is the idea behind it. It's like array operation. When you remove a single element from an array you have to shift all of the other elements of the array. Because it is an array. In contrast if you were to remove an element from a linked list you didn't need to shift other elements but files are implemented similar to arrays so that is that.
While tgwtdt's answer may give a few good insights it does not explain everything. Here is a counter example on a 140MB file:
$ time sed 's/a/b/g' data > newfile
real 0m2.612s
$ time sed -i -- 's/a/b/g' data
real 0m9.906s
Why is this a counter example, you may ask. Because I replace a with b which means that the replacement text has the same length. Thus, no data needs to be moved, but it still took about four times longer.
While tgwtdt gave a good reasoning for why in place usually takes longer, it's a question that cannot be answered 100% for the general case, because it is implementation dependent.

True in-place file editing using GNU tools

I have a very large (multiple gigabytes) file that I want to do simple operations on:
Add 5-10 lines in the end of the file.
Add 2-3 lines in the beginning of the file.
Delete a few lines in the beginning, up to a certain substring. Specifically, I need to traverse the file up to a line that says "delete me!\n" and then delete all lines in the file up to and including that line.
I'm struggling to find a tool that can do the editing in place, without creating a temporary file (very long task) that has essentially a copy of my original file. Basically, I want to minimize the number of I/O operations against the disk.
Both sed -i, and awk -i, do exactly that slow thing (https://askubuntu.com/questions/20414/find-and-replace-text-within-a-file-using-commands) and are inefficient as a result. What's a better way?
I'm on Debian.
Adding 5-10 lines at the beginning of a multi-GB file will always require fully rewriting the contents of that file, unless you're using an OS and filesystem that provides nonstandard syscalls. (You can avoid needing multiple GB of temporary space by writing back to a point in the file you're modifying from which you've already read to a buffer, but you can't avoid needing to rewrite everything past the point of the edit).
This is because UNIX only permits adding new contents to a file in a manner that changes its overall size at or past its existing end. You can edit part of a file in-place -- that is to say, you can seek 1GB in and write 1MB of new contents -- but this changes the 1MB of contents that had previously been in that location; it doesn't change the total size of the file. Similarly, you can truncate and rewrite a file at a location of your choice, but everything past the point of truncation needs to be rewritten.
An example of the nonstandard operations referred to above is the FALLOC_FL_INSERT_RANGE and FALLOC_FL_COLLAPSE_RANGE operations, which with very new Linux kernels will allow blocks to be inserted to or removed from an existing file. This is unlikely to be helpful to you here:
Only exact blocks (ie. 4kb -- whatever your filesystem is formatted for) can be inserted, not individual lines of text of arbitrary size.
Only XFS and ext4 are supported.
See the documentation for fallocate(2).
here is a recommendation for editing large files (change the lines and number of digits based on your file length and number of sections to work on)
split -l 1000 -a 4 -d bigfile bigfile_
for that you need space, since bigfile won't be removed
insert header as first line
sed -i '1iheader` bigfile_000
search a specific pattern, get the file name and remove the previous sections.
grep pattern bigfile_*
etc.
Once all editing is done, just cat back the remaining pieces
cat bigfile_* > edited_bigfile

Resources