How to retain reads on only an certain size in a bam file? - bioinformatics

I have certain bam files which have reads of different size. I wish to create another bam file from it which has reads smaller than a certain length N. I should be able to run samtools stats and idxstats like commands on it. Is there a way to do it?

Use bamutils:
bamutils filter filename.bam output.bam -maxlen N
From the docs:
Removes reads from a BAM file based on criteria
Given a BAM file, this script will only allow reads that meet filtering
criteria to be written to output. The output is another BAM file with the
reads not matching the criteria removed.
...
Currently, the available filters are:
-minlen val Remove reads that are smaller than {val}
-maxlen val Remove reads that are larger than {val}
SEE ALSO:
Filtration Of Reads With Length Lower Than 30 From Bam: https://www.biostars.org/p/92889/

Related

Problems with degradome sequencing results

I am a beginner in NGC analysis.
I have some cleaned degradome-seq files, FASTA format, provided by a company. I found out the length of reads in FASTA files are variable between 45-47nt.
I used Trimmomatic to remove adapters, but it removed all reads. How it could be possible?
Could the length of degradome-seq reads be more than 25nt?
Thanks.
Fastqc report

Spring Batch Two Files With Different Structure

I have a project in spring batch where I must read from two .txt files, one has many lines and the other is a control file that has the number of lines that should be read from the first file. I know that I must use partition to process these files because the first one is very large and I need to divide it and be able to restart it in case it fails but I don't know how the reader should handle these files since both files do not have the same width in their lines.
None of the files have a header or separator in their lines, so i have to obtain the fields according to a range mainly in the first one.
One of my doubts is whether I should read both in the same reader? And how should I set the reader FixedLengthTokenizer and DefaultLineMapper to handle both files in the case of using the same reader??
These are examples of the input file and the control file
- input file
09459915032508501149343120020562580292792085100204001530012282921883101 the txt file can contain up to 50000 lines
- control file
00128*
It only has one line
Thanks!
I must read from two .txt files, one has many lines and the other is a control file that has the number of lines that should be read from the first file
Here is a possible way to tackle your use case:
Create a first step (tasklet) that reads the control file and put the number of lines to read in the job execution context (to share it with the next step)
Create a second step (chunk-oriented) with a step scoped reader that is configured to read only the number of lines calculated by the first step (get value from job execution context)
You can read more about sharing data between steps here: https://docs.spring.io/spring-batch/docs/4.2.x/reference/html/common-patterns.html#passingDataToFutureSteps

True in-place file editing using GNU tools

I have a very large (multiple gigabytes) file that I want to do simple operations on:
Add 5-10 lines in the end of the file.
Add 2-3 lines in the beginning of the file.
Delete a few lines in the beginning, up to a certain substring. Specifically, I need to traverse the file up to a line that says "delete me!\n" and then delete all lines in the file up to and including that line.
I'm struggling to find a tool that can do the editing in place, without creating a temporary file (very long task) that has essentially a copy of my original file. Basically, I want to minimize the number of I/O operations against the disk.
Both sed -i, and awk -i, do exactly that slow thing (https://askubuntu.com/questions/20414/find-and-replace-text-within-a-file-using-commands) and are inefficient as a result. What's a better way?
I'm on Debian.
Adding 5-10 lines at the beginning of a multi-GB file will always require fully rewriting the contents of that file, unless you're using an OS and filesystem that provides nonstandard syscalls. (You can avoid needing multiple GB of temporary space by writing back to a point in the file you're modifying from which you've already read to a buffer, but you can't avoid needing to rewrite everything past the point of the edit).
This is because UNIX only permits adding new contents to a file in a manner that changes its overall size at or past its existing end. You can edit part of a file in-place -- that is to say, you can seek 1GB in and write 1MB of new contents -- but this changes the 1MB of contents that had previously been in that location; it doesn't change the total size of the file. Similarly, you can truncate and rewrite a file at a location of your choice, but everything past the point of truncation needs to be rewritten.
An example of the nonstandard operations referred to above is the FALLOC_FL_INSERT_RANGE and FALLOC_FL_COLLAPSE_RANGE operations, which with very new Linux kernels will allow blocks to be inserted to or removed from an existing file. This is unlikely to be helpful to you here:
Only exact blocks (ie. 4kb -- whatever your filesystem is formatted for) can be inserted, not individual lines of text of arbitrary size.
Only XFS and ext4 are supported.
See the documentation for fallocate(2).
here is a recommendation for editing large files (change the lines and number of digits based on your file length and number of sections to work on)
split -l 1000 -a 4 -d bigfile bigfile_
for that you need space, since bigfile won't be removed
insert header as first line
sed -i '1iheader` bigfile_000
search a specific pattern, get the file name and remove the previous sections.
grep pattern bigfile_*
etc.
Once all editing is done, just cat back the remaining pieces
cat bigfile_* > edited_bigfile

Is it possible to know the serial number of the block of input data on which map function is currently working?

I am a novice in Hadoop and here I have the following questions:
(1) As I can understand, the original input file is split into several blocks and distributed over the network. Does a map function always execute on a block in its entirety? Could there be more than one map functions executing on data in a single block?
(2) Is there any way that it can be learned, from within the map function, which section of the original input text the mapper is currently working on? I would like to get something like a serial number, for instance, for each block starting from the first block of the input text.
(3) Is it possible to make the splits of the input text in such a way that each block has a predefined word count? If possible then how?
Any help would be appreciated.
As I can understand, the original input file is split into several blocks and distributed over the network. Does a map function always execute on a block in its entirety? Could there be more than one map functions executing on data in a single block?
No. A block(split to be precise) gets processed by only one mapper.
Is there any way that it can be learned, from within the map function, which section of the original input text the mapper is currently working on? I would like to get something like a serial number, for instance, for each block starting from the first block of the input text.
You can get some valuable info, like the file containing split's data, the position of the first byte in the file to process. etc, with the help of FileSplit class. You might find it helpful.
Is it possible to make the splits of the input text in such a way that each block has a predefined word count? If possible then how?
You can do that by extending FileInputFormat class. To begin with you could do this :
In your getSplits() method maintain a counter. Now, as you read the file line by line keep on tokenizing them. Collect each token and increase the counter by 1. Once the counter reaches the desired value, emit the data read upto this point as one split. Reset the counter and start with the second split.
HTH
If you define a small max split size you can actually have multiple mappers processing a single HDFS block (say 32mb max split for a 128 MB block size - you'll get 4 mappers working on the same HDFS block). With the standard input formats, you'll typically never see two or more mappers processing the same part of the block (the same records).
MapContext.getInputSplit() can usually be cast to a FileSplit and then you have the Path, offset and length of the file being / block being processed).
If your input files are true text flies, then you can use the method suggested by Tariq, but note this is highly inefficient for larger data sources as the Job Client has to process each input file to discover the split locations (so you end up reading each file twice). If you really only want each mapper to process a set number of words, you could run a job to re-format the text files into sequence files (or another format), and write the records down to disk with a fixed number of words per file (using Multiple outputs to get a file per number of words, but this again is inefficient). Maybe if you shared the use case as for why you want a fixed number of words, we can better understand your needs and come up with alternatives

Custom input splits for streaming the data in MapReduce

I have a large data set that is ingested into HDFS as sequence files, with the key being the file metadata and value the entire file contents. I am using SequenceFileInputFormat and hence my splits are based on the sequence file sync points.
The issue I am facing is when I ingest really large files, I am basically loading the entire file in memory in the Mapper/Reducer as the value is the entire file content. I am looking for ways to stream the file contents while retaining the Sequence file container. I even thought about writing custom splits but not sure of how I will retain the sequence file container.
Any ideas would be helpful.
The custom split approach is not suitable to this scenario for the following 2 reasons. 1) Entire file is getting loaded to the Map node because the Map function needs entire file (as value = entire content). If you split the file, Map function receives only a partial record (value) and it would fail.2) Probably the sequence file container is treating your file as a 'single record' file. So, it would have only 1 sync point at max, that is after the Header. So, even if you retain the Sequence File Container's sync points, the whole file gets loaded to the Map node as it being loaded now.
I had the concerns regarding losing the sequence files sync points if writing a custom split. I was thinking of this approach of modifying the Sequence File Input Format/Record Reader to return chunks of the file contents as opposed to the entire file, but return the same key for every chunk.
The chunking strategy would be similar to how file splits are calculated in MapReduce.

Resources