I am a beginner in NGC analysis.
I have some cleaned degradome-seq files, FASTA format, provided by a company. I found out the length of reads in FASTA files are variable between 45-47nt.
I used Trimmomatic to remove adapters, but it removed all reads. How it could be possible?
Could the length of degradome-seq reads be more than 25nt?
Thanks.
Fastqc report
Related
I have certain bam files which have reads of different size. I wish to create another bam file from it which has reads smaller than a certain length N. I should be able to run samtools stats and idxstats like commands on it. Is there a way to do it?
Use bamutils:
bamutils filter filename.bam output.bam -maxlen N
From the docs:
Removes reads from a BAM file based on criteria
Given a BAM file, this script will only allow reads that meet filtering
criteria to be written to output. The output is another BAM file with the
reads not matching the criteria removed.
...
Currently, the available filters are:
-minlen val Remove reads that are smaller than {val}
-maxlen val Remove reads that are larger than {val}
SEE ALSO:
Filtration Of Reads With Length Lower Than 30 From Bam: https://www.biostars.org/p/92889/
I have a project in spring batch where I must read from two .txt files, one has many lines and the other is a control file that has the number of lines that should be read from the first file. I know that I must use partition to process these files because the first one is very large and I need to divide it and be able to restart it in case it fails but I don't know how the reader should handle these files since both files do not have the same width in their lines.
None of the files have a header or separator in their lines, so i have to obtain the fields according to a range mainly in the first one.
One of my doubts is whether I should read both in the same reader? And how should I set the reader FixedLengthTokenizer and DefaultLineMapper to handle both files in the case of using the same reader??
These are examples of the input file and the control file
- input file
09459915032508501149343120020562580292792085100204001530012282921883101 the txt file can contain up to 50000 lines
- control file
00128*
It only has one line
Thanks!
I must read from two .txt files, one has many lines and the other is a control file that has the number of lines that should be read from the first file
Here is a possible way to tackle your use case:
Create a first step (tasklet) that reads the control file and put the number of lines to read in the job execution context (to share it with the next step)
Create a second step (chunk-oriented) with a step scoped reader that is configured to read only the number of lines calculated by the first step (get value from job execution context)
You can read more about sharing data between steps here: https://docs.spring.io/spring-batch/docs/4.2.x/reference/html/common-patterns.html#passingDataToFutureSteps
Is there a way to read bin and mark files in clikchouse ?
Every Column has a bin and mark files in the clickhouse data directory. I want to read it , for Better understanding.
Reading mark file is quite easy, while bin files are relatively complicated. It has compressed blocks, different types of compressions are used, inside - just a raw binary values one after another. The best option to read that is using ClickHouse itself.
For more detail i recommend to read the sources.
There are two similar function for reading file in twincat software for Beckhoff company. FB_FileGets and FB_FileRead. I will be appreciate if someone explain what are the differences of these function and clear when we use each of them. both of them have same ‌prerequisite or not, use in same way in programs? which one has better speed(fast reading in different file format) and any inform that make them clear for better programming.
vs
The FB_FileGets reads the file line by line. So when you call it, you always get a one line of the text file as string. The maximum length of a line is 255 characters. So by using this function block, it's very easy to read all lines of a file. No need for buffers and memory copying, if the 255 line length limit is ok.
THe FB_FileReadreads given number of bytes from the file. So you can read files with for example 65000 characters in a single line.
I would use the FB_FileGets in all cases where you know that the lines are less than 255 characters and the you handle the data as line-by-line. It's very simple to use. If you have no idea of the line sizes, you need all data at once or the file is very big, I would use the FB_FileRead.
I haven't tested but I think that the FB_FileReadis probably faster, as it just copies the bytes to buffer. And you can read the whole file at once, not line-by-line.
I have found plenty of tools for trimming reads in a fastq format, but are there any available for trimming already aligned reads?
I would personally discourage trimming of reads after aligning your reads especially if the sequences you're trying to trim are adapter sequences.
The presence of these adapter sequences will prevent your reads from aligning properly to the genome (you'll get a much lower percentage of alignments that you should from my experience). Since your alignment is already inaccurate, it will be quite pointless to trim the sequences after alignment (garbage in, garbage out).
You'll be much better off trimming the fastq files before aligning them.
Do you want the alignment to be informing the trimming protocol, or are you wanting to trim on things like quality values? One approach would be to simply convert back to FASTQ and then use any of the myriad of conventional trimming options available. You can do this with Picard:
http://picard.sourceforge.net/command-line-overview.shtml#SamToFastq
One possibility would be use GATK toolset, for example ClipReads. If you want to remove adaptors, you can use ReadAdaptorTrimmer. No back converting to fastq needed(Documantation : http://www.broadinstitute.org/gatk/gatkdocs/).
Picard is, off course, another possibility.
The scenario of trimming reads in bam file would be encountered when you want to normalize the reads to the same length after you have performed a tremendous alignment works. Remapping after trimming the fastq reads is not energy efficient. In site reads trimming from bam file will be a prefer solution.
Please have a try bbmap/reformat.sh, which can trim the reads with input file accepting bam format.
reformat.sh in=test.bam out=test_trim.bam allowidenticalnames=t overwrite=true forcetrimright=74 sam=1.4
## the default output format of reformat is sam 1.4. however, many tools only recognize 1.3 version. So the following step is to convert the 1.4 to version 1.3.
reformat.sh in=test_trim.bam out=test_trim_1.3.bam allowidenticalnames=t overwrite=true sam=1.3