Binning sequence reads by GC content [closed] - bioinformatics

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 10 years ago.
I would like to "bin" (split into separate files) a multi-fasta nucleotide sequence file (e.g. a Roche-454 run of ~500,000 reads average read length 250bp). I would like the bins based on GC content of each read.
The resultant output would be 8 multi-fasta files:
<20% GC content
21-30% GC content
31-40% GC content
41-50% GC content
51-60% GC content
61-70% GC content
71-80% GC content
>80 % GC content
Does anyone know of a script or program that does this already?
If not can someone suggest how to sort the multi-fasta file based on GC content (that I can then split it down into the relevant bins)?

In R / Bioconductor, the tasks would be (a) load the appropriate library (b) read the fasta file (c) calculate nucleotide use and gc % (d) cut the data into bins and (e) output the original data into separate files. Along the lines of
## load
library(ShortRead)
## input
fa = readFasta("/path/to/fasta.fasta")
## gc content. 'abc' is a matrix, [, c("G", "C")] selects two columns
abc = alphabetFrequency(sread(fa), baseOnly=TRUE)
gc = rowSums(abc[,c("G", "C")]) / rowSums(abc)
## cut gc content into bins, with breaks at seq(0, 1, .2)
bin = cut(gc, seq(0, 1, .2))
## output, [bin==lvl] selects the reads whose 'bin' value is lvl
for (lvl in levels(bin)) {
fileName = sprintf("%s.fasta", lvl)
writeFasta(fa[bin==lvl], file=fileName)
}
To get going with R / Bioconductor see http://bioconductor.org/install. The memory requirements for 454 data of the size indicated are not too bad, and the script here will be fairly speedy (e.g., 7s for 260k reads).

I suggest using Python and Biopython or Perl and Bioperl to read in the FASTA-files. There is a script that calculates C-content of sequences in Bioperl here, and Biopython has a function for it. Then simply store the GC content for each sequence in a dictionary or hash and go through each, writing them into a file dependin on how high the GC-content is.
Do you need any more specific help?

If I understand the problem correctly, you need something like the following (Python):
def GC(seq): # determine the GC content
s = seq.upper()
return 100.0 * (s.count('G') + s.count('C')) / len(s)
def bin(gc): # get the number of the 'bin' for this value of GC content
if gc < 20: return 1
elif gc > 80: return 8
else:
return int(gc/10)
Then you just need to read the entries from the file, calculate the GC content, find the right bin and write the entry to the appropriate file. The following example implements this with a Python package we use in the lab:
from pyteomics import fasta
def split_to_bin_files(multifile):
"""Reads a file and writes the entries to appropriate 'bin' files.
`multifile` must be a filename (str)"""
for entry in fasta.read(multifile):
fasta.write((entry,), (multifile+'_bin_'+
str(bin(GC(entry[1])))))
Then you just call it like split_to_bin_files('mybig.fasta').

Related

Incrementally writing Parquet dataset from Python

I am writing a larger than RAM data out from my Python application - basically dumping data from SQLAlchemy to Parque. My solution was inspired by this question. Even though increasing the batch size as hinted here I am facing the issues:
RAM usage grows heavily
The writer starts to slow down after a while (write throughput speed drops more than 5x)
My assumption is that this is because the ParquetWriter metadata management becomes expensive when the number of rows increase. I am thinking that I should switch to datasets that would allow the writer to close the file in the middle of processing flush out the metadata.
My question is
Is there an example for writing incremental datasets with Python and Parquet
Are my assumptions correct or incorrect and using datasets would help to maintain the writer throughput?
My distilled code:
writer = pq.ParquetWriter(
fname,
Candle.to_pyarrow_schema(small_candles),
compression='snappy',
allow_truncated_timestamps=True,
version='2.0', # Highest available schema
data_page_version='2.0', # Highest available schema
) as writer:
def writeout():
nonlocal data
duration = time.time() - stats["started"]
throughout = stats["candles_processed"] / duration
logger.info("Writing Parquet table for candle %s, throughput is %s", "{:,}".format(stats["candles_processed"]), throughout)
writer.write_table(
pa.Table.from_pydict(
data,
writer.schema
)
)
data = dict.fromkeys(data.keys(), [])
process = psutil.Process(os.getpid())
logger.info("Flushed %s writer, the memory usage is %s", bucket, process.memory_info())
# Use massive yield_per() or otherwise we are leaking memory
for item in query.yield_per(100_000):
frame = construct_frame(row_type, item)
for key, value in frame.items():
data[key].append(value)
stats["candles_processed"] += 1
# Do regular checkopoints to avoid out of memory
# and to log the progress to the console
# For fine tuning Parquet writer see
# https://issues.apache.org/jira/browse/ARROW-10052
if stats["candles_processed"] % 100_000 == 0:
writeout()
In this case, the reason was the incorrect use of Python lists and dicts as a working buffer, as pointed out by #0x26res.
After making sure the dictionary of lists is cleared correctly, the memory consumption issues become negligible.

Print lines around position in the file

I'm importing a big csv (5gb) file to the BiqQuery and I had information about an error in the file and its position — specified as a byte offset from the start of the file (for example, 134683757). I'd like to look at lines around this error position.
Some example lines of the file:
field1, field2, field3
abc, bcd, efg
...
dge, hfr, kdf,
dgj, "a""a", fbd # in this line is an invalid csv element and I get error, let's say on the position 134683757
skd, frd, lqw
...
asd, fij, fle
I need some command to show lines around error like
dge, hfr, kdf,
dgj, "a""a", fbd
skd, frd, lqw
I tried sed and awk but I didn't find any simple solution.
It was definitely not clear from the original version of the question that you only got a byte offset from the start of the file.
You need to get a better position from the software generating the error; the developer was lazy in reporting an unusable number. It is reasonable to request a line number (and preferably offset within the line), rather than (or as well as) the byte offset from the start.
Assuming that the number is a byte position in the file, that gets tricky. Most Unix utilities work with lines (of variable length). I'd be tempted to write some C code to do the job, but that might be beyond you (and no shame in that).
Failing that, your best is likely the dd command. If the number reported is 134683757, then I'd guess that your lines are probably not more than 1 KiB each (adjust numbers if they're bigger, or smaller), and then use:
dd if=big.csv of=extract.csv bs=1 skip=$((134683757 - 3 * 1024)) count=6144
echo >> extract.csv
You'd then look at extract.csv. The raw dd output probably won't have a newline at the end of the last line (the echo >>extract.csv fixes that). The output will probably start part way through a record and end part way through another record. However, you're likely to have the relevant information, as well as some irrelevant information. As I said, adjust the numbers to suit your exact situation.
The trickiest part is identifying exactly where the byte offset is in the file you get. With custom C code, that can be provided easily (more easily). With the output from dd, you have to do the calculation yourself.
awk -v offset=$((134683757 - 3 * 1024)) '
{ printf "%9d: %s\n", offset, $0; offset += length($0) + 1 }
' extract.cvs
That takes the starting offset from the dd command, and prefixes the (remnants of) the first line with that number and the data; it then adds the length to the offset plus one for the newline that wasn't counted, and continues to the end of the file. That gives you the start offset for each line in the extracted data. You can see where your actual start was by looking at the offsets — you should be able to identify which record that was.
You could use a variant of this Awk script that reads the whole file line by line, and tracks the offset (as well as the line numbers) and prints the data when it gets to the vicinity of where you have the problem.
In times long past, I had to deal with data from 1/2 inch mag tapes (those big circular tapes you see in old movies) where the files generated on a mainframe seemed sanely formatted for the first few tens of megabytes, but then the format changed to some alternative format for a few megabytes, and then reverted to the original format once more. I never did find out why; I just learned how to deal with it. Trial and error!

How to handle for loop with large objects in Rstudio?

I have a for loop with large objects. According to my trial-and-error, I can only load the large object once. If I load the object again, I would be returned the error "Error: cannot allocate vector of size *** Mb". I tried to overcome this issue by removing the object at the end of the for loop. However, I am still returned the error "Error: cannot allocate vector of size 699.2 Mb" at the beginning of the second run of the for loop.
My for loop has the following structure:
for (i in 1:22) {
VeryLargeObject <- ...i...
...
.
.
.
...
rm(VeryLargeOjbect)
}
The VeryLargeObjects ranges from 2-3GB. My PC has RAM of 16Gb, 8 cores, 64-bit Win10.
Any solution on how I can manage to complete the for loop?
The error "cannot allocate..." likely comes from the fact that rm() does not immediately free memory. So the first object still occupies RAM when you load the second one. Objects that are not assigned to any name (variable) anymore get garbage collected by R at time points that R decides for itself.
Most remedies come from not loading the entire object into RAM:
If you are working with a matrix, create a filebacked.big.matrix() with the bigmemory package. Write your data into this object using var[...,...] syntax like a normal matrix. Then, in a new R session (and a new R script to preserve reproducibility), you can load this matrix from disk and modify it.
The mmap package uses a similar approach, using your operating system's ability to map RAM pages to disk. So they appear to a program like they are in ram, but are read from disk. To improve speed, the operating system takes care of keeping the relevant parts in RAM.
If you work with data frames, you can use packages like fst and feather that enable you to load only parts of your data frame into a variable.
Transfer your data frame into a data base like sqlite and then access the data base with R. The package dbplyr enables you to treat a data base as a tidyverse-style data set. Here is the RStudio help page. You can also use raw SQL commands with the package DBI
Another approach is to not write interactively, but to write an R script that processes only one of your objects:
Write an R script, named, say processBigObject.R that gets the file name of your big object from the command line using commandArgs():
#!/usr/bin/env Rscript
#
# Process a big object
#
# Usage: Rscript processBigObject.R <FILENAME>
input_filename <- commandArgs(trailing = TRUE)[1]
output_filename <- commandArgs(trailing = TRUE)[2]
# I'm making up function names here, do what you must for your object
o <- readBigObject(input_filename)
s <- calculateSmallerSummaryOf(o)
writeOutput(s, output_filename)
Then, write a shell script or use system2() to call the script multiple times, with different file names. Because R is terminated after each object, the memory is freed:
system2("Rscript", c("processBigObject.R", "bigObject1.dat", "bigObject1_result.dat"))
system2("Rscript", c("processBigObject.R", "bigObject2.dat", "bigObject2_result.dat"))
system2("Rscript", c("processBigObject.R", "bigObject3.dat", "bigObject3_result.dat"))
...

Windows (ReFS,NTFS) file preallocation hint

Assume I have multiple processes writing large files (20gb+). Each process is writing its own file and assume that the process writes x mb at a time, then does some processing and writes x mb again, etc..
What happens is that this write pattern causes the files to be heavily fragmented, since the files blocks get allocated consecutively on the disk.
Of course it is easy to workaround this issue by using SetEndOfFile to "preallocate" the file when it is opened and then set the correct size before it is closed. But now an application accessing these files remotely, which is able to parse these in-progress files, obviously sees zeroes at the end of the file and takes much longer to parse the file.
I do not have control over the this reading application so I can't optimize it to take zeros at the end into account.
Another dirty fix would be to run defragmentation more often, run Systernal's contig utility or even implement a custom "defragmenter" which would process my files and consolidate their blocks together.
Another more drastic solution would be to implement a minifilter driver which would report a "fake" filesize.
But obviously both solutions listed above are far from optimal. So I would like to know if there is a way to provide a file size hint to the filesystem so it "reserves" the consecutive space on the drive, but still report the right filesize to applications?
Otherwise obviously also writing larger chunks at a time obviously helps with fragmentation, but still does not solve the issue.
EDIT:
Since the usefulness of SetEndOfFile in my case seems to be disputed I made a small test:
LARGE_INTEGER size;
LARGE_INTEGER a;
char buf='A';
DWORD written=0;
DWORD tstart;
std::cout << "creating file\n";
tstart = GetTickCount();
HANDLE f = CreateFileA("e:\\test.dat", GENERIC_ALL, FILE_SHARE_READ, NULL, CREATE_ALWAYS, 0, NULL);
size.QuadPart = 100000000LL;
SetFilePointerEx(f, size, &a, FILE_BEGIN);
SetEndOfFile(f);
printf("file extended, elapsed: %d\n",GetTickCount()-tstart);
getchar();
printf("writing 'A' at the end\n");
tstart = GetTickCount();
SetFilePointer(f, -1, NULL, FILE_END);
WriteFile(f, &buf,1,&written,NULL);
printf("written: %d bytes, elapsed: %d\n",written,GetTickCount()-tstart);
When the application is executed and it waits for a keypress after SetEndOfFile I examined the on disc NTFS structures:
The image shows that NTFS has indeed allocated clusters for my file. However the unnamed DATA attribute has StreamDataSize specified as 0.
Systernals DiskView also confirms that clusters were allocated
When pressing enter to allow the test to continue (and waiting for quite some time since the file was created on slow USB stick), the StreamDataSize field was updated
Since I wrote 1 byte at the end, NTFS now really had to zero everything, so SetEndOfFile does indeed help with the issue that I am "fretting" about.
I would appreciate it very much that answers/comments also provide an official reference to back up the claims being made.
Oh and the test application outputs this in my case:
creating file
file extended, elapsed: 0
writing 'A' at the end
written: 1 bytes, elapsed: 21735
Also for sake of completeness here is an example how the DATA attribute looks like when setting the FileAllocationInfo (note that the I created a new file for this picture)
Windows file systems maintain two public sizes for file data, which are reported in the FileStandardInformation:
AllocationSize - a file's allocation size in bytes, which is typically a multiple of the sector or cluster size.
EndOfFile - a file's absolute end of file position as a byte offset from the start of the file, which must be less than or equal to the allocation size.
Setting an end of file that exceeds the current allocation size implicitly extends the allocation. Setting an allocation size that's less than the current end of file implicitly truncates the end of file.
Starting with Windows Vista, we can manually extend the allocation size without modifying the end of file via SetFileInformationByHandle: FileAllocationInfo. You can use Sysinternals DiskView to verify that this allocates clusters for the file. When the file is closed, the allocation gets truncated to the current end of file.
If you don't mind using the NT API directly, you can also call NtSetInformationFile: FileAllocationInformation. Or even set the allocation size at creation via NtCreateFile.
FYI, there's also an internal ValidDataLength size, which must be less than or equal to the end of file. As a file grows, the clusters on disk are lazily initialized. Reading beyond the valid region returns zeros. Writing beyond the valid region extends it by initializing all clusters up to the write offset with zeros. This is typically where we might observe a performance cost when extending a file with random writes. We can set the FileValidDataLengthInformation to get around this (e.g. SetFileValidData), but it exposes uninitialized disk data and thus requires SeManageVolumePrivilege. An application that utilizes this feature should take care to open the file exclusively and ensure the file is secure in case the application or system crashes.

Coding for Input/Output: Speed or Memory priority?

I am currently writing a simple piece of IO parsing, and is in a dilemma as to how should I code it.
This is the case of a web application, where this particular parsing function may be called multiple times within a second by several users.
Assume that the file size is more than 2 MB and Hardware IO delays are 5ms for each call.
First Case: Memory
The first case would be to code for memory, but at the expense of speed. The function will take in small parts of the file and parse by the parts thus using more iterations, but less memory.
Pseudo-code:
function parser() {
Open file and put into handle variable fHandle
while (file position not passed EOF) {
read 1024 bytes from file using fHandle into variable data
process(data)
}
Close file using handle fHandle
}
Second Case: Speed
The second case would be to code for speed, at the expense of memory usage. The function will load the entire file content into memory and parse it directly.
Pseudo-code:
function parser() {
read entire file and store into variable data
declare parsing position variable and set to 0
while (parsing position not past data length) {
get position of next token and store into variable pos
process( substring from current position to pos of data )
}
}
Note: when reading entire file we are using library direct-available functions to read the entire file. No loops are used in reading the file on the developer's end.
Third Case: End-user choice
Would it then be advisable to write for both, and whenever the function runs, the function will detect whether memory is abundant or not. If there is a lot of free memory space, the function will use the memory-intensive version.
Pseudo-code:
function parser() {
if (memory is too little) {
Open file and put into handle variable fHandle
while (file position not passed EOF) {
read 1024 bytes from file using fHandle into variable data
process(data)
}
Close file using handle fHandle
} else {
read entire file and store into variable data
declare parsing position variable and set to 0
while (parsing position not past data length) {
get position of next token and store into variable pos
process( substring from current position to pos of data )
}
}
}
Use asynchronous I/O (or a second thread), and process one chunk of data while the drive's busy fetching the next chunk. Best of both worlds.
If you need to read the full file either way and it fits into memory without issue, then read it from memory. Will it be the same file every time, or some small set of files? Cache them in memory.
If the input for your parsing comes from I/O, as it usually does, any good parsing technology, like recursive-descent, will be I/O bound.
In other words, the average time to get a character from the I/O should exceed the average time spent processing it, by a healthy factor.
So it really doesn't matter very much.
The only difference will be in how much working storage you glom onto, which is not usually a big deal.

Resources