In bioinformatics, what is a singleton? - bioinformatics

I've quickly realized that bioinformatics is not a subject which has its terms clearly defined and easily accessible. I have an apparent discrepancy with some of my results.
I used samtools view -b -h -f 8 fileName.bam > mateUnmapped.bam on several BAM files. I am under the impression that this command extracts only reads whose partner does not align to the draft genome (also includes header; the output is in BAM format)
When I use samtools 'flagstat' on the resulting files, I get an interesting result: the number of 'singletons' do not match the total number of reads... which seems odd to me.
The only reconciliation I can find is here:
http://seqanswers.com/forums/showthread.php?t=46711
One person which replies to the question posed in this forum claims that singletons are sometimes defined as sequences which do not have a partner read at all. However, that still doesn't explain away my result. Flagstat says about 40% of my reads are singletons, but I feel like based on the 'view' command I used, they should ALL be singletons.
Can a seasoned bioinformatician help me out?

In general genomic assembly, a singlton is a read that did not assemble into a contig or map to a reference. It is a contig of only 1 read.
In samtools, a singleton refers to a read that mapped but the mate didn't.
Flagstat says about 40% of my reads are singletons, but I feel like
based on the 'view' command I used, they should ALL be singletons.
I'm not a samtools expert, but I think -f 8 means show reads whose mates did not map. That doesn't say anything about the read itself, just its mate. So you are probably getting reads where both mates that didn't map at all (60%) AND reads where only one of the mates mapped (40%). ?
You might want to try running with -f 8 -F 4 to be reads that mapped but whose mates did not.

Related

Controlling du ordering [duplicate]

This question already has answers here:
What sort order does Linux use?
(1 answer)
Is the order of iteration in bash for loop guaranteed? [duplicate]
(1 answer)
Closed 3 years ago.
Short version; how can I control the order in which du scans directories?
NOTE: Sorting afterwards is not sufficient, I need to change the order that du scans, not the ordering of the results.
For a bit of background; I have a simple rsync-based backup with a bunch of snapshots in a directory, all named by the date/time they were created.
So a sample might look like this:
2019-08
2019-09
2019-09-20
2019-09-21
2019-09-23-161447
2019-09-23-172658
i.e- I've got hourlies (at the end), collapsed into dailies (middle) and then eventually into monthlies (top), which works great since these work well with the natural sorting of ls.
The problem however, is that when I run du . the ordering is unpredictable, or at least, it's not the same (oldest to newest) so the results aren't useful as I'd like; sometimes I'll get a daily or even hourly being scanned ahead of the monthlies, and the ordering only sometimes matches the one I want.
What I want, ideally, is for du to always evaluate the snapshots from oldest to newest, as this way du will give the sizes for all later snapshots as the a size difference/delta (since unchanged files are hard-linked, it only counts the changes for later snapshots).
Is it possible to force a natural sort order for the order in which du scans directories?
For some reason it hadn't occurred to me that if I just pass all the snapshots as arguments, then du would still ignore additional hard-links, so a possible answer in my case is to do something like:
du -sh /path/to/snapshots/*-*
Since my parent directory contains no other entries with hyphens, this expands to every single snapshot within it, and seems to be given in a natural sort order. Depending how similar your own use-case is you may need to adjust the pattern, or obtain the list some other way.
Before I mark this answer as correct though, can I please have someone confirm; is this a reliable way to order the snapshots? It seems to give me a natural sort order, but I haven't been able to find a source confirming if this will always be the case. If it isn't reliable, or even if it is, are there any other/more flexible alternatives?

Monitor A File For Additions And Get Last Added Line

I'm having trouble monitoring a file for changes. I need to be able to know when a file changes, and when it does, I need the new line that was added. I intend to parse each line and find ones that match certain criteria, and act on information in those lines. I know the expected number of matching lines ahead of time, but I do not know how many lines in total will be added to the file, or where the matching lines will be.
I've tried 2 packages so far, with no avail.
fsnotify/fsnotify
As fas as I can tell, fsnotify can only tell me when a file is modified, not what the details of the modification was. Since I need to know what exactly was added to the file, this is no good for me.
(As a side-question, can this be run in a loop? The example that I tried exited after just one modification. I need to monitor for multiple modifications.)
hpcloud/tail
This package tries to mimic the Unix tail command, but it seems to have its own issues. The output that I get includes timestamps and other data - I just want the added line, nothing else. Also, it seems to think a file has been modified multiple times, even when it's just one edit. Further, the deal breaker here is that it does not output the last line if the line was not followed by a newline character.
Delegating to tail
I came across this answer, which suggests to delegate this work to the tail command itself, but I need this to work cross-platform (specifically, macOS, Linux and Windows). I don't believe that an equivalent command exists on Windows.
How do I go about tackling this?
#user2515526,
Usually changed diff is out of scope of file watchers' functionality, because, you know, you could change an image, and a watcher would need to keep a track several Mb of a diff in memory, and what if we have thousands of files?
However, as bad as it sounds, this may be exactly the way you want to implement this (sure, depends on your app, etc. - could be fine for text files), i.e. - keeping a map of diffs (1 diff per file) since last modification. Cannot say I like it, but sounds like fsnotify has no support for changes/diffs that you need.
Also, regarding your question about running in a loop, maybe you can get some hints here: https://github.com/kataras/iris/blob/8370d76910cdd8de043753ed81ae080eae8dc798/utils/file.go
Its a framework that allows to build a server that watches for TypeScript file changes. So sounds similar to your case/question.
Cheers,
-D

How to display command output in a whiptail textbox

The whiptail command has an option --textbox that has the following description:
--textbox <file> <height> <width>
The first option requires a file as input; I would like to use the output of a command in its place. It seems like this should be possible in sh or bash. For the sake of the question, let's say I'd like to view the output of ls -l in a whiptail textbox.
Note that process substitution does not appear to work in whiptail (e.g. whiptail --textbox <(ls -l) 40 80 does not work.
This question is a re-asking of this other stackoverflow question, which technically was answered.
For the record, the question says that
whiptail --textbox <(ls -l) 40 80
"doesn't work". It's most certainly worth stating clearly that the nature of the failure is that whiptail displays an empty textbox. (That's mentioned in a comment to an answer to the original question, linked in this question, but that's a pretty obscure place to look for a problem report.)
In 2014, this workaround was available (and was the original contents of this answer):
whiptail --textbox /dev/stdin 40 80 <<<"$(ls -l)"
That will still work in 2022, if ls -l produces enough output (at least 64k on a standard Linux/Bash install).
Another possible workaround is to use a msgbox instead of a textbox:
whiptail --msgbox "$(ls -l)" 40 80
However, that will fail if the output from the command is too large to use as a command-line argument, which might be the case at 128k.
So if you can guess reasonably accurately how big the output will be, one of those solutions will work. Up to around 100k, you can use the msgbox solution; beyond that, you can use the textbox with a here-string.
But that's far from ideal, since it's really hard to reliably guess the size of the output of a command, even within a factor of two.
What will always work is to save the output of the command to a temporary file, then provide the file to whiptail, and then delete the file. (In fact, you can delete the file immediately, since Posix systems don't delete files until there are no open file handles.) But no matter how hard you try, you will occasionally end up with a file which should have been deleted. On Linux, your best bet is to create temporary files in the /tmp directory, which is an in-memory filesystem which is emptied automatically on reboot.
Why does this happen?
I came up with the solution above, eight years prior to this edit, on the assumption that OP was correct in their guess that the problem had to do with not being able to seek() a process substituted fd. Indeed, it's true that you can't seek() /dev/fd/63. It was also true at the time that bash implemented here-strings and here-docs by creating a temporary file to hold the expanded text, and then redirecting standard input (or whatever fd was specified) to that temporary file. So the suggested workaround did work; it ensured that /dev/stdin was just a regular file.
But in 2022, the same question was asked, this time because the suggested workaround failed. As it turns out, the reason is that Bash v5.1, which was released in late 2020, attempts to optimise small here-strings and here-docs:
c. Here documents and here strings now use pipes for the expanded document if it's smaller than the pipe buffer size, reverting to temporary files if it's larger.
(from the Bash CHANGES file; changes between bash 5.1alpha and bash 5.0, in section 3, New features in Bash.)
So with the new Bash version, unless the here-string is bigger than a pipe buffer (on Linux, 16 pages), it will no longer be a regular file.
One slightly confusing aspect of this issue is that whiptail does not, in fact, try to call lseek() on the textbox file. So the initial guess about the nature of the problem was not exact. That's not all that surprising, since using lseek() on a FIFO to position the stream at SEEK_END produces an explicit error, and it's reasonable to expect software to actually report error returns. But whiptail does not attempt to get the filesize by seeking to the end. Instead, it fstat()s the file and gets the file size from the st_size field in the returned struct stat. It then allocates exactly enough memory to hold the contents of the file, and reads the indicated number of bytes.
Of course, fstat cannot report the size of a FIFO, since that's not known until the FIFO is completely drained. But unlike lseek, that's not considered an error. fstat is documented as not filling in the st_size field on FIFOs, sockets, character devices, and other stream types. As it happens, on Linux the st_size field is filled in as 0 for such file descriptors, but Posix actually allows it to be unset. In any case, there is no error indication, and it's essentially impossible to distinguish between a stream which doesn't have a known size and a stream which is known to have size 0, like an empty file. Thus, whiptail treats a FIFO as though it were an empty file, reading 0 bytes and presenting an empty textbox.
What about dialog?
One alternative to whiptail is Dialog, currently maintained by Thomas Dickey. You can often directly substitute dialog for whiptail, and it has some additional widgets which can be useful. However, it does not provide a simple solution in this case.
Unlike whiptail, dialog's textbox attempts to avoid reading the entire file into memory before drawing the widget. As a result, it does depend on lseek() in order to read out of order, and thus cannot work on special files at all. Attempting to use a FIFO with dialog produces an error message, rather than drawing an empty textbox; that makes diagnosis easier but doesn't really solve the underlying problem. Dialog does have a variety of widgets which can read from stdin, but as far as I know none of them allow scrolling, so they're only useful if the command output fits in a single window. But it's possible that I've missed something.
Drawing out a moral: just read until you reach the end
(Skip this section if you're only interested in using command-line utilities, not writing them.)
The tragic part of this complicated tale is that it was all completely unnecessary. Whiptail is going to read the entire file in any case; trying to get the size of the file before reading was either laziness or a misguided attempt to optimise. Had the code just read until an end of file indication, possibly allocating more memory as needed, all these problems would have been avoided. And not just these problems. As it happens, Posix does not guarantee that the st_size field is accurate even for apparently normal files. Stat is only required to report accurate sizes for symlinks (the length of the link itself) and shared memory objects. The Linux documentation indicates that st_size will be returned as 0 for certain automatically-generated files:
For example, the value 0 is returned for many files under the /proc directory, while various files under /sys report a size of 4096 bytes, even though the file content is smaller. For such files, one should simply try to read as many bytes as possible (and append '\0' to the returned buffer if it is to be interpreted as a string). (from man 7 inode).
lseek will also fail on many autogenerated files, as well as sockets, FIFOs and character devices. So the more common idiom for this particular optimization is also unreliable, as well as leading to TOCTOU-like race conditions when the file is truncated or appended to while it is being read.

Implementation of Fifo in GNU-GUILE

I would like to do the following :
I want to imple,ment the concept of FIFO in normal files using GUILE.
Two processes should communicate via a normal text file, that a third process , if needed, can access.
The subordinate of the original two processes should write in the file, line after line, that is append. So far so good. (implemented in c++)
The master proces however, should treat this file as a FIFO, it should read the first line, and do somethong corresponding to it, and delete the first line leaving the rest intact.
The problems are :
While the Master is accessing the file, the subordinate may come to a point where it must write there, leading to a conflict.
Popping the first line may need reading the whole ile out, in a string, poping the first thereof, and then saving it, which is memory intensive, and the second saving action may again conflict with the child trying to write there,
I wanted to implement this in GUILE, because since it is the official OS extension language, there might be better ways which addresses the above two issues.
But in the web I do not find much to orient myself. Please help, sorry for the lewss than concrete question, then I dont have a code snippet to show.

Embarrassingly parallel workflow creates too many output files

On a Linux cluster I run many (N > 10^6) independent computations. Each computation takes only a few minutes and the output is a handful of lines. When N was small I was able to store each result in a separate file to be parsed later. With large N however, I find that I am wasting storage space (for the file creation) and simple commands like ls require extra care due to internal limits of bash: -bash: /bin/ls: Argument list too long.
Each computation is required to run through a qsub scheduling algorithm so I am unable to create a master program which simply aggregates the output data to a single file. The simple solution of appending to a single fails when two programs finish at the same time and interleave their output. I have no admin access to the cluster, so installing a system-wide database is not an option.
How can I collate the output data from embarrassingly parallel computation before it gets unmanageable?
1) As you say, it's not ls which is failing; it's the shell which does glob expansion before starting up ls. You can fix that problem easily enough by using something like
find . -type f -name 'GLOB' | xargs UTILITY
eg.:
find . -type f -name '*.dat' | xargs ls -l
You might want to sort the output, since find (for efficiency) doesn't sort the filenames (usually). There are many other options to find (like setting directory recursion depth, filtering in more complicated ways, etc.) and to xargs (maximum number of arguments for each invocation, parallel execution, etc.). Read the man pages for details.
2) I don't know how you are creating the individual files, so it's a bit hard to provide specific solutions, but here are a couple of ideas:
If you get to create the files yourself, and you can delay the file creation until the end of the job (say, by buffering output), and the files are stored on a filesystem which supports advisory locking or some other locking mechanism like atomic linking, then you can multiplex various jobs into a single file by locking it before spewing the output, and then unlocking. But that's a lot of requirements. In a cluster you might well be able to do that with a single file for all the jobs running on a single host, but then again you might not.
Again, if you get to create the files yourself, you can atomically write each line to a shared file. (Even NFS supports atomic writes but it doesn't support atomic append, see below.) You'd need to prepend a unique job identifier to each line so that you can demultiplex it. However, this won't work if you're using some automatic mechanism such as "my job writes to stdout and then the scheduling framework copies it to a file", which is sadly common. (In essence, this suggestion is pretty similar to the MapReduce strategy. Maybe that's available to you?)
Failing everything else, maybe you can just use sub-directories. A few thousand directories of a thousand files each is a lot more manageable than a single directory with a few million files.
Good luck.
Edit As requested, some more details on 2.2:
You need to use Posix I/O functions for this, because, afaik, the C library does not provide atomic write. In Posix, the write function always writes atomically, provided that you specify O_APPEND when you open the file. (Actually, it writes atomically in any case, but if you don't specify O_APPEND then each process retains it's own position into the file, so they will end up overwriting each other.)
So what you need to do is:
At the beginning of the program, open a file with options O_WRONLY|O_CREATE|O_APPEND. (Contrary to what I said earlier, this is not guaranteed to work on NFS, because NFS may not handle O_APPEND properly. Newer versions of NFS could theoretically handle append-only files, but they probably don't. Some thoughts about this a bit later.) You probably don't want to always use the same file, so put a random number somewhere into its name so that your various jobs have a variety of alternatives. O_CREAT is always atomic, afaik, even with crappy NFS implementations.
For each output line, sprintf the line to an internal buffer, putting a unique id at the beginning. (Your job must have some sort of unique id; just use that.) [If you're paranoid, start the line with some kind of record separator, followed by the number of bytes in the remaining line -- you'll have to put this value in after formatting -- so the line will look something like ^0274:xx3A7B29992A04:<274 bytes>\n, where ^ is hex 01 or some such.]
write the entire line to the file. Check the return code and the number of bytes written. If the write fails, try again. If the write was short, hopefully you followed the "if you're paranoid" instructions above, also just try again.
Really, you shouldn't get short writes, but you never know. Writing the length is pretty simple; demultiplexing is a bit more complicated, but you could cross that bridge when you need to :)
The problem with using NFS is a bit more annoying. As with 2.1, the simplest solution is to try to write the file locally, or use some cluster filesystem which properly supports append. (NFSv4 allows you to ask for only "append" permissions and not "write" permissions, which would cause the server to reject the write if some other process already managed to write to the offset you were about to use. In that case, you'd need to seek to the end of the file and try the write again, until eventually it succeeds. However, I have the impression that this feature is not actually implemented. I could be wrong.)
If the filesystem doesn't support append, you'll have another option: decide on a line length, and always write that number of bytes. (Obviously, it's easier if the selected fixed line length is longer than the longest possible line, but it's possible to write multiple fixed-length lines as long as they have a sequence number.) You'll need to guarantee that each job writes at different offsets, which you can do by dividing the job's job number into a file number and an interleave number, and write all the lines for a particular job at its interleave modulo the number of interleaves, into a file whose name includes the file number. (This is easiest if the jobs are numbered sequentially.) It's OK to write beyond the end of the file, since unix filesystems will -- or at least, should -- either insert NULs or create discontiguous files (which waste less space, but depend on the blocksize of the file).
Another way to handle filesystems which don't support append but do support advisory byte-range locking (NFSv4 supports this) is to use the fixed-line-length idea, as above, but obtaining a lock on the range about to be written before writing it. Use a non-blocking lock, and if the lock cannot be obtained, try again at the next line-offset multiple. If the lock can be obtained, read the file at that offset to verify that it doesn't have data before writing it; then release the lock.
Hope that helps.
If you are only concerned by space:
parallel --header : --tag computation {foo} {bar} {baz} ::: foo 1 2 ::: bar I II ::: baz . .. | pbzip2 > out.bz2
or shorter:
parallel --tag computation ::: 1 2 ::: I II ::: . .. | pbzip2 > out.bz2
GNU Parallel ensures output is not mixed.
If you are concerned with finding a subset of the results, then look at --results.
Watch the intro videos to learn more: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
Another possibility would be to use N files, with N greater or equal to the number of nodes in the cluster, and assign the files to your computations in a round-robin fashion. This should avoid concurrent writes to any of the files, provided you have a reasonnable guarantee on the order of execution of your computations.

Resources