Strand specific tophat/cufflinks - bioinformatics

I have a strand-specific RNA-seq library to assemble (Illumina). I would like to use TopHat/Cufflinks. From the manual of TopHat, it says,
"--library-type TopHat will treat the reads as strand specific. Every read alignment will have an XS attribute tag. Consider supplying library type options below to select the correct RNA-seq protocol."
Does it mean that TopHat only supports strand-specific protocols? I use option "--library-type fr-unstranded" to run, does it mean it runs in a strand-specific way? I googled it and asked the developers, but got no answer...
I got some result:
Here the contig is assembled by two groups of reads, left side are reverse reads, while right side is forward. (for visualization, i have reverse complement the right mate)
But some of the contigs are assembled purely from reverse or forward reads. If it is strand specific, one gene should produce the reads in the same direction. It should not report the result like the image above, am I right? Or is it possible that one gene is fragmented and then sequence independently, so that happenly left part produce reverse reads while right part produce forward reads? From my understanding, the strand specificity is kept by 3'/5' ligation, so should be in the unit of genes.
What is the problem here? Or did I understand the concept of 'strand specific' wrongly? Any help is appreciated.

Tophat/Cufflinks are not for assembly, they are for alignment to an already assembled genome or transcriptome. What are you aligning your reads to?
Also, if you have strand specific data, you shouldn't choose an unstranded library type. You should choose the proper one based on your library preparation method. The XS tag will only be placed on split reads if you choose an unstranded library type.

If you want to do a de novo assembly of your transcriptome you should take a look at assemblers (not mappers) like
Trinity
SoapDeNovo
Oases....

Tophat can deal with both stranded libraries and unstranded libraries. In your snapshot the center region does have both + and - strand reads. The biases at the two ends might be some characteristics of your library prep or analytical methods. What's the direction of this gene? It looks like a little bit biased towards the left side. If the left hand-side corresponds to 3' end then it's likely that your library prep has 3' bias features (e.g dT-primed Reverse transcription) The way you fragment your RNA may also have effects on your reads distribution.
I guess we need more information to find the truth. But we should also keep in mind that tophat/cufflinks may have bugs, too.

Related

Mapping Untyped Lisp data into a typed binary format for use in compiled functions

Background: I'm writing a toy Lisp (Scheme) interpreter in Haskell. I'm at the point where I would like to be able to compile code using LLVM. I've spent a couple days dreaming up various ways of feeding untyped Lisp values into compiled functions that expect to know the format of the data coming at them. It occurs to me that I am not the first person to need to solve this problem.
Question: What are some historically successful ways of mapping untyped data into an efficient binary format.
Addendum: In point of fact, I do know which of about a dozen different types the data is, I just don't know which one might be sent to the function at compile time. The function itself needs a way to determine what it got.
Do you mean, "I just don't know which [type] might be sent to the function at runtime"? It's not that the data isn't typed; certainly 1 and '() have different types. Rather, the data is not statically typed, i.e., it's not known at compile time what the type of a given variable will be. This is called dynamic typing.
You're right that you're not the first person to need to solve this problem. The canonical solution is to tag each runtime value with its type. For example, if you have a dozen types, number them like so:
0 = integer
1 = cons pair
2 = vector
etc.
Once you've done this, reserve the first four bits of each word for the tag. Then, every time two objects get passed in to +, first you perform a simple bit mask to verify that both objects' first four bits are 0b0000, i.e., that they are both integers. If they are not, you jump to an error message; otherwise, you proceed with the addition, and make sure that the result is also tagged accordingly.
This technique essentially makes each runtime value a manually-tagged union, which should be familiar to you if you've used C. In fact, it's also just like a Haskell data type, except that in Haskell the taggedness is much more abstract.
I'm guessing that you're familiar with pointers if you're trying to write a Scheme compiler. To avoid limiting your usable memory space, it may be more sensical to use the bottom (least significant) four bits, rather than the top ones. Better yet, because aligned dword pointers already have three meaningless bits at the bottom, you can simply co-opt those bits for your tag, as long as you dereference the actual address, rather than the tagged one.
Does that help?
Your default solution should be a simple tagged union. If you want to narrow your typing down to more specific types, you can do it - but it won't be that "toy" any more. A thing to look at is called abstract interpretation.
There are few successful implementations of such an optimisation, with V8 being probably the most widespread. In the Scheme world, the most aggressively optimising implementation is Stalin.

Does a specification/grammar for "Preflop Canonical Form" exist?

I hear this term used quite frequently, but have yet to see it specified (and can't find it by searching). Single starting hands are pretty straight forward. Here, I'm using pokerstove syntax as an example:
XX for a pair (e.g.: 77, 99, TT, KK)
XYo for an off suit combination (e.g.: 72o, 54o, AQo)
XYs for a suited combination (e.g.: 76s, 86s, AKs)
Where things get a bit dicier is when it starts to take on a set-builder-like syntax, with ranges, and unions between ranges. (e.g.: 22+, A9s+, AKo for any pair, suited aces >= A9, and AKo).
As far as I see, people tend to use slight variations on the PF form, but the term "canonical form" seems to suggest that someone has at least started looking toward simplifying and/or standardizing it.
On one hand, Stove is ubiquitous enough that duplicating Prock's syntax isn't horrible, but I'd like to implement a standard if one exists.
Seems the answer is 'no' and that Pokerstove's syntax is ubiquitous enough that it can be used.
Feel free to join in fleshing out the syntax.
Two card representations are given here.
Square brackets represent single hands, e.g.: [KQ] represents all possibilities of KQ (suited and non-suited).
Intervals can be represented as [76s-23s] which includes all suited connectors from 76s to 23s. It does not include off suit hands.
Commas are used to join single hands or intervals.

Basic Reed-Solomon Error Correction Question

Does Reed-Solomon error correction work in an instance where there is a dropped byte (or multiple dropped bytes)? For example, let's say it's a (12,8) Reed Solomon code, so theoretically it should be able to correct 2 errors (or 4 erasures if the position is known). But, what happens if only 11 (or 10) bytes are received and one doesn't know which byte(s) were dropped? Will Reed-Solomon error correction work?
Thanks,
Ben
RS decoding for erasures requires the position of the symbols "dropped" or lost. The kind of error you're talking about is due to phase distortion.
You can make it work by simply cycling through the possible positions where the character might be missing and letting it try to correct your result, so let's say you received 10 characters:
1234567890
Have it correct the following values:
??1234567890
?1?234567890
?12?34567890
:
1??234567890
1?2?34567890
:
1234567890??
Each attempt will probably give you some result, most of which are not the one you want. But I would expect that there should be exactly one result with the minimal number of additional modifications, and that should be the one you want to use as the most likely to be correct answer.
For example, if you correct the first three numbers of the example above, you might get the following result:
v
361274567890
917234567890
312734569897
: ^ ^
For the first and third case, you have additional corrections made beyond filling in the two blanks (marked with v and ^), whereas in the second case you have only the missing positions filled in and the other characters match the uncorrected input. Therefore, I would choose answer 2 as the most likely to be correct one.
Clearly, the chances that this works depend on whether there are other errors. Unfortunately I'm not able to give you a rigorous set of conditions under which this method will work for sure.
.
Another thing you can do if your message is long enough is to use an interleaving technique to basically have multiple orthogonal RS codes cover your data. That way, if one fails, you might be able to recover with another one. This method is for example used on compact discs (CDs), where it is called CIRC.
No, Reed-Solomon can't automatically correct instances where there are missing bits, because just like most other FEC algorithms, it was only designed to correct bit-flips. If you know the position of the missing bits, you can pad your received signal at those positions so that RS can then work normally.
However, if you don't know the position, you will need to use another algorithm that supports bit-insertion or bit-deletion such as Marker Codes and Watermark Codes.
Also note that RS can be not only used for erasures but also to process noisy bits using Forney syndrome.

Coregions in UML

What are coregions in UML sequence diagrams?
Coregions are used when the sequence of events does not matter, that is they can occur safely in any order.
This is one of the first few pages I found when I searched coregion sequence diagram in Google.
The coregion is a notational/sytanx choice for representing parallel CombinedFragments the UML 2.2 Superstructure spec (14.3.3) says:
Parallel The interactionOperator par
designates that the CombinedFragment
represents a parallel merge between
the behaviors of the operands. The
OccurrenceSpecifications of the
different operands can be interleaved
in any way as long as the ordering
imposed by each operand as such is
preserved. A parallel merge defines a
set of traces that describes all the
ways that OccurrenceSpecifications of
the operands may be interleaved
without obstructing the order of the
OccurrenceSpecifications within the
operand.
The answer above is correct this is just more context.
The UML is specified by the OMG in the two documents (http://www.omg.org/spec/uml): The UML Infrastructure and the UML Supestructure. Whatever documentation may be not official.
In the UML superstructure section 14.3.3 it is said:
A notational shorthand for parallel combined fragments are available for the common situation where the order of event occurrences (or other nested fragments) on one Lifeline is insignificant. This means that in a given “coregion” area of a Lifeline all the directly contained fragments are considered separate operands of a parallel combined fragment.

Looking for algorithm that reverses the sprintf() function output

I am working on a project that requires the parsing of log files. I am looking for a fast algorithm that would take groups messages like this:
The temperature at P1 is 35F.
The temperature at P1 is 40F.
The temperature at P3 is 35F.
Logger stopped.
Logger started.
The temperature at P1 is 40F.
and puts out something in the form of a printf():
"The temperature at P%d is %dF.", Int1, Int2"
{(1,35), (1, 40), (3, 35), (1,40)}
The algorithm needs to be generic enough to recognize almost any data load in message groups.
I tried searching for this kind of technology, but I don't even know the correct terms to search for.
I think you might be overlooking and missed fscanf() and sscanf(). Which are the opposite of fprintf() and sprintf().
Overview:
A naïve!! algorithm keeps track of the frequency of words in a per-column manner, where one can assume that each line can be separated into columns with a delimiter.
Example input:
The dog jumped over the moon
The cat jumped over the moon
The moon jumped over the moon
The car jumped over the moon
Frequencies:
Column 1: {The: 4}
Column 2: {car: 1, cat: 1, dog: 1, moon: 1}
Column 3: {jumped: 4}
Column 4: {over: 4}
Column 5: {the: 4}
Column 6: {moon: 4}
We could partition these frequency lists further by grouping based on the total number of fields, but in this simple and convenient example, we are only working with a fixed number of fields (6).
The next step is to iterate through lines which generated these frequency lists, so let's take the first example.
The: meets some hand-wavy criteria and the algorithm decides it must be static.
dog: doesn't appear to be static based on the rest of the frequency list, and thus it must be dynamic as opposed to static text. We loop through a few pre-defined regular expressions and come up with /[a-z]+/i.
over: same deal as #1; it's static, so leave as is.
the: same deal as #1; it's static, so leave as is.
moon: same deal as #1; it's static, so leave as is.
Thus, just from going over the first line we can put together the following regular expression:
/The ([a-z]+?) jumps over the moon/
Considerations:
Obviously one can choose to scan part or the whole document for the first pass, as long as one is confident the frequency lists will be a sufficient sampling of the entire data.
False positives may creep into the results, and it will be up to the filtering algorithm (hand-waving) to provide the best threshold between static and dynamic fields, or some human post-processing.
The overall idea is probably a good one, but the actual implementation will definitely weigh in on the speed and efficiency of this algorithm.
Thanks for all the great suggestions.
Chris, is right. I am looking for a generic solution for normalizing any kind of text. The solution of the problem boils down to dynmamically finding patterns in two or more similar strings.
Almost like predicting the next element in a set, based on the previous two:
1: Everest is 30000 feet high
2: K2 is 28000 feet high
=> What is the pattern?
=> Answer:
[name] is [number] feet high
Now the text file can have millions of lines and thousands of patterns. I would like to parse the files very, very fast, find the patterns and collect the data sets that are associated with each pattern.
I thought about creating some high level semantic hashes to represent the patterns in the message strings.
I would use a tokenizer and give each of the tokens types a specific "weight".
Then I would group the hashes and rate their similarity. Once the grouping is done I would collect the data sets.
I was hoping, that I didn't have to reinvent the wheel and could reuse something that is already out there.
Klaus
It depends on what you are trying to do, if your goal is to quickly generate sprintf() input, this works. If you are trying to parse data, maybe regular expressions would do too..
You're not going to find a tool that can simply take arbitrary input, guess what data you want from it, and produce the output you want. That sounds like strong AI to me.
Producing something like this, even just to recognize numbers, gets really hairy. For example is "123.456" one number or two? How about this "123,456"? Is "35F" a decimal number and an 'F' or is it the hex value 0x35F? You're going to have to build something that will parse in the way you need. You can do this with regular expressions, or you can do it with sscanf, or you can do it some other way, but you're going to have to write something custom.
However, with basic regular expressions, you can do this yourself. It won't be magic, but it's not that much work. Something like this will parse the lines you're interested in and consolidate them (Perl):
my #vals = ();
while (defined(my $line = <>))
{
if ($line =~ /The temperature at P(\d*) is (\d*)F./)
{
push(#vals, "($1,$2)");
}
}
print "The temperature at P%d is %dF. {";
for (my $i = 0; $i < #vals; $i++)
{
print $vals[$i];
if ($i < #vals - 1)
{
print ",";
}
}
print "}\n";
The output from this isL
The temperature at P%d is %dF. {(1,35),(1,40),(3,35),(1,40)}
You could do something similar for each type of line you need to parse. You could even read these regular expressions from a file, instead of custom coding each one.
I don't know of any specific tool to do that. What I did when I had a similar problem to solve was trying to guess regular expressions to match lines.
I then processed the files and displayed only the unmatched lines. If a line is unmatched, it means that the pattern is wrong and should be tweaked or another pattern should be added.
After around an hour of work, I succeeded in finding the ~20 patterns to match 10000+ lines.
In your case, you can first "guess" that one pattern is "The temperature at P[1-3] is [0-9]{2}F.". If you reprocess the file removing any matched line, it leaves "only":
Logger stopped.
Logger started.
Which you can then match with "Logger (.+).".
You can then refine the patterns and find new ones to match your whole log.
#John: I think that the question relates to an algorithm that actually recognises patterns in log files and automatically "guesses" appropriate format strings and data for it. The *scanf family can't do that on its own, it can only be of help once the patterns have been recognised in the first place.
#Derek Park: Well, even a strong AI couldn't be sure it had the right answer.
Perhaps some compression-like mechanism could be used:
Find large, frequent substrings
Find large, frequent substring patterns. (i.e. [pattern:1] [junk] [pattern:2])
Another item to consider might be to group lines by edit-distance. Grouping similar lines should split the problem into one-pattern-per-group chunks.
Actually, if you manage to write this, let the whole world know, I think a lot of us would like this tool!
#Anders
Well, even a strong AI couldn't be sure it had the right answer.
I was thinking that sufficiently strong AI could usually figure out the right answer from the context. e.g. Strong AI could recognize that "35F" in this context is a temperature and not a hex number. There are definitely cases where even strong AI would be unable to answer. Those are the same cases where a human would be unable to answer, though (assuming very strong AI).
Of course, it doesn't really matter, since we don't have strong AI. :)
http://www.logparser.com forwards to an IIS forum which seems fairly active. This is the official site for Gabriele Giuseppini's "Log Parser Toolkit". While I have never actually used this tool, I did pick up a cheap copy of the book from Amazon Marketplace - today a copy is as low as $16. Nothing beats a dead-tree-interface for just flipping through pages.
Glancing at this forum, I had not previously heard about the "New GUI tool for MS Log Parser, Log Parser Lizard" at http://www.lizardl.com/.
The key issue of course is the complexity of your GRAMMAR. To use any kind of log-parser as the term is commonly used, you need to know exactly what you're scanning for, you can write a BNF for it. Many years ago I took a course based on Aho-and-Ullman's "Dragon Book", and the thoroughly understood LALR technology can give you optimal speed, provided of course that you have that CFG.
On the other hand it does seem you're possibly reaching for something AI-like, which is a different order of complexity entirely.

Resources