Plink - reverse complement - bioinformatics

I am using Plink for the first time and am checking my data against some previously genotyped samples (these are the same samples so the genotypes should match up).
My data is nearly correct as in it has called homs and hets correctly but for some SNPs my data has the reverse complement.
e.g.
What command do I need in plink do I need to tell it to call the reverse complement when needed??

I think I have got to the source of the problem should anyone else stumble across this post with the same issue.
I was using an Illumina SNP chip. Apparently there is some discrepancy in post-processing of genotype data with strand definition. These two links explain it beautifully:
http://gengen.openbioinformatics.org/en/latest/tutorial/coding/
https://academic.oup.com/bioinformatics/article/33/15/2399/3204987

Related

Algorithm for one sign checksum

I am desperate in the search for an algorithm to create a checksum that is a maximum of two characters long and can recognize the confusion of characters in the input sequence. When testing different algorithms, such as Luhn, CRC24 or CRC32, the checksums were always longer than two characters. If I reduce the checksum to two or even one character, then no longer all commutations are recognized.
Does any of you know an algorithm that meets my needs? I already have a name with which I can continue my search. I would be very grateful for your help.
Taking that your data is alphanumeric, you want to detect all the permutations (in the perfect case), and you can afford to use the binary checksum (i.e. full 16 bits), my guess is that you should probably go with CRC-16 (as already suggested by #Paul Hankin in the comments), as it is more information-dense compared to check-digit algorithms like Luhn or Damm, and is more "generic" when it comes to possible types of errors.
Maybe something like CRC-CCITT (CRC-16-CCITT), you can give it a try here, to see how it works for you.

Even when using the same randomseed in Lua, get different results?

I have a large, rather complicated procedural content generation lua project. One thing I want to be able to do, for debugging purposes, is use a random seed so that I can re-run the system & get the same results.
To the end, I print out the seed at the start of a run. The problem is, I still get completely different results each time I run it. Assuming the seed doesn't change anywhere else, this shouldn't be possible, right?
My question is, what other ways are there to influence the output of lua's math.random()? I've searched through all the code in the project, and there's only one place where I call math.randomseed(), and I do that before I do anything else. I don't use the time or date for any calculations, so that wouldn't be influencing the results... What else could I be missing?
Updated on 2/22/16 monkey patching math.random & math.randomseed has, oftentimes (but not always) output the same sequence of random numbers. But still not the same results – so I guess the real question is now: what behavior in lua is indeterminate, and could result in different output when the same code is run in sequence? Noting where it diverges, when it does, is helping me narrow it down, but I still haven't found it. (this code does NOT use coroutines, so I don't think it's a threading / race condition issue)
randomseed is using srandom/srand function, which "sets its argument as the seed for a new sequence of pseudo-random integers to be returned by random()".
I can offer several possible explanations:
you think you call randomseed, but you do not (random will initialize the sequence for you in this case).
you think you call randomseed once, but you call it multiple times (or some other part of the code calls randomseed as well, possibly at different times in your sequence).
some other part of the code calls random (some number of times), which generates different results for your part of the code.
there is nothing wrong with the generated sequence, but you are misinterpreting the results.
your version of Lua has a bug in srandom/random processing.
there is something wrong with srandom or random function in your system.
Having some information about your version of Lua and your system (in addition to the small example demonstrating the issue) would help in figuring out what's causing this.
Updated on 2016/2/22: It should be fairly easy to check; monkeypatch both math.randomseed and math.random and log all the calls and the values returned by the functions for two subsequent runs. Compare the results. If the results differ, you should be able to isolate why they differ and reproduce on a smaller example. You can also look at where the functions are called from using debug.traceback.
Correct, as stated in the documentation, 'equal seeds produce equal sequences of numbers.'
Immediately after setting the seed to a known constant value, output a call to rand - if this varies across runs, you know something is seriously wrong (corrupt library download, whack install, gamma ray hit your drive, etc).
Assuming that the first value matches across runs, add another output midway through the code. From there, you can use a binary search to zero in on where things go wrong (I.E. first half or second half of the code block in question).
While you can & should use some intuition to find the error as you go, keep in mind that if intuition alone was enough, you would have already found it, thus a bit of systematic elimination is warranted.
Revision to cover comment regarding array order:
If possible, use debugging tools. This SO post on detecting when the value of a Lua variable changes might help.
In the absence of tools, here's one way to roll your own for this problem:
A full debugging dump of any sizable array quickly becomes a mess that makes it tough to spot changes. Instead, I'd use a few extra variables & a test function to keep things concise.
Make two deep copies of the array. Let's call them debug01 & debug02 & call the original array original. Next, deliberately swap the order of two elements in debug02.
Next, build a function to compare two arrays & test if their elements match up & return / print the index of the first mismatch if they do not. Immediately after initializing the arrays, test them to ensure:
original & debug01 match
original & debug02 do not match
original & debug02 mismatch where you changed them
I cannot stress enough the insanity of using an unverified (and thus, potentially bugged) test function to track down bugs.
Once you've verified the function works, you can again use a binary search to zero in on where things go off the rails. As before, balance the use of a systematic search with your intuition.

Patterns in a key generation algorithm

I want to reverse-engineer a key generation algorithm which starts from a 4-byte ID, and the output is a 4-byte key. This seems to not be impossible or very difficult, because some patterns can be observed. In the following picture are the inputs and outputs of the algorithm for 8 situations:
As it can be seen, if the bytes from inputs are matching, also the outputs are matching, but with some exceptions (the red marking in the image).
So I think there are some simple arithmetic/binary operations done, and the mismatch could come from a carry of an addition operation.
Until now I ran a C program with some simple operations on the least significant byte of the inputs, with up to 4 variable parameters (0..255, all combinations) and compared with the output LSB, but without success.
Could you please advise me, what else could I try? And what do you think, it's possible what I'm trying to do?
Thank you very much!

Strand specific tophat/cufflinks

I have a strand-specific RNA-seq library to assemble (Illumina). I would like to use TopHat/Cufflinks. From the manual of TopHat, it says,
"--library-type TopHat will treat the reads as strand specific. Every read alignment will have an XS attribute tag. Consider supplying library type options below to select the correct RNA-seq protocol."
Does it mean that TopHat only supports strand-specific protocols? I use option "--library-type fr-unstranded" to run, does it mean it runs in a strand-specific way? I googled it and asked the developers, but got no answer...
I got some result:
Here the contig is assembled by two groups of reads, left side are reverse reads, while right side is forward. (for visualization, i have reverse complement the right mate)
But some of the contigs are assembled purely from reverse or forward reads. If it is strand specific, one gene should produce the reads in the same direction. It should not report the result like the image above, am I right? Or is it possible that one gene is fragmented and then sequence independently, so that happenly left part produce reverse reads while right part produce forward reads? From my understanding, the strand specificity is kept by 3'/5' ligation, so should be in the unit of genes.
What is the problem here? Or did I understand the concept of 'strand specific' wrongly? Any help is appreciated.
Tophat/Cufflinks are not for assembly, they are for alignment to an already assembled genome or transcriptome. What are you aligning your reads to?
Also, if you have strand specific data, you shouldn't choose an unstranded library type. You should choose the proper one based on your library preparation method. The XS tag will only be placed on split reads if you choose an unstranded library type.
If you want to do a de novo assembly of your transcriptome you should take a look at assemblers (not mappers) like
Trinity
SoapDeNovo
Oases....
Tophat can deal with both stranded libraries and unstranded libraries. In your snapshot the center region does have both + and - strand reads. The biases at the two ends might be some characteristics of your library prep or analytical methods. What's the direction of this gene? It looks like a little bit biased towards the left side. If the left hand-side corresponds to 3' end then it's likely that your library prep has 3' bias features (e.g dT-primed Reverse transcription) The way you fragment your RNA may also have effects on your reads distribution.
I guess we need more information to find the truth. But we should also keep in mind that tophat/cufflinks may have bugs, too.

Basic Reed-Solomon Error Correction Question

Does Reed-Solomon error correction work in an instance where there is a dropped byte (or multiple dropped bytes)? For example, let's say it's a (12,8) Reed Solomon code, so theoretically it should be able to correct 2 errors (or 4 erasures if the position is known). But, what happens if only 11 (or 10) bytes are received and one doesn't know which byte(s) were dropped? Will Reed-Solomon error correction work?
Thanks,
Ben
RS decoding for erasures requires the position of the symbols "dropped" or lost. The kind of error you're talking about is due to phase distortion.
You can make it work by simply cycling through the possible positions where the character might be missing and letting it try to correct your result, so let's say you received 10 characters:
1234567890
Have it correct the following values:
??1234567890
?1?234567890
?12?34567890
:
1??234567890
1?2?34567890
:
1234567890??
Each attempt will probably give you some result, most of which are not the one you want. But I would expect that there should be exactly one result with the minimal number of additional modifications, and that should be the one you want to use as the most likely to be correct answer.
For example, if you correct the first three numbers of the example above, you might get the following result:
v
361274567890
917234567890
312734569897
: ^ ^
For the first and third case, you have additional corrections made beyond filling in the two blanks (marked with v and ^), whereas in the second case you have only the missing positions filled in and the other characters match the uncorrected input. Therefore, I would choose answer 2 as the most likely to be correct one.
Clearly, the chances that this works depend on whether there are other errors. Unfortunately I'm not able to give you a rigorous set of conditions under which this method will work for sure.
.
Another thing you can do if your message is long enough is to use an interleaving technique to basically have multiple orthogonal RS codes cover your data. That way, if one fails, you might be able to recover with another one. This method is for example used on compact discs (CDs), where it is called CIRC.
No, Reed-Solomon can't automatically correct instances where there are missing bits, because just like most other FEC algorithms, it was only designed to correct bit-flips. If you know the position of the missing bits, you can pad your received signal at those positions so that RS can then work normally.
However, if you don't know the position, you will need to use another algorithm that supports bit-insertion or bit-deletion such as Marker Codes and Watermark Codes.
Also note that RS can be not only used for erasures but also to process noisy bits using Forney syndrome.

Resources