How to align read to two SHORT reference sequences and see percentage that mapped to one or the other reference? - bioinformatics

I have PCR-Amplified fastq files of a specific target region from several samples. For each sample, I want to know the percentage of reads that align better to reference sequence #1 or #2 posted below. How should I begin to tackle this question and what tool for alignment is best?
I am working with Illumina paired-end adapter sequences spiked-in on a 2X150 run. The two reference amplicons are 173 and 179 bp:
1: aaaaagtataaatataggaccaggcagagcattttatacaacaggagaaataataggagatataagacaagcacattgtaaccttagtagagcaaaatggaatgacactttaaataagatagttataaaattaagagaacaatttgggaataaaacaatagtctttaagcact
2: aaaaagtatccgtatccagaggggaccagggagagcatttgttacaataggaaaaataggaaatatgagacaagcacattgtaacattagtagagcaaaatggaatgccactttaaaacagatagctagcaaattaagagaacaatttggaaataataaaacaataatctttaagcaat
We want to know if one virus wins over another after infection infection based off of the differences between these two sequences; so essentially the percentage that align best to #1 and the percentage that align best to #2.
Thank You,
Sara

Convert your reference amplicons to fasta format.
Choose an aligner, such as bwa mem, bowtie2, etc.
Index the reference for your aligner of choice.
Align the reads to the reference using your aligner of choice.
Use samtools idxstats to find the number of reads aligned to each of the amplicons.
Notes:
It is often a good idea to trim adapters from the reads before you align the reads. A number of good adapter trimmers exist, such as flexbar, skewer, etc.
Many popular bioinformatics packages mentioned above can be easily installed, for example using conda.
REFERENCES:
conda
bwa
bowtie2
samtools
flexbar
skewer

Related

Hex - Search by bytes to get Offset & Search by Offset to get Bytes

just currently prototyping a little software and currently stuck. I'm trying to create a little program that'll edit a .bin file, and for this I will need to do the following:
Get Bytes by Searching for Offset
Get Offset by searching for Bytes
Write/Update .bin file
I usually use the program HxD to do this manually, but want to get a small automated process in place.
Using hex.EncodeToString returns what I want as the output (Like HxD) however I can't find a way to search for the values by bytes and offests
Could anyone help or have suggestions?
OK, "searching of an offset" is a misnomer because if you have an offset and a medium which supports random access, you just "seek" the known offset there; for files, see os.File.Seek.
Searching is more complex: it consists of converting the user input into something searchable and, well, the searching itself.
Conversion is the process of translation of the human operator's imput to a slice of bytes — for instance, you'd need to convert a string "00 87" to a slice of bytes, []byte{00, 87}.
Such conversion can be done using, say, encoding/hex.Decode after removing any whitespace, which can be done using a multitude of ways.
Searching the file given a slice of bytes can be either simple of complex.
If a file is small (a couple megabytes, on today's hardware), you can just slurp it into memory (for instance, using io.ReadAll) and do a simple search using bytes.Index.
If a file is big, the complexity of the task quickly escalates.
For instance, you could read the file from its beginning to its end using chunks of some sensible size and search for your byte slice in each of them.
But you'd need to watch out for two issues: the slice to search should be smaller than each of such chunks, and two adjacent chunks might contain the sequence to be found positioned right across their "sides" — so that the Nth chunk contains the first part of the pattern at its end and the N+1th chunk contains the rest of it at its beginning.
There exist more advanced approaches to such searching — for instance, using so-called "memory-mapped files" but I'd speculate it's a bit too early to tread these lands, given your question.

How to first check files on equality before doing a byte by byte comparison?

I am writing a program that compare a lot of files.
I first group files by filesize. Then I check them byte by byte between grouped files. What params or propeties can I check before byte by byte comparsion to minimize using it?
Upd:
To get check sum i need to read entire file. I seek some property that can filter unequal files. I forgot to say that i need 100% equal of files. Hash functions have collision.
If the files are recorded as being the same size by the operating system then there is no way to know if they are different other than checking bytes.
For a group of files, once two files are known to be the same, then the comparison only needs to be done for one of the two. It would be wise to sort the files in a group by date for this reason, on the theory that files with similar dates are more likely to be identical. Thus, you should maintain lists of identical files. When a new comparison is done it need only be compared to the head of the list.
You should allocate as much memory as possible up front and keep the list heads in memory.
When the comparison is being done you should not actually compare bytes, but words. For example, on a 32-bit machine you would read data in 512-byte blocks from the hard drive and then each block would be compared 4-bytes at a time. Newer x86 processors have vectorized op instructions called MMX. You want to be sure you are using those.
If you are writing in C for an Intel box, use Intel's compiler, not Microsoft's. Double check the assembly to make sure the compiler is not doing something stupid.
You can also increase the speed of the work by parallelizing it. This is done by creating threads. For example, if the code is running on a quad core machine you create 4 threads and divide the work among the 4 threads.
Check file's checksum. It was mend for this task
For Python you can use hashlib. For C you can use, for example, md5 from openssl. There are similar functions for php, MySQL, and probably for every other programming language
Eventually you can use linux built-in md5sum

Strand specific tophat/cufflinks

I have a strand-specific RNA-seq library to assemble (Illumina). I would like to use TopHat/Cufflinks. From the manual of TopHat, it says,
"--library-type TopHat will treat the reads as strand specific. Every read alignment will have an XS attribute tag. Consider supplying library type options below to select the correct RNA-seq protocol."
Does it mean that TopHat only supports strand-specific protocols? I use option "--library-type fr-unstranded" to run, does it mean it runs in a strand-specific way? I googled it and asked the developers, but got no answer...
I got some result:
Here the contig is assembled by two groups of reads, left side are reverse reads, while right side is forward. (for visualization, i have reverse complement the right mate)
But some of the contigs are assembled purely from reverse or forward reads. If it is strand specific, one gene should produce the reads in the same direction. It should not report the result like the image above, am I right? Or is it possible that one gene is fragmented and then sequence independently, so that happenly left part produce reverse reads while right part produce forward reads? From my understanding, the strand specificity is kept by 3'/5' ligation, so should be in the unit of genes.
What is the problem here? Or did I understand the concept of 'strand specific' wrongly? Any help is appreciated.
Tophat/Cufflinks are not for assembly, they are for alignment to an already assembled genome or transcriptome. What are you aligning your reads to?
Also, if you have strand specific data, you shouldn't choose an unstranded library type. You should choose the proper one based on your library preparation method. The XS tag will only be placed on split reads if you choose an unstranded library type.
If you want to do a de novo assembly of your transcriptome you should take a look at assemblers (not mappers) like
Trinity
SoapDeNovo
Oases....
Tophat can deal with both stranded libraries and unstranded libraries. In your snapshot the center region does have both + and - strand reads. The biases at the two ends might be some characteristics of your library prep or analytical methods. What's the direction of this gene? It looks like a little bit biased towards the left side. If the left hand-side corresponds to 3' end then it's likely that your library prep has 3' bias features (e.g dT-primed Reverse transcription) The way you fragment your RNA may also have effects on your reads distribution.
I guess we need more information to find the truth. But we should also keep in mind that tophat/cufflinks may have bugs, too.

tutorials on first pass and second pass of assembler

Are there good tutorials around that explain about the first and second pass of assembler along with their algorithms ? I searched a lot about them but haven't got satisfying results.
Please link the tutorials if any.
Dont know of any tutorials, there really isnt much to it.
one:
inc r0
cmp r0,0
jnz one
call fun
add r0,7
jmp more_fun
fun:
mov r1,r0
ret
more_fun:
The assembler/software, like a human is going to read the source file from top to bottom, byte 0 in the file to the end. there are no hard and fast rules as to what you complete in each pass, and it is not necessarily a pass "on the file" but a pass "on the data".
First pass:
As you read each line you parse it. You are building some sort of data structure that has the instructions in file order. When you come across a label like one:, you keep track of what instruction that was in front of or perhaps you have a marker between instructions however you choose to implement it. When you come across an instruction that uses a label you have two choices, you can right now go look for that label, and if it is a backwards looking label then you should have seen it already like the jnz one instruction. IF you have thus far been keeping track of the number and size (if variable word length) instructions you can choose to encode this instruction now if it is a relative instruction, if the instruction set uses absolute you might have to just leave a placeholder anyway.
Now the call fun and jump more_fun instructions pose a problem, when you get to these instructions you cannot resolve them at this time, you dont know if these labels are local to this file or are in another file, so you cannot encode this instruction on the first pass, you have to save it for later, and this is the reason for the second pass.
The second pass is likely to be a pass across your data structures and not actually on the file, and this is heavily implementation specific. For example you might have a one dimensional array of structures and everything is in there. You may choose to make many passes on that data for example, start one index through the array looking for unresolved labels. When you find an unresolved label, send a second index through the array looking for a label definition. If you dont find it then, application specific, does your assembler create objects to be linked later or does it create a binary does it have to have everything resolved in this one assembly to binary step? If object then you assume this is external, unless application specific, your assembler requires external labels to be defined as external. So whether or not the missing label is an error is application specific. if it is not an error then, application specific, you should encode for the longest/farthest type of branch leaving the address or distance details for the linker to fill in.
For the labels you have found you now have a rough idea on how far. Now, depending on the instruction set and/or features of your assembler, you need to make several more passes on the data. You need to start encoding the instructions, assuming you have at least one flavor of relative distance call or branch instruction, you have to decide on the first encoding pass whether to hope for the, what i assume, is a shorter/smaller instruction for the relative distance branch or assume the larger one. You cant really determine if the smaller one will reach until you get one or a few encoding passes across the instructions.
top:
...
jmp down
...
jnz top
...
down:
As you encode the jmp down, you might choose optimistically to encode it as a smaller (number of bytes/words if variable word length) relative branch leaving the distance to be determined. When you get to the jnz top, lets say it is exactly to the byte just close enough to top to encode using a relative branch. On the second pass though you have to go back and finish the jmp down you find that it wont reach, you need more bytes/words to encode it as a long branch. Now the jnz top has to become a far branch as well (causing down to move again). You have to keep passing over the instructions, computing their distance far/short until you make pass with no changes. Be careful not to get caught in an infinite loop, where one pass you get to shorten an instruction, but that causes another to lengthen, and on the next pass the lengthen one causes the other to lengthen but the second to shorten and this repeats forever.
We could go back to the top of this and in your first pass you might build more than one or several data structures, maybe as you go you build a list of found labels, and a list of missing labels. And the second pass you look through the list of missing and see if they are in the found then resolve them that way. Or maybe on the first pass, and some might argue this is a single pass assembler, when you find a label, before continuing through the file you look back to see if anyone was looking for that label (or if that label had already been defined to declare an error) I would call this a multi pass assembler because it still passes through the data many times.
And now lets make it much worse. Look at the arm instruction set as an example and any other fixed length instruction set. Your relative branches are usually encoded in one instruction, thus fixed length instruction set. A far branch normally involves a load pc from the data found at this address, meaning you really need two items the instruction, then somewhere within the relative reach of that instruction a data word containing the absolute address of where to branch. You can choose to force the user to create these, but with the ARM assemblers for example they can and will do this for you, the simplest example is:
ldr r0,=0x12345678
...
b somewhere
That syntax means load r0 with the value 0x12345678, which does not fit in an arm instruction. What the assembler does with that syntax is it tries to find a dead spot in the code within reach of that instruction where it can place the data value, then it encodes that instruction as a load from pc relative address. For example after an unconditional branch is a good place to hide data. sometimes you have to use directives like .pool to encourage or remind the assembler good places to stick this data. r0 is not the program counter r15 is and you could use r15 there to connect this to the branching discussion above.
Take a look at the assembler I created for this project http://github.com/dwelch67/lsasim, a fixed length instruction set, but I force the user to allocate the word and load from it, I dont allow the shortcut the arm assemblers tend to allow.
I hope this helps explain things. The bottom line is that you cannot resolve lables in one linear pass through the data, you have to go back and connect the dots to the forward referenced labels. And I argue you have to do many passes anyway to resolve all of the long/short encodings (unless the instruction set/syntax forces the user to explicitly specify an absolute vs relative branch, and some do rjmp vs jmp or rjmp vs ljmp, rcall vs call, etc). Making one pass on the "file" sure, not a problem. If you allow include type directives some tools will create a temporary file where it pulls all the includes in creating a single file which has no includes in it, and then the tool makes one pass on this (this is how gcc manages includes for example, save intermediate files sometime and see what files are produced)(if you report line numbers with warnings/errors then you have to manage the temp file lines vs the original file name and line.).
A good place to start is David Solomon's book, Assemblers and Loaders. It's an older book, but the information is still relevant.
You can download a PDF of the book.

Compression/encryption algorithms output guarantees

My question here regards compression/encryption algorithms in general and to me sounds like a complete noobie one. Now, I understand that "in general" "it all depends", but suppose we're talking algorithms that all have reference implementation/published specs and are overall ever so standard. To be more specific, I'm using .NET implementations of AES-256 and GZip/Deflate
So here goes. Can it be assumed that, given exactly the same input, both types of algorithms will produce exactly the same output.
For example, will output of aes(gzip("hello"), key, initVector)) on .NET be identical to that of on a Mac or Linux?
AES is rigourosly defined, so given same input, same algorithm, and same key, you will get the same output.
It cannot be said the same for zip.
The problem is not the standard. There IS a defined standard : Deflate stream is IETF RFC 1950, gzip stream is IETF RFC 1952, so anyone can produce a compatible zip compressor/decoder starting from these definitions.
But zip belong to the large family of LZ compressors, which, by construction, are neither bijective nor injective. Which means, from a single source, there are many many ways to describe the same input which are all valid although different.
An example.
Let's say, my input is : ABCABCABC
Valid outputs can be :
9 literals
3 literals followed by one copy of 6 bytes long starting at offset -3
3 literals followed by two copies of 3 bytes long each starting at offset -3
6 literals followed by one copy of 3 bytes long starting at offset -6
etc.
All these outputs are valid and describe (regenerate) the same input. Obviously, one of them is more efficient (compress more) than the others. But that's where implementation may differ. Some will be more powerful than others. For example, it is known that kzip and 7zip generate better (more compressed) zip files than gzip. Even gzip has a lot of compression options generating different compressed streams starting from a same input.
Now, if you want to constantly get exactly the same binary output, you need more than "zip" : you need to enforce a precise zip implementation, and a precise compression parameter. Then, you'll be sure that you generate always the same binary.
AES is defined to a standard, so any conforming implementation will indeed produce the same output. GZip is a program, so it is possible that different versions of the program will produce different outputs. I would expect a later version to be able to reinflate the output from an earlier version, but the reverse may not be possible.
As others have said, if you are going to compress, then compress the plaintext, not the cyphertext from AES. Cyphertext won't compress well as it is designed to appear random.

Resources