How to find a list of SNPs that are LD and MAF matched to another list of interest - genetics

I am a beginner in genetics. I have a list of SNPs of interest, and I'd like to find another list of SNPs that match with LD and MAF. Is there any software to do so? Can this be done in PLINK? I don't really know how to use PLINK.

To answer your questions in the comment:
I would recommend using 1000Genomes, and restrict to ethnic population if necessary. For population of origin for each sample see:
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples_v3.20130502.ALL.panel. You can use the --keep option in PLINK to analyze only the subset of interest.
Use PLINK. R will likely be too slow, unless you use optimized libraries.
See answer to your other question on SO here: Minor allele frequency matching?

Related

How to align read to two SHORT reference sequences and see percentage that mapped to one or the other reference?

I have PCR-Amplified fastq files of a specific target region from several samples. For each sample, I want to know the percentage of reads that align better to reference sequence #1 or #2 posted below. How should I begin to tackle this question and what tool for alignment is best?
I am working with Illumina paired-end adapter sequences spiked-in on a 2X150 run. The two reference amplicons are 173 and 179 bp:
1: aaaaagtataaatataggaccaggcagagcattttatacaacaggagaaataataggagatataagacaagcacattgtaaccttagtagagcaaaatggaatgacactttaaataagatagttataaaattaagagaacaatttgggaataaaacaatagtctttaagcact
2: aaaaagtatccgtatccagaggggaccagggagagcatttgttacaataggaaaaataggaaatatgagacaagcacattgtaacattagtagagcaaaatggaatgccactttaaaacagatagctagcaaattaagagaacaatttggaaataataaaacaataatctttaagcaat
We want to know if one virus wins over another after infection infection based off of the differences between these two sequences; so essentially the percentage that align best to #1 and the percentage that align best to #2.
Thank You,
Sara
Convert your reference amplicons to fasta format.
Choose an aligner, such as bwa mem, bowtie2, etc.
Index the reference for your aligner of choice.
Align the reads to the reference using your aligner of choice.
Use samtools idxstats to find the number of reads aligned to each of the amplicons.
Notes:
It is often a good idea to trim adapters from the reads before you align the reads. A number of good adapter trimmers exist, such as flexbar, skewer, etc.
Many popular bioinformatics packages mentioned above can be easily installed, for example using conda.
REFERENCES:
conda
bwa
bowtie2
samtools
flexbar
skewer

Plink - reverse complement

I am using Plink for the first time and am checking my data against some previously genotyped samples (these are the same samples so the genotypes should match up).
My data is nearly correct as in it has called homs and hets correctly but for some SNPs my data has the reverse complement.
e.g.
What command do I need in plink do I need to tell it to call the reverse complement when needed??
I think I have got to the source of the problem should anyone else stumble across this post with the same issue.
I was using an Illumina SNP chip. Apparently there is some discrepancy in post-processing of genotype data with strand definition. These two links explain it beautifully:
http://gengen.openbioinformatics.org/en/latest/tutorial/coding/
https://academic.oup.com/bioinformatics/article/33/15/2399/3204987

what is the relationship between LCS and string similarity?

I wanted to know how similar where two strings and I found a tool in the following page:
https://www.tools4noobs.com/online_tools/string_similarity/
and it says that this tool is based on the article:
"An O(ND) Difference Algorithm and its Variations"
available on:
http://www.xmailserver.org/diff2.pdf
I have read the article, but I have some doubts about how they programmed that tool, for example the authors said that it is based on the C library GNU diff and analyze.c; maybe it refers to this:
https://www.gnu.org/software/diffutils/
and this:
https://github.com/masukomi/dwdiff-annotated/blob/master/src/diff/analyze.c
The problem that I have is how to understand the relation with the article, for what I read the article shows an algorithm for finding the LCS (longest common subsequence) between a pair of strings, so they use a modification of the dynamic programming algorithm used for solving this problem. The modification is the use of the shortest path algorithm to find the LCS that has the minimum number of modifications.
At this point I am lost, because I do not know how the authors of the tool I first mentioned used the LCS for finding how similar are two sequences. Also the have put a limit value of 0.4, what does that mean? can anybody help me with this? or have I misunderstood that article?
Thanks
I think the description on the string similarity tool is not being entirely honest, because I'm pretty sure it has been implemented using the Perl module String::Similarity. The similarity score is normalised to a value between 0 and 1, and as the module page describes, the limit value can be used to abort the comparison early if the similarity falls below it.
If you download the Perl module and expand it, you can read the C source of the algorithm, in the file called fstrcmp.c, which says that it is "Derived from GNU diff 2.7, analyze.c et al.".
The connection between the LCS and string similarity is simply that those characters that are not in the LCS are precisely the characters you would need to add, delete or substitute in order to convert the first string to the second, and the number of these differing characters is usually used as the difference score, as in the Levenshtein Distance.

Strand specific tophat/cufflinks

I have a strand-specific RNA-seq library to assemble (Illumina). I would like to use TopHat/Cufflinks. From the manual of TopHat, it says,
"--library-type TopHat will treat the reads as strand specific. Every read alignment will have an XS attribute tag. Consider supplying library type options below to select the correct RNA-seq protocol."
Does it mean that TopHat only supports strand-specific protocols? I use option "--library-type fr-unstranded" to run, does it mean it runs in a strand-specific way? I googled it and asked the developers, but got no answer...
I got some result:
Here the contig is assembled by two groups of reads, left side are reverse reads, while right side is forward. (for visualization, i have reverse complement the right mate)
But some of the contigs are assembled purely from reverse or forward reads. If it is strand specific, one gene should produce the reads in the same direction. It should not report the result like the image above, am I right? Or is it possible that one gene is fragmented and then sequence independently, so that happenly left part produce reverse reads while right part produce forward reads? From my understanding, the strand specificity is kept by 3'/5' ligation, so should be in the unit of genes.
What is the problem here? Or did I understand the concept of 'strand specific' wrongly? Any help is appreciated.
Tophat/Cufflinks are not for assembly, they are for alignment to an already assembled genome or transcriptome. What are you aligning your reads to?
Also, if you have strand specific data, you shouldn't choose an unstranded library type. You should choose the proper one based on your library preparation method. The XS tag will only be placed on split reads if you choose an unstranded library type.
If you want to do a de novo assembly of your transcriptome you should take a look at assemblers (not mappers) like
Trinity
SoapDeNovo
Oases....
Tophat can deal with both stranded libraries and unstranded libraries. In your snapshot the center region does have both + and - strand reads. The biases at the two ends might be some characteristics of your library prep or analytical methods. What's the direction of this gene? It looks like a little bit biased towards the left side. If the left hand-side corresponds to 3' end then it's likely that your library prep has 3' bias features (e.g dT-primed Reverse transcription) The way you fragment your RNA may also have effects on your reads distribution.
I guess we need more information to find the truth. But we should also keep in mind that tophat/cufflinks may have bugs, too.

Is there a diff-like algorithm that handles moving block of lines?

The diff program, in its various incarnations, is reasonably good at computing the difference between two text files and expressing it more compactly than showing both files in their entirety. It shows the difference as a sequence of inserted and deleted chunks of lines (or changed lines in some cases, but that's equivalent to a deletion followed by an insertion). The same or very similar program or algorithm is used by patch and by source control systems to minimize the storage required to represent the differences between two versions of the same file. The algorithm is discussed here and here.
But it falls down when blocks of text are moved within the file.
Suppose you have the following two files, a.txt and b.txt (imagine that they're both hundreds of lines long rather than just 6):
a.txt b.txt
----- -----
1 4
2 5
3 6
4 1
5 2
6 3
diff a.txt b.txt shows this:
$ diff a.txt b.txt
1,3d0
< 1
< 2
< 3
6a4,6
> 1
> 2
> 3
The change from a.txt to b.txt can be expressed as "Take the first three lines and move them to the end", but diff shows the complete contents of the moved chunk of lines twice, missing an opportunity to describe this large change very briefly.
Note that diff -e shows the block of text only once, but that's because it doesn't show the contents of deleted lines.
Is there a variant of the diff algorithm that (a) retains diff's ability to represent insertions and deletions, and (b) efficiently represents moved blocks of text without having to show their entire contents?
Since you asked for an algorithm and not an application, take a look at "The String-to-String Correction Problem with Block Moves" by Walter Tichy. There are others, but that's the original, so you can look for papers that cite it to find more.
The paper cites Paul Heckel's paper "A technique for isolating differences between files" (mentioned in this answer to this question) and mentions this about its algorithm:
Heckel[3] pointed out similar problems with LCS techniques and proposed a
linear-lime algorithm to detect block moves. The algorithm performs adequately
if there are few duplicate symbols in the strings. However, the algorithm gives
poor results otherwise. For example, given the two strings aabb and bbaa,
Heckel's algorithm fails to discover any common substring.
The following method is able to detect block moves:
Paul Heckel: A technique for isolating differences between files
Communications of the ACM 21(4):264 (1978)
http://doi.acm.org/10.1145/359460.359467 (access restricted)
Mirror: http://documents.scribd.com/docs/10ro9oowpo1h81pgh1as.pdf (open access)
wikEd diff is a free JavaScript diff library that implements this algorithm and improves on it. It also includes the code to compile a text output with insertions, deletions, moved blocks, and original block positions inserted into the new text version. Please see the project page or the extensively commented code for details. For testing, you can also use the online demo.
Git 2.16 (Q1 2018) will introduce another possibility, by ignoring some specified moved lines.
"git diff" learned a variant of the "--patience" algorithm, to which the user can specify which 'unique' line to be used as anchoring points.
See commit 2477ab2 (27 Nov 2017) by Jonathan Tan (jhowtan).
(Merged by Junio C Hamano -- gitster -- in commit d7c6c23, 19 Dec 2017)
diff: support anchoring line(s)
Teach diff a new algorithm, one that attempts to prevent user-specified lines from appearing as a deletion or addition in the end result.
The end user can use this by specifying "--anchored=<text>" one or more
times when using Git commands like "diff" and "show".
The documentation for git diff now reads:
--anchored=<text>:
Generate a diff using the "anchored diff" algorithm.
This option may be specified more than once.
If a line exists in both the source and destination, exists only once, and starts with this text, this algorithm attempts to prevent it from appearing as a deletion or addition in the output.
It uses the "patience diff" algorithm internally.
See the tests for some examples:
pre post
a c
b a
c b
normally, c is moved to produce the smallest diff.
But:
git diff --no-index --anchored=c pre post
Diff would be a.
With Git 2.33 (Q3 2021), the command line completion (in contrib/) learned that "git diff"(man) takes the --anchored option.
See commit d1e7c2c (30 May 2021) by Thomas Braun (t-b).
(Merged by Junio C Hamano -- gitster -- in commit 3a7d26b, 08 Jul 2021)
completion: add --anchored to diff's options
Signed-off-by: Thomas Braun
This flag was introduced in 2477ab2 ("diff: support anchoring line(s)", 2017-11-27, Git v2.16.0-rc0 -- merge listed in batch #10) but back then, the bash completion script did not learn about the new flag.
Add it.
Here's a sketch of something that may work. Ignore diff insertations/deletions for the moment for the sake of clarity.
This seems to consist of figuring out the best blocking, similar to text compression. We want to find the common substring of two files. One options is to build a generalized suffix tree and iteratively take the maximal common substring , remove it and repeat until there are no substring of some size $s$. This can be done with a suffix tree in O(N^2) time (https://en.wikipedia.org/wiki/Longest_common_substring_problem#Suffix_tree). Greedily taking the maximal appears to be optimal (as a function of characters compressed) since taking a character sequence from other substring means adding the same number of characters elsewhere.
Each substring would then be replaced by a symbol for that block and displayed once as a sort of 'dictionary'.
$ diff a.txt b.txt
1,3d0
< $
6a4,6
> $
$ = 1,2,3
Now we have to reintroduce diff-like behavior. The simple (possibly non-optimal) answer is to simply run the diff algorithm first, omit all the text that wouldn't be output in the original diff and run the above algorithm.
SemanticMerge, the "semantic scm" tool mentioned in this comment to one of the other answers, includes a "semantic diff" that handles moving a block of lines (for supported programming languages). I haven't found any details about the algorithm but it's possible the diff algorithm itself isn't particular interesting as it's relying on the output of a separate parsing of the programming language source code files themselves. Here's SemanticMerge's documentation on implementing an (external) language parser, which may shed some light on how its diffs work:
External parsers - SemanticMerge
I tested it just now and its diff is fantastic. It's significantly better than the one I produced using the demo of the algorithm mentioned in this answer (and that diff was itself much better than what was produced by Git's default diff algorithm) and I suspect still better than one likely to be produced by the algorithm mentioned in this answer.
Our Smart Differencer tools do exactly this when computing differences between source texts of two programs in the same programmming language. Differences are reported in terms of program structures (identifiers, expressions, statements, blocks) precise to line/column number, and in terms of plausible editing operations (delete, insert, move, copy [above and beyond OP's request for mere "copy"], rename-identifier-in-block).
The SmartDifferencers require an structured artifact (e.g., a programming language), so it can't do this for arbitrary text. (We could define structure to be "just lines of text" but didn't think that would be particularly valuable compared to standard diff).
For this situation in my real life coding, when I actually move a whole block of code to another position in the source, because it makes more sense either logically, or for readability, what I do is this:
clean up all the existing diffs and commit them
so that the file just requires the move that we are looking for
remove the entire block of code from the source
save the file
and stage that change
add the code into the new position
save the file
and stage that change
commit the two staged patches as one commit with a reasonable message
Check also this online tool simtexter based on the SIM_TEXT algorithm. It strongly seems the best.
You can also have a look to the source code for the Javascript implementation or C / Java.

Resources