Minor allele frequency matching? - genetics

I have a list of SNPs that I have already obtained their proxy SNPs with MAF and LD from 1000 genomes pilot 1. I want to know when everybody mentions about MAF matching, do they need to be exactly the same?
For example, the SNP of interest has MAF 0.35, if the proxy SNP has MAF 0.37, can it be used as a good proxy? Given LD of the two >0.8
Do have absolutely have to choose a proxy that has a MAF equals 0.35?

LD as measured by R-sq is dependent on MAF ie high R-squared implies MAFs are similar. See:
https://www.ncbi.nlm.nih.gov/pubmed/18572214
So, very high R-squared usually means MAF between the SNPs are similar.
Also, selection of proxy SNPs is typically dependent on LD measure only. As mentioned MAF is "incorporated" into the calculation, so matching for it is not necessary. If you have multiple proxies with same LD, then choose one with most similar MAF.

Related

Gene set enrichment analysis (GSEA) for Bio-Rad Bio-Plex human cytokine screening panel

We have analyzed the effects of several peptides, separately, on peripheral blood mononuclear cells (PBMCs). We have analyzed changes in the level of cytokines secretion in response to incubation with the peptides. The assay was performed on a Bio-Rad Bio-Plex platform with a Bio-Plex Pro Human Cytokine 48-plex Screening Panel kit. So now we have information about the changes in the secretion of 48 cytokines by PBMCs in response to incubation with any of the peptides. I would like to know if there is any way to analyze the obtained results in a kind of gene set enrichment analysis (GSEA) in order to determine, for example, the type of cells that predominantly produce the significantly changed cytokines, or, for example, signals of what processes are the changed cytokines? If there is no such program or web-service yet, then maybe someone can advise a meaningful explanatory review or a small book to understand and interpret the changes obtained at the level of cytokines into some kind of biological hypothesis about the effect of the tested peptides on the immunocompetent cells of the bloodstream?
What comes to mind is doing a kind of differential expression analysis across your tested protein conditions, clustering by cytokine secretion and using a database such as string analysis to determine protein-group interactions: https://string-db.org/cgi/input?sessionId=bpz0SAZRwcoA&input_page_active_form=multiple_identifiers.
You could hypothetically do some sort of enrichment analysis if you had a cytokine set list for a specific pathway/cell-population vs your ranked cytokine secretion list.

How to generate a matched set of random SNPs based on distance to gene and MAF?

I have a list of SNPs (rs#) and I would like to generate a matched set of random SNPs based on distance to gene and MAF so I can use it to calibrate background expectations in SNP-based enrichment analysis.
I've tried SNPsnap but I keep getting an internal server error (code 500)
Can someone suggest other tools or help me please? I'd really appreciate it!

most efficient edit distance to identify misspellings in names?

Algorithms for edit distance give a measure of the distance between two strings.
Question: which of these measures would be most relevant to detect two different persons names which are actually the same? (different because of a mispelling). The trick is that it should minimize false positives. Example:
Obaama
Obama
=> should probably be merged
Obama
Ibama
=> should not be merged.
This is just an oversimple example. Are their programmers and computer scientists who worked out this issue in more detail?
I can suggest an information-retrieval technique of doing so, but it requires a large collection of documents in order to work properly.
Index your data, using the standard IR techniques. Lucene is a good open source library that can help you with it.
Once you get a name (Obaama for example): retrieve the set of collections the word Obaama appears in. Let this set be D1.
Now, for each word w in D11 search for Obaama AND w (using your IR system). Let the set be D2.
The score |D2|/|D1| is an estimation how much w is connected to Obaama, and most likely will be close to 1 for w=Obama2.
You can manually label a set of examples and find the value from which words will be expected.
Using a standard lexicographical similarity technique you can chose to filter out words that are definetly not spelling mistakes (Like Barack).
Another solution that is often used requires a query log - find a correlation between searched words, if obaama has correlation with obama in the query log - they are connected.
1: You can improve performance by first doing the 2nd filter, and check only for candidates who are "similar enough" lexicographically.
2: Usually a normalization is also used, because more frequent words are more likely to be in the same documents with any word, regardless of being related or not.
You can check NerSim (demo) which also uses SecondString. You can find their corresponding papers, or consider this paper: Robust Similarity Measures for Named Entities Matching.

Metric for SURF

I'm searching for a usable metric for SURF. Like how good one image matches another on a scale let's say 0 to 1, where 0 means no similarities and 1 means the same image.
SURF provides the following data:
interest points (and their descriptors) in query image (set Q)
interest points (and their descriptors) in target image (set T)
using nearest neighbor algorithm pairs can be created from the two sets from above
I was trying something so far but nothing seemed to work too well:
metric using the size of the different sets: d = N / min(size(Q), size(T)) where N is the number of matched interest points. This gives for pretty similar images pretty low rating, e.g. 0.32 even when 70 interest points were matched from about 600 in Q and 200 in T. I think 70 is a really good result. I was thinking about using some logarithmic scaling so only really low numbers would get low results, but can't seem to find the right equation. With d = log(9*d0+1) I get a result of 0.59 which is pretty good but still, it kind of destroys the power of SURF.
metric using the distances within pairs: I did something like find the K best match and add their distances. The smallest the distance the similar the two images are. The problem with this is that I don't know what are the maximum and minimum values for an interest point descriptor element, from which the distant is calculated, thus I can only relatively find the result (from many inputs which is the best). As I said I would like to put the metric to exactly between 0 and 1. I need this to compare SURF to other image metrics.
The biggest problem with these two are that exclude the other. One does not take in account the number of matches the other the distance between matches. I'm lost.
EDIT: For the first one, an equation of log(x*10^k)/k where k is 3 or 4 gives a nice result most of the time, the min is not good, it can make the d bigger then 1 in some rare cases, without it small result are back.
You can easily create a metric that is the weighted sum of both metrics. Use machine learning techniques to learn the appropriate weights.
What you're describing is related closely to the field of Content-Based Image Retrieval which is a very rich and diverse field. Googling that will get you lots of hits. While SURF is an excellent general purpose low-mid level feature detector, it is far from sufficient. SURF and SIFT (what SURF was derived from), is great at duplicate or near-duplicate detection but is not that great at capturing perceptual similarity.
The best performing CBIR systems usually utilize an ensemble of features optimally combined via some training set. Some interesting detectors to try include GIST (fast and cheap detector best used for detecting man-made vs. natural environments) and Object Bank (a histogram-based detector itself made of 100's of object detector outputs).

Algorithm to decide if digital audio data is clipping?

Is there an algorithm or some heuristic to decide whether digital audio data is clipping?
The simple answer is that if any sample has the maximum or minimum value (-32768 and +32767 respectively for 16 bit samples), you can consider it clipping. This isn't stricly true, since that value may actually be the correct value, but there is no way to tell whether +32767 really should have been +33000.
For a more complicated answer: There is such a thing as sample counting clipping detectors that require x consecutive samples to be at the max/min value for them to be considered clipping (where x may be as high as 7). The theory here is that clipping in just a few samples is not audible.
That said, there is audio equipment that clips quite audible even at values below the maximum (and above the minimum). Typical advice is to master music to peak at -0.3 dB instead of 0.0 dB for this reason. You might want to consider any sample above that level to be clipping. It all depends on what you need it for.
If you ever receive values at the maximum or minimum, then you are, by definition, clipping. Those values represent their particular value as well as all values beyond, and so they are best used as outside bounds detectors.
-Adam
For digital audio data, the term "clipping" doesn't really carry a lot of meaning other than "max amplitude". In the analog world, audio data comes from some hardware which usually contains a "clipping register", which allows you the possibility of a maximum amplitude that isn't clipped.
What might be better suited to digital audio is to set some threshold based on the limitations of your output D/A. If you're doing VOIP, then choose some threshold typical of handsets or cell phones, and call it "clipping" if your digital audio gets above that. If you're outputting to high-end home theater systems, then you probably won't have any "clipping".
I just noticed that there even are some nice implementations.
For example in Audacity:
Analyze → Find Clipping…
What Adam said. You could also add some logic to detect maximum amplitude values over a period of time and only flag those, but the essence is to determine if/when the signal hits the maximum amplitude.

Resources