How to generate a matched set of random SNPs based on distance to gene and MAF? - genetics

I have a list of SNPs (rs#) and I would like to generate a matched set of random SNPs based on distance to gene and MAF so I can use it to calibrate background expectations in SNP-based enrichment analysis.
I've tried SNPsnap but I keep getting an internal server error (code 500)
Can someone suggest other tools or help me please? I'd really appreciate it!

Related

In BLAST, how to get the HSP corresponding to each word?

In BLAST, I can only get a lot of sequences using its local and online services, but I cannot get the HSP corresponding to every word (seed) I want. According to the principle of BLAST, we know that the sequence will be divided into multiple words at the beginning, and then locate in the database according to these words, find multiple hits, and finally expand to the left and right sides. My question is how to get the HSP corresponding to a single word, instead of processing the completed result for me like online BLAST.Hope to get your suggestions, thank you very much.Attach the BLAST algorithm flowenter image description here

SystemML Decision Tree - "NUMBER OF SAMPLES AT NODE 1.0 CANNOT BE REDUCED TO MATCH 10"

I am trying to run a decision tree on SystemML standalone version on Windows (https://github.com/apache/incubator-systemml/blob/master/scripts/algorithms/decision-tree.dml) but I keep receiving the error "NUMBER OF SAMPLES AT NODE 1.0 CANNOT BE REDUCED TO MATCH 10. THIS NODE IS DECLAR ED AS LEAF!". It seems like the code is not computing any split, although I am able to perform tree via R. Has anyone used this algorithm before and has some tips on how to solve the error?
Thank you
This message generally indicates that a split on the best categorical or scale features would not give any additional gain.
I would recommend to
Investigate the computed gain (best_cat_gain, best_scale_gain)
Double check that the meta data (num_cat_features,
num_scale_features) is correctly recognised.
You could simply put additional print statements into the script to do that. In case the meta data is invalid, you might want to check that the optional input R has the right layout as described in the header of the script.
If this does not help, please share the input arguments, format of input data, etc and we'll have a closer look.

How to apply weighting to fuzzy search results

I'm writing a service which should sensibly suggest UK place names based on user entered text, my data set is just under 2500 entries. So far I'm applying a slightly modified version of the Damerau Levenshtein algorithm which ignores the edit distance for comparing against longer strings.
This is giving me a reasonable set of suggestions but I'd like to manually weight some terms e.g. currently entering new will give New Mills as the top result.
I'd like to weight these results so major cities appear above towns and villages e.g. entering new will give Newcastle as the top result.
Can anyone suggest either a different search algorithm, or a separate weighting process I can apply to my results to achieve the weighted results I'm after?
Levenshtein is more for typos - what you want is NLP, you can google: NLP address
or see Detect/Parse Mailing Addresses in Text

Find the State given Latitude and Longitude Coordinates

I have a set of 900 Latitude and Longitude Coordinates-- I need a relatively simple method for finding the 'State' referred to by these coordinates. If it helps, the data is in excel.
Google provide a Geocoding service. Part of this is reverse geocoding which converts geographic coordinates into a human-readable address including States. This Demo illustrates what can be done. There are limits to what you can do with this service.
Try to use the average values as provided here. With a bit of luck, most of your 900 coordinate pairs belong to the state with the nearest center. Calculation of distances between longitude/latitude locations is explained here.
An alternative would be to use a ZIP table with US postcodes as provided here. Once you know the postcode, you know the state, don't you? I'm not sure, but each state has an interval of ZIP codes. Once you know the ZIP code of a location, you can find the interval and the state it belongs to.
A list of coordinates of US locations could help to get a more exact allocation: http://www.bcca.org/bahaivision/fast/latlong_us.html
Find the nearest location in the list and take its state as result.
Google requires that geocoding / reverse geocoding be used with maps that users can see, so if that isn't an option for you, I think the best way is to use a database with spatial functions. First, you'll need the state boundaries found for free at NationalAtlas.gov. I use SQL Server (need 2008 or 2012 versions) and you can use the STContains() method to find what state it belongs to.
A simpler solution would be to just use the ezcmd.com rest API services.
They provide two APIs:
http://ezcmd.com/apps/app_geo_postal_codes#geo_postal_codes_api
1) All you have to do is just give it a zip code and a country code (for usa you either use US or USA) and optionally you'll pass the distance radius, and units (Miles or Km) and it'll return all other zip codes with state and province that are within the given distance
2) Free search, where you give it any fuzzy search phrase that includes either one of zip / city / state / province and country and it returns the best matches for that search phrase.
Hint: You can use #2 to find the zip code for a fuzzy (human readable) address and pass that zip code to #1 to find nearest places to that zip code.
Also they have another API that returns zip code along with full geo location information for a given IP address here:
http://ezcmd.com/apps/app_ezip_locator#ezip_locator_api
Enjoy ! I hope this helps.

most efficient edit distance to identify misspellings in names?

Algorithms for edit distance give a measure of the distance between two strings.
Question: which of these measures would be most relevant to detect two different persons names which are actually the same? (different because of a mispelling). The trick is that it should minimize false positives. Example:
Obaama
Obama
=> should probably be merged
Obama
Ibama
=> should not be merged.
This is just an oversimple example. Are their programmers and computer scientists who worked out this issue in more detail?
I can suggest an information-retrieval technique of doing so, but it requires a large collection of documents in order to work properly.
Index your data, using the standard IR techniques. Lucene is a good open source library that can help you with it.
Once you get a name (Obaama for example): retrieve the set of collections the word Obaama appears in. Let this set be D1.
Now, for each word w in D11 search for Obaama AND w (using your IR system). Let the set be D2.
The score |D2|/|D1| is an estimation how much w is connected to Obaama, and most likely will be close to 1 for w=Obama2.
You can manually label a set of examples and find the value from which words will be expected.
Using a standard lexicographical similarity technique you can chose to filter out words that are definetly not spelling mistakes (Like Barack).
Another solution that is often used requires a query log - find a correlation between searched words, if obaama has correlation with obama in the query log - they are connected.
1: You can improve performance by first doing the 2nd filter, and check only for candidates who are "similar enough" lexicographically.
2: Usually a normalization is also used, because more frequent words are more likely to be in the same documents with any word, regardless of being related or not.
You can check NerSim (demo) which also uses SecondString. You can find their corresponding papers, or consider this paper: Robust Similarity Measures for Named Entities Matching.

Resources