Coefficient of inbreeding / wrights algorithm / genetics

Coefficient of inbreeding / wrights algorithm / genetics - genetic-algorithm

I am looking for a good pseudo code - or better yet actual code snippets - on implementing wrights algorithm on a genealogy database I have for sheep stored in SQL Server database.
I have a very old C program that worked against a flat text file until the population got so large the algorithm broke - as the entire thing was done in memory, so an implementation against a database would be preferable...
Anyone seen anything like this they can point me to?

If you didn't want to roll your own, you could use the Python pypedal package for calculating the inbreeding coefficient, which supports SQL databases.
Pypedal's code for computing inbreeding coefficient is here.

In genealogy we refer to this as coefficient of relationship (COR). There is some java code for this designed for a Neo4j PlugIn at GitHub; it calls on other function there. It uses the path lengths between individuals to common ancestors to compute the COR. With inbreeding there are multiple common ancestors and "pedigree collapse" which is also much easier to svg visualization of the graphs.

Related

Tree-based algorithm in Scala vs Earth Box

I need to find is point located in a given radius. Now I have a two choices, first is to write my own algorithm for it(or using existing library) second is use postgresql earth_box utility and I can select it directly from db, using stored procedure. What is pros/cons of both in context of web application?

I would think that using the earth_box procedure in postgres would be better for the following reasons:
The database already contains the data and procedures to work with it
The database server , given a properly indexed table, should be quite efficient at executing a spatial query on its own spatial data
Using the server there's no need to query for the spatial information, transfer it to wherever you're processing it, creating a tree structure and other overhead (ties into the first bullet)
You're using code that already exists and, presumably, has been thoroughly tested and vetted
You could reuse the code in other server-side SQL from a broader number of applications such as reporting
I would definitely suggest trying the earthbox approach first and going with a custom solution only if the earthbox absolutely sucks performance-wise.
Here's a more succinct meta-reasoning from a blog post you may want to check out:
[...] the earthbox function allows us to perform a simple compare to
find all records in a certain radius. This is done by the function by
returning the great circle distance between the points, a more
thorough explanation is located at
http://en.wikipedia.org/wiki/Greatcircle.
(By meta-reasoning I mean that the simplicity of the use of earthbox makes using it a no-brainer.)

trouble with recurrent neural network algorithm for structured data classification

TL;DR
I need help understanding some parts of a specific algorithm for structured data classification. I'm also open to suggestions for different algorithms for this purpose.
Hi all!
I'm currently working on a system involving classification of structured data (I'd prefer not to reveal anything more about it) for which I'm using a simple backpropagation through structure (BPTS) algorithm. I'm planning on modifying the code to make use of a GPU for an additional speed boost later, but at the moment I'm looking for better algorithms than BPTS that I could use.
I recently stumbled on this paper -> [1] and I was amazed by the results. I decided to give it a try, but I have some trouble understanding some parts of the algorithm, as its description is not very clear. I've already emailed some of the authors requesting clarification, but haven't heard from them yet, so, I'd really appreciate any insight you guys may have to offer.
The high-level description of the algorithm can be found in page 787. There, in Step 1, the authors randomize the network weights and also "Propagate the input attributes of each node through the data structure from frontier nodes to root forwardly and, hence, obtain the output of root node". My understanding is that Step 1 is never repeated, since it's the initialization step. The part I quote indicates that a one-time activation also takes place here. But, what item in the training dataset is used for this activation of the network? And is this activation really supposed to happen only once? For example, in the BPTS algorithm I'm using, for each item in the training dataset, a new neural network - whose topology depends on the current item (data structure) - is created on the fly and activated. Then, the error backpropagates, the weights are updated and saved, and the temporary neural network is destroyed.
Another thing that troubles me is Step 3b. There, the authors mention that they update the parameters {A, B, C, D} NT times, using equations (17), (30) and (34). My understanding is that NT denotes the number of items in the training dataset. But equations (17), (30) and (34) already involve ALL items in the training dataset, so, what's the point of solving them (specifically) NT times?
Yet another thing I failed to get is how exactly their algorithm takes into account the (possibly) different structure of each item in the training dataset. I know how this works in BPTS (I described it above), but it's very unclear to me how it works with their algorithm.
Okay, that's all for now. If anyone has any idea of what might be going on with this algorithm, I'd be very interested in hearing it (or rather, reading it). Also, if you are aware of other promising algorithms and / or network architectures (could long short term memory (LSTM) be of use here?) for structured data classification, please don't hesitate to post them.
Thanks in advance for any useful input!
[1] http://www.eie.polyu.edu.hk/~wcsiu/paper_store/Journal/2003/2003_J4-IEEETrans-ChoChiSiu&Tsoi.pdf

Separation and pattern matching techniques

I am new to Artificial Neural Networks.
I am interested in an application like this:
I have a significantly large set of objects. Each object has six properties, denoted by P1–P6. Each property has a value which is a symbolic value. In other words, in my example P1–P6 can have a value from the set {A, B, C, D, E, F}. They are not numeric. (Suppose A,B,C,D,E,F are colours; then you will understand my idea.)
Now, there is another property R that I am interested in. Suppose
R = {G1, G2, G3, G4, G5}
I need to train a system for a large set of P1–P6 and the relevant R. Now I want to do the following.
I have an object and I know the values of P1 to P6. I need to find
the R (The Group that the object belongs.)
To get a desired R what is the pattern I need to have in P1–P6.
As an example given that R = G2 I need to figure out any pattern in P1–P6.
My questions are:
What are the theories/technologies/techniques I should read and
learn in order to implement 1 and 2, respectively?
What are the tools/libraries you can recommend to get this
simulated/implemented/tested?

The way you described your problem, you need to look up various machine learning techniques. If it were me, I would try and read about k-NN (k Nearest Neighbours) for the classification. When I say classification, I mean getting the R if you know P1-P6. It is a really simple technique and should be helpful here.
As for the other way around, what you basically need is a representative sample of your population. This is I think not so usual, but you could try something like a k-means Clustering. Clustering methods usually determine the class of an object (property R) by themselves, but k-means Clustering is cool in this situation because you need to give it the number of object classes (e.g. different possible values of R), and in the end you get one representative sample.
You definitely shouldn't go for any really complex techniques (like neural networks) in my opinion since your data doesn't have a precise numerical interpretation and the values can't be interpreted gradually.
The recommended tools really depend on your base programming language. There's a great tool called Orange which is Python-based and it's my tool of choice for these kind of things (especially since it is really easy to connect your Python modules with C/C++). If you prefer Java, there's a quite similar tool called Weka that you could use. I think Weka is a little bit better documented, but I don't like Java so I've never tried it out.
Both of these tools have a graphical clickable interface where you could just load your data and get the classification done, play with the parameters and check what kind of output you get using different techniques and different set-ups. Once you decide that you got the results you need (or if you just don't like graphical interfaces) you can also use both of them as libraries of a kind when programming (Python for Orange and Java for Weka) and make the classification a part of a bigger project.
If you look through the documentation of Orange or Weka, I think it will give you a few ideas about what you could actually do with the data you have and when you know a few techniques that seem interesting to you and applicable to the data, maybe you could get more quality comments and info on a few specific methods here than when just searching for a general advice.

You should check out classification algorithms (a subsection of artificial intelligence), especially the nearest neighbor-algorithms. Your problem may be solved by different techniques, which all have different advantages and disadvantages.
However, I do not know of any method in artificial intelligence, which allows a two-way classification (or in other words, that both implement your prerequisites 1 and 2 simultaneously). As all you want to do so far is having a bidirectional mapping of P1..P6 <=> R, I would suggest to just use a mapping table instead of an artificial intelligence algorithm. An AI would work great if you not exactly know, which of your samples is categorized under A..E in P1..P6.
If you insist on using an AI for it, I'd suggest to first look at a Perceptron. A perceptron consists of input, intermediate and output neurons. For your example, you'd have the input-Neurons P1a..P1e, P2a..P2e, ... and five output neurons R1..R5. After training, you should be able to input P1..P6 and get the appropriate R1..R5 as output.
As for frameworks and technologies, I only know of the Business Intelligence suite for Visual Studio, although there are a lot of other frameworks for AI out there. Since I do not have used any of them (I always coded them myself in C/C++), I can't recommend any.

It seems like a typical classification problem. In case you really have a lot of data have a look at Apache Mahout which provides distributed implementations of machine learning algorithms. If you need something less complex for prototyping TimBL is a nice alternative.

What's a good algorithm for nearest neighbour problem in two dimensions?

I would like to build an app that's going to give you the closest restaurant depending on your location. We'll have a database with all the POI corresponding to the restaurant and we'll get your location with the GPS of your phone...
What algorithm would be appropriate ? Where can I find good doc about it ?
Thanks

Here's an informative presentation: http://dimacs.rutgers.edu/Workshops/MiningTutorial/pindyk-slides.ppt
I would either use a Quadtree or a Kd-tree.
See some benchmarks here: http://www.flegg.net/brett/pubs/spatial/index.html. It really all depends on your data size and range.

The main problem is how do you store and search the data. If you are using a SQL database that doesn't support spatial indexes (let's say SQLite on Android), consider converting the spatial data to a linear Z-order curve. The algorithm is simple, I know about (well, wrote) this implementation.

How can I build an incremental directed acyclic word graph to store and search strings?

I am trying to store a large list of strings in a concise manner so that they can be very quickly analyzed/searched through.
A directed acyclic word graph (DAWG) suits this purpose wonderfully. However, I do not have a list of the strings to include in the first place, so it must be incrementally buildable. Additionally, when I search through it for a string, I need to bring back data associated with the result (not just a boolean saying if it was present).
I have found information on a modification of the DAWG for string data tracking here: http://www.pathcom.com/~vadco/adtdawg.html It looks extremely, extremely complex and I am not sure I am capable of writing it.
I have also found a few research papers describing incremental building algorithms, though I've found that research papers in general are not very helpful.
I don't think I am advanced enough to be able to combine both of these algorithms myself. Is there documentation of an algorithm already that features these, or an alternative algorithm with good memory use & speed?

I wrote the ADTDAWG web page. Adding words after construction is not an option. The structure is nothing more than 4 arrays of unsigned integer types. It was designed to be immutable for total CPU cache inclusion, and minimal multi-thread access complexity.
The structure is an automaton that forms a minimal and perfect hash function. It was built for speed while traversing recursively using an explicit stack.
As published, it supports up to 18 characters. Including all 26 English chars will require further augmentation.
My advice is to use a standard Trie, with an array index stored in each node. Ya, it is going to seem infantile, but each END_OF_WORD node represents only one word. The ADTDAWG is a solution to each END_OF_WORD node in a traditional DAWG representing many, many words.
Minimal and perfect hash tables are not the sort of thing that you can just put together on the fly.
I am looking for something else to work on, or a job, so contact me, and I'll do what I can. For now, all I can say is that it is unrealistic to use heavy optimization on a structure that is subject to being changed frequently.

Java
For graph problems which require persistence, I'd take a look at the Neo4j graph DB project. Neo4j is designed to store large graphs and allow incremental building and modification of the data, which seems to meet the criteria you describe.
They have some good examples to get you going quickly and there's usually example code to get you started with most problems.
They have a DAG example with a link at the bottom to the full source code.
C++
If you're using C++, a common solution to graph building/analysis is to use the Boost graph library. To persist your graph you could maintain a file based version of the graph in GraphML (for example) and read and write to that file as your graph changes.

You may also want to look at a trie structure for this (potentially building a radix-tree). It seems like a decent 'simple' alternative structure.
I'm suggesting this for a few reasons:
I really don't have a full understanding of your result.
Definitely incremental to build.
Leaf nodes can contain any data you wish.
Subjectively, a simple algorithm.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio