Correlating word proximity - algorithm

Let's say I have a text transcript of a dialogue over a period of aprox. 1 hour. I want to know what words happen in close proximatey to one another. What type of statistical technique would I use to determine what words are clustered together and how close their proximatey to one another is?
I'm suspecting some sort of cluster analysis or PCA.

To determine word proximity, you will have to build a graph:
each word is a vertex (or "node"), and
left and right words are edges
So "I like dogs" would have 2 edges and 3 vertices.
Now, the next step will be to decide based on this model what your definition of "close" is.
This is where the statistics comes in.
To determine "groups" of correlated words
MCL clustering - This will give you a number of clusters which algorithmically have high odds of being seen together.
K MEANS clustering - This will give you "k" groups of words.
Thresholding - this is the most reliable and intuitive method. Plot all the relationships for a small subset of data that you understand (for example, a paragraph from a news clip or article you have read) and run your method to generate a graph, and visualize the graph using a tool such as graphviz or cytoscape. Once you can see the relatedness, you can count how many edges are generally found between different words that clearly cluster together. You might find that, for example, two words that cluster together will have an edge for every 5 instances. Use this as a cutoff and write your own graph analysis script which outputs word-pairs that have at least 1 edge for every 5 instances of the word in your vertex graph.
Evaluating 3 by ROC curves. You can titrate this value of your cutoff higher and higher until you have very few "clusters". If you then run your algorithm against a paragraph with known, expected results (created by a human who already knows what words should be reported as correlated), you can evaluate the precision of your algorithm using a receiver operating characteristic which compares the correlated-words output to a precalculated gold standard.

Related

Markov chains and Random walks on top of biological data

I'm coming from biology's field and thus I have some difficulties in understanding (intuitively?) some of the ideas of that paper. I really tried my best to decipher it step by step by using a lot of google and youtube, but now I feel, it's the time to refer to the professionals in that field.
Before filling out the whole universe with (unordered) questions, let me put the whole thing down and try to introduce you to the subject while at the same time explain to you what I got so far from my research on that.
Microarrays
For those that do not have any idea of what this is, you can imagine, that it is literally an array (matrix) where each cell of it contains a probe for a specific gene. Making the long story short, by the end of the microarray experiment, you have a matrix (in computational terms) with each column representing a sample, each line a different gene while the contents of the matrix represent the expression values of the genes for each sample.
Pathways
In biology pathway / gene-set they call a set of genes that interact with each other forming a small network responsible for a specific function.These pathways are not isolated but they talk/interact with each other too. What that paper does on the first hand, is to expand the initial pathway (let us call it target pathway), by including some other genes from other pathways that might interact with that.
Procedure
1.
Let's assume now that we have a matrix G x S. Where G for genes and S for Samples. We construct a gene co-expression network (G x G) using as weights the Pearson's correlation coefficients between genes' pairs (a). This could also be represented as an undirected weighted graph. .
2.
For each gene (row OR column) we calculate the weighted degree (d) which is nothing more than the sum of all correlation coefficients of that gene.
3.
From the two previous matrices, they construct the transition matrix producing the probabilities (P) to transit from one gene to another by using the
formula
Q1. Why do they call this transition probability? Is there any intuitive way to see this as a probability in the biological context?
4.
Since we have the whole transition matrix, we can define a subnetwork of the initial one, that we want to expand it and it consisted out of let's say 15 genes. In that step, they used formula number 3 (on the paper) which transforms the values of the initial transition matrix as it says. They set the probability of 1 on the nodes that are part of the selected subnetwork because they define them as absorbing states.
Q2. In that same formula (3), I cannot understand what the second condition does. When should the probability be 0? Intuitively, in my opinion, all nodes that didn't exist in subnetwork, should have the P_ij value as a probability.
5.
After that, the newly constructed transition matrix is showed at formula (4) in the paper and I managed to understand it using this excellent article.
6.
Here is where everything is getting more blur for me and where I need the most of the help. What I imagine at that step, is that the algorithm starts randomly from one node and keep walking around the network. In order to construct a relevance function (What that exactly means?), they firstly calculate a probability called joint probability of visiting one node/edge E(i,j) and noted as :
From the other hand they seem to calculate another probability called probability of a walk of length L starting in x and denoted as :
7.
In the next step, they divide the previously calculated probabilities and calculate the number of times a random walk starts in x using the transition from i to j that I don't really understand what this means.
After that step, I lost their reasoning at all :-P.
I'm not expecting an expert to come open my mind and give me understand that procedure. What I'm expecting is some guidelines, hints, ideas, useful resources or more intuitive approaches to understanding the whole procedure. Then when I fully understand it I will try to implement it on R or python.
So any idea / critics is welcome.
Thanks.

What algorithm do I use to calculate voltage across a combination circuit?

I'm trying to programmatically calculate voltage changes over a very large circuit.
*This question may seem geared toward electronics, but it's
more about applying an algorithm over a set of data.
To keep things simple,
here is a complete circuit, with the voltages already calculated:
I'm originally only given the battery voltage and the resistances:
The issue I have is that voltage is calculated differently among parallel and series circuits.
A somewhat similar question asked on SO.
Some formulas:
When resistors are in parallel:
Rtotal = 1/(1/R1 + 1/R2 + 1/R3 ... + 1/Rn)
When resistors are in series:
Rtotal = R1 + R2 + R3 ... + Rn
Ohm's Law:
V = IR
I = V/R
R = V/I
V is voltage (volts)
I is current (amps)
R is resistance(ohms)
Every tutorial I've found on the internet consists of people conceptually grouping together parallel circuits to get the total resistance, and then using that resistance to calculate the resistance in series.
This is fine for small examples, but it's difficult to derive an algorithm out of it for large scale circuits.
My question:
Given a matrix of all complete paths,
is there a way for me to calculate all the voltage drops?
I currently have the system as a graph data structure.
All of the nodes are represented(and can be looked up by) an id number.
So for the example above, if I run the traversals, I'll get back a list of paths like this:
[[0,1,2,4,0]
,[0,1,3,4,0]]
Each number can be used to derive the actual node and it's corresponding data. What kind of transformations/algorithms do I need to perform on this set of data?
It's very likely that portions of the circuit will be compound, and those compound sections may find themselves being in parallel or series with other compound sections.
I think my problem is akin to this:
http://en.wikipedia.org/wiki/Series-parallel_partial_order
Some circuits cannot even be analyzed in terms of series and parallel, for example a circuit which includes the edges of a cube (there's some code at the bottom of that web page that might be helpful; I haven't looked at it). Another example that can't be analyzed into series/parallel is a pentagon/pentagram shape.
A more robust solution than thinking about series and parallel is to use Kirchhoff's laws.
You need to make variables for the currents in each linear section
of the circuit.
Apply Kirchhoff's current law (KCL) to nodes where
linear sections meet.
Apply Kirchhoff's voltage law (KVL) to as many
cycles as you can find.
Use Gaussian elimination to solve the
resulting linear system of equations.
The tricky part is identifying cycles. In the example you give, there are three cycles: through battery and left resistor, battery and right resistor, and through left and right resistors. For planar circuits it's not too hard to find a complete set of cycles; for three dimensional circuits, it can be hard.
You don't actually need all the cycles. In the above example, two would be enough (corresponding to the two bounded regions into which the circuit divides the plane). Then you have three variables (currents in three linear parts of the circuit) and three equations (sum of currents at the top node where three linear segments meet, and voltage drops around two cycles). That is enough to solve the system for currents by Gaussian elimination, then you can calculate voltages from the currents.
If you throw in too many equations (e.g., currents at both nodes in your example, and voltages over three cycles instead of two), things will still work out: Gaussian elimination will just eliminate the redundancies and you'll still get the unique, correct answer. The real problem is if you have too few equations. For example, if you use KCL on the two nodes in your example and KVL around just one cycle, you'll have three equations, but one is redundant, so you'll only really have two independent equations, which is not enough. So I would say throw in every equation you can find and let Gaussian elimination sort it out.
And hopefully you can restrict to planar circuits, for which it is easy to find a nice set of cycles. Otherwise you'll need a graph cycle enumeration algorithm. I'm sure you can find one if you need it.
use a maximum flow algorithm (Dijkstra is your friend).
http://www.cs.princeton.edu/courses/archive/spr04/cos226/lectures/maxflow.4up.pdf
You pretend to be in front of a water flow problem (well, actually it IS a flow problem). You have to compute the flow of water on each segment (the current). Then you can easily compute the voltage drop (water pressure) across every resistor.
I think the way to go here would be something like this:
Sort all your paths into groups of the same length.
While there are more than one group, choose the group with the largest length and:
2a. Find two paths with one item difference.
2b. "Merge" them into a path with the length smaller by one - the merge is dependent on the actual items that are different.
2c. Add the new path into the relevant group.
2d. If there are only paths with more than one item difference, merge the different items so that you have only one different item between the paths.
2e. When there is only one item left, find an item from a "lower" (= length is smaller) with minimum differences, and merge item to match.
When there is one group left with more than one item, keep doing #2 until there is one group left with one item.
Calculate the value of that item directly.
This is very initial, but I think the main idea is clear.
Any improvements are welcome.

clustering algorithm for objects which have multiple feature time series information

I am looking for clustering algorithm which can handle with multiple time series information for each objects.
For example, for company "A" we have time series of 3 features(ex. income, sales, inventory)
At the same way, company "B" also has same time series of same features. and so on..
Then, how we can make cluster between set of company?
Is there some wise way to handle this?
A lot of clustering algorithms ask you to provide some measure of the similarity or distance between two points. It is really up to you to decide what features are important and what the distance really is. One way forwards would be to use the correlation between two time series. This gives you a similarity. If you have to convert this to a distance I would use sqrt(1-r), where r is the correlation, because if you look e.g. at the equation at the bottom of http://www.analytictech.com/mb876/handouts/distance_and_correlation.htm you can see that this is proportional to a distance if you have points in n-dimensional space. If you have three different time series (income, sales, inventory) I would use the sum of the three distances worked out from the correlations between the two time series of the same type.
Another option, especially if the time series are not very long, would be to regard a time series of length n as a point in n-dimensional space and feed this into the clustering algorithm, or use http://en.wikipedia.org/wiki/Principal_component_analysis to reduce the n dimensions down to 1 by looking at the most significant components (while you are doing this, it never hurts to plot the points using the least significant components and investigate points that stand out from the others. Points where the data is in error sometimes stand out here).

Question about Backpropagation Algorithm with Artificial Neural Networks -- Order of updating

Hey everyone, I've been trying to get an ANN I coded to work with the backpropagation algorithm. I have read several papers on them, but I'm noticing a few discrepancies.
Here seems to be the super general format of the algorithm:
Give input
Get output
Calculate error
Calculate change in weights
Repeat steps 3 and 4 until we reach the input level
But here's the problem: The weights need to be updated at some point, obviously. However, because we're back propagating, we need to use the weights of previous layers (ones closer to the output layer, I mean) when calculating the error for layers closer to the input layer. But we already calculated the weight changes for the layers closer to the output layer! So, when we use these weights to calculate the error for layers closer to the input, do we use their old values, or their "updated values"?
In other words, if we were to put the the step of updating the weights in my super general algorithm, would it be:
(Updating the weights immediately)
Give input
Get output
Calculate error
Calculate change in weights
Update these weights
Repeat steps 3,4,5 until we reach the input level
OR
(Using the "old" values of the weights)
Give input
Get output
Calculate error
Calculate change in weights
Store these changes in a matrix, but don't change these weights yet
Repeat steps 3,4,5 until we reach the input level
Update the weights all at once using our stored values
In this paper I read, in both abstract examples (the ones based on figures 3.3 and 3.4), they say to use the old values, not to immediately update the values. However, in their "worked example 3.1", they use the new values (even though what they say they're using are the old values) for calculating the error of the hidden layer.
Also, in my book "Introduction to Machine Learning by Ethem Alpaydin", though there is a lot of abstract stuff I don't yet understand, he says "Note that the change in the first-layer weight delta-w_hj, makes use of the second layer weight v_h. Therefore, we should calculate the changes in both layers and update the first-layer weights, making use of the old value of the second-layer weights, then update the second-layer weights."
To be honest, it really seems like they just made a mistake and all the weights are updated simultaneously at the end, but I want to be sure. My ANN is giving me strange results, and I want to be positive that this isn't the cause.
Anyone know?
Thanks!
As far as I know, you should update weights immediately. The purpose of back-propagation is to find weights that minimize the error of the ANN, and it does so by doing a gradient descent. I think the algorithm description in the Wikipedia page is quite good. You may also double-check its implementation in the joone engine.
You are usually backpropagating deltas not errors. These deltas are calculated from the errors, but they do not mean the same thing. Once you have the deltas for layer n (counting from input to output) you use these deltas and the weigths from the layer n to calculate the deltas for layer n-1 (one closer to input). The deltas only have a meaning for the old state of the network, not for the new state, so you should always use the old weights for propagating the deltas back to the input.
Deltas mean in a sense how much each part of the NN has contributed to the error before, not how much it will contribute to the error in the next step (because you do not know the actual error yet).
As with most machine-learning techniques it will probably still work, if you use the updated, weights, but it might converge slower.
If you simply train it on a single input-output pair my intuition would be to update weights immediately, because the gradient is not constant. But I don't think your book mentions only a single input-output pair. Usually you come up with an ANN because you have many input-output samples from a function you would like to model with the ANN. Thus your loops should repeat from step 1 instead of from step 3.
If we label your two methods as new->online and old->offline, then we have two algorithms.
The online algorithm is good when you don't know how many sample input-output relations you are going to see, and you don't mind some randomness in they way the weights update.
The offline algorithm is good if you want to fit a particular set of data optimally. To avoid overfitting the samples in your data set, you can split it into a training set and a test set. You use the training set to update the weights, and the test set to measure how good a fit you have. When the error on the test set begins to increase, you are done.
Which algorithm is best depends on the purpose of using an ANN. Since you talk about training until you "reach input level", I assume you train until output is exactly as the target value in the data set. In this case the offline algorithm is what you need. If you were building a backgammon playing program, the online algorithm would be a better because you have an unlimited data set.
In this book, the author talks about how the whole point of the backpropagation algorithm is that it allows you to efficiently compute all the weights in one go. In other words, using the "old values" is efficient. Using the new values is more computationally expensive, and so that's why people use the "old values" to update the weights.

matching algorithm

I'm writing an application which divides a population of users into pairs for the purpose of performing a task together. Each user can specify various preferences about their partner, e.g.
gender
language
age
location (typically, within X miles/kilometers from where the user lives)
Ideally, I would like the user to be able to specify whether each of these preferences is a "nice to have" or a "must have", e.g. "I would prefer to be matched with a native English speaker, but I must not be matched with a female".
My objective is to maximise the overall average quality of the matches. For example, assume there are 4 users in the system, A, B, C, D. These users can be matched in 3 ways:
Option 1 Match Score
A-B 5
C-D 4
---
Average 4.5
Option 2 Match Score
A-C 2
B-D 3
---
Average 2.5
Option 3 Match Score
A-D 1
B-C 9
---
Average 5
So in this contrived example, the 3rd option would be chosen because it has the highest overall match quality, even though A and D are not very well matched at all.
Is there an algorithm that can help me to:
calculate the "match scores" shown above
choose the pairings that will maximise the average match score (while respecting each user's absolute constraints)
It is not absolutely necessary that each user is matched, so given a choice between significantly lowering the overall quality of the matches, and leaving a few users without a match, I would choose the latter.
Obviously, I would like the algorithm that calculates the matches to complete as quickly as possible, because the number of users in the system could be quite large.
Finally, this system of computing match scores and maximising the overall average is just a heurisitic I've come up with myself. If there's a much better way to calculate the pairings, please let me know.
Update
The problem I've described seems to be a similar to the stable marriage problem for which there is a well-known solution. However, in this problem I do not require the chosen pairs to be stable. My goal is to choose the pairs so that the average "match score" is maximized
What maximum match algorithms have you been looking at? I read your question too hastily at first: it seems you don't necessarily restrict yourself to a bipartite graph. This seems trickier.
I believe this problem could be represented as a linear programming problem. And then you can use Simplex method to solve it.
To find a maximum matching in an arbitrary graph there is a weighted variant of Edmond's matching algorithm:
http://en.wikipedia.org/wiki/Edmonds's_matching_algorithm#Weighted_matching
See the footnotes there.
I provided a possible solution to a similar problem here. It's an algorithm for measuring dissimilarity--the more similar measured data is to expected data, the smaller your resulting number will be.
For your application you would set a person's preferences as the expected data and each other person you compare against would be the measured data. You would want to filter the 'measured data' to eliminate those cases like "must not be matched with a female", that you mention in your original question, before running the comparison.
Another option could be using a Chi-Square algorithm.
By the looks of it your problem is not bipartite, therefore it would seem to me that you are looking for a maximum weight matching in a general graph. I don't envy the task of writing this as Edmond's blossum shrinking algorithm is not easy to understand or implement efficiently. There are implementations of this algorithm out there, one such example being the C++ library LEMON (http://lemon.cs.elte.hu/trac/lemon). If you want a maximum cardinality maximum weight matching you will have to use the maximum weight matching algorithm and add a large weight (sum of all the weights) to each edge to force maximum cardinality as the first priority.
Alternatively as you mentioned in one of the comments above that your match terms are not linear and so linear programming is out, you could always take a constraint programming approach which does not require that the terms be linear.

Resources