Building a PPI network graph from differentially expressed genes and obtaining an interaction file for cytoHubba - bioinformatics

I used GEO2R to obtain three Excel files each of which contains a column of differentially expressed genes. I would like to combine these three columns into one column, select a sufficient number of genes in order to build a PPI network graph that does not resemble an impenetrable ball of yarn, then use the interaction file(?) for use with the cytoHubba app in order to obtain the top 10 genes of clinical interest. I believe I can accomplish the first task, but the PPI and interaction file tasks are slightly elusive. Any constructive feedback would be most welcome.
Thank you.

Already answered on the Cytoscape helpdesk: https://groups.google.com/g/cytoscape-helpdesk/c/GWp0QbTJf2o?hl=en_US%2F

Related

Appropriate clustering method for 1 or 2 dimensional data

I have a set of data I have generated that consists of extracted mass (well, m/z but that not so important) values and a time. I extract the data from the file, however, it is possible to get repeat measurements and this results in a large amount of redundancy within the dataset. I am looking for a method to cluster these in order to group those that are related based on either similarity in mass alone, or similarity in mass and time.
An example of data that should be group together is:
m/z time
337.65 1524.6
337.65 1524.6
337.65 1604.3
However, I have no way to determine how many clusters I will have. Does anyone know of an efficient way to accomplish this, possibly using a simple distance metric? I am not familiar with clustering algorithms sadly.
http://en.wikipedia.org/wiki/Cluster_analysis
http://en.wikipedia.org/wiki/DBSCAN
Read the section about hierarchical clustering and also look into DBSCAN if you really don't want to specify how many clusters in advance. You will need to define a distance metric and in that step is where you would determine which of the features or combination of features you will be clustering on.
Why don't you just set a threshold?
If successive values (by time) do not differ by at least +-0.1 (by m/s) they a grouped together. Alternatively, use a relative threshold: differ by less than +- .1%. Set these thresholds according to your domain knowledge.
That sounds like the straightforward way of preprocessing this data to me.
Using a "clustering" algorithm here seems total overkill to me. Clustering algorithms will try to discover much more complex structures than what you are trying to find here. The result will likely be surprising and hard to control. The straightforward change-threshold approach (which I would not call clustering!) is very simple to explain, understand and control.
For the simple one dimension K-means clustering (http://en.wikipedia.org/wiki/K-means_clustering#Standard_algorithm) is appropriate and can be used directly. The only issue is selecting appropriate K. The best way to select a good K is to either plot K vs residual variance and select the K that "dramatically" reduces variance. Another strategy is to use some information criteria (eg. Bayesian Information Criteria).
You can extend K-Means to multi-dimensional data easily. But you should be beware of scaling the individual dimensions. Eg. Among items (1KG, 1KM) (2KG, 2KM) the nearest point to (1.7KG, 1.4KM) is (2KG, 2KM) with these scales. But once you start expression second item in meters, probably the alternative is true.

Which data mining algorithm would you suggest for this particular scenario?

This is not a directly programming related question, but it's about selecting the right data mining algorithm.
I want to infer the age of people from their first names, from the region they live, and if they have an internet product or not. The idea behind it is that:
there are names that are old-fashioned or popular in a particular decade (celebrities, politicians etc.) (this may not hold in the USA, but in the country of interest that's true),
young people tend to live in highly populated regions whereas old people prefer countrysides, and
Internet is used more by young people than by old people.
I am not sure if those assumptions hold, but I want to test that. So what I have is 100K observations from our customer database with
approx. 500 different names (nominal input variable with too many classes)
20 different regions (nominal input variable)
Internet Yes/No (binary input variable)
91 distinct birthyears (numerical target variable with range: 1910-1992)
Because I have so many nominal inputs, I don't think regression is a good candidate. Because the target is numerical, I don't think decision tree is a good option either. Can anyone suggest me a method that is applicable for such a scenario?
I think you could design discrete variables that reflect the split you are trying to determine. It doesn't seem like you need a regression on their exact age.
One possibility is to cluster the ages, and then treat the clusters as discrete variables. Should this not be appropriate, another possibility is to divide the ages into bins of equal distribution.
One technique that could work very well for your purposes is, instead of clustering or partitioning the ages directly, cluster or partition the average age per name. That is to say, generate a list of all of the average ages, and work with this instead. (There may be some statistical problems in the classifier if you the discrete categories here are too fine-grained, though).
However, the best case is if you have a clear notion of what age range you consider appropriate for 'young' and 'old'. Then, use these directly.
New answer
I would try using regression, but in the manner that I specify. I would try binarizing each variable (if this is the correct term). The Internet variable is binary, but I would make it into two separate binary values. I will illustrate with an example because I feel it will be more illuminating. For my example, I will just use three names (Gertrude, Jennifer, and Mary) and the internet variable.
I have 4 women. Here are their data:
Gertrude, Internet, 57
Jennifer, Internet, 23
Gertrude, No Internet, 60
Mary, No Internet, 35
I would generate a matrix, A, like this (each row represents a respective woman in my list):
[[1,0,0,1,0],
[0,1,0,1,0],
[1,0,0,0,1],
[0,0,1,0,1]]
The first three columns represent the names and the latter two Internet/No Internet. Thus, the columns represent
[Gertrude, Jennifer, Mary, Internet, No Internet]
You can keep doing this with more names (500 columns for the names), and for the regions (20 columns for those). Then you will just be solving the standard linear algebra problem A*x=b where b for the above example is
b=[[57],
[23],
[60],
[35]]
You may be worried that A will now be a huge matrix, but it is a huge, extremely sparse matrix and thus can be stored very efficiently in a sparse matrix form. Each row has 3 1's in it and the rest are 0. You can then just solve this with a sparse matrix solver. You will want to do some sort of correlation test on the resulting predicting ages to see how effective it is.
You might check out the babynamewizard. It shows the changes in name frequency over time and should help convert your names to a numeric input. Also, you should be able to use population density from census.gov data to get a numeric value associated with your regions. I would suggest an additional flag regarding the availability of DSL access - many rural areas don't have DSL coverage. No coverage = less demand for internet services.
My first inclination would be to divide your response into two groups, those very likely to have used computers in school or work and those much less likely. The exposure to computer use at an age early in their career or schooling probably has some effect on their likelihood to use a computer later in their life. Then you might consider regressions on the groups separately. This should eliminate some of the natural correlation of your inputs.
I would use a classification algorithm that accepts nominal attributes and numeric class, like M5 (for trees or rules). Perhaps I would combine it with the bagging meta classifier to reduce variance. The original algorithm M5 was invented by R. Quinlan and Yong Wang made improvements.
The algorithm is implemented in R (library RWeka)
It also can be found in the open source machine learning software Weka
For more information see:
Ross J. Quinlan: Learning with Continuous Classes. In: 5th Australian Joint Conference on Artificial Intelligence, Singapore, 343-348, 1992.
Y. Wang, I. H. Witten: Induction of model trees for predicting continuous classes. In: Poster papers of the 9th European Conference on Machine Learning, 1997.
I think slightly different from you, I believe that trees are excellent algorithms to deal with nominal data because they can help you build a model that you can easily interpret and identify the influence of each one of these nominal variables and it's different values.
You can also use regression with dummy variables in order to represent the nominal attributes, this is also a good solution.
But you can also use other algorithms such as SVM(smo), with the previous transformation of the nominal variables to binary dummy ones, same as in regression.

Building an activity chart combining two kinds of data

I'm building, for example, an application that monitors your health. Each day, you're jogging and doing push-ups and you enter the information on a web site.
What I would like to do is building a chart combining the hours you jogged and the number of push-ups/sit-ups you did. Let's say on the first day, you jogged 1 hour and did 10 push-ups and on the second day, you jogged 50 minutes and did 20 push-ups, you would see a progression in your training.
I know it may sound strange but I want to have an overall-view of your health, not different views for jogging and push-ups. I don't want a double y-axis chart because if I have, as example, 6 runners, I will end up with 12 lines on the chart.
First I would redefine your terms. You are not tracking "health" here, you are tracking level of exertion through exercise.
Max Exertion != Max Health. If you exert yourself to the max and don't eat or drink, you will actually damage your health. :-)
To combine and plot your total "level of exertion" for multiple exercises you need to convert them to a common unit ... something like "calories burned".
I'm pretty sure there are many sources for reference tables with rough conversion factors for how many calories various exercises burn.
Does that help any?
Then you need a model of how push-ups and jogging affect yourself, and for this you should be asking a doctor or fitness expert, not a programmer :-). This question should probably be taken elsewhere.
Sounds like a double y-axis chart.
You can just do a regular excel-type chart with 2 lines, scaled appropriately, one for push-ups, one for jogging time. There are graphics libraries that let you do that in back-end language of your choice. X-axys is date.
You may want to have 2 scaled graphs, one for last week and one for last year (ala Yahoo Finance charts for different intervals).
Show the first set of values as a line graph above the x axis, and the second set below the x axis. If both sets of values increase over time this will show as an "expansion" of the graph; should be easy to recognize if one set is growing but the other is not.
Because the two quantities have no intrinsic relationship, you're stuck with either displaying them independently, such as two curves with two y-axes, or making up a measure that combines them, such as an estimate of calories burned, muscles used, mental anguish from exercising, etc. But it's tricky... taking from your example, I suspect one will never approach the calories burned from a 50 mile run by doing push-ups. Combining these in a meaningful way depends not on mathematics but on approximations and knowledge of the quantities that you start with and are interested in.
One compromise might be a graph with a single y-axis that shows some combined quantity, but where the independent values at each point are also graphically represented, for example, by a line where the local color represents the ratio of miles to pushups, or any of the many variants that display information in the shapes or colors in the plot.
Another option is to do a 3D plot, and then rotate it around and look for trends or whatever interests you.
If you want one overall measure of exercise levels, you could try using total exercise time. Another alternative is to define a points system, whereby users score points for each exercise.
I do think that there is virtue in letting the users see how much of each individual exercise they have done - in this case use a different graph for different exercises rather than using dual y-axes, if the scales are not comparable (e.g. time jogging and number of push-ups). There is a very good article on the problems with dual y-axes by business intelligence guru Stephen Few, here (pdf).
If you want to know more about presenting data well, I can also recommend his book "Now you see it", and the classic "The Visual Display of Quantitative Information" by Edward Tufte.

Algorithms to find stuff a user would like based on other users likes

I'm thinking of writing an app to classify movies in an HTPC based on what the family members like.
I don't know statistics or AI, but the stuff here looks very juicy. I wouldn't know where to start do.
Here's what I want to accomplish:
Compose a set of samples from each users likes, rating each sample attribute separately. For example, maybe a user likes western movies a lot, so the western genre would carry a bit more weight for that user (and so on for other attributes, like actors, director, etc).
A user can get suggestions based on the likes of the other users. For example, if both user A and B like Spielberg (connection between the users), and user B loves Batman Begins, but user A loathes Katie Holmes, weigh the movie for user A accordingly (again, each attribute separately, for example, maybe user A doesn't like action movies so much, so bring the rating down a bit, and since Katie Holmes isn't the main star, don't take that into account as much as the other attributes).
Basically, comparing sets from user A similar to sets from user B, and come up with a rating for user A.
I have a crude idea about how to implement this, but I'm certain some bright minds have already thought of a far better solution already, so... any suggestions?
Actually, after a quick research, it seems a Bayesian filter would work. If so, would this be the better approach? Would it be as simple as just "normalizing" movie data, training a classifier for each user, and then just classify each movie?
If your suggestion includes some brain melting concepts (I'm not experienced in these subjects, specially in AI), I'd appreciate it if you also included a list of some basics for me to research before diving into the meaty stuff.
Thanks!
Matthew Podwysocki had some interesting articles on this stuff
http://codebetter.com/blogs/matthew.podwysocki/archive/2009/03/30/functional-programming-and-collective-intelligence.aspx
http://codebetter.com/blogs/matthew.podwysocki/archive/2009/04/01/functional-programming-and-collective-intelligence-ii.aspx
http://weblogs.asp.net/podwysocki/archive/2009/04/07/functional-programming-and-collective-intelligence-iii.aspx
This is similar to this question where the OP wanted to build a recommendation system. In a nutshell, we are given a set of training data consisting of users ratings to movies (1-5 star rating for example) and a set of attributes for each movie (year, genre, actors, ..). We want to build a recommender so that it will output for unseen movies a possible rating. So the inpt data looks like:
user movie year genre ... | rating
---------------------------------------------
1 1 2006 action | 5
3 2 2008 drama | 3.5
...
and for an unrated movie X:
10 20 2009 drama ?
we want to predict a rating. Doing this for all unseen movies then sorting by predicted movie rating and outputting the top 10 gives you a recommendation system.
The simplest approach is to use a k-nearest neighbor algorithm. Among the rated movies, search for the "closest" ones to movie X, and combine their ratings to produce a prediction.
This approach has the advantage of being very simple to easy implement from scratch.
Other more sophisticated approaches exist. For example you can build a decision tree, fit a set of rules on the training data. You can also use Bayesian networks, artificial neural networks, support vector machines, among many others... Going through each of these wont be easy for someone without the proper background.
Still I expect you would be using an external tool/library. Now you seem to be familiar with Bayesian Networks, so a simple naive bayes net, could in fact be very powerful. One advantage is that it allow for prediction under missing data.
The main idea would be somewhat the same; take the input data you have, train a model, then use it to predict the class of new instances.
If you want to play around with different algorithms in simple intuitive package which requires no programming, I suggest you take a look at Weka (my 1st choice), Orange, or RapidMiner. The most difficult part would be to prepare the dataset to the required format. The rest is as easy as choosing what algorithm and applying it (all in a few clicks!)
I guess for someone not looking to go into too much details, I would recommend going with the nearest neighbor method as it is intuitive and easy to implement.. Still the option of using Weka (or one of the other tools) is worth looking into.
There are a few algorithms that are good for this:
ARTMAP: groups via probability against each other (this isn't fast but its the best thing for your problem IMO)
ARTMAP holds a group of common attributes and determines likelyhood of simliarity via a percentages.
ARTMAP
KMeans: This seperates out the vectors by the distance that they are from each other
KMeans: Wikipedia
PCA: will seperate the average of all the values from the varing bits. This is what you would use to do face detection, and background subtraction in Computer Vision.
PCA
The K-nearest neighbor algorithm may be right up your alley.
Check out some of the work of the top teams for the netflix prize.

Algorithm for clustering pictures based on date taken

Anyone know of an algorithm that will group pictures into events based on the date the picture was taken. Obviously I can group by the date, but I'd like something a little more sophisticated that would(might) be able to group pictures spanning multiple days based on the frequency over a certain timespan. Consider the following groupings:
1/2/2009 15 photos
1/3/2009 20 photos
1/4/2009 13 photos
1/5/2009 19 photos
1/15/2009 5 photos
Potentially these would be grouped into two groups:
1/2/2009 -> 1/5/2009
1/15/2009
Obviously there will be some tolerance(s) that need to be established.
Is there any well established way of doing this, other then inventing my own top/down approach?
You can apply pretty much any standard clustering technique to this, it's just a matter of defining your distance function correctly. When you are making your matrix of distances between your photos you should consider a combination of physical distance between locations - if you have it - and temporal distance between their creation timestamps. Normalise them and put them on separate dimensions and you may even just be able to take a regular euclidean distance.
Best of luck.
Just group the pictures that were taken on successive days (no days on which no pictures were taken) together.
You might try to dynamically calculate tolerance based on how many or how big (absolute or %) clusters you want to create.
To get a useful clustering of pictures according to date you require the following:
1) The number of clusters should be variable and not fixed a priori to the clustering
2) The diameter of each cluster should not exceed a specific amount.
The clustering algorithm that best satisfies both requirements is the QT (quality threshold) clustering algorithm. From Wikipedia:
QT (quality threshold) clustering
(Heyer, Kruglyak, Yooseph, 1999) is an
alternative method of partitioning
data, invented for gene clustering. It
requires more computing power than
k-means, but does not require
specifying the number of clusters a
priori, and always returns the same
result when run several times.
Although it is mainly used for gene clustering I think it would fit in very well for what you need.
Try to detect the Gaps instead of the Clusters.

Resources