Clusters of Orthologous Groups [Software] - bioinformatics

I have about 3500 genes to classify and I would like to classify them by COGs. Anyone know a good way to do this i.e. software or server?

Related

What algorithm do food delivery apps use to group nearby orders?

I've noticed on food delivery apps it says something like 'orders with nearby collection and drop off points are grouped together for efficiency'. I have a similar problem where delivery jobs can come in real time or pre booked, and the algorithm needs to group jobs to get them done faster. I have data on distance between locations and how jobs are grouped manually.
I was wondering what kind of algorithms these big companies use (here its grab, foodpanda, deliveroo etc) to group orders. Is it like a secret?
Also, I was told this algorithm has to have AI in it because its a buzzword that clients love. I'm scratching my head trying to figure out how to incorporate that. e.g. use supervised learning and treat it like a classification problem on which person to choose for each job, based on distance or something?? The 'label' would be data on how humans grouped jobs, which isn't really optimal and the client wants an improvement from that as well.
My question is if these commercial algorithms out there for grouping food orders use AI, if its appropriate to use AI and how, and in general any insight into what kind of algorithms they use. Thanks in advance.

Looking at the results of clustering algorithms on Protein Interaction Networks

I am working on a project involving the clustering of Protein Interaction Networks, having made several clustering algorithms on the graphs of interacting proteins, I am somewhat confused on how I would now go about seeing whether the clusters created are any good or not.
To put this into context protein interaction networks represent pairwise connections between proteins and isolating groups of interacting proteins that participate in the same biological processes or that perform together specific functions. This is significant as many proteins and interactions are unlabelled so inference to their function can be made if many labelled proteins for a certain are in one cluster.
Unlike typical supervised machine learning tasks where a labelled data set can show numbers of correct groupings or not, there is no precendent for good clusterings of proteins and their interaction, hypothetically a clustering where all proteins are in their one cluster are as good as one where all proteins are in one cluster (though there is no informational significance in this). There are of course no feature vectors for distance calculations either, only binary information whether one protein interacts with another or not, so this is quite difficult.
This problem is completely exploratory, and is hard to see whether a clustering is significant or just bogus.
Most academic papers use cluster analysis techniques to see how good the clusters and the algorithms are. ie. whether they are robust to edge deletion or node deletion, cluster correlation etc. I would like to see if there is any information one can fish out using protein databases, say input a large number of interactions (from one cluster) and see if the labelled ones have a tendency to be involved in the same metabolic process. If there is a significantly high number of proteins involved in one metabolic process one can surmise that the unlabelled proteins may be involved in a similar process or function, or similarly may be part of a protein domain or not.
I have just begun delving into bioinformatics and research in general so there is a very high chance that this has been done before and I haven't looked around extensively enough. If this is the case I would be grateful for links. I would appreciate any help possible, or ideas on how one could think about this problem.
If I understand your question: you would like to know if your clustered protein interaction network identifies biologically relevant protein complexes...
I can think of three ways to do this:
1) Use the primary research literature. Take a cluster and search Pubmed for each member of the cluster and see if there are any reports of interactions with other members of the cluster. This will be time consuming but the most rigorous.
2) Submit each cluster to GO term enrichment analysis (David, funcassociate, etc.) or Pathway analysis (Kegg). If a cluster is "biologically" relevant it should be enriched for specific GO/Kegg terms. this will only work if most of your proteins have annotations.
3) Look at expression data. Biological complexes tend to have correlated gene expression patterns. There for the expression of a cluster should correlate with it members than non-members of the cluster.
I thought of a 4th:
4) Find homologs in an organism with a rich and deep annotation database and look for correlations there (yeast (S. cerevisiae or S. pombe*), fly (D. melanogaster), worm (C elegans), mouse, and human all have large protein interaction databases (i.e. Biogrid).
And a 5th:
5) Use genetic screen data. In this case genetic epistasis data will have distinct relationships within complexes. Proteins that are in the same complex will tend not to have a genetic interaction. While proteins in separate/independently acting complexes could have a genetic interaction component. See the work of Dr. Charles Boone (Univ. of Toronto) on how this can be modeled.
Final thoughts:
A little bit of domain-specific knowledge will go a long way to helping others believe your results. Do well-know/studied complexes form clusters? There has been a lot of work done in this field, Pubmed will be your friend. Start at Biogrid and work out from there.
Good luck

Algorithm for cross selling -like amazon people who bought this

New to this and a long time since I've done any programming or forums ....
However, this is really getting under my skin.
I've been looking around at the algorithms used for Amazon etc on the recommendations they make around products which have an affinity to the ones people have selected - clearly this works very well.
Here is what I am wondering....
A - why would this be limited to affinity? Is there never a situation where a product would be exclusive of the original selection and perhaps a parallel but not like product might make sense?
B - why would a neural network not make sense? Could this not work well to provide a good link or would you just end up with a few product which have a very low weighting and therefore perpetuates their non selection?
Thanks for your views.
James
Question A: You do not need to limit it to affinity. However, you do need to "package up" all other pertinent information in a way that you can present it to the algorithm. You should read up on "association rules", "frequent itemsets" & recommender algorithms. Most of these algorithms analyze transaction data and learn rules like {peanuts, hot dogs} = {beer}. As to going away from affinity you can produce multiple sets where you reduce beer to {alcoholic beverage} then use multiple frequent item sets, at different levels of specificity, and then some sort of ensemble algorithm to combine them.
Question B: A neural network, or any other similar model would not work due to the dimensionality. Say you are Amazon.com and you want to input an item set of up to 5 items. You might encode that as an input neuron count equal to one neuron for each item times 5. How many items are on Amazon. I have no idea, but I am guessing its over 100k. Even if you get creative with dimensionality reduction, this is going to be a MASSIVE neural network, or even support vector machine or random forest.

The design of Clustering using MapReduce

I have got a similarity matrix like this: ItemA, ItemB, Similarity.
I wanted it to cluster the dataset using algorithm such as Kmeans by using MapReduce. But I don't know how many MapReduces I should use and how to design them.
You cannot use k-means with a similarity matrix. End of story: k-means needs the similarity to the means, not between instances. But there are alternative algorithms. Unfortunately, PAM for example scales so badly, it does not pay off to run it on a cluster either.
Other than that, just experiment. Choose as many reduces as you have cores, for example; and choose as many mappers as your cluster can sustain (unless your data is too tiny - there should be several MB per mapper to make the startup cost pay off)
But I don't think you are ready for that question yet. First figure out what you want to do, then how to set parameters that may or may not arise at all..

Social network functionality finding connections you might know

I want to create a functionality for suggesting connections in a social network.
In the network you can have connections and connect to other users.
I want to implement a connection suggestion functionality on the network.
I think the most basic approach to implement this is to check all my connections most occurring common connection that my user is not connected to and sugest this user to my user to connect to.
My questions is:
Is this a good basic approach for an easy connection finder?
Is there any good implementation algorithm that i can use for finding my connections most occurring user that they are connected to?
I'd try a machine learning approach for this problem.
I'll suggest two common machine learning concepts in order to solve this problem. In order for both of them to work - you need to extract features from the data (for example look at a subgraph, and friendship with each member in the subgraph is a binary feature).
The two approaches are:
Classification. In here, you are trying to find a classifier C:UserxUser->Boolean (A classifier that given two users, gives a boolean answer - should they be friends). The classification approach will require you to first manually label, or extract some classified information (A big enough set of pairs, each with a classification). The algorithm will learn this pattern, and use it to predict future inputs.
Clustering (AKA Unsupervised learning). You can try and find clusters in your graph, and suggest users to be friends with all members in their cluster.
I have to admit I never used any of these methods for friendship suggestion - so I have no idea how accurate it will be. You can use cross-validation in order to estimate the accuracy of the algorithm.
If you are interested in learning more about it - two weeks ago an on line free course has started in stanford about machine learning: https://class.coursera.org/ml-2012-002

Resources