Gene set enrichment analysis (GSEA) for Bio-Rad Bio-Plex human cytokine screening panel - bioinformatics

We have analyzed the effects of several peptides, separately, on peripheral blood mononuclear cells (PBMCs). We have analyzed changes in the level of cytokines secretion in response to incubation with the peptides. The assay was performed on a Bio-Rad Bio-Plex platform with a Bio-Plex Pro Human Cytokine 48-plex Screening Panel kit. So now we have information about the changes in the secretion of 48 cytokines by PBMCs in response to incubation with any of the peptides. I would like to know if there is any way to analyze the obtained results in a kind of gene set enrichment analysis (GSEA) in order to determine, for example, the type of cells that predominantly produce the significantly changed cytokines, or, for example, signals of what processes are the changed cytokines? If there is no such program or web-service yet, then maybe someone can advise a meaningful explanatory review or a small book to understand and interpret the changes obtained at the level of cytokines into some kind of biological hypothesis about the effect of the tested peptides on the immunocompetent cells of the bloodstream?

What comes to mind is doing a kind of differential expression analysis across your tested protein conditions, clustering by cytokine secretion and using a database such as string analysis to determine protein-group interactions: https://string-db.org/cgi/input?sessionId=bpz0SAZRwcoA&input_page_active_form=multiple_identifiers.
You could hypothetically do some sort of enrichment analysis if you had a cytokine set list for a specific pathway/cell-population vs your ranked cytokine secretion list.

Related

cb_explore input format : Use of providing probability value in training

The cb_explore input format requires specifying action:cost:action_probability for each example.
However the cb algorithms within are already trying to learn the optimal policy i.e. probability for each action from the data. Then, why does it need the probability of each action in the input? Is it just for initialization?
If I understand correctly, you are asking why the label associated with cb_explore is a set of action/probability pairs.
The probability of the label action is used as an importance weight for training. This has the effect of amplifying the updates for actions that are played less frequently, making them less likely to be drowned out by actions played more frequently.
As well, this type of label is very useful during predict-time, because it generates a log that can be used to perform unbiased counterfactual analyses. In other words, by logging the probability of playing each of the actions before sampling (see cb_sample - this implements how a single action/probability vector is sampled, as for example in the ccb reduction: https://github.com/VowpalWabbit/vowpal_wabbit/blob/master/vowpalwabbit/cb_sample.cc#L37), we can then use the log to train another policy, and compare how it performs against the original.
See the "A Multi-World Testing Decision Service" paper to describe the mechanism to do unbiased offline experimentation with logged data: https://arxiv.org/pdf/1606.03966v1.pdf

How do Statistica's %75 and %25 Data Sampling & 10 fold Cross Validation works together?

I made an analysis on some data using Dell's Statistica software. I am using this analysis in a scientific paper. Although data mining is not my primary topic I took Data Mining class before and have some knowledge.
I know that data is either separated as %75 %25 (numbers may change) training and test parts or n fold cross validation is used to test the model performance.
In Statistica SVM modeling prior to execution of model there are tabs to make configurations. In data sampling tab I entered %75, %25 separation and in cross-validation tab I entered 10 -fold cross validation. In the output, I see that the data was actually separated as training and test (model predictions are given for test values).
There is also a cross-validation error. I will copy results below. I have difficulty in the understanding and in the interpretation of this output. I hope someone who know better statistics compared to me and/or who is more experienced to this tools may explain how it works to me?
Ferda
Support Vector machine results SVM type:
Regression type 1 (capacity=9.000, epsilon=0.100) Kernel type:
Radial Basis Function (gamma=0.053) Number of support vectors = 705
(674 bounded) Cross-validation error = 0.244
Mean error squared = 1.830(Train), 0.193(Test), 1.267(Overall) S.D. ratio =
0.952(Train), 37076026627971.336(Test), 0.977(Overall) Correlation coefficient = 0.314(Train), -0.000(Test), 0.272(Overall)
I found out that Statistica website has an answer for my misunderstanding. In Sampling tab data may be separated into training and test sets and in cross- validation tab, if for example 10 is selected then 10-fold cross validation is used to decide the proper ni, epsilon etc. like SVM parameters for the execution of the SVM modeling.
This explanation cleared out my problem. I hope it helps to people in similar situations...
Ferda

Understanding Perceptrons

I just started a Machine learning class and we went over Perceptrons. For homework we are supposed to:
"Choose appropriate training and test data sets of two dimensions (plane). Use 10 data points for training and 5 for testing. " Then we are supposed to write a program that will use a perceptron algorithm and output:
a comment on whether the training data points are linearly
separable
a comment on whether the test points are linearly separable
your initial choice of the weights and constants
the final solution equation (decision boundary)
the total number of weight updates that your algorithm made
the total number of iterations made over the training set
the final misclassification error, if any, on the training data and
also on the test data
I have read the first chapter of my book several times and I am still having trouble fully understanding perceptrons.
I understand that you change the weights if a point is misclassified until none are misclassified anymore, I guess what I'm having trouble understanding is
What do I use the test data for and how does that relate to the
training data?
How do I know if a point is misclassified?
How do I go about choosing test points, training points, threshold or a bias?
It's really hard for me to know how to make up one of these without my book providing good examples. As you can tell I am pretty lost, any help would be so much appreciated.
What do I use the test data for and how does that relate to the
training data?
Think about a Perceptron as young child. You want to teach a child how to distinguish apples from oranges. You show it 5 different apples (all red/yellow) and 5 oranges (of different shape) while telling it what it sees at every turn ("this is a an apple. this is an orange). Assuming the child has perfect memory, it will learn to understand what makes an apple an apple and an orange an orange if you show him enough examples. He will eventually start to use meta-features (like shapes) without you actually telling him. This is what a Perceptron does. After you showed him all examples, you start at the beginning, this is called a new epoch.
What happens when you want to test the child's knowledge? You show it something new. A green apple (not just yellow/red), a grapefruit, maybe a watermelon. Why not show the child the exact same data as before during training? Because the child has perfect memory, it will only tell you what you told him. You won't see how good it generalizes from known to unseen data unless you have different training data that you never showed him during training. If the child has a horrible performance on the test data but a 100% performance on the training data, you will know that he has learned nothing - it's simply repeating what he has been told during training - you trained him too long, he only memorized your examples without understanding what makes an apple an apple because you gave him too many details - this is called overfitting. To prevent your Perceptron from only (!) recognizing training data you'll have to stop training at a reasonable time and find a good balance between the size of the training and testing set.
How do I know if a point is misclassified?
If it's different from what it should be. Let's say an apple has class 0 and an orange has 1 (here you should start reading into Single/MultiLayer Perceptrons and how Neural Networks of multiple Perceptrons work). The network will take your input. How it's coded is irrelevant for this, let's say input is a string "apple". Your training set then is {(apple1,0), (apple2,0), (apple3,0), (orange1,1), (orange2,1).....}. Since you know the class beforehand, the network will either output 1 or 0 for the input "apple1". If it outputs 1, you perform (targetValue-actualValue) = (1-0) = 1. 1 in this case means that the network gives a wrong output. Compare this to the delta rule and you will understand that this small equation is part of the larger update equation. In case you get a 1 you will perform a weight update. If target and actual value are the same, you will always get a 0 and you know that the network didn't misclassify.
How do I go about choosing test points, training points, threshold or
a bias?
Practically the bias and threshold isn't "chosen" per se. The bias is trained like any other unit using a simple "trick", namely using the bias as an additional input unit with value 1 - this means the actual bias value is encoded in this additional unit's weight and the algorithm we use will make sure it learns the bias for us automatically.
Depending on your activation function, the threshold is predetermined. For a simple perceptron, the classification will occur as follows:
Since we use a binary output (between 0 and 1), it's a good start to put the threshold at 0.5 since that's exactly the middle of the range [0,1].
Now to your last question about choosing training and test points: This is quite difficult, you do that by experience. Where you're at, you start off by implementing simple logical functions like AND, OR, XOR etc. There's it's trivial. You put everything in your training set and test with the same values as your training set (since for x XOR y etc. there are only 4 possible inputs 00, 10, 01, 11). For complex data like images, audio etc. you'll have to try and tweak your data and features until you feel like the network can work with it as good as you want it to.
What do I use the test data for and how does that relate to the training data?
Usually, to asses how well a particular algorithm performs, one first trains it and then uses different data to test how well it does on data it has never seen before.
How do I know if a point is misclassified?
Your training data has labels, which means that for each point in the training set, you know what class it belongs to.
How do I go about choosing test points, training points, threshold or a bias?
For simple problems, you usually take all the training data and split it around 80/20. You train on the 80% and test against the remaining 20%.

Does Weka test results on a separate holdout set with 10CV?

I used 10-fold cross validation in Weka.
I know this usually means that the data is split in 10 parts, 90% training, 10% test and that this is alternated 10 times.
I am wondering on what Weka calculates the resulting AUC. Is it the average of all 10 test sets? Or (and I hope this is true), does it use a holdout test set? I can't seem to find a description of this in the weka book.
Weka averages the test results. And this is a better approach then the holdout set, I don't understand why you would hope for such approach. If you hold out the test set (of what size?) your test would not be statisticaly significant, It would only say, that for best chosen parameters on the training data you achieved some score on arbitrary small part of data. The whole point of cross validation (as the evaluation technique) is to use all the data as training and as testing in turns, so the resulting metric is approximation of the expected value of the true evaluation measure. If you use the hold out test it would not converge to expected value (at least not in a reasonable time) and what is even more important - you would have to choose another constant (how big hold out set and why?) and reduce the number of samples used for training (while cross validation has been developed due to the problem with to small datasets for both training and testing).
I performed cross validation on my own (made my own random folds and created 10 classifiers) and checked the average AUC. I also checked to see if the entire dataset was used to report the AUC (similar as to when Weka outputs a decision tree under 10-fold).
The AUC for the credit dataset with a naive Bayes classifier as found by...
10-fold weka = 0.89559
10-fold mine = 0.89509
original train = 0.90281
There is a slight discrepancy between my average AUC and Weka's, but this could be from a failure in replicating the folds (although I did try to control the seeds).

most efficient edit distance to identify misspellings in names?

Algorithms for edit distance give a measure of the distance between two strings.
Question: which of these measures would be most relevant to detect two different persons names which are actually the same? (different because of a mispelling). The trick is that it should minimize false positives. Example:
Obaama
Obama
=> should probably be merged
Obama
Ibama
=> should not be merged.
This is just an oversimple example. Are their programmers and computer scientists who worked out this issue in more detail?
I can suggest an information-retrieval technique of doing so, but it requires a large collection of documents in order to work properly.
Index your data, using the standard IR techniques. Lucene is a good open source library that can help you with it.
Once you get a name (Obaama for example): retrieve the set of collections the word Obaama appears in. Let this set be D1.
Now, for each word w in D11 search for Obaama AND w (using your IR system). Let the set be D2.
The score |D2|/|D1| is an estimation how much w is connected to Obaama, and most likely will be close to 1 for w=Obama2.
You can manually label a set of examples and find the value from which words will be expected.
Using a standard lexicographical similarity technique you can chose to filter out words that are definetly not spelling mistakes (Like Barack).
Another solution that is often used requires a query log - find a correlation between searched words, if obaama has correlation with obama in the query log - they are connected.
1: You can improve performance by first doing the 2nd filter, and check only for candidates who are "similar enough" lexicographically.
2: Usually a normalization is also used, because more frequent words are more likely to be in the same documents with any word, regardless of being related or not.
You can check NerSim (demo) which also uses SecondString. You can find their corresponding papers, or consider this paper: Robust Similarity Measures for Named Entities Matching.

Resources