Can MappingScore() be used to get an absolute measure of scRNAseq dataset similarity to the reference dataset? - rna-seq

I have been using Seurat v4 Reference Mapping align some query scRNAseq datasets that come from IPSC-derived cells that were subject to several directed cortical differentiation protocols at multiple timepoints. The reference dataset I made by merging several individual fetal cortical sample datasets that I had annotated based on their unsupervised cluster DEGs (following this vignette using the default parameters).
I am interested in seeing which protocol produces cells most similar to the cells found in the fetal datasets as well as which fetal timepoints the query datasets tend to map to. I understand that the MappingScore() function can show me query cells that aren't well represented in the reference dataset, so I figured that these scores could tell me which datasets are most similar to the reference dataset. However, in comparing the violin plots of the mapping scores for a query dataset from one of the differentiation protocols to a query dataset that contains just pluripotent cells it looks like there are cells with high mapping scores found in both cases (see attached images) even though really only the differentiated cells should have cells closely resembling the fetal cortical tissue cells. I attached the code as a .txt file.
My question is whether or not the mapping score can be used as an absolute measurement of query to reference dataset similarity or if it is always just a relative measure where the high and low thresholds are set by the query dataset. If the latter, what alternative functions might I use here to get information about absolute similarity?
Thanks.
Attachments:
Pluripotent Cell Mapping Score
Differentiated Cell Mapping Score
Code Used For Mapping

Related

Gensim Doc2Vec model returns different cosine similarity depending on the dataset

I trained two versions of doc2vec models with two datasets.
The first dataset was made with 2400 documents and the second one was made with 3000 documents including the documents which were used in the first dataset.
For an example,
dataset 1 = doc1, doc2, ... doc2400
dataset 2 = doc1, doc2, ... doc2400, doc2401, ... doc3000
I thought that both doc2vec models should return the same similarity score between doc1 and doc2, however, they returned different scores.
Does doc2vec model's result change upon the datasets even they include the same documents?
Yes, any addition to the training set will change the relative results.
Further, as explained in the Gensim FAQ, even re-training with the exact same data will typically result in different end coordinates for each training doc, though each run should be about equivalently useful:
https://github.com/RaRe-Technologies/gensim/wiki/Recipes-&-FAQ#q11-ive-trained-my-word2vec--doc2vec--etc-model-repeatedly-using-the-exact-same-text-corpus-but-the-vectors-are-different-each-time-is-there-a-bug-or-have-i-made-a-mistake-2vec-training-non-determinism
What should remain roughly the same between runs is the neighborhoods around each document. That is, adding some extra training docs shouldn't change the general result that some candidate doc is "very close" or "closer than other docs" to some target doc - except to the extent that (1) the new docs might include some even-closer docs; and (2) a small amount of 'jitter' between runs, per the FAQ answer above.
If in fact you see lots of change in the relative neighborhoods and top-N neighbors of a document, either in repeated runs or runs with small increments of extra data, there's possibly something else wrong in the training.
In particular, 2400 docs is a pretty small dataset for Doc2Vec - smaller datasets might need smaller vector_size and/or more epochs and/or other tweaks to get more reliable results, and even then, might not show off the strengths of this algorithm on larger (tens-of-thousands to millions of docs) datasets.

Interpret Google AutoML Online Prediction Results

We are using Google AutoML with Tables using input as CSV files. We have imported data , linked all schema with nullable columns and train model and then deployed and used the online prediction to predict value of one column.
Column we targeted has values min-max ( 44 - 263).
When we deployed and ran online-prediction it return values like this
Prediction result
0.49457597732543945
95% prediction interval
[-8.209495544433594, 0.9892584085464478]
Most of the resultset is in above format. How can we convert it to values in range of (44-263). Didn't find much documentation online on the same.
Looking for documentation reference and interpretation along with interpretation of 95% prediction.
Actually to clarify (I'm the PM of AutoML Tables)--
AutoML Tables does not do any normalization of the predicted values for your label data, so if you expect your label data to have a distribution of min/max 44-263, then the output predictions should also be in that range. Two possibilities would make it significantly different:
1) You selected the wrong label column
2) Your input features for this prediction are dramatically different than what was seen in the training data used.
Please feel free to reach out to cloud-automl-tables-discuss#googlegroups.com if you'd like us to help debug further

Best method to identify and replace outlier for Salary column in python

What is best method to identify and replace outlier for ApplicantIncome,
CoapplicantIncome,LoanAmount,Loan_Amount_Term column in pandas python.
I tried IQR with seaborne boxplot, and tried to identified the outlet and fill with NAN record after that take mean of ApplicantIncome and filled with NAN records.
Try to take group of below combination column ex: gender, education, selfemployed, Property_Area
And having below column in my dataframe
Loan_ID LP001357
Gender Male
Married NaN
Dependents NaN
Education Graduate
Self_Employed No
ApplicantIncome 3816
CoapplicantIncome 754
LoanAmount 160
Loan_Amount_Term 360
Credit_History 1
Property_Area Urban
Loan_Status Y
Outliers
Just like missing values, your data might also contain values that diverge heavily from the big majority of your other data. These data points are called “outliers”. To find them, you can check the distribution of your single variables by means of a box plot or you can make a scatter plot of your data to identify data points that don’t lie in the “expected” area of the plot.
The causes for outliers in your data might vary, going from system errors to people interfering with the data through data entry or data processing, but it’s important to consider the effect that they can have on your analysis: they will change the result of statistical tests such as standard deviation, mean or median, they can potentially decrease the normality and impact the results of statistical models, such as regression or ANOVA.
To deal with outliers, you can either delete, transform, or impute them: the decision will again depend on the data context. That’s why it’s again important to understand your data and identify the cause for the outliers:
If the outlier value is due to data entry or data processing errors,
you might consider deleting the value.
You can transform the outliers by assigning weights to your
observations or use the natural log to reduce the variation that the
outlier values in your data set cause.
Just like the missing values, you can also use imputation methods to
replace the extreme values of your data with median, mean or mode
values.
You can use the functions that were described in the above section to deal with outliers in your data.
Following links will be useful for you:
Python data cleaning
Ways to detect and remove the outliers

How does "Addressing missing data" help KNN function better?

Source:- https://machinelearningmastery.com/k-nearest-neighbors-for-machine-learning/
This page has a section quoting the following passage:-
Best Prepare Data for KNN
Rescale Data: KNN performs much better if all of the data has the same scale. Normalizing your data to the range [0, 1] is a good idea. It may also be a good idea to standardize your data if it has a Gaussian
distribution.
Address Missing Data: Missing data will mean that the distance between samples cannot be calculated. These samples could be excluded or the missing values could be imputed.
Lower Dimensionality: KNN is suited for lower dimensional data. You can try it on high dimensional data (hundreds or thousands of input variables) but be aware that it may not perform as well as other techniques. KNN can benefit from feature selection that reduces the dimensionality of the input feature space.
Please, can someone explain the Second point, i.e. Address Missing Data, in detail?
Missing data in this context means that some samples do not have all the existing features.
For example:
Suppose you have a database with age and height for a group of individuals.
This would mean that for some persons either the height or the age is missing.
Now, why this affects KNN?
Given a test sample
KNN finds the samples that are closer to it (Aka: the students with similar age and height).
KNN does this to make some inference about the test sample based on its nearest neighbors.
If you want to find these neighbors you must be able to compute the distance between samples. To compute the distance between 2 samples you must have all the features for these 2 samples.
If some of them are missing you won't be able to compute distance.
So implicitly you would be lossing the samples with missing data

How can I compare two similarities obtained using two different data sets?

I am trying to calculate User-User similarities through cosine similarity by using two different data sets (Users are same just that features being considered for obtaining similarities are different among the data sets) . Now, is there a way I could tell how similar these two data sets are based on the similarity values?
I think the answer here should be no, unless there are no common features in the two data sets(if they differ only in units, you can normalize them both and use them). For e.g., you cannot recommend movies to a user using two different data sets where one contains only the age and gender of the users, while the other contains only the favorite genres the users like, and compare the two results.
Also, your query vector should also have the same features as the data set that the similarity search algorithm uses.
In your case, if the query has features of both the data sets, you can find the k Nearest Neighbors in both of them (for e.g.) and return them both i.e. 2k results. But you cannot choose among the two pairs of k NNs regarding which is the best. I also would recommend finding a way to merge the two data sets instead of following this approach.
Edit:
I misinterpreted the question. If you have the same users in both the data sets, you should merge them (preferably using the User ID column if any) and then use the new data set to calculate similarity among users.
Your question about the similarity of data sets does not make much sense in this context.

Resources