Mapping microarray probes - bioinformatics

I have analysed some microarray data that was floating around for a few years. I then noticed in the Limma output file that there were many probes which gene IDs were different from the gene IDs in any known database. I have managed to get the probe sequences from Agilent but I am having problems to assign them to genes in the genome.
I do not have coordinates for the probes so I will need to map them first. Do you know any Bioconductor package that will take the probes in fasta format and output the genes where they are located using a gff file or gbk of the genome? I am working with Acaryochloris marina.

Related

How to create a .GTF file?

I am new to bioinformatics and programming. I would greatly appreciate some help with step-by-step instructions on how to create a .GTF file. I have two cancer cell lines with different green fluorescent protein (GFP) variants knocked-in to the genome of each cell line. The idea is that the expression of GFP can be used to distinguish cancer cells from non-cancer cells. I would like to count GFP reads in all cancer cells in a single cell RNA-seq experiment. The single cell experiment was performed on the 10X Chromium platform, on organoids composed of a mix of these cancer cells and non-cancer cells. Next generation sequencing was then performed and the reference genome is the human genome sequence, GRCh38. To 'map' and count GFP reads I was told to create a .GTF file which holds the location information, and this file will be used retrospectively to add GFP to the human genome sequence. I have the FASTA sequences for both GFP variants, which I can upload if requested. Where do I start with creation of a .GTF file? Do I create this file in Excel, or with, for example BASH script in a Terminal? I have a link to a Wellcome Trust genome website (https://www.ensembl.org/info/website/upload/gff.html?redirect=no) but it is not clear what practical/programming steps are needed. From my reading it seems a GFF (GFF3?) file is needed as an intermediate step. Step-by-step instructions would be very welcome to create the .GTF file. Thanks in advance.

Do I need Proper Pairs in BAM file for gene quantification?

I have human RNA reads that I aligned against the human reference genome (GRCh 38) using BWA MEM and TopHat2. I now want to count the genes with HTSeq-count. Do I need to filter out the "non-proper pairs" beforehand? So that I only parse proper pairs into HTSeq-count? If so, how can I do that?
Samtools flagstats shows me that all bam files have ~100% mapped reads and the percentage of proper pairs is between 75-80%.

How to extract features from plain text?

I am writing a text parser which should extract features from product descriptions.
Eg:
text = "Canon EOS 7D Mark II Digital SLR Camera with 18-135mm IS STM Lens"
features = extract(text)
print features
Brand: Canon
Model: EOS 7D
....
The way I do this is by training the system with structured data and coming up with an inverted index which can map a term to a feature. This works mostly well.
When the text contains measurements like 50ml, or 2kg, the inverted index will say 2kg -> Size and 50ml -> Size for eg.
The problem here is that, when I get a value which I haven't seen before, like 13ml, it won't be processed. But since the patterns matches to a size, we could tag it as size.
I was thinking to solve this problem by preprocessing the tokens that I get from the text and look for patterns that I know. So when new patterns are identified, that has to be added to the preprocessing.
I was wondering, is this the best way to go about this? Or is there a better way of doing this?
The age-old problem of unseen cases. You could train your scraper to grab any number-like characters preceding certain suffixes (ml, kg, etc) and treat those as size. The problem with this is typos and other poorly formatted texts could enter into your structure data. There is no right answer for how to handle values you haven't seen before - you'll either have to QC them individually, or have rules around them. This is dependent on your dataset.
As far as identifying patterns, you'll either have to manually enter them, or manually classify a lot of records and let the algorithm learn them. Not sure that's very helpful, but a lot of this is very dependent on your data.
If you have a training data like this:
word label
10ml size-valume
20kg size-weight
etc...
you could train a classifier based on character n-grams and that would detect that ml is size-volume even if it sees a 11-ml or ml11 etc. you should also convert the numbers into a single number (e.g. 0) so that 11-ml is seen as 0-ml before feature extraction.
For that you'll need a preprocessing module and also a large training sample. For feature extraction you can use scikit-learn's character n-grams and also SVM.

Retrieving DNA sequences from a database of protein sequences?

I have 1000's of protein sequences in FASTA and their accession numbers. I want to go back into the whole genome shotgun database and retrieve all DNA sequences that encode for a protein identical to one in my list of initial sequences.
I've tried running a tBlastn with <10 results for each sequence, 1 per query and e-value below 1e-100 or with an e-value of zero and I'm not getting any results. I would like to automate this entire process.
Is this something that can be done by running blast from the command line and a batch script?
You should get at least one result: the one that encodes for the original protein. The others, if any, would be pseudogenes, if I follow you.
Anyway, a bit of programming may help help, check out Biopython. Bioperl or Bioruby should have similar features.
In particular you can BLAST using Biopython
You might find this link useful:
https://www.biostars.org/p/5403/
A similar question has been asked there, and some reasonable solutions have been posted.

mapreduce way to calculate user similarity matrix

I have a list of many users (over 10 million) each of which is represented by a userid followed by 10 floating-point numbers indicating their preference. I would like to efficiently calculate the user similarity matrix using cosine similarity based on mapreduce. However, since the values are floating-point numbers, it is hard to determine a key in the mapreduce framework. Any suggestions?
I think the easiest solution would be the Mahout library. There are a couple of map-reduce similarity matrix jobs in Mahout that might work for your use case.
The first is Mahout's ItemSimilarityJob that is part of its recommender system libraries. The specific info for that job can be found here. You would simply need to provide the input data in the required format and choose your VectorSimilarityMeasure (which for your case would be SIMILARITY_COSINE) along with any additional optimizations. Since you are looking to calculate user-user similarity based on a preference vector of ten floating point value, what you could do is assign a simple 1-to-10 numeric hash for the indices of the vector and generate a simple .csv file of vectorIndex, userID, decimalValue as input for the Mahout item-similarity job (the userID being a numeric Int or Long value). The resulting output should be a tab separated text file of userID,userID,similarity.
A second solution might be Mahout's RowSimilarityJob included in its math library. I've never used it myself, but some info can be found here and in this previous stackoverflow thread. Rather than a .csv as input, you would need to translate your input data as a DistributedRowMatrix, the userIDs being the rows of the matrix. The output, I believe, will also be a DistributedRowMatrix sequence file containing the user-user similarity data you are seeking.
I suppose which solution is better depends on what input/output format you prefer. All the best.

Resources