Entrez E-Search results not matching up with online results - bioinformatics

I am using the code below to perform an esearch, but the IDs that I get from IdList are not matching up with the IDs on the online search.
from Bio import Entrez
Entrez.email = "myEmail#gmail.com"
handle = Entrez.esearch(db = "nucleotide", term = "chordata[orgn] AND
chromosome", retmax = 10, idtype = "acc")
genome_ids = Entrez.read(handle)['IdList']
print(genome_ids)
When I print the id's out they don't match up with the ones online.Does anyone know why? These are the id's I get when I print out genome_ids:
['NG_017163.2', 'NM_017553.3', 'NG_059281.1', 'NM_005101.4',
'MH423692.1', 'MH423691.1', 'MH423690.1', 'MH423689.1', 'MH423688.1',
'MH423687.1']
Here is the link to the online search:
https://www.ncbi.nlm.nih.gov/nuccore/?term=chordata%5Borgn%5D+AND+chromosome
Also does anyone know how I can download the chromosomal and mitochondrial genome of all the organisms from the chordata phylum.I want to do it using BioPython through the E-utilities.

How I can download the chromosomal and mitochondrial genome of all the organisms from the chordata phylum
Go to https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi
Enter chordata in the 'search for' box, select complete name in the dropdown list
Enter a high number for the levels (e.g. 30), and select the filter has genome sequence in the dropdown list
Check the nucleotide checkbox
You will now view a full taxonomic tree of the chordata with its subtaxa. The number behind each taxid is the number of sequences for that taxid. So, NCBI contains 84,366,537 different sequences of chordata.
You probably don't have enough space to download them all, so make a selection, click on the number behind the taxid, and choose Send to > File > FASTA.

Related

"Different row counts implied by arguments" in attempt to plot BAM file data

I'm attempting to use this tutorial to manipulate and plot ATAC-sequencing data. I have all the libraries listed in that tutorial installed and loaded, except while they use biocLite(BSgenome.Hsapiens.UCSC.hg19) for the human genome, I'm using biocLite(TxDb.Mmusculus.UCSC.mm10.knownGene) for the mouse genome.
Here I have loaded in my BAM file
sorted_AL1.1BAM <-"Sorted_1_S1_L001_R1_001.fastq.gz.subread.bam"
And created an object called TSS, which is transcription start site regions from the mouse genome. I want to ultimately plot the average signal in my read data across mouse transcription start sites.
TSSs <- resize(genes(TxDb.Mmusculus.UCSC.mm10.knownGene), fix = "start", 1)
The problem occurs with the following code:
nucFree <- regionPlot(bamFile = sorted_AL1.1BAM, testRanges = TSSs, style = "point",
format = "bam", paired = TRUE, minFragmentLength = 0, maxFragmentLength = 100,
forceFragment = 50)
The error is as follows:
Reading Bam header information.....Done
Filtering regions which extend outside of genome boundaries.....Done
Filtered 24528 of 24528 regions
Splitting regions by Watson and Crick strand..Error in DataFrame(..., check.names = FALSE) :
different row counts implied by arguments
I assume my BAM file contains empty values that need to be changed to NAs. My issue is that I'm not sure how to visualize and manipulate BAM files in R in order to do this. Any help would be appreciated.
I tried the following:
data.frame(sorted_AL1.1BAM)
sorted_AL1.1BAM[sorted_AL1.1BAM == ''] <- NA
I expected this to resolve the issue of different row counts, but I get the same error message.

Fast search algorithm

Let's have tons of posts.
As a user, I want to find all posts containing the words "hello" and "world".
Let's say there is a post with this text "Hello world, this place is beautiful".
Now:
a) Find the text if the user searches for "hello",
b) Find the text
if the user searches for "hello", "world",
c) Don't find the text if the user searches for "hello", "world", "funny".
To reduce the quantity of possible candidates I was thinking about this:
for each post (
if number_of_search_words == number_of_post_words -> proceed with search logic
if number_of_search_words < number_of_post_words -> proceed with search logic
if number_of_search_words > number_of_post_words -> don't proceed with search logic
)
but that would also require an number containing the quantity of words of each post, which leads to more complexity.
Is there an elegant way of solving this?
You must to use bit containers, for example, BitMagic.
Initially, you assign to each post some sequenced integer ID, postID.
Thereafter, create N bit containers (N = quantity of search words), each size is maximal postID.
Thereafter, build indices: parse each post, and for each term from the post, set bit1 in the term-associated container, with postID as index.
To search:
get bit containers for your words "hello", "word".
AND those bit containers.
Result container will contains bit 1's for PostIDs, contains both search terms.

How can I locate items using xpath from below elements?

I've created some xpath expressions to locate the first item by it's "index" after "h4". However, I did something wrong that is why it doesn't work at all. I expect someone to take a look into it and give me a workaround.
I tried with:
//div[#id="schoolDetail"][1]/text() --For the Name
//div[#id="schoolDetail"]//br[0]/text() --For the PO Box
Elements within which items I would like the expression to locate is pasted below:
<div id="schoolDetail" style=""><h4>School Detail: Click here to go back to list</h4> GOLD DUST FLYING SERVICE, INC.<br>PO Box 75<br><br>TALLADEGA AL 36260<br> <br>Airport: TALLADEGA MUNICIPAL (ASN)<br>Manager: JEAN WAGNON<br>Phone: 2563620895<br>Email: golddustflyingse#bellsouth.net<br>Web: <br><br>View in AOPA Airports (Opens in new tab) <br><br></div>
By the way, the resulting values should be:
GOLD DUST FLYING SERVICE, INC.
PO Box 75
Try to locate required text nodes by appropriate index:
//div[#id="schoolDetail"]/text()[1] // For "GOLD DUST FLYING SERVICE, INC."
//div[#id="schoolDetail"]/text()[2] // For "PO Box 75"
Locator to get both elements:
//*[#id='schoolDetail']/text()[position()<3]
Explanation:
[x] - xPath could sort values using predicate in square brackets.
x - could be integer, in this case it will automatically be compared with element's position in this way [position()=x]:
//div[2] - searches for 2nd div, similar to div[position()=2]
In case predicate [x] is not an integer - it will be automatically converted to boolean value and will return only elements, where result of x is true, for example:
div[position() <= 4] - search for first four div elements, as 4 <= 4, but on the 5th and above element position will be more than 4
Important: please check following locators on this page:
https://www.w3schools.com/tags/ref_httpmessages.asp
//table//tr[1] - will return every 1st row in each table ! (12 found
elements, same as tables on the page)
(//table//tr)[1] - will return 1st row in the first found table (1 found element)

Graphlab: How to avoid manually duplicating functions that has only a different string variable?

I imported my dataset with SFrame:
products = graphlab.SFrame('amazon_baby.gl')
products['word_count'] = graphlab.text_analytics.count_words(products['review'])
I would like to do sentiment analysis on a set of words shown below:
selected_words = ['awesome', 'great', 'fantastic', 'amazing', 'love', 'horrible', 'bad', 'terrible', 'awful', 'wow', 'hate']
Then I would like to create a new column for each of the selected words in the products matrix and the entry is the number of times such word occurs, so I created a function for the word "awesome":
def awesome_count(word_count):
if 'awesome' in product:
return product['awesome']
else:
return 0;
products['awesome'] = products['word_count'].apply(awesome_count)
so far so good, but I need to manually create other functions for each of the selected words in this way, e.g., great_count, etc. How to avoid this manual effort and write cleaner code?
I think the SFrame.unpack command should do the trick. In fact, the limit parameter will accept your list of selected words and keep only these results, so that part is greatly simplified.
I don't know precisely what's in your reviews data, so I made a toy example:
# Create the data and convert to bag-of-words.
import graphlab
products = graphlab.SFrame({'review':['this book is awesome',
'I hate this book']})
products['word_count'] = \
graphlab.text_analytics.count_words(products['review'])
# Unpack the bag-of-words into separate columns.
selected_words = ['awesome', 'hate']
products2 = products.unpack('word_count', limit=selected_words)
# Fill in zeros for the missing values.
for word in selected_words:
col_name = 'word_count.{}'.format(word)
products2[col_name] = products2[col_name].fillna(value=0)
I also can't help but point out that GraphLab Create does have its own sentiment analysis toolkit, which could be worth checking out.
I actually find out an easier way do do this:
def wordCount_select(wc,selectedWord):
if selectedWord in wc:
return wc[selectedWord]
else:
return 0
for word in selected_words:
products[word] = products['word_count'].apply(lambda wc: wordCount_select(wc, word))

Visual Basic Function Procedure

I need help with the following H.W. problem. I have done everything except the instructions I numbered. Please help!
A furniture manufacturer makes two types of furniture—chairs and sofas.
The cost per chair is $350, the cost per sofa is $925, and the sales tax rate is 5%.
Write a Visual Basic program to create an invoice form for an order.
After the data on the left side of the form are entered, the user can display an invoice in a list box by pressing the Process Order button.
The user can click on the Clear Order Form button to clear all text boxes and the list box, and can click on the Quit button to exit the program.
The invoice number consists of the capitalized first two letters of the customer’s last name, followed by the last four digits of the zip code.
The customer name is input with the last name first, followed by a comma, a space, and the first name. However, the name is displayed in the invoice in the proper order.
The generation of the invoice number and the reordering of the first and last names should be carried out by Function procedures.
Seeing as this is homework and you haven't provided any code to show what effort you have made on your own, I'm not going to provide any specific answers, but hopefully I will try to point you in the right direction.
Your first 2 numbered items look to be variations on the same theme... string manipulation. Assuming you have the customer's address information from the order form, you just need to write 2 separate function to take the parts of the name and address, take the data you need and return the value (which covers your 3rd item).
To get parts of the name and address to generate the invoice number, you need to think about using the Left() and Right() functions.
Something like:
Dim first as String, last as String, word as String
word = "Foo"
first = Left(word, 1)
last = Right(word, 1)
Debug.Print(first) 'prints "F"
Debug.Print(last) 'prints "o"
Once you get the parts you need, then you just need to worry about joining the parts together in the order you want. The concatenation operator for strings is &. So using the above example, it would go something like:
Dim concat as String
concat = first & last
Debug.Print(concat) 'prints "Fo"
Your final item, using a Function procedure to generate the desired values, is very easily google-able (is that even a word). The syntax is very simple, so here's a quick example of a common function that is not built into VB6:
Private Function IsOdd(value as Integer) As Boolean
If (value Mod 2) = 0 Then 'determines of value is an odd or even by checking
' if the value divided by 2 has a remainder or not
' (aka Mod operator)
IsOdd = False ' if remainder is 0, set IsOdd to False
Else
IsOdd = True ' otherwise set IsOdd to True
End If
End Function
Hopefully this gets you going in the right direction.

Resources