How to get sequence description from gi number through biopython? - bioinformatics

I have a list of GI (genbank identifier) numbers. How can I get the Sequence description (as 'mus musculus hypothetical protein X') for each GI number so that I can store it in a variable and write it to a file?
Thanks for your help!

This is a script I wrote to pull the entire GenBank file for each genbank identifier in a file. It should be easy enough to change for your applications.
#This program will open a file containing NCBI sequence indentifiers, find the associated
#information and write the data to *.gb
import os
import sys
from Bio import Entrez
Entrez.email = "yourname#xxx.xxx" #Always tell NCBI who you are
try: #checks to make sure input file is in the folder
name = raw_input("\nEnter file name with sequence identifications only: ")
handle = open(name, 'r')
except:
print "File does not exist in folder! Check file name and extension."
quit()
outfile = os.path.splitext(name)[0]+"_GB_Full.gb"
totalhand = open(outfile, 'w')
for line in handle:
line = line.rstrip() #strips \n from file
print line
fetch_handle = Entrez.efetch(db="nucleotide", rettype="gb", retmode="text", id=line)
data = fetch_handle.read()
fetch_handle.close()
totalhand.write(data)

So, in case anybody else had that question, here is the solution:
handle=Entrez.esummary(db="nucleotide, protein, ...", id="gi or NCBI_ref number")
record=Entrez.read(handle)
handle.close()
description=record[0]["Title"]
print description
This will print the sequence description that corresponds to the identifier.

Related

Python: Opening auto-generated file

As part of my larger program, I want to create a logfile with the current time & date as part of the title. I can create it as follows:
malwareLog = open(datetime.datetime.now().strftime("%Y%m%d - %H.%M " + pcName + " Malware scan log.txt"), "w+")
Now, my app is going to call a number of other functions, so I'll need to open the file, write some output to it and close the file, several times. It doesn't seem to work if I simply go:
malwareLog.open(malwareLog, "a+")
or similar. So how should I open a dynamically created txt file that I don't know the actual filename for...?
When you create malwareLog object, it has name attribute which contains the file name.
Here's an example: (my test is your malwareLog)
import random
test = open(str(random.randint(0,999999))+".txt", "w+")
test.write("hello ")
test.close()
test = open(test.name, "a+")
test.write("world!")
test.close()
with open(test.name, "r") as f: print(f.read())
You also can store the file name in a variable before or after creating the file.
###Before
file_name = "123"
malwareLog = open(file_name, "w")
###After
malwareLog = open(random.randint(0,999999), "w")
file_name = malwareLog.name

Change the delimiter in multiple CSV files from same folder and write them into a new folder

I am a newbie programmer in python and I am trying to read multiple csv files from a folder, replace the delimiter for all the csv files with 'tab' delimiter and then output these files into a new folder with replaced delimiter. So far I am stuck at the beginning.
Here is the code that I started using, this is working for a single file. But am not able to work with multiple files in same folder.
print("\nWrite same CSV File with different string(Replace ',' with tab delimiter)")
with open('Names.csv','r') as csv_file:
csv_reader = csv.reader(csv_file)
with open('Names_new.csv', 'w') as new_file:
csv_writer = csv.writer(new_file, delimiter = '\t', lineterminator='\r')
for line in csv_reader:
csv_writer.writerow(line)
Please can someone point out some tips?
Thank you in advance!
I don't think the code in your question does what you want. However, here's how to embed it in more code that will read the csv files from a specified folder for processing.
listdir takes input_folder and yields a list of all of the files in that folder.
I loop through that list and process only those files whose names end with '.csv'.
from os import listdir
import csv
input_folder = 'catalyst/'
for file_name in listdir(input_folder):
if file_name.endswith('.csv'):
print ('---> processing input file: ', file_name)
with open(input_folder + file_name,'r') as csv_file:
csv_reader = csv.reader(csv_file)
out_file_name = file_name[:-3]+'_new.csv'
print (' creating', out_file_name )
with open(input_folder + out_file_name, 'w') as new_file:
csv_writer = csv.writer(new_file, delimiter = '\t', lineterminator='\r')
for line in csv_reader:
csv_writer.writerow(line)

Strict searching against two different files

I have two questions regarding the following code:
import subprocess
macSource1 = (r"\\Server\path\name\here\dhcp-dump.txt")
macSource2 = (r"\\Server\path\name\here\dhcp-dump-ops.txt")
with open (r"specific-pcs.txt") as file:
line = []
for line in file:
pcName = line.strip().upper()
with open (macSource1) as source1, open (macSource2) as source2:
items = []
for items in source1:
if pcName in items:
items_split = items.rstrip("\n").split('\t')
ip = items_split[0]
mac = items_split[4]
mac2 = ':'.join(s.encode('hex') for s in mac.decode('hex')).lower() # Puts the :'s between the pairs.
print mac2
print pcName
print ip
Firstly, as you can see, the script is searching for the contents of "specific-pcs.txt" against the contents of macSource1 to get various details. How do I get it to search against BOTH macSource1 & 2 (as the details could be in either file)??
And secondly, I need to have a stricter matching process as at the moment a machine called 'itroom02' will not only find it's own details, but also provide the details for another machine called '2nd-itroom02'. How would I get that?
Many thanks for your assistance in advance!
Chris.
Perhaps you should restructure it a bit more like this:
macSources = [ r"\\Server\path\name\here\dhcp-dump.txt",
r"\\Server\path\name\here\dhcp-dump-ops.txt" ]
with open (r"specific-pcs.txt") as file:
for line in file:
# ....
for target in macSources:
with open (target) as source:
for items in source:
# ....
There's no need to do e.g. line = [] immediately before you do for line in ...:.
As far as the "stricter matching" goes, since you don't give examples of the format of your files, I can only guess - but you might want to try something like if items_split[1] == pcName: after you've done the split, instead of the if pcName in items: before you split (assuming the name is in the second column - adjust accordingly if not).

Import Multiple Images With Unknown Names

I need to import multiple images (10.000) in Matlab (2013b) from a subdirectory of the predefined directory of Matlab.
I don't know the exact names of the images.
I tried this:
file = dir('C:\Users\user\Documents\MATLAB\train');
NF = length(file);
for k = 1 : NF
img = imread(fullfile('C:\Users\user\Documents\MATLAB\train', file(k).name));
end
But it throws this error though I ran it with the Admin privileges:
Error using imread (line 347)
Can't open file "C:\Users\user\Documents\MATLAB\train\." for reading;
you may not have read permission.
The "dir" command returns the virtual directory elements "." (self directory) and ".." parent, as your error message shows.
A simple fix is to use a more specific dir call, based on your image types, perhaps:
file = dir('C:\Users\user\Documents\MATLAB\train\*.jpg');
Check the output of dir. The first two "files" are . and .., which is similar to the behaviour of the windows dir command.
file = dir('C:\Users\user\Documents\MATLAB\train');
NF = length(file);
for k = 3 : NF
img = imread(fullfile('C:\Users\user\Documents\MATLAB\train', file(k).name));
end
In R2013b you would have to do
file = dir('C:\Users\user\Documents\MATLAB\train\*.jpg');
If you have R2014b with the Computer Vision System Toolbox then you can use imageSet:
images = imageSet('C:\Users\user\Documents\MATLAB\train\');
This will create an object containing paths to all image files in the train directory, regardless of format. Then you can read the i-th image like this:
im = read(images, i);

Script working in Python2 but not in Python 3 (hashlib)

I worked today in a simple script to checksum files in all available hashlib algorithms (md5, sha1.....) I wrote it and debug it with Python2, but when I decided to port it to Python 3 it just won't work. The funny thing is that it works for small files, but not for big files. I thought there was a problem with the way I was buffering the file, but the error message is what makes me think it is something related to the way I am doing the hexdigest (I think) Here is a copy of my entire script, so feel free to copy it, use it and help me figure out what the problem is with it. The error I get when checksuming a 250 MB file is
"'utf-8' codec can't decode byte 0xf3 in position 10: invalid continuation byte"
I google it, but can't find anything that fixes it. Also if you see better ways to optimize it, please let me know. My main goal is to make work 100% in Python 3. Thanks
#!/usr/local/bin/python33
import hashlib
import argparse
def hashFile(algorithm = "md5", filepaths=[], blockSize=4096):
algorithmType = getattr(hashlib, algorithm.lower())() #Default: hashlib.md5()
#Open file and extract data in chunks
for path in filepaths:
try:
with open(path) as f:
while True:
dataChunk = f.read(blockSize)
if not dataChunk:
break
algorithmType.update(dataChunk.encode())
yield algorithmType.hexdigest()
except Exception as e:
print (e)
def main():
#DEFINE ARGUMENTS
parser = argparse.ArgumentParser()
parser.add_argument('filepaths', nargs="+", help='Specified the path of the file(s) to hash')
parser.add_argument('-a', '--algorithm', action='store', dest='algorithm', default="md5",
help='Specifies what algorithm to use ("md5", "sha1", "sha224", "sha384", "sha512")')
arguments = parser.parse_args()
algo = arguments.algorithm
if algo.lower() in ("md5", "sha1", "sha224", "sha384", "sha512"):
Here is the code that works in Python 2, I will just put it in case you want to use it without having to modigy the one above.
#!/usr/bin/python
import hashlib
import argparse
def hashFile(algorithm = "md5", filepaths=[], blockSize=4096):
'''
Hashes a file. In oder to reduce the amount of memory used by the script, it hashes the file in chunks instead of putting
the whole file in memory
'''
algorithmType = hashlib.new(algorithm) #getattr(hashlib, algorithm.lower())() #Default: hashlib.md5()
#Open file and extract data in chunks
for path in filepaths:
try:
with open(path, mode = 'rb') as f:
while True:
dataChunk = f.read(blockSize)
if not dataChunk:
break
algorithmType.update(dataChunk)
yield algorithmType.hexdigest()
except Exception as e:
print e
def main():
#DEFINE ARGUMENTS
parser = argparse.ArgumentParser()
parser.add_argument('filepaths', nargs="+", help='Specified the path of the file(s) to hash')
parser.add_argument('-a', '--algorithm', action='store', dest='algorithm', default="md5",
help='Specifies what algorithm to use ("md5", "sha1", "sha224", "sha384", "sha512")')
arguments = parser.parse_args()
#Call generator function to yield hash value
algo = arguments.algorithm
if algo.lower() in ("md5", "sha1", "sha224", "sha384", "sha512"):
for hashValue in hashFile(algo, arguments.filepaths):
print hashValue
else:
print "Algorithm {0} is not available in this script".format(algorithm)
if __name__ == "__main__":
main()
I haven't tried it in Python 3, but I get the same error in Python 2.7.5 for binary files (the only difference is that mine is with the ascii codec). Instead of encoding the data chunks, open the file directly in binary mode:
with open(path, 'rb') as f:
while True:
dataChunk = f.read(blockSize)
if not dataChunk:
break
algorithmType.update(dataChunk)
yield algorithmType.hexdigest()
Apart from that, I'd use the method hashlib.new instead of getattr, and hashlib.algorithms_available to check if the argument is valid.

Resources