use pandas to retrieve files over FTP - ftp

I'm just getting to grips with pandas (which is awesome) and what I need to do is read in compressed genomics type files from ftp sites into a pandas dataframe.
This is what I tried and got a ton of errors:
from pandas.io.parsers import *
chr1 = 'ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/chr_rpts/chr_1.txt.gz'
CHR1 = read_csv(chr1, sep='\t', compression = 'gzip', skiprows = 10)
print type(CHR1)
print CHR1.head(10)
Ideally I'd like to do something like this:
from pandas.io.data import *
AAPL = DataReader('AAPL', 'yahoo', start = '01/01/2006')

The interesting part of this question is how to stream a (gz) file from ftp, which is discussed here, where it's claimed that the following will work in Python 3.2 (but won't in 2.x, nor will it be backported), and on my system this is the case:
import urllib.request as ur
from gzip import GzipFile
req = ur.Request(chr1) # gz file on ftp (ensure startswith 'ftp://')
z_f = ur.urlopen(req)
# this line *may* work (but I haven't been able to confirm it)
# df = pd.read_csv(z_f, sep='\t', compression='gzip', skiprows=10)
# this works (*)
f = GzipFile(fileobj=z_f, mode="r")
df = pd.read_csv(f, sep='\t', skiprows=10)
(*) Here f is "file-like", in the sense that we can perform a readline (read it line-by-line), rather than having to download/open the entire file.
.
Note: I couldn't get the ftplib library to readline, it wasn't clear whether it ought to.

Related

How to convert yolo annotations to coco format. Json?

I want to convert my labels in yolo format to coco format
I have tried
https://github.com/Taeyoung96/Yolo-to-COCO-format-converter
And
Pylabel
They all have a bugs.
I want to train on detectron 2 but it fails to load the dataset because of the wrong json file.
Thanks everybody
Could you try with this tool (disclaimer: I'm the author)? It is not (yet) a Python package so you need to downloads the repo first. This should ressemble something like:
from ObjectDetectionEval import *
from pathlib import Path
def main() -> None:
path = Path("/path/to/annotations/") # Where the .txt files are
names_file = Path("/path/to/classes.names")
save_file = Path("coco.json")
annotations = AnnotationSet.from_yolo(gts_path).map_labels(names)
# If you need to change the labels
# names = Annotation.parse_names_file(names_file)
# annotations.map_labels(names)
annotations.save_coco(save_file)
if __name__ == "__main__":
main()
If you need more control (coordinate format, images location and extension, etc.) you should use the more generic AnnotationSet.from_txt(). If it does not suit your needs you can easily implement your own parser using AnnotationSet.from_folder().

how can i make this script suitable for converting excel files with more than one sheet inside?

import pandas as pd
from xlsx2csv import Xlsx2csv
from io import StringIO
def read_excel(path: str, sheet_name: str) -> pd.DataFrame:
buffer = StringIO() #to read and
Xlsx2csv(path, outputencoding="utf-8", sheet_name=sheet_name).convert(buffer)
buffer.seek(0)
df = pd.read_csv(buffer)
return df
how can i make this script suitable for converting excel files with more than one sheet inside? It works only for xlsx file with one sheet at the moment...
Do you really need to use xlsx2csv module? If not, you could try this with Pandas.
import pandas as pd
for sheet in ['Sheet1', 'Sheet2']:
df = pd.read_excel('sample.xlsx', sheetname=sheet)

Multiple Securities Trading Algorithm

I am very new to Python and I am having trouble executing my algorithmic trading strategy on more than one security at a time. I am currently using these lines of code for the stocks:
data_p = pd.read_csv('AAPL_30m.csv', index_col = 0, parse_dates = True)
data_p.drop(columns = ['Adj Close'])
Does anyone know how I would go about properly adding more securities?
Since no data is provided, I can only give you a rough idea on how this can be done. Change directory to the folder with all your data series in csv files:
import pandas as pd
import os
os.chdir(r'C:\Users\username\Downloads\new')
files = os.listdir()
Assume the files in the folder is
['AAPL.csv',
'AMZN.csv',
'GOOG.csv']
Then start with an empty dictionary d and loop through all the files in the directory to read as pandas dataframe. Eventually combine all of them to one big dataframe (if you find it more useful)
d = {}
for f in files:
name = f.split('.')[0]
df = pd.read_csv(f)
....
*** Do your processing ***
....
d[name] = df.copy()
dff = pd.concat(d)
Since I do not know your format and your index, I assume you can do pd.concat(d), alternatively, you may also try out pd.DataFrame(d)

How to read the latest image in a folder using python?

I have to read the latest image in a folder using python. How can I do this?
Another similar way, with some pragmatic (non-foolproof) image validation added:
import os
def get_latest_image(dirpath, valid_extensions=('jpg','jpeg','png')):
"""
Get the latest image file in the given directory
"""
# get filepaths of all files and dirs in the given dir
valid_files = [os.path.join(dirpath, filename) for filename in os.listdir(dirpath)]
# filter out directories, no-extension, and wrong extension files
valid_files = [f for f in valid_files if '.' in f and \
f.rsplit('.',1)[-1] in valid_extensions and os.path.isfile(f)]
if not valid_files:
raise ValueError("No valid images in %s" % dirpath)
return max(valid_files, key=os.path.getmtime)
Walk over the filenames, get their modification time and keep track of the latest modification time you found:
import os
import glob
ts = 0
found = None
for file_name in glob.glob('/path/to/your/interesting/directory/*'):
fts = os.path.getmtime(file_name)
if fts > ts:
ts = fts
found = file_name
print(found)

Script working in Python2 but not in Python 3 (hashlib)

I worked today in a simple script to checksum files in all available hashlib algorithms (md5, sha1.....) I wrote it and debug it with Python2, but when I decided to port it to Python 3 it just won't work. The funny thing is that it works for small files, but not for big files. I thought there was a problem with the way I was buffering the file, but the error message is what makes me think it is something related to the way I am doing the hexdigest (I think) Here is a copy of my entire script, so feel free to copy it, use it and help me figure out what the problem is with it. The error I get when checksuming a 250 MB file is
"'utf-8' codec can't decode byte 0xf3 in position 10: invalid continuation byte"
I google it, but can't find anything that fixes it. Also if you see better ways to optimize it, please let me know. My main goal is to make work 100% in Python 3. Thanks
#!/usr/local/bin/python33
import hashlib
import argparse
def hashFile(algorithm = "md5", filepaths=[], blockSize=4096):
algorithmType = getattr(hashlib, algorithm.lower())() #Default: hashlib.md5()
#Open file and extract data in chunks
for path in filepaths:
try:
with open(path) as f:
while True:
dataChunk = f.read(blockSize)
if not dataChunk:
break
algorithmType.update(dataChunk.encode())
yield algorithmType.hexdigest()
except Exception as e:
print (e)
def main():
#DEFINE ARGUMENTS
parser = argparse.ArgumentParser()
parser.add_argument('filepaths', nargs="+", help='Specified the path of the file(s) to hash')
parser.add_argument('-a', '--algorithm', action='store', dest='algorithm', default="md5",
help='Specifies what algorithm to use ("md5", "sha1", "sha224", "sha384", "sha512")')
arguments = parser.parse_args()
algo = arguments.algorithm
if algo.lower() in ("md5", "sha1", "sha224", "sha384", "sha512"):
Here is the code that works in Python 2, I will just put it in case you want to use it without having to modigy the one above.
#!/usr/bin/python
import hashlib
import argparse
def hashFile(algorithm = "md5", filepaths=[], blockSize=4096):
'''
Hashes a file. In oder to reduce the amount of memory used by the script, it hashes the file in chunks instead of putting
the whole file in memory
'''
algorithmType = hashlib.new(algorithm) #getattr(hashlib, algorithm.lower())() #Default: hashlib.md5()
#Open file and extract data in chunks
for path in filepaths:
try:
with open(path, mode = 'rb') as f:
while True:
dataChunk = f.read(blockSize)
if not dataChunk:
break
algorithmType.update(dataChunk)
yield algorithmType.hexdigest()
except Exception as e:
print e
def main():
#DEFINE ARGUMENTS
parser = argparse.ArgumentParser()
parser.add_argument('filepaths', nargs="+", help='Specified the path of the file(s) to hash')
parser.add_argument('-a', '--algorithm', action='store', dest='algorithm', default="md5",
help='Specifies what algorithm to use ("md5", "sha1", "sha224", "sha384", "sha512")')
arguments = parser.parse_args()
#Call generator function to yield hash value
algo = arguments.algorithm
if algo.lower() in ("md5", "sha1", "sha224", "sha384", "sha512"):
for hashValue in hashFile(algo, arguments.filepaths):
print hashValue
else:
print "Algorithm {0} is not available in this script".format(algorithm)
if __name__ == "__main__":
main()
I haven't tried it in Python 3, but I get the same error in Python 2.7.5 for binary files (the only difference is that mine is with the ascii codec). Instead of encoding the data chunks, open the file directly in binary mode:
with open(path, 'rb') as f:
while True:
dataChunk = f.read(blockSize)
if not dataChunk:
break
algorithmType.update(dataChunk)
yield algorithmType.hexdigest()
Apart from that, I'd use the method hashlib.new instead of getattr, and hashlib.algorithms_available to check if the argument is valid.

Resources