How to using for loop to download GFS url data using Siphon - for-loop

I'm trying to loop to download a subset of GFS data using the siphon library. I can download one file at a time normally the way the code is laid out. I would like to know how can I download from the period January 2020 to December 2022 from the 003 UTC cycle as shown above until the 168 cycle so that I don't need to download a single file at a time?
from siphon.catalog import TDSCatalog
from siphon.ncss import NCSS
import numpy as np
import ipywidgets as widgets
from datetime import datetime, timedelta
import xarray as xr
from netCDF4 import num2date
# Download SUBSET GFS - Radiação (6 Hour Average) e PBLH
for i in range(6,168,6):
for day in range(1,32,1):
for month in range(1,13,1):
dir_out = '/home/william/GFS_Siphon/2020'+'{:0>2}'.format(month)
if not os.path.exists(dir_out):
os.makedirs(dir_out)
if not os.path.isfile('/home/william/GFS_Siphon/2020'+'{:0>2}'.format(month)+'/gfs.0p25.2020'+'{:0>2}'.format(month)+'{:0>2}'.format(day)+"00.f"+'{:0>3}'.format(i)+'.nc'):
catUrl = "https://rda.ucar.edu/thredds/catalog/files/g/ds084.1/2020/2020"+'{:0>2}'.format(month)+'{:0>2}'.format(day)+"/catalog.xml"
datasetName = 'gfs.0p25.2020'+'{:0>2}'.format(month)+'{:0>2}'.format(day)+"00.f"+'{:0>3}'.format(i)+'.grib2'
time.sleep(0.01)
catalog = TDSCatalog(catUrl)
ds = catalog.datasets[datasetName]
ds.name
ncss = ds.subset()
query = ncss.query()
query.lonlat_box(east=-30, west=-50, south=-20, north=0)
query.variables(
'Downward_Short-Wave_Radiation_Flux_surface_6_Hour_Average',
'Planetary_Boundary_Layer_Height_surface').add_lonlat()
query.accept('netcdf4')
nc = ncss.get_data(query)
data = xr.open_dataset(xr.backends.NetCDF4DataStore(nc))
data.to_netcdf('/home/william/GFS_Siphon/2020'+'{:0>2}'.format(month)+'/gfs.0p25.2020'+'{:0>2}'.format(month)+'{:0>2}'.format(day)+"00.f"+'{:0>3}'.format(i)+'.nc')
The above script is working for what I need, however after a while downloading the files the code dies with the following error: ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
What could be happening?

Unfortunately, there's no way with THREDDS and NCSS to request based on the model reference time, so there's no way to avoid looping over the files.
I will say that this is a TON of data, so at the very least make sure you're being kind to the publicly available server. Downloading close to 3 years' worth of data is something you should do slowly over time and with care so that you don't impact others' use of this shared, free resource. Setting a wait time of 1/100th of a second is, in my opinion, not doing that. I would wait a minimum of 30 seconds between requests if you're going to be requesting this much data.
I'll also add that you can simplify saving the results of the request to a netCDF file--there's no need to go through xarray since the return from the server is already a netCDF file:
...
query.accept('netcdf4')
with open('/home/william/GFS_Siphon/2020'+'{:0>2}'.format(month)+'/gfs.0p25.2020'+'{:0>2}'.format(month)+'{:0>2}'.format(day)+"00.f"+'{:0>3}'.format(i)+'.nc', 'wb') as outfile:
data = ncss.get_data_raw(query)
outfile.write(data)

Related

dask cache delayed function example

A simple dask cache example. Cache does not work as expected. Let's assume we have a list of data and a series of delayed functions, expected that for a function that encounters the same input to cache/memoize the results according to cachey score.
This example demonstrates that is not the case.
import time
import dask
from dask.cache import Cache
from dask.diagnostics import visualize
from dask.diagnostics import Profiler, ResourceProfiler, CacheProfiler
def slow_func(x):
time.sleep(5)
return x+1
output = []
data = np.ones((100))
for x in data:
a = dask.delayed(slow_func)(x)
output.append(a)
total = dask.delayed(sum)(output)
cache = Cache(2e9)
cache.register()
with Profiler() as prof, ResourceProfiler(dt=0.25) as rprof,CacheProfiler() as cprof:
total.compute()
visualize([prof, rprof, cprof])
cache cprof plot
After the initial parallel execution of the function would expect the next iteration upon calling the function with the same value to use a cache version. But obviously does not, dask_key_name is for designating the same output, but i want to assess this function for a variety of inputs and if seeing the same input use cached version. We can tell if this is happening very easily with this function due to the 5 second delay and should see it execute roughly 5 seconds as soon as the first value is cached after execution. This example executes every single function delayed 5 seconds. I am able to create a memoized version using the cachey library directly but this should work using the dask.cache library.
In dask.delayed you may need to specify the pure=True keyword.
You can verify that this worked because all of your dask delayed values will have the same key.
You don't need to use Cache for this if they are all in the same dask.compute call.

Unable to update GeoJSON files in an application using APScheduler on Heroku

I have 2 GeoJSON files in my application. I have written a Python job using APScheduler to update the 2 GeoJSON files based on the changes in the database. The job is configured to run once every 24 hours. Currently, I get the confirmation message that the new GeoJSON file was created but it crashes immediately after printing this log statement. I am not sure if we can write into the Heroku container, is that the reason for the job to crash?
What alternatives do I have to make it work? One of the things that I would be trying is to write the output of APScheduler to Amazon S3. Any suggestions in this regards would be of great help.
I have another job that is to update a couple of fields in the DB which works fine.
Also, this works fine locally. It replaces the existing GeoJSON based on the changes in the database.
from apscheduler.schedulers.blocking import BlockingScheduler
from apscheduler.schedulers.background import BackgroundScheduler
import psycopg2
from UnoCPI import sqlfiles
import os
import Project_GEOJSON,Partner_GEOJSON
sched = BlockingScheduler()
sched1 = BackgroundScheduler()
# Initializing the sql files
sql = sqlfiles
# Schedules job_function to be run on the third Friday
# of June, July, August, November and December at 00:00, 01:00, 02:00 and 03:00
# sched.add_job(YOURRUNCTIONNAME, 'cron', month='6-8,11-12', day='3rd fri', hour='0-3')
#sched.scheduled_job('cron', day_of_week='mon-sun', hour=23)
# #sched.scheduled_job('cron', month='1,6,8', day='1', hour='0')
# #sched.scheduled_job('interval', minutes=5)
#sched1.add_job(generateGEOJSON,'cron', day_of_week='mon-sun', hour=20)
def generateGEOJSON():
os.system(Partner_GEOJSON)
os.system(Project_GEOJSON)
def scheduled_job():
print('This job is ran every day at 11pm.')
# print('This job is ran every 1st day of the month of January, June and August at 12 AM.')
# print('This job is ran every minute.')
global connection
global cursor
try:
# CAT STAGING
connection = psycopg2.connect(user="heroku cred",
password="postgres password from heroku",
host="heroku host",
port="5432",
database="heroku db",
sslmode="require")
if connection:
print("Postgres SQL Database successful connection")
cursor = connection.cursor()
# create a temp table with all projects start and end dates
cursor.execute(sql.start_and_end_dates_temp_table_sql)
# fetch all community partners to be set to inactive
cursor.execute(sql.comm_partners_to_be_set_to_inactive)
inactive_comm_partners = cursor.fetchall()
print("Here is the list of all projects to be set to inactive", "\n")
# loop to print all the data
for i in inactive_comm_partners:
print(i)
# fetch all community partners to be set to active
cursor.execute(sql.comm_partners_to_be_set_to_active)
active_comm_partners = cursor.fetchall()
print("Here is the list of all projects to be set to active", "\n")
# loop to print all the data
for i in active_comm_partners:
print(i)
# UPDATE PROJECT STATUS TO ACTIVE
cursor.execute(sql.update_project_to_active_sql)
# UPDATE PROJECT STATUS TO COMPLETED
cursor.execute(sql.update_project_to_inactive_sql)
# UPDATE COMMUNITY PARTNER WHEN TIED TO A INACTIVE PROJECTS ONLY TO FALSE(INACTIVE)
cursor.execute(sql.update_comm_partner_to_inactive_sql)
# UPDATE COMMUNITY PARTNER WHEN TIED TO A BOTH ACTIVE
# and / or INACTIVE or JUST ACTIVE PROJECTS ONLY TO TRUE(ACTIVE)
cursor.execute(sql.update_comm_partner_to_active_sql)
# drop all_projects_start_and_end_date temp table
cursor.execute(sql.drop_temp_table_all_projects_start_and_end_dates_sql)
except (Exception, psycopg2.Error) as error:
print("Error while connecting to Postgres SQL", error)
finally:
# closing database connection.
if connection:
connection.commit()
cursor.close()
connection.close()
print("Postgres SQL connection is closed")
sched.start()
sched1.start()
I am not sure if we can write into the Heroku container
You can, but your changes will be periodically lost. Heroku's filesystem is dyno-local and ephemeral. Every time your dyno restarts, changes made to the filesystem will be lost. This happens frequently (at least once per day) and unpredictably.
One of the things that I would be trying is to write the output of APScheduler to Amazon S3
That's exactly what Heroku recommends doing with generated files and user uploads:
AWS Simple Storage Service, e.g. S3, is a “highly durable and available store” and can be used to reliably store application content such as media files, static assets and user uploads. It allows you to offload your entire storage infrastructure and offers better scalability, reliability, and speed than just storing files on the filesystem.
AWS S3, or similar storage services, are important when architecting applications for scale and are a perfect complement to Heroku's ephemeral filesystem.

MacOS: Why does Multiprocessing Queue.put stop working?

I have a pandas DataFrame with about 45,000 rows similar to:
from numpy import random
from pandas import DataFrame
df = DataFrame(random.rand(45000, 200))
I am trying to break up all the rows into a multiprocessing Queue like this:
from multiprocessing import Queue
rows = [idx_and_row[1] for idx_and_row in df.iterrows()]
my_queue = Queue(maxsize = 0)
for idx, r in enumerate(rows):
# print(idx)
my_queue.put(r)
But when I run it, only about 37,000 things get put into my_queue and then it the program raises the following error:
raise Full
queue.Full
What is happening and how can I fix it?
The multiprocessing.Queue is designed for inter-process communication. It is not intended for storing large amount of data. For that purpose, I'd suggest to use Redis or Memcached.
Usually, the queue maximum size is platform dependent, even if you set it to 0. You have no easy way to workaround that.
It seems that on windows, the maximum amount of objects in a multiprocessing.Queue is infinite, but on Linux and MacOS the maximum size is 32767, which is 215 - 1, here is the significance of that number.
I solved the program by making an empty Queue object and then passing it to all the processes I wanted to pass it to, plus another process. The additional process is responsible for filling the queue with 10,000 rows, and checking it every few seconds to see if the queue has been emptied. When its empty, another 10,000 rows are added. This way, all 45,000 row is processed.

How to speed up basic pyspark statements

As a new spark/pyspark user, I have a script running on an AWS t2.small ec2 instance in local mode (for testing purposes ony).
ie. As an example:
from __future__ import print_function
from pyspark.ml.classification import NaiveBayesModel
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.sql import SparkSession
import ritc (my library)
if __name__ == "__main__":
spark = SparkSession\
.builder\
.appName("NaiveBayesExample")\
.getOrCreate()
...
request_dataframe = spark.createDataFrame(ritc.request_parameters, ["features"])
model = NaiveBayesModel.load(ritc.model_path)
...
prediction = model.transform(ritc.request_dataframe)
prediction.createOrReplaceTempView("result")
df = spark.sql("SELECT prediction FROM result")
p = map(lambda row: row.asDict(), df.collect())
...
I have left out code so as to focus on my question, relating to the speed of basic spark statements such as spark = SparkSession...
Using the datetime library (not shown above), I have timings for the three biggest 'culprits':
'spark = SparkSession...' -- 3.7 secs
'spark.createDataFrame()' -- 2.6 secs
'NaiveBayesModel.load()' -- 3.4 secs
Why are these times so long??
To give a little background, I would like to provide the capability to expose scripts such as the above as REST services.
In supervised context:
- service #1: train a model and save the model in the filesystem
- service #2: load the model from the filesystem and get a prediction for a single instance
(Note: The #2 REST requests would run at different, and unanticipated (random) times. The general pattern would be:
-> once: train the model - expecting a long turnaround time
-> multiple times: request a prediction for a single instance - expecting a turnaround time in milliseconds - eg. < 400 ms.
Is there a flaw in my thinking? Can I expect to increase performance dramatically to achieve this goal of sub-second turnaround time?
In most every article/video/discussion on spark performance that I have come across, the emphasis has been on 'heavy' tasks. The 'train model' task above may indeed be a 'heavy' one - I expect this will be the case when run in production. But the 'request a prediction for a single instance' needs to be responsive.
Can anyone help?
Thanks in anticipation.
Colin Goldberg
So ApacheSpark is designed to be used in this way. You might want to look at Spark Streaming if your goal is to handle streaming input data for predictions. You may also want to look at other options for serving Spark models, like PMML or MLeap.

python 3 requests_futures requests to same server in different processes

I am looking into parallelization of url requests onto one single webserver in python for the first time.
I would like to use requests_futures for this task as it seems that one can really split up processes onto several cores with the ProcessPoolExecutor.
The example code from the module documentation is:
from concurrent.futures import ThreadPoolExecutor
from requests_futures.sessions import FuturesSession
session = FuturesSession(executor=ThreadPoolExecutor(max_workers=2))
future_one = session.get('http://httpbin.org/get')
future_two = session.get('http://httpbin.org/get?foo=bar')
response_one = future_one.result()
print('response one status: {0}'.format(response_one.status_code))
print(response_one.content)
response_two = future_two.result()
print('response two status: {0}'.format(response_two.status_code))
print(response_two.content)
The above code works for me, however, I need some help with getting it customized to my needs.
I want to query the same server, let's say, 50 times (e.g. 50 different httpbin.org/get?... requests). What would be a good way to split these up onto different futures other than just defining future_one, ..._two and so on?
I am thinking about using different processes. According to the module documentation, it should be just a change in the first three lines of the above code:
from concurrent.futures import ProcessPoolExecutor
from requests_futures.sessions import FuturesSession
session = FuturesSession(executor=ProcessPoolExecutor(max_workers=2))
If I execute this I get the following error:
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
How do I get this running properly?

Resources