Unable to update GeoJSON files in an application using APScheduler on Heroku - heroku

I have 2 GeoJSON files in my application. I have written a Python job using APScheduler to update the 2 GeoJSON files based on the changes in the database. The job is configured to run once every 24 hours. Currently, I get the confirmation message that the new GeoJSON file was created but it crashes immediately after printing this log statement. I am not sure if we can write into the Heroku container, is that the reason for the job to crash?
What alternatives do I have to make it work? One of the things that I would be trying is to write the output of APScheduler to Amazon S3. Any suggestions in this regards would be of great help.
I have another job that is to update a couple of fields in the DB which works fine.
Also, this works fine locally. It replaces the existing GeoJSON based on the changes in the database.
from apscheduler.schedulers.blocking import BlockingScheduler
from apscheduler.schedulers.background import BackgroundScheduler
import psycopg2
from UnoCPI import sqlfiles
import os
import Project_GEOJSON,Partner_GEOJSON
sched = BlockingScheduler()
sched1 = BackgroundScheduler()
# Initializing the sql files
sql = sqlfiles
# Schedules job_function to be run on the third Friday
# of June, July, August, November and December at 00:00, 01:00, 02:00 and 03:00
# sched.add_job(YOURRUNCTIONNAME, 'cron', month='6-8,11-12', day='3rd fri', hour='0-3')
#sched.scheduled_job('cron', day_of_week='mon-sun', hour=23)
# #sched.scheduled_job('cron', month='1,6,8', day='1', hour='0')
# #sched.scheduled_job('interval', minutes=5)
#sched1.add_job(generateGEOJSON,'cron', day_of_week='mon-sun', hour=20)
def generateGEOJSON():
os.system(Partner_GEOJSON)
os.system(Project_GEOJSON)
def scheduled_job():
print('This job is ran every day at 11pm.')
# print('This job is ran every 1st day of the month of January, June and August at 12 AM.')
# print('This job is ran every minute.')
global connection
global cursor
try:
# CAT STAGING
connection = psycopg2.connect(user="heroku cred",
password="postgres password from heroku",
host="heroku host",
port="5432",
database="heroku db",
sslmode="require")
if connection:
print("Postgres SQL Database successful connection")
cursor = connection.cursor()
# create a temp table with all projects start and end dates
cursor.execute(sql.start_and_end_dates_temp_table_sql)
# fetch all community partners to be set to inactive
cursor.execute(sql.comm_partners_to_be_set_to_inactive)
inactive_comm_partners = cursor.fetchall()
print("Here is the list of all projects to be set to inactive", "\n")
# loop to print all the data
for i in inactive_comm_partners:
print(i)
# fetch all community partners to be set to active
cursor.execute(sql.comm_partners_to_be_set_to_active)
active_comm_partners = cursor.fetchall()
print("Here is the list of all projects to be set to active", "\n")
# loop to print all the data
for i in active_comm_partners:
print(i)
# UPDATE PROJECT STATUS TO ACTIVE
cursor.execute(sql.update_project_to_active_sql)
# UPDATE PROJECT STATUS TO COMPLETED
cursor.execute(sql.update_project_to_inactive_sql)
# UPDATE COMMUNITY PARTNER WHEN TIED TO A INACTIVE PROJECTS ONLY TO FALSE(INACTIVE)
cursor.execute(sql.update_comm_partner_to_inactive_sql)
# UPDATE COMMUNITY PARTNER WHEN TIED TO A BOTH ACTIVE
# and / or INACTIVE or JUST ACTIVE PROJECTS ONLY TO TRUE(ACTIVE)
cursor.execute(sql.update_comm_partner_to_active_sql)
# drop all_projects_start_and_end_date temp table
cursor.execute(sql.drop_temp_table_all_projects_start_and_end_dates_sql)
except (Exception, psycopg2.Error) as error:
print("Error while connecting to Postgres SQL", error)
finally:
# closing database connection.
if connection:
connection.commit()
cursor.close()
connection.close()
print("Postgres SQL connection is closed")
sched.start()
sched1.start()

I am not sure if we can write into the Heroku container
You can, but your changes will be periodically lost. Heroku's filesystem is dyno-local and ephemeral. Every time your dyno restarts, changes made to the filesystem will be lost. This happens frequently (at least once per day) and unpredictably.
One of the things that I would be trying is to write the output of APScheduler to Amazon S3
That's exactly what Heroku recommends doing with generated files and user uploads:
AWS Simple Storage Service, e.g. S3, is a “highly durable and available store” and can be used to reliably store application content such as media files, static assets and user uploads. It allows you to offload your entire storage infrastructure and offers better scalability, reliability, and speed than just storing files on the filesystem.
AWS S3, or similar storage services, are important when architecting applications for scale and are a perfect complement to Heroku's ephemeral filesystem.

Related

How to using for loop to download GFS url data using Siphon

I'm trying to loop to download a subset of GFS data using the siphon library. I can download one file at a time normally the way the code is laid out. I would like to know how can I download from the period January 2020 to December 2022 from the 003 UTC cycle as shown above until the 168 cycle so that I don't need to download a single file at a time?
from siphon.catalog import TDSCatalog
from siphon.ncss import NCSS
import numpy as np
import ipywidgets as widgets
from datetime import datetime, timedelta
import xarray as xr
from netCDF4 import num2date
# Download SUBSET GFS - Radiação (6 Hour Average) e PBLH
for i in range(6,168,6):
for day in range(1,32,1):
for month in range(1,13,1):
dir_out = '/home/william/GFS_Siphon/2020'+'{:0>2}'.format(month)
if not os.path.exists(dir_out):
os.makedirs(dir_out)
if not os.path.isfile('/home/william/GFS_Siphon/2020'+'{:0>2}'.format(month)+'/gfs.0p25.2020'+'{:0>2}'.format(month)+'{:0>2}'.format(day)+"00.f"+'{:0>3}'.format(i)+'.nc'):
catUrl = "https://rda.ucar.edu/thredds/catalog/files/g/ds084.1/2020/2020"+'{:0>2}'.format(month)+'{:0>2}'.format(day)+"/catalog.xml"
datasetName = 'gfs.0p25.2020'+'{:0>2}'.format(month)+'{:0>2}'.format(day)+"00.f"+'{:0>3}'.format(i)+'.grib2'
time.sleep(0.01)
catalog = TDSCatalog(catUrl)
ds = catalog.datasets[datasetName]
ds.name
ncss = ds.subset()
query = ncss.query()
query.lonlat_box(east=-30, west=-50, south=-20, north=0)
query.variables(
'Downward_Short-Wave_Radiation_Flux_surface_6_Hour_Average',
'Planetary_Boundary_Layer_Height_surface').add_lonlat()
query.accept('netcdf4')
nc = ncss.get_data(query)
data = xr.open_dataset(xr.backends.NetCDF4DataStore(nc))
data.to_netcdf('/home/william/GFS_Siphon/2020'+'{:0>2}'.format(month)+'/gfs.0p25.2020'+'{:0>2}'.format(month)+'{:0>2}'.format(day)+"00.f"+'{:0>3}'.format(i)+'.nc')
The above script is working for what I need, however after a while downloading the files the code dies with the following error: ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
What could be happening?
Unfortunately, there's no way with THREDDS and NCSS to request based on the model reference time, so there's no way to avoid looping over the files.
I will say that this is a TON of data, so at the very least make sure you're being kind to the publicly available server. Downloading close to 3 years' worth of data is something you should do slowly over time and with care so that you don't impact others' use of this shared, free resource. Setting a wait time of 1/100th of a second is, in my opinion, not doing that. I would wait a minimum of 30 seconds between requests if you're going to be requesting this much data.
I'll also add that you can simplify saving the results of the request to a netCDF file--there's no need to go through xarray since the return from the server is already a netCDF file:
...
query.accept('netcdf4')
with open('/home/william/GFS_Siphon/2020'+'{:0>2}'.format(month)+'/gfs.0p25.2020'+'{:0>2}'.format(month)+'{:0>2}'.format(day)+"00.f"+'{:0>3}'.format(i)+'.nc', 'wb') as outfile:
data = ncss.get_data_raw(query)
outfile.write(data)

Time-based conditional query

Suppose I have the following Server data model:
Server
-> created_at Timestamp
-> last_ping Timestamp
A "stale" Server is defined as a Server whose last_ping occurred more than one hour ago (i.e., last_ping < Time.now - 1 hour). It should be destroyed if there exists another non-stale server that has come online (created_at) within one hour of the last_ping of the stale server.
How can I find all the Servers that should be destroyed? What would a query look like for this?
Something like…
def clean_stale_servers
return unless Server.exists?(last_ping: 1.hour..)
Server.where(last_ping: ...1.hour.ago)
.destroy_all # .delete_all is faster, use that if possible
end
Then you can call the clean_stale_servers method periodically, i.e. from a cronjob.

Limit size tinyproxy logfile

I am working on setting tinyproxy on Centos 6.5 server in cloud. I have installed it successfully. However, because of cloud limitation in terms of size, we want to limit logfile (/var/log/tinyproxy.log) size. I need to configure log file so that it could keep information of last hour logs. For example, If now were 5.30 PM, so file must contain only data from 4.30 PM. I have read tinyproxy documentation and couldn't find logfile limit parameter. I'd be very thankful if somebody gave me a clue how to do that. Thanks.
I don't believe Tinyproxy has a feature for limiting log size, but it would be pretty simple to write a script for this separately.
An example script using Python, running automatically every hour using Linux crontab:
import os
import shutil
# Remove Old Logs
os.remove(/[DESTINATION])
# Copy Logs to Storage
copyfile(/var/log/tinyproxy.log, /[DESTINATION])
# Remove Primary Logs
os.remove(/var/log/tinyproxy.log)
(This is just an example. You may have to clear tinyproxy.log instead of deleting it. You may even want to set it up so you copy the old logs one more time, so that you don't end up with only 1-2 minutes of logs when you need them.)
And add this to crontab using crontab -e (make sure you have the right permissions to edit the log file!). This will run your script every hour, on the hour:
01 * * * * python /[Python Path]/logLimit.py
I found crontab very useful for this task.
30 * * * * /usr/sbin/logrotate /etc/logrotate.d/tinyproxy
It rotates my log file every hour.

Redis sync fails. Redis copy keys and values works

I have two redis instances both running on the same machine on win64. The version is the one from https://github.com/MSOpenTech/redis with no amendments and the binaries are running as per download from github (ie version 2.6.12).
I would like to create a slave and sync it to the master. I am doing this on the same machine to ensure it works before creating a slave on a WAN located machine which will take around an hour to transfer the data that exists in the primary.
However, I get the following error:
[4100] 15 May 18:54:04.620 * Connecting to MASTER...
[4100] 15 May 18:54:04.620 * MASTER <-> SLAVE sync started
[4100] 15 May 18:54:04.620 * Non blocking connect for SYNC fired the event.
[4100] 15 May 18:54:04.620 * Master replied to PING, replication can continue...
[4100] 15 May 18:54:28.364 * MASTER <-> SLAVE sync: receiving 2147483647 bytes from master
[4100] 15 May 18:55:05.772 * MASTER <-> SLAVE sync: Loading DB in memory
[4100] 15 May 18:55:14.508 # Short read or OOM loading DB. Unrecoverable error, aborting now.
The only way I can sync up is via a mini script something along the lines of :
import orm.model
if __name__ == "__main__":
src = orm.model.caching.Redis(**{"host":"source_host","port":6379})
dest = orm.model.caching.Redis(**{"host":"source_host","port":7777})
ks = src.handle.keys()
for i,k in enumerate(ks):
if i % 1000 == 0:
print i, "%2.1f %%" % ( (i * 100.0) / len(ks))
dest.handle.set(k,src.handle.get(k))
where orm.model.caching.* are my middleware cache implementation bits (which for redis is just creating a self.handle instance variable).
Firstly, I am very suspicious of the number in the receiving bytes as that is 2^32-1 .. a very strange coincidence. Secondly, OOM can mean out of memory, yet I can fire up a 2nd process and sync that via the script but doing this via redis --slaveof fails with what appears to be out of memory. Surely this can't be right?
redis-check-dump does not run as this is the windows implementation.
Unfortunately there is sensitive data in the keys I am syncing so I can't offer it to anybody to investigate. Sorry about that.
I am definitely running the 64 bit version as it states this upon startup in the header.
I don't mind syncing via my mini script and then just enabling slave mode, but I don't think that is possible as the moment slaveof is executed, it drops all known data and resyncs from scratch (and then fails).
Any ideas ??
I have also seen this error earlier, but the latest bits from 2.8.4 seem to have resolved it https://github.com/MSOpenTech/redis/tree/2.8.4_msopen

Ruby IMAP "changes" since last check

I'm working on an IMAP client using Ruby and Rails. I can successfully import messages, mailboxes, and more... However, after the initial import, how can I detect any changes that have occurred since my last sync?
Currently I am storing the UIDs and UID validity values in the database, comparing them, and searching appropriately. This works, but it doesn't detect deleted messages or changes to message flags, etc.
Do I have to pull all messages every time to detect these changes? How do other IMAP clients do it so quickly (i.e. Apple Mail and Postbox). My script is already taking 10+ seconds per account with very few email addresses:
# select ourself as the current mailbox
#imap_connection.examine(self.location)
# grab all new messages and update them in the database
# if the uid's are still valid, we will just fetch the newest UIDs
# otherwise, we need to search when we last synced, which is slower :(
if self.uid_validity.nil? || uid_validity == self.uid_validity
# for some IMAP servers, if a mailbox is empty, a uid_fetch will fail, so then
begin
messages = #imap_connection.uid_fetch(uid_range, ['UID', 'RFC822', 'FLAGS'])
rescue
# gmail cries if the folder is empty
uids = #imap_connection.uid_search(['ALL'])
messages = #imap_connection.uid_fetch(uids, ['UID', 'RFC822', 'FLAGS']) unless uids.empty?
end
messages.each do |imap_message|
Message.create_from_imap!(imap_message, self.id)
end unless messages.nil?
else
query = self.last_synced.nil? ? ['All'] : ['SINCE', Net::IMAP.format_datetime(self.last_synced)]
#imap_connection.search(query).each do |message_id|
imap_message = #imap_connection.fetch(message_id, ['RFC822', 'FLAGS', 'UID'])[0]
# don't mark the messages as read
##imap_connection.store(message_id, '-FLAGS', [:Seen])
Message.create_from_imap!(imap_message, self.id)
end
end
# now assume all UIDs are valid
self.uid_validity = uid_validity
# now remember that we just fetched all those messages
self.last_synced = Time.now
self.save!
There is an IMAP extension for Quick Flag Changes Resynchronization (RFC-4551). With this extension it is possible to search for all messages that have been changed since the last synchronization (based on some kind of timestamp). However, as far as I know this extension is not widely supported.
There is an informational RFC that describes how IMAP clients should do synchronization (RFC-4549, section 4.3). The text recommends issuing the following two commands:
tag1 UID FETCH <lastseenuid+1>:* <descriptors>
tag2 UID FETCH 1:<lastseenuid> FLAGS
The first command is used to fetch the required information for all unknown mails (without knowing how many mails there are). The second command is used to synchronize the flags for the already seen mails.
AFAIK this method is widely used. Therefore, many IMAP servers contain optimizations in order to provide this information quickly. Typically, the network bandwidth is the limiting factor.
The IMAP protocol is brain dead this way, unfortunately. IDLE really should be able to return this kind of stuff while connected, for example. The FETCH FLAGS suggestion above is the only way to do it.
One thing to be careful of, however, is that UIDs are only valid for a given session per the spec. You should not store them, even if some servers persist them.

Resources