Python: requests module do not cache, why this error then? - caching

I have a link to a raw txt file in github of the form https://raw.githubusercontent.com/XXX/YYY/master/txtfile where I want to periodically put a new version so a python script will know that it must update, the python script (py 3.5) uses an infinite while loop and the module requests:
while True:
try:
r = requests.get('https://raw.githubusercontent.com/XXX/YYY/master/txtfile', timeout=10)
required_version = r.text
except:
required_version = 0
log_in_txt_file(required_version)
sleep(10)
This script runs under Windows, however, I remark that despite the version is updated on the server, the log still show that the request is getting the previous version! If I try to get the version from a browser (Chrome) the same happens, but after some F5 the new version appears (in the browser and in the log), however, the script still log sometimes the old, sometimes the new version! I tried to make the URL variable with:
https://raw.githubusercontent.com/XXX/YYY/master/txtfile?_=time.time
But the problem remain, I'm using an Amazon workspace and I'm pretty sure it's a OS issue, my question, how to workaround this using python? Any idea?

This is not a client-side caching issue. In effect, Github servers are caching the version, serving you content until they are updated in time.
Github serves your data from a series of webservers, distributed geographically to ease loading times. These servers don't all update at the same time; until a change has propagated to all servers you'll see old and new content returned on that URL, depending on what machine served you the content for a specific request.
You can't really use GitHub to detect when a new version has been released, not reliably. Instead, generate a unique filename (generate a GUID perhaps) that at a future time will contain the new version information. Give that filename out with the current version, and try and poll that. Releasing a new version then consists of generating the filename for the version after, and putting the information to the current 'new version' URL. Each version links to the next file, and when it appears you only need to load it once.

Related

Google Drive Files/Spreadsheets REST API: How to avoid delay when tracking file changes?

I have an application where I need to keep local copies of some of my user's Google Spreadsheets, and this copy should be in sync with the Google Drive version. I've been testing two methods of tracking changes to a Google Spreadsheet: (1) the file version polling method and (2) the files.watch method.
1) The file version polling method
In this method, whenever I need the most recent version of a Spreadsheet (for instance, when the user wants to download the file from my application), I retrieve the file version from Google using:
POST https://www.googleapis.com/drive/v3/files/FILE_ID?fields=version
If the version is greater than the version I have stored on my end, I know that changes have been made and the file on my end is outdated. So I download the file and update my copy.
The problem is that it takes a while for the file version number to be updated on Google's end. Ideally, after editing a Google Spreadsheet cell, my application should be able to detect this change within less than 10 seconds. However, after editing a cell and seeing the Saved to Drive confirmation at the top, sometimes it takes seconds, other times it takes minutes before the version number gets updated, so it is very inconsistent.
Aside from the version number, I've also tried polling the modifiedTime value to see if it changed sooner, but it didn't. So I tried another method.
2) The files.watch method
In this method, I keep track of the file changes by registering a webhook to receive change notifications from Google:
POST https://www.googleapis.com/drive/v3/files/FILE_ID/watch
Whenever I receive a change notification, I know that I need to update my local copy.
Unfortunately, the change notifications also don't happen as quick as I would like. It also has very inconsistent delays: sometimes taking a few seconds, sometimes taking more than a minute.
UPDATE (3) The 'always export' method? Never cache method?
To complicate matters, it seems that even if I ignore my local copy and always try to download the latest version of the file directly from Google, the downloaded file will not necessarily be the absolute latest version that the user sees on the Spreadsheets editor. I tried that using
GET https://www.googleapis.com/drive/v3/files/FILE_ID/export?mimeType=application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
but it would often return the outdated version, and sometimes would only return the latest version after a few minutes.
Is there something else that I can try? The above methods use the Google Drive Files API, but if there is a way to detect changes sooner using the Google Spreadsheets API, I would like to know.
1. How to detect file changes as soon as possible?
After a file change (in my case, a change in a Google Spreadsheet), the version does not get updated immediately, and when you watch for file changes with the files.watch API you will also not get a notification immediately.
What does get updated immediately is the list of revisions of the file, which can be retrieved with the revisions.list API:
GET https://www.googleapis.com/drive/v3/files/FILE_ID/revisions
This returns a list of all revisions of the file FILE_ID. The last item in the list is the most recent revision (the "head" revision). In order to know if a file has changed, I retrieve this list. If the id of head revision is different from the id stored in my end, it means that my local copy is outdated, so I have to update the file and its revision id.
However, if you call files.export, the file version returned will not necessarily be the absolute most recent version (e.g., the Google Sheet you are seeing in your browser). And in the case of Google editor documents, it is not possible to retrieve the most recent revision using the revisions.get API. What can you do then?
2. How to retrieve the most recent revision of a Google Sheet?
(I bet it works for other Google editor documents as well).
Before calling files.export, you have to "touch" the file using the files.update API, updating its modifiedTime:
PATCH https://www.googleapis.com/drive/v3/files/fileId
{
"modifiedTime": "TIMESTAMP"
}
Where TIMESTAMP is a date with the format 2022-04-16T22:00:00Z.
For some reason, touching the file like this "forces" Google to return the head revision of the file the next time you call files.export:
GET https://www.googleapis.com/drive/v3/files/FILE_ID/export?mimeType=MIMETYPE
In my case, MIMETYPE is application/vnd.openxmlformats-officedocument.spreadsheetml.sheet.
That's it. So far, this has been working for me.

Google API service account authentication can't find JSON credentials file

I can't get the Google API to find my service account's credentials. I downloaded the necessary JSON file with the right name into the proper place, and I'm using Python code straight off the API documentation:
import gspread
gc = gspread.service_account()
sh = gc.open("Example spreadsheet (I'll replace this with my actual sheet name later)")
print(sh.sheet1.get('A1'))
The code stops at gc = gspread.service_account() with a FileNotFoundError. I discovered via an error message that this is because it's looking at the complete wrong file path (I think it's thinking I'm on a Mac when I'm actually on a Windows PC??). Overriding the file name, i.e.
gc = gspread.service_account(filename="insert\actual\path\here.json"),
does not work either, which is the mystifying part. I copied that path straight out of my file explorer, doubled the backslashes so Unicode doesn't try to escape it all (that happened once), tried every modification on the file path I could think of (%APPDATA%\gspread\service_account.json instead of the whole thing, etc.) - what could be going wrong?
Edit: #mods, feel free to close the question! I found the issue, which is that I was using the Repl.it online coding environment instead of a local one. I ported everything over to IDLE and it worked fine. I strongly suspect Repl.it just couldn't access my local files at all (I also tried it on Repl.it with a random screenshot in a different place, and it threw the same error).

Issues with Swashbuckle

I have a WebAPI service, written in ASP.NET (not Core), for which I am trying to generate documentation, in order to allow other devs to use it. I found Swashbuckle, and installed it. Then, since I also use OData for some of my services, I added Swashbuckle.OData. Then, I modified the CustomProvider setting in SwaggerConfig to use the ODataSwaggerProvider. I also set ResolveConflictingActions(apiDescriptions => apiDescriptions.First()) because I had a few Actions with the same URL path, differing only by query string (I'll need to address that later). So far so good.
Then, I tested it. I started my web app, then added "/swagger/" to then end. I got a message stating that it was loading the resource info. However, after several minutes, I got a browser error debug popup, stating "Error: Not enough storage is available to complete this operation." It asks if I want to debug, and if I do, it takes me to the debugger in IE (the browser I'm using). The only code in the stack is either from jquery-1-8-0-min-js or swagger-ui-min-js (this part confuses me, as there is no "swagger-ui-min-js" file in my project; I'm assuming it's embedded in the dll). There is no part of the stack trace that floats back up to my code, and all the code there is minified, so it's very difficult to debug.
However, I do know that it is at least partially working, as three of the controllers do show up in the resulting page after you close the error popup. You can navigate through them, and all the GETs, POSTs, PUTs, and DELETEs seem to be there, and you can test them.
Is it the case that whenever you navigate to the "/swagger/" url, Swagger hits all the URLs in the service, in order to generate the documentation? I'm wondering if maybe it is hitting an action that is taking a particularly long time to run, or possibly its generated documentation is taking too much disk space (I have plenty of space on my disk, but maybe it is referring to RAM?).
Anyway, even if that were not an issue, how can I get it to generate something, some kind of document file, that I can send off to someone? I see no new files added to my folders, so it would seem that it re-does the whole process every time you navigate to the swagger URL.
When I tried the Chrome browser, I no longer had the issue (I was using IE11 before). Not sure what the problem was, but this was the workaround.

Selenium - Retaining firefox cache and history files

Is there a way to disable Selenium creating a temporary directory and profile when it starts Firefox?
I fully understand why Selenium does things as it does. I am just experimenting it as I try to create Firefox caches and histories with it for computer forensic training purposes. To this end, I have set up a clean virtual machine with a pristine user account. I can now run a Python script with selenium API to start firefox, visit a couple of web pages and shut down.
THe problem is, it leaves nothing behind. This is of course excellent if you are using selenium in its original purpose, but it thwarts my work by deleting everything.
So is there a way to disable the temporary profile creation and just start Firefox as it would start if ran by the user without Selenium.
Addition 5:34PM:
Java API documentation mentions a system property webdriver.reap_profile that would prevent deletion of temporary files. I went to the source of the problem and it appears this does not appear in Python WebDriver class:
def quit(self):
"""Quits the driver and close every associated window."""
try:
RemoteWebDriver.quit(self)
except (http_client.BadStatusLine, socket.error):
# Happens if Firefox shutsdown before we've read the response from
# the socket.
pass
self.binary.kill()
try:
shutil.rmtree(self.profile.path)
if self.profile.tempfolder is not None:
shutil.rmtree(self.profile.tempfolder)
except Exception as e:
print(str(e))
Deletion of files upon quit appears to be unconditional. I will solve this in my case by injecting
return self.profile.path
just after self.binary.kill(). This probably breaks all sorts of things and is a horrible thing to do but it appears to do exactly what I want it to do. The return value tells the calling function the random name of the temporary directory under /tmp. Not elegant but appears to work.
Addition 5:34PM: Java API documentation mentions a system property webdriver.reap_profile that would prevent deletion of temporary files. I went to the source of the problem and it appears this does not appear in Python WebDriver class:
def quit(self):
"""Quits the driver and close every associated window."""
try:
RemoteWebDriver.quit(self)
except (http_client.BadStatusLine, socket.error):
# Happens if Firefox shutsdown before we've read the response from
# the socket.
pass
self.binary.kill()
try:
shutil.rmtree(self.profile.path)
if self.profile.tempfolder is not None:
shutil.rmtree(self.profile.tempfolder)
except Exception as e:
print(str(e))
Deletion of files upon quit appears to be unconditional. I will solve this in my case by injecting
return self.profile.path
in /usr/local/lib/python2.7/dist-packages/selenium/webdriver/firefox/webdriver.py just after self.binary.kill(). This probably breaks all sorts of things and is a horrible thing to do but it appears to do exactly what I want it to do. The return value tells the calling function the random name of the temporary directory under /tmp. Not elegant but appears to wor after a recompile.
If a more elegant solution exists, I would be happy to flag that as the correct one.

Play! Framework 2.1.3 pdf problems

so I am working on a school project in which we have designed a web application that takes in much user info and creates a pdf then should display that pdf to the user so they can print it off or save it. We are using Play! Framework 2.1.3 as our framework and server and Java for the server side. I create the pdf with Apache's PDFbox library. Every thing works as it should in development mode ie launching it on a localhost with plays run command. the issue is when we put it up to the server and launch with plays start command I it seems to take a snapshot of the directory (or at least the assets/public folder) which is where I am housing the output.pdf file/s (i have attempted to move the file elsewhere but that still seems to result in a 404 error). Initially I believed this to be something with liunx machine we were deploying to which was creating a caching problem and have tried many of the tricks to defeat the browser from caching the pdf
like using javascript to append on a time stamp to the filename,
using this cache-control directive in the play! documentation,
"assets.cache./public/stylesheets/output.pdf"="max-age=0",
then I tried to just save the pdf as a different filename each time and pass back the name of that file and call it directly through the file structure in the HTML
which also works fine with the run command but not the start.
finally I came to the conclusion that when the start command is issued it balls up the files so only the files that are there when the start command is issued can be seen.
I read the documentation here
http://www.playframework.com/documentation/2.1.x/Production
which then I noticed this part
When you run the start command, Play forks a new JVM and runs the
default Netty HTTP server. The standard output stream is redirected to
the Play console, so you can monitor its status.
so it looks like the fact that it forks a new JVM is what is causing my pain.
so my question really is can this be gotten around in some way that a web app can create and display a pdf form? (if I cannot get this to work my only solution
that I can see is that I will have to simulate the form with HTML and fill it out from there) --which I really think is a bad way to do this.
this seems like something that should have a solution but I cannot seem to find or come up with one please help.
i have looked here:
http://www.playframework.com/documentation/2.1.x/JavaStream
the answer may be in there but Im not getting it to work I am pretty novice with this Play! Framework still
You are trying to deliver the generated PDF file to the user by placing it in the assets directory, and putting a link to it in the HTML. This works in development mode because Play finds the assets in the directory. It won't work in production because the project is wrapped up into a jar file when you do play dist, and the contents of the jar file can't be modified by the Play application. (In dev mode, Play has a classpath entry for the directory. In production, the classpath points to the jar file).
You are on the right lines with JavaStream. The way forward is:
Generate the PDF somewhere in your local filesystem (I recommend the temp directory).
Write a new Action in your Application object that opens the file you generated, and serves it instead of a web page.
Check out the Play docs for serving files. This approach also has the advantage that you can specify the filename that the user sees. There is an overloaded function Controller.ok(File file, String filename) for doing this. (When you generate the file, you should give it a unique name, otherwise each request will overwrite the file from a previous request. But you don't want the user to see the unique name).

Resources