No provided hash (checksum or sha) available from website to verify file integrity after using python urllib.request.urlretrieve - download

I am downloading a file using urllib.request.urlretrieve, which I am receiving is an hdf5 file. Using the hdf5 results to OSError: Can't read data (inflate() failed) due to corrupted file. At random times though, the read succeeds.
I don't want to waste running my program to a corrupted file. Is there any means to early detect corrupted files even without provided hash strings? Thanks.

Related

Garry's mod cache

for starters, I've been looking around for other Lua programmers who work on Gmod on Source Engine and know what they're talking about. So if anyone knows any of this stuff, please tell me. So I've got a question regarding my Garry's mod Cache file. It's in the directory (this is an abbreviated directory) Steam/steamapps/GarrysMod/garrysmod/Cache. Inside this file are two sub folders named "lua" and "workshop" respectively. Inside are several lua files filled with a strange type of code (presumably addresses for assets used in-game) and I've noticed that during some sessions, the game will generate a lot of these files and at other times, won't generate any at all. What are these files? What are they used for? And what dictates when a new cache lua file is generated?
The cache in Garry's Mod houses clientside/shared Lua files that have been previously downloaded from a server. When you join a server which is using an identical Lua file to which is in your cache, it will skip downloading that specific file and instead, use it from the cache. Therefore, Lua files that aren't in your cache are downloaded when joining a server. Cached Lua files are LZMA compressed which is why you cannot directly read the files as Lua when you open it.

Creating a variable zip archive on the fly, estimating file size for content-length

I'm maintaining a site where users can place pictures and other files in a kind of shopping cart. After selecting all the various contents the user wishes to download, he can checkout. Till' now an archive was generated beforehand and the user got an email with the link to the file after the generation finished.
I've changed this now by using web api and push stream to directly generate the archive on the fly. My code is offering either a zip, a zip64 or .tar.gz dynamically, depending on the estimated filesize and operating system. For performance reasons compression ist set to best speed ('none' would make the zip archives incompatible with Mac OS, the gzip library I'm using doesn't offer none).
This is working great so far, however the user is no longer having a progress bar while downloading the file because I'm not setting the content-length. What are the best ways to get around this? I've tried to guess the resulting file size, but either the browsers are canceling the downloads to early or stopping at 99,x% and are waiting for the missing bytes resulting for the difference between the estimated and actual file size.
My first thought was to guess the resulting file size always a little bit to big and filling the rest with zeros?
I've seen many file hosters offering the possibility to select files from a folder and putting them into a zip file and all are having the correct (?) file size with them and a progress bar. Any best practises? Thanks!
This is just some thoughts, maybe you can use them :)
Using Web API/HTTP the normal way to go about is that the response contains the lenght of the file. Since the response is first received after the call has finished, the actual time for generating the file will not show any progress bar in any browser other than a Windows wait cursor.
What you could do is using a two steps approach.
Generating the zip file
Create a duplex like channel using SignalR to give feedback on the file generation.
Downloading the zip file
After the file is generated you should know the file size, and the browser will show a progress bar while downloading.
It looks that this problem should have been addressed using chunk extensions, but it seems to never got further than a draft.
So I guess you are stuck with either no progress or sending the file size up front.
It seems that generating exact size zip archives is trickier than adding zero padding.
Another option might be to pre-generate the zip file without storing it just to determine the size.
But I am just wondering why not just use tar? It has no compression, so it is easy determine it's final size up front from the size of individual files and it should be also supported by both OSx and Linux. And Windows should be able to handle none compressed zip archives, so a similar trick might work as well.

Merging PDFs skipping corrupted PDFs

Currently I am using Ghostscript to merge a list of PDFs which are downloaded. The issue is if any 1 of the pdf is corrupted, it stops the merging of the rest of the pdfs.
Is there any command which i must use so that it will skip the corrupted pdfs and merge the others?
I have also tested with pdftk but facing the same issue.
Or is there any other command line based pdf merging utility that I can use for this?
You could try MuPDF, you could also try using MUPDF 'clean' to repair files before you try merging them. However if the PDF file is so badly corrupted that Ghostscript can't even repair it that probably won't work either.
There is no facility to ignore PDF files which are so badly corrupted they can't even be repaired. Its hard to see how this could work in the current scheme, since Ghostscript doesn't 'merge' files anyway, it interprets them, creating a brand new PDF file from the sequence of graphic operations. When a file is badly enough corrupted to provoke an error we abort because we may have already written any parts of the file we could, and if we tried to ignore and continue both the interpreter and the output PDF file would be in an indeterminate state.

Why doesn't a file transfer fail when the file is deleted?

I need to upload files to my server. I use the ASIHTTPRequest to do this job. But if I just add the upload job to the ASINetworkQueue and immediately delete the source file, the upload job can still completes successfully.
I thought the job would fail because I deleted the file. Can somebody explain the reason it still succeeds, even though the file was deleted?
This is the same problem you find when you delete a large log file while a process is still writing to it, expecting to recover some disk space.
UNIX systems tend to separate the directory entries for a file from the actual data of the file.
It's the data that consumes the space which is why you can have hard links in UNIX, with many directory entries pointing at the same file content.
The actual data for a file is not deleted until the last process closes it, and this is almost certainly what's causing your file transfer to continue.
Deleting the file only removes the directory entry. The data is still the because the file transfer program has it open.
Once it gets closed, the data will be removed as well.

How to know in Ruby if a file is completely downloaded

Our issue is that our project has files being downloaded using wget to the file system. We are using ruby to read the downloaded files for data.
How is it possible to tell if the file is completely downloaded so we don't read a half complete file?
I asked a very similar question and got some good answers... in summary, use some combination of one or more of the following:
download the file to a holding area and finally copy to your input directory;
use a marker file, created once the download completes, to signal readiness;
poll the file twice and see if its size has stopped increasing;
check the file's permissions (some download processes block reads during download);
use another method like in-thread download by the Ruby process.
To quote Martin Cowie, "This is a middleware problem as old as the hills"...
The typical approach to this is to download the file to a temporary location and when finished 'move' it to the final destination for processing.

Resources