how to determine if a file is completely downloaded using kqueue? - macos

I want to implement a function which monitor a directory and perform some action when a new file is downloaded from the Internet, but found it difficult to determine if the file is completely downloaded, is there a way to do that?

Usually tools that show the hash of a file will give the state of a file - this should be compared to the hash of another file - if identical then we know the file has downloaded successfully.
md5 (native to bsd) is available - but is only practical on a local file -
If you are retrieving the remote file via HTTP , then there is no way to get the hash of the file without downloading it first (whether it is to STDOUT or piped to file , using wget -O- or curl )
If the file host has a second file that contains the md5 hash of the file being downloaded - then a comparison of the locally downloaded hash is comparable to the hash provided by the file provider.
To do anything more swish will require a comprehensive program to be written - such as the combination of this question and accepted answer :
Python Compare local and remote file MD5 Hash

Besides MD5, there is a simple way to do this:
Partially downloaded file usually has a temporary filename, and it will be renamed to original filename after fully downloaded. You can make your program to ignore or monitor only certain filename extensions.

Related

Temp file not being deleted

I'm trying to create a temporary file in my pipeline, then use that file in another rule.
For example, I have two rules in a .smk file:
#Unzip adapter trimmed fastq file
rule unzip_fastq:
input:
'{sample}.adapterTrim.round2.fastq.gz',
output:
temp('{sample}.adapterTrim.round2.fastq')
conda:
'../envs/rep_element.yaml'
shell:
'gunzip -c {input[0]} > {output[0]}'
#Run bowtie2 to align to rep elements and parse output
rule parse_bowtie2_output_realtime:
input:
'{sample}.adapterTrim.round2.fastq'
output:
'rep_element_pipeline/{sample}.fastq.gz.mapped_vs_' + config["ref"]["bt2_index"] + '.sam'
params:
bt2=config["ref"]["bt2_index_path"], eid=config["ref"]["enst2id"]
conda:
'../envs/rep_element.yaml'
shell:
'perl ../scripts/parse_bowtie2_output_realtime_includemultifamily.pl '
'{input[0]} {params.bt2} {output[0]} {params.eid}'
{sample}.adapterTrim.round2.fastq is used once and should ultimately be deleted upon completion. However, I'm finding that this file is uploaded to Amazon S3, even with the addition of temp(). I'm also finding that this file is removed locally, but still persists on S3.
Am I doing this correctly? '{sample}.adapterTrim.round2.fastq' is not currently written in the rule-all of the Snakefile.
We ultimately need to prevent this file from being uploaded to S3, so if there is a way to specify not to upload this file in the rule, that would be useful.
It seems that the snippet in the question is not consistent with actual use, since for S3 files one would need to wrap file names in remote.
However, as a general solution, documentation contains the following:
The remote() wrapper is mutually-exclusive with the temp() and protected() wrappers.
Hence, if you intend to use a temp file, make sure it's not wrapped in remote, or explicitly wrap the file in local.

Reinstalling packages from a list generated by command: ado dir

I am recovering Stata following a Windows upgrade. I have a list of my packages generated from ado dir in the following format:
[1] package mdesc from http://fmwww.bc.edu/RePEc/bocode/m
'MDESC': module to tabulate prevalence of missing values
[2] package univar from http://fmwww.bc.edu/RePEc/bocode/u
'UNIVAR': module to generate univariate summary with box-and-whiskers plot
[3] package tabmiss from http://www.ats.ucla.edu/stat/stata/ado/analysis
tabmiss. Shows tabulation of number of missing and non-missing values
I have many packages and would like to reinstall them without having to designate each directory/url via net cd. While using net cd along with net install or ssc install along with package names in a loop is trivial (as below), it would seem that an automated method for this task might be available.
net cd http://www.ats.ucla.edu/stat/stata/ado/analysis
local ucla tabmiss csgof powerlog ldfbeta
foreach x of local ucla {
net install `x'
}
To my knowledge, there is no built-in or automated method of tracking and managing your installed packages outside of what is available through ado or net.
I would also tend to agree with #Nick Cox that this task seems strange and I can't imagine how a new Stata install or reinstall could know what was installed previously, but I find the question interesting for other reasons.
The main reason being for users who have Stata installed on multiple machines who need the same packages on both machines. I faced a similar issue when I purchased a new computer and installed Stata but wanted all of the packages I use to be available as well. Outside of moving the ado directory or selected contents I'm not aware of any quick solution.
Here it would be possible to use the output of ado dir on one machine to determine what you need to install on a second machine with a new Stata install.
The method you propose using a foreach loop could save you time from having to type in or copy/paste a lot of packages and URLs. At the same time however, this is only beneficial if you have many packages from only a few repositories because you will need to net cd to the URL each time as you show in your example.
An alternative solution is the programmatic solution. As you know, ado dir will list each installed package, the URL and a short description of the package. Using this, a log file, and the built in I/O functionality, a short program could be written to automate the process and dynamically build a do file that contains the commands to install the already installed packages.
The code below generates a do file containing commands (in this case, net describe package, from(url)) for each package I have installed on my computer.
clear *
tempfile log1
log using "`log1'", text name(mylog)
ado dir
log close mylog
tempname logfile
file open `logfile' using "`log1'", read
file read `logfile' line
file open dfh using "path/to/your/dofile.do", write replace
local pckage "package"
while r(eof) == 0 {
if `: list pckage in line' {
local packageName : word 3 of `line'
local dirName : word 5 of `line'
di "`packageName' `dirName'"
file write dfh "net describe `packageName', from(`dirName')"
file write dfh _newline
}
file read `logfile' line
}
file close `logfile'
file close dfh
In the above code, I create a temp file to write a .txt log file to and store the contents of ado dir in that file.
Then, I open the log file using file open and read it line by line in the while loop.
Above the loop, I'm creating a do file at /path/to/your/dofile.do to hold the output of the loop - the dynamically created commands relating to the installed packages on my machine.
The loop will iterate so long as r(eof) = 0, where r(eof) is an end of file marker. I use an if statement to sort out lines of the log file which contain the word package, as I'm only interested in those lines with the package name and URL in them.
Inside of the if block, I parse the local macro line to pull the package name and the URL/directory name.
this is important: this section of code assumes that the 3rd and 5th words in the macro will always be the package name and URL respectively - Confirm this from the output of ado dir before executing.
You will also need to change the command that is being written to the file handle dfh inside of the loop to what you want (net install, etc) when you are ready to execute.
For more help on using file, locals, and tempfiles execute any of the following in Stata:
help file
help extended_fcn
help macrolists
There may be nicer ways to parse the contents of ado dir but this has worked for me. And of course I'd always advise that you take the time to understand what the code is doing so that you can make any necessary tweaks to fit your particular situation.

Writing to popen and reading back several files in Ruby

I need to run some shell commands on a number of files and sometimes I get back more than one file in response. The question is: How can I read back several files from IO.popen in Ruby?
For instance, imagine the following case:
file = grid.get(record['_id']) # fetch a file from database
IO.popen('tar -Oxmz', 'ab') {|pipe| pipe.write(file.read)} # pass to tar and extract
This necessitates that I reread all the extracted files from the filesystem. I figured out this is the speed bottleneck of my script and I wonder if I can accomplish the same task in-memroy. I tried the following:
file = grid.get(record['_id'])
IO.popen('tar -Oxmz', 'w+b') do |pipe|
pipe.write(file.read)
pipe.close_write
output = pipe.read
end
It works, but I get the whole response, here including several extracted files, in one piece (in variable output). I need the files separate from each other and possibly with their names. Is there any way to do this?
By the way, the resulting files are most of the time text, but sometimes binary. Running a pipe for each output file is not a solution, because the actual overhead of running the commands for each file outweights the benefits of doing the transformation in-memory.
P.S. The actual use case does not rely on tar only. I use software that do not have Ruby wrappers.

bash scripting de-dupe

I have a shell script. A cron job runs it once a day. At the moment it just downloads a file from the web using wget, appends a timestamp to the filename, then compresses it. Basic stuff.
This file doesn't change very frequently though, so I want to discard the downloaded file if it already exists.
Easiest way to do this?
Thanks!
Do you really need to compress the file ?
wget provides -N, --timestamping which obviously, turns on time-stamping. What that does is say your file is located at www.example.com/file.txt
The first time you do:
$ wget -N www.example.com/file.txt
[...]
[...] file.txt saved [..size..]
The next time it'll be like this:
$ wget -N www.example.com/file.txt
Server file no newer than local file “file.txt” -- not retrieving.
Except if the file on the server was updated.
That would solve your problem, if you didn't compress the file.
If you really need to compress it, then I guess I'd go with comparing the hash of the new file/archive and the old. What matters in that case is, how big is the downloaded file ? is it worth compressing it first then checking the hashes ? is it worth decompressing the old archive and comparing the hashes ? is it better to store the old hash in a txt file ? do all these have an advantage over overwriting the old file ?
You only know that, make some tests.
So if you go the hash way, consider sha256 and xz (lzma2 algorithm) compression.
I would do something like this (in Bash):
newfilesum="$(wget -q www.example.com/file.txt -O- | tee file.txt | sha256sum)"
oldfilesum="$(xzcat file.txt.xz | sha256sum)"
if [[ $newfilesum != $oldfilesum ]]; then
xz -f file.txt # overwrite with the new compressed data
else
rm file.txt
fi
and that's done;
Calculate a hash of the content of the file and check against the new one. Use for instance md5sum. You only have to save the last MD5 sum to check if the file changed.
Also, take into account that the web is evolving to give more information on pages, that is, metadata. A well-founded web site should include file version and/or date of modification (or a valid, expires header) as part of the response headers. This, and quite other things, is what makes up the scalability of Web 2.0.
How about downloading the file, and checking it against a "last saved" file?
For example, the first time it downloads myfile, and saves it as myfile-[date], and compresses it. It also adds a symbolic link, such as lastfile pointing to myfile-[date]. The next time the script runs, it can check if the contents of whatever lastfile points to is the same as the new downloaded file.
Don't know if this would work well, but it's what I could think of.
You can compare the new file with the last one using the sum command. This takes the checksum of the file. If both files have the same checksum, they are very, very likely to be exactly the same. There's another command called md5 that takes the md5 fingerprint, but the sum command is on all systems.

How can I modify .xfdl files? (Update #1)

The .XFDL file extension identifies XFDL Formatted Document files. These belong to the XML-based document and template formatting standard. This format is exactly like the XML file format however, contains a level of encryption for use in secure communications.
I know how to view XFDL files using a file viewer I found here. I can also modify and save these files by doing File:Save/Save As. I'd like, however, to modify these files on the fly. Any suggestions? Is this even possible?
Update #1: I have now successfully decoded and unziped a .xfdl into an XML file which I can then edit. Now, I am looking for a way to re-encode the modified XML file back into base64-gzip (using Ruby or the command line)
If the encoding is base64 then this is the solution I've stumbled upon on the web:
"Decoding XDFL files saved with 'encoding=base64'.
Files saved with:
application/vnd.xfdl;content-encoding="base64-gzip"
are simple base64-encoded gzip files. They can be easily restored to XML by first decoding and then unzipping them. This can be done as follows on Ubuntu:
sudo apt-get install uudeview
uudeview -i yourform.xfdl
gunzip -S "" < UNKNOWN.001 > yourform-unpacked.xfdl
The first command will install uudeview, a package that can decode base64, among others. You can skip this step once it is installed.
Assuming your form is saved as 'yourform.xfdl', the uudeview command will decode the contents as 'UNKNOWN.001', since the xfdl file doesn't contain a file name. The '-i' option makes uudeview uninteractive, remove that option for more control.
The last command gunzips the decoded file into a file named 'yourform-unpacked.xfdl'.
Another possible solution - here
Side Note: Block quoted < code > doesn't work for long strings of code
The only answer I can think of right now is - read the manual for uudeview.
As much as I would like to help you, I am not an expert in this area, so you'll have to wait for someone more knowledgable to come down here and help you.
Meanwhile I can give you links to some documents that might help you:
UUDeview Home Page
Using XDFLengine
Gettting started with the XDFL Engine
Sorry if this doesn't help you.
You don't have to get out of Ruby to do this, can use the Base64 module in Ruby to encode the document like this:
irb(main):005:0> require 'base64'
=> true
irb(main):007:0> Base64.encode64("Hello World")
=> "SGVsbG8gV29ybGQ=\n"
irb(main):008:0> Base64.decode64("SGVsbG8gV29ybGQ=\n")
=> "Hello World"
And you can call gzip/gunzip using Kernel#system:
system("gzip foo.something")
system("gunzip foo.something.gz")

Resources