I'm attempting to use wget (on Windows) on a url that hosts a few thousand files. If I sit at my PC, it seems to work fine all day long. If I leave my PC overnight the connection just drops randomly with no reference to an error or timeout. Because of this, I never end up capturing the full directory meaning I have to start all over again. Is there a way that I can use wget after a specific file?
Example:
Directory contains files "A,B,C,D,.,.,.,.X,Y,Z".
Wget gets files A, B and C.
Can I start wget at file C (or after) to grab only items that I haven't downloaded yet? I'm trying to avoid wget looking through my current directory of files and comparing them to the URL to see whether or not I have them.
Related
I'm currently using a bash script to download several images using wget.
Unfortunately the server I am downloading from is less than reliable and therefore sometimes when I'm downloading a file, the server will disconnect and the script will move onto the next file, leaving the previous one incomplete.
In order to remedy this I've tried to add a second line after the script fetches all incomplete files using:
wget -c myurl.com/image{1..3}.png
This seems to work as wget goes back and completes download of the files, but the problem then comes from this: ImageMagick which I use to stich the images in a pdf, claims there are errors with the headers of the images.
My thought of what to with deleting the incomplete files is:
wget myurl.com/image{1..3}.png
wget -rmincompletefiles
wget -N myurl.com/image{1..3}.png
convert *.png mypdf.pdf
So the question is, what can I use in place of -rmincompletefiles that actually exists, or is there a better I should be approaching this issue?
I made surprising discovery when attempting to implement tvm's suggestion.
It turns out, and this something I didn't realize, that when you run wget -N, wget actually checks file sizes and verifies they are the same. If they are not, the files are deleted and then downloaded again.
So cool tip if you're having the same issue I am!
I've found this solution to work for my use case.
From the answer:
wget http://www.example.com/mysql.zip -O mysql.zip || rm -f mysql.zip
This way, the file will only be deleted if an error or cancellation occurred.
Well, I would try hard to download the files with wget (you can specify extra parameters like larger --timeout to give the server some extra time). wget assumes certain things about the partial downloads and even with proper resume, they can sometimes end up mangled (unless you check their eg. MD5 sums by other means).
Since you are using convert and bash, there will be most likely another tool available from the Imagemagick package - namely identify.
While certain features are surely poorly documented, it has one awesome functionality - it can identify broken (or partially downloaded images).
➜ ~ identify b.jpg; echo $?
identify.im6: Invalid JPEG file structure: ...
1
It will return exit status 1 if you call it on the inconsistent image. You can remove these inconsistent images using simple loop such as:
for i in *.png;
do identify "$i" || rm -f "$i";
done
Then I would try to download again the files that are broken.
I am trying to create a script for cron job. I have around 8 GB folder containing thousands of files. I am trying to create a bash script which first tar the folder and then transfer the tarred file to ftp server.
But I am not sure while tar is tarring the folder and some other process is accessing files inside it or writing to the files inside it.
Although its is fine for me if the tarred file does not contains that recent changes while the tar was tarring the folder.
suggest me the proper way. Thanks.
tar will hapilly tar "whatever it can". But you will probably have some surprises when untarring, as tar also stored the size of the file it tars, before taring it. So expect some surprises.
A very unpleasant surprise would be : if the size is truncated, then tar will "fill" it with "NUL" characters to match it's recorded size... This can give very unpleasant side effects. In some cases, tar, when untarring, will say nothing, and silently add as many NUL characters it needs to match the size (in fact, in unix, it doesn't even need to do that : the OS does it, see "sparse files"). In some cases, if truncating occured during the taring of the file, tar will complain it encounters an Unexpected End of File when untarring (as it expected XXX bytes but only reads fewer than this), but will still say that the file should be XXX bytes (and the unix OSes will then create it as a sparse file, with "NUL" chars magically appended at the end to match the expected size when you read it).
(to see the NUL chars : an easy way is to less thefile (or cat -v thefile | more on a very old unix. Look for any ^#)
But on the contrary, if files are only appended to (logs, etc), then the side effect is less problematic : you will only miss some bits of them (which you say you're ok about), and not have that unpleasant "fill with NUL characters" side effects. tar may complain when untarring the file, but it will untar it.
I think tar failed (so do not create archive) when an archived file is modified during archiving. As Etan said, the solution depends on what you want finally in the tarball.
To avoid a tar failure, you can simply COPY the folder elsewhere before to call tar. But in this case, you cannot be confident in the consistency of the backuped directory. It's NOT an atomic operation, so some files will be todate while other files will be outdated. It can be a severe issue or not follow your situation.
If you can, I suggest you configure how these files are created. For example: "only recent files are appended, files older than 1 day are never changed", in this case you can easily backup only old files and the backup will be consistent.
More generally, you have to accept to loose last data AND be not consistant (each files is backup at a different date), or you have to act at a different level. I suggest :
Configure the software that produces the data to choose a consistency
Or use OS/Virtualization features. For example it's possible to do consistent snapshot of a storage on some virtual storage...
I have do download all log files from a virtual directory within a site. The access to virtual directory is forbidden but files are accessible.
I have manually entered the file names to download
dir="Mar"
for ((i=1;i<100;i++)); do
wget http://sz.dsyn.com/2014/$dir/log_$i.txt
done
The problem is the script is not generic and most of the time I need to find out how many files are there and tweak the for loop. Is there a way to trigger wget to fetch all files without me bothering to specify the exact count.
Note:
If I use the browser to view http://sz.dsyn.com/2014/$dir, it is 403 forbidden. I cant pull all the files via browser tool/extension.
First of all check this similar question If this is not what you are looking for, you need to generate a file of URLs within and feed wget. e.g.
wget --input-file=http://sz.dsyn.com/2014/$dir/filelist.txt
wget will have the same problem your browser has: it cannot read the directory. Just pull until your first failure then quit.
i was making a bash script for my server which pack some directories with RAR and upload it to other ftp server, so some folders are big and i have to rar them in parts and have to wait for all parts to be rared before uploading them, which consumes lots of time and space
so i want to do it more fast like, upload every rared part on its completion and delete it afterwards automatically without waiting for another all parts, i know there are possibilities of getting data corrupted or else, but that's what i want to do and i confused about where to start the script.
Server OS: Ubuntu 9.10
thanks
Kevin
Start rar in the background
Check wheter rar is finished on the part by calling fuser foo.part2.rar.
If fuser does not return anything, you can transfer it to the remote FTP.
I have a project, where I'm forced to use ftp as a means of deploying the files to the live server.
I'm developing on linux, so I hacked together a bash script that makes a backup of the ftp server's contents,
deletes all the files on the ftp, and uploads all the fresh files from the mercurial repository.
(and taking care of user uploaded files and folders, and making post-deploy changes, etc)
It's working well, but the project is starting to get big enough to make the deployment process too long.
I'd like to modify the script to look up which files have changed, and only deploy the modified files. (the backup is fine atm as it is)
I'm using mercurial as a VCS, so my idea is to somehow request the changed files between two revisions from it, iterate over the changed files,
and upload each modified file, and delete each removed file.
I can use hg log -vr rev1:rev2, and from the output, I can carve out the changed files with grep/sed/etc.
Two problems:
I have heard the horror stories that parsing the output of ls leads to insanity, so my guess is that the same applies to here,
if I try to parse the output of hg log, the variables will undergo word-splitting, and all kinds of transformations.
hg log doesn't tell me a file is modified/added/deleted. Differentiating between modified and deleted files would be the least.
So, what would be the correct way to do this? I'm using yafc as an ftp client, in case it's needed, but willing to switch.
You could use a custom style that does the parsing for you.
hg log --rev rev1:rev2 --style mystyle
Then pipe it to sort -u to get a unique list of files. The file "mystyle" would look like this:
changeset = '{file_mods}{file_adds}\n'
file_mod = '{file_mod}\n'
file_add = '{file_add}\n'
The mods and adds templates are files modified or added. There is a similar file_dels and file_del template for deleted files.
Alternatively, you could use hg status -ma --rev rev1-1:rev2 which adds an M or an A before modified/added files. You need to pass a different revision range, one less than rev1, as it is the status since that "baseline". Deleted files are similar - you need the -d flag and a D is added before each deleted file.