Mirror multiple page site with lftp - bash

I need to mirror data hosted on a web site on a regular basis, I am trying to use lftp (version 4.0.9) as it usually does a great job for this task. However the site I am downloading from has multiple pages (I am intending to loop over the most recent n pages in a bash script which will run several times a day). I can't work out how to get lftp to accept the page parameter. I've had no luck searching for a solution online and what I have tried has failed so far.
This works perfectly:
lftp -c 'mirror -v -i "S1A" -P 4 https://qc.sentinel1.eo.esa.int/aux_resorb/'
This does not:
lftp -c 'mirror -v -i "S1A" -P 4 https://qc.sentinel1.eo.esa.int/aux_resorb/?page=2'
It gives error:
mirror: Access failed: 404 NOT FOUND (/aux_resorb/?page=2)
I also tried passing the new URL in as a variable but that didn't work either. I'd be grateful for suggestions to solve this issue.
Before it is suggested, I know wget is an option and the pagination works - I tested it - I don't want to use it because it is less appropriate for this as it wastes a lot of time getting all the "index.html?param=value" and then removing them, given the number of pages this isn't feasible.

The problem with the lftp's mirror command is that it adds a slash to the given URL when requesting the page (see below). So it boils down how the remote end will handle URLs and whether it gets upset of the trailing slash. On my tests, Drupal sites for example do not like the trailing slash and will return a 404 but some other sites worked fine. Unfortunately I was not able to figure out a workaround if you insist of using lftp.
Tests
I tried the following requests against a web server:
1. lftp -c 'mirror -v http://example/path'
2. lftp -c 'mirror -v http://example/path/?page=2'
3. lftp -c 'mirror -v http://example/path/file'
4. lftp -c 'mirror -v http://example/path/file?page=2'
These commands resulted to the following HEAD requests seen by the web server:
1. HEAD /path/
2. HEAD /path/%3Fpage=2/
3. HEAD /path/file/
4. HEAD /path/file%3Fpage=2/
Note that there's always a trailing slash in the request. %3F is just the URL encoded character ?.

Related

`wget -i files.txt` gives Scheme Missing error on VPS

I have gathered few min.map files and stored it in mapfiles.txt
as shown below,
https://example.com/app.bundle.min.map
https://example.com/app.bundle.min.map
https://example.com/app.bundle.min.map
I read the man wget page and found that to download files using wget we need to use
wget -i filename.txt
So I tried same thing and observed weird behavior
On local system,
wget -i mapfiles.txt
This command started downloading .map files
On VPS,
wget -i mapfiles.txt
Got this error
https://example/app.bundle.min.map: Scheme missing No URLs found in mapfiles.txt.
Could you guys help me to figure out where I am going wrong?

Wget mirror site via ftp - timestamps issue

A site I'm working on requires a large amount of files to be downloaded from an external system via FTP, daily. This is not of my design, it is the only solution offered up by the external provider (I cannot use SSH/SFTP/SCP).
I've solved this by using wget, run inside a cron task:
wget -m -v -o log.txt --no-parent -nH -P /exampledirectory/ --user username --password password ftp://www.example.com/"
Unfortunately, wget does not seem to see the timestamp differences, so when a file is modified, it still returns:
Remote file no newer than local file
`/xxx/data/data.file'
-- not retrieving.
When I manually connect via FTP, I can see differences in the timestamps, so it should be getting the updated file. I'm not able to access or control the target server via any other means.
Is there anything I can do to get around this? Can I force wget to mirror while ignoring timestamps? (I understand this defeats the point of mirroring)...

DD-WRT wget returns a cached file

I'm developing an installer for my YAMon script for *WRT routers (see http://www.dd-wrt.com/phpBB2/viewtopic.php?t=289324).
I'm currently testing on a TP-Link TL-WR1043ND with DD-WRT v3.0-r28647 std (01/02/16). Like many others, this firmware variant does not include curl so I (gracefully) fall back to a wget call. But, it appears that DD-WRT includes a cut-down version of wget so the -C and --no-cache options are not recognized.
Long & short, my wget calls insist on downloading cached versions of the requested files.
BTW - I'm using: wget "$src" -qO "$dst"
where src is the source file on my remote server and dst is the destination on the local router
So far I've unsuccessfully tried to:
1. append a timestamp to the request URL
2. reboot the router
3. run stopservice dnsmasq & startservice dnsmasq
None have changed the fact that I'm still getting a cached version of the file.
I'm beating my head against the wall... any suggestions? Thx!
Al
Not really an answer but a seemingly viable workaround...
After a lot of experimentation, I found that wget seems to always return the latest version of the file from the remote server if the extension on the requested file is '.html'; but if it is something else (e.g., '.txt' or '.sh'), it does not.
I have no clue why this happens or where they are cached.
But now that I do, all of the files required by my installer have an html extension on the remove server and the script saves them with the proper extension locally. (Sigh...several days of my life that I won't get back)
Al
I had the same prob. While getting images from a camera the HTTP server on the camera always send the same image.
wget --no-http-keep-alive ..
solved my problem
and my full line is
wget --no-check-certificate --no-cache --no-cookies --no-http-keep-alive $URL -O img.jpg -o wget_last.log

What does this command do: 'wget -qO- 127.0.0.1'?

In the provisioning part of vagrant guide, there is a command wget -qO- 127.0.0.1 to check if apache is installed property.
Can anyone explain the command more in detail? I dont understand what the -qO- option does. Also, what is the meaning of wget to 127.0.0.1?
Thanks!
The dash after the O instructs output to go to standard output.
The q option means the command should be "quiet" and not write to standard output.
The two options together mean the instruction can be used nicely in a pipeline.
As far as added 127.0.0.1 as the source of the wget, that is there to make sure you have a local webserver running. Running wget on the commandline is faster than bringing up a browser.

Webpage monitoring script is returning false positives

I am trying to automate a process which previously was consuming a full-time job: monitoring a series of websites for new posts. This seemed like a relatively simple scripting problem, so I tackled it, wrote a bash script, and set it to run every minute in the crontab. It works great, but after the page changes, it keeps returning false positives for an hour or so, and I can't for the life of me figure out why. It resolves itself after a while, but I don't want to deploy the script until I understand what's happening. Here's my code:
#!/bin/bash
SITENAME=example
wget http://web.site.url/apache/folder/$(date +%Y)/$(date +%m)-$(date +%B) -O $SITENAME.backend.new --no-cache
touch $SITENAME.backend.old
diff $SITENAME.backend.new $SITENAME.backend.old > $SITENAME.backend.diff
if [ -s $SITENAME.backend.diff ]
then sendemail -xu myaddress#mydomain.com -xp password -f myaddress#mydomain.com -t myaddress#mydomain.com -s smtpout.secureserver.net -u $SITENAME -m backend \
&& cp $SITENAME.backend.new $SITENAME.backend.old \
&& echo true
fi
If the only difference between diffs are absolute or not absolute links, consider using the --convert-links switch for wget, like the man said :
-k
--convert-links
After the download is complete, convert the links in the document to make them suitable for local viewing. This affects not only the visible hyperlinks, but any part of the document that links to external content, such as embedded images, links to style sheets, hyperlinks to non-HTML content, etc.
This will convert links to absolute links.

Resources