download all images on the page with WGET - image

I'm trying to download all the images that appear on the page with WGET, it seems that eveything is fine but the command is actually downloading only the first 6 images, and no more. I can't figure out why.
The command i used:
wget -nd -r -P . -A jpeg,jpg http://www.edpeers.com/2013/weddings/umbria-wedding-photographer/
It's downloading only the first 6 images relevant of the page and all other stuff that i don't need, look at the page, any idea why it's only getting the first 6 relevant images?
Thanks in advance.

I think the main problem is, that there are only 6 jpegs on that site, all others are gifs, example:
<img src="http://www.edpeers.com/wp-content/themes/prophoto5/images/blank.gif"
data-lazyload-src="http://www.edpeers.com/wp-content/uploads/2013/11/aa_umbria-italy-wedding_075.jpg"
class="alignnone size-full wp-image-12934 aligncenter" width="666" height="444"
alt="Umbria wedding photographer" title="Umbria wedding photographer" /
data-lazyload-src is a jquery plugin, which wouldn't download the jpegs, see http://www.appelsiini.net/projects/lazyload
Try -p instead of -r
wget -nd -p -P . -A jpeg,jpg http://www.edpeers.com/2013/weddings/umbria-wedding-photographer/
see http://explainshell.com:
-p
--page-requisites
This option causes Wget to download all the files that are necessary to properly display a given HTML
page. This includes such things as inlined images, sounds, and referenced stylesheets.

Related

Why can I sometimes download pictures with curl and sometimes not?

So I have gotten a link to an image from google
https://media.istockphoto.com/photos/pile-of-euro-notes-picture-id471843075?k=20&m=471843075&s=612x612&w=0&h=aEFb1spFMtvSnsNvkpgA2tULw-cmcBC4nwbCvDFYN9c=
I got this by right clicking on the image and getting the URL address of the image
This is another Image url I got in the same way
https://m.media-amazon.com/images/I/61RzcieEZpL._AC_SX522_.jpg
Both images show up fine when I paste the links in the browser
When I use curl to download the second image it does so without issues
curl -O 'https://m.media-amazon.com/images/I/61RzcieEZpL._AC_SX522_.jpg'
However, for the second...
curl -O 'https://media.istockphoto.com/photos/pile-of-euro-notes-picture-id471843075?k=20&m=471843075&s=612x612&w=0&h=aEFb1spFMtvSnsNvkpgA2tULw-cmcBC4nwbCvDFYN9c='
The downloaded file is just this strange looking text file
ˇÿˇ‡JFIF,,ˇ·òExifII*[&òÇÅPile of euro notes. This notes are miniatures, made by myself. More money? In my portfolio.Kerstin Waurickˇ·îhttp://ns.adobe.com/xap/1.0/<?xpacket begin="Ôªø" id="W5M0MpCehiHzreSzNTczkc9d"?>
<x:xmpmeta xmlns:x="adobe:ns:meta/">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about="" xmlns:photoshop="http://ns.adobe.com/photoshop/1.0/" xmlns:Iptc4xmpCore="http://iptc.org/std/Iptc4xmpCore/1.0/xmlns/" xmlns:GettyImagesGIFT="http://xmp.gettyimages.com/gift/1.0/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:plus="http://ns.useplus.org/ldf/xmp/1.0/" xmlns:iptcExt="http://iptc.org/std/Iptc4xmpExt/2008-02-29/" xmlns:xmpRights="http://ns.adobe.com/xap/1.0/rights/" dc:Rights="Kerstin Waurick" photoshop:Credit="Getty Images/iStockphoto" GettyImagesGIFT:AssetID="471843075" xmpRights:WebStatement="https://www.istockphoto.com/legal/license-agreement?utm_medium=organic&utm_source=google&utm_campaign=iptcurl" >
<dc:creator><rdf:Seq><rdf:li>Kerrick</rdf:li></rdf:Seq></dc:creator><dc:description><rdf:Alt><rdf:li xml:lang="x-default">Pile of euro notes. This notes are miniatures, made by myself. More money? In my portfolio.</rdf:li></rdf:Alt></dc:description>
<plus:Licensor><rdf:Seq><rdf:li rdf:parseType='Resource'><plus:LicensorURL>https://www.istockphoto.com/photo/license-gm471843075-?utm_medium=organic&utm_source=google&utm_campaign=iptcurl</plus:LicensorURL></rdf:li></rdf:Seq></plus:Licensor>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
<?xpacket end="w"?>
ˇÌ∫Photoshop 3.08BIMùPKerrickx[Pile of euro notes. This notes are miniatures, made by myself. More money? In my portfolio.tKerstin WauricknGetty Images/iStockphotoˇ€C
#%$""!&+7/&)4)!"0A149;>>>%.DIC<H7=>;ˇ€C
Can anyone tell me why this is happening?
Your viewer just failed to fathom out that the file is a JPEG image because it has the wrong extension. Try adding an extension like this:
curl -O 'https://media.istockphoto.com/photos/pile-of-euro-notes-picture-id471843075?k=20&m=471843075&s=612x612&w=0&h=aEFb1spFMtvSnsNvkpgA2tULw-cmcBC4nwbCvDFYN9c=' > image.jpg
If you might be downloading PNGs and GIFs and stuff other than JPEG, you can use file to get a sensible extension:
curl ... > UnknownThing
Then:
file -b --extension UnknownThing
jpeg/jpg/jpe/jfif
So maybe something along the lines of:
curl ... > UnknownThing
ext=$(file -b --extension UnknownThing | sed 's|/.*||')
mv UnknownThing image.${ext}

How to pipe multiple files into tesseract-ocr from a loop

I am looking to find a way to sequentially add files (PNG input files) to a ocr'ed PDF (via tesseract-3).
The idea is to scan a PNG, optimize it (optipng) and feed it via a stream to tesseract, which adds it to a ever growing PDF.
The time between scans is 20-40 seconds, and the scans go into the hundreds, which is why I want to use the wait time between the scans to do the OCR already.
I imagine this to work like this:
while ! $finished
do
get_scanned_image_to_png_named_scannumber
optipng $scannumber.png
check_for_finishing_condition #all this works fine already
sleep 30s
#do some magic piping into a single tesseract instance here
done #or here?
The inspiration for this comes from here:
https://github.com/tesseract-ocr/tesseract/wiki/FAQ#how-to-ocr-streaming-images-to-pdf-using-tesseract
Thanks very much for any hint,
Joost
Edits:
OS: OpenSuse Tumbleweed
Scan: more of a series of "image aquisitions" resulting in a single PNG each (not a real scanner); going on for several hours at least.
FollowUp:
This kind of works when doing
while ! $finished
do
get_scanned_image_to_png_named_scannumber
optipng $scannumber.png
check_for_finishing_condition #all this works fine already
sleep 30s
echo "$capnum.png"
done | tesseract -l deu+eng -c stream_filelist=true - Result pdf
, though the PDF is corrupted when you try to open it in between scan additions or stop this loop with e.g. Ctrl-C. I do not see a way to get an uncorrupted PDF.
try this:
while ! $finished
do
get_scanned_image_to_png_named_scannumber
optipng $scannumber.png
check_for_finishing_condition #all this works fine already
sleep 30s
done | tesseract -c stream_filelist=true - - pdf > output.pdf

How to download some specific files with some keywords from different directories using wget?

I am trying to download data from TRMM satellite data archive using the following command
wget -r --no-parent ftp://arthurhou.pps.eosdis.nasa.gov/pub/trmmdata/ByDate/V07/2008/01/01 --user=--user= --password="
2008 is the year, 01 is for January and 01 is for 01 is for the date. Within this date folder, there are plenty of data files
(e.g 1A01.20080101.57701.7.gz, 2A21.20080101.57711.7.HDF.gz, 2A23.20080101.57702.7.HDF.gz).
I want to download only the files under "2A23" category from every folder (e.g year, month and date), but with my wget command all the files are getting downloaded. Is there a way to specify some key to download just those files?
Thank you in advance for your help.
The solution is here, if someone is stuck at the same question later.
wget -r --no-parent -A 'pattern' 'URL' --user=--user= --password=
In my case the pattern was 2a23*.gz.

Image downloaded with wget has size of 4 bytes

I have a problem with downloading certain image.
I'm trying to download image and save it on disk.
Here is the wget command, that I'm using and it works perfectly fine with almost every image. (code above works fine with this url)
wget -O test.gif http://www.fmwconcepts.com/misc_tests/animation_example/lena_anim2.gif
Almost, becouse when I try to download image from this url: http://sklepymuzyczne24.pl/_data/ranking/686/e3991/ranking.gif
It fails. Downloaded file size is 4 bytes. I tried doing this using curl instead of wget, but the results are the same.
I think that the second image (the one not working) might be somehow generated (the image automatically changes, depending on store reviews). I belive that it has something to do with this issue.
Looks like some kind of misconfiguration on the server side. It won't return the image unless you specify that you accept gzip compressed content. Most web browsers nowadays do this by default, so the image is working fine in browser, but for wget or curl you need to add accept-encoding header manually. This way you will get gzip compressed image. Then you can pipe it to gunzip and get a normal, uncompressed image.
You could save the image using:
wget --header='Accept-Encoding: gzip' -O- http://sklepymuzyczne24.pl/_data/ranking/686/e3991/ranking.gif | gunzip - > ranking.gif

Download GD-JPEG image with correct dimensions from cURL CLI

I need help in downloading a series of GD generated images with their correct dimensions.
I'm using this command in the cURL CLI to download a range of items:
curl "http://apl-moe-eng-www.ai-mi.jp/img/php/fitsample.php?&i_id=4[0001-9999]" -o "Clothes\4#1.jpg" --create-dirs
But the downloaded image's dimensions are smaller than the one shown on the website. The website's image's dimensions are 640*882, but cURL's output image's dimensions are 232*320.
Original Image
cURL Output Image
Why is this, and can anything be added to the command to fix this?
I figured out it was because I left out a user agent:
curl "http://apl-moe-eng-www.ai-mi.jp/img/php/fitsample.php?&i_id=4[0001-9999]" -o "Clothes\4#1.jpg" --create-dirs --user-agent "Android"
Odd, huh?

Resources