Image downloaded with wget has size of 4 bytes - image

I have a problem with downloading certain image.
I'm trying to download image and save it on disk.
Here is the wget command, that I'm using and it works perfectly fine with almost every image. (code above works fine with this url)
wget -O test.gif http://www.fmwconcepts.com/misc_tests/animation_example/lena_anim2.gif
Almost, becouse when I try to download image from this url: http://sklepymuzyczne24.pl/_data/ranking/686/e3991/ranking.gif
It fails. Downloaded file size is 4 bytes. I tried doing this using curl instead of wget, but the results are the same.
I think that the second image (the one not working) might be somehow generated (the image automatically changes, depending on store reviews). I belive that it has something to do with this issue.

Looks like some kind of misconfiguration on the server side. It won't return the image unless you specify that you accept gzip compressed content. Most web browsers nowadays do this by default, so the image is working fine in browser, but for wget or curl you need to add accept-encoding header manually. This way you will get gzip compressed image. Then you can pipe it to gunzip and get a normal, uncompressed image.
You could save the image using:
wget --header='Accept-Encoding: gzip' -O- http://sklepymuzyczne24.pl/_data/ranking/686/e3991/ranking.gif | gunzip - > ranking.gif

Related

How to pipe multiple files into tesseract-ocr from a loop

I am looking to find a way to sequentially add files (PNG input files) to a ocr'ed PDF (via tesseract-3).
The idea is to scan a PNG, optimize it (optipng) and feed it via a stream to tesseract, which adds it to a ever growing PDF.
The time between scans is 20-40 seconds, and the scans go into the hundreds, which is why I want to use the wait time between the scans to do the OCR already.
I imagine this to work like this:
while ! $finished
do
get_scanned_image_to_png_named_scannumber
optipng $scannumber.png
check_for_finishing_condition #all this works fine already
sleep 30s
#do some magic piping into a single tesseract instance here
done #or here?
The inspiration for this comes from here:
https://github.com/tesseract-ocr/tesseract/wiki/FAQ#how-to-ocr-streaming-images-to-pdf-using-tesseract
Thanks very much for any hint,
Joost
Edits:
OS: OpenSuse Tumbleweed
Scan: more of a series of "image aquisitions" resulting in a single PNG each (not a real scanner); going on for several hours at least.
FollowUp:
This kind of works when doing
while ! $finished
do
get_scanned_image_to_png_named_scannumber
optipng $scannumber.png
check_for_finishing_condition #all this works fine already
sleep 30s
echo "$capnum.png"
done | tesseract -l deu+eng -c stream_filelist=true - Result pdf
, though the PDF is corrupted when you try to open it in between scan additions or stop this loop with e.g. Ctrl-C. I do not see a way to get an uncorrupted PDF.
try this:
while ! $finished
do
get_scanned_image_to_png_named_scannumber
optipng $scannumber.png
check_for_finishing_condition #all this works fine already
sleep 30s
done | tesseract -c stream_filelist=true - - pdf > output.pdf

download all images on the page with WGET

I'm trying to download all the images that appear on the page with WGET, it seems that eveything is fine but the command is actually downloading only the first 6 images, and no more. I can't figure out why.
The command i used:
wget -nd -r -P . -A jpeg,jpg http://www.edpeers.com/2013/weddings/umbria-wedding-photographer/
It's downloading only the first 6 images relevant of the page and all other stuff that i don't need, look at the page, any idea why it's only getting the first 6 relevant images?
Thanks in advance.
I think the main problem is, that there are only 6 jpegs on that site, all others are gifs, example:
<img src="http://www.edpeers.com/wp-content/themes/prophoto5/images/blank.gif"
data-lazyload-src="http://www.edpeers.com/wp-content/uploads/2013/11/aa_umbria-italy-wedding_075.jpg"
class="alignnone size-full wp-image-12934 aligncenter" width="666" height="444"
alt="Umbria wedding photographer" title="Umbria wedding photographer" /
data-lazyload-src is a jquery plugin, which wouldn't download the jpegs, see http://www.appelsiini.net/projects/lazyload
Try -p instead of -r
wget -nd -p -P . -A jpeg,jpg http://www.edpeers.com/2013/weddings/umbria-wedding-photographer/
see http://explainshell.com:
-p
--page-requisites
This option causes Wget to download all the files that are necessary to properly display a given HTML
page. This includes such things as inlined images, sounds, and referenced stylesheets.

jpg won't optimize (jpegtran, jpegoptim)

I have an image and it's a jpg.
I tried running through jpegtran with the following command:
$ jpegtran -copy none -optimize image.jpg > out.jpg
The file outputs, but the image seems un-modified (no size change)
I tried jpegoptim:
$ jpegoptim image.jpg
image.jpg 4475x2984 24bit P JFIF [OK] 1679488 --> 1679488 bytes (0.00%), skipped.
I get the same results when I use --force with jpegoptim except it reports that it's optimized but there is no change in file size
Here is the image in question: http://i.imgur.com/NAuigj0.jpg
But I can't seem to get it to work with any other jpegs I have either (only tried a couple though).
Am I doing something wrong?
I downloaded your image from imgur, but the size is 189,056 bytes. Is it possible that imgur did something to your image?
Anyway, I managed to optimize it to 165,920 bytes using Leanify (I'm the author) and it's lossless.

Download GD-JPEG image with correct dimensions from cURL CLI

I need help in downloading a series of GD generated images with their correct dimensions.
I'm using this command in the cURL CLI to download a range of items:
curl "http://apl-moe-eng-www.ai-mi.jp/img/php/fitsample.php?&i_id=4[0001-9999]" -o "Clothes\4#1.jpg" --create-dirs
But the downloaded image's dimensions are smaller than the one shown on the website. The website's image's dimensions are 640*882, but cURL's output image's dimensions are 232*320.
Original Image
cURL Output Image
Why is this, and can anything be added to the command to fix this?
I figured out it was because I left out a user agent:
curl "http://apl-moe-eng-www.ai-mi.jp/img/php/fitsample.php?&i_id=4[0001-9999]" -o "Clothes\4#1.jpg" --create-dirs --user-agent "Android"
Odd, huh?

Using CURL to download file and view headers and status code

I'm writing a Bash script to download image files from Snapito's web page snapshot API. The API can return a variety of responses indicated by different HTTP response codes and/or some custom headers. My script is intended to be run as an automated Cron job that pulls URLs from a MySQL database and saves the screenshots to local disk.
I am using curl. I'd like to do these 3 things using a single CURL command:
Extract the HTTP response code
Extract the headers
Save the file locally (if the request was successful)
I could do this using multiple curl requests, but I want to minimize the number of times I hit Snapito's servers. Any curl experts out there?
Or if someone has a Bash script that can respond to the full documented set of Snapito API responses, that'd be awesome. Here's their API documentation.
Thanks!
Use the dump headers option:
curl -D /tmp/headers.txt http://server.com
Use curl -i (include HTTP header) - which will yield the headers, followed by a blank line, followed by the content.
You can then split out the headers / content (or use -D to save directly to file, as suggested above).
There are three options -i, -I, and -D
> curl --help | egrep '^ +\-[iID]'
-D, --dump-header FILE Write the headers to FILE
-I, --head Show document info only
-i, --include Include protocol headers in the output (H/F)

Resources