wget wikimedia image? - bash

I am trying to download an image from Wikimedia Commons by using a URL to a page in the file namespace:
wget http://commons.wikimedia.org/wiki/File:A_golden_tree_during_the_golden_season.JPG
all I get is a JPG file that I cannot open. But when you go to the link you actually see the page instead of the image itself, but there is a link called "Full resolution" that sends you to the real image link which is: http://upload.wikimedia.org/wikipedia/commons/9/92/A_golden_tree_during_the_golden_season.JPG
How can I download this file by having only the first link ?

You can try the following:
wget http://commons.wikimedia.org/wiki/File:A_golden_tree_during_the_golden_season.JPG -O output.html; wget $(cat output.html | grep fullMedia | sed 's/\(.*href="\/\/\)\([^ ]*\)\(" class.*\)/\2/g')
The first wget fetches the link you specify. I browsed few pages and found that high resolution images were under div with class=fullMedia. It parses the url of the image and then fetches that image.
PS: As suggested above, bash is not a neat way of doing this. You should look at something that parses dom trees.

Extract the title without namespace (A_golden_tree_during_the_golden_season.JPG) and pass it to Special:Redirect.
wget http://commons.wikimedia.org/wiki/Special:Redirect/file/$( echo 'http://commons.wikimedia.org/wiki/File:A_golden_tree_during_the_golden_season.JPG' | sed 's/.*\/File\:\(.*\)/\1/g' )

wget http://upload.wikimedia.org/wikipedia/commons/9/92/A_golden_tree_during_the_golden_season.JPG
You were fetching the web page not the image itself.

you can use the following link to retrive :https://upload.wikimedia.org/wikipedia/commons/9/92/A_golden_tree_during_the_golden_season.JPG
Even I had got the same problem,click on the image you will get the above link ,i hope this helps

Related

Get list of php links with payload on website using Bash

I want to download certain files from a website, but I struggle with the download links.
If the links were „simple“ I would get a list like this:
#!/bin/bash
urL="http://..."
wget -O webpage.txt "$urL"
LC_ALL=C sed -n 's/.*href="\([^"]*\).*/\1/p' webpage.txt > links.txt
But unfortunately the url looks like this:
<a href='/appl/ics.php?apid=5464646&from=2022-06-23%2006%3A30%3A00&to=2022-06-23%2006%3A30%3A000' class='icalLink'> LinkName </a>
How can I only get these "bigger" urls collected and save it to LinkName.ics? Maybe even using only curl to download all files at once?

WGET saves with wrong file and extension name possibly due to BASH

I`ve tried this on a few forum threads already.
However I keep on getting the some failure as a result.
To replicate the problem :
Here is an url leading to a forum thread with 6 pages.
http://forex.kbpauk.ru/showflat.php/Cat/0/Number/107623/page/0/fpart/1/vc/1
What I typed into the console was :
wget "http://forex.kbpauk.ru/showflat.php/Cat/0/Number/107623/page/0/fpart/{1..6}/vc/1"
And here is what I got:
--2018-06-14 10:44:17-- http://forex.kbpauk.ru/showflat.php/Cat/0/Number/107623/page/0/fpart/%7B1..6%7D/vc/1
Resolving forex.kbpauk.ru (forex.kbpauk.ru)... 185.68.152.1
Connecting to forex.kbpauk.ru (forex.kbpauk.ru)|185.68.152.1|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: '1'
1 [ <=> ] 19.50K 58.7KB/s in 0.3s
2018-06-14 10:44:17 (58.7 KB/s) - '1' saved [19970]
The file was saved as simply "1" with no extension as it seems.
My expectations were that the file will be saved with an .html extension, because its a webpage.
Im trying to get WGET to work, but if its possible to do what I want with CURL than I would also accept that as an answer.
Well, there's a couple of issues with what you're trying to do.
The double quotes around your URL actually prevent Bash expansion, so you're not really downloading 6 files, but a single URL with "{1..6}" in it. You probably want to not have quotes around the URL to allow bash to expand it into 6 different parameters.
I notice that all of the pages are called "1", irrespective of their actual page numbers. This means the server is always serving a page with the same name, making it very hard for Wget or any other tool to actually make a copy of the webpage.
The real way to create a mirror of the forum would be to use this command line:
$ wget -m --no-parent -k --adjust-extension http://forex.kbpauk.ru/showflat.php/Cat/0/Number/107623/page/0/fpart/1
Let me explain what this command does:
-m --mirror activates the mirror mode (recursion)
--no-parent asks Wget to not go above the directory it starts from
-k --convert-links will edit the HTML pages you download so that the links in them will point to the other local pages you have also downloaded. This allows you to browse the forum pages locally without needing to be online
--adjust-extension This is the option you were originally looking for. It will cause Wget to save the file with a .html extension if it downloads a text/html file but the server did not provide an extension.
simply use the -O switch to specify the output filename, otherwise wget just defaults to something like in your case its 1
so if you wanted to call your file what-i-want-to-call-it.html then you would do
wget "http://forex.kbpauk.ru/showflat.php/Cat/0/Number/107623/page/0/fpart/{1..6}/vc/1" -o what-i-want-to-call-it.html
if you type into the console wget --help you will get a full list of all the options that wget provides
To verify it has worked type the following to output
cat what-i-want-to-call-it.html

How to parse html with wget to download an artifact using pattern matching against Jenkins

I am trying to download an artifact from Jenkins where I need the latest build. If I curl jenkins.mycompany.com/view/iOS/job/build_ios/lastSuccessfulBuild/artifact/build it brings me to the page that contains the artifact I need to download, which in my case is myCompany-1234.ipa
so by changing curl to wget with --auth-no-challenge https://userIsMe:123myapikey321#jenkins.mycompany.com/view/iOS/job/build_ios/lastSuccessfulBuild/artifact/build/ it downloads the index.html file.
if I put --reject index.html it stops the index.html downloading.
if I add the name of the artifact with a wildcard like so MyCompany-*.ipait downloads a 14k file named MyCompany-*.ipa, not the MyCompany-1234.ipa I was hoping for. Keep in mind the page i am requesting only has 1 MyCompany-1234.ipa so there will never be multiple matches found
if I use a flag to pattern match -A "*.ipa" like so: wget --auth-no-challenge -A "*.ipa" https://userIsMe:123myapikey321#jenkins.mycompany.com/view/iOS/job/build_ios/lastSuccessfulBuild/artifact/build/ It still doesn't download the artifact.
It works if I perfectly input the exact url like so: wget --auth-no-challenge https://userIsMe:123myapikey321#jenkins.mycompany.com/view/iOS/job/build_ios/lastSuccessfulBuild/artifact/build/MyCompany-1234.ipa
The problem here is the .ipa is not always going to be 1234, tomorrow will be 1235, and so on. How can I either parse the html or use the wildcard correctly in wget to ensure I am always getting the latest?
NM, working with another engineer here at my work came up with a super elegant solution parsing json.
Install Chrome and get the plugin JSONView
Call the Jenkins API in your Chrome browser using https://$domain/$job/lastSuccessfulBuild/api/json
This will print out the key pair values in json. Denote your Key, for me it was number
brew install jq
In a bash script create a variable that will store the dynamic value as follows
This will store the build number to latest:
latest=$(curl --silent --show-error https://userIsMe:123myapikey321#jenkins.mycompany.com/job/build_ios/lastSuccessfulBuild/api/json | jq '.number')
Print it to screen if you would like:
echo $latest
Now with some string interpolation pass the variable for latest to your wget call:
wget --auth-no-challenge https://userIsMe:123myapikey321#jenkins.mycompany.com/view/iOS/job/build_ios/lastSuccessfulBuild/artifact/build/myCompany-$latest.ipa
Hope this helps someone out as there is limited info out there that is clear and concise, especially given that wget has been around for an eternity.

download list of images from urls

I need to find (preferably) or build an app for a lot of images.
Each image has a distinct URL. There are many thousands, so doing it manually is a huge effort.
The list is currently in an csv file. (It is essentially a list of products, each with identifying info (name, brand, barcode, etc) and a link to a product image.
I'd like to loop through the list, and download each image file. Ideally I'd like to rename each one - something like barcode.jpg.
I've looked at a number of image scrapers, but haven't found one that works quite this way.
Very appreciative of any leads to the right tool, or ideas...
Are you on Windows or Mac/Linux? In Windows you can use a powershell script for this, on mac/linux a shell script with about 1-5 lines of code.
Here's one way to do this:
# show what's inside the file
cat urlsofproducts.csv
http://bit.ly/noexist/obj101.jpg, screwdriver, blackndecker
http://bit.ly/noexist/obj102.jpg, screwdriver, acme
# this one-liner will GENERATE one download-command per item, but will not execute them
perl -MFile::Basename -F", " -anlE "say qq(wget -q \$F[0] -O '\$F[1]--\$F[2]--). basename(\$F[0]) .q(')" urlsofproducts.csv
# Output :
wget http://bit.ly/noexist/obj101.jpg -O ' screwdriver-- blackndecker--obj101.jpg'
wget http://bit.ly/noexist/obj101.jpg -O ' screwdriver-- acme--obj101.jpg'
Now back-substitute the wget commands into the shell.
If possible please use google sheets to run a function for this kind of work, I was also puzzled on this one and now found a way to by which the images are not only downloaded but those are renamed on the real time.
Kindly reply if you want the code.

How to extract links behind a text tag of web page (using either curl,wget or userscript)

I'm trying to extract href links under tag.
Refer to the attachment. I want to save all link under the tag "PDF".
http://tinypic.com/r/2n9erdj/8 Sorry I'm not allowed update pictures as yet.
Specifically the href details appear as arnumber=60940cc as shown in red circle.
Can someone suggest how to implement this. I'm intending to use either a userscript or bash commands.
html elements details relavant to a single pdf is shown below.
<a aria-label="Download or View the PDF: IEEE Transactions on Power Electronics publication information" href="/stampPDF/getPDF.jsp?tp=&arnumber=6094072"><img class="button" src="http://staticieeexplore.ieee.org/assets/img/iconPdf.png" alt="PDF file icon" title="Download or View the PDF">PDF</a>
The web page I'm testing is
http://ieeexplore.ieee.org/xpl/tocresult.jsp?isnumber=6088512
The objective is to filter the content named as "pdf" and its urls.
Try this: Tweak sed part if you don't want "http://ieeexplorer.ieee.org/" just before stamp word or etc
wget http://ieeexplore.ieee.org/xpl/tocresult.jsp?isnumber=6088512 -O file.html
grep -o "href.*stamp.*\"><" file.html |sed 's#"#"http://ieeexplorer.ieee.org#;s#><##'
href="http://ieeexplorer.ieee.org/stamp/stamp.jsp?tp=&arnumber=6094070"
href="http://ieeexplorer.ieee.org/stamp/stamp.jsp?tp=&arnumber=6094072"
href="http://ieeexplorer.ieee.org/stamp/stamp.jsp?tp=&arnumber=6094110"
href="http://ieeexplorer.ieee.org/stamp/stamp.jsp?tp=&arnumber=6088513"
href="http://ieeexplorer.ieee.org/stamp/stamp.jsp?tp=&arnumber=5680978"
href="http://ieeexplorer.ieee.org/stamp/stamp.jsp?tp=&arnumber=5985544"
href="http://ieeexplorer.ieee.org/stamp/stamp.jsp?tp=&arnumber=5723758"
href="http://ieeexplorer.ieee.org/stamp/stamp.jsp?tp=&arnumber=5716681"
href="http://ieeexplorer.ieee.org/stamp/stamp.jsp?tp=&arnumber=5936741"
href="http://ieeexplorer.ieee.org/stamp/stamp.jsp?tp=&arnumber=5934597"
href="http://ieeexplorer.ieee.org/stamp/stamp.jsp?tp=&arnumber=5734858"
href="http://ieeexplorer.ieee.org/stamp/stamp.jsp?tp=&arnumber=5756244"
href="http://ieeexplorer.ieee.org/stamp/stamp.jsp?tp=&arnumber=5759746"
href="http://ieeexplorer.ieee.org/stamp/stamp.jsp?tp=&arnumber=5958614"
href="http://ieeexplorer.ieee.org/stamp/stamp.jsp?tp=&arnumber=5999721"
href="http://ieeexplorer.ieee.org/stamp/stamp.jsp?tp=&arnumber=6021380"
href="http://ieeexplorer.ieee.org/stamp/stamp.jsp?tp=&arnumber=5961632"
href="http://ieeexplorer.ieee.org/stamp/stamp.jsp?tp=&arnumber=5951783"
href="http://ieeexplorer.ieee.org/stamp/stamp.jsp?tp=&arnumber=5983448"
href="http://ieeexplorer.ieee.org/stamp/stamp.jsp?tp=&arnumber=5934423"
href="http://ieeexplorer.ieee.org/stamp/stamp.jsp?tp=&arnumber=5957306"
href="http://ieeexplorer.ieee.org/stamp/stamp.jsp?tp=&arnumber=5898425"
href="http://ieeexplorer.ieee.org/stamp/stamp.jsp?tp=&arnumber=5959991"
href="http://ieeexplorer.ieee.org/stamp/stamp.jsp?tp=&arnumber=5776690"
href="http://ieeexplorer.ieee.org/stamp/stamp.jsp?tp=&arnumber=5953525"
OR
$ grep -o "href.*stamp.*\"><" file.html |sed 's#href="#ieeexplorer.ieee.org#;s#"><##'
ieeexplorer.ieee.org/stamp/stamp.jsp?tp=&arnumber=6094070
ieeexplorer.ieee.org/stamp/stamp.jsp?tp=&arnumber=6094072
ieeexplorer.ieee.org/stamp/stamp.jsp?tp=&arnumber=6094110
ieeexplorer.ieee.org/stamp/stamp.jsp?tp=&arnumber=6088513
ieeexplorer.ieee.org/stamp/stamp.jsp?tp=&arnumber=5680978
ieeexplorer.ieee.org/stamp/stamp.jsp?tp=&arnumber=5985544
ieeexplorer.ieee.org/stamp/stamp.jsp?tp=&arnumber=5723758
ieeexplorer.ieee.org/stamp/stamp.jsp?tp=&arnumber=5716681
ieeexplorer.ieee.org/stamp/stamp.jsp?tp=&arnumber=5936741
ieeexplorer.ieee.org/stamp/stamp.jsp?tp=&arnumber=5934597
ieeexplorer.ieee.org/stamp/stamp.jsp?tp=&arnumber=5734858
ieeexplorer.ieee.org/stamp/stamp.jsp?tp=&arnumber=5756244
ieeexplorer.ieee.org/stamp/stamp.jsp?tp=&arnumber=5759746
ieeexplorer.ieee.org/stamp/stamp.jsp?tp=&arnumber=5958614
ieeexplorer.ieee.org/stamp/stamp.jsp?tp=&arnumber=5999721
ieeexplorer.ieee.org/stamp/stamp.jsp?tp=&arnumber=6021380
ieeexplorer.ieee.org/stamp/stamp.jsp?tp=&arnumber=5961632
ieeexplorer.ieee.org/stamp/stamp.jsp?tp=&arnumber=5951783
ieeexplorer.ieee.org/stamp/stamp.jsp?tp=&arnumber=5983448
ieeexplorer.ieee.org/stamp/stamp.jsp?tp=&arnumber=5934423
ieeexplorer.ieee.org/stamp/stamp.jsp?tp=&arnumber=5957306
ieeexplorer.ieee.org/stamp/stamp.jsp?tp=&arnumber=5898425
ieeexplorer.ieee.org/stamp/stamp.jsp?tp=&arnumber=5959991
ieeexplorer.ieee.org/stamp/stamp.jsp?tp=&arnumber=5776690
ieeexplorer.ieee.org/stamp/stamp.jsp?tp=&arnumber=5953525

Resources