Download all linked files from Wikipedia page - bash

I would like to use this Wikipedia page - http://en.wikipedia.org/wiki/Current_members_of_the_United_States_House_of_Representatives
It contains several links to .jpg images, and I would like to download all of the images into a folder. I am on Mac.
I have tried using wget but so far have been unable.
EDIT: To clarify, I would like for a script to click on every link on the page, then download the page. This is because I need the page to be redirected first.

You can use xmlstarlet for this purpose:
xmlstarlet sel --net --html -t -m "//img" -v "#src" -n 'http://en.wikipedia.org/wiki/Current_members_of_the_United_States_House_of_Representatives'
will give you all the src fields of the img tags in the page at http://en.wikipedia.org/wiki/Current_members_of_the_United_States_House_of_Representatives.
You'll notice that the output lines are missing a heading http:, so we'll have to add this.
Then:
while IFS= read -r line; do
[[ $line = //* ]] && line="http:$line"
wget "$line"
done < <(
xmlstarlet sel --net --html -t -m "//img" -v "#src" -n 'http://en.wikipedia.org/wiki/Current_members_of_the_United_States_House_of_Representatives'
)
should retrieve the image files.
From your comment I now understand your requirement: you want to get all the href fields of the a nodes that contain an img node. An xpath that fulfills this requirement is:
//a[img]
Hence,
xmlstarlet sel --net --html -t -m "//a[img]" -v "#href" -n 'http://en.wikipedia.org/wiki/Current_members_of_the_United_States_House_of_Representatives'
will get you these hrefs.
Now, the URL that is retrieved is not directly the image you want to download; instead it's another HTML page that contains links to the images you want. I've selected the image in these pages with the following xpath:
//div[#class='fullImageLink']/a
that is, the a nodes inside a div node with class="fullImageLink". This seems ok, heuristically.
Then, this should do:
#!/bin/bash
base="http://en.wikipedia.org"
get_image() {
local url=$base$1
printf "*** %s: " "$url"
IFS= read -r imglink < <(xmlstarlet sel --net --html -t -m "//div[#class='fullImageLink']/a" -v "#href" -n "$url")
if [[ -z $imglink ]]; then
echo " ERROR ***"
return 1
fi
imglink="http:$imglink"
echo " Downloading"
wget -q "$imglink" &
}
while IFS= read -r url; do
[[ $url = /wiki/File:* ]] || continue
get_image "$url"
done < <(
xmlstarlet sel --net --html -t -m "//a[img]" -v "#href" -n "$base/wiki/Current_members_of_the_United_States_House_of_Representatives"
)
You'll get a little bit more than what you want, but it's a good basis :).

Related

Use XMLStarlet to insert a single value too long to fit on a command line

Suppose I have an xml file:
<?xml version='1.0' encoding='utf-8' standalone='yes' ?>
<map>
<string name="a"></string>
</map>
And I want to set the value of string with attribute a with something big:
$ xmlstarlet ed -u '/map/string[#name="a"]' -v $(for ((i=0;i<200000;i++)); do echo -n a; done) example.xml > o.xml
This will result in bash error "Argument list is too long". I was unable to find option in xmlstarlet which accept result from a file. So, how would I set the value of xml tag with 200KB data+?
Solution
After trying to feed chunks into the xmlstarlet by argument -a (append), I realized that I am having additional difficulties like escape of special characters and the order in which xmlstarlet accepts these chunks.
Eventually I reverted to simpler tools like xml2/sed/2xml. I am dropping the code as a separate post below.
This, as a workaround for your own example that bombs because of the ARG_MAX limit:
#!/bin/bash
# (remove 'echo' commands and quotes around '>' characters when it looks good)
echo xmlstarlet ed -u '/map/string[#name="a"]' -v '' example.xml '>' o.xml
for ((i = 0; i < 100; i++))
do
echo xmlstarlet ed -u '/map/string[#name="a"]' -a -v $(for ((i=0;i<2000;i++)); do echo -n a; done) example.xml '>>' o.xml
done
SOLUTION
I am not proud of it, but at least it works.
a.xml - what was proposed as an example in the starting post
source.txt - what has to be inserted into a.xml as xml tag
b.xml - output
#!/usr/bin/env bash
ixml="a.xml"
oxml="b.xml"
s="source.txt"
echo "$ixml --> $oxml"
t="$ixml.xml2"
t2="$ixml.xml2.edited"
t3="$ixml.2xml"
# Convert xml into simple string representation
cat "$ixml" | xml2 > "$t"
# Get the string number of xml tag of interest, increment it by one and delete everything after it
# For this to work, the tag of interest should be at the very end of xml file
cat "$t" | grep -n -E 'string.*name=.*a' | cut -f1 -d: | xargs -I{} echo "{}+1" | bc | xargs -I{} sed '{},$d' "$t" > "$t2"
# Rebuild the deleted end of the xml2-file with the escaped content of s-file and convert everything back to xml
# * The apostrophe escape is necessary for apk xml files
sed "s:':\\\':g" "$s" | sed -e 's:^:/map/string=:' >> "$t2"
cat "$t2" | 2xml > "$t3"
# Make xml more readable
xmllint --pretty 1 --encode utf-8 "$t3" > "$oxml"
# Delete temporary files
rm -f "$t"
rm -f "$t2"
rm -f "$t3"

Download URLs from CSV into subdirectory given in first field

So I want to export my products into my new website. I have an csv file with these data:
product id,image1,image2,image3,image4,image5
1,https://img.url/img1-1.png,https://img.url/img1-2.png,https://img.url/img1-3.png,https://img.url/img1-4.png,https://img.url/img1-5.png
2,https://img.url/img2-1.png,https://img.url/img2-2.png,https://img.url/img2-3.png,https://img.url/img2-4.png,https://img.url/img2-5.png
What I want to do is to make a script to read from that file, make directory named with product id, download images of the product and put them inside their own folder (folder 1 => image1-image5 of product id 1, folder 2 => image1-image5 of product id 2, and so on).
I can make a normal text file instead of using the excel format if it's easier to do. Thanks before.
Sorry I'm really new here. I haven't done the code yet because I'm clueless, but what I want to do is something like this:
for id in $product_id; do
mkdir $id && cd $id && curl -o $img1 $img2 $img3 $img4 $img5 && cd ..
done
Here is a quick and dirty attempt which should hopefully at least give you an idea of how to handle this.
#!/bin/bash
tr ',' ' ' <products.csv |
while read -r prod urls; do
mkdir -p "$prod"
# Potential bug: urls mustn't contain shell metacharacters
for url in $urls; do
wget -P "$prod" "$url"
done
done
You could equivalently do ( cd "$prod" && curl -O "$url" ) if you prefer curl; I generally do, though the availability of an option to set the output directory with wget is convenient.
If your CSV contains quotes around the fields or you need to handle URLs which contain shell metacharacters (irregular spaces, wildcards which happen to match files in the current directory, etc; but most prominently & which means to run a shell command in the background) perhaps try something like
while IFS=, read -r prod url1 url2 url3 url4 url5; do
mkdir -p "$prod"
wget -P "$prod" "$url1"
wget -P "$prod" "$url2"
: etc
done <products.csv
which (modulo the fixed quoting) is pretty close to your attempt.
Or perhaps switch to a less wacky input format, maybe generate it on the fly from the CSV with
awk -F , 'function trim (value) {
# Trim leading and trailing double quotes
sub(/^"/, "", value); sub(/"$/, "", value);
return value; }
{ prod=trim($1);
for(i=2; i<=NF; ++i) {
# print space-separated prod, url
print prod, trim($i) } }' products.csv |
while read -r prod url; do
mkdir -p "$prod"
wget -P "$prod" "$url"
done
which splits the CSV into repeated lines with the same product ID and one URL each, and any CSV quoting removed, then just loops over that instead. mkdir with the -p option helfully doesn't mind if the directory already exists.
If you followed the good advice that #Aaron gave you, this code can help you, as you seem to be new with bash I commented out the code for better comprehension.
#!/bin/bash
# your csv file
myFile=products.csv
# number of lines of file
nLines=$(wc -l $myFile | awk '{print $1}')
echo "Total Lines=$nLines"
# loop over the lines of file
for i in `seq 1 $nLines`;
do
# first column value
id=$(sed -n $(($i+1))p $myFile | awk -F ";" '{print $1}')
line=$(sed -n $(($i+1))p $myFile)
#create the folder if not exist
mkdir $id 2>/dev/null
# number of images in the line
nImgs=$(($(echo $line | awk -F ";" '{print NF-1}')-1))
# go to id folder
cd $id
#loop inside the line values
for j in `seq 2 $nImgs`;
do
# getting the image url to download it
img=$(echo $line | cut -d ";" -f $j)
echo "Downloading image $img**";echo
# downloading the image
wget $img
done
# go back path
cd ..
done

Quick search to find active urls

I'm trying to use cURL to find active rediractions and save results to file. I know the url is active, when it redirects at least once to specific website. So I came up with:
if (( $( curl -I -L https://mywebpage.com/id=00001&somehashnumber&si=0 | grep -c "/something/" ) > 1 )) ; then echo https://mywebpage.com/id=00001&somehashnumber&si=0 | grep -o -P 'id=.{0,5}' >> id.txt; else echo 404; fi
And it works, but how to modify it to check id range from 00001 to 99999?
you'll want to wrap the whole operation in a for loop and use a formatted sequence to print the ids you'd like to test. without know too much about the task at hand i would write something like this to test the ids
$ for i in $(seq -f "%06g" 1 100000); do curl --silent "example.com/id=$i" --write-out "$i %{response_code}\n" --output /dev/null; done

bash How to get html code from links in txt file

I have a text document that contains URL's was write in the same way:
https://google.com
https://youtube.com
This code should read strings and get the html status from each strings in file. So it can't find the URL, i guess
exec 0<$1 #(Where $1 is param to input the file)
while IFS='' read -r line
response=$(curl --write-out %{http_code} --silent --output /dev/null $line)
[[ -n "$line" ]]
do
echo "Text read from file: $line"
This code save as the HtmlStatus.sh
you can create file for example test.txt
text.txt : https://google.com
https://youtube.com
https://facebook.com
#!/bin/bash
while IFS='' read -r line || [[ -n "$line" ]]; do
echo -n "Read Url : $line"
curl -I $line | grep HTTP
done < "$1"
this code return the html status code at the test.txt file line in url.
run terminal :
chmod +x HtmlStatus.sh
./HtmlStatus.sh test.txt

xmlstarlet to add exactly one element to a file

I have hundreds of xml files to process - some have a particular desired tag, some don't. If I just add the tag to all files then some files get 2 tags (no surprises there!). How do I do it in xmlstarlet without a clumsy grep to select the files to work on? eg:
I have this in some files:
...
<parent_tag>
<another_tag/> <-- but not in all files
</parent_tag>
I want this (but some files already have it):
...
<parent_tag>
<good_tag>foobar</good_tag>
<another_tag/>
</parent_tag>
eg this works but I wish I could do it entirely in xmlstarlet without the grep:
grep -L good_tag */config.xml | while read i; do
xmlstarlet ed -P -S -s //parent_tag -t elem -n good_tag -v "" $i > tmp || break
\cp tmp $i
done
I got myself tangled up in some XPATH exoticism like:
xmlstarlet sel --text --template --match //parent_tag --match "//parent_tag/node()[not(self::good_tag)]" -f --nl */config.xml
... but it's not doing what I had hoped ...
Just select only <parent_tag/> elements which do not contain a <good_tag/> for inserting:
xmlstarlet ed -P -S -s '//parent_tag[not(good_tag)]' -t elem -n good_tag -v ""
If you also want to test for the right contents of the tag:
xmlstarlet ed -P -S -s '//parent_tag[not(good_tag[.="foobar"])]' -t elem -n good_tag -v ""

Resources