I am trying to download a PDF file from a website, I know the name of the file, e.g. foo.pdf, but It's location changes every few weeks:
e.g.
www.server.com/media/123456/foo.pdf
changes into
www.server.com/media/245415/foo.pdf
The number is always a six-figure number, so I tried using a bash script to go through all 10 million of them, but that obviously takes a lot of time:
i=0
until [ "$RC" == "0" ] || [ $i == 1000000 ]
do
b=$(printf %06d $i)
wget -q http://www.server.com/media/${b}/foo.pdf -O bar.pdf
export RC=$?
i=$(($i + 1))
done
For wrong addresses I just get 404 errors.
I tested it around the currently correct address and it works.
Does anyone know a faster way to solve this problem?
If that page is linked form anywhere else, then you can get the link from there, and just get the file. If it's not, you are probably out of luck.
Note that most servers would consider trying to hit the webserver 1,000,000 times abuse, and would ban your IP for even trying.
Follow the values from time to time and work out if they're algorithmic or not. As zigdon said above though, if you have the source for the link, just wget that first, and follow the link to the PDF.
Related
I am downloading files using wget (curl would work too) like so,
wget somesite.com/files/{1..1000}.txt
I only want to download the files that are larger than a minimum size. File size is the only criteria I can use to determine whether I want the file, the file names are not descriptive, and all have the same extension.
As I understand it, when the request is made to the server, it returns the size of the file before the download starts, so it should be possible to reject the file without needing to download it.
Is there a flag for wget or curl that can do this, or script that add this functionality? I found two similar questions, here and here for curl & wget respectively, but neither had an answer that met these requirements. I am looking to avoid downloading the file and then rejecting it afterwards.
Alternativley, is there another terminal-based tool I can use that can do this?
Alternatively, is there another terminal-based tool I can use that can do this?
Yes, you could take a look at xidel:
$ xidel -s --method=HEAD https://www.somesite.com/files/{1..1000}.txt \
-f '$url[substring-after($headers,"Content-Length: ") gt 51200]' \
--download .
--method=HEAD prevents the entire content of these text-files from being read.
-f "follows" / opens the content of urls. In this case only those $urls that, for instance, are larger than 50KB (51200 bytes).
--download downloads those text-files to the current dir.
Alternatively you can do everything with an extraction-query:
$ xidel -se '
for $x in (1 to 1000) ! x"https://www.somesite.com/files/{.}.txt"
where substring-after(
x:request({"method":"HEAD","url":$x})/headers,
"Content-Length: "
) gt 51200
return
x:request({"url":$x})/file:write-binary(
extract(url,".+/(.+)",1),
string-to-base64Binary(raw)
)
'
I am very new to coding or doing anything like this. I have a list of several thousand URLs in excel. Each URL is associated with one of approximately 300 numbers. I have it as one column is the URL, and the next column is the number that that URL is associated with. For example, I have five URLs associated with the number 1, four URLs associated with the number 2, etc. I am trying to download the files that are found at the URL but maintain the organization that I have through the associated numbers. So I am trying to get all of the files from URLs associated with 1 into a folder, all of the files form URLs associated with 2 into a separate folder, etc.
I believe that using bash scripting and wget is the pathway towards this, but I am struggling to figure out the correct series of commands. I would appreciate any help people could give me.
I don't expect anybody to just do this for me, but I would appreciate any helpful hints or useful resources or guides that people could point me towards. Thanks!
I believe that saving my excel sheet as CSV would be part of the correct path forwards, but I have very little idea of what I am doing.
Generally folks are expected to post what they've tried so far. But since you're brand new here, let's see if we can at least get you off the ground.
#!/bin/bash
# Example input file urls.csv
# http://foo.com,2
# http://bar.com,7
# Reference for the "wget" command I used - https://www.guyrutenberg.com/2014/05/02/make-offline-mirror-of-a-site-using-wget/
#
# Split the file on the comma and loop through the url / ID pairs
#
awk -F, '{print $1" "$2}' urls.csv | while read url id
do
echo "Getting url $url ID $id"
#
# Make the directory if it doesn't exist, and change directory into it
#
if [ ! -d $id ]; then
mkdir $id
fi
cd $id
#
# Execute the wget
#
wget --mirror --convert-links --adjust-extension --page-requisites --no-parent $url
#
# Change directory back up to the parent
#
cd ..
done
I've been trying to make this if then else script but its wrong every time.
I am brand new at this, maybe I have taken more than I can chew but I am trying to make a script that runs at launch that does this
If file exists /Users/Bob/Library/Safari/Bookmarks.plist
then do nothing
if file does not exist ( i assume this is "else") then move file from /Library/Application\ Support/Bookmarks.plist to /Users/Bob/Library/Safari/Bookmarks.plist
Any feedback is greatly appreciated, thanks in advance.
What you described could be written like that in pseudo code:
if file_exists("/Users/Bob/Library/Safari/Bookmarks.plist")
exit
if not file_exists("/Users/Bob/Library/Safari/Bookmarks.plist")
move("/Library/Application\ Support/Bookmarks.plist", "/Users/Bob/Library/Safari/Bookmarks.plist")
As you can see, the first "branch" is not really used, as you anyway plan to ignore this case.
But if we assume, that in the first case, you actually want to do something, like print a nice message, you could write it like that:
if file_exists("/Users/Bob/Library/Safari/Bookmarks.plist")
print("Nothing to do!")
else
move("/Library/Application\ Support/Bookmarks.plist", "/Users/Bob/Library/Safari/Bookmarks.plist")
As you can see, in the this example, the "else" part is equivalent to saying "if not file_exists(...)". You could also reverse the order by using "not":
if not file_exists("/Users/Bob/Library/Safari/Bookmarks.plist")
move("/Library/Application\ Support/Bookmarks.plist", "/Users/Bob/Library/Safari/Bookmarks.plist")
else
print("Nothing to do!")
DST="/Users/Bob/Library/Safari/Bookmarks.plist"
SRC="/Library/Application\ Support/Bookmarks.plist"
if [ ! -f "$DST" ]; then
mv "$SRC" "$DST"
fi
This will see if the destination exists. If not, it will move the file. Realistically there should be additional error checking, and I'm not certain a move is the best approach, since there would no longer be an "original" file.
I need to find (preferably) or build an app for a lot of images.
Each image has a distinct URL. There are many thousands, so doing it manually is a huge effort.
The list is currently in an csv file. (It is essentially a list of products, each with identifying info (name, brand, barcode, etc) and a link to a product image.
I'd like to loop through the list, and download each image file. Ideally I'd like to rename each one - something like barcode.jpg.
I've looked at a number of image scrapers, but haven't found one that works quite this way.
Very appreciative of any leads to the right tool, or ideas...
Are you on Windows or Mac/Linux? In Windows you can use a powershell script for this, on mac/linux a shell script with about 1-5 lines of code.
Here's one way to do this:
# show what's inside the file
cat urlsofproducts.csv
http://bit.ly/noexist/obj101.jpg, screwdriver, blackndecker
http://bit.ly/noexist/obj102.jpg, screwdriver, acme
# this one-liner will GENERATE one download-command per item, but will not execute them
perl -MFile::Basename -F", " -anlE "say qq(wget -q \$F[0] -O '\$F[1]--\$F[2]--). basename(\$F[0]) .q(')" urlsofproducts.csv
# Output :
wget http://bit.ly/noexist/obj101.jpg -O ' screwdriver-- blackndecker--obj101.jpg'
wget http://bit.ly/noexist/obj101.jpg -O ' screwdriver-- acme--obj101.jpg'
Now back-substitute the wget commands into the shell.
If possible please use google sheets to run a function for this kind of work, I was also puzzled on this one and now found a way to by which the images are not only downloaded but those are renamed on the real time.
Kindly reply if you want the code.
I have a shell script. A cron job runs it once a day. At the moment it just downloads a file from the web using wget, appends a timestamp to the filename, then compresses it. Basic stuff.
This file doesn't change very frequently though, so I want to discard the downloaded file if it already exists.
Easiest way to do this?
Thanks!
Do you really need to compress the file ?
wget provides -N, --timestamping which obviously, turns on time-stamping. What that does is say your file is located at www.example.com/file.txt
The first time you do:
$ wget -N www.example.com/file.txt
[...]
[...] file.txt saved [..size..]
The next time it'll be like this:
$ wget -N www.example.com/file.txt
Server file no newer than local file “file.txt” -- not retrieving.
Except if the file on the server was updated.
That would solve your problem, if you didn't compress the file.
If you really need to compress it, then I guess I'd go with comparing the hash of the new file/archive and the old. What matters in that case is, how big is the downloaded file ? is it worth compressing it first then checking the hashes ? is it worth decompressing the old archive and comparing the hashes ? is it better to store the old hash in a txt file ? do all these have an advantage over overwriting the old file ?
You only know that, make some tests.
So if you go the hash way, consider sha256 and xz (lzma2 algorithm) compression.
I would do something like this (in Bash):
newfilesum="$(wget -q www.example.com/file.txt -O- | tee file.txt | sha256sum)"
oldfilesum="$(xzcat file.txt.xz | sha256sum)"
if [[ $newfilesum != $oldfilesum ]]; then
xz -f file.txt # overwrite with the new compressed data
else
rm file.txt
fi
and that's done;
Calculate a hash of the content of the file and check against the new one. Use for instance md5sum. You only have to save the last MD5 sum to check if the file changed.
Also, take into account that the web is evolving to give more information on pages, that is, metadata. A well-founded web site should include file version and/or date of modification (or a valid, expires header) as part of the response headers. This, and quite other things, is what makes up the scalability of Web 2.0.
How about downloading the file, and checking it against a "last saved" file?
For example, the first time it downloads myfile, and saves it as myfile-[date], and compresses it. It also adds a symbolic link, such as lastfile pointing to myfile-[date]. The next time the script runs, it can check if the contents of whatever lastfile points to is the same as the new downloaded file.
Don't know if this would work well, but it's what I could think of.
You can compare the new file with the last one using the sum command. This takes the checksum of the file. If both files have the same checksum, they are very, very likely to be exactly the same. There's another command called md5 that takes the md5 fingerprint, but the sum command is on all systems.