I'm using wget in terminal to download a large list of images.
example — $ wget -i images.txt
I have all the image URLS in the images.txt file.
However, the image urls tend to be like example.com/unqiueNumber/images/main_250.jpg
which means that all the images come out named main_250.jpg
What I really need is the images to be saved with the entire URLs for the images for each one, so that the 'unique number' is part of the filenames.
Any suggestions?
Presuming the urls for the images are in a text file named images.txt with one url per line you can run
cat images.txt | sed 'p;s/\//-/g' | sed 'N;s/\n/ -O /' | xargs wgetto download each and every image with a filename that was formed out of the url.
Now for the explanation:
in this example I'll use https://www.newton.ac.uk/files/covers/968361.jpg
https://www.moooi.com/sites/default/files/styles/large/public/product-images/random_detail.jpg?itok=ErJveZTY
as images.txt (you can add as many images as you like to your file, as long as they are in this same format).
cat images.txt pipes the content of the file to standard output
sed 'p;s/\//-/g' prints the file to stdout with the url on one line and then the intended filename on the next line, like so:
https://www.newton.ac.uk/files/covers/968361.jpg
https:--www.newton.ac.uk-files-covers-968361.jpg
https://www.moooi.com/sites/default/files/styles/large/public/product-images/random_detail.jpg?itok=ErJveZTY
https:--www.moooi.com-sites-default-files-styles-large-public-product-images-random_detail.jpg?itok=ErJveZTY
sed 'N;s/\n/ -O /' combines the two lines of each image (the url and the intended filename) into one line and it adds the -O option inbetween (this is for wget to know that the second argument is the intended filename), the result for this part looks like this:https://www.newton.ac.uk/files/covers/968361.jpg -O https:--www.newton.ac.uk-files-covers-968361.jpg
https://www.moooi.com/sites/default/files/styles/large/public/product-images/random_detail.jpg?itok=ErJveZTY -O https:--www.moooi.com-sites-default-files-styles-large-public-product-images-random_detail.jpg?itok=ErJveZTY
and finally xargs wget runs wget for each line as the option, the endresult in this example is two images in the current directory named https:--www.newton.ac.uk-files-covers-968361.jpg and https:--www.moooi.com-sites-default-files-styles-large-public-product-images-random_detail.jpg?itok=ErJveZTY respectively.
With GNU Parallel you can do:
cat images.txt | parallel wget -O '{= s:/:-:g; =}' {}
I have a not so elegant solution, that may not work everywhere.
You probably know that if your URL ends in a query, wget will use that query in the filename. e.g. if you have http://domain/page?q=blabla, you will get a file called page?q=blabla after download. Usually, this is annoying, but you can turn it to your advantage.
Suppose, you wanted to download some index.html pages, and wanted to keep track of their origin, as well as, avoid ending up with index.html, index.html.1, index.html.2, etc. in your download folder. Your input file urls.txt may look something like the following:
https://google.com/
https://bing.com/
https://duckduckgo.com/
If you call wget -i urls.txt you end up with those numbered index.html files. But if you "doctor" your urls with a fake query, you get useful file names.
Write a script that appends each url as a query to itself, e.g.
https://google.com/?url=https://google.com/
https://bing.com/?url=https://bing.com/
https://duckduckgo.com/?url=https://duckduck.com/
Looks cheesy, right? But if you now execute wget -i urls.txt, you get the following files:
index.html?url=https:%2F%2Fbing.com%2F
index.html?url=https:%2F%2Fduckduck.com%2F
index.html?url=https:%2F%2Fgoogle.com%2F
instead of non-descript numbered index.htmls. Sure, they look ugly, but you can clean up the filenames, and voilà! Each file will have its origin as its name.
The approach probably has some limitations, e.g. if the site you are downloading from actually executes the query and parses the parameters, etc.
Otherwise, you'll have to solve the file name/source url problem outside of wget, either with a bash script or in other programming languages.
Related
I want to download files with wget in a bash script. The url's look like this:
https://xxx.blob.core.windows.net/somefolder/$somefile.webm?sv=xxx&sp=r..
Problem is the $ doller sign in the url
When I download the file with double quotes I get a 403 because the $ sign is probably interpreted.
wget "https://xxx.blob.core.windows.net/somefolder/$somefile.webm?sv=xxx&sp=r.."
When I single quote the url and download the file everything goes well:
wget 'https://xxx.blob.core.windows.net/somefolder/$somefile.webm?sv=xxx&sp=r..'
But the url should come from a line in a text file. So I read the file and pass the lines as url:
files=($(< file.txt))
# Read through the url.txt file and execute wget command for every filename
while IFS='=| ' read -r param uri; do
for file in "${files[#]}"; do
wget "${file}"
done
done < file.txt
I get the 403 here as well and don't know how to prevent the termimal from interpreting the dollar sign. How can I achieve this?
But the url should come from a line in a text file.
If you have file with 1 URL per line or are able to easily alter your file to hold 1 URL per line, then you might use -i file option, from wget man page
-i file
--input-file=file
Read URLs from a local or external file. If - is specified as file, URLs are read from the standard input. (Use ./- to
read from a file literally named -.)
If this function is used, no URLs need be present on the command line.
If there are URLs both on the command line and in an input file, those
on the command lines will be the first ones to be retrieved. If
--force-html is not specified, then file should consist of a series of URLs, one per line.(...)
So if you have single file, say urls.txt you might use it like so
wget -i urls.txt
and if you have few files you might concat them and shove through standard input like so
cat urls1.txt urls2.txt urls3.txt | wget -i -
If file(s) contain additional data then remember to process them so GNU wget will get only URLs.
I want to import several number of files into my server using wget , the 492 files are here:
https://trace.ncbi.nlm.nih.gov/Traces/sra/?study=ERP001736
so I want to copy the URL of all files in "File Name" column to save them into a file and import them with wget.
So how can I copy all those URLs from that column ?
Thanks for reading :)
Since you've tagged bash, this should work.
wget -O- is used to output the data to the standard output, where it's greppable. (curl would do that by default.)
grep -oE is used to capture the URLs (which happily are in a regular enough format that a simple regexp works).
Then, wget -i is used to read URLs from the file generated. You might wish to add -nc or other suitable partial-fetch flags; those files are pretty hefty.
wget -O- https://trace.ncbi.nlm.nih.gov/Traces/sra/?study=ERP001736 | grep -oE 'http://ftp.sra.ebi.ac.uk/[^"]+' > urls.txt
wget -i urls.txt
First, I recommend using a more specific and robust implementation...
but, in the case you are against a wall and in a hurry -
$: curl -s https://trace.ncbi.nlm.nih.gov/Traces/sra/?study=ERP001736 |
sed -En '/href="http:\/\/.*clean.fastq.gz"/{s/^.*href="([^"]+)".*/\1/;p;}' |
while read url; do wget "$url"; done
This is a quick and dirty rough first pass, but it will give you something to work with.
If you aren't in a screaming hurry, try writing something more robust and step-wise in perl or python.
1. OS: Linux / Ubuntu x86/x64
2. Task:
Write a Bash shell script to download URLs in a (large) csv (as fast/simultaneous as possible) and naming each output on a column value.
2.1 Example Input:
A CSV file containing lines like:
001,http://farm6.staticflickr.com/5342/a.jpg
002,http://farm8.staticflickr.com/7413/b.jpg
003,http://farm4.staticflickr.com/3742/c.jpg
2.2 Example outputs:
Files in a folder, outputs, containg files like:
001.jpg
002.jpg
003.jpg
3. My Try:
I tried mainly in two styles.
1. Using the download tool's inner support
Take ariasc as an example, it support use -i option to import a file of URLs to download, and (I think) it will process it in parallel to max speed. It do have --force-sequential option to force download in the order of the lines, but I failed to find a way to make the naming part happen.
2. Splitting first
split the file into files and run a script like the following to process it:
#!/bin/bash
INPUT=$1
while IFS=, read serino url
do
aria2c -c "$url" --dir=outputs --out="$serino.jpg"
done < "$INPUT"
However, it means for each line it will restart aria2c again which seems cost time and low the speed.
Though, one can run the script in bash command multiple times to get 'shell-level' parallelism, it seems not to be the best way.
Any suggestion ?
Thank you,
aria2c supports so called option lines in input files. From man aria2c
-i, --input-file=
Downloads the URIs listed in FILE. You can specify multiple sources for a single entity by putting multiple URIs on a single line separated by the TAB character. Additionally, options can be specified after each URI line. Option lines must start with one or more white space characters (SPACE or TAB) and must only contain one option per line.
and later on
These options have exactly same meaning of the ones in the command-line options, but it just applies to the URIs it belongs to. Please note that for options in input file -- prefix must be stripped.
You can convert your csv file into an aria2c input file:
sed -E 's/([^,]*),(.*)/\2\n out=\1/' file.csv | aria2c -i -
This will convert your file into the following format and run aria2c on it.
http://farm6.staticflickr.com/5342/a.jpg
out=001
http://farm8.staticflickr.com/7413/b.jpg
out=002
http://farm4.staticflickr.com/3742/c.jpg
out=003
However this won't create files 001.jpg, 002.jpg, … but 001, 002, … since that's what you specified. Either specify file names with extensions or guess the extensions from the URLs.
If the extension is always jpg you can use
sed -E 's/([^,]*),(.*)/\2\n out=\1.jpg/' file.csv | aria2c -i -
To extract extensions from the URLs use
sed -E 's/([^,]*),(.*)(\..*)/\2\3\n out=\1\3/' file.csv | aria2c -i -
Warning: This works if and only if every URL ends with an extension. For instance, due to the missing extension the line 001,domain.tld/abc would not be converted at all, causing aria2c to fail on the "URL" 001,domain.tld/abc.
Using all standard utilities you can do this to download in parallel:
tr '\n' ',' < file.csv |
xargs -P 0 -d , -n 2 bash -c 'curl -s "$2" -o "$1.jpg"' -
-P 0 option in xargs lets it run commands in parallel (one per core processor)
I am trying to download some files using wget. I have stored all the links on a .txt file. When I read that file by the command wget -i <filename>.txt, the download starts but a notice is generated saying that the file name is too long. After this the download process is terminated.
How can I rename the files so that file name remains within acceptable range and the download continues.
Is there something like:- wget -O <target filename> <URL> for renaming files when downloaded from a .txt file ?
I do not believe that this functionality exists in wget. You should probably loop through the file in a Perl or shell script, or something similar.
This example below is modified from an example at ubuntuforums.org. With minor modifications you could make it accommodate output file names to your needs. Now it limits file length to first 50 characters.
#!/bin/bash
while read -r link
do
output=`echo $link | cut -c 1-50`
wget "$link" -O "$output"
done < ./links.txt
Using bash as a helper
for line in `cat input.txt`; do wget $line; done
You'll have to determine what you want the output names yourself, otherwise it will download them to whatever filename is in the url (e.g. blah.html) or index.html (if the URL ends in a slash).
Dump all the files to one monolithic file
There is another option with wget, which is to use --output-document=file. It concatenates all the downloaded files into one file.
I am running this:
cat Human_List.txt | xargs -I "%" curl -O http://www.pdb.org/pdb/files/"%".pdb
As you can see it is taking each file name from my Human_List.txt and looping it into the URL. As a save option I have -O where it takes the URL name and saving it using that line name from the Human_List.
WHAT I NEED is to save each one of those .pdb files with a Human_"foobar".pdb file name so I can differentiate between my different list downloads. Otherwise I cannot differentiate which one of downloads came from which list.
Thank You
$ while read LINE;
do curl -o Human_$LINE.pdb http://www.pdb.org/pdb/files/$LINE.pdb;
done < Human_List.txt