How to download URLs in a csv and naming outputs based on a column value - bash

1. OS: Linux / Ubuntu x86/x64
2. Task:
Write a Bash shell script to download URLs in a (large) csv (as fast/simultaneous as possible) and naming each output on a column value.
2.1 Example Input:
A CSV file containing lines like:
001,http://farm6.staticflickr.com/5342/a.jpg
002,http://farm8.staticflickr.com/7413/b.jpg
003,http://farm4.staticflickr.com/3742/c.jpg
2.2 Example outputs:
Files in a folder, outputs, containg files like:
001.jpg
002.jpg
003.jpg
3. My Try:
I tried mainly in two styles.
1. Using the download tool's inner support
Take ariasc as an example, it support use -i option to import a file of URLs to download, and (I think) it will process it in parallel to max speed. It do have --force-sequential option to force download in the order of the lines, but I failed to find a way to make the naming part happen.
2. Splitting first
split the file into files and run a script like the following to process it:
#!/bin/bash
INPUT=$1
while IFS=, read serino url
do
aria2c -c "$url" --dir=outputs --out="$serino.jpg"
done < "$INPUT"
However, it means for each line it will restart aria2c again which seems cost time and low the speed.
Though, one can run the script in bash command multiple times to get 'shell-level' parallelism, it seems not to be the best way.
Any suggestion ?
Thank you,

aria2c supports so called option lines in input files. From man aria2c
-i, --input-file=
Downloads the URIs listed in FILE. You can specify multiple sources for a single entity by putting multiple URIs on a single line separated by the TAB character. Additionally, options can be specified after each URI line. Option lines must start with one or more white space characters (SPACE or TAB) and must only contain one option per line.
and later on
These options have exactly same meaning of the ones in the command-line options, but it just applies to the URIs it belongs to. Please note that for options in input file -- prefix must be stripped.
You can convert your csv file into an aria2c input file:
sed -E 's/([^,]*),(.*)/\2\n out=\1/' file.csv | aria2c -i -
This will convert your file into the following format and run aria2c on it.
http://farm6.staticflickr.com/5342/a.jpg
out=001
http://farm8.staticflickr.com/7413/b.jpg
out=002
http://farm4.staticflickr.com/3742/c.jpg
out=003
However this won't create files 001.jpg, 002.jpg, … but 001, 002, … since that's what you specified. Either specify file names with extensions or guess the extensions from the URLs.
If the extension is always jpg you can use
sed -E 's/([^,]*),(.*)/\2\n out=\1.jpg/' file.csv | aria2c -i -
To extract extensions from the URLs use
sed -E 's/([^,]*),(.*)(\..*)/\2\3\n out=\1\3/' file.csv | aria2c -i -
Warning: This works if and only if every URL ends with an extension. For instance, due to the missing extension the line 001,domain.tld/abc would not be converted at all, causing aria2c to fail on the "URL" 001,domain.tld/abc.

Using all standard utilities you can do this to download in parallel:
tr '\n' ',' < file.csv |
xargs -P 0 -d , -n 2 bash -c 'curl -s "$2" -o "$1.jpg"' -
-P 0 option in xargs lets it run commands in parallel (one per core processor)

Related

curl: (3) URL using bad/illegal format or missing URL in bash Windows

I am trying to download PDF files from a list of URLs in a .txt file, with one URL per line. ('urls.txt')
When I use the following command, where the URL I used is an exact copy-paste of the first line of the .txt file:
$ curl http://www.isuresults.com/results/season1617/gpchn2016/gpchn2016_protocol.pdf -o 'test.pdf'
the pdf downloads perfectly. However when I use this command:
xargs -n 1 curl -O < urls.txt
Then I receive a 'curl: (3) URL using bad/illegal format or missing URL' error x times the amount of URLs listed in the .txt file. I have tested many of the URLS individually, and they all seem to download properly.
How can I fix this?
Edit - the first three lines of urls.txt reads as follows:
http://www.isuresults.com/results/season1718/gpf1718/gpf2017_protocol.pdf
http://www.isuresults.com/results/season1718/gpcan2017/gpcan2017_protocol.pdf
http://www.isuresults.com/results/season1718/gprus2017/gprus2017_protocol.pdf
SOLVED: As per the comment below, the issue was that the .txt file was in DOS/Windows format. I converted it using the line:
$ dos2unix urls.txt
and then the files downloaded perfectly using my original line of code. See this thread for more info: Are shell scripts sensitive to encoding and line endings?
Thank you to all who responded!
Try using
xargs -n 1 -t -a urls.txt curl -O
here the -a option reads list from a file rather than standard input
EDIT:
As #GordonDavisson mentioned, it looks like you may have a file with DOS line endings, you can potentially clean these up using sed before passing to xargs
sed 's/\r//' < urls.txt | xargs -n 1 -t curl -O

xargs -a [file] mv -t [new-directory] gives me mv: cannot stat `filename*': No such file or directory error

I have been trying to run this command (that I have run before in a different directory), and everything I've read on the message boards has not solved my unknown issue.
Of note: 1) the files exist in this directory 2) I have proper permissions to move these files around 3) I have run this exact line of code before and it has worked. 4) I tried listing files with and without '' to capture all the files (see below). 5) I also tired to list each file as 'Sample1', but that did not work.
xargs -a [filename.txt] mv -t [new-directory]
I have file beginnings (I have ~5 file for each beginning), and I want to move all the files associated with that beginning.
Example: Sample1.bam Sample1.sorted.bam, etc
The lines in the file are listed as such:
Sample1*
Sample2*
Sample3* ...etc.
What am I doing incorrectly and how can I fix it?
TIA!
When you execute command using 'xargs' arguments are passed directly to the called program ('mv' in your case). Wildcard patterns in the input are not expanded - 'sample1*' is passed as is to "mv", which issue an error message about note having a file named 'sample1*'.
To get file name expansion, you want to use the shell. One way to handle this situation is
xargs -a FILENAME.TXT -I__ sh -c "mv -t NEW-FOLDER -- __"
Security Note: the code provides some protection against command line injection (e.g., file name starting with '-'). However, other possible attacks are possible. Safer version is
cat FILENAME.txt | grep '^[A-Za-z0-9][A-Z-z0-9._-]*$' | xargs I__ sh -c "mv -t NEW-FOLDER -- __"
which will limit the input to file with alphanumeric. The 'grep' patterns can be extend the pattern as needed.
With GNU Parallel you would do something like:
cat FILENAME.txt | parallel mv {} NEW-FOLDER
One of the benefits of GNU Parallel is that it deals correctly with file names like:
My brother's 12" records cost > $1000.txt

Use full URL as saved file name with wget

I'm using wget in terminal to download a large list of images.
example — $ wget -i images.txt
I have all the image URLS in the images.txt file.
However, the image urls tend to be like example.com/unqiueNumber/images/main_250.jpg
which means that all the images come out named main_250.jpg
What I really need is the images to be saved with the entire URLs for the images for each one, so that the 'unique number' is part of the filenames.
Any suggestions?
Presuming the urls for the images are in a text file named images.txt with one url per line you can run
cat images.txt | sed 'p;s/\//-/g' | sed 'N;s/\n/ -O /' | xargs wgetto download each and every image with a filename that was formed out of the url.
Now for the explanation:
in this example I'll use https://www.newton.ac.uk/files/covers/968361.jpg
https://www.moooi.com/sites/default/files/styles/large/public/product-images/random_detail.jpg?itok=ErJveZTY
as images.txt (you can add as many images as you like to your file, as long as they are in this same format).
cat images.txt pipes the content of the file to standard output
sed 'p;s/\//-/g' prints the file to stdout with the url on one line and then the intended filename on the next line, like so:
https://www.newton.ac.uk/files/covers/968361.jpg
https:--www.newton.ac.uk-files-covers-968361.jpg
https://www.moooi.com/sites/default/files/styles/large/public/product-images/random_detail.jpg?itok=ErJveZTY
https:--www.moooi.com-sites-default-files-styles-large-public-product-images-random_detail.jpg?itok=ErJveZTY
sed 'N;s/\n/ -O /' combines the two lines of each image (the url and the intended filename) into one line and it adds the -O option inbetween (this is for wget to know that the second argument is the intended filename), the result for this part looks like this:https://www.newton.ac.uk/files/covers/968361.jpg -O https:--www.newton.ac.uk-files-covers-968361.jpg
https://www.moooi.com/sites/default/files/styles/large/public/product-images/random_detail.jpg?itok=ErJveZTY -O https:--www.moooi.com-sites-default-files-styles-large-public-product-images-random_detail.jpg?itok=ErJveZTY
and finally xargs wget runs wget for each line as the option, the endresult in this example is two images in the current directory named https:--www.newton.ac.uk-files-covers-968361.jpg and https:--www.moooi.com-sites-default-files-styles-large-public-product-images-random_detail.jpg?itok=ErJveZTY respectively.
With GNU Parallel you can do:
cat images.txt | parallel wget -O '{= s:/:-:g; =}' {}
I have a not so elegant solution, that may not work everywhere.
You probably know that if your URL ends in a query, wget will use that query in the filename. e.g. if you have http://domain/page?q=blabla, you will get a file called page?q=blabla after download. Usually, this is annoying, but you can turn it to your advantage.
Suppose, you wanted to download some index.html pages, and wanted to keep track of their origin, as well as, avoid ending up with index.html, index.html.1, index.html.2, etc. in your download folder. Your input file urls.txt may look something like the following:
https://google.com/
https://bing.com/
https://duckduckgo.com/
If you call wget -i urls.txt you end up with those numbered index.html files. But if you "doctor" your urls with a fake query, you get useful file names.
Write a script that appends each url as a query to itself, e.g.
https://google.com/?url=https://google.com/
https://bing.com/?url=https://bing.com/
https://duckduckgo.com/?url=https://duckduck.com/
Looks cheesy, right? But if you now execute wget -i urls.txt, you get the following files:
index.html?url=https:%2F%2Fbing.com%2F
index.html?url=https:%2F%2Fduckduck.com%2F
index.html?url=https:%2F%2Fgoogle.com%2F
instead of non-descript numbered index.htmls. Sure, they look ugly, but you can clean up the filenames, and voilà! Each file will have its origin as its name.
The approach probably has some limitations, e.g. if the site you are downloading from actually executes the query and parses the parameters, etc.
Otherwise, you'll have to solve the file name/source url problem outside of wget, either with a bash script or in other programming languages.

Is it possible to split a huge text file (based on number of lines) unpacking a .tar.gz archive if I cannot extract that file as whole?

I have a .tar.gz file. It contains one 20GB-sized text file with 20.5 million lines. I cannot extract this file as a whole and save to disk. I must do either one of the following options:
Specify a number of lines in each file - say, 1 million, - and get 21 files. This would be a preferred option.
Extract a part of that file based on line numbers, that is, say, from 1000001 to 2000001, to get a file with 1M lines. I will have to repeat this step 21 times with different parameters, which is very bad.
Is it possible at all?
This answer - bash: extract only part of tar.gz archive - describes a different problem.
To extract a file from f.tar.gz and split it into files, each with no more than 1 million lines, use:
tar Oxzf f.tar.gz | split -l1000000
The above will name the output files by the default method. If you prefer the output files to be named prefix.nn where nn is a sequence number, then use:
tar Oxzf f.tar.gz |split -dl1000000 - prefix.
Under this approach:
The original file is never written to disk. tar reads from the .tar.gz file and pipes its contents to split which divides it up into pieces before writing the pieces to disk.
The .tar.gz file is read only once.
split, through its many options, has a great deal of flexibility.
Explanation
For the tar command:
O tells tar to send the output to stdout. This way we can pipe it to split without ever having to save the original file on disk.
x tells tar to extract the file (as opposed to, say, creating an archive).
z tells tar that the archive is in gzip format. On modern tars, this is optional
f tells tar to use, as input, the file name specified.
For the split command:
-l tells split to split files limited by number of lines (as opposed to, say, bytes).
-d tells split to use numeric suffixes for the output files.
- tells split to get its input from stdin
You can use the --to-stdout (or -O) option in tar to send the output to stdout.
Then use sed to specify which set of lines you want.
#!/bin/bash
l=1
inc=1000000
p=1
while test $l -lt 21000000; do
e=$(($l+$inc))
tar -xfz --to-stdout myfile.tar.gz file-to-extract.txt |
sed -n -e "$l,$e p" > part$p.txt
l=$(($l+$inc))
p=$(($p+1))
done
Here's a pure Bash solution for option #1, automatically splitting lines into multiple output files.
#!/usr/bin/env bash
set -eu
filenum=1
chunksize=1000000
ii=0
while read line
do
if [ $ii -ge $chunksize ]
then
ii=0
filenum=$(($filenum + 1))
> out/file.$filenum
fi
echo $line >> out/file.$filenum
ii=$(($ii + 1))
done
This will take any lines from stdin and create files like out/file.1 with the first million lines, out/file.2 with the second million lines, etc. Then all you need is to feed the input to the above script, like this:
tar xfzO big.tar.gz | ./split.sh
This will never save any intermediate file on disk, or even in memory. It is entirely a streaming solution. It's somewhat wasteful of time, but very efficient in terms of space. It's also very portable, and should work in shells other than Bash, and on ancient systems with little change.
you can use
sed -n 1,20p /Your/file/Path
Here you mention your first line number and the last line number
I mean to say this could look like
sed -n 1,20p /Your/file/Path >> file1
and use start line number and end line number in a variable and use it accordingly.

Create files using grep and wildcards with input file

This should be a no-brainer, but apparently I have no brain today.
I have 50 20-gig logs that contain entries from multiple apps, one of which addes a transaction ID to its log lines. I have 42 transaction IDs I need to review, and I'd like to parse out the appropriate lines into separate files.
To do a single file, the command would be simply,
grep CDBBDEADBEEF2020X02393 server.log* > CDBBDEADBEEF2020X02393.log
that creates a log isolated to that transaction, from all 50 server.logs.
Now, I have a file with 42 txnIDs (shortening to 4 here):
CDBBDEADBEEF2020X02393
CDBBDEADBEEF6548X02302
CDBBDE15644F2020X02354
ABBDEADBEEF21014777811
And I wrote:
#/bin/sh
grep $1 server.\* > $1.log
But that is not working. Changing the shebang to #/bin/bash -xv, gives me this weird output (obviously I'm playing with what the correct escape magic must be):
$ ./xtrakt.sh B7F6E465E006B1F1A
#!/bin/bash -xv
grep - ./server\.\*
' grep - './server.*
: No such file or directory
I have also tried the command line
grep - server.* < txids.txt > $1
But OBVIOUSLY that $1 is pointless and I have no idea how to get a file named per txid using the input redirect form of the command.
Thanks in advance for any ideas. I haven't gone the route of doing a foreach in the shell script, because I want grep to put the original filename in the output lines so I can examine context later if I need to.
Also - it would be great to have the server.* files ordered numerically (server.log.1, server.log.2 NOT server.log.1, server.log.10...)
try this:
while read -r txid
do
grep "$txid" server.* > "$txid.log"
done < txids.txt
and for the file ordering - rename files with one digit to two digit, with leading zeroes, e.g. mv server.log.1 server.log.01.

Resources