wget download files in parallel and rename

wget download files in parallel and rename - bash

I have a text file with two columns: the first column is the name to be saved as, and the second column is the url address to the resource.
10000899567110806314.jpg 'http://lifestyle.inquirer.net/files/2018/07/t0724cheekee-marcopolo_1-e1532358505274-620x298.jpg'
10001149035013559957.jpg 'https://www.politico.eu/wp-content/uploads/2018/07/GettyImages-1004567890.jpg'
10001268622353586394.jpg 'http://www.channelnewsasia.com/image/10549912/16x9/991/529/a7afd249388308118058689b0060a978/Zv/tour-de-france-5.jpg'
10001360495981714191.jpg 'https://media.breitbart.com/media/2018/07/Dan-Coats.jpg'
The file contains thousands of lines, so I wanted a quick way to download and rename these images.
I read multiple posts on SO and came up with this solution:
cat list.txt | xargs -n 1 -P 4 -d '\n' wget -O
Which uses xargs to download in parallel. I want to use wget with -O option to rename the downloaded file. When I run a single wget command, this works well. Example:
wget -O 10000899567110806314.jpg 'http://lifestyle.inquirer.net/files/2018/07/t0724cheekee-marcopolo_1-e1532358505274-620x298.jpg'
but when running the command with xargs to download in parallel, I get this error:
Try `wget --help' for more options.
wget: missing URL
Usage: wget [OPTION]... [URL]...
If I generate a file with just (single col) newline delimited urls and run the following command, it works great.
cat list.txt | xargs -n 1 -P 4 -d '\n' wget
But, I don't want to download the files first and then do the rename operation.

The error you are getting is because you are only passing one argument -n 1 to make it work you need to pass the 2 arguments, try this:
cat list.txt | xargs -n 2 -P 4 wget -O
To use the full line as an argument as #PesaThe suggested you could use option -L 1, for example:
xargs < list.txt -P 4 -L 1 wget -O
From the man:
-L number
Call utility for every number non-empty lines read.
A line ending with a space continues to the next non-empty line.
If EOF is reached and fewer lines have been read than number then utility
will be called with the available lines. The -L and -n options are
mutually-exclusive; the last one given will be used.

Related

piping paths via xargs to `tag` command line tool

tag is a command line executable that allows macOS users to add a "tag" to a file.
tag -a | --add <tags> <path>... Add tags to file
I am trying to pipe a list of files in a text document but after a few days of failing badly need help on the syntax. The only clue on the description page is this:
If you plan to pipe the output of tag through xargs, you might want to
use the -0 option of each.
I have tried dozens of commands: xargs, gnu-xargs, for, while, and cannot get this to add tags to the files in the list. Researching the net I thought maybe there was an issue with line endings.
I installed dos2unix and ran it on the file to fix possible line ending issues, but that didn't appear to resolve anything.
If you cd into the directory of the files you are attempting to tag you don't have to use the complete path to the file.
$ gxargs -0 -t --arg-file="/Users/john/Desktop/diffremove.txt" | tag -0 -a red
echo '13.Hours.The.Secret.Soldiers.of.Benghazi.2016.1080p.BluRay.x265'$'\n''1941.1979.DC.1080p.BluRay.x265'$'\n'...
Not understanding how xargs passes the lines I thought I need to put a variable in position after the command where it's looking for the file tag -0 -a red <variable here>
I tried:
$ for i in $(/Users/john/Desktop/diffremove.txt) do `tag -a red "$1"`
I installed gnu xargs and tried this command.
$ </Users/john/Desktop/diffremove.txt gxargs -0 sh -c 'for arg do tag -0 -a red "$arg"'
EDIT:
Trying this command results in no errors but the files aren't tagged.
$ </Users/john/Desktop/diffremove.txt xargs -0 sh -c 'for arg do `tag -0 -a red '$arg'`;done'

Try this
xargs -d '\n' -a /Users/john/Desktop/diffremove.txt -I % tag -a red %
here we use xargs to read from file (using -a), using replacing string with (using %) and execute command tag -a red {filename}. You may need to add -d '\n' (setting delimeter to newline) to split strings.
However, classic way to read and process lines in file is using shell builtin command read:
$ while IFS= read -r p; do tag -a red $p; done < /Users/john/Desktop/diffremove.txt
,where IFS= part can be skipped if you don't have spaces in file.

How to escape space in file path in a bash script

I have a bash script which needs to go through files in a directory in an iOS device and remove files one by one.
To list files from command line I use the following command:
ios-deploy --id UUID --bundle_id BUNDLE -l | grep Documents
and to go one by one on each file I use the following for loop in my script
for line in $(ios-deploy --id UUID --bundle_id BUNDLE -l | grep Documents); do
echo "${line}"
done
Now the problem is that there are files which names have spaces in them, and in such cases the for loop treats them as 2 separate lines.
How can I escape that whitespace in for loop definition so that I get one line per each file?

This might solve your issue:
while IFS= read -r -d $'\n'
do
echo "${REPLY}"
done < <(ios-deploy --id UUID --bundle_id BUNDLE -l | grep Documents)
Edit per Charles Duffy recommendation:
while IFS= read -r line
do
echo "${line}"
done < <(ios-deploy --id UUID --bundle_id BUNDLE -l | grep Documents)

Can I use a variable in a file path in bash? If so, how?

I'm trying to write a small shell script to find the most recently-added file in a directory and then move that file elsewhere. If I use:
ls -t ~/directory | head -1
and then store this in the variable VARIABLE_NAME, why can't I then then move this to ~/otherdirectory via:
mv ~/directory/$VARIABLE_NAME ~/otherdirectory
I've searched around here and Googled, but there doesn't seem to be any information on using variables in file paths? Is there a better way to do this?
Edit: Here's the portion of the script:
ls -t ~/downloads | head -1
read diags
mv ~/downloads/$diags ~/desktop/testfolder

You can do the following in your script:
diags=$(ls -t ~/downloads | head -1)
mv ~/downloads/"$diags" ~/desktop/testfolder
In this case, diags is assigned the value of ls -t ~/downloads | head -1, which can be called on by mv.

The following commands
ls -t ~/downloads | head -1
read diags
are probably not what you intend: the read command does not receive its input from the command before. Instead, it waits for input from stdin, which is why you believe the script to 'hang'. Maybe you wanted to do the following (at least this was my first erroneous attempt at providing a better solution):
ls -t ~/downloads | head -1 | read diags
However, this will (as mentioned by alvits) also not work, because each element of the pipe runs as a separate command: The variable diags therefore is not part of the parent shell, but of a subprocess.
The proper solution therefore is:
diags=$(ls -t ~/downloads | head -1)
There are, however, further possible problems, which would make the subsequent mv command fail:
The directory might be empty.
The file name might contain spaces, newlines etc.

How to wget files and save as domain name?

Here's what I'm currently running to wget files:
parallel -a list.txt --jobs 100 wget -P /home/files/
The list.txt file contains a list of files such as this:
example.com/test.html
anotherexample.com/test.html
sample.com/test.html
However, it wants to save every file as test.html, obviously.
What I'm trying to do is figure out how to edit the above command to save each file as the domain name. So it should save it as the text before the / symbol. Like this:
example.com
anotherexample.com
sample.com
Does anyone know of any easy way to do this so I can still run it in parallel?

you could first transform the addresses in list.txt and specify the output file of wget explicitly, i.e., something like this:
parallel -a list.txt --jobs 100 'g=$(echo {1} | tr "/" "_");wget -P /home/files -O $g {1}'
Here, {1} stands for the argument extracted by parallel from the input list and all / are simply replaced with _. To keep only characters before the first /, one could do
parallel -a list.txt --jobs 100 'g=$(echo {1} | sed "s#/.*##");wget -P /home/files -O $g {1}'

how to print names of files being downloaded

I'm trying to write a bash script that downloads all the .txt files from a website 'http://www1.ncdc.noaa.gov/pub/data/ghcn/daily/'.
So far I have wget -A txt -r -l 1 -nd 'http://www1.ncdc.noaa.gov/pub/data/ghcn/daily/' but I'm struggling to find a way to print the name of each file to the screen (when downloading). That's the part I'm really stuck on. How would one print the names?
Thoughts?
EDIT this is what I have done so far, but I'm trying to remove a lot of stuff like ghcnd-inventory.txt</a></td><td align=...
wget -O- $LINK | tr '"' '\n' | grep -e .txt | while read line; do
echo Downloading $LINK$line ...
wget $LINK$line
done

LINK='http://www1.ncdc.noaa.gov/pub/data/ghcn/daily/'
wget -O- $LINK | tr '"' '\n' | grep -e .txt | grep -v align | while read line; do
echo Downloading $LINK$line ...
wget -nv $LINK$line
done

Slight optimization of Sundeep's answer:
LINK='http://www1.ncdc.noaa.gov/pub/data/ghcn/daily/'
wget -q -O- $LINK | sed -E '/.*href="[^"]*\.txt".*/!d;s/.*href="([^"]*\.txt)".*/\1/' | wget -nv -i- -B$LINK
The sed command eliminates all lines not matching href="xxx.txt" and extracts only the xxx.txt part of the others. It then passes the result to another wget that uses it as the list of files to retrieve. The -nv option tells wget to be as less verbose as possible. It will thus print the name of the file it currently downloads but almost nothing else. Warning: this works only for this particular web site and does not descend in the sub-directories.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio