How to wget files and save as domain name? - shell

Here's what I'm currently running to wget files:
parallel -a list.txt --jobs 100 wget -P /home/files/
The list.txt file contains a list of files such as this:
example.com/test.html
anotherexample.com/test.html
sample.com/test.html
However, it wants to save every file as test.html, obviously.
What I'm trying to do is figure out how to edit the above command to save each file as the domain name. So it should save it as the text before the / symbol. Like this:
example.com
anotherexample.com
sample.com
Does anyone know of any easy way to do this so I can still run it in parallel?

you could first transform the addresses in list.txt and specify the output file of wget explicitly, i.e., something like this:
parallel -a list.txt --jobs 100 'g=$(echo {1} | tr "/" "_");wget -P /home/files -O $g {1}'
Here, {1} stands for the argument extracted by parallel from the input list and all / are simply replaced with _. To keep only characters before the first /, one could do
parallel -a list.txt --jobs 100 'g=$(echo {1} | sed "s#/.*##");wget -P /home/files -O $g {1}'

Related

piping paths via xargs to `tag` command line tool

tag is a command line executable that allows macOS users to add a "tag" to a file.
tag -a | --add <tags> <path>... Add tags to file
I am trying to pipe a list of files in a text document but after a few days of failing badly need help on the syntax. The only clue on the description page is this:
If you plan to pipe the output of tag through xargs, you might want to
use the -0 option of each.
I have tried dozens of commands: xargs, gnu-xargs, for, while, and cannot get this to add tags to the files in the list. Researching the net I thought maybe there was an issue with line endings.
I installed dos2unix and ran it on the file to fix possible line ending issues, but that didn't appear to resolve anything.
If you cd into the directory of the files you are attempting to tag you don't have to use the complete path to the file.
$ gxargs -0 -t --arg-file="/Users/john/Desktop/diffremove.txt" | tag -0 -a red
echo '13.Hours.The.Secret.Soldiers.of.Benghazi.2016.1080p.BluRay.x265'$'\n''1941.1979.DC.1080p.BluRay.x265'$'\n'...
Not understanding how xargs passes the lines I thought I need to put a variable in position after the command where it's looking for the file tag -0 -a red <variable here>
I tried:
$ for i in $(/Users/john/Desktop/diffremove.txt) do `tag -a red "$1"`
I installed gnu xargs and tried this command.
$ </Users/john/Desktop/diffremove.txt gxargs -0 sh -c 'for arg do tag -0 -a red "$arg"'
EDIT:
Trying this command results in no errors but the files aren't tagged.
$ </Users/john/Desktop/diffremove.txt xargs -0 sh -c 'for arg do `tag -0 -a red '$arg'`;done'
Try this
xargs -d '\n' -a /Users/john/Desktop/diffremove.txt -I % tag -a red %
here we use xargs to read from file (using -a), using replacing string with (using %) and execute command tag -a red {filename}. You may need to add -d '\n' (setting delimeter to newline) to split strings.
However, classic way to read and process lines in file is using shell builtin command read:
$ while IFS= read -r p; do tag -a red $p; done < /Users/john/Desktop/diffremove.txt
,where IFS= part can be skipped if you don't have spaces in file.

Replacing strings using sed and a file as input

I have a bunch of files, in a lot of different folders that I need to edit.
Each folder has a subfolder that has another subfolder nested in it.
I have the folder names (genome_list.txt) and each the files that need to be edited have the same name in all the folders.
I want to use a list of strings from one file, to delete such strings from another file.
Here is what my failed attempt to do so looks like
for dir in $(cat genome_list.txt)
do
cd ${dir}/ref_map_${dir}
for samplename in $(cat remove_from_samples.txt )
do
sed -i 's/${samplename}//g' ../samples.txt
done
cd ../..
done
Files look like this:
cat remove_from_samples.txt
-s CHJ111.fq
-s CHJ727.fq
cat samples.txt
-s CHJ062.fq -s CHJ111.fq -s CHJ522.fq -s CHJ_528.fq -s CHJ727.fq
#Desired output:
-s CHJ062.fq -s CHJ522.fq -s CHJ_528.fq

wget download files in parallel and rename

I have a text file with two columns: the first column is the name to be saved as, and the second column is the url address to the resource.
10000899567110806314.jpg 'http://lifestyle.inquirer.net/files/2018/07/t0724cheekee-marcopolo_1-e1532358505274-620x298.jpg'
10001149035013559957.jpg 'https://www.politico.eu/wp-content/uploads/2018/07/GettyImages-1004567890.jpg'
10001268622353586394.jpg 'http://www.channelnewsasia.com/image/10549912/16x9/991/529/a7afd249388308118058689b0060a978/Zv/tour-de-france-5.jpg'
10001360495981714191.jpg 'https://media.breitbart.com/media/2018/07/Dan-Coats.jpg'
The file contains thousands of lines, so I wanted a quick way to download and rename these images.
I read multiple posts on SO and came up with this solution:
cat list.txt | xargs -n 1 -P 4 -d '\n' wget -O
Which uses xargs to download in parallel. I want to use wget with -O option to rename the downloaded file. When I run a single wget command, this works well. Example:
wget -O 10000899567110806314.jpg 'http://lifestyle.inquirer.net/files/2018/07/t0724cheekee-marcopolo_1-e1532358505274-620x298.jpg'
but when running the command with xargs to download in parallel, I get this error:
Try `wget --help' for more options.
wget: missing URL
Usage: wget [OPTION]... [URL]...
If I generate a file with just (single col) newline delimited urls and run the following command, it works great.
cat list.txt | xargs -n 1 -P 4 -d '\n' wget
But, I don't want to download the files first and then do the rename operation.
The error you are getting is because you are only passing one argument -n 1 to make it work you need to pass the 2 arguments, try this:
cat list.txt | xargs -n 2 -P 4 wget -O
To use the full line as an argument as #PesaThe suggested you could use option -L 1, for example:
xargs < list.txt -P 4 -L 1 wget -O
From the man:
-L number
Call utility for every number non-empty lines read.
A line ending with a space continues to the next non-empty line.
If EOF is reached and fewer lines have been read than number then utility
will be called with the available lines. The -L and -n options are
mutually-exclusive; the last one given will be used.

How to Copy and Rename multiple files using shell

I want to copy only 20180721 files from Outgoing to Incoming folder. I also want to remove the first numbers from the file name and want to rename from -1 to -3. I want to keep my commands to minimum so I am using pax command below.
Filename:
216118105741_MOM-09330-20180721_102408-1.jar
Output expected:
MOM-09330-20180721_102408-3.jar
I have tried this command and it's doing most of the work apart from removing the number coming in front of the file name. Can anyone help?
Command used:
pax -rw -pe -s/-1/-3/ ./*20180721*.jar ../Incoming/
Try this simple script using just parameter expansion:
for file in *20180721*.jar; do
new=${file#*_}
cp -- "$file" "/path/to/destination/${new%-*}-3.jar"
done
You can try this
In general
for i in `ls files-to-copy-*`; do
cp $i `echo $i | sed "s/rename-from/rename-to/g"`;
done;
In your case
for i in `ls *_MOM*`; do
cp $i `echo $i | sed "s/_MOM/MOM/g" | sed "s/-1/-3/g"`;
done;
pax only applies the first successful substitution even if the -s option is specified more than once. You can pipe the output to a second pax instance, though.
pax -w -s ':^[^_]*_::p' *20180721*.jar | (builtin cd ../Incoming; pax -r -s ':1[.]jar$:3.jar:p')

how to print names of files being downloaded

I'm trying to write a bash script that downloads all the .txt files from a website 'http://www1.ncdc.noaa.gov/pub/data/ghcn/daily/'.
So far I have wget -A txt -r -l 1 -nd 'http://www1.ncdc.noaa.gov/pub/data/ghcn/daily/' but I'm struggling to find a way to print the name of each file to the screen (when downloading). That's the part I'm really stuck on. How would one print the names?
Thoughts?
EDIT this is what I have done so far, but I'm trying to remove a lot of stuff like ghcnd-inventory.txt</a></td><td align=...
wget -O- $LINK | tr '"' '\n' | grep -e .txt | while read line; do
echo Downloading $LINK$line ...
wget $LINK$line
done
LINK='http://www1.ncdc.noaa.gov/pub/data/ghcn/daily/'
wget -O- $LINK | tr '"' '\n' | grep -e .txt | grep -v align | while read line; do
echo Downloading $LINK$line ...
wget -nv $LINK$line
done
Slight optimization of Sundeep's answer:
LINK='http://www1.ncdc.noaa.gov/pub/data/ghcn/daily/'
wget -q -O- $LINK | sed -E '/.*href="[^"]*\.txt".*/!d;s/.*href="([^"]*\.txt)".*/\1/' | wget -nv -i- -B$LINK
The sed command eliminates all lines not matching href="xxx.txt" and extracts only the xxx.txt part of the others. It then passes the result to another wget that uses it as the list of files to retrieve. The -nv option tells wget to be as less verbose as possible. It will thus print the name of the file it currently downloads but almost nothing else. Warning: this works only for this particular web site and does not descend in the sub-directories.

Resources