I'm trying to write a bash script that downloads all the .txt files from a website 'http://www1.ncdc.noaa.gov/pub/data/ghcn/daily/'.
So far I have wget -A txt -r -l 1 -nd 'http://www1.ncdc.noaa.gov/pub/data/ghcn/daily/' but I'm struggling to find a way to print the name of each file to the screen (when downloading). That's the part I'm really stuck on. How would one print the names?
Thoughts?
EDIT this is what I have done so far, but I'm trying to remove a lot of stuff like ghcnd-inventory.txt</a></td><td align=...
wget -O- $LINK | tr '"' '\n' | grep -e .txt | while read line; do
echo Downloading $LINK$line ...
wget $LINK$line
done
LINK='http://www1.ncdc.noaa.gov/pub/data/ghcn/daily/'
wget -O- $LINK | tr '"' '\n' | grep -e .txt | grep -v align | while read line; do
echo Downloading $LINK$line ...
wget -nv $LINK$line
done
Slight optimization of Sundeep's answer:
LINK='http://www1.ncdc.noaa.gov/pub/data/ghcn/daily/'
wget -q -O- $LINK | sed -E '/.*href="[^"]*\.txt".*/!d;s/.*href="([^"]*\.txt)".*/\1/' | wget -nv -i- -B$LINK
The sed command eliminates all lines not matching href="xxx.txt" and extracts only the xxx.txt part of the others. It then passes the result to another wget that uses it as the list of files to retrieve. The -nv option tells wget to be as less verbose as possible. It will thus print the name of the file it currently downloads but almost nothing else. Warning: this works only for this particular web site and does not descend in the sub-directories.
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 2 years ago.
Improve this question
I usually download with filezilla in the directory /Public/Downloads on my nas.
I made a script executed by filezilla when download queue is finished, so all my downloads are moved to /Public/Downloads/Completed. My directory /Public/Downloads contains also two files and three directories that must not be moved.
folder.jpg
log.txt
Temp
Cache
Completed
I tried this command:
find /Public/Downloads/* -maxdepth 1 | grep -v Completed | grep -v Cache | grep -v Temp | grep -v log.txt | grep -v folder.jpg | xargs -i mv {} /Public/Downloads/Completed
This works for downloaded files and folders named without special characters: they are moved to /Public/Downloads/Completed
But when there is a <space> or an à or something else special, xarg is complaining unmatched single quote; by default quotes are special to xargs unless you use the -0 option
I've searched a solution by myself but haven't find something for my needs combining find, grep and xargs for files and directories.
How do I have to modify my command ?
This is just a suggestion for your change strategy and not about xargs. You only need the bash shell and mv for the external tool.
#!/usr/bin/env bash
shopt -s nullglob extglob
array=(
folder.jpg
log.txt
Temp
Cache
Completed
)
to_skip=$(IFS='|'; printf '%s' "*#(${array[*]})")
for item in /Public/Downloads/*; do
[[ $item == $to_skip ]] && continue
echo mv -v "$item" /Public/Downloads/Completed/ || exit
done
Remove the echo if you think that the output is correct.
Add the -x e.g. set -x (after the shebang) option to see which/what the code is doing or bash -x my_script, assuming my_script is the name of your script.
So i've changed my strategy for a loop using a temporary text file:
#!/bin/bash
ls -p /Public/Downloads | grep -v "Cache/" | grep -v "Temp/" | grep -v "Completed/" | grep -v 'log.txt' | grep -v 'folder.jpg' >/Public/Downloads/Completed/temp.txt
cat /Public/Downloads/Completed/temp.txt |\
while IFS='' read -r CURRENT || [ -n "$CURRENT" ]; do
mv /Public/Downloads/"$CURRENT" /Public/Downloads/Completed
done
rm /Public/Downloads/Completed/temp.txt
1) I write a list of directories and files to be moved with "ls" in "temp.txt"
2) Each line of "temp.txt" is passed into the $CURRENT variable. So each file and directory is moved one by one with "mv". The variable $CURRENT is double quoted in the mv command in case of "space" character in the name of the directory or the file
3) "temp.txt" is deleted
Based on OP input, the main issue is files name with "special" characters, including space. Two options:
If none of the input files has embedded new line in the file name (which is the case), the problem can be addressed by explicit specification of the delimited string to new line (-d '\n'). See below
If any of the files can contain new line in the name, the whole pipeline has to use zero-terminated strings. Does not seems to be the case here.
find /Public/Downloads/* -maxdepth 1 |
grep -v Completed |
grep -v Cache |
grep -v Temp |
grep -v log.txt |
grep -v folder.jpg |
xargs -d '\n' -i mv {} /Public/Downloads/Completed
I try to search for files and seperate path and version as variable because each will be needed later for creating a directory and to unzip a .jar in desired path.
file=$(find /home/user/Documents/test/ -path *.jar)
version=$(echo "$file" | grep -P -o '[0-9].[0-9].[0-9].[0-9]')
path=$(echo "$file" | sed 's/\(.*\)[/].*/\1/')
newpath=$(echo "${path}/${version}")
echo "$newpath"
result
> /home/user/Documents/test/gb0500
> /home/user/Documents/test/gb0500 /home/user/Documents/test/gb0500
> /home/user/Documents/test /home/user/Documents/test/1.3.2.0
> 1.3.2.1
> 1.3.2.2
> 1.2.0.0
> 1.3.0.0
It's hilarious that it's only working at one line.
what else I tried:
file=$(find /home/v990549/Dokumente/test/ -path *.jar)
version=$(grep -P -o '[0-9].[0-9].[0-9].[0-9]')
path=$(sed 's/\(.*\)[/].*/\1/')
while read $file
do
echo "$path$version"
done
I have no experience in scripting. Thats what I figured out some days ago. I am just practicing and trying to make life easier.
find output:
/home/user/Documents/test/gb0500/gb0500-koetlin-log4j2-web-1.3.2.0-javadoc.jar
/home/user/Documents/test/gb0500/gb0500-koetlin-log4j2-web-1.3.2.1-javadoc.jar
/home/user/Documents/test/gb0500/gb0500-koetlin-log4j2-web-1.3.2.2-javadoc.jar
/home/user/Documents/test/gb0500-co-log4j2-web-1.2.0.0-javadoc.jar
/home/user/Documents/test/gb0500-commons-log4j2-web-1.3.0.0-javadoc.jar
As the both variables version and path are newline-separated, how about:
file=$(find /home/user/Documents/test/ -path *.jar)
version=$(echo "$file" | grep -P -o '[0-9].[0-9].[0-9].[0-9]')
path=$(echo "$file" | sed 's/\(.*\)[/].*/\1/')
paste -d "/" <(echo "$path") <(echo "$version")
Result:
/home/user/Documents/test/gb0500/1.3.2.0
/home/user/Documents/test/gb0500/1.3.2.1
/home/user/Documents/test/gb0500/1.3.2.2
/home/user/Documents/test/1.2.0.0
/home/user/Documents/test/1.3.0.0
BTW I do not recommend to store multiple filenames in a single variable
as a newline-separated variable due to several reasons:
Filenames may contain a newline character.
It is not easy to manipulate the values of each line.
For instance you could simply say
the third line as path=${file%/*} if file contains just one.
Hope this helps.
I have written a bash script that finds any executable files in our scripts directory, then performs a grep on the resulting files to display a description, if it was included in the file.
A "description" is identified in each file as a line beginning with "# DESC:"
For some reason, the script also includes the grep command that is being run (but only once). Does anyone know why this is?
Script and output shown below. Why does the second line in the output happen?
Script
#!/bin/bash
# Find any FILES that are EXECUTABLE in the SCRIPTS
# directory and display any description, if there is one
find /opt/scripts/. -perm -111 -type f -maxdepth 1 | while read line ;
do
file=$(basename "$line")
printf "\033[1m%10s\033[0m : " $file
grep "# DESC:" "$line" | cut -c 9-
done
Output
desc : Displays all the scripts and their descriptions
DESC:" "$line" | cut -c 9-
showhelp : Displays the script help file
test : Script to perform system testing
Reason
Presumably your grepping script is also in /opt/scripts?
So it finds itself, and finds the grep subject '#DESC' and prints that.
You could fix that by adding a # DESC line to the top of your grep script, and just outputting the first result found by each grep using 'grep -m1'
grep -m1 '# DESC' "$line" | cut -c 9-
<humour>Otherwise its just turtles all the way down... ;-) </humour>
Alternative Fix
You could also improve the grep by using a regular expression and anchoring to the beginning of the line:
egrep -m1 '^# DESC' "$line" | cut -c 9-
I want to create a shellscript that reads files from a .diz file, where information about various source files are stored, that are needed to compile a certain piece of software (imagemagick in this case). i am using Mac OSX Leopard 10.5 for this examples.
Basically i want to have an easy way to maintain these .diz files that hold the information for up-to-date source packages. i would just need to update these .diz files with urls, version information and file checksums.
Example line:
libpng:1.2.42:libpng-1.2.42.tar.bz2?use_mirror=biznetnetworks:http://downloads.sourceforge.net/project/libpng/00-libpng-stable/1.2.42/libpng-1.2.42.tar.bz2?use_mirror=biznetnetworks:9a5cbe9798927fdf528f3186a8840ebe
script part:
while IFS=: read app version file url md5
do
echo "Downloading $app Version: $version"
curl -L -v -O $url 2>> logfile.txt
$calculated_md5=`/sbin/md5 $file | /usr/bin/cut -f 2 -d "="`
echo $calculated_md5
done < "files.diz"
Actually I have more than just one question concerning this.
how to calculate and compare the checksums the best? i wanted to store md5 checksums in the .diz file and compare it with string comparison with "cut"ting out the string
is there a way to tell curl another filename to save to? (in my case the filename gets ugly libpng-1.2.42.tar.bz2?use_mirror=biznetnetworks)
i seem to have issues with the backticks that should direct the output of the piped md5 and cut into the variable $calculated_md5. is the syntax wrong?
Thanks!
The following is a practical one-liner:
curl -s -L <url> | tee <destination-file> |
sha256sum -c <(echo "a748a107dd0c6146e7f8a40f9d0fde29e19b3e8234d2de7e522a1fea15048e70 -") ||
rm -f <destination-file>
wrapping it up in a function taking 3 arguments:
- the url
- the destination
- the sha256
download() {
curl -s -L $1 | tee $2 | sha256sum -c <(echo "$3 -") || rm -f $2
}
while IFS=: read app version file url md5
do
echo "Downloading $app Version: $version"
#use -o for output file. define $outputfile yourself
curl -L -v $url -o $outputfile 2>> logfile.txt
# use $(..) instead of backticks.
calculated_md5=$(/sbin/md5 "$file" | /usr/bin/cut -f 2 -d "=")
# compare md5
case "$calculated_md5" in
"$md5" )
echo "md5 ok"
echo "do something else here";;
esac
done < "files.diz"
My curl has a -o (--output) option to specify an output file. There's also a problem with your assignment to $calculated_md5. It shouldn't have the dollar sign at the front when you assign to it. I don't have /sbin/md5 here so I can't comment on that. What I do have is md5sum. If you have it too, you might consider it as an alternative. In particular, it has a --check option that works from a file listing of md5sums that might be handy for your situation. HTH.
I have a list of URLs which I would like to feed into wget using --input-file.
However I can't work out how to control the --output-document value at the same time,
which is simple if you issue the commands one by one.
I would like to save each document as the MD5 of its URL.
cat url-list.txt | xargs -P 4 wget
And xargs is there because I also want to make use of the max-procs features for parallel downloads.
Don't use cat. You can have xargs read from a file. From the man page:
--arg-file=file
-a file
Read items from file instead of standard input. If you use this
option, stdin remains unchanged when commands are run. Other‐
wise, stdin is redirected from /dev/null.
how about using a loop?
while read -r line
do
md5=$(echo "$line"|md5sum)
wget ... $line ... --output-document $md5 ......
done < url-list.txt
In your question you use -P 4 which suggests you want your solution to run in parallel. GNU Parallel http://www.gnu.org/software/parallel/ may help you:
cat url-list.txt | parallel 'wget {} --output-document "`echo {}|md5sum`"'
You can do that like this :
cat url-list.txt | while read url;
do
wget $url -O $( echo "$url" | md5 );
done
good luck