aria2c - Any way to keep only list of failed downloads? - terminal

I am using aria2c to download a quite large list of urls (~6000) organized in a text file.
Based on this gist, I am using the following script to download all the files:
#!/bin/bash
aria2c -j5 -i list.txt -c --save-session out.txt
has_error=`wc -l < out.txt`
while [ $has_error -gt 0 ]
do
echo "still has $has_error errors, rerun aria2 to download ..."
aria2c -j5 -i list.txt -c --save-session out.txt
has_error=`wc -l < out.txt`
sleep 10
done
### PS: one line solution, just loop 1000 times
### seq 1000 | parallel -j1 aria2c -i list.txt -c
which saves all the aria2c output in a text file, and in case of at least one download error, tries to download all urls again.
The problem is: since my url list is so large, this becomes pretty inefficient. If 30 files result in download error (say, because of a server timeout), then the whole list will be looped over again 30 times.
So, the question is: is there any way to tell aria2c to save only the failed downloads, and then try to re-download only those files?

Related

Efficient parallel downloading and decompressing with matching pattern for list of files on server

Every day every 6 hours I have to download bz2 files from a web server, decompress them and merge them into a single file. This needs to be as efficient and quick as possible as I have to wait for the download and decompress phase to complete before proceeding further with the merging.
I have wrote some bash functions which take as input some strings to construct a URL of the files to be downloaded as a matching pattern. This way I can pass the matching pattern directly to wget without having to build locally the server's contents list to be then passed as a list with -i to wget. My function looks something like this
parallelized_extraction(){
i=0
until [ `ls -1 ${1}.bz2 2>/dev/null | wc -l ` -gt 0 -o $i -ge 30 ]; do
((i++))
sleep 1
done
while [ `ls -1 ${1}.bz2 2>/dev/null | wc -l ` -gt 0 ]; do
ls ${1}.bz2| parallel -j+0 bzip2 -d '{}'
sleep 1
done
}
download_merge_2d_variable()
{
filename="file_${year}${month}${day}${run}_*_${1}.grib2"
wget -b -r -nH -np -nv -nd -A "${filename}.bz2" "url/${run}/${1,,}/"
parallelized_extraction ${filename}
# do the merging
rm ${filename}
}
which I call as download_merge_2d_variable name_of_variable
I was able to speed up the code by writing the function parallelized_extraction which takes care of decompressing the downloaded files while wget is running in the background. To do this I first wait for the first .bz2 file to appear, then run the parallelized extraction until the last .bz2 is present on the server (this is what the two until and while loops are doing).
I'm pretty happy with this approach, however I think it could be improved. Here are my questions:
how can I launch multiple instances of wget to perform as well parallel downloads if my list of files is given as a matching pattern? Do I have to write multiple matching patterns with "chunks" of data inside or do I necessarily have to download a contents list from the server, split this list and then give it as input to wget?
parallelized_extraction may fail if the download of files is really slow, as it will not find any new bz2 file to be extracted and exit from the loop at the next iteration, although wget is still running in the background. Although this never happened to me it is a possibility. To take care of that I tried to add a condition to the second while by getting the PID of wget running in the background to check if it's still there but somehow it is not working
parallelized_extraction(){
# ...................
# same as before ....
# ...................
while [ `ls -1 ${1}.bz2 2>/dev/null | wc -l ` -gt 0 -a kill -0 ${2} >/dev/null 2>&1 ]; do
ls ${1}.bz2| parallel -j+0 bzip2 -d '{}'
sleep 1
done
}
download_merge_2d_variable()
{
filename="ifile_${year}${month}${day}${run}_*_${1}.grib2"
wget -r -nH -np -nv -nd -A "${filename}.bz2" "url/${run}/${1,,}/" &
# get ID of process running in background
PROC_ID=$!
parallelized_extraction ${filename} ${PROC_ID}
# do the merging
rm ${filename}
}
Any clue to why this is not working? Any suggestions on how to improve my code?
Thanks
UPDATE
I'm posting here my working solution based on the accepted answer in case someone is interested.
# Extract a plain list of URLs by using --spider option and filtering
# only URLs from the output
listurls() {
filename="$1"
url="$2"
wget --spider -r -nH -np -nv -nd --reject "index.html" --cut-dirs=3 \
-A $filename.bz2 $url 2>&1\
| grep -Eo '(http|https)://(.*).bz2'
}
# Extract each file by redirecting the stdout of wget to bzip2
# note that I get the filename from the URL directly with
# basename and by removing the bz2 extension at the end
get_and_extract_one() {
url="$1"
file=`basename $url | sed 's/\.bz2//g'`
wget -q -O - "$url" | bzip2 -dc > "$file"
}
export -f get_and_extract_one
# Here the main calling function
download_merge_2d_variable()
{
filename="filename.grib2"
url="url/where/the/file/is/"
listurls $filename $url | parallel get_and_extract_one {}
# merging and processing
}
export -f download_merge_2d_variable_icon_globe
Can you list the urls to download?
listurls() {
# do something that lists the urls without downloading them
# Possibly something like:
# lynx -listonly -image_links -dump "$starturl"
# or
# wget --spider -r -nH -np -nv -nd -A "${filename}.bz2" "url/${run}/${1,,}/"
# or
# seq 100 | parallel echo ${url}${year}${month}${day}${run}_{}_${id}.grib2
}
get_and_extract_one() {
url="$1"
file="$2"
wget -O - "$url" | bzip2 -dc > "$file"
}
export -f get_and_extract_one
# {=s:/:_:g; =} will generate a file name from the URL with / replaced by _
# You probably want something nicer.
# Possibly just {/.}
listurls | parallel get_and_extract_one {} '{=s:/:_:g; =}'
This way you will decompress while downloading and doing all in parallel.

Tesseract OCR large number of files

I have around 135000 .TIF files (1.2KB to 1.4KB) sitting on my hard drive. I need to extract text out of those files. If I run tesseract as a cron job I am getting 500 to 600 per hour at the most. Can anyone suggest me strategies so I can get atleast 500 per minute?
UPDATE:
Below is my code after implementing on suggestions given by #Mark still I dont seem to go beyond 20 files per min.
#!/bin/bash
cd /mnt/ramdisk/input
function tess()
{
if [ -f /mnt/ramdisk/output/$2.txt ]
then
echo skipping $2
return
fi
tesseract --tessdata-dir /mnt/ramdisk/tessdata -l eng+kan $1 /mnt/ramdisk/output/$2 > /dev/null 2>&1
}
export -f tess
find . -name \*.tif -print0 | parallel -0 -j100 --progress tess {/} {/.}
You need GNU Parallel. Here I process 500 TIF files of 3kB each in 37s on an iMac. By way of comparison, the same processing takes 160s if done in a sequential for loop.
The basic command looks like this:
parallel --bar 'tesseract {} {.} > /dev/null 2>&1' ::: *.tif
which will show a progress bar and use all available cores on your machine. Here it is in action:
If you want to see what it would do without actually doing anything, use parallel --dry-run.
As you have 135,000 files it will probably overflow your command line length - you can check with sysctl like this:
sysctl -a kern.argmax
kern.argmax: 262144
So you need to pump the filenames into GNU Parallel on its stdin and separate them with null characters so you don't get problems with spaces:
find . -iname \*.tif -print0 | parallel -0 --bar 'tesseract {} {.} > /dev/null 2>&1'
If you are dealing with very large numbers of files, you probably need to consider the possibility of being interrupted and restarted. You could either mv each TIF file after processing to a subdirectory called processed so that it won't get done again on restarting, or you could test for the existence of the corresponding txt file before processing any TIF like this:
#!/bin/bash
doit() {
if [ -f "${2}.txt" ]; then
echo Skipping $1...
return
fi
tesseract "$1" "$2" > /dev/null 2>&1
}
export -f doit
time parallel --bar doit {} {.} ::: *.tif
If you run that twice in a row, you will see it is near instantaneous the second time because all the processing was done the first time.
If you have millions of files, you could consider using multiple machines in parallel, so just make sure you have ssh logins to each of the machines on your network and then run across 4 machines, including the localhost like this:
parallel -S :,remote1,remote2,remote3 ...
where : is shorthand for the machine on which you are running.

aria2c timeout error on large file, but downloads most of it

I posted this on superuser yesterday, but maybe it's better here. I apologize if I am incorrect.
I am trying to trouble shoot the below aria2c command run using ubuntu 14.04. Basically the download starts and gets to about 30 minutes left, then errors with a timeout error. The file being downloaded is 25GB and there is often multiple files that are downloaded using a loop. Any suggestions to make this more efficient and stable? Currently, the each file takes about 4 hours to download, which is ok as long as there are no errors. I do get an aria2c file with the partial dowloaded file as well. Thank you :).
aria2c -x 4 -l log.txt -c -d /home/cmccabe/Desktop/download --http-user "xxxxx"  --http-passwd xxxx xxx://www.example.com/x/x/xxx/"file"
I apologize for the tag as I am not able to create a new one, that was the closest.
You can write a loop to rerun aria2c until all files are download (see this gist).
Basically, you can put all the links in a file (e.g. list.txt):
http://foo/file1
http://foo/file2
...
Then run loop_aria2.sh:
#!/bin/bash
aria2c -j5 -i list.txt -c --save-session out.txt
has_error=`wc -l < out.txt`
while [ $has_error -gt 0 ]
do
echo "still has $has_error errors, rerun aria2 to download ..."
aria2c -j5 -i list.txt -c --save-session out.txt
has_error=`wc -l < out.txt`
sleep 10
done
aria2c -j5 -i list.txt -c --save-session out.txt will downloads 5 files in parallel (-j5) and writes the failed files into out.txt. If the out.txt is not empty ($has_error -gt 0), then rerun the same command to continue downloading. The -c option of aria2c will skip completed files.
PS: another simpler solution (without checking error), which just run the aria2 1000 times if you don't mind :)
seq 1000 | parallel -j1 aria2c -i list.txt -c

How to rename wget-downloaded files sequentially?

Let's say I am downloading image files from a website with wget.
wget -H -p -w 2 -nd -nc -A jpg,jpeg -R gif "forum.foo.com/showthread.php?t=12345"
there are 20 images in that page.. when downloaded, the images are saved as their original file names.
I want to rename the first image downloaded by wget as
001-original_filename.jpg, the second one as 002-original_filename.jpg, and so on..
What to do? Is bash or curl needed for this?
Note: I am on windows.
If you have bash installed, after files downloaded, run this.
i=1
ls -crt | while read file; do
newfile=$(printf "%.3d-%s\n" $i "$file")
mv "$file" "$newfile"
i=$((i+1))
done
ls -crt : list files using creation date, reverse order using time stamp.
.3d in printf will precision number to 3 digit

program starting next line of code when gzip is still running

In a shell script first I am generating a file, then zipping it using gzip and then transferring it using scp to remote machine. The issue here before the gzip completes successfully the pointer going to the next line of code and so because of this I am having partial transfer of gz file in the remote machine.
What I mean here is gzip command is starting the zip but before the zip get completed (as file size is big, so it should take some time to complete the zip process), the next line of code is getting executed; which is scp and so I am having partial file transfer in the remote machine.
My question is what is the option for gzip which I can specify and with that the pointer shouldn't move to the next line of code before zip process gets completed successfully.
GZIP is used like below in my code :
gzip -c <filename> > <zip_filename> 2>&1 | tee -a <log_filename>
Please suggest.
First of all, if I do this
gzip -cr /home/username > homefolder.gz
the bash waits till end of the command, before executing the next (even if used in a script).
However, if I'm mistaken or your ran a command in the background, you can wait till gzip is finished, using the following code:
#!/bin/bash
gzip -cr "$1" > "$1.gz" &
while true; do
if fuser -s "$1" "$1.gz"; then
echo "gzip still compressing '$1'"
else
echo "gzip finished compressing '$1' (archive is '$1.gz')"
break
fi
done
exit 0
Or if you just want to wait and do nothing more, it's even simpler:
gzip -cr "$1" > "$1.gz"
while fuser -s "$1" "$1.gz"; do :; done
# copy your file wherever you want

Resources