aria2c timeout error on large file, but downloads most of it - aria2

I posted this on superuser yesterday, but maybe it's better here. I apologize if I am incorrect.
I am trying to trouble shoot the below aria2c command run using ubuntu 14.04. Basically the download starts and gets to about 30 minutes left, then errors with a timeout error. The file being downloaded is 25GB and there is often multiple files that are downloaded using a loop. Any suggestions to make this more efficient and stable? Currently, the each file takes about 4 hours to download, which is ok as long as there are no errors. I do get an aria2c file with the partial dowloaded file as well. Thank you :).
aria2c -x 4 -l log.txt -c -d /home/cmccabe/Desktop/download --http-user "xxxxx"  --http-passwd xxxx xxx://www.example.com/x/x/xxx/"file"
I apologize for the tag as I am not able to create a new one, that was the closest.

You can write a loop to rerun aria2c until all files are download (see this gist).
Basically, you can put all the links in a file (e.g. list.txt):
http://foo/file1
http://foo/file2
...
Then run loop_aria2.sh:
#!/bin/bash
aria2c -j5 -i list.txt -c --save-session out.txt
has_error=`wc -l < out.txt`
while [ $has_error -gt 0 ]
do
echo "still has $has_error errors, rerun aria2 to download ..."
aria2c -j5 -i list.txt -c --save-session out.txt
has_error=`wc -l < out.txt`
sleep 10
done
aria2c -j5 -i list.txt -c --save-session out.txt will downloads 5 files in parallel (-j5) and writes the failed files into out.txt. If the out.txt is not empty ($has_error -gt 0), then rerun the same command to continue downloading. The -c option of aria2c will skip completed files.
PS: another simpler solution (without checking error), which just run the aria2 1000 times if you don't mind :)
seq 1000 | parallel -j1 aria2c -i list.txt -c

Related

aria2c - Any way to keep only list of failed downloads?

I am using aria2c to download a quite large list of urls (~6000) organized in a text file.
Based on this gist, I am using the following script to download all the files:
#!/bin/bash
aria2c -j5 -i list.txt -c --save-session out.txt
has_error=`wc -l < out.txt`
while [ $has_error -gt 0 ]
do
echo "still has $has_error errors, rerun aria2 to download ..."
aria2c -j5 -i list.txt -c --save-session out.txt
has_error=`wc -l < out.txt`
sleep 10
done
### PS: one line solution, just loop 1000 times
### seq 1000 | parallel -j1 aria2c -i list.txt -c
which saves all the aria2c output in a text file, and in case of at least one download error, tries to download all urls again.
The problem is: since my url list is so large, this becomes pretty inefficient. If 30 files result in download error (say, because of a server timeout), then the whole list will be looped over again 30 times.
So, the question is: is there any way to tell aria2c to save only the failed downloads, and then try to re-download only those files?

Script to download, gunzip merge files and gzip the fil again?

My script skills are really limited so I wonder if someone here could help me. I would like to download 2 files, run gunzip, run a command called tv_merge and then gzip the new file. Here's what I Would like to be run from a script.
I would like to download two files (.gz) with wget:
wget -O /some/where/file1.gz http://some.url.com/data/
wget -O /some/where/file2.gz http://some.url.com/data/
Then gunzip the 2 files:
gunzip /some/where/file1.gz
gunzip /some/where/file2.gz
After that run a command called Tv_merge:
tv_merge -i /some/where/file1 -m /some/where/file2 -o newmaster.xml
After tv_merge. I would like to gzip the file:
gzip newmaster.xml
I would like to run all these commands in that order from a script, and I would like to put that to be run let's see every 8h like a crontab.
I'm assuming that file names are static. with provided information this should get you going.
#!/bin/bash
echo "Downloading first file"
wget -O /some/where/file1.gz http://some.url.com/data/
echo "First Download Completed"
echo "Downloading Second file"
wget -O /some/where/file2.gz http://some.url.com/data/
echo "Second Download Completed"
gunzip /some/where/file1.gz
gunzip /some/where/file2.gz
echo "Running tv_merge"
tv_merge -i /some/where/file1 -m /some/where/file2 -o newmaster.xml
gzip -c newmaster.xml > /some/where/newmaster.xml.gz
echo "newmaster.xml.gz is ready at /some/where/newmaster.xml.gz"
Save this to a file for example script.sh then chmod +x script.sh and you can run it with bash script.sh.

youtube-dl problems (scripting)

Okay, so I've got this small problem with a bash script that I'm writing.
This script is supposed to be run like this:
bash script.sh https://www.youtube.com/user/<channel name>
OR
bash script.sh https://www.youtube.com/user/<random characters that make up a youtube channel ID>
It downloads an entire YouTube channel to a folder named
<uploader>{<uploader_id>}/
Or, at least it SHOULD...
the problem I'm getting is that the archive.txt file that youtube-dl creates is not created in the same directory as the videos. It's created in the directory from which the script is run.
Is there a grep or sed command that I could use to get the archive.txt file to the video folder?
Or maybe create the folder FIRST, then cd into it, and run the command from there?
I dunno
Here is my script:
#!/bin/bash
pwd
sleep 1
echo "You entered: $1 for the URL"
sleep 1
echo "Now downloading all videos from URL "$1""
youtube-dl -iw \
--no-continue $1 \
-f bestvideo+bestaudio --merge-output-format mkv \
-o "%(uploader)s{%(uploader_id)s}/[%(upload_date)s] %(title)s" \
--add-metadata --download-archive archive.txt
exit 0
I ended up solving it with this:
uploader="$(youtube-dl -i -J $URL --playlist-items 1 | grep -Po '(?<="uploader": ")[^"]*')"
uploader_id="$(youtube-dl -i -J $URL --playlist-items 1 | grep -Po '(?<="uploader_id": ")[^"]*')"
uploaderandid="$uploader{$uploader_id}"
echo "Uploader: $uploader"
echo "Uploader ID: $uploader_id"
echo "Folder Name: $uploaderandid"
echo "Now downloading all videos from URL "$URL" to the folder "$DIR/$uploaderandid""
Basically I had to parse the JSON with grep, since the youtube-dl devs said that implementing -o type variables into any other variable would clog up the code and make it bloated.

program starting next line of code when gzip is still running

In a shell script first I am generating a file, then zipping it using gzip and then transferring it using scp to remote machine. The issue here before the gzip completes successfully the pointer going to the next line of code and so because of this I am having partial transfer of gz file in the remote machine.
What I mean here is gzip command is starting the zip but before the zip get completed (as file size is big, so it should take some time to complete the zip process), the next line of code is getting executed; which is scp and so I am having partial file transfer in the remote machine.
My question is what is the option for gzip which I can specify and with that the pointer shouldn't move to the next line of code before zip process gets completed successfully.
GZIP is used like below in my code :
gzip -c <filename> > <zip_filename> 2>&1 | tee -a <log_filename>
Please suggest.
First of all, if I do this
gzip -cr /home/username > homefolder.gz
the bash waits till end of the command, before executing the next (even if used in a script).
However, if I'm mistaken or your ran a command in the background, you can wait till gzip is finished, using the following code:
#!/bin/bash
gzip -cr "$1" > "$1.gz" &
while true; do
if fuser -s "$1" "$1.gz"; then
echo "gzip still compressing '$1'"
else
echo "gzip finished compressing '$1' (archive is '$1.gz')"
break
fi
done
exit 0
Or if you just want to wait and do nothing more, it's even simpler:
gzip -cr "$1" > "$1.gz"
while fuser -s "$1" "$1.gz"; do :; done
# copy your file wherever you want

How to get files list downloaded with scp -r

Is it possible to get files list that were downloaded using scp -r ?
Example:
$ scp -r $USERNAME#HOSTNAME:~/backups/ .
3.tar 100% 5 0.0KB/s 00:00
2.tar 100% 5 0.0KB/s 00:00
1.tar 100% 4 0.0KB/s 00:00
Expected result:
3.tar
2.tar
1.tar
The output that scp generates does not seem to come out on any of the standard streams (stdout or stderr), so capturing it directly may be difficult. One way you could do this would be to make scp output verbose information (by using the -v switch) and then capture and process this information. The verbose information is output on stderr, so you will need to capture it using the 2> redirection operator.
For example, to capture the verbose output do:
scp -rv $USERNAME#HOSTNAME:~/backups/ . 2> scp.output
Then you will be able to filter this output with something like this:
awk '/Sending file/ {print $NF}' scp.output
The awk command simply prints the last word on the relevant line. If you have spaces in your filenames then you may need to come up with a more robust filter.
I realise that you asked the question about scp, but I will give you an alternative solution to the problem Copy files recursively from a server using ssh, and getting the file names that are copied.
The scp solution has at least one problem: if you copy lots of files, it takes a while as each file generates a transaction. Instead of scp, I use ssh and tar:
ssh $USERNAME#HOSTNAME:~/backups/ "cd ~/backups/ && tar -cf - ." | tar -xf -
With that, adding a tee and a tar -t gives you what you need:
ssh $USERNAME#HOSTNAME:~/backups/ "cd ~/backups/ && tar -cf - ." | tee >(tar -xf -) | tar -tf - > file_list
Note that it might not work in all shell (bash is ok) as the >(...) construct (process substitution) is not a general option. If you do not have it in your shell you could use a fifo (basically what the process substitution allows but shorter):
mkfifo tmp4tar
(tar -xf tmp4tar ; rm tmp4tar;) &
ssh $USERNAME#HOSTNAME:~/backups/ "cd ~/backups/ && tar -cf - ." | tee -a tmp4tar | tar -tf - > file_list
scp -v -r yourdir orczhou#targethost:/home/orczhou/ \
2> >(awk '{if($0 ~ "Sending file modes:")print $6}')
with -v "Sending file modes: C0644 7864 a.sql" should be ouput to stderr
use 'awk' to pick out the file list

Resources