Shell script to batch download files using curl + cookie and merge those files - shell

I have a list of urls to files that I want to download and join. Those can only be accessed when authenticated.
So first I call:
curl -c cookie.txt http://url.to.authenticate
Then I can download a file file1 using the cookie:
curl -b cookie.txt -O http://url.to.file1
At the end I would just use cat:
cat file1 file2 file3 ... > file_merged
I have 320 of those urls stored in a text file and want to create a script with these urls included in the script, so all I need is to copy the script to a remote computer and execute it.
I am not that good at shell scripting and would love it if someone could help me out a bit.
Maybe something a little more fail-proof than
#!/bin/sh
curl -c cookie.txt http://url.to.authenticate
curl -b cookie.txt -O http://url.to.file1
curl -b cookie.txt -O http://url.to.file2
curl -b cookie.txt -O http://url.to.file3
...
cat file1 file2 file3 ... file320 > file_merged

So, something like (if your list of files is stored in files.txt):
#!/bin/sh
curl -c cookie.txt http://url.to.authenticate
while read f; do
curl -b cookie.txt -O http://url.to."$f"
cat "$f" >> file_merged
rm -f "$f"
done < files.txt

Related

Why does curl -o output contain sequences like "^[[38;5;250m", when "surf" output looks fine?

I want to output wttr.in in to a file with curl. The problem is that the output isn't how it would be when i just surf wttr.in.
What i did is:
curl wttr.in -o ~/wt.tex and curl wttr.in -o ~/wt
The output is like: <output>
It should be https://wttr.in.
I solved my self:
less -r -f -L wt.tex
-r controlls the binary characters
-f forces to open the the file with out asking.

wget to parse a webpage in shell

I am trying to extract URLS from a webpage using wget. I tried this
wget -r -l2 --reject=gif -O out.html www.google.com | sed -n 's/.*href="\([^"]*\).*/\1/p'
It is displaiyng FINISHED
Downloaded: 18,472 bytes in 1 files
But not displaying the weblinks. If I try to do it seperately
wget -r -l2 --reject=gif -O out.html www.google.com
sed -n 's/.*href="\([^"]*\).*/\1/p' < out.html
Output
http://www.google.com/intl/en/options/
/intl/en/policies/terms/
It is not displaying all the links
ttp://www.google.com
http://maps.google.com
https://play.google.com
http://www.youtube.com
http://news.google.com
https://mail.google.com
https://drive.google.com
http://www.google.com
http://www.google.com
http://www.google.com
https://www.google.com
https://plus.google.com
And more over I want to get links from 2nd level and more can any one give a solution for this
Thanks in advance
The -O file option captures the output of wget and writes it to the specified file, so there is no output going through the pipe to sed.
You can say -O - to direct wget output to standard output.
If you don't want to use grep, you can try
sed -n "/href/ s/.*href=['\"]\([^'\"]*\)['\"].*/\1/gp"

Passing curl results to wget with bash

I have a small script that i'd like to use with cron.
Purpose: Get webpage with links, extract dates from link and download files.
Script below is not working 100% and i can't see the problem.
#!/bin/bash
for i in $(curl http://107.155.72.213/anarirecap.php 2>&1 | grep -o -E 'href="([^"#]+)"' | cut -d'"' -f2 | grep '_whole_1_3000.mp4'); do
GAMEDAY=$(echo "$i" | grep -Eo '[[:digit:]]{4}/[[:digit:]]{2}/[[:digit:]]{2}')
wget "$i" --output-document="$GAMEDAY.mp4"
done
It get's the webpage "curl http://...etc" - works
$DAY - extracts the date - works.
wget part not working when i add $DAY. I'm i blind ... what am i missing.
Look at your output format here:
wget "$i" -O 2015/05/12.mp4
This is looking for a directory named 2015 with a subdirectory named 05 in which to place the file 12.mp4. Those directories don't exist, so you get 2015/05/12.mp4: No such file or directory.
If you want to replace the /s with underscores:
wget -O "${GAMEDAY//\//_}" "$i"
Alternately, if you want to create the directories if they don't exist:
mkdir -p -- "(dirname "$GAMEDAY")"
wget -O "$GAMEDAY" "$i"

How to download multiple URLs using wget using a single command?

I am using following command to download a single webpage with all its images and js using wget in Windows 7:
wget -E -H -k -K -p -e robots=off -P /Downloads/ http://www.vodafone.de/privat/tarife/red-smartphone-tarife.html
It is downloading the HTML as required, but when I tried to pass on a text file having a list of 3 URLs to download, it didn't give any output, below is the command I am using:
wget -E -H -k -K -p -e robots=off -P /Downloads/ -i ./list.txt -B 'http://'
I tried this also:
wget -E -H -k -K -p -e robots=off -P /Downloads/ -i ./list.txt
This text file had URLs http:// prepended in it.
list.txt contains list of 3 URLs which I need to download using a single command. Please help me in resolving this issue.
From man wget:
2 Invoking
By default, Wget is very simple to invoke. The basic syntax is:
wget [option]... [URL]...
So, just use multiple URLs:
wget URL1 URL2
Or using the links from comments:
$ cat list.txt
http://www.vodafone.de/privat/tarife/red-smartphone-tarife.html
http://www.verizonwireless.com/smartphones-2.shtml
http://www.att.com/shop/wireless/devices/smartphones.html
and your command line:
wget -E -H -k -K -p -e robots=off -P /Downloads/ -i ./list.txt
works as expected.
First create a text file with the URLs that you need to download.
eg: download.txt
download.txt will as below:
http://www.google.com
http://www.yahoo.com
then use the command wget -i download.txt to download the files. You can add many URLs to the text file.
If you have a list of URLs separated on multiple lines like this:
http://example.com/a
http://example.com/b
http://example.com/c
but you don't want to create a file and point wget to it, you can do this:
wget -i - <<< 'http://example.com/a
http://example.com/b
http://example.com/c'
pedantic version:
for x in {'url1','url2'}; do wget $x; done
the advantage of it you can treat is as a single wget url command

curl upload command using bash & terminal

when i use bash to upload files to dropbox, it works fine but when i manually use command line it does not work.
I'm thinking it might be the & in the url.. im not sure..
Bash code:
CURL_BIN="/usr/bin/curl"
#Note: This option explicitly allows curl to perform "insecure" SSL connections and transfers.
#CURL_ACCEPT_CERTIFICATES="-k"
CURL_PARAMETERS="--progress-bar"
APPKEY="zrwv8z3bycfk3m8"
OAUTH_ACCESS_TOKEN="aaaaaaaa"
APPSECRET="aaaaaaaaaa"
OAUTH_ACCESS_TOKEN_SECRET="aaaaaaaaa"
ACCESS_LEVEL="dropbox"
API_UPLOAD_URL="https://api-content.dropbox.com/1/files_put"
RESPONSE_FILE="temp2.txt"
FILE_SRC="temp.txt"
$CURL_BIN $CURL_ACCEPT_CERTIFICATES $CURL_PARAMETERS -v -i -o "$RESPONSE_FILE" --upload-file "$FILE_SRC" "$API_UPLOAD_URL/$ACCESS_LEVEL/$FILE_DST?oauth_consumer_key=$APPKEY&oauth_token=$OAUTH_ACCESS_TOKEN&oauth_signature_method=PLAINTEXT&oauth_signature=$APPSECRET%26$OAUTH_ACCESS_TOKEN_SECRET"
Manual code:
curl --insecure --progress-bar -v -i -o temp2.txt --upload-file temp.txt https://api-content.dropbox.com/1/files_put/dropbox/attachments/temp.txt?oauth_consumer_key=aaaaaaaaaa&oauth_token=aaaaaaaaa&oauth_signature_method=PLAINTEXT&oauth_signature=aaaaaaaaa%26aaaaaaaaaa
curl --insecure --progress-bar -v -i -o temp2.txt --upload-file temp.txt "https://api-content.dropbox.com/1/files_put/dropbox/attachments/temp.txt?oauth_consumer_key=aaaaaaaaaa&oauth_token=aaaaaaaaa&oauth_signature_method=PLAINTEXT&oauth_signature=aaaaaaaaa%26aaaaaaaaaa"
The solution is to add in the inverted commas "

Resources