requesting data indefinitely in curl - bash

I have 200MB file to download. I don't want to download it directly by passing URL to cURL (because my college blocks requests with more than 150MB).
So, I can download data by 10MB chunks, by passing range parameters to cURL. But I don't know how many 10MB chunks to download. Is there a way in cURL so that I can download data indefinitely. Something more like
while(next byte present)
download byte;
Thanks :)

command line curl lets you specify a range to download, so for your 150meg max, you'd do something like
curl http://example.com/200_meg_file -r 0-104857600 > the_file
curl http://example.com/200_meg_file -r 104857601-209715200 >> the_file
and so on until the entire thing's downloaded, grabbing 100meg chunks at a time and appending each chunk to the local copy.

Curl already has the ability to resume a download. Just run like this:
$> curl -C - $url -o $output_file
Of course this won't figure out when to stop, per se. However it would be pretty easy to write a wrapper. Something like this:
#!/bin/bash
url="http://someurl/somefile"
out="outfile"
touch "$out"
last_size=-1
while [ "`du -b $out | sed 's/\W.*//'`" -ne "$last_size" ]; do
curl -C - "$url" -o "$out"
last_size=`du -b $out | sed 's/\W.*//'`
done
I should note that curl outputs a fun looking error:
curl: (18) transfer closed with outstanding read data remaining
However I tested this on a rather large ISO file, and the md5 still matched up even though the above error was shown.

Related

I need help parsing HTML with grep [duplicate]

It works ok as a single tool:
curl "someURL"
curl -o - "someURL"
but it doesn't work in a pipeline:
curl "someURL" | tr -d '\n'
curl -o - "someURL" | tr -d '\n'
it returns:
(23) Failed writing body
What is the problem with piping the cURL output? How to buffer the whole cURL output and then handle it?
This happens when a piped program (e.g. grep) closes the read pipe before the previous program is finished writing the whole page.
In curl "url" | grep -qs foo, as soon as grep has what it wants it will close the read stream from curl. cURL doesn't expect this and emits the "Failed writing body" error.
A workaround is to pipe the stream through an intermediary program that always reads the whole page before feeding it to the next program.
E.g.
curl "url" | tac | tac | grep -qs foo
tac is a simple Unix program that reads the entire input page and reverses the line order (hence we run it twice). Because it has to read the whole input to find the last line, it will not output anything to grep until cURL is finished. Grep will still close the read stream when it has what it's looking for, but it will only affect tac, which doesn't emit an error.
For completeness and future searches:
It's a matter of how cURL manages the buffer, the buffer disables the output stream with the -N option.
Example:
curl -s -N "URL" | grep -q Welcome
Another possibility, if using the -o (output file) option - the destination directory does not exist.
eg. if you have -o /tmp/download/abc.txt and /tmp/download does not exist.
Hence, ensure any required directories are created/exist beforehand, use the --create-dirs option as well as -o if necessary
The server ran out of disk space, in my case.
Check for it with df -k .
I was alerted to the lack of disk space when I tried piping through tac twice, as described in one of the other answers: https://stackoverflow.com/a/28879552/336694. It showed me the error message write error: No space left on device.
You can do this instead of using -o option:
curl [url] > [file]
So it was a problem of encoding. Iconv solves the problem
curl 'http://www.multitran.ru/c/m.exe?CL=1&s=hello&l1=1' | iconv -f windows-1251 | tr -dc '[:print:]' | ...
If you are trying something similar like source <( curl -sS $url ) and getting the (23) Failed writing body error, it is because sourcing a process substitution doesn't work in bash 3.2 (the default for macOS).
Instead, you can use this workaround.
source /dev/stdin <<<"$( curl -sS $url )"
Trying the command with sudo worked for me. For example:
sudo curl -O -k 'https url here'
note: -O (this is capital o, not zero) & -k for https url.
I had the same error but from different reason. In my case I had (tmpfs) partition with only 1GB space and I was downloading big file which finally filled all memory on that partition and I got the same error as you.
I encountered the same problem when doing:
curl -L https://packagecloud.io/golang-migrate/migrate/gpgkey | apt-key add -
The above query needs to be executed using root privileges.
Writing it in following way solved the issue for me:
curl -L https://packagecloud.io/golang-migrate/migrate/gpgkey | sudo apt-key add -
If you write sudo before curl, you will get the Failed writing body error.
For me, it was permission issue. Docker run is called with a user profile but root is the user inside the container. The solution was to make curl write to /tmp since that has write permission for all users , not just root.
I used the -o option.
-o /tmp/file_to_download
In my case, I was doing:
curl <blabla> | jq | grep <blibli>
With jq . it worked: curl <blabla> | jq . | grep <blibli>
I encountered this error message while trying to install varnish cache on ubuntu. The google search landed me here for the error (23) Failed writing body, hence posting a solution that worked for me.
The bug is encountered while running the command as root curl -L https://packagecloud.io/varnishcache/varnish5/gpgkey | apt-key add -
the solution is to run apt-key add as non root
curl -L https://packagecloud.io/varnishcache/varnish5/gpgkey | apt-key add -
The explanation here by #Kaworu is great: https://stackoverflow.com/a/28879552/198219
This happens when a piped program (e.g. grep) closes the read pipe before the previous program is finished writing the whole page. cURL doesn't expect this and emits the "Failed writing body" error.
A workaround is to pipe the stream through an intermediary program that always reads the whole page before feeding it to the next program.
I believe the more correct implementation would be to use sponge, as already suggested by #nisetama in the comments:
curl "url" | sponge | grep -qs foo
I got this error trying to use jq when I didn't have jq installed. So... make sure jq is installed if you're trying to use it.
In Bash and zsh (and perhaps other shells), you can use process substitution (Bash/zsh) to create a file on the fly, and then use that as input to the next process in the pipeline chain.
For example, I was trying to parse JSON output from cURL using jq and less, but was getting the Failed writing body error.
# Note: this does NOT work
curl https://gitlab.com/api/v4/projects/ | jq | less
When I rewrote it using process substitution, it worked!
# this works!
jq "" <(curl https://gitlab.com/api/v4/projects/) | less
Note: jq uses its 2nd argument to specify an input file
Bonus: If you're using jq like me and want to keep the colorized output in less, use the following command line instead:
jq -C "" <(curl https://gitlab.com/api/v4/projects/) | less -r
(Thanks to Kowaru for their explanation of why Failed writing body was occurring. However, their solution of using tac twice didn't work for me. I also wanted to find a solution that would scale better for large files and tries to avoid the other issues noted as comments to that answer.)
I was getting curl: (23) Failed writing body . Later I noticed that I did not had sufficient space for downloading an rpm package via curl and thats the reason I was getting issue. I freed up some space and issue for resolved.
I had the same question because of my own typo mistake:
# fails because of reasons mentioned above
curl -I -fail https://www.google.com | echo $?
curl: (23) Failed writing body
# success
curl -I -fail https://www.google.com || echo $?
I added flag -s and it did the job. eg: curl -o- -s https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.1/install.sh | bash

Check if a remote file exists in bash

I am downloading files with this script:
parallel --progress -j16 -a ./temp/img-url.txt 'wget -nc -q -P ./images/ {}; wget -nc -q -P ./images/ {.}_{001..005}.jpg'
Would it be possible to not download files, just check them on the remote side and if exists create a dummy file instead of downloading?
Something like:
if wget --spider $url 2>/dev/null; then
#touch img.file
fi
should work, but I don't know how to combine this code with GNU Parallel.
Edit:
Based on Ole's answer I wrote this piece of code:
#!/bin/bash
do_url() {
url="$1"
wget -q -nc --method HEAD "$url" && touch ./images/${url##*/}
#get filename from $url
url2=${url##*/}
wget -q -nc --method HEAD ${url%.jpg}_{001..005}.jpg && touch ./images/${url2%.jpg}_{001..005}.jpg
}
export -f do_url
parallel --progress -a urls.txt do_url {}
It works, but it fails for some files. I can not find consistency why it works for some files, why it fails for others. Maybe it has something with the last filename. Second wget tries to access the currect url, but the touch command after that simply does not create the desidered file. First wget always (correctly) downloads the main image without the _001.jpg, _002.jpg.
Example urls.txt:
http://host.com/092401.jpg (works correctly, _001.jpg.._005.jpg are downloaded)
http://host.com/HT11019.jpg (not works, only the main image is downloaded)
It is pretty hard to understand what it is you really want to accomplish. Let me try to rephrase your question.
I have urls.txt containing:
http://example.com/dira/foo.jpg
http://example.com/dira/bar.jpg
http://example.com/dirb/foo.jpg
http://example.com/dirb/baz.jpg
http://example.org/dira/foo.jpg
On example.com these URLs exist:
http://example.com/dira/foo.jpg
http://example.com/dira/foo_001.jpg
http://example.com/dira/foo_003.jpg
http://example.com/dira/foo_005.jpg
http://example.com/dira/bar_000.jpg
http://example.com/dira/bar_002.jpg
http://example.com/dira/bar_004.jpg
http://example.com/dira/fubar.jpg
http://example.com/dirb/foo.jpg
http://example.com/dirb/baz.jpg
http://example.com/dirb/baz_001.jpg
http://example.com/dirb/baz_005.jpg
On example.org these URLs exist:
http://example.org/dira/foo_001.jpg
Given urls.txt I want to generate the combinations with _001.jpg .. _005.jpg in addition to the original URL. E.g.:
http://example.com/dira/foo.jpg
becomes:
http://example.com/dira/foo.jpg
http://example.com/dira/foo_001.jpg
http://example.com/dira/foo_002.jpg
http://example.com/dira/foo_003.jpg
http://example.com/dira/foo_004.jpg
http://example.com/dira/foo_005.jpg
Then I want to test if these URLs exist without downloading the file. As there are many URLs I want to do this in parallel.
If the URL exists I want an empty file created.
(Version 1): I want the empty file created in a the similar directory structure in the dir images. This is needed because some of the images have the same name, but in different dirs.
So the files created should be:
images/http:/example.com/dira/foo.jpg
images/http:/example.com/dira/foo_001.jpg
images/http:/example.com/dira/foo_003.jpg
images/http:/example.com/dira/foo_005.jpg
images/http:/example.com/dira/bar_000.jpg
images/http:/example.com/dira/bar_002.jpg
images/http:/example.com/dira/bar_004.jpg
images/http:/example.com/dirb/foo.jpg
images/http:/example.com/dirb/baz.jpg
images/http:/example.com/dirb/baz_001.jpg
images/http:/example.com/dirb/baz_005.jpg
images/http:/example.org/dira/foo_001.jpg
(Version 2): I want the empty file created in the dir images. This can be done because all the images have unique names.
So the files created should be:
images/foo.jpg
images/foo_001.jpg
images/foo_003.jpg
images/foo_005.jpg
images/bar_000.jpg
images/bar_002.jpg
images/bar_004.jpg
images/baz.jpg
images/baz_001.jpg
images/baz_005.jpg
(Version 3): I want the empty file created in the dir images called the name from urls.txt. This can be done because only one of _001.jpg .. _005.jpg exists.
images/foo.jpg
images/bar.jpg
images/baz.jpg
#!/bin/bash
do_url() {
url="$1"
# Version 1:
# If you want to keep the folder structure from the server (similar to wget -m):
wget -q --method HEAD "$url" && mkdir -p images/"$2" && touch images/"$url"
# Version 2:
# If all the images have unique names and you want all images in a single dir
wget -q --method HEAD "$url" && touch images/"$3"
# Version 3:
# If all the images have unique names when _###.jpg is removed and you want all images in a single dir
wget -q --method HEAD "$url" && touch images/"$4"
}
export -f do_url
parallel do_url {1.}{2} {1//} {1/.}{2} {1/} :::: urls.txt ::: .jpg _{001..005}.jpg
GNU Parallel takes a few ms per job. When your jobs are this short, the overhead will affect the timing. If none of your CPU cores are running at 100% you can run more jobs in parallel:
parallel -j0 do_url {1.}{2} {1//} {1/.}{2} {1/} :::: urls.txt ::: .jpg _{001..005}.jpg
You can also "unroll" the loop. This will save 5 overheads per URL:
do_url() {
url="$1"
# Version 2:
# If all the images have unique names and you want all images in a single dir
wget -q --method HEAD "$url".jpg && touch images/"$url".jpg
wget -q --method HEAD "$url"_001.jpg && touch images/"$url"_001.jpg
wget -q --method HEAD "$url"_002.jpg && touch images/"$url"_002.jpg
wget -q --method HEAD "$url"_003.jpg && touch images/"$url"_003.jpg
wget -q --method HEAD "$url"_004.jpg && touch images/"$url"_004.jpg
wget -q --method HEAD "$url"_005.jpg && touch images/"$url"_005.jpg
}
export -f do_url
parallel -j0 do_url {.} :::: urls.txt
Finally you can run more than 250 jobs: https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Running-more-than-250-jobs-workaround
You may use curl instead to check if the URLs you are parsing are there without downloading any file as such:
if curl --head --fail --silent "$url" >/dev/null; then
touch .images/"${url##*/}"
fi
Explanation:
--fail will make the exit status nonzero on a failed request.
--head will avoid downloading the file contents
--silent will avoid status or errors from being emitted by the check itself.
To solve the "looping" issue, you can do:
urls=( "${url%.jpg}"_{001..005}.jpg )
for url in "${urls[#]}"; do
if curl --head --silent --fail "$url" > /dev/null; then
touch .images/${url##*/}
fi
done
From what I can see, your question isn't really about how to use wget to test for the existence of a file, but rather on how to perform correct looping in a shell script.
Here is a simple solution for that:
urls=( "${url%.jpg}"_{001..005}.jpg )
for url in "${urls[#]}"; do
if wget -q --method=HEAD "$url"; then
touch .images/${url##*/}
fi
done
What this does is that it invokes Wget with the --method=HEAD option. With the HEAD request, the server will simply report back whether the file exists or not, without returning any data.
Of course, with a large data set this is pretty inefficient. You're creating a new connection to the server for every file you're trying. Instead, as suggested in the other answer, you could use GNU Wget2. With wget2, you can test all of these in parallel, and use the new --stats-server option to find a list of all the files and the specific return code that the server provided. For example:
$ wget2 --spider --progress=none -q --stats-site example.com/{,1,2,3}
Site Statistics:
http://example.com:
Status No. of docs
404 3
http://example.com/3 0 bytes (identity) : 0 bytes (decompressed), 238ms (transfer) : 238ms (response)
http://example.com/1 0 bytes (gzip) : 0 bytes (decompressed), 241ms (transfer) : 241ms (response)
http://example.com/2 0 bytes (identity) : 0 bytes (decompressed), 238ms (transfer) : 238ms (response)
200 1
http://example.com/ 0 bytes (identity) : 0 bytes (decompressed), 231ms (transfer) : 231ms (response)
You can even get this data printed as a CSV or JSON for easier parsing
Just loop over the names?
for uname in ${url%.jpg}_{001..005}.jpg
do
if wget --spider $uname 2>/dev/null; then
touch ./images/${uname##*/}
fi
done
You could send a command via ssh to see if the remote file exists and cat it if it does:
ssh your_host 'test -e "somefile" && cat "somefile"' > somefile
Could also try scp which supports glob expressions and recursion.

curl: (26) couldn't open file when the file is a variable

I am trying to upload a list of files to a server. This is the script that I have
files=$(shopt -s nullglob dotglob; echo /media/USB/*) > /dev/null 2>&1
if (( ${#files} ))
then
for file in $files
do
echo "Filename"
echo $file
curl -i -X POST -F files=#$file 192.168.1.122:5000/upload
done
Basically I am trying to take all of the files on a USB drive and upload them to my local server. The curl command is giving me trouble. I can move these files to drives that I mount on this system but I haven't been able to send them with the curl command. I have tried variations on #"$file" and #\"$file\" based on other related questions but I haven't been able to get this to work. However what is annoying is that when I do this:
curl -i -X POST -F files=#/absolute/path/to/my/file.txt 192.168.1.122:5000/upload
It works as I expect. How can I get this to work in my loop?
So I ended up figuring out a solution that I will share in case anyone else is having this problem. I am not sure exactly why this fixed it but I simply had to put quotes around the files=#$file in the curl command:
curl -i -X POST -F "files=#$file" 192.168.1.122:5000/upload
Leaving this here in case it is useful to someone down the line.

Using all files in a directory with curl?

This is my script:
#!/bin/bash
curl -X POST -T /this/is/my/path/system.log https://whatever;
As you see, I am using a file called system.log. How can I do that for the complete /this/is/my/path/ path in a loop? There are about 50 files in /this/is/my/path/ which I want to use with curl.
Thanks!
You can upload multiple files using this range syntax in curl:
$ curl -u ftpuser:ftppass -T "{file1,file2}" ftp://ftp.testserver.com
A very robust solution is to iterate through a for loop. Moreover you can take advantage of this and insert echo commands or delete, or whatever command you want.
#!/bin/bash
for file in /this/is/my/path/*
do
curl -X POST -T "/this/is/my/path/$file" https://whatever;
done; # file

How to download a file with wget and save it according to the http-reported filename?

When you request a file with wget and that file is being served by some dynamic page (e.g. php), wget will try to use the path to that dynamic page (usually looking as if an angry child got hold of your keyboard: index.php?a8s7df6a8s=d6fa8sd6f90v78wg&l45i87ylqwiu45h=j76h2g461k326v).
However, these pages usually send an HTTP header with the file so that user agents can display a sensible file name. How do I get wget to listen to that and use it (instead of the url) to determine the name under which to save the file?
I found that a way to do this was to use the --server-response flag with --spider and invoke wget twice (there is certainly room for improvement, there!)
Assume the url to be in $link:
wget --quiet --server-response --spider -O /dev/null -- "$link" 2>&1 \
| sed -n 's/^.*filename=\([^;]*\)\(;.*\)\?$/\1/p' \
| while read name; do
wget -O "$name" -- "$link"
break
done
Seems to work like a charm for me.
Possibly, there is a direct way, though. This creates (completely unnecessarily) two connections to the server.

Resources