Bashscript with curl operation in parallel - bash

I have a list with urls which I like to load with CURL and do some operations on the result with a bash script.
Since it are almost 100k requests I like to run this in parallel.
I already looked into GNU parallel, but how am I going to glue all together? Thanks!
The bashscript:
while read URL; do
curl -L -H "Accept: application/unixref+xml" $URL > temp.xml;
YEAR=$(xmllint --xpath '//year' temp.xml);
MONTH=$(xmllint --xpath '(//date/month)[1]' temp.xml);
echo "$URL;$YEAR;$MONTH" >> results.csv;
sed -i '1d' urls.txt;
done < urls.txt;

You shouldn't be modifying the input list of URLs as you make each HTTP request. And having multiple appenders writing to the same output file from different processes will likely end in tears.
Put most of your commands in a separate script (named, say, geturl.sh) that could be invoked with the URL as a parameter, and writes its line of output to standard out:
#!/usr/bin/env bash
URL="${1}"
curl -L -H "Accept: application/unixref+xml" "${URL}" > /tmp/$$.xml
YEAR="$(xmllint --xpath '//year' /tmp/.xml)"
MONTH="$(xmllint --xpath '(//date/month)[1]' /tmp/$$.xml)"
rm -f /tmp/$$.xml
echo "${URL};${YEAR};${MONTH}"
Then invoke as follows (here we let parallel merge the outputs from the various threads line by line):
parallel --line-buffer geturl.sh < urls.txt > results.csv

Related

How to use parallel with curl?

How do i use gnu parallel to make this process faster ?
#!/bin/bash
for (( c=1; c<=100; c++ ))
do
curl -sS 'https://example.com' \
--data 'value='$c'' /dev/null
echo $c
done
You can use parallel, or xargs
seq 100 | parallel curl -sS 'https://example.com' --data value='{}' /dev/null
seq 100 | xargs -I{} curl -sS 'https://example.com' --data value='{}' /dev/null
As the script stand, output will be sent to stdout. With xargs, this will result in output from different calls potentially mixed. Consider redirect output to files for additional processing, if needed.
You can add options for max parallel (-Pn, etc.) as needed
I'm not sure why '/dev/null' is needed. Consider reordering:
curl -sS --data value='{}' https://example.com'

Run multiple curl commands in parallel

I have the following shell script. The issue is that I want to run the transactions parallel/concurrently without waiting for one request to finish to go to the next request. For example if I make 20 requests, I want them to be executed at the same time.
for ((request=1;request<=20;request++))
do
for ((x=1;x<=20;x++))
do
time curl -X POST --header "http://localhost:5000/example"
done
done
Any guide?
You can use xargs with -P option to run any command in parallel:
seq 1 200 | xargs -n1 -P10 curl "http://localhost:5000/example"
This will run curl command 200 times with max 10 jobs in parallel.
Using xargs -P option, you can run any command in parallel:
xargs -I % -P 8 curl -X POST --header "http://localhost:5000/example" \
< <(printf '%s\n' {1..400})
This will run give curl command 400 times with max 8 jobs in parallel.
Update 2020:
Curl can now fetch several websites in parallel:
curl --parallel --parallel-immediate --parallel-max 3 --config websites.txt
websites.txt file:
url = "website1.com"
url = "website2.com"
url = "website3.com"
This is an addition to #saeed's answer.
I faced an issue where it made unnecessary requests to the following hosts
0.0.0.1, 0.0.0.2 .... 0.0.0.N
The reason was the command xargs was passing arguments to the curl command. In order to prevent the passing of arguments, we can specify which character to replace the argument by using the -I flag.
So we will use it as,
... xargs -I '$' command ...
Now, xargs will replace the argument wherever the $ literal is found. And if it is not found the argument is not passed. So using this the final command will be.
seq 1 200 | xargs -I $ -n1 -P10 curl "http://localhost:5000/example"
Note: If you are using $ in your command try to replace it with some other character that is not being used.
Adding to #saeed's answer, I created a generic function that utilises function arguments to fire commands for a total of N times in M jobs at a parallel
function conc(){
cmd=("${#:3}")
seq 1 "$1" | xargs -n1 -P"$2" "${cmd[#]}"
}
$ conc N M cmd
$ conc 10 2 curl --location --request GET 'http://google.com/'
This will fire 10 curl commands at a max parallelism of two each.
Adding this function to the bash_profile.rc makes it easier. Gist
Add “wait” at the end, and background them.
for ((request=1;request<=20;request++))
do
for ((x=1;x<=20;x++))
do
time curl -X POST --header "http://localhost:5000/example" &
done
done
wait
They will all output to the same stdout, but you can redirect the result of the time (and stdout and stderr) to a named file:
time curl -X POST --header "http://localhost:5000/example" > output.${x}.${request}.out 2>1 &
Wanted to share my example how I utilised parallel xargs with curl.
The pros from using xargs that u can specify how many threads will be used to parallelise curl rather than using curl with "&" that will schedule all let's say 10000 curls simultaneously.
Hope it will be helpful to smdy:
#!/bin/sh
url=/any-url
currentDate=$(date +%Y-%m-%d)
payload='{"field1":"value1", "field2":{},"timestamp":"'$currentDate'"}'
threadCount=10
cat $1 | \
xargs -P $threadCount -I {} curl -sw 'url= %{url_effective}, http_status_code = %{http_code},time_total = %{time_total} seconds \n' -H "Content-Type: application/json" -H "Accept: application/json" -X POST $url --max-time 60 -d $payload
.csv file has 1 value per row that will be inserted in json payload
Based on the solution provided by #isopropylcyanide and the comment by #Dario Seidl, I find this to be the best response as it handles both curl and httpie.
# conc N M cmd - fire (N) commands at a max parallelism of (M) each
function conc(){
cmd=("${#:3}")
seq 1 "$1" | xargs -I'$XARGI' -P"$2" "${cmd[#]}"
}
For example:
conc 10 3 curl -L -X POST https://httpbin.org/post -H 'Authorization: Basic dXNlcjpwYXNz' -H 'Content-Type: application/json' -d '{"url":"http://google.com/","foo":"bar"}'
conc 10 3 http --ignore-stdin -F -a user:pass httpbin.org/post url=http://google.com/ foo=bar

Why use -Lo- with curl when piping to bash?

In the janus project, they use curl to download and pipe a bootstrap script into bash.
https://github.com/carlhuda/janus
It looks like this:
$ curl -Lo- https://bit.ly/janus-bootstrap | bash
Why would one want to use the args -Lo-?
-o is supposed to be for output, but wouldn't that happen anyway (i.e. to stdout)?
It's all in the man pages:
-L in case the page has moved (3xx response) curl will redirect the request to the new address
-o output to a file instead of stdout (usually the screen). In your case the o flag is redundant since the output is piped to bash (for execution) - not to a file.
The -o is redundant, they produce the exact same output:
$ curl --silent example.com | sha256sum
3587cb776ce0e4e8237f215800b7dffba0f25865cb84550e87ea8bbac838c423 *-
$ curl --silent --output - example.com | sha256sum
3587cb776ce0e4e8237f215800b7dffba0f25865cb84550e87ea8bbac838c423 *-
They have used that syntax since that line was first introduced in 2011.
You might ask Wael Nasreddine (#kalbasit on GitHub) why he did it. He
is still active on that repo.

Using all files in a directory with curl?

This is my script:
#!/bin/bash
curl -X POST -T /this/is/my/path/system.log https://whatever;
As you see, I am using a file called system.log. How can I do that for the complete /this/is/my/path/ path in a loop? There are about 50 files in /this/is/my/path/ which I want to use with curl.
Thanks!
You can upload multiple files using this range syntax in curl:
$ curl -u ftpuser:ftppass -T "{file1,file2}" ftp://ftp.testserver.com
A very robust solution is to iterate through a for loop. Moreover you can take advantage of this and insert echo commands or delete, or whatever command you want.
#!/bin/bash
for file in /this/is/my/path/*
do
curl -X POST -T "/this/is/my/path/$file" https://whatever;
done; # file

Script to get the HTTP status code of a list of urls?

I have a list of URLS that I need to check, to see if they still work or not. I would like to write a bash script that does that for me.
I only need the returned HTTP status code, i.e. 200, 404, 500 and so forth. Nothing more.
EDIT Note that there is an issue if the page says "404 not found" but returns a 200 OK message. It's a misconfigured web server, but you may have to consider this case.
For more on this, see Check if a URL goes to a page containing the text "404"
Curl has a specific option, --write-out, for this:
$ curl -o /dev/null --silent --head --write-out '%{http_code}\n' <url>
200
-o /dev/null throws away the usual output
--silent throws away the progress meter
--head makes a HEAD HTTP request, instead of GET
--write-out '%{http_code}\n' prints the required status code
To wrap this up in a complete Bash script:
#!/bin/bash
while read LINE; do
curl -o /dev/null --silent --head --write-out "%{http_code} $LINE\n" "$LINE"
done < url-list.txt
(Eagle-eyed readers will notice that this uses one curl process per URL, which imposes fork and TCP connection penalties. It would be faster if multiple URLs were combined in a single curl, but there isn't space to write out the monsterous repetition of options that curl requires to do this.)
wget --spider -S "http://url/to/be/checked" 2>&1 | grep "HTTP/" | awk '{print $2}'
prints only the status code for you
Extending the answer already provided by Phil. Adding parallelism to it is a no brainer in bash if you use xargs for the call.
Here the code:
xargs -n1 -P 10 curl -o /dev/null --silent --head --write-out '%{url_effective}: %{http_code}\n' < url.lst
-n1: use just one value (from the list) as argument to the curl call
-P10: Keep 10 curl processes alive at any time (i.e. 10 parallel connections)
Check the write_out parameter in the manual of curl for more data you can extract using it (times, etc).
In case it helps someone this is the call I'm currently using:
xargs -n1 -P 10 curl -o /dev/null --silent --head --write-out '%{url_effective};%{http_code};%{time_total};%{time_namelookup};%{time_connect};%{size_download};%{speed_download}\n' < url.lst | tee results.csv
It just outputs a bunch of data into a csv file that can be imported into any office tool.
This relies on widely available wget, present almost everywhere, even on Alpine Linux.
wget --server-response --spider --quiet "${url}" 2>&1 | awk 'NR==1{print $2}'
The explanations are as follow :
--quiet
Turn off Wget's output.
Source - wget man pages
--spider
[ ... ] it will not download the pages, just check that they are there. [ ... ]
Source - wget man pages
--server-response
Print the headers sent by HTTP servers and responses sent by FTP servers.
Source - wget man pages
What they don't say about --server-response is that those headers output are printed to standard error (sterr), thus the need to redirect to stdin.
The output sent to standard input, we can pipe it to awk to extract the HTTP status code. That code is :
the second ($2) non-blank group of characters: {$2}
on the very first line of the header: NR==1
And because we want to print it... {print $2}.
wget --server-response --spider --quiet "${url}" 2>&1 | awk 'NR==1{print $2}'
Use curl to fetch the HTTP-header only (not the whole file) and parse it:
$ curl -I --stderr /dev/null http://www.google.co.uk/index.html | head -1 | cut -d' ' -f2
200
wget -S -i *file* will get you the headers from each url in a file.
Filter though grep for the status code specifically.
I found a tool "webchk” written in Python. Returns a status code for a list of urls.
https://pypi.org/project/webchk/
Output looks like this:
▶ webchk -i ./dxieu.txt | grep '200'
http://salesforce-case-status.dxi.eu/login ... 200 OK (0.108)
https://support.dxi.eu/hc/en-gb ... 200 OK (0.389)
https://support.dxi.eu/hc/en-gb ... 200 OK (0.401)
Hope that helps!
Keeping in mind that curl is not always available (particularly in containers), there are issues with this solution:
wget --server-response --spider --quiet "${url}" 2>&1 | awk 'NR==1{print $2}'
which will return exit status of 0 even if the URL doesn't exist.
Alternatively, here is a reasonable container health-check for using wget:
wget -S --spider -q -t 1 "${url}" 2>&1 | grep "200 OK" > /dev/null
While it may not give you exact status out, it will at least give you a valid exit code based health responses (even with redirects on the endpoint).
Due to https://mywiki.wooledge.org/BashPitfalls#Non-atomic_writes_with_xargs_-P (output from parallel jobs in xargs risks being mixed), I would use GNU Parallel instead of xargs to parallelize:
cat url.lst |
parallel -P0 -q curl -o /dev/null --silent --head --write-out '%{url_effective}: %{http_code}\n' > outfile
In this particular case it may be safe to use xargs because the output is so short, so the problem with using xargs is rather that if someone later changes the code to do something bigger, it will no longer be safe. Or if someone reads this question and thinks he can replace curl with something else, then that may also not be safe.

Resources