I have a json file, with entries containing urls (among other things), which i retrieve using curl.
I'd like to be able to run the loop several times at once to go faster, but also to have a limitation of the number of parallel curls, to avoid being kicked out by the distant server.
For now, my code is like
jq -r '.entries[] | select(.enabled != false) | .id,.unitUrl' $fileIndexFeed | \
while read unitId; do
read -r unitUrl
if ! in_array tabAnnoncesExistantesIds $unitId; then
fullUnitUrl="$unitUrlBase$unitUrl"
unitFile="$unitFileBase$unitId.json"
if [ ! -f $unitFile ]; then
curl -H "Authorization:$authMethod $encodedHeader" -X GET $fullUnitUrl -o $unitFile
fi
fi
done
If i use a simple & at the end of my curl, it will run lots of concurrent requests, and i could be kicked.
So, the question would be (i suppose) : how to know that a curl runned with an & has finished its job ? If i'm able to detect that, then i guess that i can test, increment and decrement a variable telling the number of running curls.
Thanks
Use GNU Parallel to control the number of parallel jobs. Either write your curl commands to a file so you can look at them and check them:
commands.txt
curl "something" "somehow" "toSomewhere"
curl "somethingelse" "someotherway" "toSomewhereElse"
Then, if you want no more than 8 jobs running at a time, run:
parallel -j 8 --eta -a commands.txt
Or you can just write the commands to GNU Parallel's stdin:
jq ... | while read ...; do
printf "curl ..."
done | parallel -j 8
Use a Bash function:
doit() {
unitId="$1"
unitUrl="$2"
if ! in_array tabAnnoncesExistantesIds $unitId; then
fullUnitUrl="$unitUrlBase$unitUrl"
unitFile="$unitFileBase$unitId.json"
if [ ! -f $unitFile ]; then
curl -H "Authorization:$authMethod $encodedHeader" -X GET $fullUnitUrl -o $unitFile
fi
fi
}
jq -r '.entries[] | select(.enabled != false) | .id,.unitUrl' $fileIndexFeed |
env_parallel -N2 doit
env_parallel will import the environment, so all shell variables are available.
Related
Every day every 6 hours I have to download bz2 files from a web server, decompress them and merge them into a single file. This needs to be as efficient and quick as possible as I have to wait for the download and decompress phase to complete before proceeding further with the merging.
I have wrote some bash functions which take as input some strings to construct a URL of the files to be downloaded as a matching pattern. This way I can pass the matching pattern directly to wget without having to build locally the server's contents list to be then passed as a list with -i to wget. My function looks something like this
parallelized_extraction(){
i=0
until [ `ls -1 ${1}.bz2 2>/dev/null | wc -l ` -gt 0 -o $i -ge 30 ]; do
((i++))
sleep 1
done
while [ `ls -1 ${1}.bz2 2>/dev/null | wc -l ` -gt 0 ]; do
ls ${1}.bz2| parallel -j+0 bzip2 -d '{}'
sleep 1
done
}
download_merge_2d_variable()
{
filename="file_${year}${month}${day}${run}_*_${1}.grib2"
wget -b -r -nH -np -nv -nd -A "${filename}.bz2" "url/${run}/${1,,}/"
parallelized_extraction ${filename}
# do the merging
rm ${filename}
}
which I call as download_merge_2d_variable name_of_variable
I was able to speed up the code by writing the function parallelized_extraction which takes care of decompressing the downloaded files while wget is running in the background. To do this I first wait for the first .bz2 file to appear, then run the parallelized extraction until the last .bz2 is present on the server (this is what the two until and while loops are doing).
I'm pretty happy with this approach, however I think it could be improved. Here are my questions:
how can I launch multiple instances of wget to perform as well parallel downloads if my list of files is given as a matching pattern? Do I have to write multiple matching patterns with "chunks" of data inside or do I necessarily have to download a contents list from the server, split this list and then give it as input to wget?
parallelized_extraction may fail if the download of files is really slow, as it will not find any new bz2 file to be extracted and exit from the loop at the next iteration, although wget is still running in the background. Although this never happened to me it is a possibility. To take care of that I tried to add a condition to the second while by getting the PID of wget running in the background to check if it's still there but somehow it is not working
parallelized_extraction(){
# ...................
# same as before ....
# ...................
while [ `ls -1 ${1}.bz2 2>/dev/null | wc -l ` -gt 0 -a kill -0 ${2} >/dev/null 2>&1 ]; do
ls ${1}.bz2| parallel -j+0 bzip2 -d '{}'
sleep 1
done
}
download_merge_2d_variable()
{
filename="ifile_${year}${month}${day}${run}_*_${1}.grib2"
wget -r -nH -np -nv -nd -A "${filename}.bz2" "url/${run}/${1,,}/" &
# get ID of process running in background
PROC_ID=$!
parallelized_extraction ${filename} ${PROC_ID}
# do the merging
rm ${filename}
}
Any clue to why this is not working? Any suggestions on how to improve my code?
Thanks
UPDATE
I'm posting here my working solution based on the accepted answer in case someone is interested.
# Extract a plain list of URLs by using --spider option and filtering
# only URLs from the output
listurls() {
filename="$1"
url="$2"
wget --spider -r -nH -np -nv -nd --reject "index.html" --cut-dirs=3 \
-A $filename.bz2 $url 2>&1\
| grep -Eo '(http|https)://(.*).bz2'
}
# Extract each file by redirecting the stdout of wget to bzip2
# note that I get the filename from the URL directly with
# basename and by removing the bz2 extension at the end
get_and_extract_one() {
url="$1"
file=`basename $url | sed 's/\.bz2//g'`
wget -q -O - "$url" | bzip2 -dc > "$file"
}
export -f get_and_extract_one
# Here the main calling function
download_merge_2d_variable()
{
filename="filename.grib2"
url="url/where/the/file/is/"
listurls $filename $url | parallel get_and_extract_one {}
# merging and processing
}
export -f download_merge_2d_variable_icon_globe
Can you list the urls to download?
listurls() {
# do something that lists the urls without downloading them
# Possibly something like:
# lynx -listonly -image_links -dump "$starturl"
# or
# wget --spider -r -nH -np -nv -nd -A "${filename}.bz2" "url/${run}/${1,,}/"
# or
# seq 100 | parallel echo ${url}${year}${month}${day}${run}_{}_${id}.grib2
}
get_and_extract_one() {
url="$1"
file="$2"
wget -O - "$url" | bzip2 -dc > "$file"
}
export -f get_and_extract_one
# {=s:/:_:g; =} will generate a file name from the URL with / replaced by _
# You probably want something nicer.
# Possibly just {/.}
listurls | parallel get_and_extract_one {} '{=s:/:_:g; =}'
This way you will decompress while downloading and doing all in parallel.
I want to write a while loop in a GitLab CI file and here is the syntax that I've tried but seems to not be working.
Is the while loop authorized in GitLab or YAML files? Or are there other ways to write it?
Here is where I used it:
- while ($(curl -X GET ${URL} | jq -r '.task.status') != "SUCCESS")
ANALYSIS_ID=$(curl -X GET ${URL} | jq -r '.task.analysisId')
Why don't you write yourself a shell/python/whatever script and just run it from the CI?
YAML is not the suitable language to perform such a things (e.g. while loops, large conditions, for loops) and should not be used that way...
So I did this to resolve my issue , its to create a script in which I ve wrote the loop while and this script return the value that I needed, and then I called this script in my gitlab_ci file as below :
- ANALYSIS_ID=$(**./checkUrl.sh** $URL)
And if needed as an example the script that I used
#!/bin/bash
success="SUCCESS"
condition="$(curl -X GET "$1" | jq -r '.task.status')"
while [ "$condition" != "$success" ]
do
ANALYSIS_Id="$(curl -X GET "$1" | jq -r '.task.analysisId')"
done
return "$ANALYSIS_Id"
I'm trying to kick off multiple processes to work through some test suites. In my bash script I have the following
printf "%s\0" "${SUITE_ARRAY[#]}" | xargs -P 2 -0 bash -c 'run_test_suite "$#" ${EXTRA_ARG}'
Below is the defined script, cut down to it's basics.
SUITE_ARRAY will be a list of suites that may have 1 or more, {Suite 1, Suite 2, ..., Suite n}
EXTRA_ARG will be like a specific name to store values in another script
#!/bin/bash
run_test_suite(){
suite=$1
someArg=$2
someSaveDir=someArg"/"suite
# some preprocess work happens here, but isn't relevant to running
runSomeScript.sh suite someSaveDir
}
export -f run_test_suite
SUITES=$1
EXTRA_ARG=$2
IFS=','
SUITECOUNT=0
for csuite in ${SUITES}; do
SUITE_ARRAY[$SUITECOUNT]=$csuite
SUITECOUNT=$(($SUITECOUNT+1))
done
unset IFS
printf "%s\0" "${SUITE_ARRAY[#]}" | xargs -P 2 -0 bash -c 'run_test_suite "$#" ${EXTRA_ARG}'
The issue I'm having is how to get the ${EXTRA_ARG} passed into xargs. From how I've come to understand it, xargs will take whatever is piped into it, so the way I have it doesn't seem correct.
Any suggestions on how to correctly pass the values? Thanks in advance
If you want EXTRA_ARG to be available to the subshell, you need to export it. You can do that either explicitly, with the export keyword, or by putting the var=value assignment in the same simple command as xargs itself:
#!/bin/bash
run_test_suite(){
suite=$1
someArg=$2
someSaveDir=someArg"/"suite
# some preprocess work happens here, but isn't relevant to running
runSomeScript.sh suite someSaveDir
}
export -f run_test_suite
# assuming that the "array" in $1 is comma-separated:
IFS=, read -r -a suite_array <<<"$1"
# see the EXTRA_ARG="$2" just before xargs on the same line; this exports the variable
printf "%s\0" "${suite_array[#]}" | \
EXTRA_ARG="$2" xargs -P 2 -0 bash -c 'run_test_suite "$#" "${EXTRA_ARG}"' _
The _ prevents the first argument passed from xargs to bash from becoming $0, and thus not included in "$#".
Note also that I changed "${suite_array[#]}" to be assigned by splitting $1 on commas. This or something like it (you could use IFS=$'\n' to split on newlines instead, for example) is necessary, as $1 cannot contain a literal array; every shell command-line argument is only a single string.
This is something of a guess:
#!/bin/bash
run_test_suite(){
suite="$1"
someArg="$2"
someSaveDir="${someArg}/${suite}"
# some preprocess work happens here, but isn't relevant to running
runSomeScript.sh "${suite}" "${someSaveDir}"
}
export -f run_test_suite
SUITE_ARRAY="$1"
EXTRA_ARG="$2"
printf "%s\0" "${SUITE_ARRAY[#]}" |
xargs -n 1 -I '{}' -P 2 -0 bash -c 'run_test_suite {} '"${EXTRA_ARG}"
Using GNU Parallel it looks like this:
#!/bin/bash
run_test_suite(){
suite="$1"
someArg="$2"
someSaveDir="$someArg"/"$suite"
# some preprocess work happens here, but isn't relevant to running
echo runSomeScript.sh "$suite" "$someSaveDir"
}
export -f run_test_suite
EXTRA_ARG="$2"
parallel -d, -q run_test_suite {} "$EXTRA_ARG" ::: "$1"
Called as:
mytester 'suite 1,suite 2,suite "three"' 'extra "quoted" args here'
If you have the suites in an array:
parallel -q run_test_suite {} "$EXTRA_ARG" ::: "${SUITE_ARRAY[#]}"
Added bonus: Any output from the jobs will not be mixed, so you will not have to deal with http://mywiki.wooledge.org/BashPitfalls#Using_output_from_xargs_-P
I'm looking to get a simple listing of all the objects in a public S3 bucket.
I'm aware how to get a listing with curl for upto 1000 results, though I do not understand how to paginate the results, in order to get a full listing. I think marker is a clue.
I do not want to use a SDK / library or authenticate. I'm looking for a couple of lines of shell to do this.
#!/bin/sh
# setting max-keys higher than 1000 is not effective
s3url=http://mr2011.s3-ap-southeast-1.amazonaws.com?max-keys=1000
s3ns=http://s3.amazonaws.com/doc/2006-03-01/
i=0
s3get=$s3url
while :; do
curl -s $s3get > "listing$i.xml"
nextkey=$(xml sel -T -N "w=$s3ns" -t \
--if '/w:ListBucketResult/w:IsTruncated="true"' \
-v 'str:encode-uri(/w:ListBucketResult/w:Contents[last()]/w:Key, true())' \
-b -n "listing$i.xml")
# -b -n adds a newline to the result unconditionally,
# this avoids the "no XPaths matched" message; $() drops newlines.
if [ -n "$nextkey" ] ; then
s3get=$s3url"&marker=$nextkey"
i=$((i+1))
else
break
fi
done
I have a list of URLs which I would like to feed into wget using --input-file.
However I can't work out how to control the --output-document value at the same time,
which is simple if you issue the commands one by one.
I would like to save each document as the MD5 of its URL.
cat url-list.txt | xargs -P 4 wget
And xargs is there because I also want to make use of the max-procs features for parallel downloads.
Don't use cat. You can have xargs read from a file. From the man page:
--arg-file=file
-a file
Read items from file instead of standard input. If you use this
option, stdin remains unchanged when commands are run. Other‐
wise, stdin is redirected from /dev/null.
how about using a loop?
while read -r line
do
md5=$(echo "$line"|md5sum)
wget ... $line ... --output-document $md5 ......
done < url-list.txt
In your question you use -P 4 which suggests you want your solution to run in parallel. GNU Parallel http://www.gnu.org/software/parallel/ may help you:
cat url-list.txt | parallel 'wget {} --output-document "`echo {}|md5sum`"'
You can do that like this :
cat url-list.txt | while read url;
do
wget $url -O $( echo "$url" | md5 );
done
good luck