I'm looking to get a simple listing of all the objects in a public S3 bucket.
I'm aware how to get a listing with curl for upto 1000 results, though I do not understand how to paginate the results, in order to get a full listing. I think marker is a clue.
I do not want to use a SDK / library or authenticate. I'm looking for a couple of lines of shell to do this.
#!/bin/sh
# setting max-keys higher than 1000 is not effective
s3url=http://mr2011.s3-ap-southeast-1.amazonaws.com?max-keys=1000
s3ns=http://s3.amazonaws.com/doc/2006-03-01/
i=0
s3get=$s3url
while :; do
curl -s $s3get > "listing$i.xml"
nextkey=$(xml sel -T -N "w=$s3ns" -t \
--if '/w:ListBucketResult/w:IsTruncated="true"' \
-v 'str:encode-uri(/w:ListBucketResult/w:Contents[last()]/w:Key, true())' \
-b -n "listing$i.xml")
# -b -n adds a newline to the result unconditionally,
# this avoids the "no XPaths matched" message; $() drops newlines.
if [ -n "$nextkey" ] ; then
s3get=$s3url"&marker=$nextkey"
i=$((i+1))
else
break
fi
done
Related
Every day every 6 hours I have to download bz2 files from a web server, decompress them and merge them into a single file. This needs to be as efficient and quick as possible as I have to wait for the download and decompress phase to complete before proceeding further with the merging.
I have wrote some bash functions which take as input some strings to construct a URL of the files to be downloaded as a matching pattern. This way I can pass the matching pattern directly to wget without having to build locally the server's contents list to be then passed as a list with -i to wget. My function looks something like this
parallelized_extraction(){
i=0
until [ `ls -1 ${1}.bz2 2>/dev/null | wc -l ` -gt 0 -o $i -ge 30 ]; do
((i++))
sleep 1
done
while [ `ls -1 ${1}.bz2 2>/dev/null | wc -l ` -gt 0 ]; do
ls ${1}.bz2| parallel -j+0 bzip2 -d '{}'
sleep 1
done
}
download_merge_2d_variable()
{
filename="file_${year}${month}${day}${run}_*_${1}.grib2"
wget -b -r -nH -np -nv -nd -A "${filename}.bz2" "url/${run}/${1,,}/"
parallelized_extraction ${filename}
# do the merging
rm ${filename}
}
which I call as download_merge_2d_variable name_of_variable
I was able to speed up the code by writing the function parallelized_extraction which takes care of decompressing the downloaded files while wget is running in the background. To do this I first wait for the first .bz2 file to appear, then run the parallelized extraction until the last .bz2 is present on the server (this is what the two until and while loops are doing).
I'm pretty happy with this approach, however I think it could be improved. Here are my questions:
how can I launch multiple instances of wget to perform as well parallel downloads if my list of files is given as a matching pattern? Do I have to write multiple matching patterns with "chunks" of data inside or do I necessarily have to download a contents list from the server, split this list and then give it as input to wget?
parallelized_extraction may fail if the download of files is really slow, as it will not find any new bz2 file to be extracted and exit from the loop at the next iteration, although wget is still running in the background. Although this never happened to me it is a possibility. To take care of that I tried to add a condition to the second while by getting the PID of wget running in the background to check if it's still there but somehow it is not working
parallelized_extraction(){
# ...................
# same as before ....
# ...................
while [ `ls -1 ${1}.bz2 2>/dev/null | wc -l ` -gt 0 -a kill -0 ${2} >/dev/null 2>&1 ]; do
ls ${1}.bz2| parallel -j+0 bzip2 -d '{}'
sleep 1
done
}
download_merge_2d_variable()
{
filename="ifile_${year}${month}${day}${run}_*_${1}.grib2"
wget -r -nH -np -nv -nd -A "${filename}.bz2" "url/${run}/${1,,}/" &
# get ID of process running in background
PROC_ID=$!
parallelized_extraction ${filename} ${PROC_ID}
# do the merging
rm ${filename}
}
Any clue to why this is not working? Any suggestions on how to improve my code?
Thanks
UPDATE
I'm posting here my working solution based on the accepted answer in case someone is interested.
# Extract a plain list of URLs by using --spider option and filtering
# only URLs from the output
listurls() {
filename="$1"
url="$2"
wget --spider -r -nH -np -nv -nd --reject "index.html" --cut-dirs=3 \
-A $filename.bz2 $url 2>&1\
| grep -Eo '(http|https)://(.*).bz2'
}
# Extract each file by redirecting the stdout of wget to bzip2
# note that I get the filename from the URL directly with
# basename and by removing the bz2 extension at the end
get_and_extract_one() {
url="$1"
file=`basename $url | sed 's/\.bz2//g'`
wget -q -O - "$url" | bzip2 -dc > "$file"
}
export -f get_and_extract_one
# Here the main calling function
download_merge_2d_variable()
{
filename="filename.grib2"
url="url/where/the/file/is/"
listurls $filename $url | parallel get_and_extract_one {}
# merging and processing
}
export -f download_merge_2d_variable_icon_globe
Can you list the urls to download?
listurls() {
# do something that lists the urls without downloading them
# Possibly something like:
# lynx -listonly -image_links -dump "$starturl"
# or
# wget --spider -r -nH -np -nv -nd -A "${filename}.bz2" "url/${run}/${1,,}/"
# or
# seq 100 | parallel echo ${url}${year}${month}${day}${run}_{}_${id}.grib2
}
get_and_extract_one() {
url="$1"
file="$2"
wget -O - "$url" | bzip2 -dc > "$file"
}
export -f get_and_extract_one
# {=s:/:_:g; =} will generate a file name from the URL with / replaced by _
# You probably want something nicer.
# Possibly just {/.}
listurls | parallel get_and_extract_one {} '{=s:/:_:g; =}'
This way you will decompress while downloading and doing all in parallel.
I'm running in to a situation where the curl seeks that I do to a server to fetch some information intermittently breaks and instead of populating the resultant file with actual information it populates it with error it got and then move to next datapoint(in for loop), so I decided to catch and retry.(Using bash). But for some reason it's not doing what it should be, the generated file turns out to be empty. Below is what I have written, if you have a better/easier way to approach this or see a problem in below please go ahead and share. Thanks in advance.
for i in `cat master_db_list`
do
curl -sN --negotiate -u foo:bar "${URL}/$i/table" > $table_info/${i}
#Seek error catch
result=`cat $table_info/${i} | grep -i "Error"`
#lowercase comparison, since error can be in any case
echo -e "result is: $result \n" >> $job_log 2>&1
while [[ "${result,,}" =~ error ]]
do
echo "Retrying table list generation for ${i}" >> $job_log 2>&1
curl -sN --negotiate -u foo:bar "${URL}/$i/table" > $table_info/${i}
result=`cat $table_info/${i} | grep -i "Error"`
done
done
turns out , nothing wrong with above, some of the curls calls returned empty values, decreased seek frequency with sleeps in between and is working all ok now.
I have a json file, with entries containing urls (among other things), which i retrieve using curl.
I'd like to be able to run the loop several times at once to go faster, but also to have a limitation of the number of parallel curls, to avoid being kicked out by the distant server.
For now, my code is like
jq -r '.entries[] | select(.enabled != false) | .id,.unitUrl' $fileIndexFeed | \
while read unitId; do
read -r unitUrl
if ! in_array tabAnnoncesExistantesIds $unitId; then
fullUnitUrl="$unitUrlBase$unitUrl"
unitFile="$unitFileBase$unitId.json"
if [ ! -f $unitFile ]; then
curl -H "Authorization:$authMethod $encodedHeader" -X GET $fullUnitUrl -o $unitFile
fi
fi
done
If i use a simple & at the end of my curl, it will run lots of concurrent requests, and i could be kicked.
So, the question would be (i suppose) : how to know that a curl runned with an & has finished its job ? If i'm able to detect that, then i guess that i can test, increment and decrement a variable telling the number of running curls.
Thanks
Use GNU Parallel to control the number of parallel jobs. Either write your curl commands to a file so you can look at them and check them:
commands.txt
curl "something" "somehow" "toSomewhere"
curl "somethingelse" "someotherway" "toSomewhereElse"
Then, if you want no more than 8 jobs running at a time, run:
parallel -j 8 --eta -a commands.txt
Or you can just write the commands to GNU Parallel's stdin:
jq ... | while read ...; do
printf "curl ..."
done | parallel -j 8
Use a Bash function:
doit() {
unitId="$1"
unitUrl="$2"
if ! in_array tabAnnoncesExistantesIds $unitId; then
fullUnitUrl="$unitUrlBase$unitUrl"
unitFile="$unitFileBase$unitId.json"
if [ ! -f $unitFile ]; then
curl -H "Authorization:$authMethod $encodedHeader" -X GET $fullUnitUrl -o $unitFile
fi
fi
}
jq -r '.entries[] | select(.enabled != false) | .id,.unitUrl' $fileIndexFeed |
env_parallel -N2 doit
env_parallel will import the environment, so all shell variables are available.
What I want to do: I want to find all the products(URLs) which are not redirected.
To get the final URL after redirection I'm using curl command as follows:
curl -Ls -o /dev/null -w %{url_effective} "$URL"
This is working fine. Now I want to iterate over URLs to get which are the URLs that are not redirected and display them as output of program. I've the following code:
result=""
productIds=$1
for productId in $(echo $productIds | sed "s/,/ /g")
do
echo "Checking product: $productId";
URL="http://localhost/?go=$productId";
updatedURL=`curl -Ls -o /dev/null -w %{url_effective} "$URL"`
echo "URL : $URL, UpdatedUrl: $updatedURL";
if [ "$URL" == "$updatedURL" ]
then
result="$result$productId,";
fi
done
The curl command works only for the first product. But from 2nd to last product, all the URL and updatedURL are same. I can't understand the reason why? The productId is changing in every iteration, so I think it cannot be something related to caching.
Also, I've tried the following variant of curl also:
updatedURL=$(curl -Ls -o /dev/null -w %{url_effective} "$URL")
updatedURL="$(curl -Ls -o /dev/null -w %{url_effective} "$URL")"
Edit: After trying with debug mode and lot of different ways. I noticed a pattern i.e. If I manually hit the following on terminal:
curl -Ls -o /dev/null -w %{url_effective} "http://localhost/?go=32123"
Then in shell script these urls will work fine. But if I don't hit manually then curl will also not work for those products via shell script.
Just add #!/bin/bash to be the first line of shell. It does produce required output. Now invocation should be like this bash file.sh 123,456,789,221
Invocation via Bourne shell sh file.sh 123,456,789,221 does requires code changes.Do let me know if you require that too :)
I would suggest changing your loop to something like this:
IFS=, read -ra productIds <<<"$1"
for productId in "${productIds[#]}"; do
url=http://localhost/?go=$productId
num_redirects=$(curl -Ls -o /dev/null -w %{num_redirects} "$url")
if [ "$num_redirects" -eq 0 ]; then
printf 'productId %s has no redirects\n' "$productId"
fi
done
This splits the first argument passed to the script into an array, using a comma as the delimiter. The number of redirects is stored to a variable. When the number is zero, the message is printed.
I have to admit that I can't see anything inherently broken with your original approach so it's possible that there is something extra going on that we're not aware of. If you could provide a reproducible test case then we would be able to help you more effectively.
I want to create a shellscript that reads files from a .diz file, where information about various source files are stored, that are needed to compile a certain piece of software (imagemagick in this case). i am using Mac OSX Leopard 10.5 for this examples.
Basically i want to have an easy way to maintain these .diz files that hold the information for up-to-date source packages. i would just need to update these .diz files with urls, version information and file checksums.
Example line:
libpng:1.2.42:libpng-1.2.42.tar.bz2?use_mirror=biznetnetworks:http://downloads.sourceforge.net/project/libpng/00-libpng-stable/1.2.42/libpng-1.2.42.tar.bz2?use_mirror=biznetnetworks:9a5cbe9798927fdf528f3186a8840ebe
script part:
while IFS=: read app version file url md5
do
echo "Downloading $app Version: $version"
curl -L -v -O $url 2>> logfile.txt
$calculated_md5=`/sbin/md5 $file | /usr/bin/cut -f 2 -d "="`
echo $calculated_md5
done < "files.diz"
Actually I have more than just one question concerning this.
how to calculate and compare the checksums the best? i wanted to store md5 checksums in the .diz file and compare it with string comparison with "cut"ting out the string
is there a way to tell curl another filename to save to? (in my case the filename gets ugly libpng-1.2.42.tar.bz2?use_mirror=biznetnetworks)
i seem to have issues with the backticks that should direct the output of the piped md5 and cut into the variable $calculated_md5. is the syntax wrong?
Thanks!
The following is a practical one-liner:
curl -s -L <url> | tee <destination-file> |
sha256sum -c <(echo "a748a107dd0c6146e7f8a40f9d0fde29e19b3e8234d2de7e522a1fea15048e70 -") ||
rm -f <destination-file>
wrapping it up in a function taking 3 arguments:
- the url
- the destination
- the sha256
download() {
curl -s -L $1 | tee $2 | sha256sum -c <(echo "$3 -") || rm -f $2
}
while IFS=: read app version file url md5
do
echo "Downloading $app Version: $version"
#use -o for output file. define $outputfile yourself
curl -L -v $url -o $outputfile 2>> logfile.txt
# use $(..) instead of backticks.
calculated_md5=$(/sbin/md5 "$file" | /usr/bin/cut -f 2 -d "=")
# compare md5
case "$calculated_md5" in
"$md5" )
echo "md5 ok"
echo "do something else here";;
esac
done < "files.diz"
My curl has a -o (--output) option to specify an output file. There's also a problem with your assignment to $calculated_md5. It shouldn't have the dollar sign at the front when you assign to it. I don't have /sbin/md5 here so I can't comment on that. What I do have is md5sum. If you have it too, you might consider it as an alternative. In particular, it has a --check option that works from a file listing of md5sums that might be handy for your situation. HTH.