How to check if PDF files are online? - bash

I would like to iterate through a number of PDFs starting from 18001.pdf to N.pdf (adding 1 to the basename) and stop the loop as soon as a file is not online available. Below is the code that I guess is closest to what a solution might look like but actually there are multiple things not properly working it seems. The command in the while condition causes a syntax error f.x.
#!/bin/bash
path=http://dip21.bundestag.de/dip21/btp/18/
n=18001
while [ wget -q --spider $path$n.pdf ]
do
n=$(($n+1))
done
echo $n
HST - my question is not about debugging this specific code - it mostly serves the purpose of illustrating what I would like to do. Then again, I would appreciate a solution using a loop and wget.

If you want to test the success of a command, don't put it inside [ -- that's used to test the value of a conditional expression.
while wget -q --spider $path$n.pdf
do
...
done

Related

How to download the best track JTWC data files from 1945 to 2020 using loop?

I want to download the data files from the URLs using a loop from 1945 to 2020, only one number changes in the URL,
The URLs are given below
https://www.metoc.navy.mil/jtwc/products/best-tracks/1945/1945s-bio/bio1945.zip
https://www.metoc.navy.mil/jtwc/products/best-tracks/1984/1984s-bio/bio1984.zip
https://www.metoc.navy.mil/jtwc/products/best-tracks/2020/2020s-bio/bio2020.zip
I tried the following code, but it throws an error
for i in {1945..2020}
do
wget "https://www.metoc.navy.mil/jtwc/products/best-tracks/$i/$is-bio/bio$i.zip"
done
I did changed your code slightly
for i in {1945..1947}
do
echo "https://www.metoc.navy.mil/jtwc/products/best-tracks/$i/$is-bio/bio$i.zip"
done
when run it does output
https://www.metoc.navy.mil/jtwc/products/best-tracks/1945/-bio/bio1945.zip
https://www.metoc.navy.mil/jtwc/products/best-tracks/1946/-bio/bio1946.zip
https://www.metoc.navy.mil/jtwc/products/best-tracks/1947/-bio/bio1947.zip
Notice that first one is not https://www.metoc.navy.mil/jtwc/products/best-tracks/1945/1945s-bio/bio1945.zip as you might expect - 2nd $i did not work as intended, as it is followed by s it was understand as variable is which is not defined. Enclose variable names in { } to avoid confusion, this code
for i in {1945..1947}
do
echo "https://www.metoc.navy.mil/jtwc/products/best-tracks/${i}/${i}s-bio/bio${i}.zip"
done
when run does output
https://www.metoc.navy.mil/jtwc/products/best-tracks/1945/1945s-bio/bio1945.zip
https://www.metoc.navy.mil/jtwc/products/best-tracks/1946/1946s-bio/bio1946.zip
https://www.metoc.navy.mil/jtwc/products/best-tracks/1947/1947s-bio/bio1947.zip
which is compliant with example you gave. Now you might either replace echo using wget or save output of code with echo to file named say urls.txt and then harness -i option of wget as follows
wget -i urls.txt
Note: for brevity sake I use 1945..1947 in place of 1945..2020
It directly worked, Thanks #Daweo
for i in {1945..2020}
do
wget "https://www.metoc.navy.mil/jtwc/products/best-tracks/${i}/${i}s-bio/bio${i}.zip"
done

append output of multiple curl requests to a file from shell script

I'm trying to fetch the JSON output by an internal API and add 100 to a parameter value between cURL requests. I need to loop through because it restricts the maximum number of results per request to 100. I was told to "increment and you should be able to get what you need".
Anyway, here's what I wrote:
#!/bin/bash
COUNTER=100
until [ COUNTER -gt 30000 ]; do
curl -vs "http://example.com/locations/city?limit=100&offset=$COUNTER" >> cities.json
let COUNTER=COUNTER+100
done
The problem is that I get a bunch of weird messages in the terminal and the file I'm trying to redirect the output too still contains it's original 100 objects. I feel like I'm probably missing something terrifically obvious. Any thoughts? I did use a somewhat old tutorial on the until loop, so maybe it's a syntax issue?
Thank you in advance!
EDIT: I'm not opposed to a completely alternate method, but I had hoped this would be somewhat straightforward. I figured my lack of experience was the main limiter.
You might find you can do this faster, and pretty easily with GNU Parallel:
parallel -k curl -vs "http://example.com/locations/city?limit=100\&offset={}" ::: $(seq 100 100 30000) > cities.json
If you want to overwrite the file's content only once, for your entire loop...
#!/bin/bash
# ^-- NOT /bin/sh, as this uses bash-only syntax
for (( counter=100; counter<=30000; counter+=100 )); do
curl -vs "http://example.com/locations/city?limit=100&offset=$counter"
done >cities.json
This is actually more efficient than putting >>cities.json on each curl command, as it only opens the output file once, and has the side effect (which you appear to want) of clearing the file's former contents when the loop is started.

Bash: Slow redirection and filter

I have a bash script that calls a program which generates a humongous amount of output. A lot of this data is coming from a Python package that I have not created and whose output I can't really control, nor interests me.
I tried to filter the output generated by that external Python package and redirect the "cleaned" output to a log file. If I used regular pipes and grep expressions, I lost many chunks of information. I read that is something that can actually happen with the redirections (1 and 2).
In order to fix that, I made the redirections like this:
#!/bin/bash
regexTxnFilterer="\[txn\.-[[:digit:]]+\]"
regexThreadPoolFilterer="\[paste\.httpserver\.ThreadPool\]"
bin/paster serve --reload --pid-file="/var/run/myServer//server.pid" parts/etc/debug.ini 2>&1 < "/dev/null" | while IFS='' read -r thingy ; do
if [[ ! "$thingy" =~ $regexTxnFilterer ]] && [[ ! "$thingy" =~ $regexThreadPoolFilterer ]]; then
echo "$thingy" >> "/var/log/myOutput.log"
fi
done
Which doesn't lose any information (at least not that I could tell) and filters the strings I don't need (using the two regular expressions above).
The issue is that it has rendered the application (the bin/paster thing I'm executing) unbearably slow. Is there any way to achieve the same effect but with a better performance?
Thank you in advance!
Update #2012-04-13: As shellter pointed out in one of the comments to this question, it may be useful to provide examples of the outputs I want to filter. Here's a bunch of them:
2012-04-13 19:30:37,996 DEBUG [txn.-1220917568] new transaction
2012-04-13 19:30:37,997 DEBUG [txn.-1220917568] commit <zope.sqlalchemy.datamanager.SessionDataManager object at 0xbf4062c>
2012-04-13 19:30:37,997 DEBUG [txn.-1220917568] commit
Starting server in PID 18262.
2012-04-13 19:30:38,292 DEBUG [paste.httpserver.ThreadPool] Started new worker -1269716112: Initial worker pool
2012-04-13 19:33:08,158 DEBUG [txn.-1244144784] new transaction
2012-04-13 19:33:08,158 DEBUG [txn.-1244144784] commit
2012-04-13 19:32:06,980 DEBUG [paste.httpserver.ThreadPool] Added task (0 tasks queued)
2012-04-13 19:32:06,980 INFO [paste.httpserver.ThreadPool] kill_hung_threads status: 10 threads (0 working, 10 idle, 0 starting) ave time N/A, max time 0.00sec, killed 0 workers
There's a few more different messages involving the ThreadPool though, but I couldn't catch any.
For one thing -- you're reopening the log file every time you want to append a line. That's silly.
Instead of this:
while ...; do
echo "foo" >>filename
done
Do this (which opens the output file on a new, non-stdout file handle, such that you still have a clear line to stdout should you wish to write to it):
exec 4>>filename
while ...; do
echo "foo" >&4
done
It's also possible to redirect stdout for the whole loop:
while ...; do
echo "foo"
done >filename
...notably, this will impact more than just the "echo" line, and thus have slightly different semantics from the original.
Or, better yet -- Configure the Python logging module to filter output to only what you care about, and don't bother with shell-script postprocessing at all.
If the version of Paste you're using is sufficiently similar to modern Pyramid, you can put this in your ini file (currently parts/etc/debug.ini):
[logger_paste.httpserver.ThreadPool]
level = INFO
[logger_txn]
level = INFO
...and anything below INFO level (including the DEBUG messages) will be excluded.
It may be faster to use a grep-based solution to this
#!/bin/bash
regexTxnFilterer="\[txn\.-[[:digit:]]+\]"
regexThreadPoolFilterer="\[paste\.httpserver\.ThreadPool\]"
bin/paster serve --reload --pid-file="/var/run/myServer//server.pid" parts/etc/debug.ini 2>&1 < "/dev/null" | grep -vf <(echo "$regexTxnFilterer"; echo "$regexThreadPoolFilterer") >> "/var/log/myOutput.log"
Your loop may be slow because the echo "$thingy" >> "/var/log/myOutput.log" line is opening and closing the log file every time it executes. I wouldn't expect there to be a big performance difference between grep's regex matching and bash's, but if there was it wouldn't surprise me.
Late Edit
There's a far simpler way to fix the performance issue caused by opening/closing the output once per line. Why this didn't occur to me before, I have no idea. Just move the >> to outside your loop
#!/bin/bash
regexTxnFilterer="\[txn\.-[[:digit:]]+\]"
regexThreadPoolFilterer="\[paste\.httpserver\.ThreadPool\]"
bin/paster serve --reload --pid-file="/var/run/myServer//server.pid" parts/etc/debug.ini 2>&1 < "/dev/null" | while IFS='' read -r thingy ; do
if [[ ! "$thingy" =~ $regexTxnFilterer ]] && [[ ! "$thingy" =~ $regexThreadPoolFilterer ]]; then
echo "$thingy"
fi
done >> "/var/log/myOutput.log"
I can't see any compelling reason why this would be either faster or slower than the grep solution, but it's a lot closer to the original code and a little less cryptic.

Best way for testing compiled code to return expected output/errors

How do you test if compiled code returns the expected output or fails as expected?
I have worked out a working example below, but it is not easily extendable. Every additional test would require additional nesting parentheses. Of course I could split this into other files, but do you have any suggestions on how to improve this?. Also I'm planning to use this from make test stanza in a makefile, so I do not expect other people to install something that isn't installed by default, just for testing it. And stdout should also remain interleaved with stderr.
simplified example:
./testFoo || echo execution failed
./testBar && echo expected failure
(./testBaz && (./testBaz 2>&1 | cmp -s - foo.tst && ( ./testFoo && echo and so on
|| echo testFoo's execution failed )|| echo testBaz's does not match )
|| echo testBaz's execution failed
my current tester looks like this (for one test):
\#!/bin/bash
compiler1 $1 && (compiler2 -E --make $(echo $1 | sed 's/^\(.\)\(.*\)\..*$/\l\1\2/') && (./$(echo $1 | sed 's/^\(.\)\(.*\)\..*$/\l\1\2/') || echo execution failed) || less $(echo $1 | sed 's/^\(.\)\(.*\)\..*$/\l\1\2/').err) || echo compile failed
I suggest to start looking for patterns here. For example, you could use the file name as the pattern and then create some additional files that encode the expected result.
You can then use a simple script to run the command and verify the result (instead of repeating the test code again and again).
For example, a file testFoo.exec with the content 0 means that it must succeed (or at least return with 0) while testBar.exec would contain 1.
textBaz.out would then contain the expected output. You don't need to call testBaz several times; you can redirect the output in the first call and then look at $? to see if the call succeeded or not. If it did, then you can directly verify the output (without starting the command again).
My own simple minded test harness works like this:
every test is represented by a bash script with an extension .test - these all live in the same directory
when I create a test, I run the test script and examine the output
carefully, if it looks good it goes into a directory called good_results, in a file with the same name as the test that generated it
the main testing script finds all the .test scripts and executes each of them in turn, producing a temporary output file. This is diff'd with the
matching file in the good_results directory and any differences reported
Itv took me about half an hour to write this and get it working, but it has proved invaluable!

How to deal with NFS latency in shell scripts

I'm writing shell scripts where quite regularly some stuff is written
to a file, after which an application is executed that reads that file. I find that through our company the network latency differs vastly, so a simple sleep 2 for example will not be robust enough.
I tried to write a (configurable) timeout loop like this:
waitLoop()
{
local timeout=$1
local test="$2"
if ! $test
then
local counter=0
while ! $test && [ $counter -lt $timeout ]
do
sleep 1
((counter++))
done
if ! $test
then
exit 1
fi
fi
}
This works for test="[ -e $somefilename ]". However, testing existence is not enough, I sometimes need to test whether a certain string was written to the file. I tried
test="grep -sq \"^sometext$\" $somefilename", but this did not work. Can someone tell me why?
Are there other, less verbose options to perform such a test?
You can set your test variable this way:
test=$(grep -sq "^sometext$" $somefilename)
The reason your grep isn't working is that quotes are really hard to pass in arguments. You'll need to use eval:
if ! eval $test
I'd say the way to check for a string in a text file is grep.
What's your exact problem with it?
Also you might adjust your NFS mount parameters, to get rid of the root problem. A sync might also help. See NFS docs.
If you're wanting to use waitLoop in an "if", you might want to change the "exit" to a "return", so the rest of the script can handle the error situation (there's not even a message to the user about what failed before the script dies otherwise).
The other issue is using "$test" to hold a command means you don't get shell expansion when actually executing, just evaluating. So if you say test="grep \"foo\" \"bar baz\"", rather than looking for the three letter string foo in the file with the seven character name bar baz, it'll look for the five char string "foo" in the nine char file "bar baz".
So you can either decide you don't need the shell magic, and set test='grep -sq ^sometext$ somefilename', or you can get the shell to handle the quoting explicitly with something like:
if /bin/sh -c "$test"
then
...
Try using the file modification time to detect when it is written without opening it. Something like
old_mtime=`stat --format="%Z" file`
# Write to file.
new_mtime=$old_mtime
while [[ "$old_mtime" -eq "$new_mtime" ]]; do
sleep 2;
new_mtime=`stat --format="%Z" file`
done
This won't work, however, if multiple processes try to access the file at the same time.
I just had the exact same problem. I used a similar approach to the timeout wait that you include in your OP; however, I also included a file-size check. I reset my timeout timer if the file had increased in size since last it was checked. The files I'm writing can be a few gig, so they take a while to write across NFS.
This may be overkill for your particular case, but I also had my writing process calculate a hash of the file after it was done writing. I used md5, but something like crc32 would work, too. This hash was broadcast from the writer to the (multiple) readers, and the reader waits until a) the file size stops increasing and b) the (freshly computed) hash of the file matches the hash sent by the writer.
We have a similar issue, but for different reasons. We are reading s file, which is sent to an SFTP server. The machine running the script is not the SFTP server.
What I have done is set it up in cron (although a loop with a sleep would work too) to do a cksum of the file. When the old cksum matches the current cksum (the file has not changed for the determined amount of time) we know that the writes are complete, and transfer the file.
Just to be extra safe, we never overwrite a local file before making a backup, and only transfer at all when the remote file has two cksums in a row that match, and that cksum does not match the local file.
If you need code examples, I am sure I can dig them up.
The shell was splitting your predicate into words. Grab it all with $# as in the code below:
#! /bin/bash
waitFor()
{
local tries=$1
shift
local predicate="$#"
while [ $tries -ge 1 ]; do
(( tries-- ))
if $predicate >/dev/null 2>&1; then
return
else
[ $tries -gt 0 ] && sleep 1
fi
done
exit 1
}
pred='[ -e /etc/passwd ]'
waitFor 5 $pred
echo "$pred satisfied"
rm -f /tmp/baz
(sleep 2; echo blahblah >>/tmp/baz) &
(sleep 4; echo hasfoo >>/tmp/baz) &
pred='grep ^hasfoo /tmp/baz'
waitFor 5 $pred
echo "$pred satisfied"
Output:
$ ./waitngo
[ -e /etc/passwd ] satisfied
grep ^hasfoo /tmp/baz satisfied
Too bad the typescript isn't as interesting as watching it in real time.
Ok...this is a bit whacky...
If you have control over the file: you might be able to create a 'named pipe' here.
So (depending on how the writing program works) you can monitor the file in an synchronized fashion.
At its simplest:
Create the named pipe:
mkfifo file.txt
Set up the sync'd receiver:
while :
do
process.sh < file.txt
end
Create a test sender:
echo "Hello There" > file.txt
The 'process.sh' is where your logic goes : this will block until the sender has written its output. In theory the writer program won't need modifiying....
WARNING: if the receiver is not running for some reason, you may end up blocking the sender!
Not sure it fits your requirement here, but might be worth looking into.
Or to avoid synchronized, try 'lsof' ?
http://en.wikipedia.org/wiki/Lsof
Assuming that you only want to read from the file when nothing else is writing to it (ie, the writing process has finished) - you could check whether nothing else has file handle to it ?

Resources