Header Footer are gone/wrong randomly - wkhtmltopdf

I generate 100 pages use this library, but unfortunately the Header and Footer are gone randomly. These are arguments I pass to library:
"--javascript-delay 5000 --enable-javascript --debug-javascript " +
"-T 47mm -B 27mm -L 13mm -R 13mm --header-spacing 8 --header-html " +
SomeHeaderURL +
" --footer-spacing 5 --footer-html " +
SomeFooterURL
For example,
header are gone for first 8 pages, and Footer only shows around last 10 pages, while other pages' Footer are either not showing at all or page number not replaced.
Can anyone advise what happens?
Note:
I use wkhtmltopdf 0.12.5 (with patched qt) version

Related

How can I extract URLs from the source code of a webpage?

I am trying to setup a stream for some scanners I have found on Broadcastify. The problem is that the URLs they use are dynamic and only the same for a few hours a time. I would like to create a shell script that can simply scan the page from which the stream is accessed (which does have a static URL) and return the current URL of the stream, which can then be fed to the audio player.
For instance, right now the following stream at https://www.broadcastify.com/listen/feed/30185/web has a stream at http://audio12.broadcastify.com/kq2ydfr1jz98shw.mp3
However, that stream link will only work for a short period of time. I need an MP3 stream like the one above.
I only have minor experience with shell scripting, so I'm wondering what the best approach would be here. Specifically, my first problem is if I simply "View page source" and search for "mp3", there are no results. I can only find the URL by inspecting element (F12 developer tools) and, in Chrome for instance, going to Application → Frames → Media. I thought I could do a "view frame source" on the audio player in the past but that option isn't there now.
I imagine I could use grep if I was able to CURL the source code, but I'm not sure what I would need to CURL here, if that makes sense.
UPDATE
Thanks mk12 for the insight. Based off that, here is my shell script:
#!/bin/bash
curl "https://www.broadcastify.com/listen/feed/$1/web" | grep webAuth > /var/tmp/broadcastifyauth$1.txt
pta=`cat /var/tmp/broadcastifyauth$1.txt | sed -i 's/$.ajaxSetup({ headers: { "webAuth": "//g' /var/tmp/broadcastifyauth$1.txt`
pta=`cat /var/tmp/broadcastifyauth$1.txt | sed -i 's/" }});//g' /var/tmp/broadcastifyauth$1.txt`
auth=`cat /var/tmp/broadcastifyauth$1.txt`
echo $auth
curl "https://www.broadcastify.com/listen/webpl.php?feedId=$1" --request POST --header "webAuth: $auth" --data 't=14' >/var/tmp/broadcastify$1.txt
pta=`cat /var/tmp/broadcastify$1.txt | grep -o 'http://[^"]*' > /var/tmp/broadcastify$1.b.txt`
pta=`cat /var/tmp/broadcastify$1.b.txt`
echo $pta
#pta=`cat /var/tmp/broadcastify$1.txt | sed -n '/<audio/s/^.*<audio width="300px" id="mePlayer_$1" src="\([^"]*\)".*/\1/p' > /var/tmp/broadcastify$1.b.txt`
#ptb=`cat /var/tmp/broadcastify$1.b.txt`
#echo $ptb
Here is its output:
root#na01:/etc/asterisk/scripts/music# ./broadcastify.sh 30185
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 9175 100 9175 0 0 51843 0 --:--:-- --:--:-- --:--:-- 52130
74f440ad812f0cc2192ab782e27608cc
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 946 0 942 100 4 3851 16 --:--:-- --:--:-- --:--:-- 3844
http://relay.broadcastify.com/b94hfrp5k1s0tvy.mp3?xan=DCJP4HvtwMoXdH9HvtwMJ5vv342DfleDptcoX3dH9H48vtwMJ
Works!
The mp3 URL is not present in the original HTML document — it's added to the DOM later by JavaScript code. That's why you can't find it in "View page source," but you can with "Inspect element."
If you run curl https://www.broadcastify.com/listen/feed/30185/web, you will see the following somewhere in the middle:
<div id="fp" width="300px"></div>
<script>
$.ajaxSetup({ headers: { "webAuth": "74f440ad812f0cc2192ab782e27608cc" }});
$('#fp').load('/listen/webpl.php?feedId=30185',{t:14});
</script>
Note in particular that it loads content (using jQuery .load) into the initially empty <div id="fp"> just above. When you use "Inspect element" to find the audio player, you'll find it gets placed inside that div.
Before trying to reproduce this request with curl, I looked in the Network tab of the developer tools to see what the browser did. Filtering for "listen," I found the webpl.php request. Here is the relevant information from the "Headers" tab:
URL: https://www.broadcastify.com/listen/webpl.php?feedId=30185
Request
POST /listen/webpl.php HTTP/1.1
Content-Type: application/x-www-form-urlencoded
webAuth: 74f440ad812f0cc2192ab782e27608cc
Query String Parameters
feedId: 30185
Request Data
MIME Type: application/x-www-form-urlencoded
t: 14
Let's reproduce this request with curl:
curl 'https://www.broadcastify.com/listen/webpl.php?feedId=30185' \
--request POST \
--header 'webAuth: 74f440ad812f0cc2192ab782e27608cc' \
--data 't=14'
Here's the result:
<script src="/scripts/me_4.2.9/mediaelement-and-player.min.js"></script>
<link rel="stylesheet" href="/scripts/me_4.2.9/mediaelementplayer.min.css"/>
<audio width="300px" id="mePlayer_30185" src="http://relay.broadcastify.com/9wzfd3hrpyctvqx.mp3?xan=DCJP4HvtwMoXdH9HvtwMJ5vv342DfleDptcoX3dH9H48vtwMJ" type="audio/mp3" controls="controls"
autoplay="true">
</audio>
<script>
$('audio').mediaelementplayer({
features: ['playpause', 'current', 'volume'],
error: function () {
alert("Feed has disconnected from the server. This could be due to a power outage, network connection problem, or server problem. Click OK to restart the player. If the player fails to connect then the feed might be down for an extended timeframe.");
location.reload();
}
});
</script>
<br />
<div class="c">If the feed does not automatically play, click or touch the play icon in the player above.</div>
There's your mp3 link, in the src attribute of the <audio> tag. If we try to get it:
$ curl http://relay.broadcastify.com/9wzfd3hrpyctvqx.mp3?xan=DCJP4HvtwMoXdH9HvtwMJ5vv342DfleDptcoX3dH9H48vtwMJ
Moved Temporarily. Redirecting to http://audio13.broadcastify.com/9wzfd3hrpyctvqx.mp3?nocache=2623053&xan=DCJP4HvtwMoXdH9HvtwMJ5vv342DfleDptcoX3dH9H48vtwMJ
If you try to access that URL (or the original one with -L, instructing curl to follow redirects), the mp3 stream will start printing to your terminal as a bunch of nonsense characters.
So, your shell script should hit the /listen/webpl.php endpoint instead of trying to scrape the web player HTML page. Or perhaps just scrape the page to first get a webAuth token.
Update
In response to your update with the shell script, here is a simplified script that does the same thing and also strips the "Moved Temporarily" prefix to just get the audio url. Note that there's no need to use a temporary file, and the $(...) syntax is preferred over the `...` syntax:
#!/bin/bash
# I always start my scripts with this. See https://sipb.mit.edu/doc/safe-shell/
set -eufo pipefail
auth=$(curl -s "https://www.broadcastify.com/listen/feed/$1/web" \
| grep webAuth \
| head -n 1 \
| sed 's/^.*"webAuth": "//;s/".*$//')
relay_url=$(curl -s "https://www.broadcastify.com/listen/webpl.php?feedId=$1" \
-H "webAuth: $auth" -d 't=14' \
| grep -o 'http://[^"]*')
audio_url=$(curl -s "$relay_url" | cut -d' ' -f5)
echo "$audio_url"

download all images on the page with WGET

I'm trying to download all the images that appear on the page with WGET, it seems that eveything is fine but the command is actually downloading only the first 6 images, and no more. I can't figure out why.
The command i used:
wget -nd -r -P . -A jpeg,jpg http://www.edpeers.com/2013/weddings/umbria-wedding-photographer/
It's downloading only the first 6 images relevant of the page and all other stuff that i don't need, look at the page, any idea why it's only getting the first 6 relevant images?
Thanks in advance.
I think the main problem is, that there are only 6 jpegs on that site, all others are gifs, example:
<img src="http://www.edpeers.com/wp-content/themes/prophoto5/images/blank.gif"
data-lazyload-src="http://www.edpeers.com/wp-content/uploads/2013/11/aa_umbria-italy-wedding_075.jpg"
class="alignnone size-full wp-image-12934 aligncenter" width="666" height="444"
alt="Umbria wedding photographer" title="Umbria wedding photographer" /
data-lazyload-src is a jquery plugin, which wouldn't download the jpegs, see http://www.appelsiini.net/projects/lazyload
Try -p instead of -r
wget -nd -p -P . -A jpeg,jpg http://www.edpeers.com/2013/weddings/umbria-wedding-photographer/
see http://explainshell.com:
-p
--page-requisites
This option causes Wget to download all the files that are necessary to properly display a given HTML
page. This includes such things as inlined images, sounds, and referenced stylesheets.

Shell script Email bad formatting?

My script is perfectly fine and produce a file. The file is in plain text and is formatted like how (My expect results should look like this.) is formatted. However when I try to send my file to my email the formatting is completly wrong.
The line of code I am using to send my email.
cat ReportEmail | mail -s 'Report' bob#aol.com
The result I am getting on my email.
30129 22.65 253
96187 72.32 294
109525 82.35 295
10235 7.7 105
5906 4.44 106
76096 57.22 251
My expect results should look like this.
30129 22.65 253
96187 72.32 294
109525 82.35 295
10235 7.7 105
5906 4.44 106
76096 57.22 251
Your source file achieves the column alignment by using a combination of tabs and spaces. The width assigned to a tab, however, can vary from program to program. Widths of 4, 5, or 8 spaces, for example, are common. If you want consistent formatting in plain text from one viewer to the next, use only spaces.
As a workaround, you can expand the the tabs to spaces before passing the file to mail using the expand utility:
expand -t 8 ReportEmail.txt | mail -s 'Report' bob#aol.com
The option -t 8 tells expand to treat tabs as 8 spaces wide. Change the 8 to whatever number consistently makes the format in ReportEmail.txt work properly.

Print a postscript document with CUPS and a thermal printer

I installed an epson TM-T20 in Ubuntu 12.04, using the official driver. This is a thermal printer, I'm using 80mm paper.
My problem: When I print an image (using a postscript document) it waste a lot of paper because the image uses around 5cm and the printer before the image sends out 25cm of white paper.
I use the following command to send the document to the printer:
lpr -P tm-t20 -o document.ps
The printer prints the image (a 200x200 image), but first sends out a lot of non printed paper.
The printer wasn't recognized by CUPS (using the web interface at localhost:631). Then I installed it using the following procedure:
sudo lpadmin -p tm-t20 -E -v serial:/dev/ttyUSB0 -P /usr/share/ppd/epson-tm-t20-rastertotmt.ppd
Then the printer appeared in the CUPS web interface and I configured it (baud rate, bit parity, etc).
The printer works ok when I send some text.
Here is part of the printer ppd:
*DefaultPageRegion:RP80x297
*PageRegion RP80x297/Roll Paper 80 x 297 mm: "<</PageSize[204 841.8]/ ImagingBBox null>>setpagedevice"
*PageRegion RP58x297/Roll Paper 58 x 297 mm: "<</PageSize[141.7 841.8]/ ImagingBBox null>>setpagedevice"
*CloseUI: *PageRegion
*DefaultImageableArea: RP80x297
*ImageableArea RP80x297/Roll Paper 80 x 297 mm: "0 0 204 841.8"
*ImageableArea RP58x297/Roll Paper 58 x 297 mm: "0 0 141.7 841.8"
*DefaultPaperDimension: RP80x297
*PaperDimension RP80x297/Roll Paper 80 x 297 mm: "204 841.8"
*PaperDimension RP58x297/Roll Paper 58 x 297 mm: "141.7 841.8"
I suppose that this waste of paper is because the 297mm of long that appears in the ppd file. Then I tried adding another configuration of 100mm instead of 297mm, but the problem persists.
I also tryied adding the tag %%DocumentMedia to the ps file, but the same problem:
%!PS-Adobe-3.0
%%Creator: GIMP PostScript file plugin V 1.17 by Peter Kirchgessner
%%Title: yay.ps
%%CreationDate: Thu Sep 13 13:44:26 2012
%%DocumentData: Clean7Bit
%%LanguageLevel: 2
%%Pages: 1
%%BoundingBox: 14 14 215 215
%%
%%EndComments
%%DocumentMedia: Plain 72 72 0 white Plain
%%BeginProlog
% Use own dictionary to avoid conflicts
10 dict begin
%%EndProlog
%%Page: 1 1
% Translate for offset
14.173228346456694 14.173228346456694 translate
% Translate to begin of first scanline
0 199.99999999999997 translate
199.99999999999997 -199.99999999999997 scale
% Image geometry
200 200 8
% Transformation matrix
[ 200 0 0 200 0 0 ]
% Strings to hold RGB-samples per scanline
/rstr 200 string def
/gstr 200 string def
/bstr 200 string def
{currentfile /ASCII85Decode filter /RunLengthDecode filter rstr readstring pop}
{currentfile /ASCII85Decode filter /RunLengthDecode filter gstr readstring pop}
{currentfile /ASCII85Decode filter /RunLengthDecode filter bstr readstring pop}
true 3
%%BeginData: 14759 ASCII Bytes
Any idea?
Finally after a lot of pain. I discover that the problem was the serial to USB cable (in order to connect the serial printer to an USB port). I tried with two different serial to USB cables, but the problem persists and finally I conclude that The printer works erratically if is not connect to a "real" serial port. I tested the printer under identical conditions in a PC with a serial port and it works perfect, just installing the driver provided by epson and giving chmod 777 to /dev/ttyS0. At the job list sometimes I see the error: "/usr/lib/cups/filter/pstopdf failed". But the printer prints ok, like no error occurred.
I have to chmod 777 /dev/ttyUSB0 in order to get the printer working (Even if a run the commands with sudo).
I'm getting acceptable results (text is not at the center) with the option media=B8
lp -d tm-t20 -o media=B8 document.ps
I also tried with
lp -d tm-t20 -o media=Custom.80x90mm document.ps
But the printer doesn't print and the job appears as completed at the cups web interface.
If I try with
lp -d tm-t20 -o media=Custom.200x190 document.ps
The printer prints (not correctly centered, I guess that I need to try with different values until I get the desired result). The paper dimensions in dots are in this site: http://paulbourke.net/dataformats/postscript/
The printer isn't cutting the paper, I dont know how to give that option (print and cut the paper).
The options accepted by the printer are:
lpoptions -p tm-t20 -l
PageSize/Media Size: *RP80x297 RP58x297 Custom.WIDTHxHEIGHT
Resolution/Resolution: *203x203dpi
TmtSpeed/Printing Speed: *Auto 1 2 3 4
TmtPaperReduction/Paper Reduction: Off Top *Bottom Both
TmtPaperSource/Paper Source: *DocFeedCut DocFeedNoCut DocNoFeedCut DocNoFeedNoCut PageFeedCut PageFeedNoCut PageNoFeedCut
TmtBuzzerControl/Buzzer: *Off Before After
TmtSoundPattern/Sound Pattern: *A B C D E
TmtBuzzerRepeat/Buzzer Repeat: *1 2 3 5
TmtDrawer1/Cash Drawer #1: *Off Before After
How to make the printer print and cut the paper? I need to do it from the console, to use it from a custom C++ program. If you have any other experience with this kinds of printers under Linux, please give me some advice. My goal is to use the printer from a C++ program, I didn't find a fast way to do it (sending directly ESC/POS commands to the printer, there isn't official documentation to do it under Linux), so I'm working with CUPS from the console.
Paper CUT SOLVED:
lp -d tm-t20 -o media=Custom.200x258 -o source=DocFeedCut document.ps
I don't know why it works, because as is shown in the options DocFeedCut is the default option.
Now I just will try to center correctly the text.

wkhtmltopdf with full page background

I am using wkhtmltopdf to generate a PDF file that is going to a printer and have some troubles with making the content fill up an entire page in the resulting PDF.
In the CSS I've set the width and height to 2480 X 3508 pixels (a4 300 dpi) and when creating the PDF I use 0 for margins but still end up with a small white border to the right and bottom. Also tried to use mm and percentage but with the same result.
I'd need someone to please provide an example on how to style the HTML and what options to use at command line so that the resulting PDF pages fill out the entire background. One way might be to include bleeding (this might be necessary anyway) but any tips are welcome. At the moment I am creating one big HTML page (without CSS page breaks - might help?) but if needed it would be fine to generate each page separately and then feed them all to wkhtmltopdf.
wkhtmltopdf v 0.11.0 rc2
What ended up working:
wkhtmltopdf --margin-top 0 --margin-bottom 0 --margin-left 0 --margin-right 0 <url> <output>
shortens to
wkhtmltopdf -T 0 -B 0 -L 0 -R 0 <url> <output>
Using html from stdin (Note dash)
echo "<h1>Testing Some Html</h2>" | wkhtmltopdf -T 0 -B 0 -L 0 -R 0 - <output>
Using html from stdin to stdout
echo "Testing Some Html" | wkhtmltopdf -T 0 -B 0 -L 0 -R 0 - test.pdf
echo "Testing Some Html" | wkhtmltopdf -T 0 -B 0 -L 0 -R 0 - - > test.pdf
What did not work:
Using --dpi
Using --page-width and --page-height
Using --zoom
We just solved the same problem by using the --disable-smart-shrinking option.
I realize this is old and cold, but just in case someone finds this and has the same/similar problem, here's a workaround that worked for me after some trial&error.
I created a simple filler.html as:
<!DOCTYPE html>
<html>
<head>
</head>
<body style="margin: 0; padding: 0;">
<div style="height: 30mm; background-color: #F7EBD4;">
</div>
</body>
</html>
Use valid HTML (!DOCTYPE is important) and only inline styles. Match the background color to that of the main document and use height equal or bigger than your margins.
I run version 0.12.0 with the following arguments:
wkhtmltopdf --print-media-type --orientation portrait --page-size A4
--encoding UTF-8 --T 10mm --B 10mm --L 0mm --R 0mm
--header-html filler.html --footer-html filler.html - - <file.html >file.pdf
Hoping this helps someone...
I'm using version 0.12.2.1 and setting:
body { padding: 0; margin 0; }
div.page-layout { height: 295.5mm; width: 209mm;}
worked for me.
Of course need to add 0 margins by:
wkhtmltopdf -T 0 -B 0 -L 0 -R 0
At http://code.google.com/p/wkhtmltopdf/issues/detail?id=359 I found out more people 'suffer' from this bug. The --dpi 300 workaround did not work for me, I had to set --zoom 1.045 to zoom in a bit which made the extra right and bottom border disappear...
Works fine for me with -B 0 -L 0 -R 0 -T 0 options and using your trick of setting up an A4 sized div.
Did you remember to use body {margin:0; padding:0;} in the top of your CSS?
I cannot help you with CSS page breaks as I have not trialled an errored those yet, however, you can run scripts on the page to do clever things. Here is a jQuery example of how to split content down into page size chunks based on the length of the content. If you can get that adapted to work with wkhtmltopdf then please post here!
http://www.script-tutorials.com/demos/79/index.html
What you are experiencing is a bug.
You'll need to set the --dpi option when converting the file. In you case you will probably want --dpi 300, but that can be set lower.
Solved it by increasing the DPI
I'm working with an A4 size in portrait mode. Had white space to the right.
I noticed that as the dpi is increased, the white space got thinner.
at 300 dpi the white space is not visible in chrome pdf view at (max) zoomed at 500%
In Adobe reader it's still visible. It got better at 600 DPI and at 1200 DPI it's become invisible even at 6500% zoom.
There's no disadvantage to this so far as I observed, all dpi generate the same file size and run at the same speed (tested on 1 page).
effectively my settings are as follows:
echo "<html style='padding=0;margin=0'><body style='background-color:black;padding=0;margin=0'></html>" | wkhtmltopdf -T 0 -B 0 -L 0 -R 0 --disable-smart-shrinking --orientation portrait --page-size A4 --dpi 1200 - happy.pdf
If using an unscaled PNG image (thus will be pixel perfect) the default ratio, for an A4 needs to be 120ppi thus # 210mm = 993 pixels wide x 1404 pixels high, if the source is 72 or 300 dpi it makes no difference for a default placement, its the 993 that's counted as 210 mm
No heights, no width, no stretch, nor shrink just default place image as background un-scaled.
wkhtmltopdf --enable-local-file-access -T "0mm" -L "0mm" -R "0mm" -B "0mm" test.html test.pdf
here is such an image reduced into A 4 pdf page 2 different densities same number of pixels
If you use scaling you can use different density values, but this is all that is needed by default's, since PDF works on overall pixel values not DPI as such. Note the PNG is actually smaller by insertion in a PDF than the source JPG which was over 372 KB

Resources