wicked_pdf header and footer no rendered - wkhtmltopdf

I am using wicked_pdf in my sinatra application. When i try to add a header or footer it is not shown in the pdf. Only the body element is working. Why the header and footer are not applied?
Here the very simple code example:
get '/api/v1/admin/fcb/pdf/schedules/:id' do
headers['Content-Type'] = 'application/pdf'
WickedPdf.new.pdf_from_string("<!DOCTYPE html><p>body<p>", header: {content: "<!DOCTYPE html><h1>header</h1>"})
end
This results in this PDF:
Versions:
wicked_pdf 2.1.0
wkhtmltopdf 0.12.6
Running on Debian based docker image ruby:2.6-slim

I could not figure it out, why it did not work as expected. So I ended up in taking care of the header and footer my self. I calculated the body length in my code and added footer and header when needed.

Related

Ruby: how to generate HTML from Markdown like GitHub's or BitBucket's?

On the main page of every repository in GitHub or BitBucket it shows the Readme.md in a very pretty format.
Is there a way to make the same thing with ruby? I have already found some gems like Redcarpet, but it never looks pretty. I've followed this instructions for Redcarpet.
Edit:
After I tried Github's markup ruby gem, the same thing is happening.
What is shown is this:
And what I want is this:
And I'm sure it's not only css missing, because after 3 backquotes (```) I write the syntax like json or bash and in the first image it is written.
Edit2:
This code here:
renderer = Redcarpet::Render::HTML.new(prettify: true)
markdown = Redcarpet::Markdown.new(renderer, fenced_code_blocks: true)
html = markdown.render(source_text)
'<script src="https://cdn.rawgit.com/google/code-prettify/master/loader/run_prettify.js"></script>'+html
Generated this:
Github provides its own ruby gem to do so: https://github.com/github/markup.
You just need to install the right dependencies and you're good to go.
You need to enable a few nonstandard features.
Fenced code blocks
Fenced code blocks are nonstandard and are not enabled by default on most Markdown parsers (some older ones don't support them at all). According to Redcarpet's docs, you want to enable the fenced_code_blocks extension:
:fenced_code_blocks: parse fenced code blocks, PHP-Markdown style. Blocks delimited with 3 or more ~ or backticks will be considered as code, without the need to be indented. An optional language name may be added at the end of the opening fence for the code block.
Syntax Highlighting
Most Markdown parsers to not do syntax highlighting of code blocks. And those that do always do it as an option. Even then, you will still need to provide your own CSS styles to have the code blocks styled properly. As it turns out, Redcarpet does include support for a prettify option to the HTML renderer:
:prettify: add prettyprint classes to <code> tags for google-code-prettify.
You will need to get the Javascript and CSS from the google-code-prettify project to include in your pages.
Solution
In the end you'll need something like this:
renderer = Redcarpet::Render::HTML.new(prettify: true)
markdown = Redcarpet::Markdown.new(renderer, fenced_code_blocks: true)
html = markdown.render(source_text)
As #yoones said Github shares their way to do it but to be more precise they use the gem "commonmarker" for markdown. Though as far as I can tell this thing does not give the full formatted HTML file but only a piece that you insert into <body>. So you can do it like I did:
require "commonmarker"
puts <<~HEREDOC
<!DOCTYPE html>
<html>
<head>
<style>#{File.read "markdown.css"}</style>
</head>
<body class="markdown-body Box-body">
#{CommonMarker.render_html ARGF.read, %i{ DEFAULT UNSAFE }, %i{ table }}
</body>
</html>
HEREDOC
Where did I get the markdown.css? I just stole the CSS files from an arbitrary Github page with README rendered and applied UNCSS to it -- resulted in a 26kb file, you can find it in the same repo I just linked.
Why the table and UNSAFE? I need this to render an index.html for Github Pages because their markdown renderer can't newlines within table cells, etc. so instead of asking it to render my README.md I make the index.html myself.

Web scraping from youtube with nokogiri

I want to scrape all the names of the users who commented below a youtube video.
I'm using ruby and nokogiri.
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "https://www.youtube.com/watch?v=tntOCGkgt98"
doc = Nokogiri::HTML(open(url))
doc.css(".comment-thread-renderer > .comment-renderer").each do |comment|
name = comment.css("#comment-section-renderer-items .g-hovercard").text
puts name
end
But it's not working, I'm not getting any output, no error either.
I won't be able to give you a solution, but at least I can give you a couple of hints that may help you to move forward.
The code you have is not working because the comments section is loaded via an ajax call after the page is loaded. If you do a hard reload in your browser, you will see that there is a spinner icon and a Loading... text in the sections comment, waiting for the content to be loaded. When Nokogiri gets the page via the http request, it gets the html content that you see before the comments are loaded. As a matter of fact the place where the contents will be later added looks like:
<div id="watch-discussion" class="branded-page-box yt-card">
<div id="comment-section-renderer"
class="comment-section-renderer vve-check"
data-visibility-tracking="CCsQuy8iEwjr3P3u1uzNAhXIepAKHRV9D8Ao-B0=">
<div class="action-panel-loading">
<p class="yt-spinner ">
<span class="yt-spinner-img yt-sprite" title="Loading icon">
</span>
<span class="yt-spinner-message">Loading...</span>
</p>
</div>
</div>
</div>
That is the reason why you won't find the divs you are looking for, because they aren't part of the html you have.
Looking at the network console in the browser, it seems that the ajax request to get the comments data is being sent to https://www.youtube.com/watch_fragments_ajax?v=tntOCGkgt98&tr=time&distiller=1&ctoken=EhYSC3RudE9DR2tndDk4wAEAyAEA4AEBGAY%253D&frags=comments&spf=load. As you can see the v parameter is the video id, however there are a couple of caveats:
There is a ctoken param, which you can get by scraping the original page contents. It is inside a <script> tag, in the form of
'COMMENTS_TOKEN': "<token>".
However, you still need to send a session_token as a form data in the body of the AJAX request (which is a POST). That I don't know where is coming from :(.
I think that you will be pushing the limits of Nokogiri here, as AFAIK it is not intended to follow ajax requests or handling Javascript. Maybe the ruby Selenium driver is better suited for this.
HTH
I think you need name.css("#comment-section..."
The each statement will iterate over the elements, using the variable name.
You may want to use node instead of name:
doc.css(".comment-thread-renderer > .comment-renderer").each do |node|
name = node.css("#comment-section-renderer-items .g-hovercard").text
puts name
end
I wrote this rails app using nokogiri to see all the tags that a page has before any javascript is run in the browser. The source code is here, so you can adjust it if you need to add more info about the node in the view.
That can easily tell you if the particular tag element that you are looking for is something you can retrieve without having to do some JS eval.
Most web crawlers don't support client-side rendering, which gives you an idea that it's not a trivial task to execute JS when scraping content.
YouTube is a dynamically rendered JavaScript website, though it could be parsed with Nokogiri without using Selenium or another package. Try open the Network tab in dev tools, scroll to the comment section, and see what request being send.
You need to make a post request in order to fetch comments data. You can preview the output in the "Preview" tab.
Preview output:
Which is equivalent to this comment:
Note: Since this comment brings very little value, this answer will be updated with the attached code once there will be an available solution.

How to extract HTML from updated DOM using Capybara Webkit driver?

I have a page that injects some text into the DOM: 'Success!'.
The Javascript code works because I see the expected text in the screenshot, and the spec passes:
page.visit '/'
save_and_open_screenshot
expect( page).to have_content 'Success!'
puts page.html
However, the page.html is not updated. It does not have the injected text.
How do I get the HTML for the updated DOM?
EDIT: I found that the issue is caused by an iframe. The iframe is not added to the page.html, but it is added to the page.
EDIT #2: It turns out that the 'Success!' content is not in the iframe. So maybe the context is switching to the iframe.
Found one workaround which is OK:
html = page.evaluate_script( 'document.documentElement.innerHTML' )
I guess one could use JS or jQuery finder to find the expected <div>.
For the entire page body you can do this:
page.body
For any element in particular
page.find(".my-div").base.inner_html
Check out the full API here: https://github.com/thoughtbot/capybara-webkit/blob/master/lib/capybara/webkit/node.rb

Header HTML has 100% height with wkhtmltopdf 0.12

I am using wkhtmltopdf 0.12 with wicked_pdf or pdfkit and the header takes almost 100% of the page height.
It creates these problems :
Pages are almost empty
There are many more pages than it should
Solved this by adding
<!DOCTYPE html>
at the top of the header HTML file.
Somewhat capricious, I know...

Put HTML in javascript using Ruby

Note: This is a very strange and unique use case so I apologise in advance if it seems a bit ass-backwards.
I have a haml file content.haml and a coffeescript file main.coffee.
I wish to somehow get the html resulting from rendering content.haml into a variable in the coffeescript/resulting javascript.
The end result should be a javascript file rendered to the browser.
let's say they look like this:
# content.haml
.container
.some_content
blah blah blah
-
# main.coffee
html_content = ???
do_something_with_html_content(html_content)
I know, this sounds ridiculous, 'use templates', 'fetch the HTML via ajax' etc. In this instance however, it's not possible, everything needs to be served via one JS file and I cannot fetch other resources from the server. Weird, I know.
Short of manually reconstructing the haml in the coffeescript file by joining an array of strings like this:
html_content = [
'<div class"container">',
'<div class"some_content">',
'blah blah blah',
'</div>',
'</div>',
]
I'm not sure the best way of doing this.
Another way I though of was to put something like this in the coffee file:
html_content = '###CONTENT###'
Then render the haml to html in ruby, render the coffeescript to js and then replace ###CONTENT### with the rendered html before serving to the client. However the html is a multi-line string so it completely destroys the javascript.
I'm convinced there must be some other nice way of rendering the haml into html in a variable such that it forms valid javascript, but my brain has gone blank.
Perhaps you can try something like this in one of your views:
:javascript
html_content = <%= escape_javascript(render partial: "content")%>
## your own logic follows here....
Wouldn't it be better to use a custom html data attribute and then fetch the content of it in js?
<div data-mycontent="YOUR CONTENT GOES HERE"></div>
And then in coffee, use the dataset attribute / data via jquery, if it is available.
If you set a var via writing the file directly it will render your js file uncacheable, among other drawbacks.
You can do that by using the sprockets gem, like Rails does. You just need to rename your CoffeeScript file to main.coffee.erb and use it as you would e.g. a haml template. Pass in your rendered html with an instance variable:
html_content = '<%= #html_content %>'
Edit: Added missing quotes.

Resources