How to extract HTML from updated DOM using Capybara Webkit driver?

How to extract HTML from updated DOM using Capybara Webkit driver? - ruby

I have a page that injects some text into the DOM: 'Success!'.
The Javascript code works because I see the expected text in the screenshot, and the spec passes:
page.visit '/'
save_and_open_screenshot
expect( page).to have_content 'Success!'
puts page.html
However, the page.html is not updated. It does not have the injected text.
How do I get the HTML for the updated DOM?
EDIT: I found that the issue is caused by an iframe. The iframe is not added to the page.html, but it is added to the page.
EDIT #2: It turns out that the 'Success!' content is not in the iframe. So maybe the context is switching to the iframe.

Found one workaround which is OK:
html = page.evaluate_script( 'document.documentElement.innerHTML' )
I guess one could use JS or jQuery finder to find the expected <div>.

For the entire page body you can do this:
page.body
For any element in particular
page.find(".my-div").base.inner_html
Check out the full API here: https://github.com/thoughtbot/capybara-webkit/blob/master/lib/capybara/webkit/node.rb

Related

How can I get an anchor tag with no or empty href in typo3 ckeditor

In TYPO3 8.7, I'm trying to create an anchor tag to open a modal, in a regular text element, like this:
<a class="someclass" data-open="myModal">Click me</a>
But Typo3 will automatically add an href attribute linking to the current page. When I click the tag, the modal opens, but the page immediately reloads.
I've tryed adding href="#", but that turns into href="/mypage/#" and href="#mymodal" becomes href="/mypage/#mymodal", both of which trigger a reload.
In my ckeditor setup, I have set allowedContent: true
How can I make an <a> tag without the href being altered?

If you have a ClickEvent on an a-tag you need to return false from the javascript to stop further processing. And following the link is the last further processing.
Even if you manage to reduce the href to # you page may reload or jump to the start.
Maybe you can fool your browser if you use href="javascript:return false".

Web scraping from youtube with nokogiri

I want to scrape all the names of the users who commented below a youtube video.
I'm using ruby and nokogiri.
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "https://www.youtube.com/watch?v=tntOCGkgt98"
doc = Nokogiri::HTML(open(url))
doc.css(".comment-thread-renderer > .comment-renderer").each do |comment|
name = comment.css("#comment-section-renderer-items .g-hovercard").text
puts name
end
But it's not working, I'm not getting any output, no error either.

I won't be able to give you a solution, but at least I can give you a couple of hints that may help you to move forward.
The code you have is not working because the comments section is loaded via an ajax call after the page is loaded. If you do a hard reload in your browser, you will see that there is a spinner icon and a Loading... text in the sections comment, waiting for the content to be loaded. When Nokogiri gets the page via the http request, it gets the html content that you see before the comments are loaded. As a matter of fact the place where the contents will be later added looks like:
<div id="watch-discussion" class="branded-page-box yt-card">
<div id="comment-section-renderer"
class="comment-section-renderer vve-check"
data-visibility-tracking="CCsQuy8iEwjr3P3u1uzNAhXIepAKHRV9D8Ao-B0=">
<div class="action-panel-loading">
<p class="yt-spinner ">
<span class="yt-spinner-img yt-sprite" title="Loading icon">
</span>
<span class="yt-spinner-message">Loading...</span>
</p>
</div>
</div>
</div>
That is the reason why you won't find the divs you are looking for, because they aren't part of the html you have.
Looking at the network console in the browser, it seems that the ajax request to get the comments data is being sent to https://www.youtube.com/watch_fragments_ajax?v=tntOCGkgt98&tr=time&distiller=1&ctoken=EhYSC3RudE9DR2tndDk4wAEAyAEA4AEBGAY%253D&frags=comments&spf=load. As you can see the v parameter is the video id, however there are a couple of caveats:
There is a ctoken param, which you can get by scraping the original page contents. It is inside a <script> tag, in the form of
'COMMENTS_TOKEN': "<token>".
However, you still need to send a session_token as a form data in the body of the AJAX request (which is a POST). That I don't know where is coming from :(.
I think that you will be pushing the limits of Nokogiri here, as AFAIK it is not intended to follow ajax requests or handling Javascript. Maybe the ruby Selenium driver is better suited for this.
HTH

I think you need name.css("#comment-section..."
The each statement will iterate over the elements, using the variable name.
You may want to use node instead of name:
doc.css(".comment-thread-renderer > .comment-renderer").each do |node|
name = node.css("#comment-section-renderer-items .g-hovercard").text
puts name
end

I wrote this rails app using nokogiri to see all the tags that a page has before any javascript is run in the browser. The source code is here, so you can adjust it if you need to add more info about the node in the view.
That can easily tell you if the particular tag element that you are looking for is something you can retrieve without having to do some JS eval.
Most web crawlers don't support client-side rendering, which gives you an idea that it's not a trivial task to execute JS when scraping content.

YouTube is a dynamically rendered JavaScript website, though it could be parsed with Nokogiri without using Selenium or another package. Try open the Network tab in dev tools, scroll to the comment section, and see what request being send.
You need to make a post request in order to fetch comments data. You can preview the output in the "Preview" tab.
Preview output:
Which is equivalent to this comment:
Note: Since this comment brings very little value, this answer will be updated with the attached code once there will be an available solution.

Ckeditor inline editor <p> tags being added on init despite presence of <h2> tag

SOLVED. Update - I was mistaken in my original assumption. See my answer below.
I have an app where I initialise inline ckeditors on various contenteditable divs.
I am well aware that CKEditor needs to add
<p><br><p>
to the markup of an empty editor to prevent content collapse, however I have a specific situation where contenteditable div that contains ONLY this html:
<h2>This is a heading</h2>
Has its markup modified to this:
<p><br></p><h2>This is a heading</h2><p><br></p>
When I call
CKEDITOR.inline(element, config);
Where element is the contenteditable div
I am using 4.4.1
This only happens when the markup in the contenteditable div is purely a heading. If there is also a paragraph in the markup this does not happen.
It appears that CKEditor is ignoring the heading when determining whether or not it needs to add content to an empty editor.
To be clear everything else works as I would expect, just this very specific issue.
Any ideas how to fix this?

Ok I figured out this was not ckeditor at all but some of my own code that was adding the tags.
I had some script which was checking whether the innerHtml of the element was a p tag, and if not, it was wrapping the whole thing in p tags.
The reason this was not more obvious is because the p tags were empty and hence collapsed. Only when calling CKEDITOR.inline(element, config) on the element did CKEditor do its thing and fillEmptyBlocks, which created the height of the p tags. This seemed then that they only appeared when the editor was instantiated.
In fact they were there already.

Unexpected result loading partial view into an IE8 DOM using jQuery Ajax

I have a strange result happening when loading a Partial View using jQuery Ajax into the DOM, but only when viewing the result using IE8.
The Partial View in question (for example purposes only) looks like this;
<aside>Example test</aside>
When the result comes back from the Ajax call it appears to look exactly as above. However when viewing the DOM using the developer tools in IE8 the result looks like this;
<aisde/>
Example test
</aside/>
As a result the element is not recognised and the text does not sit within it. Also the style sheet class for 'aside' is not being applied. This only happens in IE8 as far as I can see.
Any one got any suggestions, other than do not use custom tags?
Thanks in advance

You need to make sure the DOM recognizes HTML5 elements. Essentially you will have to do:
document.createElement('aisde');
Have a look at this link. Without creating the element older IE browsers will not see or style the elements.
The common practice around these issues is to load a html5 fix javascript file within a conditional comment block. The script does a createElement on all new html5 node types.
<!--[if lte IE 8]>
<script src="html5.js" type="text/javascript"></script>
<![endif]-->

Groovy htmlunit getByXPath

I'm currently using HtmlUnit to attempt to grab an href out of a page and am having some trouble.
The XPath is:
/html/body/div[2]/div/div/table/tbody/tr/td[2]/div/div[5]/div/div[2]/span/a
On the webpage it looks like:
<a class="t" title="This Brush" href=http://domain.com/this/that">Brush Set</a>
In my code I am doing:
hrefs = page.getByXPath("//html/body/div[2]/div/div/table/tbody/tr/td[2]/div/div[5]/div/div[2]/span/a[#class='t']")
However, this is returning everything in there instead of just the url that I want.
Can someone explain what I must add to get the href? (also it doesn't end with .html)

You are selecting the a. You want to select the a/#href.
hrefs = page.getByXPath("//html/body/div[2]/div/div/table/tbody/tr/td[2]/div/div[5]/div/div[2]/span/a[#class='t']/#href")

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to extract HTML from updated DOM using Capybara Webkit driver? - ruby

Found one workaround which is OK: html = page.evaluate_script( 'document.documentElement.innerHTML' ) I guess one could use JS or jQuery finder to find the expected <div>.

For the entire page body you can do this: page.body For any element in particular page.find(".my-div").base.inner_html Check out the full API here: https://github.com/thoughtbot/capybara-webkit/blob/master/lib/capybara/webkit/node.rb

Related

How can I get an anchor tag with no or empty href in typo3 ckeditor

Web scraping from youtube with nokogiri

Ckeditor inline editor <p> tags being added on init despite presence of <h2> tag

Unexpected result loading partial view into an IE8 DOM using jQuery Ajax

Groovy htmlunit getByXPath

Categories

Resources