Nokogiri Scraping Misses HTML

Nokogiri Scraping Misses HTML - ruby

Nokogiri isn't grabbing anything beneath the iframe tag.
doc.search("iframe") returns only the iframe tag. doc.search("body.content-frame") returns empty. doc.errors returns empty also. Why isn't Nokogiri registering the HTML beneath the iframe? How can I grab it?
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head></head>
<body onunload="clearMyTimeInterval()">
<iframe id="content-frame" frameborder="0" src="/sportsbook/betting-lines/baseball/2014-08-21/?range=day" onload="javascript:checkLoadedFrame(this);" style="background-color: rgb(34, 34, 34); height: 1875px;" name="content-frame" scrolling="no" border="0">
#document
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head></head>
<body class="content-frame">
#ETC.......

That's because the contents of the iframe are not part of the page. In fact, they are in a completely different location (note the src attribute of the iframe). You'll have to fetch that content separately, which is how a browser would do it.

Here is code that handles it:
page = Mechanize.new.get "http://page_u_need"
page.iframe_with(id: 'beatles').content

Related

<img> after <h1> does not validate as HTML 4.01 Strict

My markup is
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>klaymen - About</title>
</head>
<body>
<h1>Klaymen</h1>
<img src="resources/klaymen-about.jpg" width="200" alt="Klaymen's about picture">
</body>
When I test the document with this validator https://validator.w3.org/#validate_by_input I get the following error:
document type does not allow element "IMG" here; missing one of "P", "H1", "H2", "H3", "H4", "H5", "H6", "DIV", "ADDRESS" start-tag
Clearly I have an H1, and img is a flow content element which supposed to be allowed in this location, so what is the problem?

You can use a div container.
<div>
<img src=""/>
<span display: block>Text below the image</span>
</div>
This happens because the body of a document in this spec cannot contain an inline element like <img>, thus, by putting it inside a block element like <div>, all's fixed.

document type does not allow element "div" here; assuming missing "object" start-tag

Error Line 18, Column 19: document type does not allow element "div" here; assuming missing "object" start-tag
Please see the page source below
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title></title>
<meta name="keywords" content="Karnataka,Bangalore_Rural,Healthcare,Office_Assistant,Kerala,Ernakulam,IT_Hardware_Networking,Engineer,Sales___Marketing,Executive,Maharashtra,Mumbai_City,Retailing,Manager,Kollam,CRM_CallCentres_BPO_ITES_Med.Trans,Customer_Care,Hotel_Travel_Tourism_Airlines_Hospitality,Front_Office_Staff,Andhra_Pradesh,Hyderabad,IT_Software,Java_Developer,Pathanamthitta,Manufacturing_Industrial,Educational_Training,Teacher,Engineering_Projects"/>
<meta name="description" content="The best job oriented resume sharing system. Create and Publish your online resumes for FREE. Search and apply your dream jobs for FREE. Post your jobs for FREE."/>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"/>
<div id="fb-root"></div>

I do believe you need to put the < div > inside a < body > section.

Javascript getElementById from parent window to child window

Hello and thank you for reading my post.
Here is what I basically want to do:
in a first HTML page ("parent.html"), there is a button ;
when a user clicks the button a new window pops up ("child.html")
AND the contents of a "div" element in the child window is updated.
The final action is unsuccessful under "Firefox" and "Chrome".
parent.html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Parent window document</title>
</head>
<body>
<input
type="button"
value="Open child window document"
onclick="openChildWindow()" />
<script type="text/javascript">
function openChildWindow()
{
var s_url = "http://localhost:8080/projectroot/child.html";
var s_name = "ChildWindowDocument";
var s_specs = "resizable=yes,scrollbars=yes,toolbar=0,status=0";
var childWnd = window.open(s_url, s_name, s_specs);
var div = childWnd.document.getElementById("child_wnd_doc_div_id");
div.innerHTML = "Hello from parent wnd";
}
</script>
</body>
</html>
child.html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Parent window document</title>
</head>
<body>
<div id="child_wnd_doc_div_id">child window</div>
</body>
</html>
IE9 => it works.
Firefox 13.0.1 => it doesn't work. Error message: "div is null".
Chrome 20.0.1132.47 m => doesn't work.
Do you understand that behaviour?
Can you help me make it work in these three cases?
Thank you and best regards.

I think that the window/document is not loaded at the time when you try to access the elements from it. You can do something like
childWnd.onload = function() {
var div = childWnd.document.getElementById("child_wnd_doc_div_id");
div.innerHTML = "Hello from parent wnd";
}
Also you can take a look at the mdn doc.
A better approach to the problem may be to do the changes in the 'child'. You can access the parent window with window.opener. But you should keep in mind that the parent window could be closed so you should consider some type of local storage (e.g. cookie).

Firebug: cannot find an object in DOM (using FireFox)

I simply cannot find an object in DOM using Firebug (FF).
I want to see elrteOptions object in DOM. I right-click "Inspect Element" on the page, going to DOM tab and typing elrteOptions in the search box. No results.
How to I see it? =)
Thanks.
Code is as simple as:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>some</title>
<script src='js/jquery.min.js' type='text/javascript'></script>
<script type='text/javascript'>
$(document).ready(function() {
var elrteOptions = {
cssClass : 'el-rte',
lang : 'ru',
toolbar : 'maxi',
cssfiles : ['styles/elrte-inner.css']
}
});
</script>
</head>
<body>
test
</body>
</html>

Looks like it had something to do with the browser (Firefox 9.0.1).
After complete re-install: objects started to appear normally.

Nokogiri -- preserve doctype and meta tags

I'm using nokogiri to open an existing html file that looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>Foo</title>
</head>
<body>
<!-- stuff -->
</body>
</html>
Then I change the contents of the body tag like this:
html_file = Nokogiri::HTML("path/to/html/file")
html_file.css('body').first.inner_html = "new body content"
Then I write this new document to a file like this:
File.open("path/to/new/html/file", 'w') {|f| f.write html_file}
And this is my resulting html file:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
new body content
</body></html>
My question for you guys if it's possible to tell Nokogiri to preserve the original html file's doctype and meta tags, since it appears like they are being lost/changed when I open the document with Nokogiri and attempt to write it to a file.
Any help would be much appreciated. Thanks!

Finally figured it out:
I just changed the line:
html_file = Nokogiri::HTML("path/to/html/file")
to
html_file = Nokogiri::HTML(File.open("path/to/html/file").read)
and now it works like I'm expecting it to. Seems kind of inconsistent, but I'm sure there's a good reason for it.
Thanks for all of the suggestions #ezkl!

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Nokogiri Scraping Misses HTML - ruby

That's because the contents of the iframe are not part of the page. In fact, they are in a completely different location (note the src attribute of the iframe). You'll have to fetch that content separately, which is how a browser would do it.

Here is code that handles it: page = Mechanize.new.get "http://page_u_need" page.iframe_with(id: 'beatles').content

Related

<img> after <h1> does not validate as HTML 4.01 Strict

document type does not allow element "div" here; assuming missing "object" start-tag

Javascript getElementById from parent window to child window

Firebug: cannot find an object in DOM (using FireFox)

Nokogiri -- preserve doctype and meta tags

Categories

Resources