Parsing XHTML with XPATH using Microsoft.XMLHTTP in VBScript

Parsing XHTML with XPATH using Microsoft.XMLHTTP in VBScript - xpath

I'm looking to parse an xhtml document with Microsoft.XMLHTTP with XPATH in VBScript. I have the following xhtml document structure. How would I get an array of the urls?
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>Local index</title>
</head>
<body>
<table>
<tr>
<td>
url1<br/>
url2<br/>
url3
</td><td>
url1-1<br/>
url2-1<br/>
url3-1
</td>
</tr>
</table>
</body>
</html>

Are you sure you need to use the antiquated program id Microsoft.XMLHTTP? These days both MSXML 3 as well as MSXML 6 are part of the OS respectively supported service packs with anything since Windows XP.
As for using XPath and MSXML 3, here is an example:
Dim doc
Set doc = CreateObject("Msxml2.DOMDocument.3.0")
doc.validateOnParse = False
doc.resolveExternals = False
If doc.load("file.xml") Then
doc.setProperty "SelectionLanguage", "XPath"
doc.setProperty "SelectionNamespaces", "xmlns:xhtml='http://www.w3.org/1999/xhtml'"
For Each link In doc.selectNodes("//xhtml:a")
WScript.Echo(link.getAttribute("href") & ": " & link.text)
Next
Else
WScript.Echo(doc.parseError.reason)
End If

Related

Nokogiri Scraping Misses HTML

Nokogiri isn't grabbing anything beneath the iframe tag.
doc.search("iframe") returns only the iframe tag. doc.search("body.content-frame") returns empty. doc.errors returns empty also. Why isn't Nokogiri registering the HTML beneath the iframe? How can I grab it?
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head></head>
<body onunload="clearMyTimeInterval()">
<iframe id="content-frame" frameborder="0" src="/sportsbook/betting-lines/baseball/2014-08-21/?range=day" onload="javascript:checkLoadedFrame(this);" style="background-color: rgb(34, 34, 34); height: 1875px;" name="content-frame" scrolling="no" border="0">
#document
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head></head>
<body class="content-frame">
#ETC.......

That's because the contents of the iframe are not part of the page. In fact, they are in a completely different location (note the src attribute of the iframe). You'll have to fetch that content separately, which is how a browser would do it.

Here is code that handles it:
page = Mechanize.new.get "http://page_u_need"
page.iframe_with(id: 'beatles').content

Javascript getElementById from parent window to child window

Hello and thank you for reading my post.
Here is what I basically want to do:
in a first HTML page ("parent.html"), there is a button ;
when a user clicks the button a new window pops up ("child.html")
AND the contents of a "div" element in the child window is updated.
The final action is unsuccessful under "Firefox" and "Chrome".
parent.html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Parent window document</title>
</head>
<body>
<input
type="button"
value="Open child window document"
onclick="openChildWindow()" />
<script type="text/javascript">
function openChildWindow()
{
var s_url = "http://localhost:8080/projectroot/child.html";
var s_name = "ChildWindowDocument";
var s_specs = "resizable=yes,scrollbars=yes,toolbar=0,status=0";
var childWnd = window.open(s_url, s_name, s_specs);
var div = childWnd.document.getElementById("child_wnd_doc_div_id");
div.innerHTML = "Hello from parent wnd";
}
</script>
</body>
</html>
child.html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Parent window document</title>
</head>
<body>
<div id="child_wnd_doc_div_id">child window</div>
</body>
</html>
IE9 => it works.
Firefox 13.0.1 => it doesn't work. Error message: "div is null".
Chrome 20.0.1132.47 m => doesn't work.
Do you understand that behaviour?
Can you help me make it work in these three cases?
Thank you and best regards.

I think that the window/document is not loaded at the time when you try to access the elements from it. You can do something like
childWnd.onload = function() {
var div = childWnd.document.getElementById("child_wnd_doc_div_id");
div.innerHTML = "Hello from parent wnd";
}
Also you can take a look at the mdn doc.
A better approach to the problem may be to do the changes in the 'child'. You can access the parent window with window.opener. But you should keep in mind that the parent window could be closed so you should consider some type of local storage (e.g. cookie).

DOCTYPE not working in Firefox

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
<head>
<meta http-equiv="Cache-control" content="no-cache">
<meta http-equiv="Pragma" content="no-cache">
<meta http-equiv="Expires" content="-1">
<Script language = JavaScript>
function addOptionList(selectbox,text,value )
{
var optn = document.createElement('OPTION');
optn.text = text;
optn.value = value;
selectbox.options.add(optn);
}
function removeOptionList(listbox,i){
listbox.remove(i);
}
function addOption_list(fromvar,tovar){
for(i=fromvar.options.length-1;i>=0;i--) {
var userlist=fromvar;
if(fromvar[i].selected){
addOptionList(tovar, fromvar[i].value, fromvar[i].value);
removeOptionList(userlist,i);
}
}
}
</Script>
<table align='center'>
<tr>
<td ><select multiple name='userlist' id='userlist' >
<option value='aaa'>aaa</option>
<option value='bbb'>bbb</option>
</select></td>
<td align='center' valign='middle'>
<input value='-->'
onClick='addOption_list(userlist,pouser);' type='button'>
<br><input value='<--'
onClick='addOption_list(pouser,userlist);' type='button'></td>
<td><select multiple name='pouser' id='pouser'>
<option id='test' value='ccc'>ccc</option>
</select></td>
</tr>
</table>
</body>
</HTML>
I am using the code above to select a name from left box and move it to the right box. The code is working in IE with/without DOCTYPE. But when I use DOCTYPE, it stops working in Firfox. I have spent a lot of time on it, but still couldn't figure out the problem. Also, I am a novice in Javascript, so please explain me the problem with code below (when I am using DOCTYPE). Thanks in advance for your help!!

You're relying on elements with ids showing up as global properties on the window (e.g. userlist). Firefox only does that in quirks mode, which is why the doctype matters.

Your markup does not match the DOCTYPE. I.e. you are not using valid XHTML 1.0 markup.
Paste you code into the xhtml validator and it will show you what's wrong.

XPath: select nodes with explicit 'xmlns' attribute

Could anyone please provide XPath expression which selects all nodes that have explicit 'xmlns' attribute, e.g. <html xmlns="http://www.w3.org/1999/xhtml">? //*[#xmlns] does not work because (as it turned out) xmlns is not treated as attribute by XPath.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="X-UA-Compatible" content="IE=edge"/>
<title>Информация по счетам, картам</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
<meta http-equiv="cache-control" content="no-cache"/>
<meta http-equiv="pragma" content="no-cache"/>
.......
I need only 'html' node here.

The technically correct answer is that it's...
Not possible. You need to distinguish between the abstract document that the source text represents and the actual source text itself. XPath operates on the abstraction, not on the source text, and the location of the xmlns pseudo-attribute is only relevant in the latter.
However...
You could sort of fake it with the following XPath 2.0 expression:
//*[not(namespace-uri()=ancestor::*/namespace-uri())]
This selects any element that does not have an ancestor in the same namespace, which theoretically means that it selects all elements where the namespace is declared. However, it won't catch namespaces that are re-declared. For example, consider this document:
<html xmlns="http://www.w3.org/1999/xhtml">
<head/>
<body>
<p xmlns="http://something">
<p xmlns="http://something"/>
</p>
</body>
</html>
The expression above selects the html element and the first p. The second p has an ancestor in the same namespace, so it's not selected, even though it specifies an xmlns.

This should not be possible, because
<a xmlns="http://www.org/1"> <b/> </a>
is equivalent to
<a xmlns="http://www.org/1"> <b xmlns="http://www.org/1"/> </a>

Nokogiri -- preserve doctype and meta tags

I'm using nokogiri to open an existing html file that looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>Foo</title>
</head>
<body>
<!-- stuff -->
</body>
</html>
Then I change the contents of the body tag like this:
html_file = Nokogiri::HTML("path/to/html/file")
html_file.css('body').first.inner_html = "new body content"
Then I write this new document to a file like this:
File.open("path/to/new/html/file", 'w') {|f| f.write html_file}
And this is my resulting html file:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
new body content
</body></html>
My question for you guys if it's possible to tell Nokogiri to preserve the original html file's doctype and meta tags, since it appears like they are being lost/changed when I open the document with Nokogiri and attempt to write it to a file.
Any help would be much appreciated. Thanks!

Finally figured it out:
I just changed the line:
html_file = Nokogiri::HTML("path/to/html/file")
to
html_file = Nokogiri::HTML(File.open("path/to/html/file").read)
and now it works like I'm expecting it to. Seems kind of inconsistent, but I'm sure there's a good reason for it.
Thanks for all of the suggestions #ezkl!

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Parsing XHTML with XPATH using Microsoft.XMLHTTP in VBScript - xpath

Related

Nokogiri Scraping Misses HTML

Javascript getElementById from parent window to child window

DOCTYPE not working in Firefox

XPath: select nodes with explicit 'xmlns' attribute

Nokogiri -- preserve doctype and meta tags

Categories

Resources