Disappearing entities in XML fragment with nokogiri - ruby

I'm using Nokogiri to process fragments of XHTML documents, and am running into some behavior I cannot explain or workaround. I'm not sure if it's a bug, or something I don't understand.
Consider the following two lines, showcasing a reduced version of the problem I'm running into:
puts Nokogiri::XML::DocumentFragment.parse(" <pre><div>foo</div></pre>")
puts Nokogiri::XML::DocumentFragment.parse("<pre><div>foo</div></pre>")
This is the output:
<pre>div>foo/div></pre>
<pre><div>foo</div></pre>
The second line is what I expect, but the first one puzzles me. Where did the go? Why does its presence cause the < to disappear?

Based on matt's suggestion, I'm parsing the fragment by wrapping it in a full XHTML file, as that allows Nokogiri to know about the XHTML entities.
fragment = " <pre><div>foo</div></pre>"
head = <<HERE
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title></title>
<meta charset="UTF-8" />
</head>
<body>
HERE
foot = <<HERE
</body>
</html>
HERE
puts Nokogiri::XML.parse( head + fragment + foot).css("body").children.to_xml
Feels a bit heavy handed, but it works.

Related

How to format HTML returned by Verify.PlayWright for better comparison

I am using Verify.PlayWright and to take HTML element snapshots. When the compare opens, all the HTML is on one line. This makes it hard to see the differences. Is there a way to format the HTML in order to get a nicer comparison?
var root = await page.QuerySelectorAsync("#sectionContainer .tree-root");
await Verifier.Verify(root);
You can use Verify.AngleSharp. It has a feature that ppretty prints html](https://github.com/VerifyTests/Verify.AngleSharp#pretty-print) for comparison purposes.
install https://nuget.org/packages/Verify.AngleSharp/
Call VerifyAngleSharpDiffing.Initialize() once at assembly load time.
use PrettyPrintHtml in your test:
[Test]
public Task PrettyPrintHtml()
{
var html = #"<!DOCTYPE html>
<html><body><h1>My First Heading</h1>
<p>My first paragraph.</p></body></html>";
return Verifier.Verify(html)
.UseExtension("html")
.PrettyPrintHtml();
}
which will produce a verified file containing
<!DOCTYPE html>
<html>
<head></head>
<body>
<h1>My First Heading</h1>
<p>My first paragraph.</p>
</body>
</html>

Recove data from xml file with ruby for manipulating

i have a problem. I must recover the data from a xml file inside my print machine becouse i want recove number of copy each day but i can see only total copy from start day of machine.
My script must recover from xml file the total copy day by day and with my manipulition the script subtracts the number each day.
I have alredy try use my little script like follow
require 'net/http'
require 'uri'
uri = URI.parse( "http://192.168.1.80/wcd/system_counter.xml" )
params = {'q'=>'cheese'}
http = Net::HTTP.new(uri.host, uri.port)
request = Net::HTTP::Get.new(uri.path)
request.set_form_data( params )
# instantiate a new Request object
request = Net::HTTP::Get.new( uri.path+ '?' + request.body )
response = http.request(request)
puts response.body
but when i try with other html page i have correctly response and i can see the code of page with my page i have this html page like response:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN">
<HTML lang="en">
<HEAD>
<TITLE></TITLE>
<meta http-equiv="Expires" content="0">
<meta http-equiv="Pragma" content="no-cache">
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
<meta content="text/javascript" http-equiv="Content-Script-Type">
<noscript>
<meta http-equiv="refresh" content="0; URL=/wcd/js_error.xml">
</noscript>
</HEAD>
<BODY BGCOLOR="#ffffff" LINK="#000000" ALINK="#ff0000" VLINK="#000000" onload="location.replace('/wcd/index.html?access=SYS_COU');" >
</BODY>
</HTML>
When i go from browser i can see correctly number of copy.
What for your experiece the correctly mode for bypass this restriction with ruby code?
Thanks for all help.

Replace content file in ruby

I would like to replace some text with whitespaces and spaces in ruby in all files.
toReplace = [
'<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\"
\"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">
<html xmlns=\"http://www.w3.org/1999/xhtml\" lang=\"pl\" xml:lang=\"pl\">
<head>'
]
replacement = [
'<!DOCTYPE html>
<html>
<head>
<meta name="viewport" content="width=device-width, initial-scale=1">'
]
I use gsub for this, but it doesn't work because there is problem with whitespaces and spaces.
contents.gsub! toReplace[i], replacement[i]
How can I do that?
You could try escaping the first string to avoid any characters being treated as special:
REPLACE = Regexp.escape(%Q[<!DOCTYPE...
])
WITH = %Q[
...
]
contents.gsub!(REPLACE, WITH)
Note that you should be using either a string or a regular expression, not an array as you have in your code.

HTMLUnit HtmlPage.getBody() returns null even though the response contains a <body> tag

I'm using HTMLUnit in Java to extract information from website.
Ran into a strange phenom where the page is not fully parsed into the DOM tree.
After the following:
HtmlPage lineHours = (HtmlPage) _webClient.getTopLevelWindows().get(1).getEnclosedPage();
Watching the expression lineHours.asXml() results in the following (... marks ommitted sensitive data)
<?xml version="1.0" encoding="UTF-8"?>
<html>
<head>
<script ...>
</script>
</head>
</html>
While printing lineHours.getWebResponse().getContentAsString() results in the following:
<html>
<head>
<script ...>
</script>
</head>
</html>
<body>
<div> ...
In short, the body tag is not parsed into the DOM tree. and therefore all XPath queries and helper methods such as HtmlPage.getBody() fail. In a regular browser the page renders well.
Any ideas?
Thanks
Tomer
This was eventually solved by parsing the DOM tree using a Xerces parser and retrieving the result from it.

MVC 3 Razor View Engine - script blocks appear before DOCTYPE

I have a very strange problem. I have migrated my views from Webforms view engine to Razor. I am finding now that when the html for my page is rendered, it doesn't render the DOCTYPE at the top (as it should), but rather renders some javascript script blocks before the DOCTYPE tag. I have no clue what is causing this. The result it that the browser displays the page in Quirks mode. This manifests by my font-size in my tables not conforming to the font-size set for the body tag.
I must also mention that I am using Telerik MVC extensions version 2011Q1.
Below is a portion of the page source from the beginning of the html page to the end of the head tag. Any help on why this is happening will be appreciated.
<script type="text/javascript" src="/asset.axd?id=PQEAAB-LCAAAAAAABADsvQdgHEmWJSYvbcp7f0r1StfgdKEIgGATJNiQQBDswYjN5pLsHWlHIymrKoHKZVZlXWYWQMztnbz33nvvvffee--997o7nU4n99__P1xmZAFs9s5K2smeIYCqyB8_fnwfPyJ-8Uezjx597xd_tPro0Uevp3WxapuPRh-d82dL-uynf9E6r6-3d8f3x7vjn8Z31UePdn7JSL5t8zKvi7fjabVYVEv7_W73-zZ_106qd7bBXrfBRV3M7Lf3zLfS-fgyK4tZ1ua2wX709XxWtMXywra638MimzQtDdG2-PSXfP-XfH_00bTlRu_auz-dXWYNU4EaXNKnezu7uzTwe7v36YMpkerep_fpl48etfU6_yX_TwAAAP__IQbpFT0BAAA%3d"></script>
<script type="text/javascript">
//<![CDATA[
jQuery(document).ready(function(){
if (!jQuery.telerik) jQuery.telerik = {};
jQuery.telerik.cultureInfo={"shortDate":"dd/MM/yyyy","longDate":"dd MMMM yyyy","longTime":"HH:mm:ss","shortTime":"HH:mm","fullDateTime":"dd MMMM yyyy HH:mm:ss","sortableDateTime":"yyyy\u0027-\u0027MM\u0027-\u0027dd\u0027T\u0027HH\u0027:\u0027mm\u0027:\u0027ss","universalSortableDateTime":"yyyy\u0027-\u0027MM\u0027-\u0027dd HH\u0027:\u0027mm\u0027:\u0027ss\u0027Z\u0027","generalDateShortTime":"dd/MM/yyyy HH:mm","generalDateTime":"dd/MM/yyyy HH:mm:ss","monthDay":"dd MMMM","monthYear":"MMMM yyyy","days":["Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"],"abbrDays":["Sun","Mon","Tue","Wed","Thu","Fri","Sat"],"abbrMonths":["Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec",""],"months":["January","February","March","April","May","June","July","August","September","October","November","December",""],"am":"AM","pm":"PM","dateSeparator":"/","timeSeparator":":","firstDayOfWeek":1,"currencydecimaldigits":2,"currencydecimalseparator":".","currencygroupseparator":",","currencygroupsize":3,"currencynegative":1,"currencypositive":0,"currencysymbol":"£","numericdecimaldigits":2,"numericdecimalseparator":".","numericgroupseparator":",","numericgroupsize":3,"numericnegative":1,"percentdecimaldigits":2,"percentdecimalseparator":".","percentgroupseparator":",","percentgroupsize":3,"percentnegative":0,"percentpositive":0,"percentsymbol":"%"};
jQuery('#CoursesGrid').tGrid({columns:[{"title":"Id","member":"Id","type":"Number","editor":null},{"title":"Course Title:","member":"Title","type":"String","editor":null},{"title":"Completion Category","member":"CompletionCategory","type":"String","editor":null},{"title":"Expiry Months (0 to 100):","member":"ExpiryMonths","type":"Number","editor":null},{"title":"Commands","commands":[{"name":"edit","buttonType":"Image"},{"name":"delete","buttonType":"Image"}]}], plugins:["editing"], editing:{"mode":"InForm","editor":"\r\n\r\n\u003cdiv\u003e\r\n \u003cfieldset class=\"editfieldset\"\u003e\r\n \u003clegend class=\"titlelegend\"\u003eCourse Details\u003c/legend\u003e\r\n \u003col\u003e\r\n \u003cli\u003e\r\n \u003clabel for=\"Title\"\u003eCourse Title:\u003c/label\u003e \r\n \u003cinput id=\"Title\" name=\"Title\" type=\"text\" value=\"\" /\u003e \r\n \u003cspan class=\"field-validation-valid\" id=\"Title_validationMessage\"\u003e\u003c/span\u003e \r\n \u003c/li\u003e\r\n \u003cli\u003e\r\n \u003clabel for=\"Description\"\u003eDescription:\u003c/label\u003e \r\n \u003ctextarea cols=\"20\" id=\"Description\" name=\"Description\" rows=\"2\"\u003e\r\n\u003c/textarea\u003e \r\n \u003cspan class=\"field-validation-valid\" id=\"Description_validationMessage\"\u003e\u003c/span\u003e \r\n \u003c/li\u003e\r\n \u003cli\u003e\r\n \u003clabel for=\"ExpiryMonths\"\u003eExpiry Months (0 to 100):\u003c/label\u003e \r\n \u003cdiv class=\"t-widget t-numerictextbox\"\u003e\u003cinput class=\"t-input\" id=\"ExpiryMonths\" name=\"ExpiryMonths\" style=\"width:100%\" value=\"0\" /\u003e\u003ca class=\"t-link t-icon t-arrow-up\" href=\"#\" tabindex=\"-1\" title=\"Increase value\"\u003eIncrement\u003c/a\u003e\u003ca class=\"t-link t-icon t-arrow-down\" href=\"#\" tabindex=\"-1\" title=\"Decrease value\"\u003eDecrement\u003c/a\u003e\u003c/div\u003e\u003cscript type=\"text/javascript\"\u003e\r\n\tjQuery(\u0027#ExpiryMonths\u0027).tTextBox({val:0, step:1, minValue:-2147483648, maxValue:2147483647, digits:0, groupSize:3, negative:1, text:\u0027Enter value\u0027, type:\u0027numeric\u0027});\r\n\u003c/script\u003e\r\n \r\n \u003cspan class=\"field-validation-valid\" id=\"ExpiryMonths_validationMessage\"\u003e\u003c/span\u003e \r\n \u003c/li\u003e\r\n \u003c/ol\u003e\r\n \u003c/fieldset\u003e\r\n\u003c/div\u003e\r\n","defaultDataItem":{"Id":0,"Title":null,"Description":null,"CompletionCategory":null,"ReminderId":0,"ExpiryMonths":0,"Deleted":false,"ScheduledCourses":[]}}, dataKeys:{"Id":"id"}, validationMetadata:{"Fields":[{"FieldName":"Title","ReplaceValidationMessageContents":true,"ValidationMessageId":"Title_validationMessage","ValidationRules":[{"ErrorMessage":"Course Title is required.","ValidationParameters":{},"ValidationType":"required"}]},{"FieldName":"Description","ReplaceValidationMessageContents":true,"ValidationMessageId":"Description_validationMessage","ValidationRules":[]},{"FieldName":"ExpiryMonths","ReplaceValidationMessageContents":true,"ValidationMessageId":"ExpiryMonths_validationMessage","ValidationRules":[{"ErrorMessage":"The Expiry Months (0 to 100): field is required.","ValidationParameters":{},"ValidationType":"required"},{"ErrorMessage":"The field Expiry Months (0 to 100): must be a number.","ValidationParameters":{},"ValidationType":"number"}]}],"FormId":"CoursesGridform"}, pageSize:0, sortMode:'single', ajax:{"selectUrl":"/Course/_IndexAjax","insertUrl":"/Course/_InsertAjax","updateUrl":"/Course/_UpdateAjax","deleteUrl":"/Course/_DeleteAjax"}, localization:{"addNew":"Add new record","delete":"Delete","cancel":"Cancel","update":"Update","insert":"Insert","edit":"Edit","select":"Select","page":"Page ","displayingItems":"Displaying items {0} - {1} of {2}","pageOf":"of {0}","filter":"Filter","filterAnd":"And","filterClear":"Clear Filter","filterDateEq":"Is equal to","filterDateGe":"Is after or equal to","filterDateGt":"Is after","filterDateLe":"Is before or equal to","filterDateLt":"Is before","filterDateNe":"Is not equal to","filterNumberEq":"Is equal to","filterNumberGe":"Is greater than or equal to","filterNumberGt":"Is greater than","filterNumberLe":"Is less than or equal to","filterNumberLt":"Is less than","filterNumberNe":"Is not equal to","filterShowRows":"Show rows with value that","filterStringEndsWith":"Ends with","filterStringEq":"Is equal to","filterStringNe":"Is not equal to","filterStringStartsWith":"Starts with","filterStringSubstringOf":"Contains","groupHint":"Drag a column header and drop it here to group by that column","filterEnumEq":"Is equal to","filterEnumNe":"Is not equal to","deleteConfirmation":"Are you sure you want to delete this record?","filterSelectValue":"-Select value-","filterBoolIsFalse":"is false","filterBoolIsTrue":"is true","noRecords":"No records to display.","cancelChanges":"Cancel changes","saveChanges":"Save changes","refresh":"Refresh","sortedAsc":"sorted ascending","sortedDesc":"sorted descending","unGroup":"ungroup"}, noRecordsTemplate:'No records to display.'});
jQuery('#TabStrip').tTabStrip();});
//]]>
</script>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head id="Head1">
<meta http-equiv="Content-type" content="text/html;charset=UTF-8" />
<title>Vigilaris Solutions</title>
<link rel="stylesheet" href="/Content/Themes/Shared/vway-backend.css" type="text/css" />
<link rel="stylesheet" href="/Content/Themes/Green/branding.css" type="text/css" />
<link rel="stylesheet" href="/Content/Themes/Shared/StatusBar.css.css" type="text/css" />
<link type="text/css" href="/asset.axd?id=lAAAAB-LCAAAAAAABADsvQdgHEmWJSYvbcp7f0r1StfgdKEIgGATJNiQQBDswYjN5pLsHWlHIymrKoHKZVZlXWYWQMztnbz33nvvvffee--997o7nU4n99__P1xmZAFs9s5K2smeIYCqyB8_fnwfPyJ-8Uezjx597xd_tPro0Ucn1bLNl-1Ho4_O-bMlfdbmZV4Xb8fTarGoluNp09DX1UePdn7JKGxwVSxn1VXzwDXZ_SXf_yXfH300bbnVu_aufHNJf-7t7O6Od8f3du_TB1PC4N6n9-mXjx619Tr_Jf9PAAAA__9JtaUdlAAAAA%3d%3d" rel="stylesheet"/>
</head>
Okay. I have resolved the problem. It is, in fact, a Telerik related issue. The issue is as follows:
In my layout view, my migrated razor view engine script registrar code looked as follows:
#{Html.Telerik().ScriptRegistrar()
.Globalization(true)
.DefaultGroup(g => g.Combined(true).Compress(true))
.Render();
}
This is a code block approach and results in the javascript script blocks being rendered right at the beginning of the page source before the DOCTYPE declaration, thus causing the browser to go into quirks mode (uggglllyyy!).
So, I changed the Telerik code in my layout view to the following:
#(Html.Telerik().ScriptRegistrar()
.Globalization(true)
.DefaultGroup(g => g.Combined(true).Compress(true))
)
The script blocks now correctly get rendered at the end of the page source and the browser no longer operates in quirks mode.
I really hope this can help other developers.
This is caused by something writing those scripts directly to the output stream (Response.Write). Razor uses Writer property of ViewContext for output before flushing everything to Response.
i take it you are using the Telerik MVC Components, more specifically the tabs?
check your view templates, specifically _Layout.cshtml and ensure you arent calling the telerik library before your markup.

Resources