Parsing xml with Go, ignoring nested elements? - go

I am trying to parse a html document with the Golang xml parser. I have managed it to extract all the <li>elements but if the element contains a link <a>, then the content of the link is ignored. I would like to just ignore the nested <a> and display it's content as plain text but I don't know how.
Here is my code:
d := xml.NewDecoder(resp.Body)
d.Strict = false
d.AutoClose = xml.HTMLAutoClose
d.Entity = xml.HTMLEntity
type list_item struct {
Data string `xml:",chardata"`
}
for {
t,_ := d.Token()
if t == nil {
break
}
switch se := t.(type) {
case xml.StartElement:
if se.Name.Local == "li" {
var q list_item
d.DecodeElement(&q, &se)
c.Infof("%+v\n", q)
}
}
}
Is there any way to just ignore nested elements and display their content?

Constder using specialized package for parsing HTML. In general, HTML is not XML (XHTML 1.0 is, but documents formatted using it are not very common, and that standard has been deprecated).
An even better approach in my opinion—given your apparent use case,— would be using XPath to extract the necessary information using a query.
As to the question as stated, I think there's no built-in way to do what you want: the xml.Decoder implements the Skip() method but it only allows you to skip over unneeded content; there's nothing returning "inner XML" as is. You could roll this yourself by using xml.Decoder's RawToken(): by immediately rendering whatever it returns until it returns something denoting and end element you're looking for (you'll have to implement support for handling nested elements).

I found a library that uses the jQuery style of getting html information: http://godoc.org/github.com/PuerkitoBio/goquery
I used that and it solved the problem.

Related

How to check either the headline or the content should contain "FIFA" keyword?

How do I write this test in Cypress?
enter image description here
I have to confirm that either the headline or paragraph should contain the same keyword.
It's better to have JQuery elements in hand before the assertion so that you can use JQuery methods on them. cy.get() acts just like $(...) in JQuery, it will be enough to have the elements. (more here)
Once u have the elements, i.e. $el1 and $el2 below, then you can get their text via .text() method (more here) and then you can write your assertion.
Instead of a separate assertion, below I used a single one and checked if either of them includes the desired text by using || operator.
cy.get('first-el').then($el1 => {
cy.get('second-el').then($el2 => {
const inEl1 = $el1.text().includes('FIFA');
const inEl2 = $el2.text().includes('FIFA');
expect(inEl1 || inEl2).to.be.true;
});
});

How to invoke "dynamic" XPaths with count() in Karate? [duplicate]

This question already has answers here:
Dynamically set a XML tag value while building payload
(2 answers)
Closed 1 year ago.
I'm trying to invoke a "dynamic" XPath in Karate that uses the XPath count() function to return a number (or string representation).
[With Karate 0.9.2] I'm trying to invoke "dynamic" XPath expressions (originally read from a JSON-based configuration file) on an XML document.
There are (potentially) multiple XPath expressions so I am using Karate's karate.forEach() to invoke an XPath utility Javascript function repeatedly within Karate.
Within the embedded Javascript function I use karate.xmlPath() to invoke the "dynamic" XPath expression string.
This works fine for retrieving single nodes, node lists etc but it fails when the expression uses XPath's count() function as the result is a number rather than an XML node or XML NodeList.
Feature: General XPath based evaluator
Scenario: ....
# Omitting details around performing HTTP request to obtain XML response....
* xml payload = ..... $.requests[0].body ...
#
# A JS Function to invoke each XPath Query in our query dictionary
#
# queryDictionaryItem has a single XPath query in it with an expected
value
# { "xpath": <query>, "expectedValue", <string> }
#
* def checkXPathQueryFn =
"""
function(queryDictionaryItem) {
var requestXML = karate.get("payload");
var xpathQuery = queryDictionaryItem.xpath;
var expectedValue = queryDictionaryItem.expectedValue;
// [!!] This will blow up if the xpathQuery is of the form:
// "count(........)"
// --> Cannot return a NUMERIC value rather than a NODELIST
var actualValue = karate.xmlPath( requestXML, xpathQuery );
var match = karate.match( actualValue, expectedValue );
if (!match.pass)
{
karate.abort("Failed to match expectation..."); }
}
"""
# queryDictionary is a list of JSON objects of the form:
# { "xpath": <query>, "expectedValue", <string> }
* eval karate.forEach(queryDictionary, checkXPathQueryFn)
Expected result:
Receive a String/Number when an XPath based on count() is dynamically invoked.
Actual outcome:
Error:
javax.xml.xpath.XPathExpressionException: com.sun.org.apache.xpath.internal.XPathException: Can not convert #NUMBER to a NodeList!
javascript evaluation failed: karate.forEach(requestExpectations, oldCheckExpectation), javax.xml.xpath.XPathExpressionException: com.sun.org.apache.xpath.internal.XPathException: Can not convert #NUMBER to a NodeList!
For the Intuit Karate developers: [#ptrthomas]
In the v0.9.2 version of karate-core, there are provisions for use of count() in XPaths within Script#evalXmlPathOnXmlNode():
https://github.com/intuit/karate/blob/master/karate-core/src/main/java/com/intuit/karate/Script.java#L367 L367
but as we're using dynamic XPath, the call sequence does not use that "safeguard" and instead uses ScriptBridge#xmlPath()
https://github.com/intuit/karate/blob/master/karate-core/src/main/java/com/intuit/karate/core/ScriptBridge.java#L230 L230
This method has the line:
Node result = XmlUtils.getNodeByPath((Node) o, path, false);
which throws RuntimeExceptions when XPath expressions do not return NODESET shaped data.
https://github.com/intuit/karate/blob/master/karate-core/src/main/java/com/intuit/karate/XmlUtils.java#L152 L152.
Confirming this Karate Framework issue is fixed with the latest (as at 2019-04-23) Karate Core (Development branch build).
This fix is scheduled for release in Intuit Karate v0.9.3.
Java/Karate source further detailing the issue and an alternative (interim), Java-native workaround via direct Java XPath interop is listed at:
https://github.com/mhavilah/karateDynamicXPath
NB: The behaviour of the XPathHelper in the above project is slightly different to that of the karate.xmlPath() DSL service.
In particular for retrieving single XML elements, the karate DSL auto extracts the underlying text() node, whereas the Java native Helper requires explicit reference to the embedded text node within the XML element.

xerces-c 3.1 XPath evaluation

I could not find much examples of evaluate XPath using xerces-c 3.1.
Given the following sample XML input:
<abc>
<def>AAA BBB CCC</def>
</abc>
I need to retrieve the "AAA BBB CCC" string by the XPath "/abc/def/text()[0]".
The following code works:
XMLPlatformUtils::Initialize();
// create the DOM parser
XercesDOMParser *parser = new XercesDOMParser;
parser->setValidationScheme(XercesDOMParser::Val_Never);
parser->parse("test.xml");
// get the DOM representation
DOMDocument *doc = parser->getDocument();
// get the root element
DOMElement* root = doc->getDocumentElement();
// evaluate the xpath
DOMXPathResult* result=doc->evaluate(
XMLString::transcode("/abc/def"), // "/abc/def/text()[0]"
root,
NULL,
DOMXPathResult::ORDERED_NODE_SNAPSHOT_TYPE, //DOMXPathResult::ANY_UNORDERED_NODE_TYPE, //DOMXPathResult::STRING_TYPE,
NULL);
// look into the xpart evaluate result
result->snapshotItem(0);
std::cout<<StrX(result->getNodeValue()->getFirstChild()->getNodeValue())<<std::endl;;
XMLPlatformUtils::Terminate();
return 0;
But I really hate that:
result->getNodeValue()->getFirstChild()->getNodeValue()
Has it to be a node set instead of the exact node I want?
I tried other format of XPath such as "/abc/def/text()[0]", and "DOMXPathResult::STRING_TYPE". xerces always thrown exception.
What did I do wrong?
I don't code with Xerces C++ but it seems to implement the W3C DOM Level 3 so based on that I would suggest to select an element node with a path like /abc/def and then simply to access result->getNodeValue()->getTextContent() to get the contents of the element (e.g. AAA BBB CCC).
As far as I understand the DOM APIs, if you want a string value then you need to use a path like string(/abc/def) and then result->getStringValue() should do (if the evaluate method requests any type or STRING_TYPE as the result type).
Other approaches if you know you are only interested in the first node in document order you could evaluate /abc/def with FIRST_ORDERED_NODE_TYPE and then access result->getNodeValue()->getTextContent().

Google AJAX Transliteration API: Is it possible to make all input fields in the page transliteratable?

I've used "Google AJAX Transliteration API" and it's going well with me.
http://code.google.com/apis/ajaxlanguage/documentation/referenceTransliteration.html
Currently I've a project that I need all input fields in every page (input & textarea tags) to be transliteratable, while these input fields differs from page to page (dynamic).
As I know, I've to call makeTransliteratable(elementIds, opt_options) method in the API call to define which input fields to make transliteratable, and in my case here I can't predefine those fields manually. Is there a way to achieve this?
Thanks in advance
Rephrasing what you are asking for: you would like to collect together all the inputs on the page which match a certain criteria, and then pass them into an api.
A quick look at the API reference says that makeTransliteratable will accept an array of id strings or an array of elements. Since we don't know the ids of the elements before hand, we shall pass an array of elements.
So, how to get the array of elements?
I'll show you two ways: a hard way and an easy way.
First, to get all of the text areas, we can do that using the document.getElementsByTagName API:
var textareas = document.getElementsByTagName("textarea");
Getting the list of inputs is slightly harder, since we don't want to include checkboxes, radio buttons etc. We can distinguish them by their type attribute, so lets write a quick function to make that distinction:
function selectElementsWithTypeAttribute(elements, type)
{
var results = [];
for (var i = 0; i < elements.length; i++)
{
if (elements[i].getAttribute("type") == type)
{
results.push(elements[i]);
}
}
return results;
}
Now we can use this function to get the inputs, like this:
var inputs = document.getElementsByTagName("input")
var textInputs = selectElementsWithTypeAttribute(textInputs, "text");
Now that we have references to all of the text boxes, we can concatenate them into one array, and pass that to the api:
var allTextBoxes = [].concat(textareas).concat(textInputs);
makeTransliteratable(allTextBoxes, /* options here */);
So, this should all work, but we can make it easier with judicious use of library methods. If you were to download jQuery (google it), then you could write this more compact code instead:
var allTextBoxes = $("input[type='text'], textarea").toArray();
makeTransliteratable(allTextBoxes, /* options here */);
This uses a CSS selector to find all of the inputs with a type attribute of "text", and all textareas. There is a handy toArray method which puts all of the inputs into an array, ready to pass to makeTransliteratable.
I hope this helped,
Douglas

Image tag not closing with HTMLAgilityPack

Using the HTMLAgilityPack to write out a new image node, it seems to remove the closing tag of an image, e.g. should be but when you check outer html, has .
string strIMG = "<img src='" + imgPath + "' height='" + pubImg.Height + "px' width='" + pubImg.Width + "px' />";
HtmlNode newNode = HtmlNode.Create(strIMG);
This breaks xhtml.
Telling it to output XML as Micky suggests works, but if you have other reasons not to want XML, try this:
doc.OptionWriteEmptyNodes = true;
Edit 1:Here is how to fix an HTML Agilty Pack document to correctly display image (img) tags:
if (HtmlNode.ElementsFlags.ContainsKey("img"))
{ HtmlNode.ElementsFlags["img"] = HtmlElementFlag.Closed;}
else
{ HtmlNode.ElementsFlags.Add("img", HtmlElementFlag.Closed);}
replace "img" for any other tag to fix them as well (input, select, and option come up frequently). Repeat as needed. Keep in mind that this will produce rather than , because of the HAP bug preventing the "closed" and "empty" flags from being set simultaneously.
Source: Mike Bridge
Original answer:
Having just labored over solutions to this issue, and not finding any sufficient answers (doctype set properly, using Output as XML, Check Syntax, AutoCloseOnEnd, and Write Empty Node options), I was able to solve this with a dirty hack.
This will certainly not solve the issue outright for everyone, but for anyone returning their generated html/xml as a string (EG via a web service), the simple solution is to use fake tags that the agility pack doesn't know to break.
Once you have finished doing everything you need to do on your document, call the following method once for each tag giving you a headache (notable examples being option, input, and img). Immediately after, render your final string and do a simple replace for each tag prefixed with some string (in this case "Fix_", and return your string.
This is only marginally better in my opinion than the regex solution proposed in another question I cannot locate at the moment (something along the lines of )
private void fixHAPUnclosedTags(ref HtmlDocument doc, string tagName, bool hasInnerText = false)
{
HtmlNode tagReplacement = null;
foreach(var tag in doc.DocumentNode.SelectNodes("//"+tagName))
{
tagReplacement = HtmlTextNode.CreateNode("<fix_"+tagName+"></fix_"+tagName+">");
foreach(var attr in tag.Attributes)
{
tagReplacement.SetAttributeValue(attr.Name, attr.Value);
}
if(hasInnerText)//for option tags and other non-empty nodes, the next (text) node will be its inner HTML
{
tagReplacement.InnerHtml = tag.InnerHtml + tag.NextSibling.InnerHtml;
tag.NextSibling.Remove();
}
tag.ParentNode.ReplaceChild(tagReplacement, tag);
}
}
As a note, if I were a betting man I would guess that MikeBridge's answer above inadvertently identifies the source of this bug in the pack - something is causing the closed and empty flags to be mutually exclusive
Additionally, after a bit more digging, I don't appear to be the only one who has taken this approach:
HtmlAgilityPack Drops Option End Tags
Furthermore, in cases where you ONLY need non-empty elements, there is a very simple fix listed in that same question, as well as the HAP codeplex discussion here: This essentially sets the empty flag option listed in Mike Bridge's answer above permanently everywhere.
There is an option to turn on XML output that makes this issue go away.
var htmlDoc = new HtmlDocument();
htmlDoc.OptionOutputAsXml = true;
htmlDoc.LoadHtml(rawHtml);
This seems to be a bug with HtmlAgilityPack. There are many ways to reproduce this, for example:
Debug.WriteLine(HtmlNode.CreateNode("<img id=\"bla\"></img>").OuterHtml);
Outputs malformed HTML. Using the suggested fixes in the other answers does nothing.
HtmlDocument doc = new HtmlDocument();
doc.OptionOutputAsXml = true;
HtmlNode node = doc.CreateElement("x");
node.InnerHtml = "<img id=\"bla\"></img>";
doc.DocumentNode.AppendChild(node);
Debug.WriteLine(doc.DocumentNode.OuterHtml);
Produces malformed XML / XHTML like <x><img id="bla"></x>
I have created a issue in CodePlex for this.

Resources