How to apply successively two xpath expressions with libxml? - xpath

To sum up, i'm a totally beginner in libxml and I have to use an existing source code. The main idea is to apply a first xpath expression to extract a set of nodes from an xml file. Then, for each node, the second xpath expression shall be applied to extract some values.
Existing source code is:
int xt_parseXmlResult(xmlDocPtr doc, const char *xpath, assoc_arrayc_t expr, arrayc_t *result)
{
xmlXPathContextPtr xpathCtx = xmlXPathNewContext(doc);
// Register namespaces ...
/*
* Evaluate main xpath expression
*/
xmlXPathObjectPtr xpathNodes = xmlXPathEvalExpression((xmlChar *)xpath, xpathCtx);
/*
* Now we apply the xpath expressions on each node returned by the first xpath request
*/
// First loop is on the XML document as we have to create a new context each
// time we change the document
int nbDocs = xpathNodes->nodesetval->nodeNr;
for (row = 0; row < nbDocs; row++)
{
xmlXPathContextPtr subCtx = xmlXPathNewContext(doc);
// Register namespaces ...
// Update context to use the nodeset related to this row
subCtx->node = xpathNodes->nodesetval->nodeTab[row];
for (col = 0; col < expr.nbItems; col++)
{
// Evaluate expression
xpathRows = xmlXPathEvalExpression((xmlChar *)expr.itemList[col].val, subCtx);
result->data[(row + 1) * result->nbCols + col] = strdup((char *)xmlXPathCastToString(xpathRows));
xmlXPathFreeObject(xpathRows);
}
xmlXPathFreeContext(subCtx);
subCtx = NULL;
}
xmlFreeDoc(doc);
xmlXPathFreeContext(xpathCtx);
xmlXPathFreeObject(xpathNodes);
return 0;
}
I think that the problem comes from this line
// Update context to use the nodeset related to this row
subCtx->node = xpathNodes->nodesetval->nodeTab[row];
Because the second xpath expression is applied from the root of the xml file, not the root of each node.
Any idea on how to do such thing?

you could concatinate your xpath expressions.
edit
//FORECAST/DAY/descendant::content/meteo/desc should work

as i see in the xmlXPathContext::node is for internal library use, so we can not use it
Probably xmlXPtrNewContext should help, but i am not able to use it.
I currently do the trick with concatenating both xpaths and quering the whole.
The new xpath is: "(" + xpath1 + ")" + "[num]" + xpath2.
Where num can be replaced with any number betwen 1 and the size of xpath1 result set.
And it seem to work.

Some sample code. Modify to suit your needs and language. This is C#, but it should be largely the same. Notice the second xpath is not starting with a "/" and is using an instance of the node returned from the first one. Neither of the xpaths end in a "/".
XmlDocument doc = new XmlDocument();
doc.Load(docfile);
XmlNodeList items = doc.SelectNodes("/part1/part2");
foreach (item in items)
{
XMLNode x = item.SelectNodes("part3");
//Dostuff
}

Related

VTD fails to evaluate a "find all empty nodes with no attributes" xpath

I found a bug (I think) using the 2.13.4 version of vtd-xml. Well, in short I have the following snippet code:
String test = "<catalog><description></description></catalog>";
VTDGen vg = new VTDGen();
vg.setDoc(test.getBytes("UTF-8"));
vg.parse(true);
VTDNav vn = vg.getNav();
//get nodes with no childs, text and attributes
String xpath = "/catalog//*[not(child::node()) and not(child::text()) and count(#*)=0]";
AutoPilot ap = new AutoPilot(vn);
ap.selectXPath(xpath);
//block inside while is never executed
while(ap.evalXPath()!=-1) {
System.out.println("current node "+vn.toRawString(vn.getCurrentIndex()));
}
and this doesn't work (=do not find any node, while it should find "description" instead). The code above works if I use the self closed tag:
String test = "<catalog><description/></catalog>";
The point is every xpath evaluator works with both version of the xml. Sadly I receive the xml from an external source, so I have no power over it...
Breaking the xpath I noticed that evaluating both
/catalog//*[not(child::node())]
and
/catalog//*[not(child::text())]
give false as result. As additional bit I tried something like:
String xpath = "/catalog/description/text()";
ap.selectXpath(xpath);
if(ap.evalXPath()!=-1)
System.out.println(vn.toRawString(vn.getCurrentIndex()));
And this print empty space, so in some way VTD "thinks" the node has text, even empty but still, while I expect a no match. Any hint?
TL;DR
When I faced this issue, I was left mainly with three options (see below). I went for the second option : Use XMLModifier to fix the VTDNav. At the bottom of my answser, you'll find an implementation of this option and a sample output.
The long story ...
I faced the same issue. Here are the main three options I first thought of (by order of difficulty) :
1. Turn empty elements into self closed tags in the XML source.
This option isn't always possible (like in OP case). Moreover, it may be difficult to "pre-process" the xml before hand.
2. Use XMLModifier to fix the VTDNav.
Find the empty elements with an xpath expression, replace them with self closed tags and rebuild the VTDNav.
2.bis Use XMLModifier#removeToken
A lower level variant of the preceding solution would consist in looping over the tokens in VTDNav and remove unecessary tokens thanks to XMLModifier#removeToken.
3. Patch the vtd-xml code directly.
Taking this path may require more effort and more time. IMO, the optimized vtd-xml code isn't easy to grasp at first sight.
Option 1 wasn't feasible in my case. I failed implementing Option 2bis. The "unecessary" tokens still remained. I didn't look at Option 3 because I didn't want to fix some (rather complex) third party code.
I was left with Option 2. Here is an implementation:
Code
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import com.ximpleware.AutoPilot;
import com.ximpleware.NavException;
import com.ximpleware.VTDException;
import com.ximpleware.VTDGen;
import com.ximpleware.VTDNav;
import com.ximpleware.XMLModifier;
#Test
public void turnEmptyElementsIntoSelfClosedTags() throws VTDException, IOException {
// STEP 1 : Load XML into VTDNav
// * Convert the initial xml code into a byte array
String xml = "<root><empty-element></empty-element><self-closed/><empty-element2 foo='bar'></empty-element2></root>";
byte[] ba = xml.getBytes(StandardCharsets.UTF_8);
// * Build VTDNav and dump it to screen
VTDGen vg = new VTDGen();
vg.setDoc(ba);
vg.parse(false); // Use `true' to activate namespace support
VTDNav nav = vg.getNav();
dump("BEFORE", nav);
// STEP 2 : Prepare to fix the VTDNAv
// * Prepare an autopilot to find empty elements
AutoPilot ap = new AutoPilot(nav);
ap.selectXPath("//*[count(child::node())=1][text()='']");
// * Prepare a simple regex matcher to create self closed tags
Matcher elementReducer = Pattern.compile("^<(.+)></.+>$").matcher("");
// STEP 3 : Fix the VTDNAv
// * Instanciate an XMLModifier on the VTDNav
XMLModifier xm = new XMLModifier(nav);
ByteArrayOutputStream baos = new ByteArrayOutputStream(); // baos will hold the elements to fix
String utf8 = StandardCharsets.UTF_8.name();
// * Find all empty elements and replace them
while (ap.evalXPath() != -1) {
nav.dumpFragment(baos);
String emptyElementXml = baos.toString(utf8);
String selfClosingTagXml = elementReducer.reset(emptyElementXml).replaceFirst("<$1/>");
xm.remove();
xm.insertAfterElement(selfClosingTagXml);
baos.reset();
}
// * Rebuild VTDNav and dump it to screen
nav = xm.outputAndReparse(); // You MUST call this method to save all your changes
dump("AFTER", nav);
}
private void dump(String msg,VTDNav nav) throws NavException, IOException {
System.out.print(msg + ":\n ");
nav.dumpFragment(System.out);
System.out.print("\n\n");
}
Output
BEFORE:
<root><empty-element></empty-element><self-closed/><empty-element2 foo='bar'></empty-element2></root>
AFTER:
<root><empty-element/><self-closed/><empty-element2 foo='bar'/></root>

Node Selection - Two Tags Deep from Current Node

I am using HTML Agility Pack. I have an HTMLNode which has the following InnerHtml:
"Item: <b>Link Text</b>"
From this node, I want to select the "Link Text" from within the "a" tag. I have not been able to do this. I have tried this:
System.Diagnostics.Debug.WriteLine(node.InnerHtml);
//The above line prints "Item: <b>Link Text</b>"
HtmlNode boldTag = node.SelectSingleNode("b");
if (boldTags != null)
{
HtmlNode linkTag = boldTag.SelectSingleNode("a");
//This is always null!
if (linkTag != null)
{
return linkTag.InnerHtml;
}
}
Any help to get the selection correct would be appreciated.
SelectSingleNode expects an XPath
So you need
var b = htmlDoc.DocumentNode.SelectSingleNode("//b");
var a = b.SelectSingleNode("./a");
var text = a.InnerText;
in one line
var text = htmlDoc.DocumentNode.SelectSingleNode("//b/a").InnerText;
Note that at the begining of the xpath
// will look anywhere in DocumentNode
.// will look for a descendant of the current node
/ will look for a child of the DocumentNode
./ will look for a child of the current node

How to add multiple values to a parameter in jMeter

How can I add multiple values (these values are extracted with regex extractor) to a parameter.
I have the following test:
Using the regex extractor I get the following:
Now I'm using a BeanShell PreProcessor that contains the following code:
int count = Integer.parseInt(vars.get("articleID_matchNr"));
for(int i=1;i<=count;i++) { //regex counts are 1 based
sampler.addArgument("articleIds", "[" + vars.get("articleID_" + i) + "]");
}
Using this will generate the following request:
This will add multiple parameters with the same name (articleIds) which will cause an error when I'm running the test. The correct form of the parameter should be:
articleIds=["148437", "148720"]
The number of articleIds is different from a user to another.
That's totally expected as you're adding an argument per match. You need to amend your code as follows to get desired behavior:
StringBuilder sb = new StringBuilder();
sb.append("[");
int count = Integer.parseInt(vars.get("articleID_matchNr"));
for (int i = 1; i <= count; i++) {
sb.append("\"");
sb.append(vars.get("articleID_" + i));
if (i < count) {
sb.append("\", ");
}
}
sb.append("\"]");
sampler.addArgument("articleIds", sb.toString());
See How to use BeanShell guide for more details and kind of JMeter Beanshell scripting cookbook.

xpath: check if current elements position is second in order

Background:
I have an XML document with the following structure:
<body>
<section>content</section>
<section>content</section>
<section>content</section>
<section>content</section>
</body>
Using xpath I want to check if a <section> element is the second element and if so apply some function.
Question:
How do I check if a <section> element is the second element in the body element?
../section[position()=2]
If you want to know if the second element in the body is named section then you can do this:
local-name(/body/child::element()[2]) eq "section"
That will return either true or false.
However, you then asked how can you check this and if it is true, then apply some function. In XPath you cannot author your own functions you can only do that in XQuery or XSLT. So let me for a moment assume you are wishing to call a different XPath function on the value of the second element if it is a section. Here is an example of applying the lower-case function:
if(local-name(/body/child::element()[2]) eq "section")then
lower-case(/body/child::element()[2])
else()
However, this can simplified as lower-case and many other functions take a value with a minimum cardinality of zero. This means that you can just apply the function to a path expression, and if the path did not match anything then the function typically returns an empty sequence, in the same way as a path that did not match will. So, this is semantically equivalent to the above:
lower-case(/body/child::element()[2][local-name(.) eq "section"])
If you are in XQuery or XSLT and are writing your own functions, I would encourage you to write functions that will accept a minimum cardinality of zero, just like lower-case does. By doing this you can chain functions together, and if there is no input data (i.e. from a path expression that does not match anything), these is no output data. This leads to a very nice functional programming style.
Question: How do I check if a element is the second element
in the body element?
Using C#, you can utilize theXPathNodeIterator class in order to traverse the nodes data, and use its CurrentPosition property to investigate the current node position:
XPathNodeIterator.CurrentPosition
Example:
const string xmlStr = #"<body>
<section>1</section>
<section>2</section>
<section>3</section>
<section>4</section>
</body>";
using (var stream = new StringReader(xmlStr))
{
var document = new XPathDocument(stream);
XPathNavigator navigator = document.CreateNavigator();
XPathNodeIterator nodes = navigator.Select("/body/section");
if (nodes.MoveNext())
{
XPathNavigator nodesNavigator = nodes.Current;
XPathNodeIterator nodesText =
nodesNavigator.SelectDescendants(XPathNodeType.Text, false);
while (nodesText.MoveNext())
{
if (nodesText.CurrentPosition == 2)
{
//DO SOMETHING WITH THE VALUE AT THIS POSITION
var currentValue = nodesText.Current.Value;
}
}
}
}

Remove HTML formatting in Razor MVC 3

I am using MVC 3 and Razor View engine.
What I am trying to do
I am making a blog using MVC 3, I want to remove all HTML formatting tags like <p> <b> <i> etc..
For which I am using the following code. (it does work)
#{
post.PostContent = post.PostContent.Replace("<p>", " ");
post.PostContent = post.PostContent.Replace("</p>", " ");
post.PostContent = post.PostContent.Replace("<b>", " ");
post.PostContent = post.PostContent.Replace("</b>", " ");
post.PostContent = post.PostContent.Replace("<i>", " ");
post.PostContent = post.PostContent.Replace("</i>", " ");
}
I feel that there definitely has to be a better way to do this. Can anyone please guide me on this.
Thanks Alex Yaroshevich,
Here is what I use now..
post.PostContent = Regex.Replace(post.PostContent, #"<[^>]*>", String.Empty);
The regular expression is slow. use this, it's faster:
public static string StripHtmlTagByCharArray(string htmlString)
{
char[] array = new char[htmlString.Length];
int arrayIndex = 0;
bool inside = false;
for (int i = 0; i < htmlString.Length; i++)
{
char let = htmlString[i];
if (let == '<')
{
inside = true;
continue;
}
if (let == '>')
{
inside = false;
continue;
}
if (!inside)
{
array[arrayIndex] = let;
arrayIndex++;
}
}
return new string(array, 0, arrayIndex);
}
You can take a look at http://www.dotnetperls.com/remove-html-tags
Just in case you want to use regex in .NET to strip the HTML tags, the following seems to work pretty well on the source code for this very page. It's better than some of the other answers on this page because it looks for actual HTML tags instead of blindly removing everything between < and >. Back in the BBS days, we typed <grin> a lot instead of :), so removing <grin> is not an option. :)
This solution only removes the tags. It does not remove the contents of those tags in situations where that might be important -- a script tag, for example. You'd see the script, but the script wouldn't execute because the script tag itself gets removed. Removing the contents of an HTML tag is VERY tricky, and practically requires that the HTML fragment be well formed...
Also note the RegexOption.Singleline option. That's very important for any block of HTML. as there's nothing wrong with opening an HTML tag on one line and closing it in another.
string strRegex = #"</{0,1}(!DOCTYPE|a|abbr|acronym|address|applet|area|article|aside|audio|b|base|basefont|bdi|bdo|big|blockquote|body|br|button|canvas|caption|center|cite|code|col|colgroup|datalist|dd|del|details|dfn|dialog|dir|div|dl|dt|em|embed|fieldset|figcaption|figure|font|footer|form|frame|frameset|h1|h2|h3|h4|h5|h6|head|header|hr|html|i|iframe|img|input|ins|kbd|keygen|label|legend|li|link|main|map|mark|menu|menuitem|meta|meter|nav|noframes|noscript|object|ol|optgroup|option|output|p|param|pre|progress|q|rp|rt|ruby|s|samp|script|section|select|small|source|span|strike|strong|style|sub|summary|sup|table|tbody|td|textarea|tfoot|th|thead|time|title|tr|track|tt|u|ul|var|video|wbr){1}(\s*/{0,1}>|\s+.*?/{0,1}>)";
Regex myRegex = new Regex(strRegex, RegexOptions.Singleline);
string strTargetString = #"<p>Hello, World</p>";
string strReplace = #"";
return myRegex.Replace(strTargetString, strReplace);
I'm not saying this is the best answer. It's just an option and it worked great for me.

Resources