I'm working on a project where I need to harvest some data from website, so I'm using webharvest.
I'm running into a problem where the data I'm harvesting (comments from news websites) is sometimes across more than one page. I'm trying to configure it to look for the link to the second page of comments in the xpath of the webpage. Problem is, if I try an if test, the condition always passes, and if I try a try statement, the try body always succeeds. This results in my script extracting comments from the first page (if there is only one), twice. Articles with two sets of comments work beautifully, however. So my question relates to the syntax of if conditions and try statements. The documentation on Webharvest is scant with regard to these functions.
Here's what I'm trying. First, the if test:
<var-def name="secondPageLink">
<xpath expression="/a[#class='next']/#href">
<var name="firstPage"/>
</xpath>
</var-def>
<case>
<if condition="${secondPageLink != null}">
[ process second page ]
</if>
</case>
Second, the try/catch:
<try>
<body>
<var-def name="secondPageLink">
<xpath expression="/a[#class='next']/#href">
<var name="firstPage"/>
</xpath>
</var-def>
[ continue to process page ]
</body>
<catch>
</catch>
</try>
The problem with the if test is that despite the fact that the variable is empty when no second page exists (which I can see from the debugging in the gui), the if seems to return true, and runs its body.
I can more easily see why the try/catch doesn't work properly, since an xpath returning no value (if the second page doesn't exist) wouldn't constitute an 'error' as such and the try will still succeed. A further difficulty is that the #href of the next page link is relative, and so needs to be appended to the URL of the first page (or the base URL of the article, actually, but same thing here), meaning that my html-to-xml takes the url ${firstPage}${secondPageLink}, which ends up simply being the first page URL again, and webharvest thus processes the first page a second time.
If someone can reformulate my if test to return false when the secondPageLink xpath returns an empty value, I'd be very appreciative!
Found an answer.
This person had a similar problem with if, and an answer there suggested using the syntax: condition="${variable.toString().length() > 0}".
So in my code, replacing the if test with:
<case>
<if condition="${secondPageLink.toString().length() > 0}">
<var-def name="secondPageFull">
<html-to-xml>
<http url="${commentedArticleURL}${secondPageLink}"/>
</html-to-xml>
[...]
produced the correct result.
Related
I added a Response Assertion to my test to hit the home page of our local site. I added this to the "Patterns to Test" in a Response Assertion:
Email
This worked. ( To get that label, I did View Source in Firefox and copied the code including all white space. I then clicked "Add" for the Response Assertion and pasted the copied code directly into JMeter this way. ) When I run my test, my test will pass with just this label as a Pattern to Test. It shows no red errors after running it in JMeter.
However, when I add the following span tag by clicking on "Add" to get a new entry in the same Response Assertion, the test will fail.
1.7.0.147
So, to be clear, I had 2 entries for the same Response Assertion...one for the "Email" label and one for the "footerVer" span. Each of these had their own separate line under the same Response Assertion.
Also, for most tests that passed and did not pass, I had "Main Sample only", "Text Response", and "Contains" selected. I did try to change to "Matches" and "Equals" but I just ended up with different errors. So, I wanted to stay on "Contains" for now since my other entry for the "Email" label worked when I had "Contains" selected.
Under the "View Results Tree", JMeter tells me about this failure when I add the span tag:
Assertion error: false
Assertion failure: true
Assertion failure message: Test failed: text expected to contain /
1.7.0.147
/
I also have had success with other tags like , , , , etc. along the way.
Only the tag seems to be giving me a problem right now. Any ideas?
===============================
Added config:
I am not able to add the full response since it is not my code, but the company's code. But, I can try to get something on here that me be useful in a different way.
This is the response dealing with the version copied verbatim from the response tab within JMeter:
<span class="footerVer">
1.7.0.147
</span>
Hope that helps
I would suggest using XPath assertions for multiline HTML entities parsing as page source may vary and it can be a headache to deal with flaky HTML code.
Following XPath expression validates whether inner text of span with footerVer class equals 1.7.0.147
//span[#class='footerVer']/text()='1.7.0.147'
Use Substring instead of Contains for Pattern Matching rules:
http://jmeter.apache.org/usermanual/component_reference.html#Response_Assertion
So, I found one way around this. Although, I do not think this is the most efficient way to verify the test. I split the span into 3 individual lines in the Response Assertion.
<span class="copyright marginLeft_100">
© Copyright 2002-2013 Turning Technologies, LLC. All Rights Reserved.
</span>
==========================
I do not really mind the first 2 lines. But, the third line is so generic it really does no good if not combined with the beginning tag
Well, for now, I can at least confirm something. Also, I left it on "Contains", even though I took a look at the other link posted above, because all of my other tags presented no problem when it was on "Contains". Hope this helps someone else also.
I'm using openReports that uses freeMarker formats as a template.
The following:
<#display.table name="results" class="displayTag" sort="list" export=true pagesize=10 requestURI="queryReportResult.action">
<#display.column property="first_name" title="First Name" sortable=true headerClass="sortable" />
<#display.column property="last_name" title="Last Name" sortable=true headerClass="sortable"/>
</#display.table>
The data is automatically grabbed using a stored procedure.
This will create a sortable table, does anyone know how I could access just the first row of data. I intend to save it into a variable and output it in some part of the page.
The reason I want to do this is we have a basic report and what would make it perfect is if I could print some from it toward the top of the page above the report.
I know a lot of people aren't familiar with OpenReports, but I figured freeMarker does have a pretty good following. I understand if this is pretty obscure
From what I can see from here, the #display.table call prints the whole table at once, so there's nowhere to insert the FreeMarker code to catch the first row. But of course you should check the documentation of #display.table to see if it offers any helpful options. But, I suppose you have already done that. So as a last resort, you can capture the whole table into a variable with <#assign tableHTML><#display.table ...>...</#display.table></#assign> and then extract the first row with a regular expression (or something like that) from the value of the tableHTML variable.
I have the following line in a long loop
page = Nokogiri::HTML(open(topic[:url].first)).xpath('//ul[#class = "pages"]//li').first
Sometimes my Ruby application crashes raising the "End of file reached " exception in this line.
How can I resolve this problem? Just a begin;raise;end block?
Is a script that performs a forum backup, so is important that doesn't skip any thread.
Thanks in advance.
In addition to #Phrogz's excellent advice (in particular about at_css with the simpler expression), I would pull the raw xml [content] separately:
page = if (content = open(topic[:url].first)).strip.length > 0
Nokogiri::HTML(content).xpath('//ul[#class = "pages"]//li').first
end
I would suggest that you should first to fix the underlying issue so that you do not get this error.
Does the same URL always cause the problem? (Output it in your log files.) If so, perhaps you need to URI encode the URL.
Is it random, and therefor likely related to a connection hiccup or server problem? If so, you should rescue the specific error and then retry one or more times to get the crucial data.
Secondarily, you should know that the CSS syntax for that query is far simpler:
page = Nokogiri.HTML(...).at_css('ul.pages li')
Not only is this less than half the bytes, it allows for cases like <ul class="foo pages"> that the XPath would miss.
Using at_css (or at_xpath) is the same as .css(...).first, but is faster and simpler.
I am coding with Groovy, however, I don't believe its a language specific set of questions.
I actually have two questions
First Question
I've run into an issue while using HtmlUnit. It is telling me that what I am trying to grab is null.
The page I'm testing it on is:
http://browse.deviantart.com/resources/applications/psbrushes/?order=9&offset=0#/dbwam4
My code:
client = new WebClient(BrowserVersion.FIREFOX_3)
client.javaScriptEnabled = false
page = client.getPage(url)
//coming up as null
title = page.getByXPath("//html/body/div[4]/div/div[3]/div/div/div/div/div/div/div/div/div/div/h1/a")
println title
This simply prints out: []
Is this because the page uses onclick()? If so, how would I get around that? Enabling javascript creates a mess in my cmd prompt.
Second Question
I am wanting to also get the image but am having trouble because when I attempt to get the XPath (via firebug) it shows up as: //*[#id="gmi-ResViewSizer_img"]
How do I handle that?
First Answer:
/html/body/div[3]/div/div[3]/div/div/div/div/div/div/div/div/div/div/h1/a
Your XPATH was off by one in the predicate filter for the 4th div of the body, it should be the 3rd div. It appears the HTML for the site can/does change from when you had origionally snagged the XPATH using Firebug. You may need to adjust your XPATH to accommodate for potential change and be less sensitive to some differences in document structure.
Maybe something like this:
/html/body//div/h1/a
Second Answer: The XPATH that you listed will work. It may look odd/short(and may not be the most efficient), but // starts at the root node and looks throughout every node in the tree, * matches on any element(to include the img) and the [] predicate filter restricts it to those that have an id attribute who's value equals "gmi-ResViewSizer_img".
There are many other options for XPATHs that could work as well. It will also depend on how often the HTML structure changes. This is one that also works for the page referenced to select that img:
/html/body/div/div/div/div/img[1]
I had the same problem, I solved when I realize iframe tags on page, try call
((HtmlPage)current_page.getFrames()[n].getEnclosedPage()).getElementByXPath(...
where n is the position in frame in iframe collection. It's work for me !!!
Thanks a lot.
How to access the comments inside the XML document using XPath?
For example:
<table>
<length> 12 </length>
<!--Some comment here-->
</table>
I want to access the "Some comment here".
Thanks...
EDIT: I am using MSXML DOM ActiveX and the command comment() seems to be failing... Any idea why?
With the path
/foo/bar/comment()
you can select all comments in the /foo/bar element. May depend on your language of choice, of course. But generally this is how you do it.
Use comment() function for example:-
/table/length/following::comment()[1]
selects the first comment that follows the length element.
Edit
Manoj asks in a comment to this answer why this isn't working in MSXML. The reason will be you are using MSXML3. By default MSXML3 does not use XPath as its selection language, it defaults to an earlier much weaker language (XSL pattern). You need to set XPath as the selection language via the DOMDocument's setProperty method. E.g (in JScript):-
var dom = new ActiveXObject("MSXML2.DOMDocument.3.0");
dom.setProperty("SelectionLanguage", "XPath");
Now the full XPath language will work in your queries (note one breaking change is indexer predicates are 1 based in XPath whereas they were 0 based in XSL Pattern).
Based on the OP's comments to posted answers (and my curiosity as to why this simple thing would not work), here is my suggestion:
Using the XPath expression suggested by #Anthony, I was able to successfully load the comment node with the following JS function:
function SelectComment(s)
{
var xDoc = new ActiveXObject("MSXML2.DOMDocument.6.0");
if (xDoc)
{
xDoc.loadXML(s);
var selNode = xDoc.selectSingleNode("/table/length/following::comment()[1]");
if (selNode != null)
return selNode.text;
else
return "";
}
}
Sample invocation:
SelectComment("<table><length> 12</length><!--Some comment here--></table>");
Output:
"Some comment here"
Notes:
a. Your MSXML version may vary. Please use appropriately.
b. This kind of code is definitely not recommended because it works only on IE. However, since this is your explicitly stated requirement, I have used the ActiveXObject.
c. You have not mentioned in your comments what fails in the suggested XPath expressions. My guess is that you are not querying the text property of the retrieved node. Keep in mind that the SelectSingleNode always returns an IXmlNode and you need to query its data or text properties.
Maybe this coud help,
This sample removes Comments
XmlNodeList list = xmlDoc.SelectNodes("//comment()");
foreach(XmlNode node in list)
node.ParentNode.RemoveChild(node);
Leaned from here link text
<adjustment>
<!-- krishna k -->
<name>FX Update USD</name>
<!-- Since this plan updates existing adj's no ajd's will be created using this id -->
<id>7206</id>
Am facing the similar Issue my application is reading comments which causes stack crash. How can I avoid reading comments by DOM.