How do I use the XPath tokenizer function in Nokogiri? - ruby

I am attempting to extract information from the following HTML using Nokogiri and XPath.
<p>Friday, February 1<br><strong>Apple <br> Orange</strong></p>
e.xpath('./text()[following-sibling::br]')
Gives me the date just fine. I want to then grab the text inside the strong node and split on br. There may be many fruits separated by br or there may just be one with no br. I would ideally like to accomplish this in xpath instead of code since I'm essentially defining a bunch of parsers via JSON.
Right now I'm thinking that I should use the tokenizer function and pass the text in the strong tag. I thought that should look like this:
e.xpath('./strong[fn::tokenize(.,"<br>")]')
and have also tried
e.xpath('fn::tokenize(./strong,"<br>")')
but I am getting:
.../gems/nokogiri-1.5.6/lib/nokogiri/xml/node.rb:159:in `evaluate': Invalid expression: ./strong/text()[fn::tokenize(.,"br")] (Nokogiri::XML::XPath::SyntaxError)
I'm modeling my usage after the documentation for the method that the error occurs in (line 139):
node.xpath('.//title[regex(., "\w+")]',...

Related

Using a regex to get a Nokogiri node

I'm parsing an XML file with Nokogiri.
Currently, I'm using the following to get the value I need (the document includes multiple Phase nodes):
xml.xpath("//Phase[#text=' = STER P=P(T) ']")
But now, the uploaded XML file can have a text attribute with a different value. Thus, I'm trying to update my code using a regular expression since the value always contains STER.
After looking at a few questions on SO, I tried
xml.xpath("//Phase[#text~=/STER/]")
However, when I run it, I get
ERROR: Invalid predicate: //Phase[#text~=/STER/] (Nokogiri::XML::XPath::SyntaxError)
What am I missing here?
Alternatively, is there an XPATH function similar to starts-with` that looks for the substring within the entire value and not just at the beginning of it?
There are two problems with your code: first off, there is no =~ operator in XPath. The way to test whether text matches a regex is using the matches function:
//Phase[matches(#text, 'STER')]
Secondly, regex matching is a feature of XPath 2.0, but Nokogiri implements XPath 1.0.
Luckily, you are not actually using any regex features, you are simply checking for a fixed string, which can be done with XPath 1.0 using the contains function:
//Phase[contains(#text, 'STER')]

How to write Xpath expressions to distinguish between results?

I am new to xpath expression. Need help on a issue
Consider the following Document :
<tbody><tr>
<td>By <strong>Bec</strong></td>
<td><strong>Great Support</strong></td>
</tr></tbody>
In this I have to find the text inside tags separately.
Following is my xpath expression:
//tbody//td//strong/text();
It evaluates output as expected:
Bec
Great Support
How can I write xpath expressions to distinguish between the results i.e Becand Great Support
It's rather unclear what you're trying to do, but the following should succeed in selecting them separately:
//tbody/tr/td[1]/strong
and
//tbody/tr/td[2]/strong
Note that the text() you had at the end is most likely not needed in this case.
Not sure I understand 100%, but if you're trying to get the text of the first and the second strong tags, you can use position (1 based index)
//tbody/td[position()=1]/strong/text() //first text
//tbody/td[position()=2]/strong/text() //second text
This solution only applies to the current sample though, where your strong tags are inside either the first or second td tag.
Not sure this is what you're looking for... anyway, assuming you're asking to retrieve a node based on its text you can look up for text content by doing something like:
//tbody//td//strong/text()[.="Bec"]
PS
in [.=""] the dot is an alias for text() self::node() (thanks JLRishe for pointing out the mistake).

How to use Regular Expression to insert text in between text?

I have a unique scenario. There is a web application which is a simulator to check sending of data in XML and getting the data back in xml and verifying few details in xml.
Now the xml data which I am sending has a lot of details. In that xml I will have to insert a parameter which I have defined in my test. I am not able to get, how to send the data as parameter in the xml before sending it.
the xml structre looks like this
id='12345'><version>1.3.4<</version><accno>1234567890</accno>add<address details</> ..........
Now int this xml structure, I have parameterized <accno>1234567890</accno> ... Mean in begin of the script I am declaring accno='1234567890'
Now I want to using accno as parameter in the xml instead of the hard coded value in the xml. Please suggest how to do this.
XML is not regular, but context-free. Use a proper parser like Nokogiri instead of regex. See RegEx match open tags except XHTML self-contained tags.
As answer, as requested.
I will say editing xml, by regex is a bad idea.
but just to answer the direct question use gsub. eg.
str.gsub(/reg_match/, newstring)
but better way of doing it will be use of hpricot,
Or you can also use ruby templates.
require 'erb'
require 'ostruct'
data = {:accno => "1234567890"}
variables = OpenStruct.new(data)
template = "<id='12345'><version>1.3.4</version><accno><%= accno%></accno>"
res = ERB.new(template).result(variables.instance_eval { binding })
puts res
First identify the pattern, then replace it using gsub!
xml_data.gsub! (pattern, replacement)
http://ruby-doc.org/docs/ProgrammingRuby/html/ref_c_string.html#String.gsub_oh
The fast way to do it is with gsub (like Rajkaran says). The right way to do it is rexml or some other xml library. Investment should be related to how much you will use this kind of thing in the future.

Trouble using Xpath "starts with" to parse xhtml

I'm trying to parse a webpage to get posts from a forum.
The start of each message starts with the following format
<div id="post_message_somenumber">
and I only want to get the first one
I tried xpath='//div[starts-with(#id, '"post_message_')]' in yql without success
I'm still learning this, anyone have suggestions
I think I have a solution that does not require dealing with namespaces.
Here is one that selects all matching div's:
//div[#id[starts-with(.,"post_message")]]
But you said you wanted just the "first one" (I assume you mean the first "hit" in the whole page?). Here is a slight modification that selects just the first matching result:
(//div[#id[starts-with(.,"post_message")]])[1]
These use the dot to represent the id's value within the starts-with() function. You may have to escape special characters in your language.
It works great for me in PowerShell:
# Load a sample xml document
$xml = [xml]'<root><div id="post_message_somenumber"/><div id="not_post_message"/><div id="post_message_somenumber2"/></root>'
# Run the xpath selection of all matching div's
$xml.selectnodes('//div[#id[starts-with(.,"post_message")]]')
Result:
id
--
post_message_somenumber
post_message_somenumber2
Or, for just the first match:
# Run the xpath selection of the first matching div
$xml.selectnodes('(//div[#id[starts-with(.,"post_message")]])[1]')
Result:
id
--
post_message_somenumber
I tried xpath='//div[starts-with(#id,
'"post_message_')]' in yql without
success I'm still learning this,
anyone have suggestions
If the problem isn't due to the many nested apostrophes and the unclosed double-quote, then the most likely cause (we can only guess without being shown the XML document) is that a default namespace is used.
Specifying names of elements that are in a default namespace is the most FAQ in XPath. If you search for "XPath default namespace" in SO or on the internet, you'll find many sources with the correct solution.
Generally, a special method must be called that binds a prefix (say "x:") to the default namespace. Then, in the XPath expression every element name "someName" must be replaced by "x:someName.
Here is a good answer how to do this in C#.
Read the documentation of your language/xpath-engine how something similar should be done in your specific environment.
#FindBy(xpath = "//div[starts-with(#id,'expiredUserDetails') and contains(text(), 'Details')]")
private WebElementFacade ListOfExpiredUsersDetails;
This one gives a list of all elements on the page that share an ID of expiredUserDetails and also contains the text or the element Details

XPath expression?

I want to extract "Date: 2009-09-25, 1:54PM EDT" from this webpage
http://auburn.craigslist.org/sha/1392067187.html
But I don't understand how to write Xpath expressions for that.
Can anyone help me in that.
I am getting other fields also from this page.
Why don't you just run a regexp like the one below?
'Date:\s+([0-9]{4}-[0-9]{2}-[0-9]{2}.+?\<)'
It seams to be the easiest way. And if you don't want to use pure text you can use XPath 2.0 which has support for regexps (fn:matches).
Are you running the HTML through TIDY or some other process to turn it into XHTML? Or how are you able to execute XPATH against that HTML?
If the document was well-formed, then you could probably use the following XPATH:
/html/body/hr[1]/following-sibling::text()[1]
It finds the first HR element in the document, then selects the first text() node following it(which contains the string "Date: 2009-09-25, 1:54PM EDT"

Resources