I'm trying to parse a webpage to get posts from a forum.
The start of each message starts with the following format
<div id="post_message_somenumber">
and I only want to get the first one
I tried xpath='//div[starts-with(#id, '"post_message_')]' in yql without success
I'm still learning this, anyone have suggestions
I think I have a solution that does not require dealing with namespaces.
Here is one that selects all matching div's:
//div[#id[starts-with(.,"post_message")]]
But you said you wanted just the "first one" (I assume you mean the first "hit" in the whole page?). Here is a slight modification that selects just the first matching result:
(//div[#id[starts-with(.,"post_message")]])[1]
These use the dot to represent the id's value within the starts-with() function. You may have to escape special characters in your language.
It works great for me in PowerShell:
# Load a sample xml document
$xml = [xml]'<root><div id="post_message_somenumber"/><div id="not_post_message"/><div id="post_message_somenumber2"/></root>'
# Run the xpath selection of all matching div's
$xml.selectnodes('//div[#id[starts-with(.,"post_message")]]')
Result:
id
--
post_message_somenumber
post_message_somenumber2
Or, for just the first match:
# Run the xpath selection of the first matching div
$xml.selectnodes('(//div[#id[starts-with(.,"post_message")]])[1]')
Result:
id
--
post_message_somenumber
I tried xpath='//div[starts-with(#id,
'"post_message_')]' in yql without
success I'm still learning this,
anyone have suggestions
If the problem isn't due to the many nested apostrophes and the unclosed double-quote, then the most likely cause (we can only guess without being shown the XML document) is that a default namespace is used.
Specifying names of elements that are in a default namespace is the most FAQ in XPath. If you search for "XPath default namespace" in SO or on the internet, you'll find many sources with the correct solution.
Generally, a special method must be called that binds a prefix (say "x:") to the default namespace. Then, in the XPath expression every element name "someName" must be replaced by "x:someName.
Here is a good answer how to do this in C#.
Read the documentation of your language/xpath-engine how something similar should be done in your specific environment.
#FindBy(xpath = "//div[starts-with(#id,'expiredUserDetails') and contains(text(), 'Details')]")
private WebElementFacade ListOfExpiredUsersDetails;
This one gives a list of all elements on the page that share an ID of expiredUserDetails and also contains the text or the element Details
Related
I have a few Xpaths as below:
//*[#id="904735f0-bb82-11ea-a473-6d0f51688222"]/div/p
//*[#id="729c0860-a71d-11ea-b994-53a3e91a35c2"]/div/div/div[1]/div/p
//*[#id="2555ab30-bb84-11ea-9e8b-277e7f6208b2"]/div/div/div[1]/div/p
//*[#id="7e100250-a71d-11ea-b994-53a3e91a35c2"]/div/div/div[1]/div/p
//*[#id="811727d0-a71d-11ea-b994-53a3e91a35c2"]/div/div/div[1]/div/p
All of the above are used to extract text from a single web page since text is located at different view--ports, but I wish to find a single xpath to extract text for all of them. Is it possible to use 'and' and multiple ID's to extract all of it through one xpath?
Any other suggestions would be appreciate.
You can use the or operator for the last four.
And the merge-nodes operator | to add the first one.
So to select all 5 expression in one, use the following expression:
//*[#id="904735f0-bb82-11ea-a473-6d0f51688222"]/div/p | //*[#id="729c0860-a71d-11ea-b994-53a3e91a35c2" or #id="2555ab30-bb84-11ea-9e8b-277e7f6208b2" or #id="7e100250-a71d-11ea-b994-53a3e91a35c2" or #id="811727d0-a71d-11ea-b994-53a3e91a35c2"]/div/div/div[1]/div/p
A shorter and more generic solution could be :
(//div/div/div[1]/div/p|//div/p)[parent::*[string-length(#id)=36 and substring(#id,24,1)="-"]]
First part with () is used to specify the end of the path. Since #id attributes have the same length, we use it inside the predicate. We also verify the presence of a - at a specific position with substring.
I am new to xpath expression. Need help on a issue
Consider the following Document :
<tbody><tr>
<td>By <strong>Bec</strong></td>
<td><strong>Great Support</strong></td>
</tr></tbody>
In this I have to find the text inside tags separately.
Following is my xpath expression:
//tbody//td//strong/text();
It evaluates output as expected:
Bec
Great Support
How can I write xpath expressions to distinguish between the results i.e Becand Great Support
It's rather unclear what you're trying to do, but the following should succeed in selecting them separately:
//tbody/tr/td[1]/strong
and
//tbody/tr/td[2]/strong
Note that the text() you had at the end is most likely not needed in this case.
Not sure I understand 100%, but if you're trying to get the text of the first and the second strong tags, you can use position (1 based index)
//tbody/td[position()=1]/strong/text() //first text
//tbody/td[position()=2]/strong/text() //second text
This solution only applies to the current sample though, where your strong tags are inside either the first or second td tag.
Not sure this is what you're looking for... anyway, assuming you're asking to retrieve a node based on its text you can look up for text content by doing something like:
//tbody//td//strong/text()[.="Bec"]
PS
in [.=""] the dot is an alias for text() self::node() (thanks JLRishe for pointing out the mistake).
I'm currently trying to extract the blurb, or summary from any given Wikipedia page, using XPath. Now, there are many places online where this has already been done: http://jimblackler.net/blog/?p=13, How to use XPath or xgrep to find information in Wikipedia?.
But, when I try to use similar XPath expressions, on a variety of pages, the returned results are strange. For the sake of this question, let's assume I'm trying to retrieve the very first paragraph in the printable Wikipedia page on Boston: http://en.wikipedia.org/w/index.php?title=Boston&printable=yes.
When I try to use this expression /html/body/div[#id='content']/div[#id='bodyContent']//p, only the last four words of the paragraph, "in the United States.", are returned.
Actually, the expression used above could be simplified to //div/p, but the results are the same.
Strangely, the links I linked to previously seem to use similar methods and return great results; originally, I imagined this was due to Wikipedia changing the formatting of their pages in recent years, but honestly, I can't seem to find what's wrong with both the expressions.
Does anyone have any idea about this?
When I try to use this expression
/html/body/div[#id='content']/div[#id='bodyContent']//p, only the
last four words of the paragraph, "in the United States.", are
returned.
There are a few problems here:
The XML document is in a default namespace. Writing XPath expressions to select nodes in a document that is in a default namespace is the most FAQ about XPath -- search for "XPath and default namespace". In short, any unprefixed name will most probably cause nothing to be selected. One must register the default namespace and associate a specific prefix with this namespace. Then any element name in the XPath expression must be written with this prefix. So, the expression above will become:
:
/x:html/x:body/x:div[#id='content']/x:div[#id='bodyContent']//x:p
where the "x:" prefix is associated to the "http://www.w3.org/1999/xhtml" namespace.
.2. Even the above expression doesn't select (only) the wanted node. In order to select only the first x:p from the above, the XPath expression must be specified as (note the brackets):
(/x:html/x:body/x:div[#id='content']/x:div[#id='bodyContent']//x:p)[1]
.3. As you want the text of the paragraph, an easy way to do this is to use the standard XPath function string():
string((/x:html/x:body/x:div[#id='content']/x:div[#id='bodyContent']//x:p)[1])
When this XPath expression is evaluated, I get the text of the paragraph -- for example in the XPath Visualizer I wrote some years ago:
I used Firebug's Inspect Element to capture the XPath in a webpage, and it gave me something like:
//*[#id="Search_Fields_profile_docno_input"]
I used the Bookmarklets technique in IE to capture the XPath of the same object, and I got something like:
//INPUT[#id='Search_Fields_profile_docno_input']
Notice, the first one does not have INPUT instead has an asterisk (*). Why am I getting different XPath expressions? Does it matter which one I use for my tests like:
Selenium.Click(//*[#id="Search_Fields_profile_docno_input"]);
OR
Selenium.Click(//INPUT[#id='Search_Fields_profile_docno_input']);
*[Id=] denotes that it can be any element while the second one clearly mentions selenium to look ONLY for INPUT fields which have id as Search_Fields_profile_docno_input. The second xpath is better due to following reasons
It takes more time to find the element using * as IDs of all elements should be matched.
If your HTML code is not "well written" there could be other elements which have the same id and this could cause your test to fail.
The first one matches any element with a matching ID, whereas the second one restricts matches to <input> elements. If these were CSS expressions it'd be the difference between #Search_Fields_profile_docno_input and input#Search_Fields_profile_docno_input.
Assuming you only use this ID once in your web page, the two XPaths are effectively equivalent. They'll both match the <input id="Search_Fields_profile_docno_input"> element and no other.
There are some good answers to your "why?" question here, but for Selenium use, there's an even better alternative. Since your page element has an ID attribute, use Selenium's ID locator instead of XPath or CSS:
Selenium.Click("id=Search_Fields_profile_docno_input");
This will go directly to the element, and will run quicker than just about any other locator. Note that the syntax is id=value, not id="value".
Given any element in your document, there's an infinite number of XPath expressions that will select it uniquely. Therefore it's entirely reasonable for two different products to generate two different paths.
Google has just released Wicked Good XPath - A rewrite of Cybozu Lab's famous JavaScript-XPath. Link: https://code.google.com/p/wicked-good-xpath/ The rewritten version is 40% smaller and about %30 faster than the original implementation.
You can check this out and replace the one being used in Selenium.
I have such content of html file:
<a class="bf" title="Link to book" href="/book/229920/">book name</a>
Help me to construct xpath expression to get link text (book name).
I try to use /a, but expression evaluates without results.
If the context is the entire document you should probably use // instead of /. Also you may (not sure about that) need to get down one more level to retrieve the text.
I think it should look like this
//a/text()
EDIT: As Tomalak pointed out it's text() not text
Have you tried
//a
?
More specific is better:
//a[#class='bf' and starts-with(#href, '/book/')]
Note that this selects the <a> element. In your host environment it's easy to extract the text value of that node via standard DOM methods (like the .textContent property).
To select the actual text node, see the other answers in this thread.
It depends also on the rest of your document. If you use // in the beginning all the matching nodes will be returned, which might be too many results in case you have other links in your document.
Apart from that a possible xpath expression is //a/text().
The /a you tried only returns the a-tag itself, if it is the root element. To get the link text you need to append the /text() part.