How can I match on an attribute that contains a certain string? - xpath

I am having a problem selecting nodes by attribute when the attributes contains more than one word. For example:
<div class="atag btag" />
This is my xpath expression:
//*[#class='atag']
The expression works with
<div class="atag" />
but not for the previous example. How can I select the <div>?

Here's an example that finds div elements whose className contains atag:
//div[contains(#class, 'atag')]
Here's an example that finds div elements whose className contains atag and btag:
//div[contains(#class, 'atag') and contains(#class ,'btag')]
However, it will also find partial matches like class="catag bobtag".
If you don't want partial matches, see bobince's answer below.

mjv's answer is a good start but will fail if atag is not the first classname listed.
The usual approach is the rather unwieldy:
//*[contains(concat(' ', #class, ' '), ' atag ')]
this works as long as classes are separated by spaces only, and not other forms of whitespace. This is almost always the case. If it might not be, you have to make it more unwieldy still:
//*[contains(concat(' ', normalize-space(#class), ' '), ' atag ')]
(Selecting by classname-like space-separated strings is such a common case it's surprising there isn't a specific XPath function for it, like CSS3's '[class~="atag"]'.)

try this: //*[contains(#class, 'atag')]

EDIT: see bobince's solution which uses contains rather than start-with, along with a trick to ensure the comparison is done at the level of a complete token (lest the 'atag' pattern be found as part of another 'tag').
"atag btag" is an odd value for the class attribute, but never the less, try:
//*[starts-with(#class,"atag")]

A 2.0 XPath that works:
//*[tokenize(#class,'\s+')='atag']
or with a variable:
//*[tokenize(#class,'\s+')=$classname]

Be aware that bobince's answer might be overly complicated if you can assume that the class name you are interested in is not a substring of another possible class name. If this is true, you can simply use substring matching via the contains function. The following will match any element whose class contains the substring 'atag':
//*[contains(#class,'atag')]
If the assumption above does not hold, a substring match will match elements you don't intend. In this case, you have to find the word boundaries. By using the space delimiters to find the class name boundaries, bobince's second answer finds the exact matches:
//*[contains(concat(' ', normalize-space(#class), ' '), ' atag ')]
This will match atag and not matag.

To add onto bobince's answer...
If whatever tool/library you using uses Xpath 2.0, you can also do this:
//*[count(index-of(tokenize(#class, '\s+' ), $classname)) = 1]
count() is apparently needed because index-of() returns a sequence of each index it has a match at in the string.

You can try the following
By.CssSelector("div.atag.btag")

I came here searching solution for Ranorex Studio 9.0.1. There is no contains() there yet. Instead we can use regex like:
div[#class~'atag']

For the links which contains common url have to console in a variable. Then attempt it sequentially.
webelements allLinks=driver.findelements(By.xpath("//a[contains(#href,'http://122.11.38.214/dl/appdl/application/apk')]"));
int linkCount=allLinks.length();
for(int i=0; <linkCount;i++)
{
driver.findelement(allLinks[i]).click();
}

Related

Xpath: why normalize-space could not remove the empty space and \n?

For the following code:
<a class="title" href="the link">
Low price
<strong>computer</strong>
you should not miss
</a>
I used this xpath code to scrapy:
response.xpath('.//a[#class="title"]//text()[normalize-space()]').extract()
I got the following result:
u'\n \n Low price ', u'computer', u' you should not miss'
Why two \n and many empty spaces before low price was not removed by normalize-space() for this example?
Another question: how to combine the 3 parts as one scraped item as u'Low price computer you should not miss'?
Please try this:
'normalize-space(.//a[#class="title"])'
I already had the same problem, try this:
[item.strip() for item in response.xpath('.//a[#class="title"]//text()').extract()]
Your call to normalize-space() is in a predicate. That means you are selecting text nodes where (the effective boolean value of) normalize-space() is true. You aren't selecting the result of normalize-space: for that you would want
.//a[#class="title"]//text()/normalize-space()
(which needs XPath 2.0)
The second part of your question: just use
string(.//a[#class="title"])
(assuming scrapy-spider allows you to use an XPath expression that returns a string, rather than one that returns nodes).

Is it safe to concatenate two XPath 1.0 queries?

If I have two XPath queries where the second one is meant to further drill down the result of the first, can I safely let my script combine them into a single query by...
placing parenthesis around the first query,
prefixing the second query with with a slash, and then
simply concatenating the two strings ?
Context
The concrete usecase that sparked this question involves extracting information from XML/XHTML documents according to externally supplied pairs of "CSS selector + attribute name" using XPath behind the scenes.
For example the script may get the following as input:
selector: a#home, a.chapter
attribute: href
It then compiles the selector to an XPath query using the HTML::Selector::XPath Perl module, and the attribute by simply prefixing a # ... which in this case would yield:
XPath query 1: //a[#id='home'] | //a[contains(concat(' ', #class, ' '), ' chapter ')]
XPath query 2: #href
And then it repeatedly passes those queries to libxml2's XPath engine to extract the requested information (in this example, a list of URLs) from the XML documents in question.
It works, but I would prefer to combine the two queries into a single one, which would simplify the code for invoking them and reduce the performance overhead:
XPath query: (//a[#id='home'] | //a[contains(concat(' ', #class, ' '), ' chapter ')])/#href
(note the added parenthesis and slash)
But is this safe to do programmatically, for arbitrary input queries?
In general, no, you can't concatenate two arbitrary XPath expressions in this way, especially not in XPath 1.0. It's easy to find counter-examples: in XPath 1.0 you can't even have a union expression on the RHS of '/', so concatenating "/a" and "(b|c)" would fail.
In XPath 2.0, the result will always be syntactically valid, but in may contain type errors, e.g. if the expressions are "count(a)" and "b". The LHS operand of "/" must evaluate to a sequence of nodes.
Sure, this should work. However, you will always have to respect the correct context. If the elements in your example in the first query have no href attribute, you will get an empty result set.
Also, you will have to take care of e.g. a leading slash in front of your second query, so that you don't end up with a descendant-or-self axis step, which might not be what you want. Apart from that, this should always work - The worst that can happen that it is not logical correct (i.e. you don't get the expected result), but it should always be valid XPath.

Replacing partial regex matches in place with Ruby

I want to transform the following text
This is a ![foto](foto.jpeg), here is another ![foto](foto.png)
into
This is a ![foto](/folder1/foto.jpeg), here is another ![foto](/folder2/foto.png)
In other words I want to find all the image paths that are enclosed between brackets (the text is in Markdown syntax) and replace them with other paths. The string containing the new path is returned by a separate real_path function.
I would like to do this using String#gsub in its block version. Currently my code looks like this:
re = /!\[.*?\]\((.*?)\)/
rel_content = content.gsub(re) do |path|
real_path(path)
end
The problem with this regex is that it will match ![foto](foto.jpeg) instead of just foto.jpeg. I also tried other regexen like (?>\!\[.*?\]\()(.*?)(?>\)) but to no avail.
My current workaround is to split the path and reassemble it later.
Is there a Ruby regex that matches only the path inside the brackets and not all the contextual required characters?
Post-answers update: The main problem here is that Ruby's regexen have no way to specify zero-width lookbehinds. The most generic solution is to group what the part of regexp before and the one after the real matching part, i.e. /(pre)(matching-part)(post)/, and reconstruct the full string afterwards.
In this case the solution would be
re = /(!\[.*?\]\()(.*?)(\))/
rel_content = content.gsub(re) do
$1 + real_path($2) + $3
end
A quick solution (adjust as necessary):
s = 'This is a ![foto](foto.jpeg)'
s.sub!(/!(\[.*?\])\((.*?)\)/, '\1(/folder1/\2)' )
p s # This is a [foto](/folder1/foto.jpeg)
You can always do it in two steps - first extract the whole image expression out and then second replace the link:
str = "This is a ![foto](foto.jpeg), here is another ![foto](foto.png)"
str.gsub(/\!\[[^\]]*\]\(([^)]*)\)/) do |image|
image.gsub(/(?<=\()(.*)(?=\))/) do |link|
"/a/new/path/" + link
end
end
#=> "This is a ![foto](/a/new/path/foto.jpeg), here is another ![foto](/a/new/path/foto.png)"
I changed the first regex a bit, but you can use the same one you had before in its place. image is the image expression like ![foto](foto.jpeg), and link is just the path like foto.jpeg.
[EDIT] Clarification: Ruby does have lookbehinds (and they are used in my answer):
You can create lookbehinds with (?<=regex) for positive and (?<!regex) for negative, where regex is an arbitrary regex expression subject to the following condition. Regexp expressions in lookbehinds they have to be fixed width due to limitations on the regex implementation, which means that they can't include expressions with an unknown number of repetitions or alternations with different-width choices. If you try to do that, you'll get an error. (The restriction doesn't apply to lookaheads though).
In your case, the [foto] part has a variable width (foto can be any string) so it can't go into a lookbehind due to the above. However, lookbehind is exactly what we need since it's a zero-width match, and we take advantage of that in the second regex which only needs to worry about (fixed-length) compulsory open parentheses.
Obviously you can put real_path in from here, but I just wanted a test-able example.
I think that this approach is more flexible and more readable than reconstructing the string through the match group variables
In your block, use $1 to access the first capture group ($2 for the second and so on).
From the documentation:
In the block form, the current match string is passed in as a parameter, and variables such as $1, $2, $`, $&, and $' will be set appropriately. The value returned by the block will be substituted for the match on each call.
As a side note, some people think '\1' inappropriate for situations where an unconfirmed number of characters are matched. For example, if you want to match and modify the middle content, how can you protect the characters on both sides?
It's easy. Put a bracket around something else.
For example, I hope replace a-ruby-porgramming-book-531070.png to a-ruby-porgramming-book.png. Remove context between last "-" and last ".".
I can use /.*(-.*?)\./ match -531070. Now how should I replace it? Notice
everything else does not have a definite format.
The answer is to put brackets around something else, then protect them:
"a-ruby-porgramming-book-531070.png".sub(/(.*)(-.*?)\./, '\1.')
# => "a-ruby-porgramming-book.png"
If you want add something before matched content, you can use:
"a-ruby-porgramming-book-531070.png".sub(/(.*)(-.*?)\./, '\1-2019\2.')
# => "a-ruby-porgramming-book-2019-531070.png"

A better way of selecting multiple elements by attribute name in XPath

I'm looking to select a collection of elements based on an array of ID names. I'm currently using a giant OR statement essentially:
//*[#id='apple']|//*[#id='orange']|//*[#id='banana']
But building that string manually seems messy. Is there something like a nice SQL-esque "WHERE IN [a,b,c]" operator that I could be using?
I am using the HTTPAgilityPack for ASP.Net which I think equates to XPath1.o (feel free to correct me on that.)
Thanks.
First, you could simplify this by using or. This avoids repeating the //* multiple times although you till specify the #id= part multiple times:
//*[#id='apple' or #id='orange' or #id='banana']
A more elegant solution is to check against a list of acceptable ids. Now if you're using XPath 1.x then you'll have to do a bit of gymnastics to get contains() to do your bidding. Specifically, notice that I've got spaces on both ends of the first string, and then concatenate spaces to each end of #id before looking for a match. This is to prevent an #id of "range" from matching, for example.
//*[contains(' apple orange banana ', concat(' ', #id, ' '))]
If you have are using XPath 2.0 then the way forward is simpler thanks to the addition of sequences to the language:
//*[exists(index-of(('apple', 'orange', 'banana'), #id))]
Use:
//*[contains('|apple|banana|orange|', concat('|',#id, '|'))]
In case some of the id attributes may contain the "|" character, use another instead, that is known not to be present in the value of any of the id attributes.
An XPath 2.0 solution:
//*[#id=('apple', 'orange', 'banana')]

Locating the node by value containing whitespaces using XPath

I need to locate the node within an xml file by its value using XPath.
The problem araises when the node to find contains value with whitespaces inside.
F.e.:
<Root>
<Child>value</Child>
<Child>value with spaces</Child>
</Root>
I can not construct the XPath locating the second Child node.
Simple XPath /Root/Child perfectly works for both children, but /Root[Child=value with spaces] returns an empty collection.
I have already tried masking spaces with %20, & #20;, & nbsp; and using quotes and double quotes.
Still no luck.
Does anybody have an idea?
Depending on your exact situation, there are different XPath expressions that will select the node, whose value contains some whitespace.
First, let us recall that any one of these characters is "whitespace":
-- the Tab
-- newline
-- carriage return
' ' or -- the space
If you know the exact value of the node, say it is "Hello World" with a space, then a most direct XPath expression:
/top/aChild[. = 'Hello World']
will select this node.
The difficulties with specifying a value that contains whitespace, however, come from the fact that we see all whitespace characters just as ... well, whitespace and don't know if a it is a group of spaces or a single tab.
In XPath 2.0 one may use regular expressions and they provide a simple and convenient solution. Thus we can use an XPath 2.0 expression as the one below:
/*/aChild[matches(., "Hello\sWorld")]
to select any child of the top node, whose value is the string "Hello" followed by whitespace followed by the string "World". Note the use of the matches() function and of the "\s" pattern that matches whitespace.
In XPath 1.0 a convenient test if a given string contains any whitespace characters is:
not(string-length(.)= stringlength(translate(., '
','')))
Here we use the translate() function to eliminate any of the four whitespace characters, and compare the length of the resulting string to that of the original string.
So, if in a text editor a node's value is displayed as
"Hello World",
we can safely select this node with the XPath expression:
/*/aChild[translate(., '
','') = 'HelloWorld']
In many cases we can also use the XPath function normalize-space(), which from its string argument produces another string in which the groups of leading and trailing whitespace is cut, and every whitespace within the string is replaced by a single space.
In the above case, we will simply use the following XPath expression:
/*/aChild[normalize-space() = 'Hello World']
Try either this:
/Root/Child[normalize-space(text())=value without spaces]
or
/Root/Child[contains(text(),value without spaces)]
or (since it looks like your test value may be the issue)
/Root/Child[normalize-space(text())=normalize-space(value with spaces)]
Haven't actually executed any of these so the syntax may be wonky.
Locating the Attribute by value containing whitespaces using XPath
I have a input type element with value containing white space.
eg:
<input type="button" value="Import Selected File">
I solved this by using this xpath expression.
//input[contains(#value,'Import') and contains(#value ,'Selected')and contains(#value ,'File')]
Hope this will help you guys.
"x0020" worked for me on a jackrabbit based CQ5/AEM repository in which the property names had spaces. Below would work for a property "Record ID"-
[(jcr:contains(jcr:content/#Record_x0020_ID, 'test'))]
did you try #x20 ?
i've googled this up like on the second link:
try to replace the space using "x0020"
this seems to work for the guy.
All of the above solutions didn't really work for me.
However, there's a much simpler solution.
When you create the XMLDocument, make sure you set PreserveWhiteSpace property to true;
XmlDocument xmldoc = new XmlDocument();
xmldoc.PreserveWhitespace = true;
xmldoc.Load(xmlCollection);

Resources