How can I do a wildcard search with Nokogiri? - ruby

I'm trying to get rid of all data- attributes in a document. If my document looks like
<div id="person" data-name="John Smith" data-age="32" data-location="UK">...</div>
I want to strip out the data to just leave
<div id="person">...</div>
I've tried a lot of combinations, and can a least get to things like data-name with
doc.xpath('//#data-name')
but sometimes, there will be more data-? properties, and I'd like to avoid manually adding them all. This SO answer is close, but it's not returning anything for me when I try
doc.xpath("//*[#*[contains(., 'data-')]]")

To select all attributes whose name starts with data-:
//#*[starts-with(name(), 'data-')]

That answer you found was too deeply nested, try it this way:
doc.xpath('//*[contains(., "data-")]')

Related

Is there a way to use xpath to search for a value where a part of the value dynamically changes?

I’m trying to match a value where I don’t necessarily know the whole value every time i.e. it's randomly generated. Is there a way to search for a value where a part of the value dynamically changes?
Please see my example of value I'm trying to find and my attempted xPath:
<div class="target" testid="target”>
<h2>Hi, random user</h2>
<p>To get the xpath <b>target</b> of <b>[text I don’t know]</b> in <b>[text I don’t know]</b>, you need to do the following</p>
</div>
I’ve tried the following xpath I picked up from another question but it don’t get a match:
//p[matches(.,'^To get the xpath <b>target</b> of <b>.*</b> in <b>.*</b>, you need to do the following$')]
I’ve tried different combinations with and without the bold tag but can’t seem to get it to match. truthfully I'm not sure I've got the right syntax...
Try the plain text in the second argument of matches e.g.
//p[matches(., '^To get the xpath target of .*? in .*?, you need to do the following$')]
Online sample here.
Why not to use contains() method using the fixed attribute value?
Something like:
//p[contains(.,'you need to do the following')]

Find HTML Tags in Properties

My current issue is to find HTML-Tags inside of property values. I thought it would be easy to search with a query like /jcr:root/content/xgermany//*[jcr:contains(., '<strong>')] order by #jcr:score
It looks like there is a problem with the chars < and > because this query finds everything which has strong in it's property. It finds <strong>Some Text</strong> but also This is a strong man.
Also the Query Builder API didn't helped me.
Is there a possibility to solve it with a XPath or SQL Query or do I have to iterate through the whole content?
I don't fully understand why it finds This is a strong man as a result for '<strong>', but it sounds like the unexpected behavior comes from the "simple search-engine syntax" for the second argument to jcr:contains(). Apparently the < > are just being ignored as "meaningless" punctuation.
You could try quoting the search term:
/jcr:root/content/xgermany//*[jcr:contains(., '"<strong>"')]
though you may have to tweak that if your whole XPath expression is enclosed in double quotes.
Of course this will not be very robust even if it works, since you're trying to find HTML elements by searching for fixed strings, instead of actually parsing the HTML.
If you have an specific jcr:primaryType and the targeted properties you can do something like this
select * from nt:unstructured where text like '%<strong>%'
I tested it , but you need to know the properties you are intererested in.
This is jcr-sql syntax
Start using predicates like a champ this way all of this will make sense to you!
HTML Encode <strong>
HTML Decimal <strong>
Query builder is your friend:
Predicates: (like a CHAMP!)
path=/content/geometrixx
type=nt:unstructured
property=text
property.operation=like
property.value=%<strong>%
Have go here:
http://localhost:4502/libs/cq/search/content/querydebug.html?charset=UTF-8&query=path%3D%2Fcontent%2Fgeometrixx%0D%0Atype%3Dnt%3Aunstructured%0D%0Aproperty%3Dtext%0D%0Aproperty.operation%3Dlike%0D%0Aproperty.value%3D%25%3Cstrong%3E%25
Predicates: (like a CHAMP!)
path=/content/geometrixx
type=nt:unstructured
property=text
property.operation=like
property.value=%<strong>%
Have a go here:
http://localhost:4502/libs/cq/search/content/querydebug.html?charset=UTF-8&query=path%3D%2Fcontent%2Fgeometrixx%0D%0Atype%3Dnt%3Aunstructured%0D%0Aproperty%3Dtext%0D%0Aproperty.operation%3Dlike%0D%0Aproperty.value%3D%25%26lt%3Bstrong%26gt%3B%25
XPath:
/jcr:root/content/geometrixx//element(*, nt:unstructured)
[
jcr:like(#text, '%<strong>%')
]
SQL2 (already covered... NASTY YUK..)
SELECT * FROM [nt:unstructured] AS s WHERE ISDESCENDANTNODE([/content/geometrixx]) and text like '%<strong>%'
Although I'm sure it's entirely possible with a string of predicates, it's possibly heading down the wrong route. Ideally it would be better to parse the HTML when it is stored or published.
The required information would be stored on simple properties on the node in question. The query will then be a lot simpler with just a property = value query, than lots of overly complex query syntax.
It will probably be faster too.
So if you read in your HTML with something like HTMLClient and then parse it with a OSGI service, that can accurately save these properties for you. Every time the HTML is changed the process would update these properties as necessary. Just some thoughts if your SQL is getting too much.

How to use substring() with Import.io?

I'm having some issues with XPath and import.io and I hope you'll be able to help me. :)
The html code:
<a href="page.php?var=12345">
For the moment, I manage to extract the content of the href ( page.php?var=12345 ) with this:
./td[3]/a[1]/#href
Though, I would like to just collect: 12345
substring might be the solution but it does not seem to work on import.io as I use it...
substring(./td[3]/a[1]/#href,13)
Any ideas of what the problem is?
Thank's a lot in advance!
Try using this for the xpath: (Have the field selected as Text)
.//*[#class='oeil']/a/#href
Then use this for your regex:
([^=]*)$
This will get you the ISBN number you are looking for.
import.io only support functions in XPath when they return a node list
Your path expression is fine, but perhaps it should be
substring(./td[3]/a[1]/#href,14)
"Does not seem to work" is not a very clear description of what is wrong. Do you get error messages? Is the output wrong? Do you have any code surrounding the path expression you could show?
You can use substring, but using substring-after() would be even better.
substring-after(/a/#href,'=')
assuming as input the tiny snippet you have shown:
<a href="page.php?var=12345"/>
will select
12345
and taking into account the structure of your input
substring-after(./td[3]/a[1]/#href,'=')
A leading . in a path expression selects only immediate child td nodes of the current context node. I trust you know what you are doing.

Make 1 page objects Two Elements ID's to 1 page object Variable

I am using the page object Gem with Watir. During testing I found that I have a field that has the same contents that show in the same location but have separate unique ID's. The difference is before you get to the page.
I tried using Xpaths:
select_list(:selectionSpecial, :xpath => "//select[#id='t_id9' OR #id='t_id7']")
But was met with a script error.
They are static ID's but I want to force them into one variable since that would allow me to use "populate_page_with" feature.
I have a long winded way currently, but I am fishing for a more efficient way that works with the page object Features.
Does anyone know of a way to do this?
Your approach of using xpath can work. The problem is the syntax errors in the xpath selector. It should be:
"//select[#id='t_id9' or #id='t_id7']"
Note:
The start should be a // rather than a \
Using or is case-sensitive; it has to be lower case
There was also a missing closing ' for the first id attribute
Personally, I find css and xpath selectors harder to use. I would go with the id locator with a regex. The following gives the same results, but some will find it easier to read.
select_list(:selectionSpecial, :id => /^t_id(7|9)$/)

XPath Find full HTML element ID from partial ID

I am looking to write an XPath query to return the full element ID from a partial ID that I have constructed. Does anyone know how I could do this? From the following HTML (I have cut this down to remove work specific content) I am looking to extract f41_txtResponse from putting f41_txt into my query.
<input id="f41_txtResponse" class="GTTextField BGLQSTextField2 txtResponse" value="asdasdadfgasdfg" name="f41_txtResponse" title="" tabindex="21"/>
Cheers
You can use contains to select the element:
//*[contains(#id, 'f41_txt')]
Thanks to Thomas Jung I have been able to figure this out. If I use:
//*[contains(./#id, 'f41_txt')]/#id
This will return just the ID I am looking for.
I suggest to not use numbers from Id , when you are composing xpath's using partial id. Those number reprezent DINAMIC elements. And dinamic elements change over the next deploys / releases in the System Under Test.The pourpose is to UNIQUE identify elements.
Using this may be a better option or something like this, yo got the idea:
//input[contains(#id, '_txtResponse')]/#id
It worked for me like below
//*[contains(./#id, 'f41_txt')]

Resources