Find HTML Tags in Properties - xpath

My current issue is to find HTML-Tags inside of property values. I thought it would be easy to search with a query like /jcr:root/content/xgermany//*[jcr:contains(., '<strong>')] order by #jcr:score
It looks like there is a problem with the chars < and > because this query finds everything which has strong in it's property. It finds <strong>Some Text</strong> but also This is a strong man.
Also the Query Builder API didn't helped me.
Is there a possibility to solve it with a XPath or SQL Query or do I have to iterate through the whole content?

I don't fully understand why it finds This is a strong man as a result for '<strong>', but it sounds like the unexpected behavior comes from the "simple search-engine syntax" for the second argument to jcr:contains(). Apparently the < > are just being ignored as "meaningless" punctuation.
You could try quoting the search term:
/jcr:root/content/xgermany//*[jcr:contains(., '"<strong>"')]
though you may have to tweak that if your whole XPath expression is enclosed in double quotes.
Of course this will not be very robust even if it works, since you're trying to find HTML elements by searching for fixed strings, instead of actually parsing the HTML.

If you have an specific jcr:primaryType and the targeted properties you can do something like this
select * from nt:unstructured where text like '%<strong>%'
I tested it , but you need to know the properties you are intererested in.
This is jcr-sql syntax

Start using predicates like a champ this way all of this will make sense to you!
HTML Encode <strong>
HTML Decimal <strong>
Query builder is your friend:
Predicates: (like a CHAMP!)
path=/content/geometrixx
type=nt:unstructured
property=text
property.operation=like
property.value=%<strong>%
Have go here:
http://localhost:4502/libs/cq/search/content/querydebug.html?charset=UTF-8&query=path%3D%2Fcontent%2Fgeometrixx%0D%0Atype%3Dnt%3Aunstructured%0D%0Aproperty%3Dtext%0D%0Aproperty.operation%3Dlike%0D%0Aproperty.value%3D%25%3Cstrong%3E%25
Predicates: (like a CHAMP!)
path=/content/geometrixx
type=nt:unstructured
property=text
property.operation=like
property.value=%<strong>%
Have a go here:
http://localhost:4502/libs/cq/search/content/querydebug.html?charset=UTF-8&query=path%3D%2Fcontent%2Fgeometrixx%0D%0Atype%3Dnt%3Aunstructured%0D%0Aproperty%3Dtext%0D%0Aproperty.operation%3Dlike%0D%0Aproperty.value%3D%25%26lt%3Bstrong%26gt%3B%25
XPath:
/jcr:root/content/geometrixx//element(*, nt:unstructured)
[
jcr:like(#text, '%<strong>%')
]
SQL2 (already covered... NASTY YUK..)
SELECT * FROM [nt:unstructured] AS s WHERE ISDESCENDANTNODE([/content/geometrixx]) and text like '%<strong>%'

Although I'm sure it's entirely possible with a string of predicates, it's possibly heading down the wrong route. Ideally it would be better to parse the HTML when it is stored or published.
The required information would be stored on simple properties on the node in question. The query will then be a lot simpler with just a property = value query, than lots of overly complex query syntax.
It will probably be faster too.
So if you read in your HTML with something like HTMLClient and then parse it with a OSGI service, that can accurately save these properties for you. Every time the HTML is changed the process would update these properties as necessary. Just some thoughts if your SQL is getting too much.

Related

XPath fails because Namespace colon in Title

I'm generating an XML report, using the JDF standard for PDFs going into a printing workflow.
There are 3 "DPart" sections, and I can use an xPath query to recognize them, but I want to grab the "Separation" attribute of each "cip4:Part". I can also get a query to find that, but it does not distinguish between the multiple "DPart"s.
<DPart End="0" ID="0003" ParentRef="0002" Start="0">
<DPM>
<cip4:Root>
<cip4:Intent cip4:ProductType="ProductPart"/>
<cip4:Production>
<cip4:Resource>
<cip4:Part Separation="K1"/>
<cip4:Color cip4:ActualColorName="Black" cip4:ColorType="Normal">
</cip4:Resource>
<cip4:Resource>
<cip4:Part Separation="S1"/>**
<cip4:Color cip4:ActualColorName="Dieline" cip4:ColorType="Normal">
</cip4:Resource>
<cip4:Resource>
<cip4:ColorantControl ColorantOrder="K1 S1" ColorantParams="K1 S1"/>
</cip4:Resource>
<cip4:Resource>
<eg:InkCoverage>
<eg:InkCov eg:Mm2="0.000000" eg:Pct="0.000000" eg:Separation="K1"/>
<eg:InkCov eg:Mm2="182.337538" eg:Pct="0.721209" eg:Separation="S1"/>
</eg:InkCoverage>
</cip4:Resource>
</cip4:Production>
</cip4:Root>
</DPM>
</DPart>
I want to do something like:
/DPM[2]/*[name ()='cip4:Part'], but it's not working.
I'm in a low-code pre-press environment (Esko Automation Engine), but the system gives me tools to parse an xPath, and throw some JavaScript at it.
There are at least three reasons your XPath selects nothing:
DPM is not an immediate child of the root node
There is only one DPM, so DPM[2] won't select anything
There is no child of a DPM whose name is cip4:Part.
You also say in the narrative that there are three DPart's, which implies that DPart is not actually the outermost element as it appears to be in your sample. This makes it difficult to provide the correct XPath. However, you might be able to make a start with
(//DPM)[2]//*[name()='cip4:Part']

DMQL2 Query Syntax for PHRets v2 Seach() to include filter arguments?

(It's been a while since I've been here.)
I've been using the first version of PHRets v1 for years, and understood it well enough to get by, but now I'm trying to understand the advantages of v2.6.2. I've got it all installed and the basics are working fine. My issues are pretty much with comprehending fine points of query syntax that goes into the rets=>Search() statement. (I'm much more familiar with SQL statements). Specifically, I'd like to have a query return a list of properties, EXCLUDING those which already have the status of "Sold".
Here's where I am stuck: If I start with this
`$results = $rets->Search('Property', 'A','*',['Select' => 'LIST_8,LIST_105,LIST_15,LIST_19,listing_office_shortid']);`
That works well enough. BUT I'd like to fit in a filter like:
"LIST_15 != Sold", or "NOT LIST_15=Sold"...something like that. I don't get how to fit/type that into a PHRets Search().
I like PHRets but it is so hard to find well-organized/complete documentation about specific things like this. Thanks in advance.
As in my comment above I've figured out that the filter goes in the third argument position ('*', as in the original question). The tricky thing was having to find a specific "sold" code for each class of properties and placing it in that position like so: '(LIST_15=~B4ZIT1Y75TZ)', (notice the =~ combination of characters that means "does not equal" in this context). I've found the code strings for each of the property types (not clear WHY they would need to be unique for each type of property: "Sold" is Sold for any type, after all) but the correct code for a single-family residential property (type 'A' ...at least for the MLS in which I have to search is:
$results = $rets->Search('Property', 'A','(LIST_15=~B4ZIT1Y75TZ)',['Select' => 'LIST_8,LIST_105,LIST_15,LIST_19,listing_office_shortid']);
(again, the code to go with LIST_15 will be different for the different types of properties.) I think there is a better answer that involves more naturalistic language, but this works and I guess I will have to be satisfied with it for now. I hope this is of some use to anyone else struggling with this stuff.

How to grab a piece of data which has a different xpath on different webpages?

So I am trying to grab a piece of data that is displayed in a different xpath on different pages.
if you will see the xpath of the IPA pronunction on wiktionary... https://en.wiktionary.org/wiki/foo you will see that the xpath is
//*[#id="mw-content-text"]/ul[1]/li[1]/span[4]
but if I got to another word, like https://en.wiktionary.org/wiki/bar then the xpath would be
//*[#id="mw-content-text"]/ul[1]/li[2]/span[5]
I cannot think of any way to reconcile these, is there something that I am missing?
The answer is simple. Never let a tool write any XPath for you. All tools get it wrong.
Look at the document's HTML source and write the appropriate XPath it yourself.
var result = document.evaluate("//*[#class = 'IPA']", document),
elem;
while (elem = result.iterateNext()) {
console.log(elem);
}
The above shows the simplest variant. It selects two occurrences of <span class="IPA"> on https://en.wiktionary.org/wiki/foo and quite a few more on https://en.wiktionary.org/wiki/bar.
Use a more specific expression to narrow down the results.

Make 1 page objects Two Elements ID's to 1 page object Variable

I am using the page object Gem with Watir. During testing I found that I have a field that has the same contents that show in the same location but have separate unique ID's. The difference is before you get to the page.
I tried using Xpaths:
select_list(:selectionSpecial, :xpath => "//select[#id='t_id9' OR #id='t_id7']")
But was met with a script error.
They are static ID's but I want to force them into one variable since that would allow me to use "populate_page_with" feature.
I have a long winded way currently, but I am fishing for a more efficient way that works with the page object Features.
Does anyone know of a way to do this?
Your approach of using xpath can work. The problem is the syntax errors in the xpath selector. It should be:
"//select[#id='t_id9' or #id='t_id7']"
Note:
The start should be a // rather than a \
Using or is case-sensitive; it has to be lower case
There was also a missing closing ' for the first id attribute
Personally, I find css and xpath selectors harder to use. I would go with the id locator with a regex. The following gives the same results, but some will find it easier to read.
select_list(:selectionSpecial, :id => /^t_id(7|9)$/)

Retrieve an xpath text contains using text()

I've been hacking away at this one for hours and I just can't figure it out. Using XPath to find text values is tricky and this problem has too many moving parts.
I have a webpage with a large table and a section in this table contains a list of users (assignees) that are assigned to a particular unit. There is nearly always multiple users assigned to a unit and I need to make sure a particular user is assigned to any of the units on the table. I've used XPath for nearly all of my selectors and I'm half way there on this one. I just can't seem to figure out how to use contains with text() in this context.
Here's what I have so far:
//td[#id='unit']/span [text()='asdfasdfasdfasdfasdf (Primary); asdfasdfasdfasdfasdf, asdfasdfasdfasdf; 456, 3456'; testuser]
The XPath Query above captures all text in the particular section I am looking at, which is great. However, I only need to know if testuser is in that section.
text() gets you a set of text nodes. I tend to use it more in a context of //span//text() or something.
If you are trying to check if the text inside an element contains something you should use contains on the element rather than the result of text() like this:
span[contains(., 'testuser')]
XPath is pretty good with context. If you know exactly what text a node should have you can do:
span[.='full text in this span']
But if you want to do something like regular expressions (using exslt for example) you'll need to use the string() function:
span[regexp:test(string(.), 'testuser')]

Resources