Filtering by multiple values using XPath - ruby

I am trying to filter an XML document of Jobs by the Company name.
I am able to pull all items that match specific Company names using:
doc.xpath("/source/job[company[text() = 'BigCorp' or text() = 'MegaCorp']]")
I am unable to do the opposite and exclude by these values, using something like:
doc.xpath("/source/job[company[text() != 'Hodes' or text() != 'Scurri']]")
Where am I going wrong? Is there a way to provide a comma-separated list of values?

Try changing the or to and:
doc.xpath("/source/job[company[text() != 'Hodes' and text() != 'Scurri']]")
If you use or, it's always going to return the job.
For example, it would return the job with the company Hodes because text() != 'Scurri' is true (and vice versa).
Regarding the following comment:
so normalize-space() did it!
doc.xpath("/source/job[company[normalize-space() != 'Hodes' and normalize-space() != 'Scurri']]") not sure why?
The reason normalize-space() worked is because text() is also going to return whitespace.
For example, if you have an element like:
<company>
Hodes
</company>
or:
<company> Hodes </company>
the text() would equal "_Hodes_". (I replaced the spaces with _ to make them easier to see.)
Because of the whitespace, "_Hodes_" doesn't equal "Hodes".
Using normalize-space() will strip the leading/trailing whitespace and replace multiple spaces with a single space.

Related

Need XPath and XQuery query

I'm working on Xpath/Xquery to return values of multiple child nodes based on a sibling node value in a single query. My XML looks like this
<FilterResults>
<FilterResult>
<ID>535</ID>
<Analysis>
<Name>ZZZZ</Name>
<Identifier>asdfg</Identifier>
<Result>High</Result>
<Score>0</Score>
</Analysis>
<Analysis>
<Name>XXXX</Name>
<Identifier>qwerty</Identifier>
<Result>Medium</Result>
<Score>0</Score>
</Analysis>
</FilterResult>
<FilterResult>
<ID>745</ID>
<Analysis>
<Name>XXXX</Name>
<Identifier>xyz</Identifier>
<Result>Critical</Result>
<Score>0</Score>
</Analysis>
<Analysis>
<Name>YYYY</Name>
<Identifier>qwerty</Identifier>
<Result>Medium</Result>
<Score>0</Score>
</Analysis>
</FilterResult>
</FilterResults>
I need to get values of Score and Identifier based on Name value. I'm currently trying with below query but not working as desired
fn:string-join((
for $Identifier in fn:distinct-values(FilterResults/FilterResult/Analysis[Name="XXXX"])
return fn:string-join((//Identifier,//Score),'-')),',')
The output i'm looking for is this
qwerty-0,xyz-0
Your question suggests some fundamental misunderstandings about XQuery, generally. It's hard to explain everything in a single answer, but 1) that is not how distinct-values works (it returns string values, not nodes), and 2) the double slash selections in your return statement are returning everything because they are not constrained by anything. The XPath you use inside the distinct-values call is very close, however.
Instead of calling distinct-values, you can assign the Analysis results of that XPath to a variable, iterate over them, and generate concatenated strings. Then use string-join to comma separate the full sequence. Note that in the return statement, the variable $a is used to concat only one pair of values at a time.
string-join(
let $analyses := FilterResults/FilterResult/Analysis[Name="XXXX"]
for $a in $analyses
return $a/concat(Identifier, '-', Score),
',')
=> qwerty-0,xyz-0

XPath HTML finding nodes

I am using HtmlAgilityPack to try to find HTML 'A' nodes that have a href attribute that contains a certain string, in my case the string '/groups/':
HtmlNodeCollection groups = source.DocumentNode.SelectNodes("//a[contains(#href, '/groups/')]");
Although the source code contains about 20 such nodes my code above is returning none which leads me to believe maybe I'm doing it incorrectly.
Is what I'm doing correct, and if not how can I select nodes that have a certain attribute that has a value that contains a certain string?
Your expression is seems to be correct as for me.
You don't post your source document (or at least a part of it). So, I'll be guessing.
The thing is, xpath is not cool for case insensitive comparison. If you have an <a> tag with href attribute that contains e.g. /Groups/ or /GROUPS/, it won't be matched. There is a workaround for this:
//a[contains(translate(#href, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), '/groups/')]
As another option you could use LINQ with StringComparison.OrdinalIgnoreCase:
source.DocumentNode.Descendants("a")
.Where(a => a.GetAttributeValue("href", string.Empty)
.IndexOf("/groups/", StringComparison.OrdinalIgnoreCase) != -1
);

How to get H1,H2,H3,... using a single xpath expression

How can I get H1,H2,H3 contents in one single xpath expression?
I know I could do this.
//html/body/h1/text()
//html/body/h2/text()
//html/body/h3/text()
and so on.
Use:
/html/body/*[self::h1 or self::h2 or self::h3]/text()
The following expression is incorrect:
//html/body/*[local-name() = "h1"
or local-name() = "h2"
or local-name() = "h3"]/text()
because it may select text nodes that are children of unwanted:h1, different:h2, someWeirdNamespace:h3.
Another recommendation: Always avoid using // when the structure of the XML document is statically known. Using // most often results in significant inefficiencies because it causes the complete document (sub)tree roted in the context node to be traversed.

Is it possible to exclude some of the string used to match from Ruby regexp data?

I have a bunch of strings that look, for example, like this:
<option value="Spain">Spain</option>
And I want to extract the name of the country from inside.
The easiest way I could think of to do this in Ruby was to use a regular expression of this form:
country = line.match(/>(.+)</)
However, this returns >Spain<. So I did this:
line.match(/>(.+)</).to_s.gsub!(/<|>/,"")
Works well enough, but I'd be surprised if there's not a more elegant way to do this? It seems like using a regular expression to declare how to find the thing you want, without actually wanting the enclosing strings that were used to match it to be part of the data that gets returned.
Is there a conventional approach to this problem?
The right way to deal with that string is to use an HTML parser, for example:
country = Nokogiri::HTML('<option value="Spain">Spain</option>').at('option').text
And if you have several such strings, paste them together and use search:
html = '<option value="Spain">Spain</option><option value="Canada">Canada</option>'
countries = Nokogiri::HTML(html).search('option').map(&:text)
# ["Spain", "Canada"]
But if you must use a regex, then:
country = '<option value="Spain">Spain</option>'.match('>([^<]+)<')[1]
Keep in mind that match actually returns a MatchData object and MatchData#to_s:
Returns the entire matched string.
But you can access the captured groups using MatchData#[]. And if you don't like counting, you could use a named capture group as well:
country = '<option value="Spain">Spain</option>'.match('>(?<name>[^<]+)<')['name']

Use Xpath to find the appropriate element based on the element value

I have the following xml snippet
<ZMARA01 SEGMENT="1">
<CHARACTERISTICS_01>X,001,COLOR_ATTRIBUTE_FR,BRUN ÉCORCE,TMBR,French C</CHARACTERISTICS_01>
<CHARACTERISTICS_02>X,001,COLOR_ATTRIBUTE,Timber Brown,TMBR,Color Attr</CHARACTERISTICS_02>
</ZMARA01>
I am looking for an xpath expression that will match based on COLOR_ATTRIBUTE. It will not always be in CHARACTERISTIC_02. It could be CHARACTERISTIC_XX. Also I don't want to match COLOR_ATTRIBUTE_FR. I have been using this:
Transaction.Input_XML{/ZMAT/IDOC/E1MARAM/ZMARA01/*[starts-with(local-name(.), 'CHARACTERISTIC_')][contains(.,'COLOR_ATTRIBUTE')]}
This gets me mostly there but it matches both COLOR_ATTRIBUTE and COLOR_ATTRIBUTE_FR
Use:
contains(concat(',', ., ','), ',COLOR_ATTRIBUTE,')
This first surrounds the string value of the context node with commas, then simply tests if the so cunstructed string contains ',COLOR_ATTRIBUTE,'.
Thus we treat all cases (pattern at the start of the string, pattern at the end of the string and pattern neither at the start or at the end) in the same single way.
If COLOR_ATTRIBUTE is guaranteed not to be in the first or last position, you could use [contains(.,',COLOR_ATTRIBUTE,')], otherwise you could use something like [contains(.,'COLOR_ATTRIBUTE') and not contains(.,'COLOR_ATTRIBUTE_FR')].

Resources