How do I exclude text within a node, when it's relative? - xpath

I need to exclude text after a pipe written in a node.
Ex:
<meta property="og:xx" content=" this is my text | this is text I need to exclude">
I'm not sure how to exclude the text after |.
I know what I would say to include all the text in the div, but no idea how I would only include the text preceding the pipe, or exclude the text after/including the pipe.
//meta[#property='xx']//#content

Related

Extract text from HTML, exclude text in <small> tags

I want to extract text from HTML, without the <small> tag:
<h1>THE BIG TEXT<small>the small text</small></h1>
I can extract "THE BIG TEXT the small text" with //h1/text(), but how can I extract "THE BIG TEXT" only, without "the small text"?
What XPath do I have to use?
The following XPath should work:
//h1/text()
It will find the immediate text inside the h1 tag not the child tag.
It extracts "THE BIG TEXT".
Demo here.
But if you want to extract all text inside h1 including the child tags:
//h1//text()
It extracts "THE BIG TEXT the small text".
Look at the single and double slashes (/). Single / means immediate and double / means all including nested.

How to prevent trimming whitespaces in JSP page

Example
String var="welcome to JSP";
<c:out value=${test}"/>
the above standard tag library trimming the spaces from string var , also tried to display same var on JSP without JSTL still the whitespaces have been takenout.
The JSTL doesn't trim white spaces at all. Look at the generated HTML by right-clicking in the page and choosing "view page source", and you'll see that the white spaces are there.
HTML does that. One white space or 100 successive ones are rendered the same way in HTML (as a single white space), unless you use a CSS style that makes them relevant, like for example
<pre> Now white space is
relevant
</pre>

what xpath to select CDATA content when some childs exist

Let's say I have an XML that looks like this:
<a>
<b>
<![CDATA[some text]]>
<c>xxx</c>
<d>yyy</d>
</b>
</a>
I can't find a way to get "some text". Any idea?
If I'm using "a/b" it returns also xxx and yyy
If I'm using "a/b/text()" it returns nothing
You can't actually select a CDATA section: CDATA is just a way of telling the parser to avoid unescaping special characters, and your input document looks to XPath exactly the same as:
<a>
<b>
some text
<c>xxx</c>
<d>yyy</d>
</b>
</a>
(Having said that, if you're using DOM, then some DOM XPath engines fail to implement the spec correctly, and treat the CDATA content as a separate text node from the text outside the CDATA section).
The XPath expression a/b/text() should select three text nodes, of which the first contains "some text" along with surrounding whitespace.
With the XPath data model the path /a/b/text()[1] should select a text node with the string value
some text
that is a line break, some spaces, the text some text followed by a line break and some spaces.

scrapy: Remove elements from an xpath selector

I'm using scrapy to crawl a site with some odd formatting conventions. The basic idea is that I want all the text and subelements of a certain div, EXCEPT a few at the beginning, and a few at the end.
Here's the gist.
<div id="easy-id">
<stuff I don't want>
text I don't want
<div id="another-easy-id" more stuff I don't want>
text I want
<stuff I want>
...
<more stuff I want>
text I want
...
<div id="one-more-easy-id" more stuff I *don't* want>
<more stuff I *don't* want>
NB: The indenting implies closing tags, so everything here is a child of the first div -- the one with id="easy-id"
Because text and nodes are mixed, I haven't been able to figure out a simple xpath selector to grab the stuff I want. At this point, I'm wondering if it's possible to retrieve the result from xpath as an lxml.etree.elementTree, and then hack at it using the .remove() method.
Any suggestions?
I am guessing you want everything from the div with ID another-easy-id up to but not including the one-more-easy-id div.
Stack overflow has not preserved the indenting, so I do not know where the end of the first div element is, but I'm going to guess it ends before the text.
In that case you might want
//div[#id = 'another-easy-id']/following:node()
[not(preceding::div[#id = 'one-more-easy-id']) and not(#id = 'one-more-easy-id')]
If this is XHTML you'll need to bind some prefix, h, say, to the XHTML namespace and use h:div in both places.
EDIT: Here's the syntax I went with in the end. (See comments for the reasons.)
//div[#id='easy-id']/div[#id='one-more-easy-id']/preceding-sibling::node()[preceding-sibling::div[#id='another-easy-id']]

Parsing through text to find html tags in Ruby 1.9.x

I want to be able to match text in between two tags, starting at an opening tag and ending in a closing tag.
Say I have this block of text in a variable called 'text':
some text some text some text some text some text
<some_tag>
some text some text some text some text some text
</some_tag>
some text some text some text some text some text
I want to parse the contents 'text' doing nothing until it finds an opening tag, in this case 'some_tag', and once it finds an opening tag I want it to capture everything until the tag closes.
I've been fooling around with blocks and regular expressions for about an hour now and cannot seem to figure out a good way to work this out.
I'd appreciate any and all pointers, thanks!
You should use a parser for HTML. Regex and HTML tends to make a volatile mix, that leads to insanity in large doses.
Using Nokogiri:
require 'nokogiri'
html = <<EOT
some text some text some text some text some text
<p>
some text some text some text some text some text
</p>
some text some text some text some text some text
EOT
doc = Nokogiri::HTML::DocumentFragment.parse(html)
puts doc.search('p').map { |n| n.inner_text }
>> some text some text some text some text some text
This is searching through the HTML fragment, looking for <p> tags. For each one it finds it'll extract the inner text.
I'm using Nokogiri's CSS mode, by using "p". I could use XPath instead, but CSS is understood by more people.

Resources