XPATH - how to pick up text in each html element regardless of classes - xpath

I am trying to grab some content from webpages that are not structured in a uniform fashion. What I want to do is tell the XPATH to grab any content within html tags in the order it sees them and return the results, without having to specify div names etc, as they are different and not very uniform.
So I need to know how to just say 'return any html content in the order that it's found from within tags, regardless of whether they are classes, ems, strong tags etc. The only experience I have had with XPATH is to specify actual div names, example:
//div[#id='tab_info']

This XPath,
string(/)
will return the string value of the entire XML or HTML document. That is, it'll return a single string of all of the text in document order, as requested.

Related

Are there cases of editing HTML output by Aspose.Word with CKEditor?

I am in trouble with the event that the sentence edited in CKEditor are not output to Word as a result of inheriting attributes of “-aw-import:ignore”.
A tag with this attribute is a tag that conveys the attribute of the original word when converting from html to word, and it is not output as word as a meta tag.
If the sentence entered in CKEditor inherits the attributes, it will not be output as word by mistake.
Aspose.Words writes this "-aw-import:ignore" only when it needs to make certain elements visible in HTML that would otherwise be collapsed and hidden by web browsers e.g. empty paragraphs, space sequences, etc.
Currently we mark only the following elements with “-aw-import:ignore”:
Sequences of spaces and non-breaking spaces that are used to simulate
padding on native list item (<li>) elements.
Non-breaking spaces that are used to prevent empty paragraphs from collapsing.
However, note that this list is not fixed and we may add more cases to it in the future.
Also, please note that Aspose.Words write   instead of because is not defined in XML. And by default Aspose.Words generate XHTML documents (i.e. HTML documents that comply with XML rules).
I work with Aspose as Developer Evangelist.
Please find below list of custom styles that Aspose.Words uses to save extra information in output HTML and usually this information is used for Aspose.Words-HTML-Aspose.Words round-trip. We will add description of these entities in documentation as soon as possible.
-aw-comment-author
-aw-comment-datetime
-aw-comment-initial
-aw-comment-start
-aw-comment-end
-aw-footnote-type
-aw-footnote-numberstyle
-aw-footnote-startnumber
-aw-footnote-isauto
-aw-headerfooter-type
-aw-bookmark-start
-aw-bookmark-end
-aw-different-first-page
-aw-tabstop-align
-aw-tabstop-pos
-aw-tabstop-leader
-aw-field-code
-aw-wrap-type
-aw-left-pos
-aw-top-pos
-aw-rel-hpos
-aw-rel-vpos
-aw-revision-author
-aw-revision-datetime

XPATH - how to get the text if an element contains a certain class

JHow do I grab this text here?
I am trying to grab the text here based on that the href contains "#faq-default".
I tried this first of all but it doesn't grab the text, only the actual href name, which is pointless:
//a/#href[contains(., '#faq-default-2')]
There will be many of these hrefs, such as default-2, default-3 so I need to do some kind of contains query, I'd guess?
You are selecting the #href node value instead of the a element value. So try this instead:
//a[contains(#href, '#faq-default-2')]

WebDriver select element that has ::before

I have 2 elements that have the same attributes but shown one at a time on the page (When one is shown, the other disappears).The only difference between the two is that the element which is displayed will have the '::before' selector. Is it possible to use an xpath or css selector to retrieve the element based on its id and whether or not it has ::before
I bet also to try with the javascript solution above.
Since ::after & ::before are a pseudo element which allows you to insert content onto a page from CSS (without it needing to be in the HTML). While the end result is not actually in the DOM, it appears on the page as if it is - you see it but can't really locate it with xpath for example (https://css-tricks.com/almanac/selectors/a/after-and-before/).
I can also suggest if possible to have different IDs or if they in different place in the DOM make more complex xpath using above/below elements and see if it is displayed.
String script = "return window.getComputedStyle(document.querySelector('.analyzer_search_inner.tooltipstered'),':after').getPropertyValue('content')";
Thread.sleep(3000);
JavascriptExecutor js = (JavascriptExecutor) driver;
String content = (String) js.executeScript(script);
System.out.println(content);

CKEDITOR How to find and wrap text in span

I am writing a CKEDITOR plugin that needs to wrap certain pieces of text in a tag. From a webservice, I have an array of items that need to be wrapped. The array is just the plain text strings. Such as:
"[best buy", "horrible migraine", "eat cake"]
I need to find the instances of this text in the editor and wrap them in a span tag.
This is further complicated because the text may be marked up. So the HTML for "best buy" might be
"<strong>best</strong> buy"
but the text returned from the web service is stripped of any markup.
I started trying to use a CKEDITOR.htmlParser() object, and that seems like it is moderately successful. I am able to catch the parser.onText event and check if the text contains anything in my array.
But then I cannot modify that text. Modifications are not persisted back to the source html. So I think using the htmlParser() is a dead-end.
What is the best way to accomplish this task?
Oh, and as a bonus, I also do not want to lose my user's current cursor position when the changes are displayed.
Here is what I wound up doing and it seems to be working so far.
I created a text filter rule that searches through my array of items for any item that is contained (or partially contained) in the text. If so, it wraps the element in my span.
A drawback here is that I wind up with two spans for items with markup. But in my usecase, this is tolerable.
Then I set the results using:
editor.document.getBody().setHtml(results);
Because of this, I also have to strip this markup back out when this text gets read. I do this using an elements filter on editor.dataProcessor.htmlFilter.
This seems to be working well for my (so far limited) test cases.

XPath "Not". Ignore branches with a specific tag

I have loaded a web page into the HTML Agility Pack and have a DOM. I want to use XPATH to pull out all of the text on the page (but not the javascript found within <script> tags).
I figure I need a //text() and then a 'not' to ignore any tag within the branch that has a <script> in it.
I have tried
doc.DocumentNode.SelectNodes("//text()[not(self::script)]"))
and
doc.DocumentNode.SelectNodes("//text()[not(script)]"))
but neither work. An example of the XPath property of a node that they return is (notice the Script)
/html[1]/body[1]/div[2]/div[4]/div[1]/div[1]/div[1]/div[3]/script[1]/#text[1]
I have consulted with both of these posts.
Is it possible to do 'not' matching in XPath?
Grab all text from html with Html Agility Pack (This is a good post but it brings out the JS)
Any suggestions?
Your first attempt rejects all text nodes that are script elements, and your second rejects all text nodes that have script node children. Of course, in both cases, the condition is never true.
You haven't explained your requirements clearly, but I guess you want to reject all text nodes that have script elements as their parent, which would be
//text()[not(parent::script)]
or
//*[not(self::script)]/text()

Resources