XPath function to combine text nodes and retrieve it to one Google Spreadsheet cell - xpath

I'm completely rookie in XPath (I don't even know how to paste proper html into this post ;-p) subject and I need some help. I would like to retrieve text which is in quotation marks and put it into a one cell in Google Spreadsheet. Right now I can only retrieve this text into separate cells.
http://imm.io/oLYI

Does string(//tr[class='darkGreen']/td[2]) result in what you want? Your XML fragment looks incomplete and I'm not sure if you only want the contents of the second cell so it's a wild guess if this fits your need.

Related

Why does IMPORTXML with XPATH return unexpected blank row in addition to expected result?

I'm importing into Google Sheets with IMPORTXML with the following XPATH:
=IMPORTXML(A2;"//*[#id='mw-content-text']/div/table[1]/tbody/tr[4]/td[1]/ul/li")
A2 containing the URL (https://stt.wiki/wiki/20th_Century_Pistol).
From the website I want to import the list entries in the "Basic" column and "Crafted From" row of the table.
There are only two list entries in this section of the table:
"x1 Basic Security Codes" and
"x4 Basic Casing"
Therefore, I expected to get only those two list entries as rows in my sheet.
Instead, I got an additional blank row above those two entries. When I change "td[1]" to "td[3]" in the XPATH query however, there are no extra blanks.
I don't understand where the additional blank row is coming from and how I can avoid it.
Google Sheet with desired and actual result
When I saw the HTML of the URL, there are 2 li tags in the ul tag. So I think that your xpath is correct. But from your issue, I was worry that the sup tag might affect to this situation. But I'm not sure whether this is the direct reason. So I would like to propose to add the attribute of li for your xpath as follows.
Modified xpath:
When your xpath is modified, please modify as follows.
From:
//*[#id='mw-content-text']/div/table[1]/tbody/tr[4]/td[1]/ul/li
To:
//*[#id='mw-content-text']/div/table[1]/tbody/tr[4]/td[1]/ul/li[#style='white-space:nowrap']
By adding [#style='white-space:nowrap'], the value of li with style='white-space:nowrap' is retrieved.
Result:
The formula is =IMPORTXML(A1;"//*[#id='mw-content-text']/div/table[1]/tbody/tr[4]/td[1]/ul/li[#style='white-space:nowrap']"). Please put the URL to the cell "A1".
Note:
Also, you can use the xpath of //*[#id='mw-content-text']/div/table[1]/tbody/tr[4]/td[1]/ul/li[position()>1].
To complete the very neat #Tanaike's answer, another expression :
=IMPORTXML(A2;"//th[contains(.,'Crafted')]/following::td[1]//li[contains(#style,'white')]")
If a blank line is added it's because GoogleSheets parses an additional blank li element containing a #style attribute.

how to select the second <p> element using Xpath

I am trying to scrape full reviews from this webpage. (Full reviews - after clicking the 'Read More' button). This I am doing using RSelenium. I am able to select and extract text from the first <p> element, using the code
reviewNodes <- mybrowser$findElements(using = 'xpath', "//p[#id][1]")
which is for less text review.
But not able to extract full text reviews using the code
reviewNodes <- mybrowser$findElements(using = 'xpath', "//p[#id][2]")
or
reviewNodes <- mybrowser$findElements(using = 'xpath', "//p[#itemprop = 'reviewBody']")
It shows blank list elements. I don't know what is wrong. Please help me..
Drop the double slash and try to use the explicit descendant axis:
/descendant::p[#id][2]
(see the note from W3C document on XPath I mentioned in this answer)
As you're dealing with a list, you should first find the list items, e.g. using CSS selector
div.srm
Based on these elements, you can then search on inside the list items, e.g. using CSS selector
p[itemprop='reviewBody']
Of course you can also do it in 1 single expression, but that is not quite as neat imho:
div.srm p[itemprop='reviewBody']
Or in XPath (which I wouldn't recommend):
//div[#class='srm']//p[#itemprop='reviewBody']
If neither of these work for you, then the problem must be somewhere else.

CKEDITOR How to find and wrap text in span

I am writing a CKEDITOR plugin that needs to wrap certain pieces of text in a tag. From a webservice, I have an array of items that need to be wrapped. The array is just the plain text strings. Such as:
"[best buy", "horrible migraine", "eat cake"]
I need to find the instances of this text in the editor and wrap them in a span tag.
This is further complicated because the text may be marked up. So the HTML for "best buy" might be
"<strong>best</strong> buy"
but the text returned from the web service is stripped of any markup.
I started trying to use a CKEDITOR.htmlParser() object, and that seems like it is moderately successful. I am able to catch the parser.onText event and check if the text contains anything in my array.
But then I cannot modify that text. Modifications are not persisted back to the source html. So I think using the htmlParser() is a dead-end.
What is the best way to accomplish this task?
Oh, and as a bonus, I also do not want to lose my user's current cursor position when the changes are displayed.
Here is what I wound up doing and it seems to be working so far.
I created a text filter rule that searches through my array of items for any item that is contained (or partially contained) in the text. If so, it wraps the element in my span.
A drawback here is that I wind up with two spans for items with markup. But in my usecase, this is tolerable.
Then I set the results using:
editor.document.getBody().setHtml(results);
Because of this, I also have to strip this markup back out when this text gets read. I do this using an elements filter on editor.dataProcessor.htmlFilter.
This seems to be working well for my (so far limited) test cases.

BIRT - expression builder: HTML Table + Dataset field does not evaluate

I am new to BIRT and its awesome but I am unable to make a bullet point list where each bullet point is a field from my dataset. Without using any html the datasetfield evaluates but as soon as I add an html tag it will simply show the name of the field.
This
<ul>
<li><value-of> row["SRRI"] </value-of></li>
</ul>
Shows:
row["SRRI"]
But I want it to show the value of row["SRRI"] instead. (Omitting "" does not change the output for me)
I was searching for a solution for a few hours now and I guess its fairly simple but I cannot find a solution on how to tell BIRT that this is not a string.
It sounds like you have a list, and you want to lead each entry with a bullet point. In your report design, you can put a cell in front of your row["SRRI"] value and put what ever bullet image you want there.

XPath Expression

I am new to XPath. I have a html source of the webpage
http://london.craigslist.co.uk/com/1233708939.html
Now I want to extract the following data from the above page
Full Date
Email - just below the date
I also want to find the existence of the button "Reply to this post" on the page
http://sfbay.craigslist.org/sfc/w4w/1391399758.html
Can anyone help me in writing the three XPath expressions for the above three data.
You don't need to write these yourself, or even figure them out yourself. If you use the Firebug plugin, go to the page, right click on the elements you want, click 'Inspect element' and Firebug will popup the HTML in a viewer at the bottom of your browser. Right click on the desired element in the HTML viewer and click on 'Copy XPath'.
That said, the XPath expression you're looking for (for #3) is:
/html/body/div[4]/form/button
...obtained via the method described above.
I noticed that the DTD is HTML 4/01 Transitional and not XHTML for the first link, so there's no guarantee that this is a valid XML document, and it may not be loaded correctly by an XML parser. In fact, I see several tags that aren't properly closed (i.e. <hr>, etc)
I don't know the first one off hand, and the third one was just answered by Alex, but the second one is /html/body/a[0].
As of your first page it's just impossible to do because this is not the way xpath works. In order for an xpath expression to select something that "something" must be a node (ie an element)
The second page is fairly easy, but you need an "id" attribute in order to do that (or anything that can make sure your button is unique). For example if you are sure the text "Reply to this post" correctly identify the button just do it with
//button["Reply to this post"]

Resources