ImportXML and Google Spreadsheet issue - xpath

I'm 'scraping' a few product descriptions from a website and bringing them into a google spreadsheet using importXML.
It has gone fairly smoothly, but there is one major snag that I would love to correct, and I need your help!
The website in question prohibits those posting products from including contact information (email addresses usually) in the product description. Sometimes people ignore the rule, and include the contact information anyways. When this occurs, the website automatically hides the contact information in the product description, replacing it with [obscured], as in "...please feel free to contact me at [obscured]" or something close to that. The [obscured] appears in a different colour, and is obviously treated differently by the website.
When these product descriptions are imported into my spreadsheet, the [obscured] causes the scraping to kind of be 'bumped'-- the description text stops prior to [obscured], the word [obscured] appears in an adjacent cell all by itself, and the description text that follows [obscured] then continues in a third cell.
This separation ruins the alignment and logic in my spreadsheet, as product descriptions having an [obscured] word become broken up and misaligned from those that do not.
I would love to be able to have my importXML or XPath accommodate for this, and essentially 'ignore' the [obscured]. I don't mind it being included in the scraped description, but I want to stop the breaking-up into 3 separate adjacent cells.
The [obscured] is part of a 'span' that appears to occasionally lie within the description class 'desc' I am calling.
Is there a way to do this? Instruct importXML to import that 'desc' class BUT 'ignore/omit/exception' of the span which might sometimes appear within?
I've included the source code (inspect element in Safari) below:
<div class="desc descFull collapsed">
<span class="obscureText">[obscured]</span>
As mentioned, this span only occurs in some of the product descriptions, not all of them.
Does anyone know what kind of language I would use in the importXML to call the 'desc' but ignore the 'span', or prevent the splitting into 3 cells when the [obscured] is encountered??
My current call is
=ImportXML(A1,"//div[#class='desc']")
which works fine, unless the [obscured] span is encountered.
Thank you for any help you can give!

Unless Google Drive is breaking the definition of Xpath, Xpaths can't be used to query CSS classes, like CSS selectors can.
The Xpath //div[#class='desc'] will only match a div element with a class attribute that is literally "desc". It won't match "desc descFull collapsed" as the string is different.
As for excluding the text of the obscured node, that would require finding the text nodes and exluding on, which would return a nodeset, not a string, and you wouldn't be able to concatenate these back together using XPath 1.0. If Google Drive uses XPath 2.0 it might be possible, using the techniques in that linked question.

Related

Highlighting search keyword in Slickgrid

I'm developing logs viewer web program with Vue.js
I receive log data with ajax and display it with Slickgrid.
What i need to do is highlighting keyword after searching.
I found some examples highlighting whole cells or row but couldn't find highlighting specific keyword in cell.
ex)When i search a word 'cat', slickgrid shows cells which include 'cat'.
And i need to highlight the word 'cat' in the cell.
Anyone knows how to do this? or any examples??
Thank you.
You'll need to write a custom formatter. See here for an example page. Make sure you're using the 6pac repo - it's up to date, the MLeibman repo is unmaintained now.
Re highlighting a word, you'll need to return HTML from the formatter, and just have a special span to hilight the word, eg:
we will build a <span class="hilight">wall<span/>
It's a tricky business finding a full word, that is making sure it's not part of another word, if that's what you want eg.
did you buy the <span class="hilight">wall<span/>paper yet?
That's a whole 'nother Google search in itself.

CKEDITOR How to find and wrap text in span

I am writing a CKEDITOR plugin that needs to wrap certain pieces of text in a tag. From a webservice, I have an array of items that need to be wrapped. The array is just the plain text strings. Such as:
"[best buy", "horrible migraine", "eat cake"]
I need to find the instances of this text in the editor and wrap them in a span tag.
This is further complicated because the text may be marked up. So the HTML for "best buy" might be
"<strong>best</strong> buy"
but the text returned from the web service is stripped of any markup.
I started trying to use a CKEDITOR.htmlParser() object, and that seems like it is moderately successful. I am able to catch the parser.onText event and check if the text contains anything in my array.
But then I cannot modify that text. Modifications are not persisted back to the source html. So I think using the htmlParser() is a dead-end.
What is the best way to accomplish this task?
Oh, and as a bonus, I also do not want to lose my user's current cursor position when the changes are displayed.
Here is what I wound up doing and it seems to be working so far.
I created a text filter rule that searches through my array of items for any item that is contained (or partially contained) in the text. If so, it wraps the element in my span.
A drawback here is that I wind up with two spans for items with markup. But in my usecase, this is tolerable.
Then I set the results using:
editor.document.getBody().setHtml(results);
Because of this, I also have to strip this markup back out when this text gets read. I do this using an elements filter on editor.dataProcessor.htmlFilter.
This seems to be working well for my (so far limited) test cases.

Google Spreadsheet xpath scraping

So I'm not a professional programmer, but I'm trying to scrape data off the Reuters homepage and import it into google spreadsheets.
I know that there have already been questions answerd about scraping from Reuters, however, that didn't help me.
I want data from this page: http://www.reuters.com/finance/stocks/financialHighlights?symbol=9983.T
specifically, if you scroll down, there's a lot of data on the company's financials, packed into tables. I need specific values out of the tables.
So naturally my question to you is, how can I get specific values out of the tables? For instance, I want the first value out of the line that's labelled "Net Profit Margin (TTM)". The value should be 7.30.
So I got the xpath by using google chrome developer tools, right-click on the element and select "copy xpath". Since I'm not a programmer I dont know any other way for arriving at a specific element from the tables.
I tried the following function in google spreadsheets:
=IMPORTXML(URL as written above,"//*[#id='content']/div[2]/div/div[2]/div[1]/div[13]/div[2]/table/tbody/tr[14]/td[2]")
but it returns
"#N/A - Error, imported content is empty"
What can I do to get the value?
The IMPORTXML() function of Google Sheets is known to be incredibly buggy and it is not surprising if people dig up real errors in it. Still, we don't know exactly why your original XPath expression does not work.
I want the first value out of the line that's labelled "Net Profit Margin (TTM)". The value should be 7.30.
The path expression you got from the developer tools heavily relies on positioning, and not at all on actual values.
If you can rely on the text content of the first cell in this row, use
=IMPORTXML("http://www.reuters.com/finance/stocks/financialHighlights?symbol=9983.T","//tr[contains(td[1],'Net Profit Margin (TTM)')]/td[2]")
which means
Select all tr elements where the text content of the first td child element contains "Net Profit MArgin (TTM)" and select the second td of that tr.
and the result will be
7.3

Ruby on Rails: gem or plugin for definitions on a medical form

I am creating a website with a rather lengthy medical questionnaire. The users/patients need to be able to hover or click on a medical term and see the definition.
What are ways to accomplish this in RoR? There are similar plugins for WordPress, but I haven't found any in Rails.
My idea is have a Term model, that has attributes "word" and a "definition". Then in my layouts, I have to somehow scan the page and output the definition.
There are multiple approaches. I use jquery-tooltip. I'm in the same boat, instead of medical forms in my case it's insurance forms.
I checked out a few different approaches. I really like the tooltip feature coming soon to jquery-ui 1.9. Until it's officially released, I'm using jquery-tooltip.
They both work the same, give an element a title:
<div id='q12', title='This is number 12'>
What is this?
</div>
Then
$('#q12').tooltip
If the only reason you ever give your elements a Title is to create a tooltip, then you can just use something like:
$("[title]").tooltip({ position: "center left", predelay:500 });
Then every element with a title defined will show your stylized tooltip when the element is hovered over.
Why not use Twitter's Bootstrap framework.
Modal
These can be customized with images and any other content that you need.
Tooltips
Mainly for small snippets of text.
Medical Term
Popovers
Can contain more information then tooltips, but not as versatile as a Modal.
You can find more information on using it in Rails 3.0/3.2 here.

extract xpath

I want to retrieve the xpath of an attribute (example "brand" of a product from a retailer website).
One way of doing it is using addons like xpather or xpath checker to firefox, opening up the website using firefox and right clicking the desired attrbute I am interested in. This is ok. But I want to capture this information for many attributes and right clicking each and every attribute maybe time consuming. Also, the other problem I have is that attributes I maybe interested in will be there for one product. The other attributes maybe for some other product. So, I will have to go that product & then do it manually again.
Is there an automated or programatic way of retrieving the xpath of the desired attributes from a website rather than having to do this manually?
You must notice that not all websites use valid XML that you can use xpath on...
That said, you should check out some HTML parsers that will allow you to use xpath on HTML even if it is not a valid XML.
Since you did not specify the technology you are working with - I'll suggest the .NET HTML Agility Pack, if you need others, search for questions dealing with this here on SO.
The solution I use for this kind of thing is to write an xpath something like this:
//*[text()="Brand"]/following-sibling::*
//*[text()="Color"]/following-sibling::*
//*[text()="Size"]/following-sibling::*
//*[text()="Material"]/following-sibling::*
It works by finding all elements (labels) with the text you want and then looking to the next sibling in the HTML. Without a specific URL to see I can't help any further.
This is a generalised version you can make more specific versions by replacing the asterisks is tag types, and you can navigate differently by replacing the axis following sibling with something else.
I use xPaths in import.io to make APIs for this kind of thing all the time, It's just a matter of finding a xPath that's generic enough to find the HTML no matter where it is on the page, but being specific enough to get the right data.

Resources