I have a Yahoo pipe taking the Atom feed from a Google group, and I want to do some processing on the message's full text (running various regular expressions to extract data). I can get a message's text in plain text from from Google using a url like this:
http://groups.google.com/group/(group_name)/msg/(message_id)?dmode=source&output=gplain
However, I'm having trouble getting it inside Yahoo pipes as a string value. Fetch Page rejects non-HTML pages. YQL using the html table seems to work, and wraps the plain text inside a p element, whose text I can extract like this:
select * from html where url="..." and xpath="//p"
However, if the message text contains html tags, YQL returns an HTML subtree instead of a string. Is there any way of flattening it back into its HTML source?
The trick is to remove the "output=gplain" and grab the content from the pre element.
select content from html
where url="http://groups.google.com/group/haml/msg/0f78eda2f5ef802d?dmode=source"
and xpath='//div[contains(#class,"maincontbox")]/pre'
I have created a pipe with Google Group and Message ID as inputs to demonstrate:
http://pipes.yahoo.com/pipes/pipe.info?_id=3d345e162405e7dbd47d73b95c21f102
Related
I am trying to grab some content from webpages that are not structured in a uniform fashion. What I want to do is tell the XPATH to grab any content within html tags in the order it sees them and return the results, without having to specify div names etc, as they are different and not very uniform.
So I need to know how to just say 'return any html content in the order that it's found from within tags, regardless of whether they are classes, ems, strong tags etc. The only experience I have had with XPATH is to specify actual div names, example:
//div[#id='tab_info']
This XPath,
string(/)
will return the string value of the entire XML or HTML document. That is, it'll return a single string of all of the text in document order, as requested.
I want to extract the plain text in the html table (that is, I don't want to grab the information including red arrow),
However, I tried to get the plain text by cell.text, it will get the unnecessary hyperlinks' text
"\n central tendency1 \n "
I expected that I can get
"central tendency"
So I tried cell.text.strip.downcase.gsub!(/\d/, ""),
However the gsub method will also clear the information in the green rectangle.
Is there any way to grab the text in html excepting the text of hyperlink ?
here's the html link I need to parse
You can remove all the links before converting to text with nokogiri:
table = doc.css(".page table")[0]
table.css("a").each(&:remove)
Edit: Alternatively, you can have a regexp that only removes numbers at the end of a string and if they're preceded by a letter, which seems like it may work in this specific case but cannot be relied upon to work in similar cases:
cell.text.strip.downcase.gsub(/(?<=\w)\d$/, "")
I'm trying to change a indl file. The indl file is a file created by Adobe Indesign to keep the structure of a document, and is basically an XML. I want to use Nokogiri to find some selected XML nodes and replace the text with my text, saving then the xml to another file.
The XML of course is strange: i find some document to retrieve HTML tag with Nokogiri changing text but I don't know How I can manage a piece of XML like this:
<cflo>
<txsr prst="o_u5084" crst="o_u5085" trak="D_10">
<pcnt>c_tEST</pcnt>
</txsr>
<txsr prst="o_u5086" crst="o_u5c" trak="D_20">
<pcnt>c_Titolo titolo titolo</pcnt>
</txsr>
<cflo>
Basically I need to look for a combination of prst and crst attribute and replace the content inside the pcnt node.
I try with this
#doc.xpath("//txsr[prst='o_u5086' and crst='o_u5085']")
but I don't know how I can change ther text inside the pcnt node.
That's not the correct XPath. The correct XPath will look like this:
#doc.xpath("//txsr[#prst='o_u5086'][#crst='o_u5085']")
You should just take the first node from a set and use the inner_html= method to replace the text value.
Full code may be found here: https://gist.github.com/kaineer/7673698
I need to get both plain text as well as html text from Ajax Editor. I'm able to get the html text and not able to retrieve plain text. i'm not supposed to eliminate html tags from the editor to retrieve plain text.
Is there any property, which gives plain text from ajax editor?
Sample code from my app:
i'm able to get rich html text like this:
string desc = QuestionAndAnswerEditor.Content;
Same way i want plain text.
Please help me.
Use HTML.Encode for getting encoded text. and html.decode ..
I am trying to read mails programmatically in VB6. but i am unable to read mails containing inline images or HTML code like hyper link. Can anyone suggest me the way to read this type of mails.
EDIT:
I am not getting any error message but
nsfDocument.GETITEMVALUE("Body")(0) returns only text.
images are not shown.
You may want to try a third party API to help, such as the Midas Rich Text C++ API from Genii Software. http://www.geniisoft.com/showcase.nsf/MidasCPP
Or try the code examples shown on this site to gain access to the Notes Document in HTML form: http://searchdomino.techtarget.com/tip/0,289483,sid4_gci1284906,00.html
The GetItemValue method of the Document class returns rich-text item values as an array of strings, with all rich text styling removed. The "body" field in a Notes email is generally rich text. So, you should look into using the GetFirstItem method, instead. That will return a NotesRichTextItem object (for the body field). From that object, you can access the styling of the text, hyperlinks and file attachments, etc. (I do not believe that you can access in-line images at all via the "back-end" COM classes - I think for that, you will need to drop down to use the C API classes).
Here's a quick sample of how to get a NotesRichTextItem handle:
Dim doc As NotesDocument
Dim rtitem As Variant
... get the document
Set rtitem = doc.GetFirstItem( "Body" )
If rtitem.Type = RICHTEXT Then
.. work with rtItem
End If
Here is the doc page for the NotesRichTextItemClass:
http://publib-b.boulder.ibm.com/lotus/c2359850.nsf/2e73cbb2141acefa85256b8700688cea/dc72d312572a75818525731b004a5294?OpenDocument
And here is a starting point for the C API docs:
http://www14.software.ibm.com/webapp/download/nochargesearch.jsp?k=ALL&S_TACT=104CBW71&status=Active&q=Lotus+%22C+API%22