How to retrieve plain text from a formatted website to use in UIWebView - xcode

Not sure if what I want to do is possible, but what I am hoping to do is somehow gather certain pieces of text from a website, remove the header, footer, background, all formatting, and place it into my application in a scrollview or something similar...
I'll give you an example... Imagine I was making wikipedia's iPhone app, I want to download the information about the wiki on dogs, without the header, side bars etc, just the text. How would I go about doing this?
I understand that for this I have not provided any example code or what I've tried or started, but that's just because in this case I'm lost! That doesn't mean I want full chunks of code either. Any help will do. If this doesn't work, I will just have to make a 'mobile optimised' version of the webpages I want to include in my app.
Thanks
(Edit: the term I was trying to use was 'strip the web page of its HTML coding')

You may be going about this the wrong way, or perhaps even asking the wrong question.
Does the target website have an API or datafeed of some kind?
Can you get the information you need in JSON or XML format directly from the site?
I think you've misunderstood the technology. HTML is merely the framwork on which the formatting and data is hung.
Parsing the HTML page seems like an awfully big headache, I doubt you'll ever be able to get it to work, because almost all sites these days are partially or wholly generated on the server side, the page is only the result.
Some sites hide the information in memory and others get it dynamically through ajax for example, which means that simply trying to get the data by parsing the HTML will get zero data.
Another issue you should be aware of though, is that simply copying the data from generated websites may open yourself up to copyright issues.

You have to parse the html code and search for the part that you want and "throw" away the part that you do not need. This is more or less like bruteforcing and the code of the website should not change otherwise you are screwed. So you have to write the parser by hand with this method. But maybe there is a atom or rss feed and you can parse this one. This will be much more easier and you are not depending on the website layout because the rss/atom feed is just about the data. For parsing rss you could try out NSXMLParser.
And then you have to make a valid html page out of the data and present it in the UIWebView

Related

capture data of a slickgrid using UiPath

I am trying to scrap data from a slickgrid, but it isn't working out. It has problems with selector or meta data. I have tried using wildcards, and other alternatives too, but all fail to scrap the data. Can anyone help me out in this?
I think you are going to have to get to know Slickgrid a bit to work out what to do. Slickgrid only displays the currently visible data, so scraping the data from the UI will only get partial results.
The best thing to do would be to find the underlying data object and print it as JSON. However this would be impossible to do in a general sense (ie. for any website).
It's relatively easy for a single target website, and may be able to be automated with a few tricks for multiple renderings of the same page of a specific website.
(edit: whoops, just noticed the UIPath bit. I had a look, but I don't know how it works. Assume it looks in the HTML. Basically, that will only work if you are able to get UiPath to scroll through the entire dataset one page at a time. You'll have to use logic to eliminate the metadata rows. But if you can drop into javascript, it will be so much easier)

Laravel Save Markdown to Database - Don't Understand

I am reluctant to post this, but I am having trouble understanding how markdown actually "saves" to a database.
When I'm creating a migration, I will add columns and specify the type of value (i.e. integer, text, string, etc.) and in the course of operation on the website, users will input different information that is then saved in the DB. No problem there.
I just can't seem to wrap my head around the process for markdown. I've read about saving the HTML or saving the markdown file, rendering at runtime, pros and cons all that.
So, say I use an editor like Tiny MCE which attaches itself to a textarea. When I click "Submit" on the form, how does that operate? How does validation work? Feel free to answer my question directly or offer some resource to help further my understanding. I have an app built on Laravel so I'm guessing I'll need to use a package like https://github.com/GrahamCampbell/Laravel-Markdown along with an editor (i.e. Tiny MCE).
Thanks!
Let's start with a more basic example: StackOverflow. When you are writing/editing a question or answer, you are typing Markdown text into a textarea field. And below that textarea is a preview, which displays the Markdown text converted to HTML.
The way this works (simplified a little) is that StackOverflow uses a JavaScript library to parse the Markdown into HTML. This parsing happens entirely client side (in the browser) and nothing is sent to the server. With each key press in the textarea the preview is updated quickly because there is no back-and-forth with the server.
However, when you submit your question/answer, the HTML in the preview is discarded and the Markdown text from the textarea is forwarded to the StackOverflow server where is is saved to the database. At some point the server also converts the Markdown to HTML so that when another user comes alone and requests to view that question/answer, the document is sent to the user as HTML by the server. I say "at some point" because this is where you have to decide when the conversion happens. You have two options:
If the server converts the HTML when is saves it to the Database, then it will save to two columns, one for the Markdown and one of for the HTML. Later, when a user requests to view the document, the HTML document will be retrieved from the database and returned to the user. However, if a user requests to edit the document, then the Markdown document will be retrieved from the database and returned to the user so that she can edit it.
If the server only stores the Markdown text to the database, then when a user requests to view the document, the Markdown document will be retrieved from the database, converted to HTML and then returned to the user. However, if a user requests to edit the document, then the Markdown document will be retrieved from the database and returned to the user (skipping the conversion step) so that she can edit it.
Note that in either option, the server is doing the conversion to HTML. The only time the conversion happens client-side (in the browser) is for preview. But the "preview" conversion is not used to display the document outside of edit mode or to store the document in the database.
The only difference between something like StackOverflow and TinyMCE is that in TinyMCE the preview is also the editor. Behind the scenes the same process is still happening and when you submit, it is the Markdown which is sent to the server. The HTML used for preview is still discarded.
The primary concern when implementing such a system is that if the Markdown implementation used for preview is dissimilar from the implementation used by the server, the preview may not be very accurate. Therefore, it is generally best to choose two implementations that are very similar or, if available, use the same implementations for both.
It is actually very simple.
Historally, in forums, there used be BBCodes, which are basically pseudo-tags that allow you to format your text in some say. For example [b][/b] used to mean "make this text bold". In Markdown, it happens the exact same thing, but with other characters like *text* or **text**.
This happens so that you only allow your users to use a specific formatting, otherwise if you'd allow to write pure HTML, XSS (cross-site scripting) issues would arise and it's not really a good idea.
You should then save the HTML on the database. You can use, for example, markdown-js which is a Markdown parser that parses Markdown to HTML.
I have seen TinyMCE does not make use of Markdown by default, since it's simple a WYSIWYG editor, however it seems like it also supports a markdown-like formatting.
Laravel-Markdown is a server-side markdown render helper, you can use this on Laravel Blade views. markdown-js is instead client-side, it can be used, for example, to show a preview of what you're writing in real-time.

Algorithm / API for converting HTML to email friendly HTML (for newsletters)

I'm sure this is a very old question, but I could not find a straight answer
I'm looking for a works-mostly algorithm to take regular HTML content, and make it email client friendly.
I can rewrite any nice DIV layout to table layout, this is OK, but is there anything that will do it for me?
Here are my concerns
Overflow content - gmail etc ignores any overflow:hidden, the algorithm should address it
Clipped images - same as above, but here the solution will probably be server side clipping
CSS / Script / non standard tags - the algorithm should remove but keep the general look and feel
DIV layout to table layout, I heard it's a must, but I'm sure it's not an easy task to automate
There are many HTML to PDF converters, but I could not find a good HTML to "HTEMAIL" converter
Is there any standard or proposed standard for HTML for email clients? or is it an open jungle out there?
There is no way to make a converter that will be cross email client compatible. The closest you can get is using templates and adding text in certain sections using php or .net
I've been creating emails for 6 months, and the amount of time you spend correcting email client differences is normally around 50% of the time you spend making the email.
Here is some reading that may help you:
http://www.sitepoint.com/code-html-email-newsletters/
http://www.campaignmonitor.com/css/
As you can see from that last link there is no way to create an algorithm that can sort out all these issues.
Hope this helps
Another option that I've been using is to build the email in HTML or directly in Mailchimp. Once I'm happy with it, using Mailchimp, I click on preview and I get the email in a popup. The source code from the popup is email-client friendly (in tables). I then copy that code and use it for my emails.
Not ideal and a bit of trouble, but so far the best solution I can find.
And before people ask, I mostly us Mailchimp directly, but there is one situation that I have to kick it old school.

For ajax - Hashes vs HTML 5 History API?

Before I launch my site, I want to get my URL structure set in stone. A large number of my pages have tabs on them, and it's a much better user experience if when changing a tab, I use ajax to get the relevant changes and just update that, rather than updating the whole page.
Should I use the popular method of just updating the hash of the url for ajax tab changes, or should I just use the HTML 5 history API, and let anyone with browsers that don't support it reload the full page? I've heard people say that websites that use hashes and hashbangs are "breaking the web". Using hashes my urls would look like this: example.com/#popular, and using HTML 5 history my urls would look like this: example.com/?tab=popular.
If you want to serve a different page depending on which tab is selected, then use the HTML 5 history approach. Otherwise just update the hash.
As far as I know, and from my experience it's really six of one and half dozen of another. It's really what you prefer since the end result is the same.

What algorithms could I use to identify content on a web page

I have a web page loaded up in the browser (i.e. its DOM and element positioning are both accessible to me) and I want to find the block element (or a sorted list of these elements), which likely contains the most content (as in a continuous block of text). The goal is to exclude things like menus, headers, footers and such.
This is my personal favorite: VIPS: a Vision-based Page Segmentation Algorithm
First, if you need to parse a web page, I would use HTMLAgilityPack to transform it to an XML. It will speed everything and will enable you, using a simple XPath to go directly to the BODY.
After that, you have to run on all the divs (You can get all the DIV elements in a list from the agility pack), and get whatever you want.
There's a simple technique to do this,based on analysing how "noisy" HTML is, i.e., what is the ratio of markup to displayed text through an html page. The Easy Way to Extract Useful Text from Arbitrary HTML describes this tex, giving some python code to illustrate.
Cf. also the HTML::ContentExtractor Perl module, which implements this idea. It would make sense to clean the html first, if you wanted to use this, using beautifulsoup.
I would recommend Vit Baisa's thesis on Web Content Cleaning, I think he has some code too, but I can't find a link for it. There is also a discussion of the very same problem on the natural language processing LingPipe blog.

Resources