I would like to start working with parsing large numbers of raw HTML pages into semantic data structures.
Just interested in the community opinion on various available tools for such a task, particularly various useful libraries in any language.
So far, planning on using Hadoop to manage a lot of the processing, but curious about alternatives.
First you need to download your page source and then create a DOM tree.
if you are coding in C# you can user the following tools to create your DOM tree.
1) http://htmlagilitypack.codeplex.com/
2) http://www.majestic12.co.uk/projects/html_parser.php
the first one is easy to use but second one is much faster and memory friendly and I suggest you to use the second one if you want to create a robust application
then you can extract usefull content from web page using:
http://www.chrisspen.com/blog/how-to-extract-a-webpages-main-article-content.html
and many other articles you can find to extract content from web page by Googling (extract main content from web page)
Hope it helps
Related
I want to build an application that displays the content that user types on the command prompt to the display like a presentation.
I am writing this application in golang. If there are existing libraries that I can use to do this great and if not would need direction how to approach solving this.
I did search on the internet for pointers but found none.
Have a look at the present tool, it does a similar thing using flat files and might even be useful for you.
https://godoc.org/golang.org/x/tools/present
I recently automated the creation of Powerpoint Presentations in a site I'm making. I found the Office Interop libraries extremely simple to use.
Office isn't built for this kind of thing in a webserver environment, so I'm looking at creating the Powerpoints using Open Office XML, only it's so extremely complex. For example I downloaded some code to create a blank presentation with some text. This code was around 300 lines! Using the Office Interop libraries I could do the same thing in just a couple of lines of code.
I don't have time, nor do I want to attempt to learn how to interact with the Open Office XML libraries, so I'm hoping someone has made a wrapper for the Open Office XML libraries. So far all my searching has only given me one result, Aspose Slides for .NET. This looks really hopeful, but it also looks rather expensive
Has anyone ever used a decent wrapper or alternative before?
If you are looking at automating the creation of Powerpoint presentation files, I'd say you continue with OpenXML, there's nothing better than it. Everything else is either paid or don't offer entire gamut of functionality that Open XML can provide.
If you find creating a blank file tedious, you could save an empty file somewhere and use that as a template for performing further operations on it.
The only thing close to a wrapper for PowerPoint I've found is the Open XML PowerTools. It includes a PresentationBuilder class which can be used for some specific tasks like combining slides from multiple PowerPoint documents into a new document. Although its pretty limited in its functionality you could extend the class.
However, I've come to the conclusion that there just is not a good wrapper out there so I've had to do what everybody pretty much recommends and that is using the Open XML SDK Productivity Tool and the Reflect code button.
I put together a basic presentation then Reflect Code and put that into a class. Yes its a lot of lines of code and its not the most elegant solution but it does work. Then from there I can extend or modify that class to do the specific things I need to do with each slide. The Productivity Tool is a big help for figuring out the code need to do specific things. I try to keep it simple and just do one or two things at a time, Reflect Code, then look at the code to see what it does.
You could try SoftArtisans PowerPointWriter, it has a template mode that allows you to start with an existing PowerPoint file with a few place holders, and merge your data with your presentation with as little as 5 lines of code.
Disclaimer: I work for SoftArtisans
I am trying to scrape content from shopping sites then save it on my database in table Product. Scraping such content require to know the DOM structure of each site. Not only DOM Structure, but also the hierarchy of categories in the menu.
There are many solutions to achieve that by setup a configuration for each site, then look for specific html elements that contains (ex product name, price ,model,...) using regx, XPath or css selectors.
Is there any solution to avoid setup configuration for each site and scrape the product properties automatically?
There is a similar solutions that deal with news like Readability which looks for sequence of <p> tag and images. It is easier for news due the similarity between news site and the simple structure,
There is no magic bullet, however what you could do is use XSLT as the main "binding" between your site and your scraping program. XSLT support is built in with Html Agility Pack.
At least it will minimize the amount of work required when the site evolves or changes its structure, instead of relying only on pure procedural code. Changing XSLT (once you're used to it) text will not require compilation and is more equivalent to "configure" the system. But still, you'll have to define at least one XSLT file per target website (unless these website are built on the same software of course).
You may check this link for an XSLT example: Use HtmlAgilityPack to divy up a document
If the website that you want to scrape has no general pattern for its html structure you must configure your script for every website.
ONLY if you are lucky you don't have to reconfigure your script.
ps: in generally web scrapers build their codes from scratch.
Right now, I’m working on a legacy web application that is made up of multiple screens, each one performing a separate function. I’m in the process of converting several of the screens to EXTJS 4 using the MVC approach. In order to isolate the impact of my changes and because we don’t have time to convert the entire app at once, I’ve converted two of the screens into two separate EXTJS 4 apps. Each screen now has its own folder in which I’ve set up an app using the appropriate file structure and app.js file.
My question is this: as I continue developing, I may want to use models from one app (screen) in another app. How do you share models, views and controllers between applications? What’s the best approach?
FYI, I’m using autoloading to pull everything in.
Thanks
I would not use autoload in production, because it generates to many HTTP requests to get all files, which slows down the page load speed. This is well documented at Google's Page Speed and Yahoo's Best Practices for Speeding Up Your Web Site.
The best practice is to preprocess the resources upon deployment of the application and generate a single javascript file with everything in it that is sent in a single (GZIP) compressed response. There are several tools for this job, but it depends heavily on your toolchain. You can for example have a look a the SO question Best JavaScript compressor to get recommendations for various compressors (I use Jammit).
When you have a flexible configurable JavaScript compressor in your toolchain, you can set up a shared folder where you have your common files, like model, stores and some libs. These are now included in the builds for the different projects.
In case you have a good reason to serve single javascript files, you can either use a good version controll system like git and make use of submodules. Which this approach you'll have a separate repository for common files. This gives you the downside of slower page speed and a little overhead with updating the submodules.
As last solution, you can use a symbolic link on the file system to link the common folder to the different other projects.
Here's what Saki said to me on the Sencha Forums:
The multiple applications on one page, or sub-applications of Ext MVC
are not supported yet, however, developers are working on this
functionality, AFAIK. Such implementation would most likely solve also
the problem of re-using models, views and controllers among (sub)
applications, I hope.
More specifically regarding linking multiple applications:
I would just soft-link files of MVC components is this case. There's
no logical or functional connection among them now, only I wanna reuse
already written file, right?
I want to dynamically load (AJAX) the text from some Microsoft Word files into a webpage. So I might have a link to essays I've written and upon mouseover have it load the first few sentences in a tooltip.
Only if you have a parser. I think the new format is a zip archive with XML schema. But the old one is just binary.
There are some parsers out there.
I know of wvWare but it seems it's outdated. (http://wvware.sourceforge.net/)
This is maybe something worth looking at: http://poi.apache.org/hwpf/index.html
And yeah, forgot to mention how to do this. :-)
First you need to make the javascript ask for the data through ajax. The serverside has to take care of the parsing and return the text to the javascript. This will be a pain in the ass. I haven't done this myself and have never tried the parsers I linked, so I'm not sure if they suit you. Images, stylesheets, etc.... not sure if that will be useable.
At least, good luck.
For security reasons, it is not possible to directly load a local file (such as a Word document) into the page using simply Javascript. The user will need to upload the file to the server, which you will want to parse on the server and then you can load whatever result you like into the page using Ajax.
It sounds like you mean to upload your files (e.g. essays) to your server to allow users to download them, and want to create a server-side page that will parse the files and print the first few lines (so it can be called by an AJAX method that displays a preview on hover).
To suggest a tool for this, we'll need to know whether these are "old" Word format (Office 2003 - extension is .doc) or "new" Word format (Office 2007 - extension is .docx).
It will also be good to know what you're using to create your pages server-side, since different document-reading tools support different programming languages. If you're using Java to read .doc files, you can use the tool we use at my place of work, which is POI (http://poi.apache.org/). If you're using something else, try searching google for {read in }, e.g. {read .docx in ruby}.
If all of this is Greek to you and you have no prior experience with developing custom server-side web code, this is probably going to be unnecessarily painful and you should consider an alternative (like manually creating a 3-line text "preview" page for each regular page, and then just showing that).