Web Scraping for specific content - xpath

I am trying to scrape content from shopping sites then save it on my database in table Product. Scraping such content require to know the DOM structure of each site. Not only DOM Structure, but also the hierarchy of categories in the menu.
There are many solutions to achieve that by setup a configuration for each site, then look for specific html elements that contains (ex product name, price ,model,...) using regx, XPath or css selectors.
Is there any solution to avoid setup configuration for each site and scrape the product properties automatically?
There is a similar solutions that deal with news like Readability which looks for sequence of <p> tag and images. It is easier for news due the similarity between news site and the simple structure,

There is no magic bullet, however what you could do is use XSLT as the main "binding" between your site and your scraping program. XSLT support is built in with Html Agility Pack.
At least it will minimize the amount of work required when the site evolves or changes its structure, instead of relying only on pure procedural code. Changing XSLT (once you're used to it) text will not require compilation and is more equivalent to "configure" the system. But still, you'll have to define at least one XSLT file per target website (unless these website are built on the same software of course).
You may check this link for an XSLT example: Use HtmlAgilityPack to divy up a document

If the website that you want to scrape has no general pattern for its html structure you must configure your script for every website.
ONLY if you are lucky you don't have to reconfigure your script.
ps: in generally web scrapers build their codes from scratch.

Related

How to link one file of code to another file of code or do i have to copy and paste that code on that file of code?

Hi I am fairly new to coding and I am designing a webpage/app. I am basically trying to figure out how to link one file to another file of code. I have one file with some code that has my page design but I designed a hamburger drop down menu in another file. My question is can I just link that file to my home page file or do I have to re-code it on that page and every page after that?Picture of code and files
You seem to be writing a HTML file. The traditional way to solve what you are looking for, is to bring another language into the mix, such as PHP, Ruby, Python, etc.
These languages need to run on a server. Instead of writing HTML directly, these languages write HTML on the fly for you.
Because of this feature, you can make re-usable parts of HTML and use them in multiple pages.
The switch to server-side programming languages is definitely a step up in complexity from HTML, but it's worth learning if this is your goal.

Does website like wix or weebly are multisites system?

I'd like to know how websites like wix.com or weebly.com are hosted?
Does each website build on their system are hosted independently? Or are they sharing the same code like a Wordpress multisite system?
Thanks.
WIX is a bit like wordpress. Your code will be generated by referencing templates that they have created. Then when you are editing it, you are pretty much just changing the CSS. Even adding new items on the page etc is just adding another div for that element, that you then style with CSS.
If you look at the source code of the pages its quite ugly, because all the styles are mostly hard coded and not fluid. But hats off to them for creating such an easy system that's so easy to use.
A wix site relies on their systems so you can never self host it. Unless you want to get hacky with fairly unmanageable code, due to not having their editor.

What are the best delivery methods for handing off designs to developers?

I am working on a large and complex web application and am curious, from a developer's standpoint, about what the best methods are for handing over design to engineering in this type of environment. I'm curious about methods of delivery and their pros and cons.
Delivery methods I have used in the past (all of the methods below include detailed wireframes):
Layered PSDs with Layer Comps for interactive states.
A GUI Kit - a PSD of design elements that the developers create markup for and pull from to make the design match wireframes.
A CSS Kit - an HTML page that includes CSS styles and layouts that developers can pull from to match the wireframes.
Please feel free to add your own methods to this list, your experience with these methods and what methods you prefer. I'd like to know what methods work in different cases as well (creating a new feature from scratch, updating an existing feature, etc).
I've been working on a similar project which involved delivery of UI widgets for the development team. Some of them had built-in interactivity, written in javascript.
We delivered each widget in a separate folder containing the following subfolders:
Dist
img
js
css
Source
img
js
css
Test
The Source folder contains usually the non-minified versions of js and css source. Also if you use SASS or LESS for your css (and you should be using it for such a huge project), the /Source/CSS folder is the place to store your uncompiled css sources. Also here, in the img folder we usually have included the PSD files we've used to design the component
In the Dist(ribution) folder we shipped the compiled and minified css, the images used (when necessary) combined in sprites, and also the minified js code. Although you can prepare all these minified versions by yourself, it's faster if you automate the whole process with Grunt. You simply define a task that minifies js and css and creates the appropriate folder structure. All with a single press of a button (it's actually a command being executed).
The Test folder usually includes a html page that renders the component and provides instructions on how to test its different states or interactive features. This html page should always use the css and js files from the Dist folder. It's also a good practice to include a short read.me file describing the steps required to include the component into an existing page. Last but not least, if we're speaking about responsive layouts, it's also a good practice to provide here screenshots with the widget/component rendered in mobile browsers at various resolutions (ipad/iphone/android phone).

Libraries/Tools for Website Parsing

I would like to start working with parsing large numbers of raw HTML pages into semantic data structures.
Just interested in the community opinion on various available tools for such a task, particularly various useful libraries in any language.
So far, planning on using Hadoop to manage a lot of the processing, but curious about alternatives.
First you need to download your page source and then create a DOM tree.
if you are coding in C# you can user the following tools to create your DOM tree.
1) http://htmlagilitypack.codeplex.com/
2) http://www.majestic12.co.uk/projects/html_parser.php
the first one is easy to use but second one is much faster and memory friendly and I suggest you to use the second one if you want to create a robust application
then you can extract usefull content from web page using:
http://www.chrisspen.com/blog/how-to-extract-a-webpages-main-article-content.html
and many other articles you can find to extract content from web page by Googling (extract main content from web page)
Hope it helps

Load MS Word files with AJAX

I want to dynamically load (AJAX) the text from some Microsoft Word files into a webpage. So I might have a link to essays I've written and upon mouseover have it load the first few sentences in a tooltip.
Only if you have a parser. I think the new format is a zip archive with XML schema. But the old one is just binary.
There are some parsers out there.
I know of wvWare but it seems it's outdated. (http://wvware.sourceforge.net/)
This is maybe something worth looking at: http://poi.apache.org/hwpf/index.html
And yeah, forgot to mention how to do this. :-)
First you need to make the javascript ask for the data through ajax. The serverside has to take care of the parsing and return the text to the javascript. This will be a pain in the ass. I haven't done this myself and have never tried the parsers I linked, so I'm not sure if they suit you. Images, stylesheets, etc.... not sure if that will be useable.
At least, good luck.
For security reasons, it is not possible to directly load a local file (such as a Word document) into the page using simply Javascript. The user will need to upload the file to the server, which you will want to parse on the server and then you can load whatever result you like into the page using Ajax.
It sounds like you mean to upload your files (e.g. essays) to your server to allow users to download them, and want to create a server-side page that will parse the files and print the first few lines (so it can be called by an AJAX method that displays a preview on hover).
To suggest a tool for this, we'll need to know whether these are "old" Word format (Office 2003 - extension is .doc) or "new" Word format (Office 2007 - extension is .docx).
It will also be good to know what you're using to create your pages server-side, since different document-reading tools support different programming languages. If you're using Java to read .doc files, you can use the tool we use at my place of work, which is POI (http://poi.apache.org/). If you're using something else, try searching google for {read in }, e.g. {read .docx in ruby}.
If all of this is Greek to you and you have no prior experience with developing custom server-side web code, this is probably going to be unnecessarily painful and you should consider an alternative (like manually creating a 3-line text "preview" page for each regular page, and then just showing that).

Resources