How to build a price comparison program that scrapes the prices of a product across several websites - xpath

I am trying to build a price comparison program for personal use (and for practice) that allows me to compare prices of the same item across different websites. I have just started using the Scrapy library and played around by scraping websites. These are my steps whenever I scrape a new website:
1) Find the website's search url, understand its pattern, and store it. For instance, Target's search url is composed by a fixed url="https://www.target.com/s?searchTerm=" plus the search terms (in parsed url)
2)Once I know the website's search url, I send a SplashRequest using the Splash library. I do this because many pages are heavily loaded with JS
3)Look up the HTML structure of the results page and determine the correct xpath expression to parse the prices. However, many websites have results page in different formats depending on the search terms or product category, changing thus the page's HTML code. Therefore, I have to examine all the possible results page's formats and come up with an xpath that can account for all the different formats
I find this process to be very inefficient, slow, and inaccurate. For instance, at step 3, even though I have the correct xpath, I am still unable to scrape all the prices in the page (sometimes I also get prices of items that are not present in the HTML rendered page), which I dont understand. Also, I dont know whether the websites know that my requests come from a bot, thus maybe sending me a faulty or incorrect HTML code. Moreover, this process cannot be automated. For example, I have to repeat step 1 and 2 for every new website. Therefore, I was wondering if there was a more efficient process, library, or approach that I could use to help me finish this program. I also heard something about using the website's API, although I dont quite understand how it works. This is my first time doing scraping and I dont know too much about web technologies, so any help/advice is highly appreciate!

The most common problem with crawling is that in general, they are determining everything to be scraped syntactically, while conceptualizing the entities you are to be working with helps a lot, I am speaking from my own experience.
In a research about scraping I was involved in we have reached to the conclusion that we need to use a semantic tree. This tree should contain nodes, which represent important data for your purpose and a parent-child relation means that the parent encapsulates the child in the HTML, XML or other hierarchical structure.
You will therefore need some kind of concept about how you will want to represent the semantic tree and how it will be mapped with site structures. If your search method allows you to use the logical OR, then you will be able to define the same semantic tree for multiple online sources.
On the other hand, if the owners of some sites are willing to allow you to scrape their data, then you might ask them to define the semantic tree.
If a given website's structure is changed, then using a semantic tree more often than not you will be able to comply to the change by just changing the selector of a few elements, if the semantic tree's node structure remains the same. If some owners are partners in allowing scraping, then you will be able to just download their semantic trees.
If a website provides an API, then you can use that, read about REST APIs to do so. However, these APIs are probably not uniform.

Related

Vue approach to populating components

I can't seem to find any information how it would be best to put data inside a component. To define the problem, lets say we have a user table in a database and this table has an ID and maybe 30 fields with details about the user.
Now if I want to create a Vue component that shows a list of many users details, lets just call it <user-details>. To show this on a page, would you:
1) Call the database to get all users you want to show and get their ID, then do a for loop with <user-details id="xxx"> and make Vue do ajax call to some API and get the details?
2) OR, use the inline version <user-details id="xxx" name="user name" ...> with 30+ fields?
3) OR, have some specific Vue component for this user list, maybe it's users who did not validate email or something, then <users-not-validated> and use ajax?
The problem I see is, that in case 1, you already called the database for the IDs, then call the database once again with ajax with pretty much the same SQL.
In case 2, it's just annoying to fill so many fields out each time you use the component.
In case 3, you will end up with a TON of components...
How do you approach this?
You won't find such information because it's not Vue related. Vue doesn't care what you use it for and how you structure your data. It aims to allow you to do anything you want.
Just as it doesn't care what your folder structure looks like (because, at its core, all it needs in order to render is a single DOM element), it also doesn't care how you organize your API, how you structure your application, your pages or even your components.
Obviously, having this amount of freedom is not always a good thing. If you look around, you'll notice people who use Vue professionally have embraced certain patterns/structures which allow for better code reuse and more flexibility. Nuxt is one such good example.
To anyone just starting with Vue, I recommend trying to use Nuxt as soon as possible, even if its overkill for their little project because they will likely pick up some good patterns.
Getting down to your specific question, in terms of data API architecture, you always have to ask yourself: what's the underlying principle?
The underlying principle is to make your application as fast as possible. In order to do that, ideally, you want to fetch exactly how much data you want to display, but not more. Therefore:
when getting the same data, if you have a choice, always try to lower the number of requests. You don't want each item in the list to initiate a call to the server when it is rendered. Make a single call for the entire list (only fetching what you display in the list view) and call for details if the user requests it (presses the details button).
adjust your pagination to cater how many items you can display on a screen, but also according to how long it takes to load a page. If it takes too long, lower the pageSize and allow your items more padding. If you think about it, most people prefer a snappy app with fewer items on page (and generously padded items) to one which takes seconds to load each page and displays items so crummed they're hard to click/tap on or hard to follow in the list without losing the row.
However, you have to take these guidelines with a grain of salt. In the vast majority of cases fetching full data in one call makes little to no difference in user experience. Many times the delays have to do with server cold-starts (first call to a server takes longer, as it needs to "wake it up" - but all subsequent calls of the same type are faster), with unoptimized images or with bad internet connectivity (as in, it works poorly regardless of whether you receive only the names or the full list of details).
Another aspect to keep in mind is that getting all the data at once is a trade-off. You do get a slower initial call but afterwards you are able to do seamless animations between list view and detail view as the data is already fetched, no more loading required. If you handle the loading state graciously, it's a viable option in many scenarios.
Last, but not least, your 2nd point's drawback does not exist. You can always bind all the details in one go:
<user-details v-bind="user" />
is equivalent to
<user-details :id="user.id" :name="user.name" :age="user.age" ... />
To give you a very basic example, the typical markup for your use-case would be:
<div v-if="isLoadingUsers" />
<user-list v-else :users="users">
<user-list-item v-for="(user, key) in users"
:key="key"
v-bind="user"
#click="selectedUser = user" />
</user-list>
<user-details-modal v-bind="selectedUser" />
It's obviously a simplification, you might opt to not have a user details modal but a cool transform on the list item, making it grow and display more details, etc...
When in doubt, simplify. For example, only showing details for one selected item (and closing it when selecting another) will solve a lot of UI problems right off the bat.
As for the last question: whether or not to have different components for different states, the answer should come from answering a different question: how large should you allow your component to get? The upper limit is generally considered around 300 lines, although I know developers who don't go above 200 and others who don't have a problem having 500+ lines in a component).
When it becomes too large, you should extract a part of it (let's say the user-not-validated functionality into a sub-component) and end up with this inside the <user-detail> component:
<user-detail>
... common details (title, description, etc...)
<div v-if="user.isValidated">
...normal case
</div>
<user-not-validated v-bind="user" v-else />
... common functionality (action bar, etc...)
</user-detail>
But, these are sub-components of your <user-detail> component, which are extracted to help you keep the code organized. They shouldn't replace <user-detail> in its entirety. Similarly, you could extract the user-detail header or footer components, whatever makes sense. Your goal should be to keep your code neat and organized. Follow whatever principles make more sense to you.
Finally, if I had to single out one helpful guideline when taking code architecture decisions, it would definitely be the DRY principle. If you end up not having to write the same code in multiple places in the same application, you're doing it right.
Hope you'll find some of the above useful.

How do I speed up iterations for web crawling ids-nokogiri/ruby

What I want to do is iterate through all possible product pages given a 10 digit numerical id
an example of the page I would like to scrape is somewebsite.com/product?productid=10000000000
The scraper would go to the page see if a tag exists to see if it is a product page and then log the url if it is or move on to the next page if it is not.
doing iterations 1 by 1 (productid = large number++)is too slow and from looking at some sample product ids it seems like numbers without patterns such as(121212121212) are more likely I wanted to ask what would be a way to iterate through these pages in a more reasonable amount of time. I am doing this in ruby with nokogiri right now.
Iterating through that number of product IDs is a horrible way to treat a target site, and odds are good you'd get banned because it's not likely their products are sequentially numbered. In other words, you would get a lot of missing page responses, which will be logged, and if their web-development team is decent they'll get a list of those along with the requesting IP.
Instead, be smart and find a page that lists all their products, parse out that list, then walk it. If there isn't a single page containing them, but many, then start at the first and walk them all until you've reached the last one. Aggregate the product IDs into an array, or process them as you read each page.
Also, be very gentle and kind to their site by sleeping between iterations. Failing to do that can also get you banned because requesting thousands of pages, one immediately after another, will drive up their host's CPU, network usage, which again will alert them that you are spidering their site and negatively impacting their ability to serve normal customers.
Finally, if you really want to do things the right way, your first connection to the site should request their "robots.txt" file. Process it, and use those directives in your code. That file is put there to help robots/spiders/scrapers do the right thing and not unfairly antagonize the site or web-admins of the site. Failing to do that is a sure path to being banned. More information is available at "The Web Robots Pages" and "Robots exclusion standard".

"Time to Interact" metric in web performance measurements

Apparently "Time to Interact" is the new metric to use when measuring the perceived speed of a webpage. I'm interested in understanding a bit more about what this actually is.
The term was apparently coined by Radware, and is being pushed as the most meaningful performance measurement (compared to things such as Time to First/Last Byte, Time to Render etc.).
It is described as:
the point which a page displays its primary interactive (think
clickable) content, rather than full page load.
This seems pretty subjective to me; what is the "primary interactive content" of a webpage for example?
There have been reports citing results for the measurement, so some how this is being measured, and further, it must be automated as the result sets are pretty big (~500 sites were tested).
Other than the above quote, I cannot find any more information on how to measure this.
As Google are placing more emphasis on above the fold content (or visible content), I am wondering whether this metric is actually more like "Time to First Meaningful Render", i.e. it is contextual to the current page goal. So for example, on an eCommerce site's product page, this could be the main image, and an add to basket link.
I am keen to understand this metric, as to me it does seem like the most useful one. My question is therefore whether anyone is measuring this, and if so how are they doing so?
You kind of answered your own question, it is subjective, and contextual to you current project.
What if I'm testing a site with only HTML without any complex resources? There is no point measuring TTI there. On the other hand, let's see this demo site.
Bigger picture here.
Blue line is marking the "COMContentLoaded" event (main document is loaded and markup parsed), red line indicates the load event, where all page resources are loaded. The TTI line would go in-between the two lines, that is defined differently for each project, based on some essential to interact resources loaded event.
For example, let's say that the pictures on the demo site are not essential to the core features of the site. While the main site loaded in 0.8 seconds, the 3 big pictures took 36 extra seconds to load, so in this case using the overall response time as a KPI would yield ~36second response time, while if you define TTI excluding those big, non essential resources, you end up with < 1s response time.
I am keen to understand this metric, as to me it does seem like the most useful one.
Definitely useful, but as you said it in your question, it's specific to the project. You wouldn't measure TTI on a simple, relatively static web app, you would probably measure overall response time. I always define KPIs "tailored" for the current project, instead of trying to use common metrics, and "force them" on a project.
My question is therefore whether anyone is measuring this, and if so how are they doing so?
Definitely used it before, you should identify the essential resources for your site, and when the last of those resources are loaded, that is your TTI. This could be a javascript file, a css, etc...
Websites are getting more complex. Whereas they might not always contain more content they still have more resources to load as the user interaction/user experience is more complex from a technical point of view. Ajax helps us to load different parts separately. So rather than one page load we have the loading of several small things. And for each of these parts we can measure the loading performance. But there might some parts on the site that might be more important than others. The "primary interactive content" is that part of your view that enables the user to do what he intends to do, for example buy a train ticket. If some advertisement or a special animation on the left side of the screen hasn't loaded this does not prevent the user to buy start buying a ticket. But of course "primary interactive content" as a term is quite vague and you have to define it for your specific application. It is the point an average user can and will start to interact with the website while some parts are sill loading.
This is how I understand the concept and I see the difference to "Time to First Meaningful Render" here: you might have a basket rendered on your eCommerce page but the GUI is not yet responsive. So you see something meaningful but the interactivity is not yet there. Therefore TTI >= TtFMR.
Measuring TTI requires you to define what elements are required for interactivity which not only depends on what the site does but also HOW it does it. So it highly depends on your implementation/technology.

Most performant live search technique for mobile safari

I am building a mobile web application that targets webkit. I have a requirement to perform a live search (on keypress) against a database of ~5000 users.
I've tried a number of different techniques:
On page load, making an AJAX call which loads an in-memory representation of all 5000 users, and querying them on the client. I tried sending JSON, which proved to be too large, and also a custom delimited string, which was then parsed using split(). This was better, but ultimately searches against this array of users was slow.
I tried using a conventional AJAX call, which would return users based on a query, also using the custom delimited string technique. This was better, but I was forced to tune it so that searches were only performed with a minimum of 3 characters. This is not optimal, as I would like to be able to start filtering after 1 character. I could also throttle the calls so that not every keystroke within a certain threshold triggered a request. This could help with performance, but I'd rather not have to fiddle with that sort of thing.
Facebook mobile does this very well if you try their friend search. Searches happen instantaneously, and are triggered after 1 character.
My question is, does anyone have any suggestions for faster live searches for a mobile app? Should I be looking at localStorage? Is this reliable, feasible?
Is there any reason you can't use a binary search? The names you're looking for should be in a block. If you want first and last name search, you could create a second copy of the data sorted by last name and look in both sets.
Some helpful but more complicated data structures that address this type of problem include:
http://en.wikipedia.org/wiki/Directed_acyclic_word_graph
http://en.wikipedia.org/wiki/Trie

Google crawling indexing algorithms

I am looking for some documents on how Google crawl and index content. I read many "light" papers and articles on what you need to do to improve your ranking and make sure your content is properly indexed but I am looking for some more advanced technical documents on how Google crawl and index content.
The things I would like to know more about:
What elements Google look for when it crawls: page content, URLs format, keywords, description etc...
How the index is updated?
Basically, I am trying to understand why some pages are indexed but not others even if the formats are similar. Why only 10% of my site's pages appear when I do a search on the entire domain even if I can see on my server logs that Google crawled every single link.
The answers to both things are closely-guarded trade secrets, ostensibly to prevent gaming the system.
Also keep in mind that Google makes over 400 algorithmic changes per year, making it close to impossible for an outsider to be accurate and up-to-date. Short of working for Google, you're likely not going to find an in-depth and accurate answer.
However, Matt Cutts, head of the web spam team, frequently provides the most accurate insights in how Google handles content, both on his blog and on the GoogleWebmasterHelp YouTube channel. It's worth going through his content to get a much better understanding of Google's methodology.
In order to provide a technical approach of how a webcrawler works I will suggest you to take a deep look into nutch.apache.org solution.
A typical webcrawler displays the following areas, a fetcher, a parser, and indexer and a searcher. To put it briefly a webcrawler fetch all urls available on a website and creates segments where its store up to 101kb per page. Those pages are parsed but typical words such as and-or-the are not stored but other words are analyzed using bayesian calculations in order to make a rank.
Search engine indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. These tasks are mainly performed by storing a list of occurrences of each search critera, typically in the form of a hash table or binary tree using an inverted index.
As Mark stated Google´s calculations are mainly trade secrets but Patents issued by google could be a good start. Pagerank http://en.wikipedia.org/wiki/PageRank analyses backlinks mainly and the importance that websites pointing to your site have on people´s preferences. In my experience its important to offer an xml sitemap stating all your webpages at your site. On that sitemap you could define the crawl frequency for each page. gsitecrawler.com/ is an interesting possibility.
Google Website Optimizer will give you the chance to see what is google finding on your site, logs are ok but probably the robot finds problem and the best way to know that is with google´s website optimizer in order to display errors.
Finally most of your concerns are things that SEO´s specialist live for, I suggest you to check sites like seomoz.com and their tools... You will learn how to position your website better on organic results on search engines.
hope it helps!, sebastian.
"Yes" Google like fresh & unique content.
Use Google webmaster guideline "try this instead" H1 or H2 meta tag on your HTML programming under the head tag ....your keyword. Anchor have to must use your business related keywords in H1, H2, it can help your site search engine.
Also use for Rich snippets in this tag..!
It scans you web page very precisely and sensitively. Factors like you have javascript embedded or in different file matter, whether you are using frames in designing or using heavy graphics can reduce the ranking of your page. Keywords are obviously rank affecting entities. Broken links also bring your website ranking down.
Basically you can refer to http://www.tutorialspoint.com/seo/ to go through all the important points of google's crawler. This will take a maximum of 40 mins.
MapReduce: Simplified Data Processing on Large Clusters
I analysed the latest algorithm and found that now
Google gives more importance to CONTENT rather than LINKS.
So if your content is good enough with properly available tags, Google will automatically generate index for you. I would suggest H1 - H6 all to be used in good manner.

Resources