scrapy doesn't seem to optimize on depth_limit - limit

I am new to scrapy and it seems this problem has not been asked.
The question is, I just want to get a shallow crawl of a large site(around 500 links), so I set depth_limit=1 (maybe later will extend to 2 or 3), and also require the spider to filter all duplicate response.
However, after reading the log, I find that even when depth_limit=1, the crawler still checks a lot of outlinks of the front page (around 100000) and all returned depth > 1, which is a waste of time as all links on the front page are with depth 1, then the links generated will definitely with depth 2 or higher. I think There is no reason to check the outlinks of links with depth 1 to find outlinks with depth equal to 1.
So how do I write my settings to implement my own logic or optimize the spider?

I think you are right. Scrapy does extra work over here.
Depth limit ( or depth filter) is implemented as a middleware at the end of the pipeline. So after the whole cycle ( scrape the page, generate the item, generate the request ) it filters the requests.
I can outline a solution as,
If you are using BaseSpider then you can use request.meta to store the depth and increment it for the request generated. As you are the one who generates requests in the parse callback effectively you will not generate the request when the depth is reached.
If you are using crawl spider then you have to override "_requests_to_follow" of the base CrawlSpider. Still you will be propagating depth. Everything will be same in the _request_to_follow except when the depth is reached you will not extract and generate request.
Hope it helps.

Related

Is Elasticsearch Scroll API not recommended for real-time pagination?

I understand that Elasticsearch Scroll API is not intended for real-time user requests. But would it be bad if it's used for that? I have a requirement to implement paginated results (to be displayed on web frontend) and from/size approach is returning duplicates across pages. Presumably because I have a sharded setup (with no replicas at all). I've tried setting preferencebut it did not help.
Scroll API does not seem to have this issue, I'm wondering if it's really bad to use it for my use case?
Thanks
Results from a scrolling search reflect the state of the index at the time of the initial search request. Subsequent indexing or document changes only affect later search and scroll requests. it means that your pagination is based on the time you requested the search result, so you don't see new document or will see deleted in your result. Also Scroll API is not recommended by ES for deep pagination any more(ES 7.x). you can find more info on ElasticSearch documentation page: https://www.elastic.co/guide/en/elasticsearch/reference/7.x/scroll-api.html
On the question 'why you get duplicate results', I think this is caused by intermediate indexing. When doing independent search calls with pagination, each call runs independently (still using some caching). So if you ask the first 100, you get the first 100 at that time. When then asking x seconds later the 'next' 100, you get 100 - 199 at x seconds later. If meanwhile a new document got indexed which logically fits in the first 100, it will push the rest further. This way, your result 100 (first in the second results) might have been #99 in the first call. When then gluing them together in the UI, you see the same result twice.
Both scroll and search-after are designed to refer ES back to the original call, indicating it that you want to continue counting from that moment onwards.
I have not found a good explanation though why search_after is better than scroll.
I assume that scroll is optimized for the use case where you will go through the entire set anyway (so the pagination is to avoid overloading the client and the pipe between ES and client with too big chunks at once). While search_after is optimized for the use case where you are likely to only go a few pages far/deep (it is known that human users tend to stay on the first page with a quickly lowering frequency of going much further, because you would force your eyes to find something into overwhelming amounts of information). Implementing good filters in the user interface is the much better approach.

What's the expected behavior of the Bing Search API v5 when deeply paginating?

I perform a bing API search for webpages and the query cameras.
The first "page" of results (offset=0, count=50) returns 49 actual results. It also returns a totalEstimatedMatches of 114000000 -- 114 million. Neat, that's a lot of results.
The second "page" of results (offset=49, count=50) performs similarly...
...until I reach page 7 (offset=314, count=50). Suddenly totalEstimatedMatches is 544.
And the actual count of results returned per-page trails off precipitously from there. In fact, over 43 "pages" of results, I get 413 actual results, of which only 311 have unique URLs.
This appears to happen for any query after a small number of pages.
Is this expected behavior? There's no hint from the API documentation that exhaustive pagination should lead to this behavior... but there you have it.
Here's a screenshot:
Each time the API is called, the search API obtains a group of possible matches starting at in the result set, and then filters out the results based on different parameters (e.g spam, duplicates, safesearch setting, etc), finally leaving a final result set.  If the final result after filtering and optimization is more than the count parameter then the number of results equal to count would be returned. If the parameter is more than the final result set count then the final result set is returned which will be less than the count parameter.  If the search API is called again, passing in the offset parameter to get the next set of results, then the filtering process happens again on the next set of results which means it may also be less than count.
 
You should not expect the full count parameter number of results to always be returned for each API call.  If further search results beyond the number returned are required then the query should be called again, passing in the offset parameter with a value equal to the number of results returned in the previous API call.  This also means that when making subsequent API calls, the offset parameter should never be a hard coded value and should always be calculated based on the results of previous queries. 
 
totalEstimatedMatches can also add to confusion around the Bing Search API results.  The word ‘estimated’ is important because the number is an estimation based on an initial quick result set, prior to the filtering described above.  Additionally, the totalEstimatedMatches value can change as you iterate through the result set by making subsequent API calls with increasing offset values.  The totalEstimatedMatches should only be used as a rough guide indicating the magnitude of the possible result set, and it should not be used to determine the number of results that will ultimately be returned.  To query all of the possible results you should continue making API calls, passing in offset with a value of the sum of the results returned in previous calls, until that sum is greater than totalEstimatedMatches of the most recent API call.
 
Note that you can see this same behavior by going to bing.com directly and using a query such as https://www.bing.com/search?q=bill+gates&count=50.  Notice that you will get around 34 results with a totalEstimatedMatches of ~567,000 (valid as of June 2017, future searches may change), and if you click the 'next page' arrow you will see that the next query executed will start at the offset of the 34 returned in the first query (ie. https://www.bing.com/search?q=bill+gates&count=50&first=34).  If you click ‘next’ several more times you may see the totalEstimatedMatches also change from page to page.
This seems to be expected behavior. The Web Search API is not a crawler API, thus it only delivers results, that the algorithms deem relevant for a human. Simply put, most humans won't skim through more than a few pages of results, furthermore they expect to find relevant results on the first page.
If you could retrieve the results in the millions, you could simply copy their search index and Bing would be out of business.
Search indices seem to be things of political and economic power, as far as I know there are only four relevant search indices world wide: from Google, from Microsoft (Bing), from Russia, and from China.
Those who control the search, control the Spice... ;-)

Sencha Touch 2 - How to Abort store load with ajax/json

How can I abort a store load while the ajax call is still executing? I have a simple store with proxy type of 'ajax' and 'json' reader.
The documentation does not indicate any way to abort this. I have noticed that jsonp does allow aborting a load in progress. Do I have to switch to jsonp?
The motivation here is that I have a search bar and list object that gets populated with results. The actual search on the backend can take 5-10 seconds. So if a user starts a search then quickly wants to do another search (in case, for example, the first search was a typo), then the new search needs to abort the first search ajax call. Otherwise, I am seeing mixed results showing up in my search results.
As usual, any help is greatly appreciated!
Mohammad
The solution I have used in the past to solve this exact problem is to track each request with an incrementing counter and as requests complete I check the counter and if a request has been made with a higher counter I disregard the result.

How does pagination on Reddit's home page work?

Reddit uses a time decay algorithm. That would mean the sort order is subject to change. When a user goes to page 2, is there a mechanism to prevent them from seeing a post that was on page 1 but was bumped down to page 2 before they paged over? Is it just an acceptable flaw of the sort method? Or are the first couple of pages cached for the user so this doesn't happen?
Side note: It's my understand that Digg cannot suffer from this issue but that HackerNews and Reddit can.
From the next URL you see: http://www.reddit.com/?count=25&after=t3_dj7xt
So clearly the next page ensures that the page2 starts at the post after t3_dj7xt - whatever that translated to. This could be accomplished using IDs so you'd pass after=188 then the next page starts at 189 thus ensuring you don't see the same post if a time delay occured
It might be using the last ID as opposed to limiting from. Take these two examples of SQL:
SELECT * FROM Stories WHERE StoryID>$LastStoryID;
rather than:
SELECT * FROM Stories LIMIT 20, 10;

GWT - Populate Grid asynchronously

we've got a GWT application with a simple search mask displaying the results as a grid.
Server side processing time is ok as well as network latency.
Client rendering time is ok even on low spec hardware with internet explorer 6 as long as the number of results is not too high (max 100 rows in the grid).
We have implemented a navigation scheme allowing the user to scroll up/down the grid. That's fast enough also.
Has anybody an idea if it is possible to display the first 100 results immediately and pull the rest in the background? The GWT architecture allows this. However I'm interested in possible pitfalls e.g. what happens if the user starts another query while the browser is still fetching previous results etc.
Thanks!
Holger
LazyPanel and this blog post might be a good starting point for you :)
The GWT Incubator has also many interesting (albeit not always complete/perfect/stable) tables and other pagination solutions - like PagingScrollTable.
Assuming your plan is to send the first 100, and then bring the rest, you can use bulks for the rest of the results. then, if a user initiates another search, you just wait for the end of the bulk ( ie, check between bulk retrivals if you have a pending query ).
Another way you can go is assign identifiers to the user searches. this will make the problem of mixed results non-existant, and will also help you with results history for multiple searches.
we found that users love the live grid look & feel, which solves most of those problems, but that might not be optional always.

Resources