totalEstimatedMatches behavior with Microsoft (Bing) Cognitive search API (v5) - bing-api

Recently converted some Bing Search API v2 code to v5 and it works but I am curious about the behavior of "totalEstimatedMatches". Here's an example to illustrate my question:
A user on our site searches for a particular word. The API query returns 10 results (our page size setting) and totalEstimatedMatches set to 21. We therefore indicate 3 pages of results and let the user page through.
When they get to page 3, totalEstimatedMatches returns 22 rather than 21. Seems odd that with such a small result set it shouldn't already know it's 22, but okay I can live with that. All results are displayed correctly.
Now if the user pages back again from page 3 to page 2, the value of totalEstimatedMatches is 21 again. This strikes me as a little surprising because once the result set has been paged through, the API probably ought to know that there are 22 and not 21 results.
I've been a professional software developer since the 80s, so I get that this is one of those devil-in-the-details issues related to the API design. Apparently it is not caching the exact number of results, or whatever. I just don't remember that kind of behavior in the V2 search API (which I realize was 3rd party code). It was pretty reliable on number of results.
Does this strike anyone besides me as a little bit unexpected?

Turns out this is the reason why the response JSON field totalEstimatedMatches includes the word ...Estimated... and isn't just called totalMatches:
"...search engine index does not support an accurate estimation of total match."
Taken from: News Search API V5 paging results with offset and count
As one might expect, the fewer results you get back, the larger % error you're likely to see in the totalEstimatedMatches value. Similarly, the more complex your query is (for example running a compound query such as ../search?q=(foo OR bar OR foobar)&...which is actually 3 searches packed into 1) the more variation this value seems to exhibit.
That said, I've managed to (at least preliminarily) compensate for this by setting the offset == totalEstimatedMatches and creating a simple equivalency-checking function.
Here's a trivial example in python:
while True:
if original_totalEstimatedMatches < new_totalEstimatedMatches:
original_totalEstimatedMatches = new_totalEstimatedMatches.copy()
#set_new_offset_and_call_api() is a func that does what it says.
new_totalEstimatedMatches = set_new_offset_and_call_api()
else:
break

Revisiting the API & and I've come up with a way to paginate efficiently without having to use the "totalEstimatedMatches" return value:
class ApiWorker(object):
def __init__(self, q):
self.q = q
self.offset = 0
self.result_hashes = set()
self.finished = False
def calc_next_offset(self, resp_urls):
before_adding = len(self.result_hashes)
self.result_hashes.update((hash(i) for i in resp_urls)) #<==abuse of set operations.
after_adding = len(self.result_hashes)
if after_adding == before_adding: #<==then we either got a bunch of duplicates or we're getting very few results back.
self.complete = True
else:
self.offset += len(new_results)
def page_through_results(self, *args, **kwargs):
while not self.finished:
new_resp_urls = ...<call_logic>...
self.calc_next_offset(new_resp_urls)
...<save logic>...
print(f'All unique results for q={self.q} have been obtained.')
This^ will stop paginating as soon as a full response of duplicates have been obtained.

Related

How to avoid getting blocked by websites when using Ruby Mechanize for web crawling

I am successful scraping building data from a website (www.propertyshark.com) using a single address, but it looks like I get blocked once I use loop to scrape multiple addresses. Is there a way around this? FYI, the information I'm trying to access is not prohibited according to their robots.txt.
Codes for single run is as follows:
require 'mechanize'
class PropShark
def initialize(key,link_key)
##key = key
##link_key = link_key
end
def crawl_propshark_single
agent = Mechanize.new{ |agent|
agent.user_agent_alias = 'Mac Safari'
}
agent.ignore_bad_chunking = true
agent.verify_mode = OpenSSL::SSL::VERIFY_NONE
page = agent.get('https://www.google.com/')
form = page.forms.first
form['q'] = "#{##key}"
page = agent.submit(form)
page = form.submit
page.links.each do |link|
if link.text.include?("#{##link_key}")
if link.text.include?("PropertyShark")
property_page = link.click
else
next
end
if property_page
data_value = property_page.css("div.cols").css("td.r_align")[4].text # <--- error points to these commands
data_name = property_page.css("div.cols").css("th")[4].text
#result_hash["#{data_name}"] = data_value
else
next
end
end
end
return #result_hash
end
end #endof: class PropShark
# run
key = '41 coral St, Worcester, MA 01604 propertyshark'
key_link = '41 Coral Street'
spider = PropShark.new(key,key_link)
puts spider.crawl_propshark_single
I get the following errors but in an hour or two the error disappears:
undefined method `text' for nil:NilClass (NoMethodError)
When I use a loop using the above codes, I delay the process by having sleep 80 between addresses.
The first thing you should do, before you do anything else, is to contact the website owner(s). Right now, you actions could be interpreted anywhere between overly aggressive and illegal. As others have pointed out, the owners may not want you scraping the site. Alternatively, they may have an API or product feed available for this particular thing. Either way, if you are going to be depending on this website for your product, you may want to consider playing nice with them.
With that being said, you are moving through their website with all of the grace of an elephant in a china store. Between the abnormal user agent, unusual usage patterns from a single IP, and a predictable delay between requests, you've completely blown your cover. Consider taking a more organic path through the site, with a more natural human-emulation delay. Also, you should either disguise your useragent, or make it super obvious (Josh's Big Bad Scraper). You may even consider using something like Selenium, which uses a real browser, instead of Mechanize, to give away fewer hints.
You may also consider adding more robust error handling. Perhaps the site is under excessive load (or something), and the page you are parsing is not the desired page, but some random error page. A simple retry may be all you need to get that data in question. When scraping, a poorly-functioning or inefficient site can be as much of an impediment as deliberate scraping protections.
If none of that works, you could consider setting up elaborate arrays of proxies, but at that point you would be much better of using one of the many online Webscraping/API creating/Data extraction services that currently exist. They are fairly inexpensive and already do everything discussed above, plus more.
It is very likely nothing is "blocking" you. As you pointed out
property_page.css("div.cols").css("td.r_align")[4].text
is the problem. So lets focus on that line of code for a second.
Say the first time round your columns are columns = [1,2,3,4,5] well then rows[4] will return 5 (the element at index 4).
No for fun let's assume the next go around your columns are columns = ['a','b','c','d'] well then rows[4] will return nil because there is nothing at the fourth index.
This appears to be your case where sometimes there are 5 columns and sometimes there are not. Thus leading to nil.text and the error you are recieving

google search appliance accurate result count parameter not making a difference

We are having a result count issue where the pages have 10 results per page. For pagination we are getting 64 result count on page 1 (ie start=0), 25 for page 2, and 21 for page 3.
I understand as per documentation for estimated vs actual results that it is not guaranteed but the above result count is when I set filter=0 and rc=1. The rc=1 does not appear to make a difference when included or not. We are on version 7.2.0.G.252
filter=0&rc=1 should work for you and you should see the same count even after paginating.
What you need to notice is, when you click on pagination link, make sure the filter=0&rc=1 are carried over. i.e., after pagination, see if you still have the filter and rc parameters intact.
Also check using the default_frontend as your custom frontend may not be handling it?
The problem was related to the collection not the query. The content match pattern did not include a "/" at end which when resolved gave an accurate count. Thanks for the assistance.

Deceptively puzzling : Sudo-code (or C#, linq) for paging algorithm like Google

I've racked my brain over this one and it's harder than it looks.
Please could some hardcore hacker out there show me a nice way to implement the following:
Given an indexed list of unknown size And a known max range size [say
10] (page size, i.e. how many results will be returned) When I give
this function an index (within the range of the indexed list) Then it
will return me a new range And the returned range should be of size
10, if possible And the returned range should always try to include
5 indexes before the input index And the returned range should try to
include 4 indexes after the input index
To see this working, goto Google and search for something. You get a set of results with some links (1 - 10)
When you click any link after page 6, the results will always have five links before and four links after the current page.
I just want to see how this is done, logically.
If anybody has a cool linq suggestion then I'd be really grateful.
I've already made this code work, but it's verbose and with lots of 'ifs' and 'elses' - I just know there's an elegant way to do it.
The problems I found where:
(1) Having a range that's less than the offset (i.e. only three results).
(2) Entering a index that's very close to the start or end of the input range.
I've searched the net over and over but can't find a simple (language agnostic) way to express this logic.
Thanks,
You can use a max function (language agnostic) to achieve this.
start_index = max(1, index - offset)
end_index = index + offset

Trouble with facet counts

I'm attempting to use ElasticSearch for analytics -- specifically to track "top content" for hand-rolled Rails CMS. The requirement is quite a bit more complicated than keeping a counter for each piece of content. I won't get into the depth of problem right now, as I can't seem to get even the basics working.
My problem is this: I'm using facets and the counts aren't what I expect them to be. For example:
Query:
{"facets":{"el_ids":{"terms":{"field":"el_id","size":1,"all_terms":false,"order":"count"}}}}
Result:
{"el_ids":{"_type":"terms","missing":0,"total":16672,"other":16657,"terms":[{"term":"quis","count":15}]}}
Ok, great, the piece of content with id "quis" had 15 hits and since the order is count, it should be my top piece of content. Now lets get the top 5 pieces of content.
Query:
{"facets":{"el_ids":{"terms":{"field":"el_id","size":5,"all_terms":false,"order":"count"}}}}
Result (just the facet):
[
{"term":"qgz9","count":26},
{"term":"quis","count":15},
{"term":"hnqn","count":15},
{"term":"higp","count":15},
{"term":"csns","count":15}
]
Huh? So the piece of content w/ id "qgz9" had more hits with 26? Why wasn't it the top result in the first query?
Ok, lets get the top 100 now.
Query:
{"facets":{"el_ids":{"terms":{"field":"el_id","size":100,"all_terms":false,"order":"count"}}}}
Results (just the facet):
[
{"term":"qgz9","count":43},
{"term":"difc","count":37},
{"term":"zryp","count":31},
{"term":"u65r","count":31},
{"term":"sxsi","count":31},
...
]
So now "qgz9" has 43 hits instead of 26? How can that be? I can assure you there's nothing happening in the background modifying the index. If I repeat these queries, I get the same results.
As I repeat this process of increasing the result size, counts continue to change and new content ids emerge at the top. Can someone explain to me what I'm doing wrong or where my understanding of how this works is flawed?
It turns out that this is a known issue:
...the way top N facets work now is by getting the top N from each shard, and merging the results. This can give inaccurate results.
By default, my index was being created with 5 shards. By changing this so the index only has a single shard, the counts behave inline with my expectations. Another workaround would be to always set size to a value greater than the number of expected facets and peel off the top N results.

Ellipsizing a set of names

OK, I'm sure somebody, somewhere must have come up with an algorithm for this already, so I figured I'd ask before I go off to (re)invent it myself.
I have a list of arbitrary (user-entered) non-empty text strings. Each string can be any length (except 0), and they're all unique. I want to display them to the user, but I want to trim them to some fixed length that I decide, and replace part of them with an ellipsis (...). The catch is that I want all of the output strings to be unique.
For example, if I have the strings:
Microsoft Internet Explorer 6
Microsoft Internet Explorer 7
Microsoft Internet Explorer 8
Mozilla Firefox 3
Mozilla Firefox 4
Google Chrome 14
then I wouldn't want to trim the ends of the strings, because that's the unique part (don't want to display "Microsoft Internet ..." 3 times), but it's OK to cut out the middle part:
Microsoft...rer 6
Microsoft...rer 7
Microsoft...rer 8
Mozilla Firefox 3
Mozilla Firefox 4
Google Chrome 14
Other times, the middle part might be unique, and I'd want to trim the end:
Minutes of Company Meeting, 5/25/2010 -- Internal use only
Minutes of Company Meeting, 6/24/2010 -- Internal use only
Minutes of Company Meeting, 7/23/2010 -- Internal use only
could become:
Minutes of Company Meeting, 5/25/2010...
Minutes of Company Meeting, 6/24/2010...
Minutes of Company Meeting, 7/23/2010...
I guess it should probably never ellipsize the very beginning of the strings, even if that would otherwise be allowed, since that would look weird. And I guess it could ellipsize more than one place in the string, but within reason -- maybe 2 times would be OK, but 3 or more seems excessive. Or maybe the number of times isn't as important as the size of the chunks that remain: less than about 5 characters between ellipses would be rather pointless.
The inputs (both number and size) won't be terribly large, so performance is not a major concern (well, as long as the algorithm doesn't try something silly like enumerating all possible strings until it finds a set that works!).
I guess these requirements seem pretty specific, but I'm actually fairly lenient -- I'm just trying to describe what I have in mind.
Has something like this been done before? Is there some existing algorithm or library that does this? I've googled some but found nothing quite like this so far (but maybe I'm just bad at googling). I have to believe somebody somewhere has wanted to solve this problem already!
It sounds like an application of the longest common substring problem.
Replace the longest substring common to all strings with ellipsis. If the string is still too long and you are allowed to have another ellipsis, repeat.
You have to realize that you might not be able to "ellipsize" a given set of strings enough to meet length requirements.
Sort the strings. Keep the first X characters of each string. If this prefix is not unique to the string before and after, then advance until unique characters (compared to the string before and after) are found. (If no unique characters are found, the string has no unique part, see bottom of post) Add ellipses before and after those unique characters.
Note that this still might look funny:
Microsoft Office -> Micro...ffice
Microsoft Outlook -> Micro...utlook
I don't know what language you're looking to do this in, but here's a Python implementation.
def unique_index(before, current, after, size):
'''Returns the index of the first part of _current_ of length _size_ that is
unique to it, _before_, and _after_. If _current_ has no part unique to it,
_before_, and _after_, it returns the _size_ letters at the end of _current_'''
before_unique = False
after_unique = False
for i in range(len(current)-size):
#this will be incorrect in the case mentioned below
if i > len(before)-1 or before[i] != current[i]:
before_unique = True
if i > len(after)-1 or after[i] != current[i]:
after_unique = True
if before_unique and after_unique:
return i
return len(current)-size
def ellipsize(entries, prefix_size, max_string_length):
non_prefix_size = max_string_length - prefix_size #-len("...")? Post isn't clear about this.
#If you want to preserve order then make a copy and make a mapping from the copy to the original
entries.sort()
ellipsized = []
# you could probably remove all this indexing with something out of itertools
for i in range(len(entries)):
current = entries[i]
#entry is already short enough, don't need to truncate
if len(current) <= max_string_length:
ellipsized.append(current)
continue
#grab empty strings if there's no string before/after
if i == 0:
before = ''
else:
before = entries[i-1]
if i == len(entries)-1:
after = ''
else:
after = entries[i+1]
#Is the prefix unique? If so, we're done.
current_prefix = entries[i][:prefix_size]
if not before.startswith(current_prefix) and not after.startswith(current_prefix):
ellipsized.append(current[:max_string_length] + '...') #again, possibly -3
#Otherwise find the unique part after the prefix if it exists.
else:
index = prefix_size + unique_index(before[prefix_size:], current[prefix_size:], after[prefix_size:], non_prefix_size)
if index == prefix_size:
header = ''
else:
header = '...'
if index + non_prefix_size == len(current):
trailer = ''
else:
trailer = '...'
ellipsized.append(entries[i][:prefix_size] + header + entries[i][index:index+non_prefix_size] + trailer)
return ellipsized
Also, you mention the string themselves are unique, but do they all have unique parts? For example, "Microsoft" and "Microsoft Internet Explorer 7" are two different strings, but the first has no part that is unique from the second. If this is the case, then you'll have to add something to your spec as to what to do to make this case unambiguous. (If you add "Xicrosoft", "MXcrosoft", "MiXrosoft", etc. to the mix with these two strings, there is no unique string shorter than the original string to represent "Microsoft") (Another way to think about it: if you have all possible X letter strings you can't compress them all to X-1 or less strings. Just like no compression method can compress all inputs, as this is essentially a compression method.)
Results from original post:
>>> for entry in ellipsize(["Microsoft Internet Explorer 6", "Microsoft Internet Explorer 7", "Microsoft Internet Explorer 8", "Mozilla Firefox 3", "Mozilla Firefox 4", "Google Chrome 14"], 7, 20):
print entry
Google Chrome 14
Microso...et Explorer 6
Microso...et Explorer 7
Microso...et Explorer 8
Mozilla Firefox 3
Mozilla Firefox 4
>>> for entry in ellipsize(["Minutes of Company Meeting, 5/25/2010 -- Internal use only", "Minutes of Company Meeting, 6/24/2010 -- Internal use only", "Minutes of Company Meeting, 7/23/2010 -- Internal use only"], 15, 40):
print entry
Minutes of Comp...5/25/2010 -- Internal use...
Minutes of Comp...6/24/2010 -- Internal use...
Minutes of Comp...7/23/2010 -- Internal use...

Resources