Get number of results for Google Search Appliance query - google-search-appliance

We are trying to implement a feature on our search results page similar to dynamic navigation, but different enough that the included dynamic navigation feature cannot be used to meet our business requirements. We've got everything working the way we need, except for one thing.
Using "RES/M" will display the number of results for the current search, but we need the number for a different search, so that we could display that number before the user was to try that search, very similar to dynamic navigation, and if the number is 0, to not enable that other search.
For Example:
Current Search (234)
Search 2 (2)
Search 3 (67)
Search 4 (0)
Is there any way to get the number of results for a search other than the search that's being run at the time?

As far as I know the only way to do this, is to put in your frontend one ajax call for each additional search that you need.
This is not a clean solution, but the only that I know: for one of our customer we have implemented a similar solution using JQuery and it works very well.
Be careful about the number of ajax parallel request you do, because GSA has a maximum number of concurrent request that can handle depending on the version.
In the end, if you want an accurate result for each additional search, you can use the parameter rc=1, see here (Appendix A: Estimated vs. Actual Number of Results): http://static.googleusercontent.com/media/www.google.com/it//support/enterprise/static/gsa/docs/admin/70/gsa_doc_set/xml_reference/xml_reference.pdf. But be careful because this parameter can introduce high latency to the search.
Regards

Related

How to build a price comparison program that scrapes the prices of a product across several websites

I am trying to build a price comparison program for personal use (and for practice) that allows me to compare prices of the same item across different websites. I have just started using the Scrapy library and played around by scraping websites. These are my steps whenever I scrape a new website:
1) Find the website's search url, understand its pattern, and store it. For instance, Target's search url is composed by a fixed url="https://www.target.com/s?searchTerm=" plus the search terms (in parsed url)
2)Once I know the website's search url, I send a SplashRequest using the Splash library. I do this because many pages are heavily loaded with JS
3)Look up the HTML structure of the results page and determine the correct xpath expression to parse the prices. However, many websites have results page in different formats depending on the search terms or product category, changing thus the page's HTML code. Therefore, I have to examine all the possible results page's formats and come up with an xpath that can account for all the different formats
I find this process to be very inefficient, slow, and inaccurate. For instance, at step 3, even though I have the correct xpath, I am still unable to scrape all the prices in the page (sometimes I also get prices of items that are not present in the HTML rendered page), which I dont understand. Also, I dont know whether the websites know that my requests come from a bot, thus maybe sending me a faulty or incorrect HTML code. Moreover, this process cannot be automated. For example, I have to repeat step 1 and 2 for every new website. Therefore, I was wondering if there was a more efficient process, library, or approach that I could use to help me finish this program. I also heard something about using the website's API, although I dont quite understand how it works. This is my first time doing scraping and I dont know too much about web technologies, so any help/advice is highly appreciate!
The most common problem with crawling is that in general, they are determining everything to be scraped syntactically, while conceptualizing the entities you are to be working with helps a lot, I am speaking from my own experience.
In a research about scraping I was involved in we have reached to the conclusion that we need to use a semantic tree. This tree should contain nodes, which represent important data for your purpose and a parent-child relation means that the parent encapsulates the child in the HTML, XML or other hierarchical structure.
You will therefore need some kind of concept about how you will want to represent the semantic tree and how it will be mapped with site structures. If your search method allows you to use the logical OR, then you will be able to define the same semantic tree for multiple online sources.
On the other hand, if the owners of some sites are willing to allow you to scrape their data, then you might ask them to define the semantic tree.
If a given website's structure is changed, then using a semantic tree more often than not you will be able to comply to the change by just changing the selector of a few elements, if the semantic tree's node structure remains the same. If some owners are partners in allowing scraping, then you will be able to just download their semantic trees.
If a website provides an API, then you can use that, read about REST APIs to do so. However, these APIs are probably not uniform.

Google Places API: Less results if filtering by more types (and vice versa)

My applications shows nearby places by using the Google Places Web API.
The user has control over the place types searched in the requests.
Multiple types are concatenated with the pipe symbol |. I use &rankby=distance, because prominence does not matter for the app.
I have noticed, that requesting nearby places with "a lot" of types returns less results than filtering by a single type.
Example
returns 10 results near where I live:
&types=airport|bank|bar|bicycle_store|book_store|bus_station|casino|cafe|city_hall|clothing_store|food|furniture_store|grocery_or_supermarket|gym|hardware_store|library|liquor_store|movie_theater|museum|night_club|park|place_of_worship|police|post_office|restaurant|school|shoe_store|shopping_mall|spa|stadium|store|subway_station|train_station|university|zoo
returns 20 results and a next_page_token (so at least 20 results):
&types=store
I happen to live across a shopping mall, so I know for sure that there are more than 20 stores nearby. The first query contains store as a filter, too.
Questions
I would like to always show as many results as possible. Has anybody experienced the same issue? Is there any document that I did not see, anything on this topic?
I'm a bit lost, since I don't know where to start looking or how to approach this problem.
Be aware that searching for more than one type at a time is deprecated. See http://googlegeodevelopers.blogspot.com.au/2016/02/changes-and-quality-improvements-in_16.html:
Beginning Feb 16, 2016, we are replacing the types restriction
parameter with a new type search parameter. If you have been using the
types parameter for Nearby Search, Text Search or Radar Search you
will be affected.
Type search works similarly to types restriction, but it only supports
one type per request.
Requests using the types parameter and those specifying multiple types
(for example, types=hospital|pharmacy|doctor) will continue to return
results until Feb 16, 2017, but we do not recommend using multiple
types in a search request. After that date, requests with multiple
types will no longer be supported. To ensure the best possible search
results for your users, we recommend using a single type in search
requests.
Thanks to #AndrewR 's comment, I stumbled upon a comment on an issue, that states the following:
[Google Places API does not] support specifying more than 20 types at a time
– Comment on bug report from Sep 1, 2014
which solves my problem in a few words. I wish the docs had stated this.

Most performant live search technique for mobile safari

I am building a mobile web application that targets webkit. I have a requirement to perform a live search (on keypress) against a database of ~5000 users.
I've tried a number of different techniques:
On page load, making an AJAX call which loads an in-memory representation of all 5000 users, and querying them on the client. I tried sending JSON, which proved to be too large, and also a custom delimited string, which was then parsed using split(). This was better, but ultimately searches against this array of users was slow.
I tried using a conventional AJAX call, which would return users based on a query, also using the custom delimited string technique. This was better, but I was forced to tune it so that searches were only performed with a minimum of 3 characters. This is not optimal, as I would like to be able to start filtering after 1 character. I could also throttle the calls so that not every keystroke within a certain threshold triggered a request. This could help with performance, but I'd rather not have to fiddle with that sort of thing.
Facebook mobile does this very well if you try their friend search. Searches happen instantaneously, and are triggered after 1 character.
My question is, does anyone have any suggestions for faster live searches for a mobile app? Should I be looking at localStorage? Is this reliable, feasible?
Is there any reason you can't use a binary search? The names you're looking for should be in a block. If you want first and last name search, you could create a second copy of the data sorted by last name and look in both sets.
Some helpful but more complicated data structures that address this type of problem include:
http://en.wikipedia.org/wiki/Directed_acyclic_word_graph
http://en.wikipedia.org/wiki/Trie

efficient serverside autocomplete

First off all I know:
Premature optimization is the root of all evil
But I think wrong autocomplete can really blow up your site.
I would to know if there are any libraries out there which can do autocomplete efficiently(serverside) which preferable can fit into RAM(for best performance). So no browserside javascript autocomplete(yui/jquery/dojo). I think there are enough topic about this on stackoverflow. But I could not find a good thread about this on stackoverflow (maybe did not look good enough).
For example autocomplete names:
names:[alfred, miathe, .., ..]
What I can think off:
simple SQL like for example: SELECT name FROM users WHERE name LIKE al%.
I think this implementation will blow up with a lot of simultaneously users or large data set, but maybe I am wrong so numbers(which could be handled) would be cool.
Using something like solr terms like for example: http://localhost:8983/solr/terms?terms.fl=name&terms.sort=index&terms.prefix=al&wt=json&omitHeader=true.
I don't know the performance of this so users with big sites please tell me.
Maybe something like in memory redis trie which I also haven't tested performance on.
I also read in this thread about how to implement this in java (lucene and some library created by shilad)
What I would like to hear is implementation used by sites and numbers of how well it can handle load preferable with:
Link to implementation or code.
numbers to which you know it can scale.
It would be nice if it could be accesed by http or sockets.
Many thanks,
Alfred
Optimising for Auto-complete
Unfortunately, the resolution of this issue will depend heavily on the data you are hoping to query.
LIKE queries will not put too much strain on your database, as long as you spend time using 'EXPLAIN' or the profiler to show you how the query optimiser plans to perform your query.
Some basics to keep in mind:
Indexes: Ensure that you have indexes setup. (Yes, in many cases LIKE does use the indexes. There is an excellent article on the topic at myitforum. SQL Performance - Indexes and the LIKE clause ).
Joins: Ensure your JOINs are in place and are optimized by the query planner. SQL Server Profiler can help with this. Look out for full index or full table scans
Auto-complete sub-sets
Auto-complete queries are a special case, in that they usually works as ever decreasing sub sets.
'name' LIKE 'a%' (may return 10000 records)
'name' LIKE 'al%' (may return 500 records)
'name' LIKE 'ala%' (may return 75 records)
'name' LIKE 'alan%' (may return 20 records)
If you return the entire resultset for query 1 then there is no need to hit the database again for the following result sets as they are a sub set of your original query.
Depending on your data, this may open a further opportunity for optimisation.
I will no comply with your requirements and obviously the numbers of scale will depend on hardware, size of the DB, architecture of the app, and several other items. You must test it yourself.
But I will tell you the method I've used with success:
Use a simple SQL like for example: SELECT name FROM users WHERE name LIKE al%. but use TOP 100 to limit the number of results.
Cache the results and maintain a list of terms that are cached
When a new request comes in, first check in the list if you have the term (or part of the term cached).
Keep in mind that your cached results are limited, some you may need to do a SQL query if the term remains valid at the end of the result (I mean valid if the latest result match with the term.
Hope it helps.
Using SQL versus Solr's terms component is really not a comparison. At their core they solve the problem the same way by making an index and then making simple calls to it.
What i would want to know is "what you are trying to auto complete".
Ultimately, the easiest and most surefire way to scale a system is to make a simple solution and then just scale the system by replicating data. Trying to cache calls or predict results just make things complicated, and don't get to the root of the problem (ie you can only take them so far, like if each request missed the cache).
Perhaps a little more info about how your data is structured and how you want to see it extracted would be helpful.

Fast Text Search Over Logs

Here's the problem I'm having, I've got a set of logs that can grow fairly quickly. They're split into individual files every day, and the files can easily grow up to a gig in size. To help keep the size down, entries older than 30 days or so are cleared out.
The problem is when I want to search these files for a certain string. Right now, a Boyer-Moore search is unfeasibly slow. I know that applications like dtSearch can provide a really fast search using indexing, but I'm not really sure how to implement that without taking up twice the space a log already takes up.
Are there any resources I can check out that can help? I'm really looking for a standard algorithm that'll explain what I should do to build an index and use it to search.
Edit:
Grep won't work as this search needs to be integrated into a cross-platform application. There's no way I'll be able to swing including any external program into it.
The way it works is that there's a web front end that has a log browser. This talks to a custom C++ web server backend. This server needs to search the logs in a reasonable amount of time. Currently searching through several gigs of logs takes ages.
Edit 2:
Some of these suggestions are great, but I have to reiterate that I can't integrate another application, it's part of the contract. But to answer some questions, the data in the logs varies from either received messages in a health-care specific format or messages relating to these. I'm looking to rely on an index because while it may take up to a minute to rebuild the index, searching currently takes a very long time (I've seen it take up to 2.5 minutes). Also, a lot of the data IS discarded before even recording it. Unless some debug logging options are turned on, more than half of the log messages are ignored.
The search basically goes like this: A user on the web form is presented with a list of the most recent messages (streamed from disk as they scroll, yay for ajax), usually, they'll want to search for messages with some information in it, maybe a patient id, or some string they've sent, and so they can enter the string into the search. The search gets sent asychronously and the custom web server linearly searches through the logs 1MB at a time for some results. This process can take a very long time when the logs get big. And it's what I'm trying to optimize.
grep usually works pretty well for me with big logs (sometimes 12G+). You can find a version for windows here as well.
Check out the algorithms that Lucene uses to do its thing. They aren't likely to be very simple, though. I had to study some of these algorithms once upon a time, and some of them are very sophisticated.
If you can identify the "words" in the text you want to index, just build a large hash table of the words which maps a hash of the word to its occurrences in each file. If users repeat the same search frequently, cache the search results. When a search is done, you can then check each location to confirm the search term falls there, rather than just a word with a matching hash.
Also, who really cares if the index is larger than the files themselves? If your system is really this big, with so much activity, is a few dozen gigs for an index the end of the world?
You'll most likely want to integrate some type of indexing search engine into your application. There are dozens out there, Lucene seems to be very popular. Check these two questions for some more suggestions:
Best text search engine for integrating with custom web app?
How do I implement Search Functionality in a website?
More details on the kind of search you're performing could definitely help. Why, in particular do you want to rely on an index, since you'll have to rebuild it every day when the logs roll over? What kind of information is in these logs? Can some of it be discarded before it is ever even recorded?
How long are these searches taking now?
You may want to check out the source for BSD grep. You may not be able to rely on grep being there for you, but nothing says you can't recreate similar functionality, right?
Splunk is great for searching through lots of logs. May be overkill for your purpose. You pay according to the amount of data (size of the logs) you want to process. I'm pretty sure they have an API so you don't have to use their front-end if you don't want to.

Resources