GSA limit reached with "noindex" - google-search-appliance

Recently the GSA I manage reached the limit in URLs being indexed and for what I see the total number of URLs with actual content is very low as opposed to the amount of page listings (mostly by date and that are not content but only show results for users to navigate).
I have already added the Robots meta tag with "noindex" attribute and many of the URLs show as "Excluded":
So I assume those documents are not being counted towards the licensed total but without that amount my crawled URLs cannot possibly reach the limit of 500K.
My other guess is that having multiple collections will make documents count towards the total even if sometimes documents are duplicate in a couple of collections.
Has somebody else faced a similar problem?

Are you receiving a warning that you have exceeded your index? There is a limit of how many URLs, over your license, the GSA will crawl, but you should be able to have about 1M docs in your license (between CRAWLED/ERRORS/EXCLUDED). Only 500K can be in the "Crawled URLs".

Related

Elastic Search Version 7.17 Java Rest API returns incorrect totalElements and total pages using queryBuilder

We are currently upgrading our system from ElasticSearch 6.8.8 to ElasticSearch 7.17. When we run pageable queries using the Java Rest API, the results are incorrect.
For example, in version 6.8.8, if we query for data with and request page 2 with a page size of 10, the query return the 10 items on page 2 and give us a totalElement of 10000 records which is correct. When we run this same exact query on Version 7.17, it returns 10 items on page 2 but only gives us a totalElement of 10 instead of the correct number. We need the correct number, so that our gridview handles paging correctly. Is there a setting I am missing in ElasticSearch version 7.17?
Elasticsearch implemented an option of Track_total_hits in all search in ES 7.X.
Generally the total hit count can’t be computed accurately without visiting all matches, which is costly for queries that match lots of documents. The track_total_hits parameter allows you to control how the total number of hits should be tracked. Given that it is often enough to have a lower bound of the number of hits, such as "there are at least 10000 hits", the default is set to 10,000. This means that requests will count the total hit accurately up to 10,000 hits. It is a good trade-off to speed up searches if you don’t need the accurate number of hits after a certain threshold.
So to force ES to calculate all the hit documents you should set Track_total_hits to true. For more information, you can check the ES official documentation page here.

For Google Places API, does an Empty result count against your quota?

I read through the pricing pages on Google but cannot find if Empty results count against your quota. I have a large dataset (100k records) but know some of the results will be empty due to businesses closing. I want to know if that will affect my total expected quota and therefore price.
Thanks for the comments. I have contacted Google and confirmed that empty results do count against the quota. I have suggested that they add this to their online documentation but have added an answer here for anyone else looking for the correct answer until it is available.

Stormcrawler / Elasticsearch and keeping track of inbound links to a page

When we are searching the results of the Stormcrawler crawl in the Elasticsearch index, people are inevitably comparing the results to Google, and the searched results are comparing unfavorably to the google search of the same topic. One of the ways Google helps to determine the rank of various pages is to track the in-bound links to any given page.
In thinking about the search results on our page, and looking through the status index, I came across the field url.path. url.path appears to contain the entire path that led to the current page.
Would it be possible to create a multivalue field in the index that gets populated with just the last url from whatever bolt/function generates the url.path. That way the field would end up being an array of all pages that are directly linking to the current document.
With that info, you could potentially count the values and get an idea of the relative popularity of the current doc by all of the pages linking to it.
Is something like that possible with Stormcrawler?
This would be possible with some modifications of the code. By default, we keep the info about a discovered URL, including the path that led to it, only for the first instance of that URL being discovered. There could be various ways of implementing this, for instance with a custom bolt accumulating the inlinks into Redis or a Graph DB.
Your underlying question is about relevance tuning with Elasticsearch. This depends of course on what fields are sent by the crawler but not only. I know of some StormCrawler users who used it with ES as a replacement for Google Search Appliance with great success. Info about inlinks could help, but you should be able to get decent results without it.

How to limit results in elasticsearch by term

I am querying Elasticsearch to get a list of 50 matching websites. Ideally, I'd like to get 50 relevant matches from 50 different websites.
My problem is that sometimes the search results contain mostly matches from one website.
Is it possible to create a query that returns 50 results that are all unique with regard to the field that stores the website's name?
The simplest and best way to do this going forward is to use the top hits aggregator:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-metrics-top-hits-aggregation.html
The approach I used before top hits aggregator was definitely a hack, but surprisingly worked well. Just query for 10x the number of docs you need and collapse down client side, preferably using the node client to reduce the network overhead of the likely larger payload.

SolrCloud: workaround for classic pagination with "start,rows" parameters

I have SolrCloud with 3 shards.
My purpose: select and process all products from category.
Current implementation: Portion selection in cycle.
1st iteration: q=cat:1&start=0&rows=100
2nd iteration: q=cat:1&start=100&rows=100
3th: q=cat:1&start=200&rows=100
...
But growing "start", performance is down. Explanation here: https://wiki.apache.org/solr/DistributedSearch
Makes it more inefficient to use a high "start" parameter. For
example, if you request start=500000&rows=25 on an index with 500,000+
docs per shard, this will currently result in 500,000 records getting
sent over the network from the shard to the coordinating Solr
instance. If you had a single-shard index, in contrast, only 25
records would ever get sent over the network. (Granted, setting start
this high is not something many people need to do.)
What ideas how I can walk around all records in category?
There is another way to do more effective pagination in Solr - Cursors - which uses the current place in the sort instead. This is particularly useful for deep pagination.
See the section about Cursors at the Pagination of Results wiki page. This should speed up delivery as the Server should be able to do a sort of its local documents, decide where it is in that sequence and return 25 documents after that document.
UPDATE: Also useful link coming-soon-to-solr-efficient-cursor-based-iteration-of-large-result-sets
I think the short answer is "no" - it's a limitation of how Solr does sharding. Instead, can you amass a list of document unique keys outside of Solr - presumably from a backing database - and then retrieve from the index using sets of those keys instead?
e.g. ID:(1 OR 2 OR 3 OR ...very long list...)
Or, if the unique keys are numeric you could use a moving range instead:
ID:[1 TO 1000] then ID:[1001 TO 2000] and so forth.
In both options above you'd also restrict by category as well. They both should avoid the slow down associated with windowing however.

Resources