GSA . How to prioritize html content with specific metadata - google-search-appliance

i want to make GSA to prioritize (in the results) content with some specific metadata . Is that possible?
Thanks!

Yes, it's possible. Metadata is one of the options for biasing results.
The admin console help tells you how to do it:
7.4 Admin Console Biasing Help
It's worth pointing out that biasing is more art than science. You may need to experiment a bit.
You should also make sure that you enable Advanced Search Reporting (ASR) so that the self learning scorer is enabled. This is where the GSA adapts to user behavior, learning to deliver better results over time. Links that get more clicks automatically move up in the search ranking.

Related

Saved Searches and Alerting in Coveo

I'm looking at Coveo capability for a project and the documentation isn't clear in the area of alerting (high level docs are marketing let, and low level docs are mainly API based without a lot of explanation). I see you can set up a 'subscription' to alert on, but it's not clear exactly what that subscription can be against.
Essentially I'm trying to find out if you can save a search query/criteria, and then alert if a document matching that query/criteria is indexed. Essentially what you can do with Percolators in Elastic (if you are familiar with that).
Is this possible in Coveo?
Any pointers to precise and clear documentation that I'm missing would be appreciated.
I don't think this is a feature that Coveo provides, especially not for end users.
The only way to design something like you're asking would be to create your own code that queries Coveo, say every day, with a specific query that normally returns no result, and checks if results are returned. When results are returned, the system would then be able to alert you.
If you think Coveo should create that feature, please feel welcome to post it on the Coveo Ideas portal: https://connect.coveo.com/s/ideas

How to build a price comparison program that scrapes the prices of a product across several websites

I am trying to build a price comparison program for personal use (and for practice) that allows me to compare prices of the same item across different websites. I have just started using the Scrapy library and played around by scraping websites. These are my steps whenever I scrape a new website:
1) Find the website's search url, understand its pattern, and store it. For instance, Target's search url is composed by a fixed url="https://www.target.com/s?searchTerm=" plus the search terms (in parsed url)
2)Once I know the website's search url, I send a SplashRequest using the Splash library. I do this because many pages are heavily loaded with JS
3)Look up the HTML structure of the results page and determine the correct xpath expression to parse the prices. However, many websites have results page in different formats depending on the search terms or product category, changing thus the page's HTML code. Therefore, I have to examine all the possible results page's formats and come up with an xpath that can account for all the different formats
I find this process to be very inefficient, slow, and inaccurate. For instance, at step 3, even though I have the correct xpath, I am still unable to scrape all the prices in the page (sometimes I also get prices of items that are not present in the HTML rendered page), which I dont understand. Also, I dont know whether the websites know that my requests come from a bot, thus maybe sending me a faulty or incorrect HTML code. Moreover, this process cannot be automated. For example, I have to repeat step 1 and 2 for every new website. Therefore, I was wondering if there was a more efficient process, library, or approach that I could use to help me finish this program. I also heard something about using the website's API, although I dont quite understand how it works. This is my first time doing scraping and I dont know too much about web technologies, so any help/advice is highly appreciate!
The most common problem with crawling is that in general, they are determining everything to be scraped syntactically, while conceptualizing the entities you are to be working with helps a lot, I am speaking from my own experience.
In a research about scraping I was involved in we have reached to the conclusion that we need to use a semantic tree. This tree should contain nodes, which represent important data for your purpose and a parent-child relation means that the parent encapsulates the child in the HTML, XML or other hierarchical structure.
You will therefore need some kind of concept about how you will want to represent the semantic tree and how it will be mapped with site structures. If your search method allows you to use the logical OR, then you will be able to define the same semantic tree for multiple online sources.
On the other hand, if the owners of some sites are willing to allow you to scrape their data, then you might ask them to define the semantic tree.
If a given website's structure is changed, then using a semantic tree more often than not you will be able to comply to the change by just changing the selector of a few elements, if the semantic tree's node structure remains the same. If some owners are partners in allowing scraping, then you will be able to just download their semantic trees.
If a website provides an API, then you can use that, read about REST APIs to do so. However, these APIs are probably not uniform.

How to correlate the below given scenario for check boxes?

in my script i have a scenario like the page contains multiple check boxes for example 10, as per the user need user selects check boxes for example one user selects 4 check boxes and other user clicks 5 check boxes, so per each it will vary.
so how to correlate those values,
thanking you.
From the website: "Please don’t share your solutions, ask for help, or help others. This is meant to be a challenge."
So you appear to be violating one of the primary rules in this website. I have looked at this challenge and it's really good to gauge someone's knowledge.
However, to address technology generally - in reading your question I get the sense you may be missing certain fundamental knowledge for this kind of thing. Here's some fundamental knowledge. Hopefully my answer will help increase your knowledge. And hopefully you can use this increased general knowledge to address this specific question.
Definitions:
Correlation - you're taking data the SERVER sends to the browser, capturing it and sending it back. Information present on web pages would fit into this category.
Parameterization - you've got a set of values you'd like to put into web forms. This is usually values like names, addresses, etc
Also understand exactly what is happening when you conduct certain actions on your browser. When you "click" a checkbox does that actually send a message to a server? That usually doesn't (though not always) happen. So when you use phrases like 'click a checkbox' that tells me you may not appreciate the fact that performance testing is server focused, not browser focused.
Performance testing isn't intuitive so you need to understand these concepts. If you dedicate time to understanding the concepts I've outlined above you'll have the knowledge to complete the challenge.
Good luck.
What is driving the variation on check boxes being checked? Is it the result of something that comes back from the server, from a previous request? Or is it somewhat random based on whatever the user wants to do at runtime?

Google crawling indexing algorithms

I am looking for some documents on how Google crawl and index content. I read many "light" papers and articles on what you need to do to improve your ranking and make sure your content is properly indexed but I am looking for some more advanced technical documents on how Google crawl and index content.
The things I would like to know more about:
What elements Google look for when it crawls: page content, URLs format, keywords, description etc...
How the index is updated?
Basically, I am trying to understand why some pages are indexed but not others even if the formats are similar. Why only 10% of my site's pages appear when I do a search on the entire domain even if I can see on my server logs that Google crawled every single link.
The answers to both things are closely-guarded trade secrets, ostensibly to prevent gaming the system.
Also keep in mind that Google makes over 400 algorithmic changes per year, making it close to impossible for an outsider to be accurate and up-to-date. Short of working for Google, you're likely not going to find an in-depth and accurate answer.
However, Matt Cutts, head of the web spam team, frequently provides the most accurate insights in how Google handles content, both on his blog and on the GoogleWebmasterHelp YouTube channel. It's worth going through his content to get a much better understanding of Google's methodology.
In order to provide a technical approach of how a webcrawler works I will suggest you to take a deep look into nutch.apache.org solution.
A typical webcrawler displays the following areas, a fetcher, a parser, and indexer and a searcher. To put it briefly a webcrawler fetch all urls available on a website and creates segments where its store up to 101kb per page. Those pages are parsed but typical words such as and-or-the are not stored but other words are analyzed using bayesian calculations in order to make a rank.
Search engine indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. These tasks are mainly performed by storing a list of occurrences of each search critera, typically in the form of a hash table or binary tree using an inverted index.
As Mark stated Google´s calculations are mainly trade secrets but Patents issued by google could be a good start. Pagerank http://en.wikipedia.org/wiki/PageRank analyses backlinks mainly and the importance that websites pointing to your site have on people´s preferences. In my experience its important to offer an xml sitemap stating all your webpages at your site. On that sitemap you could define the crawl frequency for each page. gsitecrawler.com/ is an interesting possibility.
Google Website Optimizer will give you the chance to see what is google finding on your site, logs are ok but probably the robot finds problem and the best way to know that is with google´s website optimizer in order to display errors.
Finally most of your concerns are things that SEO´s specialist live for, I suggest you to check sites like seomoz.com and their tools... You will learn how to position your website better on organic results on search engines.
hope it helps!, sebastian.
"Yes" Google like fresh & unique content.
Use Google webmaster guideline "try this instead" H1 or H2 meta tag on your HTML programming under the head tag ....your keyword. Anchor have to must use your business related keywords in H1, H2, it can help your site search engine.
Also use for Rich snippets in this tag..!
It scans you web page very precisely and sensitively. Factors like you have javascript embedded or in different file matter, whether you are using frames in designing or using heavy graphics can reduce the ranking of your page. Keywords are obviously rank affecting entities. Broken links also bring your website ranking down.
Basically you can refer to http://www.tutorialspoint.com/seo/ to go through all the important points of google's crawler. This will take a maximum of 40 mins.
MapReduce: Simplified Data Processing on Large Clusters
I analysed the latest algorithm and found that now
Google gives more importance to CONTENT rather than LINKS.
So if your content is good enough with properly available tags, Google will automatically generate index for you. I would suggest H1 - H6 all to be used in good manner.

What are good/bad ways of providing help for an application..?

I'm in the process of developling various applications for whom the end users are both engineers and salesman. Some of the operations and options may not be immediately obvious to all users. All applications are delivered with a PDF and paper manual - but of course nobody reads them!
I would like to improve the usability of the applications by including dynamic context sensitive help. One option would be alá MSDN and have F1 call up a web page - however internet access will not always be available and even this will be too much effort for some.
Another idea is to have descriptions pop up when an option is hovered over - like a tooltip.
I'm interested in other peoples views on this and what are best practices in this situation. Along a similar theme to this post What are common UI misconceptions and annoyances? I'd like to start a discussion regarding these two points:
What would be the best way to go about it?
What help features in existing applications you use either delight or annoy you..?
In my experience nobody but programmers reads the help. So when you have a technical and non-technical target audience you end up providing 2 ways of doing everything:
A Wizard with a few options.
A property editor with lots of options.
In either case, pictures are usually better than words for documentation. So a screenshot or 3 with big green arrows and circles calling out what does what will go a lot further than an indexing, exhaustive help file.
In my experience it would be very helpful to have a tooltip on each option that provides a little more definition/clarity for each option. Additionally, you can improve usability by having the default screen contain a few common, simple options and providing an advanced section that provides more control.
I'm currently working on a similar side-project. We have an existing product that's used by people as part of their day job. There is an inherent learning curve on the product, so users receive some degree of training and have people they can turn to for assistance. Even so, we know it needs more help and user documentation in general.
We are starting this help enhancement project by running a quick survey on the end users, (offering a prize draw as an incentive). We will also speak to the support staff who have to deal with help requests. This will uncover some of the pain points, and will give us a clear idea of how to focus our time & resources.
Guidelines on when to use inline tips vs tool tips etc can be found in various style guides, e.g. here:
http://developers.sun.com/docs/web-app-guidelines/uispec4_0/11-help.htm
Bear in mind that it's probably a bad idea to just copy & paste the text from your existing manuals into contextual help tips. You're going to need help writing completely new content. See if you can get some time from a technical writer / copywriter.

Resources