Best way to get definitions out of Google? - google-api

I'm trying to make a simple feature where a user can specify a term and the program fetches a definition for it and returns it. The best definition system I know of is Google's "define" keyword in search queries where if you start the query with "define " or "define:" etc it returns very accurate and sufficient definitions. However, I have no idea how to access this information programatically.
Google's new Custom Search Engine API doesn't show definitions and the old one gives slightly better results but is deprecated and still doesn't show the same definitions I see when I Google the term in the browser.
Failing Google, I turned to Wikipedia, which has a huge API but I still couldn't find a way to extract summaries like Google definitions.
So my question is, does anybody know how I can get this information out of Google via the API or any other means?
This is an older question but is asking the same thing. Except the answers given are no longer applicable as Google Dictionary no longer exists.
Update: So I'm now going down the route of trying to scrape the definitions straight out of the page itself. Now the problem is, when I visit the page in the browser (Firefox), the definitions show up, but when I'm scraping them using cheerio, they don't show up anywhere on the page. I must mention I'm scraping the page through nitrous.io so it's rendering the page from a different region and operating system to the one I'm viewing it in the browser with so maybe it's region related. Will look into it further.
Update 2.0: I think maybe the definitions are loaded asynchronously and so I have no idea how to scrape them because I've never really done scraping before and I'm just a newbie :(
Update 3.0: Ok, so now I think it's not to do with the asynchronous loading but the renderer of the page. When I load this in Firefox, the page looks like this:
However, when I load it in IE (8) it looks like this:
Anybody got some insight on this?

Finally got to the answer. Had to set user agent when screen scraping. My resulting code for getting definitions via scraping:
var request = require('request')
, cheerio = require('cheerio');
var searchTerm = 'test';
request({url:'https://www.google.co.uk/search?q=define+'+searchTerm,headers:{"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:24.0) Gecko/20100101 Firefox/24.0"}}, function(err, resp, body){
$ = cheerio.load(body);
var defineBlocks = $(".lr_dct_sf_sen");
var numOfBlocks = (defineBlocks.length < 3) ? defineBlocks.length : 3;
for (var i=0; i<numOfBlocks; i++){
var block = defineBlocks[i].children[1].children[0]; //font-size:small level
process(block);
function process (block) {
for (var i=0; i<block.children.length; i++){
var line = block.children[i];
if ("style" in line.attribs){ // main text
exampleStr = "";
for (var k=0; k<line.children.length; k++){
exampleStr += line.children[k].children[0].data;
}
console.log(exampleStr);
} else if ("class" in line.attribs){ // example
console.log("\""+line.children[1].children[0].data+"\"");
} else { // nothing i want
}
}
}
}
});

Related

Google sheets IMPORTXML fails for ASX data

I am trying to extract the "Forward Dividend & Yield" value from https://finance.yahoo.com/ for multiple companies in different markets, into Google Sheets.
This is successful:
=IMPORTXML("https://finance.yahoo.com/quote/WBS", "//*[#id='quote-summary']/div[2]/table/tbody/tr[6]/td[2]")
But this fails with #N/A:
=IMPORTXML("https://finance.yahoo.com/quote/CBA.AX", "//*[#id='quote-summary']/div[2]/table/tbody/tr[6]/td[2]")
I cannot work out what needs to be different for ASX ticker codes, why does CBA.AX cause a problem?
Huge thanks for any help
When I tested the formula of =IMPORTXML("https://finance.yahoo.com/quote/CBA.AX", "//*"), an error of Error Resource at url not found. occurred. I thought that this might be the reason of your issue.
But, fortunately, when I try to retrieve the HTML from the same URL using Google Apps Script, the HTML could be retrieved. So, in this answer, I would like to propose to retrieve the value using the custom function created by Google Apps Script. The sample script is as follows.
Sample script:
Please copy and paste the following script to the script editor of Google Spreadsheet and save it. And, please put a formula of =SAMPLE("https://finance.yahoo.com/quote/CBA.AX") to a cell. By this, the value is retrieved.
function SAMPLE(url) {
const res = UrlFetchApp.fetch(url).getContentText().match(/DIVIDEND_AND_YIELD-value.+?>(.+?)</);
return res && res.length > 1 ? res[1] : "No value";
}
Result:
When above script is used, the following result is obtained.
Note:
When this script is used, you can also use =SAMPLE("https://finance.yahoo.com/quote/WBS").
In this case, when the HTML structure of the URL is changed, this script might not be able to be used. I think that this situation is the same with IMPORTXML and the xpath. So please be careful this.
References:
Custom Functions in Google Sheets
Class UrlFetchApp
An other solution is to decode the json contained in the source of the web page. Of course you can't use importxml since the web page is built on your side by javascript and not on server's side. You can access data by this way and get a lot of informations
var source = UrlFetchApp.fetch(url).getContentText()
var jsonString = source.match(/(?<=root.App.main = ).*(?=}}}})/g) + '}}}}'
i.e. for what you are looking for you can use
function trailingAnnualDividendRate(){
var url='https://finance.yahoo.com/quote/CBA.AX'
var source = UrlFetchApp.fetch(url).getContentText()
var jsonString = source.match(/(?<=root.App.main = ).*(?=}}}})/g) + '}}}}'
var data = JSON.parse(jsonString)
var dividendRate = data.context.dispatcher.stores.QuoteSummaryStore.summaryDetail.trailingAnnualDividendRate.raw
Logger.log(dividendRate)
}

Calling checkCurrentDictionary() from addon crashes FF - why?

I'm tyring to call the method checkCurrentDictionary() of nsIEditorSpellCheck from within an add-on. The relevant code I use is:
var editorSpellCheck = Cc["#mozilla.org/editor/editorspellchecker;1"].createInstance(Ci.nsIEditorSpellCheck);
editorSpellCheck.checkCurrentDictionary();
This immediately crashes the Fx. What is going wrong here?
So this probably has something to do with the fact that nsIEditorSpellCheck is not a scriptable interface.
Basically, a scriptable interface is one that can be used from JavaScript.
If you want to access the spell check service you can do something like:
let editor = editableElement.editor;
if (!editor) {
let win = editableElement.ownerDocument.defaultView;
editor = win.QueryInterface(Ci.nsIInterfaceRequestor).
getInterface(Ci.nsIWebNavigation).
QueryInterface(Ci.nsIInterfaceRequestor).
getInterface(Ci.nsIEditingSession).
getEditorForWindow(win);
}
if (!editor)
throw new Error("Unable to find editor for element " + editableElement);
(The above is from http://dxr.mozilla.org/mozilla-central/source/editor/AsyncSpellCheckTestHelper.jsm which is MPL).
Then you can use the InlineSpellCheck.jsm to do some crazy stuff.
I'm not sure what you want to do though, so perhaps you should ask that more specific question as a new question.

Windows Phone webclient caching "issue"?

I am trying to call the same link, but with different values, the issue is that the url is correct containing the new values but when I download it (Webclient.DownloadStringTaskAsync), it gives me the previous calls result.
I have tried adding headers no-cache, and attaching a random value to the call, and ifmodifiedSince header. however it is still not working.
any help will be much appreciated cause I have tried everything.
uri: + "&junk=" + Guid.NewGuid());
client.Headers["Cache-Control"] = "no-cache";
client.Headers[HttpRequestHeader.IfModifiedSince] = DateTime.UtcNow.ToString();
var accessdes = await client.DownloadStringTaskAsync(uri3);
so here my uri3 contains the latest values, but when I hover over accessdes, it contains the result as if I am making a old uri3 call with previous set data.
I saw one friend that was attaching a random GUID to the Url in order to prevent the OS to cache its content. For example:
if the Url were: http://www.ms.com/getdatetime and the OS is caching it.
Our solution was adding a guid for creating "sort of" like a new url, as an example our previous Url would look like: http://www.ms.com/getdatetime?cachebuster=21EC2020-3AEA-4069-A2DD-08002B30309D
(see more about cache buster : http://www.adopsinsider.com/ad-ops-basics/what-is-a-cache-buster-and-how-does-it-work/ )

Coded UI error: The following element is not longer availabe

I recorded some test cases with CUIT in VS2010. Everything worked fine the day before. So, today I run again, all the test failed, with the warning: The following element is no longer available ... and I got the exception : Can't perform "Click" on the hidden control, which is not true because all the controls are not hidden. I tried on the other machine, and they failed as well.
Does anyone know why it happens? Is it because of the web application for something else? Please help, thanks.
PS: So I tried to record a new test with the same controls that said "hidden controls", and the new test worked!? I don't understand why.
EDIT
The warning "The following element blah blah ..." appears when I tried to capture an element or a control while recording. The source code of the button is said 'hidden'
public HtmlImage UIAbmeldenImage
{
get
{
if ((this.mUIAbmeldenImage == null))
{
this.mUIAbmeldenImage = new HtmlImage(this);
#region Search Criteria
this.mUIAbmeldenImage.SearchProperties[HtmlImage.PropertyNames.Id] = null;
this.mUIAbmeldenImage.SearchProperties[HtmlImage.PropertyNames.Name] = null;
this.mUIAbmeldenImage.SearchProperties[HtmlImage.PropertyNames.Alt] = "abmelden";
this.mUIAbmeldenImage.FilterProperties[HtmlImage.PropertyNames.AbsolutePath] = "/webakte-vnext/content/apps/Ordner/images/logOut.png";
this.mUIAbmeldenImage.FilterProperties[HtmlImage.PropertyNames.Src] = "http://localhost/webakte-vnext/content/apps/Ordner/images/logOut.png";
this.mUIAbmeldenImage.FilterProperties[HtmlImage.PropertyNames.LinkAbsolutePath] = "/webakte-vnext/e.consult.9999/webakte/logout/index";
this.mUIAbmeldenImage.FilterProperties[HtmlImage.PropertyNames.Href] = "http://localhost/webakte-vnext/e.consult.9999/webakte/logout/index";
this.mUIAbmeldenImage.FilterProperties[HtmlImage.PropertyNames.Class] = null;
this.mUIAbmeldenImage.FilterProperties[HtmlImage.PropertyNames.ControlDefinition] = "alt=\"abmelden\" src=\"http://localhost/web";
this.mUIAbmeldenImage.FilterProperties[HtmlImage.PropertyNames.TagInstance] = "1";
this.mUIAbmeldenImage.WindowTitles.Add("Akte - Test Akte Coded UI VS2010");
#endregion
}
return this.mUIAbmeldenImage;
}
}
Although I am running Visual Studio 2012, I find it odd that we started experiencing the same problem on the same day, I can not see any difference in the DOM for the Coded UI Tests I have for my web page, but for some reason VS is saying the control is hidden and specifies the correct ID of the element it is looking for (I verified that the ID is still the same one). I even tried to re-record the action, because I assumed that something must have changed, but I get the same error.
Since this sounds like the same problem, occurring at the same time I am thinking this might be related to some automatic update? That's my best guess at the moment, I am going to look into it, I will update my post if I figure anything out.
EDIT
I removed update KB2870699, which removes some voulnerability in IE, this fixed the problems I was having with my tests. This update was added on the 12. september, so it fits. Hope this helps you. :)
https://connect.microsoft.com/VisualStudio/feedback/details/800953/security-update-kb2870699-for-ie-breaks-existing-coded-ui-tests#tabs
Official link to get around the problem :
http://blogs.msdn.com/b/visualstudioalm/archive/2013/09/17/coded-ui-mtm-issues-on-internet-explorer-with-kb2870699.aspx
The problem is more serious than that! In my case I can't even record new Coded UI Tests. After I click in any Hyper Link of any web page of my application the coded UI test builder cannot record that click "The following element is no longer available....".
Apparently removing the updates, as said by AdrianHHH do the trick!
Shut down VS2010, launch it again "Run as administrator".
There may be a field in the SearchProperties (or possible the FilterProperties) that has a value set by the web site, or that represents some kind of window ID on your desktop. Another possibility is that the web page title changes from day to day or visit to visit. Different executions of the browser or different visits to the web page(s) create different values. Removing these values from the SearchProperties (or FilterProperties) or changing the check for the title from an equals to a contains for a constant part of the title should fix the problem. Coded UI often searches for more values than the minimum set needed.
Compare the search properties etc for the same control in the two recorded tests.
Update based extra detail given in the comments:
I solved a similar problem as follows. I copied property code similar to that shown in your question into a method that called FindMatchingControls. I checked how many controls were returned, in my case up to 3. I examined various properties of the controls found, by writing lots of text to a debug file. In my case I found that the Left and Top properties were negative for the unwanted, ie hidden, controls.
For your code rather than just using the UIAbmeldenImage property, you might call the method below. Change an expression such as
HtmlImage im = UIMap.abc.def.UIAbmeldenImage;
to be
HtmlImage im = FindHtmlHyperLink(UIMap.abc.def);
Where the method is:
public HtmlImage FindHtmlHyperLink(HtmlDocument doc)
{
HtmlImage myImage = new HtmlImage(doc);
myImage.SearchProperties[HtmlImage.PropertyNames.Id] = null;
myImage.SearchProperties[HtmlImage.PropertyNames.Name] = null;
myImage.SearchProperties[HtmlImage.PropertyNames.Alt] = "abmelden";
myImage.FilterProperties[HtmlImage.PropertyNames.AbsolutePath] = "/webakte-vnext/content/apps/Ordner/images/logOut.png";
myImage.FilterProperties[HtmlImage.PropertyNames.Src] = "http://localhost/webakte-vnext/content/apps/Ordner/images/logOut.png";
myImage.FilterProperties[HtmlImage.PropertyNames.LinkAbsolutePath] = "/webakte-vnext/e.consult.9999/webakte/logout/index";
myImage.FilterProperties[HtmlImage.PropertyNames.Href] = "http://localhost/webakte-vnext/e.consult.9999/webakte/logout/index";
myImage.FilterProperties[HtmlImage.PropertyNames.Class] = null;
myImage.FilterProperties[HtmlImage.PropertyNames.ControlDefinition] = "alt=\"abmelden\" src=\"http://localhost/web";
myImage.FilterProperties[HtmlImage.PropertyNames.TagInstance] = "1";
myImage.WindowTitles.Add("Akte - Test Akte Coded UI VS2010");
UITestControlCollection controls = myImage.FindMatchingControls();
if (controls.Count > 1)
{
foreach (UITestControl con in controls)
{
if ( con.Left < 0 || con.Top < 0 )
{
// Not on display, ignore it.
}
else
{
// Select this one and break out of the loop.
myImage = con as HtmlImage;
break;
}
}
}
return myImage;
}
Note that the above code has not been compiled or tested, it should be taken as ideas not as the final code.
I had the same problem on VS 2012. As a workaround, you can remove that step, and re-record it again. That usually works.
One of the biggest problem while analyzing the Coded UI test failures is that the error stack trace indicates the line of code which might be completely unrelated to the actual cause of failure.
I would suggest you to enable HTML logging in your tests - this will display step by step details of how Coded UI tried to execute the tests - with screenshots of your application. It will also highlight the control in red which Coded UI is trying to search/operate upon.This is very beneficial in troubleshooting the actual cause of test failures.
To enable tracing you can just add the below code to your app.config file --

Google Spreadsheet API - returns remote 500 error

Has anyone battled 500 errors with the Google spreadsheet API for google domains?
I have copied the code in this post (2-legged OAuth): http://code.google.com/p/google-gdata/source/browse/trunk/clients/cs/samples/OAuth/Program.cs, substituted in my domain;s API id and secret and my own credentials, and it works.
So it appears my domain setup is fine (at least for the contacts/calendar apis).
However swapping the code out for a new Spreadsheet service / query instead, it reverts to type: remote server returned an internal server error (500).
var ssq = new SpreadsheetQuery();
ssq.Uri = new OAuthUri("https://spreadsheets.google.com/feeds/spreadsheets/private/full", "me", "mydomain.com");
ssq.OAuthRequestorId = "me#mydomain.com"; // can do this instead of using OAuthUri for queries
var feed = ssservice.Query(ssq); //boom 500
Console.WriteLine("ss:" + feed.Entries.Count);
I are befuddled
I had to make sure to use the "correct" class:
not
//using SpreadsheetQuery = Google.GData.Spreadsheets.SpreadsheetQuery;
but
using SpreadsheetQuery = Google.GData.Documents.SpreadsheetQuery;
stinky-malinky
Seems you need the gdocs api to query for spreadsheets, but the spreadsheet api to query inside of a spreadsheet but nowhere on the internet until now will you find this undeniably important tit-bit. Google sucks hard on that one.

Resources