Building a reverse language dictionary - algorithm

I was wondering what does it take to build a reverse language dictionary.
The user enters something along the lines of: "red edible fruit" and the application would return: "tomatoes, strawberries, ..."
I assume these results should be based on some form of keywords such as synonyms, or some form of string search.
This is an online implementation of this concept.
What's going on there and what is involved?
EDIT 1:
The question is more about the "how" rather than the "which tool"; However, feel free to provide the tools you think to do the job.

OpenCyc is a computer-usable database of real-world concepts and meanings. From their web site:
OpenCyc is the open source version of the Cyc technology, the world's largest and most complete general knowledge base and commonsense reasoning engine. OpenCyc can be used as the basis of a wide variety of intelligent applications
Beware though, that it's an enormously complex reasoning engine -- real-world facts never were simple. Documentation is quite sparse and the learning curve is steep.

Any approach would basically involve having a normalized database. Here is a basic example of what your database structure might look like:
// terms
+-------------------+
| id | name |
| 1 | tomatoes |
| 2 | strawberries |
| 3 | peaches |
| 4 | plums |
+-------------------+
// descriptions
+-------------------+
| id | name |
| 1 | red |
| 2 | edible |
| 3 | fruit |
| 4 | purple |
| 5 | orange |
+-------------------+
// connections
+-------------------------+
| terms_id | descript_id |
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
| 2 | 1 |
| 2 | 2 |
| 2 | 3 |
| 3 | 1 |
| 3 | 2 |
| 3 | 5 |
| 4 | 1 |
| 4 | 2 |
| 4 | 4 |
+-------------------------+
This would be a fairly basic setup, however it should give you an idea how many-to-many relationships using a look-up table work within databases.
Your application would have to break apart strings and be able to handle normalizing the input for example getting rid of suffixes with user input. Then the script would query the connections table and return the results.

To answer the "how" part of your question, you could utilize human computation: There are hordes of bored teenagers with iPhones around the globe, so create a silly game whose byproduct is filling your database with facts -- to harness their brainpower for your purposes.
Sounds like an awkward concept at first, but look at this lecture on Human Computation for an example.

First, there must be some way of associating concepts (like 'snow') with particular words.
So rather than simply storing a wordlist, you would also need to store concepts or properties like "red", "fruit", and "edible" as well as the keywords themselves, and model relationships between them.
At a simple level, you could have two tables (don't have to be database tables): a list of keywords, and a list of concepts/properties/adjectives, then you model the the relationship by storing another table which represents the mapping from keyword to adjective.
So if you have:
keywords:
0001 aardvark
....
0050 strawberry
....
0072 tomato
....
0120 zoo
and concepts:
0001 big
0002 small
0003 fruit
0004 vegetable
0005 mineral
0006 metal
....
0250 black
0251 blue
0252 red
....
0570 edible
you would need a mapping containing:
0050 -> 0003
0050 -> 0252
0050 -> 0570
0072 -> 0003
0072 -> 0252
0072 -> 0570
You may like to think of this as modelling an "is" relationship: 0050 (a strawberry) "is" 0003 (fruit), and "is" 0252 (red), and "is" 0570 (edible).

How will your engine know that
"An incredibly versatile ingredient, essential for any fridge chiller drawer. Whether used for salads, soups, sauces or just raw in sandwiches, make sure they are firm and a rich red colour when purchased",
"mildly acid red or yellow pulpy fruit eaten as a vegetable", and
"an American musician who is known for being the lead singer/drummer for the alternative rock band Sound of Urchin"
all map to the same original word? Natural language definitions are unstructured, you can't store them in a normalized database. You can attempt to structure it by reducing to an ontology, like Princeton's WordNet, but creating and using ontologies is an extremely difficult problem, topic of phd theses and well funded advanced research.

It should be fairly straightforward. You can use straight synonyms in addition to a series of words to define each word. The word order in the definition is sometimes important. Each word can have multiple definitions, of course.
You can develop a rating system to see which definitions are the closest match to the input, then display the top 3 or 4 words.

what about using a dictionary, and performing a full-text search over the definitions (after removing link words and article, like 'and', 'or'...), then returning the word which has the best score (highest number of matching words or maybe a more complicated scoring method) ?

There are several ways you can go about this depending on how much work you want to put into it. One way you can build a reverse dictionary is to use the definitions to help calculate which words are closely related. This way can be the most difficult because you need to have a pretty extensive algorithm that can associate phrases.
Finding Similar Definitions
One way you could do this is by matching the definition string with others and see which ones match the closest. In php you can use the similar_text function. problem with this method is that if your database has a ton of words and definitions then you will use a lot of overhead on your SQL DB.
Use An API
There are several resources out there you can use to help you get a reverse dictionary by using an API. Here are some of them.
https://www.wordgamedictionary.com/api/ Has an API and includes a working reverse
dictionary
http://developer.wordnik.com/docs.html#!/words/reverseDictionary_get_2 Just the API
http://www.onelook.com/reverse-dictionary.shtml Just has a working Reverse Dictionary

This sounds like a job for Prolog.

Related

How does cb_adf algorithm know a new action is available in the data if no feature is associated with arms?

From the documentations I have read, cb_adf format multiline data is suitable for scenarios where number of actions are changing over time. My questions is, how does the algorithm know if a new action is available? Is code like formatting the logged bandits data correct?
two_actions = """
shared | a:0.5 b:1 c:2
0:-0.1:0.75 |
|
"""
and
three_actions_now = """
shared | a:0.5 b:1 c:2
|
0:-0.3:0.55 |
|
"""
And what about if one action is no longer available?
In this case you should use some identity feature for the arms which have no other features, this is because for cb_adf the actions themselves are essentially defined as the set of their features.
shared | a:0.5 b:1 c:2
| action_1
0:-0.3:0.55 | action_2
| action_3
If the action is no longer available you would omit the line that corresponded to that feature. So, if we wished to remove action_2 from the pool of actions to be chosen from it might look like.
shared | a:0.5 b:1 c:2
| action_1
| action_3
cb_adf works best when there is more than just a single feature per action. For example, having features shared across actions allows the learner to learn the value of other features from rewards on other actions.

What does the `relevanceLanguage` parameter really do?

The YouTube Data API documentation mentions a relevanceLanguage parameter, which is defined as follows (emphasis mine).
The relevanceLanguage parameter instructs the API to return search
results that are most relevant to the specified language. The
parameter value is typically an ISO 639-1 two-letter language code.
[...] Please note that results in other
languages will still be returned if they are highly relevant to the
search query term.
I understand the part in bold, but I had a very hard time having the API take my relevant language into consideration. In most requests, the relevant language is completely ignored (examples below).
Query | Relevant Language | Results language
----------------|-------------------|-----------------
Donald Trump | None | `en`
| `fr` | `en`
| `de` | `en`
----------------|-------------------|-----------------
Nicolas Sarkozy | None | `fr`
| `en` | `fr`
| `de` | `fr`
Hence my question: what does this parameter actually do? Ideally, I would like to completely filter out the results that are not in my relevant language, but AFAIK, that's not possible, the alternative being this relevanceLanguage parameter.

Can I export descriptive test names in Selenium / Gherkin / Cucumber?

I have a few tests in feature files that use the Scenario Template method to plug in multiple parameters. For example:
#MaskSelection
Scenario Template: The Mask Guide Is Available
Given the patient is using "<browser>"
And A patient registered and logged in
And A patient selected the mask
When the patient clicks on the "<guide>"
Then the patient should see the "<guide>" guide for the mask with "<guideLength>" slides
Examples:
| browser | guide | guideName | guideLength |
| chrome | mask | Mask | 5 |
| firefox | replacement | Mask Replacement Guide | 6 |
| internetexplorer | replacement | Mask Replacement Guide | 6 |
Currently, this is exporting test results with names like "TheMaskGuideIsAvailableVariant3". Is there any way to have it instead export something like "TheMaskGuideIsAvailable("chrome", "mask", "Mask", "5")"? I have a few tests which export 50+ results, and it's a pain to count the list to figure out exactly which set of parameters failed. I could have sworn the export used to work like this at one time, but I can't seem to replicate that behavior.
Possibly tied to it, recently, I've lost the ability to double-click on the test instance in Test Explorer in Visual Studio and go to the test outline in its file. Instead, nothing happens and I have to manually go to that file.
The answer to the Variant situation is that the part that gets appended is the first column of the table. If there are non-unique items in the first column, it gets exported as numbered "Variants".
The answer I found to exporting the list is to use vstest.console with the "/ListTests" option. As per the prior paragraph, since the first column is the one to be used for naming, a column can be established with a concatenated list of parameters.

Using variable with Background in Cucumber

I'm trying to run a feature file like this:
Feature: my feature
Background:
When I do something
And I choose from a <list>
Scenario Outline: choice A
And I click on <something> after the choice A is clicked
Examples:
| list | something |
| a | 1 |
| b | 2 |
| c | 3 |
But what happens is when the second Background step runs, in the step definition, list is a String with the value <list>, and the first Scenario line something is 1, so can Background not use the variables from Examples? Putting a copy of Examples before Scenario Outline does not work.
The answer to your question is: no. Background is not a scenario outline. It does not take values from Examples, which is exclusively for the Scenario Outline where it is included.
Let's suposse you have several Scenario Outlines. Each of them should have its own Examples sections and it is not shared between them. Consequently, it is not shared with Background either.
That is why it does not work when you move Examples before Scenario Outline, as you mentioned in your question.

Cucumber load data tables dynamically

I am currently trying to use cucumber together with capybara for some integration tests of a web-app.
There is one test where I just want to click through all (or most of) the pages of the web app and see if no error is returned. I want to be able to see afterwards which pages are not working.
I think that Scenario outlines would be the best approach so I started in that way:
Scenario Outline: Checking all pages pages
When I go on the page <page>
Then the page has no HTTP error response
Examples:
| page |
| "/resource1" |
| "/resource2" |
...
I currently have 82 pages and that works fine.
However I find this approach is not maintable as there may new resources and resources that will be deleted.
A better approach would be to load the data from the table from somewhere (parsing HTML of an index page, the database etc...).
But I did not figure out how to do that.
I came across an article about table transformation but I could not figure out how to use this transformation in an scenario outline.
Are there any suggestions?
OK since there is some confusion. If you have a look at the example above. All I want to do is change it so that the table is almost empty:
Scenario Outline: Checking all pages pages
When I go on the page <page>
Then the page has no HTTP error response
Examples:
| page |
| "will be generated" |
Then I want to add a transformation that looks something like this:
Transform /^table:page$/ do
all_my_pages.each do |page|
table.hashes << {:page => page}
end
table.hashes
end
I specified the transformation in the same file, but it is not executed, so I was assuming that the transformations don't work with Scenario outlines.
Cucumber is really the wrong tool for that task, you should describe functionality in terms of features. If you want to describe behavior programmatically you should use something like rspec or test-unit.
Also your scenario steps should be descriptive and specialized like a written text and not abstract phrases like used in a programming language. They should not include "incidental details" like the exact url of a ressource or it's id.
Please read http://blog.carbonfive.com/2011/11/07/modern-cucumber-and-rails-no-more-training-wheels/ and watch http://skillsmatter.com/podcast/home/refuctoring-your-cukes
Concerning your question about "inserting into tables", yes it is possible if you
mean adding additional rows to it, infact you could do anything you like with it.
The result of the Transform block completely replaces the original table.
Transform /^table:Name,Posts$/ do
# transform the table into a list of hashes
results = table.hashes.map do |row|
user = User.create! :name => row["Name"]
posts = (1..row["Posts"]).map { |i| Post.create! :title => "Nr #{i}" }
{ :user => user, :posts => posts }
end
# append another hash to the results (e.g. a User "Tim" with 2 Posts)
tim = User.create! :name => "Tim"
tims_posts = [Post.create! :title => "First", Post.create! :title => "Second"]
results << { :user => tim, :posts => tims_posts }
results
end
Given /^I have Posts of the following Users:$/ do |transformation_results|
transformation_results.each do |row|
# assing Posts to the corresponding User
row[:user].posts = row[:posts]
end
end
You could combine this with Scenario Outlines like this:
Scenario Outline: Paginate the post list of an user at 10
Given I have Posts of the following Users:
| Name | Posts |
| Max | 7 |
| Tom | 11 |
When I visit the post list of <name>
Then I should see <count> posts
Examples:
| name | count |
| Max | 7 |
| Tom | 10 |
| Tim | 2 |
This should demonstarte why "adding" rows to a table, might not be best practice.
Please note that it is impossible to expand example tags inside of a table:
Scenario Outline: Paginate the post list of an user at 10
Given I have Posts of the following Users:
| Name | Posts |
| <name> | <existing> | # won't work
When I visit the post list of <name>
Then I should see <displayed> posts
Examples:
| name | existing | displayed |
| Max | 7 | 7 |
| Tom | 11 | 10 |
| Tim | 2 | 2 |
For the specific case of loading data dynamically, here's a suggestion:
A class, let's say PageSets, with methods, e.g. all_pages_in_the_sitemap_errorcount, developing_countries_errorcount.
a step that reads something like
Given I am on the "Check Stuff" page
Then there are 0 errors in the "developing countries" pages
or
Then there are 0 errors in "all pages in the sitemap"
The Then step converts the string "developing countries" into a method name developing_countries_errorcountand attempts to call it on class PageSets. The step expects all _errorcount methods to return an integer in this case. Returning data structures like maps gives you many possibilities for writing succinct dynamic steps.
For more static data we have found YAML very useful for making our tests self-documenting and self-validating, and for helping us remove hard-to-maintain literals like "5382739" that we've all forgotten the meaning of three weeks later.
The YAML format is easy to read and can be commented if necessary (it usually isn't.)
Rather than write:
Given I am logged in as "jackrobinson#gmail.com"
And I select the "History" tab
Then I can see 5 or more "rows of history"
We can write instead:
Given I am logged in as "a user with at least 5 items of history"
When I select the "History" tab
Then I can see 5 or more "rows of history"
In file logins.yaml....
a member with at least 5 items of history:
username: jackrobinson#gmail.com
password: WalRus
We use YAML to hold sets of data relating to all sorts of entities like members, providers, policies, ... the list is growing all the time:
In file test_data.yaml...
a member who has direct debit set up:
username: jackrobinson#gmail.com
password: WalRus
policyId: 5382739
first name: Jack
last name: Robinson
partner's first name: Sally
partner's last name: Fredericks
It's also worth looking at YAML's multi-line text facilities if you need to verify text. Although that's not usual for automation tests, it can sometimes be useful.
I think that the better approach would be using different tool, just for crawling your site and checking if no error is returned. Assuming you're using Rails
The tool you might consider is: Tarantula.
https://github.com/relevance/tarantula
I hope that helps :)
A quick hack is to change the Examples collector code, and using eval of ruby to run your customized ruby function to overwrite the default collected examples data, here is the code:
generate-dynamic-examples-for-cucumber
drawback: need change the scenario_outline.rb file.

Resources