Sorry for the blunt title, but couldn't really generalize the question. Hope someone with Prolog experience can help me out here. So I've got a database which basically lists universities and their rank, i.e: oxford(1), warwick(2), etc. The question requires me to write a rule that returns all the names of the universities that have the same rank. Thanks in advance.
I believe this is going to require a bit of meta-programming, but only a little bit. You are probably going to have to provide some feedback about my assumptions in this answer in order to get a robust solution. But I think jumping in will get you there faster (with both of us learning something along the way) than asking a sequence of clarifying comments.
Our immediate goal will be to find these "university" facts through what SWI-Prolog calls "Examining the program" (links below, but you could search for it as a section of the Manual). If we can do this, we can query those facts to get a particular rank, thus obtaining all universities of the same rank.
From what you've said, there are a number of "facts" of the form "UNIVERSITY(RANK)." Typically if you consult a file containing these from SWI-Prolog, they will be dynamic predicates and (unless you've done something explicit to avoid it) added to the [user] module. Such a database is often called a "factbase". Facts here mean clauses with only a head (no body); dynamic predicates can in general have clauses with or without bodies.
SWI-Prolog has three different database mechanisms. The one we are discussing is the clause database that is manipulated through not only consulting but also by the assert/retract meta-predicates. We will refer to these as the "dynamic" predicates.
Here's a modification of a snippet of code that Jan Wielemaker provides for generating (through backtracking) all the built-in predicates, now repurposed to generate the dynamic predicates:
generate_dynamic(Name/Arity) :-
predicate_property(user:Head, dynamic),
functor(Head, Name, Arity). % get arity: number of args
In your case you are only interested in certain dynamic predicates, so this may return too much in the way of results. One way to narrow things down is by setting Arity = 1, since your university facts only consist of predicates with a single argument.
Another way to narrow things down is by the absence of a body. If this check is needed, we can incorporate a call to clause/2 (documented on the same page linked above). If we have a "fact" (clause without a body), then the resulting call to clause/2 returns the second argument (Body) set to the atom true.
As a final note, Jan's website uses SWI-Prolog to deliver its pages, but the resulting links don't always cut-and-paste well. If the link I gave above doesn't work for you, then you can either navigate to Sec. 4.14 of the Manual yourself, or try this link to a mirrored copy of the documentation that appears not-quite current (cf. difference in section numbering and absence of Jan's code snippet).
And feel free to ask questions if I've said something that needs clarification or assumed something that doesn't apply to your setup.
Added: Let's finish the answer by showing how to query a list of universities, whether given as such or derived from the "factbase" as outlined above. Then we have a few comments about design and learning at the end.
Suppose LU = [oxford,warwick,...] is in hand, a list of all possible universities. Apart from efficiency, we may not even care if a few things that are not universities or are not ranked are on the list, depending on the nature of the query you want to do.
listUniversityRank(LU,V,R) :- % LU: list of universities
member(V,LU),
call(V(R)).
The above snippet defines a predicate listUniversityRank/2 that we would provide a list of universities, and which would in turn call a dynamically generated goal on each member of the list to find its rank. Such a predicate can be used in several ways to accomplish your objective of finding "all the names of the universities that have the same rank."
For instance, we might want to ask for a specific rank R=1 what universities share that rank. Calling listUniversityRank(LU,V,R) with R bound to 1 would accomplish that, at least in the sense that it would backtrack through all such university names. If you wanted to gather these names into a list, then you could use findall/3.
For that matter you might want to begin listing "all the names of the universities that have the same rank" by making a list of all possible ranks, using setof/3 to collect the solutions for R in listUniversityRank(LU,_,R). setof is similar to findall but sorts the results and eliminates duplicates.
Now let's look back and think about how hard we are working to accomplish the stated aim, and what might be a design that makes life easier for that purpose. We want a list of university names with a certain property (all have the same rank). It would have been easier if we had the list of university names to start with. As Little Bobby Tables points out in one of the Comments on the Question, we have a tough time telling what is and isn't a university if there are facts like foo(3) in our program.
But something more is going on here. Using the university names to create the "facts", a different predicate for each different university, obscures the relationship university vs. rank that we would like to query. If we only had it to do over again, surely we'd rather represent this relationship with a single two-argument predicate, say universityRank/2 that directly connects each university name and the corresponding rank. Fewer predicates, better design (because more easily queried, without fancy meta-programming).
Related
How are keyword clouds constructed?
I know there are a lot of nlp methods, but I'm not sure how they solve the following problem:
You can have several items that each have a list of keywords relating to them.
(In my own program, these items are articles where I can use nlp methods to detect proper nouns, people, places, and (?) possibly subjects. This will be a very large list given a sufficiently sized article, but I will assume that I can winnow the list down using some method by comparing articles. How to do this properly is what I am confused about).
Each item can have a list of keywords, but how do they pick keywords such that the keywords aren't overly specific or overly general between each item?
For example, trivially "the" can be a keyword that is a lot of items.
While "supercalifragilistic" could only be in one.
I suppose that I could create a heuristic where if a word exists in n% of the items where n is sufficiently small, but will return a nice sublist (say 5% of 1000 articles is 50, which seems reasonable) then I could just use that. However, the issue that I take with this approach is that given two different sets of entirely different items, there is most likely some difference in interrelatedness between the items, and I'm throwing away that information.
This is very unsatisfying.
I feel that given the popularity of keyword clouds there must have been a solution created already. I don't want to use a library however as I want to understand and manipulate the assumptions in the math.
If anyone has any ideas please let me know.
Thanks!
EDIT:
freenode/programming/guardianx has suggested https://en.wikipedia.org/wiki/Tf%E2%80%93idf
tf-idf is ok btw, but the issue is that the weighting needs to be determined apriori. Given that two distinct collections of documents will have a different inherent similarity between documents, assuming an apriori weighting does not feel correct
freenode/programming/anon suggested https://en.wikipedia.org/wiki/Word2vec
I'm not sure I want something that uses a neural net (a little complicated for this problem?), but still considering.
Tf-idf is still a pretty standard method for extracting keywords. You can try a demo of a tf-idf-based keyword extractor (which has the idf vector, as you say apriori determined, estimated from Wikipedia). A popular alternative is the TextRank algorithm based on PageRank that has an off-the-shelf implementation in Gensim.
If you decide for your own implementation, note that all algorithms typically need plenty of tuning and text preprocessing to work correctly.
The minimum you need to do is removing stopwords that you know that they never can be a keyword (prepositions, articles, pronouns, etc.). If you want something fancier, you can use for instance Spacy to keep only desired parts of speech (nouns, verbs, adjectives). You can also include frequent multiword expressions (gensim has good function for automatic collocation detection), named entities (spacy can do it). You can get better results if you run coreference resolution and substitute pronouns with what they refer to... There are endless options for improvements.
I'm trying to solve the Sokoban puzzle in Prolog using a depth-first-search algorithm, but I cannot manage to search the solution tree in depth. I'm able to explore only the first level.
All the sources are at Github (links to revision when the question was asked) so feel free to explore and test them. I divided the rules into several files:
board.pl: contains rules related to the board: directions, neighbourhoods,...
game.pl: this file states the rules about movements, valid positions,...
level1.pl: defines the board, position of the boxes and solution squares for a sample game.
sokoban.pl: tries to implement dfs :(
I know I need to go deeper when a new state is created instead of checking if it is the final state and backtracking... I need to continue moving, it is impossible to reach the final state with only one movement.
Any help/advice will be highly appreciated, I've been playing around without improvements.
Thanks!
PS.- ¡Ah! I'm working with SWI-Prolog, just in case it makes some difference
PS.- I'm really newbie to Prolog, and maybe I'm facing an obvious mistake, but this is the reason I'm asking here.
This is easy to fix: In sokoban.pl, predicate solve_problem/2, you are limiting the solution to lists of a single element in the goal:
solve_dfs(Problem, Initial, [Initial], [Solution])
Instead, you probably mean:
solve_dfs(Problem, Initial, [Initial], Solution)
because a solution can consist of many moves.
In fact, an even better search strategy is often iterative deepening, which you get with:
length(Solution, _),
solve_dfs(Problem, Initial, [Initial], Solution)
Iterative deepening is a complete search strategy and an optimal strategy under quite general assumptions.
Other than that, I recommend you cut down the significant number of impure I/O calls in your program. There are just too many predicates where you write something on the screen.
Instead, focus on a clear declarative description, and cleanly separate the output from a description of what a solution looks like. In fact, let the toplevel do the printing for you: Describe what a solution looks like (you are already doing this), and let the toplevel display the solution as variable bindings. Also, think declaratively, and use better names like dfs_moves/4, problem_solution/2 instead of solve_dfs/4, solve_problem/2 etc.
DCGs may also help you in some places of your code to more conveniently describe lists.
+1 for tackling a nice and challenging search problem with Prolog!
I have a few algorithms that extract and rank keywords [both terms and bigrams] from a paragraph [most are based on the tf-idf model].
I am looking for an experiment to evaluate these algorithms. This experiment should give a grade to each algorithm, indicating "how good was it" [on the evaluation set, of course].
I am looking for an automatic / semi-automatic method to evaluate each algorithm's results, and an automatic / semi-automatic method to create the evaluation set.
Note: These experiments will be ran off-line, so efficiency is not an issue.
The classic way to do this would be to define a set of key words you want the algorithms to find per paragraph, then check how well the algorithms do with respect to this set, e.g. (generated_correct - generated_not_correct)/total_generated (see update, this is nonsense). This is automatic once you have defined this ground truth. I guess constructing that is what you want to automate as well when you talk about constructing the evaluation set? That's a bit more tricky.
Generally, if there was a way to generate key words automatically that's a good way to use as a ground truth - you should use that as your algorithm ;). Sounds cheeky, but it's a common problem. When you evaluate one algorithm using the output of another algorithm, something's probably going wrong (unless you specifically want to benchmark against that algorithm).
So you might start harvesting key words from common sources. For example:
Download scientific papers that have a keyword section. Check if those keywords actually appear in the text, if they do, take the section of text including the keywords, use the keyword section as ground truth.
Get blog posts, check if the terms in the heading appear in the text, then use the words in the title (always minus stop words of course) as ground truth
...
You get the idea. Unless you want to employ people to manually generate keywords, I guess you'll have to make do with something like the above.
Update
The evaluation function mentioned above is stupid. It does not incorporate how many of the available key words have been found. Instead, the way to judge a ranked list of relevant and irrelevant results is to use precision and recall. Precision rewards the absence of irrelevant results, Recall rewards the presence of relevant results. This again gives you two measures. In order to combine these two into a single measure, either use the F-measure, which combines those two measures into a single measure, with an optional weighting. Alternatively, use Precision#X, where X is the number of results you want to consider. Precision#X interestingly is equivalent to Recall#X. However, you need a sensible X here, ie if you have less than X keywords in some cases, those results will be punished for never providing an Xth keyword. In the literature on tag recommendation for example, which is very similar to your case, F-measure and P#5 are often used.
http://en.wikipedia.org/wiki/F1_score
http://en.wikipedia.org/wiki/Precision_and_recall
I would like to represent a mutable graph in Prolog in an efficient manner. I will searching for subsets in the graph and replacing them with other subsets.
I've managed to get something working using the database as my 'graph storage'. For instance, I have:
:- dynamic step/2.
% step(Type, Name).
:- dynamic sequence/2.
% sequence(Step, NextStep).
I then use a few rules to retract subsets I've matched and replace them with new steps using assert. I'm really liking this method... it's easy to read and deal with, and I let Prolog do a lot of the heavy pattern-matching work.
The other way I know to represent graphs is using lists of nodes and adjacency connections. I've seen plenty of websites using this method, but I'm a bit hesitant because it's more overhead.
Execution time is important to me, as is ease-of-development for myself.
What are the pros/cons for either approach?
As usual: Using the dynamic database gives you indexing, which may speed things up (on look-up) and slow you down (on asserting). In general, the dynamic database is not so good when you assert more often than you look up. The main drawback though is that it also significantly complicates testing and debugging, because you cannot test your predicates in isolation, and need to keep the current implicit state of the database in mind. Lists of nodes and adjacancy connections are a good representation in many cases. A different representation I like a lot, especially if you need to store further attributes for nodes and edges, is to use one variable for each node, and use variable attribtues (get_attr/3 and put_attr/3 in SWI-Prolog) to store edges on them, for example [edge_to(E1,N_1),edge_to(E2,N_2),...] where N_i are the variables representing other nodes (with their own attributes), and E_j are also variables onto which you can attach further attributes to store additional information (weight, capacity etc.) about each edge if needed.
Have you considered using SWI-Prolog's RDF database ? http://www.swi-prolog.org/pldoc/package/semweb.html
as mat said, dynamic predicates have an extra cost.
in case however you can construct the graph and then you dont need to change it, you can compile the predicate and it will be as fast as a normal predicate.
usually in sw-prolog the predicate lookup is done using hash tables on the first argument. (they are resized in case of dynamic predicates)
another solution is association lists where the cost of lookup etc is o(log(n))
after you understand how they work you could easily write an interface if needed.
in the end, you can always use a SQL database and use the ODBC interface to submit queries (although it sounds like an overkill for the application you mentioned)
I'm trying to create a simple scheduler in Prolog that takes a bunch of courses along with the semesters they're offered and a user's ranking of the courses. These inputs get turned into facts like
course('CS 4812','Quantum Information Processing',1.0882353,s2012).
course('Math 6110','Real Analysis I',0.5441176,f2011).
where the third entry is a score. Currently, my database is around 60 classes, but I'd like the program to eventually be able to handle more. I'm having trouble getting my DP implementation to work on a nontrivial input. The answers are correct, but the time spent is on the same order as the brute force algorithm. I handle memoization with a dynamic predicate:
:- dynamic(stored/6).
memo(Result,Schedule,F11,S12,F12,S13) :-
stored(Result,Schedule,F11,S12,F12,S13) -> true;
dpScheduler(Result,Schedule,F11,S12,F12,S13),
assertz(stored(Result,Scheduler,F11,S12,F12,S13)).
The arguments to dpScheduler are the optimal schedule (a tuple of a list of classes and its score), the classes chosen so far, and how many classes are remaining to be chosen for the Fall 2011, Spring 2012, Fall 2012, and Spring 2013 semesters. Once the scheduler has a compete schedule, it gets the score with evalSchedule, which just sums up the scores of the classes.
dpScheduler((Acc,X),Acc,0,0,0,0) :-
!, evalSchedule(X,Acc).
I broke up dpScheduler up for each semester, but they all pretty much look the same. Here is the clause for Fall 2011, the first semester chosen.
dpScheduler(Answer,Acc,N,B,C,D) :-
!, M is N - 1,
getCourses(Courses,f2011,Acc),
lemma([Head|Tail],Courses,Acc,M,B,C,D),
findBest(Answer,Tail,Head).
The lemma predicate computes all the subgoals.
lemma(Results,Courses,Acc,F11,S12,F12,S13) :-
findall(Result,
(member(Course,Courses), memo(Result,[Course|Acc],F11,S12,F12,S13)),
Results).
My performance has been horrendous, and I'd be grateful for any pointers on how to improve it. Also, I'm a new Prolog programmer, and I haven't spent much time reading others' Prolog code, so my program is probably unidiomatic. Any advice on that would be much appreciated as well.
There are a couple of reasons for bad performance:
First of all, assert/3 is not very fast so you spend a lot of time there if there are a lot of asserts.
Then, prolog uses a hash table based on the first argument to match clauses. In your case, yhe first argument is the Result which is uninstantiated when it's called so I think you would have a performance penalty because of that. You could solve this be reordering the arguments. I thought that you could change the argument on which the hash table is based but i dont see how in the swi-prolog manual :/
Also, prolog isnt really renowned for great performance xd
I suggest to use XSB (if possible) which offers automatic memoization (tabling); you simply write
:-table(my_predicate/42) and it takes care of everything. I think that it's a bit faster than swipl too.
Other than that, you could try to use a list with all the calculated values and pass it around; maybe an association list.
edit: i dont really see where you call the memoization predicate