In layman's terms, what's a RDF triple?
I think the question needs to be split into two parts - what is a triple and what makes an "RDF triple" so special?
Firstly, a triple is, as most of the other commenters here have already pointed out, a statement in "subject/predicate/object" form - i.e. a statement linking one object (subject) to another object(object) or a literal, via a predicate. We are all familiar with triples: a triple is the smallest irreducible representation for binary relationship. In plain English: a spreadsheet is a collection of triples, for example, if a column in your spreadsheet has the heading "Paul" and a row has the heading "has Sister" and the value in the cell is "Lisa". Here you have a triple: Paul (subject) has Sister(predicate) Lisa (literal/object).
What makes RDF triples special is that EVERY PART of the triple has a URI associated with it, so the everyday statement "Mike Smith knows John Doe" might be represented in RDF as:
uri://people#MikeSmith12 http://xmlns.com/foaf/0.1/knows uri://people#JohnDoe45
The analogy to the spreadsheet is that by giving every part of the URI a unique address, you give the cell in the spreadsheet its whole address space....so you could in principle stick every cell (if expressed in RDF triples) of the spreadsheet into a different document on a different server and reconstitute the spreadsheet through a single query.
Edit:
This section of the official documentation addresses the original question.
An RDF Triple is a statement which relates one object to another. For Example:
"gcc" "Compiles" "c" .
"gcc" "compiles" "Java" .
"gcc" "compiles" "fortran" .
"gcc" "has a website at" <http://gcc.gnu.org/> .
"gcc" "has a mailing list at" <mailto:gcc-help#gcc.gnu.org> .
"c" "is a" "programming language" .
"c" "is documented in" <http://www.amazon.com/Programming-Language-Prentice-Hall-Software/dp/0131103628/ref=pd_bbs_sr_1?ie=UTF8&s=books&qid=1226085111&sr=8-1> .
An RDF file should parse down to a
list of triples.
A triple consists of a subject, a
predicate, and an object. But what do
these actually mean?
The subject is, well, the subject. It
identifies what object the triple is
describing.
The predicate defines the piece of
data in the object we are giving a
value to.
The object is the actual value.
From: http://www.robertprice.co.uk/robblog/archive/2004/10/What_Is_An_RDF_Triple_.shtml
Regarding the answer by Adam N. I believe the O.P. asked a previous question regarding data for a social network, so although the answer is excellent, I will just clarify in relation to the "original original" question. (As I feel responsible).
John | Is a Friend of | James
James | Is a friend of | Jill
Jill | Likes | Snowboarding
Snowboarding | Is a | Sport
Using triples like this you can have a really flexible data structure.
Look at the Friend of a friend (FOAF) perhaps for a better example.
RDF is a Language, i.e., a system of signs, syntax, and semantics for encoding and decoding information (data in some context).
In RDF, a unit of observation (Data) is represented by a sentence that consists of three parts: subject, predicate, object. Basically, this is the fundamental structure of natural language speech.
The sign used to denote entities (things) participating in entity relationships represented by RDF is an IRI (which includes HTTP URIs). Each subject and predicate (and optionally, object) component of an RDF sentence is denoted by an IRI.
The syntax (grammar) is abstract (meaning it can be represented using a variety of notations) in the form of subject, predicate, and object arrangement order.
The semantics (the part overlooked most often) is all about the meaning of the subject, predicate, and object roles in an RDF statement.
When you use HTTP URIs to denote RDF statement subject, predicates, and (optionally) objects, you end up with structured data (collections of entity relationship types) that form a web -- just as you have today on the World Wide Web.
When the semantics of a predicate (in particular) in an RDF statement are both machine and human comprehensible you have a web of entity relationship types that provide powerful encoding of information that is a foundation for knowledge (inference and reasoning).
Here are examples of simple RDF statements:
{
<#this> a schema:WebPage .
<#this> schema:about dbpedia:Resource_Description_Framework .
<#this> skos:related <https://stackoverflow.com/questions/30742747/convert-a-statement-with-adjective-in-rdf-triple/30836089#30836089> .
}
I've used braces to enclose the examples so that this post turns into a live RDF-based Linked Data demonstration, courtesy of relative HTTP URIs and the # based fragment identifier (indexical).
Results of the RDF statements embedded in this post, courtesy of nanotation (embedding RDF statements wherever text is accepted):
Basic Entity Description Page -- Each Statement is identified by a hyperlink that resolves to its description (subject, predicate, object parts)
Deeper Faceted Browsing Page -- Alternative view that lends itself to deeper exploration and discovery by following-your-nose through the hyperlinks that constitute the data web or web of linked data.
Description of an embedded statement -- About a specific RDF statement.
Here's the visualization generated from the triples embedded in this post (using our Structured Data Sniffer Browser Extension, using RDF-Turtle Notation:
Note, that it can get a bit more complicated. RDF triples can also be considered Subjects or Objects, so you can have something like:
Bart -> said -> ( triples -> can be -> objects)
I'm going to have to agree with A Pa in part, even though he was down-voted.
Background: I'm a linguist, with a PhD in that subject, and I work in computational linguistics.
The statement that "...a sentence that consists of three parts: subject, predicate, object. Basically, this is the fundamental structure of natural language speech" (which A Pa quotes from Kingsley Uyi Idehen's answer) is simply wrong. And it's not just that Kingsley says this, I've heard it from many advocates of RDF triples.
It's wrong for many reasons, for example: Predicates (in English, arguably, and in many other natural languages) consist of a verb (or a verb-like thing) + the object (and perhaps other complements). It is definitely NOT the case that the syntactic structure of English is Subj-Pred-Obj.
Furthermore, not all natural language sentences in English have an object; intransitive verbs, in particular, by definition do not take objects. And weather verbs (among other things) don't even take a "real" subject (the "it" of "it rains" has no reference). And on the other hand, ditransitive verbs like "give" take both a direct and an indirect object. Then there are verbs like "put" that take a locative in addition to the direct object, or "tell" that take an object and a clause. Not to mention adjuncts, like time and manner adverbials.
Yes, of course you can represent embedded clauses as embedded triples (to the extent that you can represent any statement as triples, which as I hope you've made clear, you can't), but what I don't think you can do in RDF (at least I've never seen it done, and it seems like it would take a quadruple) is to have both an object and an embedded clause. Likewise both a direct and an indirect object, or adjuncts.
So whatever the motivation for RDF triples, I wish the advocates would stop pretending that there's a linguistic motivation, or that the triples in any way resemble natural language syntax. Because they don't.
It has been awhile since I worked with RDF, but here it goes :D
A triple is a subject, predicate and object.
The subject is a URI which uniquely identifies something. For example, your openid uniquely identifies you.
The object defines how the subject and object are related.
The predicate is some attribute of the subject. For example a name.
Given that, the triples form a graph S->P. Given more triplets, the graph grows. For example, you can have the same person identified as the subject of a bunch of triples, you can then connect all of the predicates through that unique subject.
RDF Triple is an actual expression that defines a way in which you can represent a relationship between objects. There are three parts to a triple: Subject, Predicate and Object (typically written in the same order). A predicate relates subject to object.
Subject ----Predicate---> Object
More useful information can be found at:
http://www.w3.org/TR/rdf-concepts/
A simple answer can be that an RDF triple is a representation of some knowledge using RDF data model. This model is based upon the idea of making statements about resources (in particular web resources URIs) in the form of subject–predicate–object expressions. RDF is also a standard model for data interchange on the Web. RDF has features that facilitate data merging even if the underlying schemas differ, and it specifically supports the evolution of schemas over time without requiring all the data consumers to be changed. I recommend this article to know how: https://www.w3.org/DesignIssues/RDF-XML.html
One can think of a triple as a type of sentence that states a single "fact" about a resource. First of all to understand RDF Triple you should know that each and every thing in RDF is defined in terms of URI http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/#dfn-URI-referenceor blank node http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/#dfn-blank-node.
An RDF Triple consists of three components :-
1) Subject
2) Predicate
3) Object
For ex :- Pranay hasCar Ferrari
Here Subject is Pranay, hasCar is a predicate and Ferrari is a object. This are each defined with RDF-URI. For more information you can visit :- http://www.w3.org/TR/owl-ref/
Triple explained by example
Be there a table that relates users and questions.
TABLE dc:creator
-------------------------
| Question | User |
-------------------------
| 45 | 485527 |
| 44 | 485527 |
| 40 | 485528 |
This could conceptually be expressed in three RDF triples like...
<question:45> <dc:creator> <user:485527>
<question:44> <dc:creator> <user:485527>
<question:40> <dc:creator> <user:485528>
...so that each row is converted to one triple that relates a user to a question. The general form of each triple can be described as:
<Subject> <Predicate> <Object>
One specialty about RDF is, that you can (or must) use URIs/IRIs to identify entities as well as relations. Find more here. This makes it possible for everyone to reuse already existing relations (predicates) and to publish statements about arbitrary entities in the www.
Example relating a SO answer to its creator:
<https://stackoverflow.com/a/49066324/1485527>
<http://purl.org/dc/terms/creator>
<https://stackoverflow.com/users/1485527>
As a developer, I have struggled for a while until I finally understood what RDF and its tripes was about, mostly because I have always seen the world through code and not through data.
Given this is posted on StackOverflow, here is the Java analogy that finally made it click for me: a RDF triple is to data what a class' method/parameter is to code.
So:
A class with its package name is the Subject
A method on this class is the Predicate
Parameter(s) on the method is the Object, which are themselves represented by classes
Contexts are import statements to avoid writing the full canonical name of classes
The only point where this analogy breaks down a bit is that Predicates also have namespaces, while methods do not. But the overall relationships created between class instances as Subject and Object when a Predicate is used reflects on the idea of calling a method to do something.
Basically, RDF is to data what OO is to code.
See:
http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/#dfn-rdf-triple
An RDF triple contains three components:
the subject, which is an RDF URI reference or a blank node
the predicate, which is an RDF URI reference
the object, which is an RDF URI reference, a literal or a blank node
where literals are essentially strings with optional language tags,
and blank nodes are also strings. URIs, literals and blank nodes must
be from pair-wise disjoint sets.
Related
Warning: This question is a more philosophical question than practical, but I find it well as to be asked and answered in practical contexts (forums like StackOverflow here, instead of the SoftwareEngineering stack-exchange website), due to the native development in the actual use de-facto of YAML and the way the way it's specification has evolved and features have been added to it over time. Let's ask:
As opposed to formats/languages/protocols such as JSON, the YAML format allows you (according to this link, that seems pretty official, or at least accurate and reliable source to understand the YAML specification) to embed multiple 'Documents' within one file/stream, using the three-dashes marking ("---").
If so, it's hard to ignore the fact that the concept/model/idea of 'Document' in YAML, is no longer an external definition, or "meta"-directive that helps the human/parser to organize multiple/distincted documents along each other (similar to the way file-systems defining the concept of "file" to organize different files, but each file in itself - does not necessarily recognize that it's a file, or that it's being part of a file system that wraps it, by definition, AFAIK.
However, when YAML allows for a multi-Document YAML files, that gather collections of Documents in a single YAML file (and perhaps in a way that is similar/analogous to HTTP Pipelining approach of HTTP protocol), the concept/model/idea/goal of Document receives a new, wider definition/character de-facto, as a part of the YAML grammar and it's produces, and not just of the YAML specification as an assistive concept or format description that helps to describe the specification.
If so, being a Document part of the language itself, what is the added value of this data-structure, compared to the existing, familiar and well-used good old data-structure of Collection (array of items)?
I'm asking it, because I've seen in this link (here) some snippet (in the second example), which describes a YAML sequence that is actually a collection of logs. For some reason, the author of the example, chose to prefer to present each log as a separate "Document" (separated with three-dashes), gathered together in the same YAML sequence/file, instead of writing a file that has a "Collection" of logs represented with the data-type of array. Why did he choose to do this? Is his choice fit, correct, ideal?
I can speculate that the added value of the distinction between a Document and a Collection become relevant when using more advanced features of the YAML grammar, such as Anchors, Tags, References. I guess every Document provide a guarantee that all these identifiers will be a unique set, and there is no collision or duplicates among them. Am I right? And if so, is this the only advantage, or maybe there are any more justifications for the existence of these two pretty-similar data structures?
My best for now, is to see Document as a "meta"-Collection, that is more strict, and lack of high-level logic, or as two different layers of collection schemes. Is it correct, accurate way of view?
And even if I am right, why in the above example (of the logs document from the link), when there's no use and not imply or expected to use duplications or collisions or even identifiers/anchors or compound structures at all - the author is still choosing to represent the collection's items as separate documents? Is this just not so successful selection of an example? Or maybe I'm missing something, and this is a redundancy in the specification, or an evolving syntactic-sugar due to practical needs?
Because the example was written on a website that looks serious with official information written by professionals who dealt with the essence of the language and its definition, theory and philosophy behind (as opposed to practical uses in the wild), and also in light of other provided examples I have seen in it and the added value of them being meticulous, I prefer not to assume that the example is just simply imperfect/meticulous/fit, and that there may be a good reason to choose to write it this way over another, in the specific case exampled.
First, let's look at the technical difference between the list of documents in a YAML stream and a YAML sequence (which is a collection of ordered items). For this, I'll discuss YAML tags, which are an advanced feature so I'll provide a quick overview:
YAML nodes can have tags, such as !!str (the official tag for string values) or !dice (a local tag that can be interpreted by your application but is unknown to others). This applies to all nodes: Scalars, mappings and sequences. Nodes that do not have such a tag set in the source will be assigned the non-specific tag ?, except for quoted scalars which get ! instead. These non-specific tags are later resolved to specific tags, thereby defining to which kind of data structure the node will be deserialized into.
YAML implementations in scripting languages, such as PyYAML, usually only implement resolution by looking at the node's value. For example, a scalar node containing true will become a boolean value, 42 will become an integer, and droggeljug will become a string.
YAML implementations for languages with static types, however, do this differently. For example, assume you deserialize your YAML into a Java class
public class Config {
String name;
int count;
}
Assume the YAML is
name: 42
count: five
The 42 will become a String despite the fact that it looks like a number. Likewise, five will generate an error because it is not a number; it won't be deserialized into a string. This means that not the content of the node defines how it will be deserialized, but the path to the node.
What does this have to do with documents? Well, the YAML spec says:
Resolving the tag of a node must only depend on the following three parameters: (1) the non-specific tag of the node, (2) the path leading from the root to the node and (3) the content (and hence the kind) of the node.)
So, the technical difference is: If you put your data into a single document with a collection at the top, the YAML processor is allowed to take into account the position of the data in the top-level collection when resolving a tag. However, when you put your data in different documents, the YAML processor must not depend on the position of the document in the YAML stream for resolving the tag.
What does this mean in practice? It means that YAML documents are structurally disjoint from one another. Whether a YAML document is valid or not must not depend on any preceeding or succeeding documents. Consequentially, even when deserialization runs into a semantic problem (such as with the five above) in one document, a following document may still be deserialized successfully.
The goal of this design is to be able to concatenate arbitrary YAML documents together without altering their semantics: A middleware component may, without understanding the semantics of the YAML documents, collect multiple streams together or split up a single stream. As long as they are syntactically correct, stream splitting and merging are sound operations that do not invalidate a YAML document even if another document is structurally invalid.
This design primary focuses on sending and receiving data over networks. Of course, nowadays, YAML is primarily used as configuration language. This is why this feature is seldom used and of rather little importance.
Edit: (Reply to comment)
What about end-cases like a string-tagged Document starts with a folded-string, making even its following "---" and "..." just a characters of the global string?
That is not the case, see rules l-bare-document and c-forbidden. A line containing un-indented ... not followed by non-whitespace will always end a document if one is open.
Moreover, ... doesn't do anything if no document is open. This ensures that a stream merger can always append ... to a document to ensure that the current document is closed, but no additional one is created.
--- has widely been adopted as separator between YAML documents (and, perhaps more prominently, between YAML front matter and content in tools like Jekyll) where ... would have been more appropriate, particularly in Jekyll. This gives the false impression that --- should be used by tooling to separate documents, when in reality ... is the syntactic element designed for that use-case.
I'm trying to use the ABPerson method searchElementForProperty:... to create a moderately complex search. In particular, I want to find the set of people who have an email address that ends with "foo.com", and are NOT part of the pre-populated group "My workunit".
Matching against just the email address seems to be trivial. Creating a conjunction against the (inverse of the) group membership seems impossible.
Yes, I can do this by doing the obvious explicit cross-checking myself, but if the point of having search functionality directly in Address Book is to optimize performance, wouldn't it make sense for the search facility to be sufficiently complete to be able to do this?
Thanks in advance,
Tony
You could potentially copy all the data from the address book into a Core Data store and use predicates to work with that data. Predicates tend to be very useful when building complex queries.
Predicate Programming Guide
In this case you would have to get all contacts ([[AddressBook sharedAddressBook] people]) and also have a Core Data entity called Contact (or something similar) that would save names, emails, addresses and other properties from the ABPerson object.
Having this you can probably create an NSPredicate to filter with the conditions you want.
Groups reference their members according to recordId. The only way I have found to perform such an operation is here: how to find parent groups of a person. It is not a simple thing like we would like. It seems that Apple is not concerned about group searching which would be extremely useful.
I have tags on my website, and I input them one by one when I create a blog post. I love gmail's new feature, that ask you if you want to include X in a mail, if you type Y's name and that you often include both of them in the same messages.
I'd like to do something similar on my website, but I don't know how to represent the tags "related-ness" in an object or database ... thoughts ?
It all boils down to create associations between certain characteristics of your posts and certain tags, and then - when you press the "publish" button - to analyse the new post and propose all tags matched with your post characteristics.
This can be done in several ways from a "totally hard-coded" association to some sort of "learning AI"... and everything in-between.
Hard-coded solutions
This are the simplest algorithms to implement. You should first decide what characteristics of your post are relevant for tagging (e.g.: it's length if you tag them "short" or "long", the presence of photos or videos if you tag them "multimedia-content", etc...). The most obvious is however to focus on which words are used in posts. For example you could build a mapping like this:
tag_hint_words = {'code-development' : ['programming',
'language', 'python', 'function',
'object', 'method'],
'family' : ['Theresa', 'kids',
'uncle Ben', 'holidays']}
Then you would check your post for the presence of the words in the list (the code between [ and ] ) and propose the tag (the word before :) as a possible candidate.
A common approach is to give "scores", or in other word to put a number that indicates the probability a given tag is the right one. For example: if your post would contain the sentence...
After months of programming, we finally left for the summer holidays at uncle Ben's cottage. Theresa and the kids were ecstatic!
...despite the presence of the word "programming" the program should indicate family as the most likely tag to use, as there are many more words hinting.
Learning AI's
One of the obvious limitations of the above method is that - say one day you pick up java beside python - you would probably need to change your code and include words like "java" or "oracle" too. The same applies if you create new tags.
To circumvent this limitation (and have some fun!!) you could try to implement a learning algorithm. Learning algorithms are those who refine their outcome the more you use them (so they indeed... learn!). Some algorithm requires initial training (many spam filters and voice recognition programs need this initial "primer"). Some don't.
I am absolutely no expert on the subject, but two common AI's are: the Naive Bayes Classifier and some flavour of Neural network.
Although the WP pages might look scary, they are surprisingly easy to implement (at least in Python). Here's the recording of a lecture at PyCon 2009 on the subject "Easy AI with Python". I found it very informative and even somehow inspiring! :)
HTH!
You should have a look at this post :
Any suggestions for a db schema for storing related keywords?
If you're looking for a schema for storing related tags it will help.
Relevancy searches where multiple agents play a part are usually done using Collaborative filtering. You might want to give that a look see.
Look up Clustering (Machine Learning algorithm). Don't be intimidated by math, it's a pretty straightforward algorithm. Check out Machine Learning for Hackers for simpler explanations of many Machine Learning algorithms and methods.
I'm trying to make an algorithm that easily simplifies and groups synonyms (with mismatches, capitals, acronims, etc) into only one. I supose there should exist a standard way to build such a structure that, looking for a string with possible mismatches, if the string exists in the structure, it returns a normalized string key. In short, sometimes the same concept could be written in several ways, but I only want to keep the concept.
For instance: Supose I want to normalize or simplify the appearances of
"General Director", "General Manager", "G, Dtor", "Gen Dir", ...
into
"GEN_DIR"
and keep only this result for further reference.
By the way, I suppose that building a Hash with key/value pairs like
hash["General Director"]="GEN_DIR"
hash["General Manager"]="GEN_DIR"
hash["G, Dtor"]="GEN_DIR"
hash["G, Dir"]="GEN_DIR"
could be a solution, but I suspect that there are more elegant or adequate solutions to that.
I would also need the way to persist this associative structure easily without any database because it should grow as I find more mismatches of the same word or sentence. A possible approach I think is to define this structure by means of a DSL, but I'm open to suggestions.
Well, there is no rule, at least a clear one.
My aim is to scrap from web some "structured" data that sometimes is incorrectly or incompletely typed. Some fields are descriptions and can be left as is. But some fields are suposedly to be "sets" but aren's correctly typed (as in my example). As a human can read that, he immediatelly knows what it means and can associate that with its meaning.
But I would like to automate as much as possible the process of reducing those possible mismatches to only one "string" (or symbol) before, for instance, saving it into a database. So, what I would need is a kindof hash or dictionary, as sawa correctly stated, that I can use to lookup any of such dirty strings to get the normalized string or symbol.
Also, of course, it would be desirable a way to make this hash (or whatelse it could be) to learn from new mismatches in some way and add a new association automatically (possibly it could be based on a distance measure between mismatched string and normalized string that, if lower than X, a new association is built). The whole association (i.e, hash) should grow as new mismatches and concepts arise and, though, it should be kept anywhere (possibly in an xml file, or something like what Mori answered below) for future uses.
Any new Idea?
Say you were to create a search engine that can accept a query statement under the form of a String. The statement can be used to retrieve different types of objects with a given set of characteristics and possibly linked to other objects. In plain english or pseudo-code using an OOP approach, how would you go about parsing and processing statements as follows to get the series of desired objects ?
get fruit with colour green
get variety of apples, pears from Andy
get strawberry with colour "deep red" and origin not Spain
get total of sales of melons between 2010-10-10 and 2010-12-30
get last deliverydate of bananas from "Pete" and state not sold
Hope the question is clear. If not I'll be more than happy to reformulate.
P.S: This isn't homework ;)
Your problem is well suited to a document-oriented store such as Lucene. For example you can design a schema such as
Type
Variety
Color
Origin
DateSold
etc
:
Then you can write a Lucene query such as Type:Fruit AND Color:Green. You can also build nested queries such as (Fruit:Straberry AND Color:Deep Red) AND NOT Origin:Spain.
Apache Lucene is a Java library with portts available for most major languages. Apache Solr is a full-fledged search server built using Lucene lib and easily integrable into your platform-of-choice because it has a RESTful API.
BTW Solr has something called faceting which lets the user filter results using each of the criteria above. So user types fruit into search box and then gets results back.
Type:
- Fruit (109)
- Nut (99)
Origin:
- Spain(32)
- France(39)
Color:
- Red (22)
- Deep Red(45)
Clicking on each of the facets filters the results with the intersection. So if you want a more user-friendly interaction model, faceting/filtering is much easier, than getting users to type extensive Lucene queries.
Update: You might still need to do some lexical parsing if you wish to let users type natural language queries and break it down, but given the tremendously difficult challenge, my suggestion would be to use the simple & powerful faceting approach.
Hope that helps.
It sounds like you're developing a mini language, since you're concerned with syntax and parsing. So, check out the many tools used to generate lexers and parsers. You can start here: http://en.wikipedia.org/wiki/Lexical_analysis
I agree with John.
a) Start with lexical analysis
b) Take statistics of searches and use them to index
c) Find relationships by analysing possibly related searches
This is just a wild guess though, never tried it before.