What is the correct way of validating RDF with Shex when part of the IRIs are in the Triple Store? - shex

Say that I want to validate insertion of a company promotion in a triple Store using Shex. A possible approach would be to code Shex as in:
:Promotion {
my-onto:has_person #:Person ;
my-onto:grants_role #:Role ;
}
:Person {
a [ foaf:Person ] ;
}
:Role {
a [ my-onto:CompanyRole ] ;
}
This is a simplification. The problem is that when inserting the data the triple will be something of:
:promotion-123 my-onto:has_person :person-456 ;
my-onto:grants_role :role-CTO .
and this graph won't pass Shex validation because it lacks all the a triples.
So for defining and documenting what are correct as IRIs in the two relations, it makes sense to have the Shapes but in 90% of all real world scenarios data will come as in the example above without the type (in this example) relation and thus will fail to validate.
What would the correct way of documenting complex and nested shapes for validating RDF but at the same time "disable" some checks a certain points in the graph?
The use case I'm thinking about is when I need to add extra info to "shapes" already existing, using IRIs like owl:NamedIndividuals or constants in an ontology, already existing entities like Persons, companies, etc.

You mean that you insert data without rdf:type (a) declarations and the system adds those declarations by some kind of reasoning system.
ShEx doesn't interfere with reasoning systems and doesn't treat rdf:type declarations in any special way. So there could be several approaches for that use case.
One approach is to have add a question mark to the rdf:type declaration as:
:Promotion {
my-onto:has_person #:Person ;
my-onto:grants_role #:Role ;
}
:Person {
a [ foaf:Person ] ? ;
}
:Role {
a [ my-onto:CompanyRole ] ? ;
}
which says that a :Person can either not have a rdf:type declaration or if it has a rdf:type declaration, then it must contain the single value foaf:Person.
Another approach could be to have two shapes, one before reasoning to check the input data and another after insertion the data to check the correct behaviour of the insertion process.
Notice that it is possible to have different shapes for the same data that act at different points during the data processing pipeline.

Related

How to get documents that contain sub-string in FaunaDB

I'm trying to retrieve all the tasks documents that have the string first in their name.
I currently have the following code, but it only works if I pass the exact name:
res, err := db.client.Query(
f.Map(
f.Paginate(f.MatchTerm(f.Index("tasks_by_name"), "My first task")),
f.Lambda("ref", f.Get(f.Var("ref"))),
),
)
I think I can use ContainsStr() somewhere, but I don't know how to use it in my query.
Also, is there a way to do it without using Filter()? I ask because it seems like it filters after the pagination, and it messes up with the pages
FaunaDB provides a lot of constructs, this makes it powerful but you have a lot to choose from. With great power comes a small learning curve :).
How to read the code samples
To be clear, I use the JavaScript flavor of FQL here and typically expose the FQL functions from the JavaScript driver as follows:
const faunadb = require('faunadb')
const q = faunadb.query
const {
Not,
Abort,
...
} = q
You do have to be careful to export Map like that since it will conflict with JavaScripts map. In that case, you could just use q.Map.
Option 1: using ContainsStr() & Filter
Basic usage according to the docs
ContainsStr('Fauna', 'a')
Of course, this works on a specific value so in order to make it work you need Filter and Filter only works on paginated sets. That means that we first need to get a paginated set. One way to get a paginated set of documents is:
q.Map(
Paginate(Documents(Collection('tasks'))),
Lambda(['ref'], Get(Var('ref')))
)
But we can do that more efficiently since one get === one read and we don't need the docs, we'll be filtering out a lot of them. It's interesting to know that one index page is also one read so we can define an index as follows:
{
name: "tasks_name_and_ref",
unique: false,
serialized: true,
source: "tasks",
terms: [],
values: [
{
field: ["data", "name"]
},
{
field: ["ref"]
}
]
}
And since we added name and ref to the values, the index will return pages of name and ref which we can then use to filter. We can, for example, do something similar with indexes, map over them and this will return us an array of booleans.
Map(
Paginate(Match(Index('tasks_name_and_ref'))),
Lambda(['name', 'ref'], ContainsStr(Var('name'), 'first'))
)
Since Filter also works on arrays, we can actually simple replace Map with filter. We'll also add a to lowercase to ignore casing and we have what we need:
Filter(
Paginate(Match(Index('tasks_name_and_ref'))),
Lambda(['name', 'ref'], ContainsStr(LowerCase(Var('name')), 'first'))
)
In my case, the result is:
{
"data": [
[
"Firstly, we'll have to go and refactor this!",
Ref(Collection("tasks"), "267120709035098631")
],
[
"go to a big rock-concert abroad, but let's not dive in headfirst",
Ref(Collection("tasks"), "267120846106001926")
],
[
"The first thing to do is dance!",
Ref(Collection("tasks"), "267120677201379847")
]
]
}
Filter and reduced page sizes
As you mentioned, this is not exactly what you want since it also means that if you request pages of 500 in size, they might be filtered out and you might end up with a page of size 3, then one of 7. You might think, why can't I just get my filtered elements in pages? Well, it's a good idea for performance reasons since it basically checks each value. Imagine you have a massive collection and filter out 99.99 percent. You might have to loop over many elements to get to 500 which all cost reads. We want pricing to be predictable :).
Option 2: indexes!
Each time you want to do something more efficient, the answer lies in indexes. FaunaDB provides you with the raw power to implement different search strategies but you'll have to be a bit creative and I'm here to help you with that :).
Bindings
In Index bindings, you can transform the attributes of your document and in our first attempt we will split the string into words (I'll implement multiple since I'm not entirely sure which kind of matching you want)
We do not have a string split function but since FQL is easily extended, we can write it ourselves bind to a variable in our host language (in this case javascript), or use one from this community-driven library: https://github.com/shiftx/faunadb-fql-lib
function StringSplit(string: ExprArg, delimiter = " "){
return If(
Not(IsString(string)),
Abort("SplitString only accept strings"),
q.Map(
FindStrRegex(string, Concat(["[^\\", delimiter, "]+"])),
Lambda("res", LowerCase(Select(["data"], Var("res"))))
)
)
)
And use it in our binding.
CreateIndex({
name: 'tasks_by_words',
source: [
{
collection: Collection('tasks'),
fields: {
words: Query(Lambda('task', StringSplit(Select(['data', 'name']))))
}
}
],
terms: [
{
binding: 'words'
}
]
})
Hint, if you are not sure whether you have got it right, you can always throw the binding in values instead of terms and then you'll see in the fauna dashboard whether your index actually contains values:
What did we do? We just wrote a binding that will transform the value into an array of values at the time a document is written. When you index the array of a document in FaunaDB, these values are indexes separately yet point all to the same document which will be very useful for our search implementation.
We can now find tasks that contain the string 'first' as one of their words by using the following query:
q.Map(
Paginate(Match(Index('tasks_by_words'), 'first')),
Lambda('ref', Get(Var('ref')))
)
Which will give me the document with name:
"The first thing to do is dance!"
The other two documents didn't contain the exact words, so how do we do that?
Option 3: indexes and Ngram (exact contains matching)
To get exact contains matching efficient, you need to use a (still undocumented function since we'll make it easier in the future) function called 'NGram'. Dividing a string in ngrams is a search technique that is often used underneath the hood in other search engines. In FaunaDB we can easily apply it as due to the power of the indexes and bindings. The Fwitter example has an example in it's source code that does autocompletion. This example won't work for your use-case but I do reference it for other users since it's meant for autocompleting short strings, not to search a short string in a longer string like a task.
We'll adapt it though for your use-case. When it comes to searching it's all a tradeoff of performance and storage and in FaunaDB users can choose their tradeoff. Note that in the previous approach, we stored each word separately, with Ngrams we'll split words even further to provide some form of fuzzy matching. The downside is that the index size might become very big if you make the wrong choice (this is equally true for search engines, hence why they let you define different algorithms).
What NGram essentially does is get substrings of a string of a certain length.
For example:
NGram('lalala', 3, 3)
Will return:
If we know that we won't be searching for strings longer than a certain length, let's say length 10 (it's a tradeoff, increasing the size will increase the storage requirements but allow you to do query for longer strings), you can write the following Ngram generator.
function GenerateNgrams(Phrase) {
return Distinct(
Union(
Let(
{
// Reduce this array if you want less ngrams per word.
indexes: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
indexesFiltered: Filter(
Var('indexes'),
// filter out the ones below 0
Lambda('l', GT(Var('l'), 0))
),
ngramsArray: q.Map(Var('indexesFiltered'), Lambda('l', NGram(LowerCase(Var('Phrase')), Var('l'), Var('l'))))
},
Var('ngramsArray')
)
)
)
}
You can then write your index as followed:
CreateIndex({
name: 'tasks_by_ngrams_exact',
// we actually want to sort to get the shortest word that matches first
source: [
{
// If your collections have the same property tht you want to access you can pass a list to the collection
collection: [Collection('tasks')],
fields: {
wordparts: Query(Lambda('task', GenerateNgrams(Select(['data', 'name'], Var('task')))))
}
}
],
terms: [
{
binding: 'wordparts'
}
]
})
And you have an index backed search where your pages are the size you requested.
q.Map(
Paginate(Match(Index('tasks_by_ngrams_exact'), 'first')),
Lambda('ref', Get(Var('ref')))
)
Option 4: indexes and Ngrams of size 3 or trigrams (Fuzzy matching)
If you want fuzzy searching, often trigrams are used, in this case our index will be easy so we're not going to use an external function.
CreateIndex({
name: 'tasks_by_ngrams',
source: {
collection: Collection('tasks'),
fields: {
ngrams: Query(Lambda('task', Distinct(NGram(LowerCase(Select(['data', 'name'], Var('task'))), 3, 3))))
}
},
terms: [
{
binding: 'ngrams'
}
]
})
If we would place the binding in values again to see what comes out we'll see something like this:
In this approach, we use both trigrams on the indexing side as on the querying side. On the querying side, that means that the 'first' word which we search for will also be divided in Trigrams as follows:
For example, we can now do a fuzzy search as follows:
q.Map(
Paginate(Union(q.Map(NGram('first', 3, 3), Lambda('ngram', Match(Index('tasks_by_ngrams'), Var('ngram')))))),
Lambda('ref', Get(Var('ref')))
)
In this case, we do actually 3 searches, we are searching for all of the trigrams and union the results. Which will return us all sentences that contain first.
But if we would have miss-spelled it and would have written frst we would still match all three since there is a trigram (rst) that matches.

How do you allow not foreseen properties in RDF when performing Shex validation?

We are creating our Shex definition files checking that some IRIs are of a given type. There is no problem with our generated code but sometimes we get files generated using Protege and most of the individuals are of type X plus owl:NamedIndividual, making our validation fail because now a given resource has 2 assertions of type rdf:type.
Adding owl:NamedIndividual to all shape checks seems like polluting the Shape definition, so how would you allow extra properties that do not conflict with your shape definition?
In Shex, by default the triple constraints are closed which means that a shape like:
:Shape {
rdf:type [ :X ]
}
means that a node that conforms to :Shape must have exactly one rdf:type declaration whose value is :X.
If you want to allow extra values for the rdf:type declaration, you can express it with the keyword EXTRA as:
:Shape EXTRA rdf:type {
rdf:type [ :X ]
}
The meaning now is that a conforming node must have rdf:type :X and can have zero or mode values for rdf:type.
Notice that the previous example could be defined as:
:Shape EXTRA a {
a [ :X ]
}
In the particular case that you only want to allow an extra rdf:type with value owl:NamedIndividual you could also define it as:
:Shape {
a [:X ] ;
a [ owl:NamedIndividual] ;
}
or as:
:Shape {
a [:X owl:NamedIndividual]{2} ;
}

Verifying transformed string was actually changed (according to mapping table)

I have a mapping table, M:
And using this, I've performed a find & replace on string S which gives me the transformed string S':
S: {"z" "y" "g" "k"} -> S':{"z" "y" "h" "k"}
Now I wish to verify if my mapping transformation was actually applied to S'. The psudo-code I came up for doing so is as follows:
I. Call function searchCol(x, “h”); // returns true if “h” can be found in column x in M.
II. If searchCol(x, “h”); returns true {
// assume mapping transformation was not applied to S'
// S'' after transforming S': {“z”, “y”, “i”, “j”}
}
III.If searchCol(x, “h”); returns false {
// assume mapping transformation was already applied to S'
// do nothing
}
IV. // log and continue …
However, as you can see, for the case above the algorithm doesn't work. Does anyone know a better way of going about this?
Cheers for your help.
Note: As my codebase is in Java, if you do provide any code examples, I'd prefer it if you posted them in the same language :)
Can you instead keep track of transformations? There are some cases where it's impossible to determine if a transformation took place, imagine this mapping table:
x -> y
y -> x
Now given the String yxyxyxyx, was it already transformed? And how many times?
But even if your mapping table is free of circles, the only thing you can say is:
If the string contains a char that is on the left side and not on the right side,
then it was not yet transformed.
But if the above condition is not fulfilled, then you can not be sure of anything.

Description Logics and Ontologies: How to denote role domain-restrictions to blank nodes

Request for assistance denoting a domain-restriction to a blank node.
Figure 1: Modelling a many-to-many relationship with a blank node.
Business Rule: An Enrolment maps one Student to one Section, once.
My attempt:
∃hasStudent.⊤ ≡ ∃hasSection.⊤ ≡ ∃grade_code.⊤
i.e. "the set of individuals that have some value for the role 'hasStudent' is the same set of individuals that have some value for the role 'hasSection' ...e.t.c."
I assume equivalence here instead of inclusion since the inclusions would be in both directions.
Restricting further:
∃hasStudent.⊤ ≡ ∃hasSection.⊤ ≡ ∃grade_code.⊤ ≡ =1hasStudent.⊤ ≡ =1hasSection.⊤ ≡ =1grade_code.⊤
i.e. "the set of individuals that have values for the roles 'hasStudent', 'hasSection' and 'grade_code', have one and only one value for them."
Assistance or comments on correctly denoting the domain-restrictions of the object properties in figure 1 would be appreciated.
Thanks!!
OWL's Open World assumption is going to prevent you from finding "the set of individuals that have values for the roles 'hasStudent', 'hasSection' and 'grade_code', have one and only one value for them."
However, using SPARQL, you could create an ASK query that does just what you are asking for:
ASK {
SELECT (count(?student) AS ?stcount) (count(?section) AS ?secount) (count(?course) AS ?ccount)
WHERE {
?indiv :hasStudent ?student .
?indiv :hasSection ?section .
?indiv :grade_course ?course .
} GROUP BY ?student ?section ?course
HAVING (stcount = 1 && ?secount = 1 && ?ccount = 1)
}
A bit akward syntactically, since the aggregates need to be computed by a SELECT statement. The ASK will return true if the 'constraints' (see the HAVING clause) are all true and false otherwise.
For future reference the SHACL (RDF Shapes Constraint Language) work at W3C is intended to shore up these kinds of constraint violation problems that are not possible to answer with OWL.
If I understand your intent correctly, you want these restrictions to apply to any use of these properties rather than only for a specific class.
Under this assumption, you can achieve this by declaring the properties functional and setting their domain to C. In Functional syntax:
Prefix(owl:=<http://www.w3.org/2002/07/owl#>)
Prefix(rdf:=<http://www.w3.org/1999/02/22-rdf-syntax-ns#>)
Prefix(xml:=<http://www.w3.org/XML/1998/namespace>)
Prefix(xsd:=<http://www.w3.org/2001/XMLSchema#>)
Prefix(rdfs:=<http://www.w3.org/2000/01/rdf-schema#>)
Ontology(
Declaration(Class(<urn:test:C>))
Declaration(ObjectProperty(<urn:test:hasSection>))
Declaration(ObjectProperty(<urn:test:hasStudent>))
Declaration(DataProperty(<urn:test:grade_code>))
FunctionalObjectProperty(<urn:test:hasSection>)
ObjectPropertyDomain(<urn:test:hasSection> <urn:test:C>)
FunctionalObjectProperty(<urn:test:hasStudent>)
ObjectPropertyDomain(<urn:test:hasStudent> <urn:test:C>)
FunctionalDataProperty(<urn:test:grade_code>)
DataPropertyDomain(<urn:test:grade_code> <urn:test:C>)
SubClassOf(<urn:test:C> ObjectIntersectionOf(ObjectSomeValuesFrom(<urn:test:hasSection> owl:Thing) ObjectSomeValuesFrom(<urn:test:hasStudent> owl:Thing) DataSomeValuesFrom(<urn:test:grade_code> rdfs:Literal)))
)

how to identify the minimal set of parameters describing a data set

I have a bunch of regression test data. Each test is just a list of messages (associative arrays), mapping message field names to values. There's a lot of repetition within this data.
For example
test1 = [
{ sender => 'client', msg => '123', arg => '900', foo => 'bar', ... },
{ sender => 'server', msg => '456', arg => '800', foo => 'bar', ... },
{ sender => 'client', msg => '789', arg => '900', foo => 'bar', ... },
]
I would like to represent the field data (as a minimal-depth decision tree?) so that each message can be programatically regenerated using a minimal number of parameters. For example, in the above
foo is always 'bar', so I don't need to mention it
sender and client are correlated, so I only need to mention one or the other
and msg is different each time
So I would like to be able to regenerate these messages with a program along the lines of
write_msg( 'client', '123' )
write_msg( 'server', '456' )
write_msg( 'client', '789' )
where the write_msg function would be composed of nested if statements or subfunction calls using the parameters.
Based on my original data, how can I determine the 'most important' set of parameters, i.e. the ones that will let me recreate my data set using the smallest number of arguments?
The following papers describe algortithms for discovering functional dependencies:
Y. Huhtala, J. Kärkkäinen, P. Porkka,
and H. Toivonen. TANE: An efficient
algorithm for discovering functional
and approximate dependencies. The
Computer Journal, 42(2):100–111,
1999, doi:10.1093/comjnl/42.2.100.
I. Savnik and P. A. Flach. Bottom-up
induction of functional dependencies
from relations. In Proc. AAAI-93 Workshop:
Knowledge Discovery in Databases,
pages 174–185, Washington, DC, USA,
1993.
C. Wyss, C. Giannella, and E.
Robertson. FastFDs: A
Heuristic-Driven, Depth-First
Algorithm for Mining Functional
Dependencies from Relation Instances.
In Proc. Data Warehousing and Knowledge Discovery, pages 101–110, Munich,
Germany, 2001, doi:10.1007/3-540-44801-2.
Hong Yao and Howard J. Hamilton. "Mining functional dependencies from data." Data Mining and Knowledge Discovery, 2008, doi:10.1007/s10618-007-0083-9.
There has also been some work on discovering multivalued dependencies:
I. Savnik and P. A. Flach. "Discovery
of Mutlivalued Dependencies from
Relations." Intelligent Data Analysis
Journal, 4(3):195–211, IOS Press, 2000.
This looks very similar to Database Normalization.
You have a relation (your test data set), and some known functional dependencies ({sender} => arg, {} => foo and possibly {msg} => sender. If the order of tests is important then add {testNr} => msg.) and you want to eliminate redundancies.
Treat your test set as a database table, apply the normalization rules and create equivalent functions (getArgFromSender(sender) etc.) for each join.
If the number of fields and records is small:
Brute force it by looping through every combination of fields, and for each combination detect if there are multiple items in the list which map to the same value.
If you can live with a fairly good choice of fields:
Start off assuming you need all fields. Then, select a field at random and see if it can be eliminated; if it can, cross it off the set of fields. Otherwise, choose another field at random and try again. If you find no fields can be eliminated, then you've found a reasonable set of fields. Had you chosen other fields first, you may find a better solution. You can repeat the whole procedure a few times and pick the best solution if you like. This kind of approach is called hill climbing.
(I suspect that this problem is NP complete, i.e. we probably don't know of an efficient and powerful solution so it is not worth losing sleep over trying to dream up a perfect solution.)

Resources