SPARQL Query Computational Complexity - algorithm

I have a list of SPARQL queries with various patterns (e.g., select, union, join). I want to calculate their time complexity by using big O notation (e.g., O(n), O(nlogn)). Please let me know how to do that. I have more than 3000,0000,000 triples in my RDF graph.
Followings are some example query queries
Query 1:
select ?o where { <http://example.com/person_info/242622027> vocab:info_gender ?o}
Query 2:
select ?o ?k where {
{
?s vocab:person_info_pid '242622027'^^xsd:decimal.
?s vocab:person_info_homeloc ?o
}
UNION
{
?i vocab:activities_pid '242622027'^^xsd:decimal.
?i vocab:activities_purpose ?k
}
}
Query3:
select (count(*) as ?no) where{
?s vocab:outputparttwo_iteration '0'^^xsd:decimal
}

SPARQL itself is PSPACE-complete. You can probably only come up with the best case complexity for any given query. The real-world complexity will depend on the implementation of the database to some degree.

Related

How to retrieve a GitLab GraphQL's query complexity?

Is there any way to retrieve what the query complexity is for a GitLab GraphQL query?
As a comparison, GitHub's GraphQL api has a rateLimit object that returns the "cost" of a query https://docs.github.com/en/graphql/overview/resource-limitations. Does GitLab have anything similar?
If this capability does not exist, how can one compute the complexity of a query?
https://docs.gitlab.com/ee/api/graphql/index.html#max-query-complexity
There is no way to discover the complexity of a query except by exceeding the limit.
If a query exceeds the complexity limit an error message response will be returned.
In general, each field in a query will add 1 to the complexity score, although this can be higher or lower for particular fields. Sometimes the addition of certain arguments may also increase the complexity of a query.
Not sure when this was implemented, but you can now query the complexity and limit, as described in the docs and reference https://docs.gitlab.com/ee/api/graphql/reference/#queryquerycomplexity
Example query:
{
queryComplexity {
limit
score
}
}
Example response:
{
"data": {
"queryComplexity": {
"limit": 300,
"score": 3
}
}
}

How to add instances from dbpedia to ontology

I am trying to add all countries to my ontology from dbpedia. However, it says that no countries were added. I am using GraphDB for this. I saw another post here that had the format to what I should use, but I still couldn't make it work. Can someone help me? Here is my query:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dbo: <http://www.dbpedia.org/ontology>
PREFIX dbr: <http://www.dbpedia.org/resource>
INSERT
{ ?s ?p ?o }
WHERE
{ SERVICE <http://dbpedia.org/sparql>
{
?s rdf:type dbo:Country.
?s ?p ?o.
}
}

distinct count on hive does not match cardinality count on elasticsearch

I have loaded data into my elasticsearch cluster from hive using the elasticsearch-hadoop plugin from elastic.
I need to fetch a count of unique account numbers. I have the following queries written in both hql and queryDSL, BUT they are returning different counts.
Hive Query:
select count(distinct account) from <tableName> where capacity="550";
// Returns --> 71132
Similarly, in Elasticsearch the query looks like this:
{
"query": {
"bool": {
"must": [
{"match": { "capacity": "550"}}
]
}
},
"aggs": {
"unique_account": {
"cardinality": {
"field": "account"
}
}
}
}
// Returns --> 71607
Am I doing something wrong? What can I do to match the two queries?
Note: There are exactly the same number of records in hive and elasticsearch.
"the first approximate aggregation provided by Elasticsearch is
the cardinality metric
...
As mentioned at the top of this chapter, the cardinality metric is an
approximate algorithm. It is based on the HyperLogLog++ (HLL)
algorithm."
https://www.elastic.co/guide/en/elasticsearch/guide/current/cardinality.html
For the OP
precision_threshold
"precision_threshold accepts a number from 0–40,000. Larger values are
treated as equivalent to 40,000.
...
Although not guaranteed by the
algorithm, if a cardinality is under the threshold, it is almost
always 100% accurate. Cardinalities above this will begin to trade
accuracy for memory savings, and a little error will creep into the
metric."
https://www.elastic.co/guide/en/elasticsearch/guide/current/cardinality.html
You might also want to take a look at "Support for precise cardinality aggregation #15876"
For the OP, 2
"I have tried several numbers..."
You have 71,132 distinct values while the precision threshold limit is 40,000, therefore the cardinality is over the threshold, which means accuracy is traded for memory saving.
This is how the chosen implementation (based on HyperLogLog++ algorithm) works.
Cardinality does not ensure accurate count even with 40000 precision_threshold. There is another way to get accurate distinct count of a field.
This article on "Accurate Distinct Count and Values from Elasticsearch" explains in detail the solution as well as it's accuracy over Cardinality.

Search multiple phrases/terms with weights in elasticsearch

I am new in using elastic-search. I am familiar with basic searching but what now I want is to search multiple terms in a single hit. i.e.
I have five search-terms 'first', 'second', 'third', 'four', 'five' and each term has some weight assigned to it. So rather than making a call one by one for each term, what I want a single query that will accept these terms along with their weights and return results according to the weights.
So it should look something like this (This is not the elastic-search syntax)
search
{
terms: [(first , 3),(second , 1),(thirst , 4),(four , 2),(five , 5)],
fields: [field1, field2, field3...]
}
Thanks in anticipation.
The query string query supports boosting in the following form:
quick^2 fox

Elasticsearch: How to search, sort, limit the results then sort again?

This isn't about multi-level sorting.
I need my results first selected by distance, limited to 50, then those 50 sorted by price.
select *
from
(
select top 50 * from mytable order by distance asc)
)
order by price asc
Essentially, the second sort throws away the ordering of the inner sort - but the inner sort is used to hone in on the top 50 results.
The other answers I've seen for this sort of question looks at second-level sorting, which is not what I'm after.
BTW: I've looked at aggregations - Top N results, but I'm not sure I can apply a sort on the aggregation result sort. Also looked at rescore, but I don't know where to put my 'sorts'
A top hits aggregation will allow you to sort on a separate field, in your case price from the main query sort (on distance). See the documentation here for how to specify sorting in the top hits agg.
It'll look a little like this (which assumes distance is a double type; if it's a geo-location type, use the documentation provided by Volodymyr Bilyachat.)
{
"sort":[
{
"distance":"asc"
}
],
"query":{
"match_all":{}
},
"size":50,
"aggs":{
"top_price_hits":{
"top_hits":{
"sort":[
{
"price":{
"order":"asc"
}
}
],
"size":50
}
}
}
}
However, if there are only 50 results that you want from your primary query, why don't you just sort in the application client side? This would be a better approach as using a top hits aggregation for a secondary sort is a slight abuse of its purpose.
The in-application approach would be more robust.
+1'ed the accepted answer, but I wanted to make sure you were aware of how search scoring, can often deliver a better user experience than traditional sorting.
Based on your current strategy, one could say:
Distance is important, relatively speaking (e.g. top 50 closest) but not in absolute terms (e.g. must be within 50mi).
You only want to show 50 results.
You want those results to be sorted by price (or perhaps alphabetically).
However, if you find yourself trying to generalize about which result a searcher is most likely to choose, you may discover a function of price and distance (or other features) which better models the real-world likelihood of a searcher choosing a particular result.
E.g. Say you discover that
Users will pay more for the convenience of a nearby result
Users will travel greater distances for greater discounts
Then you could model a sample scoring function that generates a result ordering based on this relationship.
E.g. 1/price + 1/distance ... which would generate a higher score as either price or distance decreased.
Which could be generalized to P * 1/price + 1/distance where P represented a tuning coefficient expressing the relative importance of price vs distance.
Armed with this model, you could then write a function score query which would output ordered results with the optimal combinations of price and distance for your users.
As i see it would be better to do select top 50 using size: 50 property in query, and ordering by distance, then sort result in your application by price.

Resources