I am using the JMeter tool to evaluate the performance testing of several SPARQL queries. I use the "Sampler" and "Http request" section of Jmeter (like this image ) to write the SPARQL queries. However it does not work as JMeter does not give me the actual response time of the queries. The more complex query has the same response time or even less. Maybe I have to use other options of JMeter.
Does anybody have an experience on this?
You may be interested in the SPARQL Query Benchmarker tool (disclaimer - I am the primary author on that tool) which is specifically designed for this purpose
It distinguishes between response time as being the time when the first result is received from the remote system and runtime as the time taken to receive all results. The Benchmarking Methodology page describes more about the difference between these metrics and the various other metrics which the tool calculates.
Since many systems execute queries and generate results in a streaming fashion for complex queries you will typically see noticeable differences between these two figures. Of course systems vary so not all systems will display quite the same variations between the two metrics.
Related
I am a student who is learning spring and jpa recently. While developing 'get api' with conditions, I came to think about which method is advantageous in terms of performance.
When it is necessary to query data based on conditions, jpql or querydsl are usually used to generate dynamic queries. Can you tell me why generating a dynamic query like this and looking up only the necessary data is better than using the java stream filter() function after looking up the entire data?
Also, can you tell me why generating fewer queries is advantageous in terms of performance?
I know that generating fewer queries has a performance advantage, but I lack an understanding of why I say it has a performance advantage.
Can you tell me why generating a dynamic query like this and looking up only the necessary data is better than using the java stream filter() function after looking up the entire data?
In general addressing the database or any other external storage is much more expensive than most of operations on Java side because of networking latency. If you query all the data and use e.g. list.stream().filter() than the significant amount of data is transferred over the network. And if one vice versa queries only some data filtered on the DB side the transferred amount in lower.
Pay attention, that while this is true in general there might be a cases when filtering on Java side could be more effective. This is highly dependent on several things:
query complexity
amount of data
database structure (schema, indices, column types etc.)
As of number of queries here we have the same considerations: query execution costs, data transfer costs, so the less queries you have - the better. And again, this is not an axiom: in some cases having multiple lightweight queries with grouping/filtering on Java side might be faster, than one huge and complicated SQL-query.
How I analyze jmeter reports. Where is the best tutorial available
If you already executed your test and generated HTML Reporting Dashboard you should see some numbers in the statistics table.
These are so called KPIs, see JMeter Glossary for the description on what do the metrics mean.
The next step would be detecting KPI correlations, like see the relationship between increasing number of users and the increasing response time or errors rate.
And the final step would be creating a report which normally consists of two parts:
Manager-friendly summary stating where the application meets NFR, what is the saturation point, what is the first bottleneck, what are possible risks, etc.
Technical details with in-depth analysis of performance degradations including logs, slow database queries, profiler tool output interpretation, etc.
I would like to retrieve information about what exact terms match the search query.
I found out that this problem was discussed in the following topic: https://github.com/elastic/elasticsearch/issues/17045
but was not resolved "since it would be too cumbersome and expensive to keep this information around" (inside of ElasticSearch context).
Then I discovered that using "explain" option in search request I get the detailed information about score calculation including matching terms.
I made some kind of performance test to compare search requests with explain option set to true and without explain option. And this test doesn't show significant impact of explain option usage.
So I'm wondering if this option can be used for production system? It looks like some kind of workaround but seems it's working.
Any considerations about this?
First of all, you didn't include the details of your performance test, so it's really difficult to know and say whether it would make a performance impact or not and again it's relative to:
What is your cluster configuration, total nodes, size, shards, replicas, JVM, no of documents, size of documents?
Index configuration ie, for which index you are using the explain API, again is it a ready or write-heavy index, how many docs, during peak time how it performs, etc.
Apart from that, in An application, there will be only certain types of queries although search term might change, the underlying concept of whether it matches or not them can be understood by samples itself.
I've worked with search systems extensively and I use explain API a lot but only on samples and not on all queries and have not seen this happening anywhere.
EDIT:- Please have a look at named queries which can also be used to check which part of your queries matched the search results and more info on this official blog
I'm trying to benchmark Neo4j massive insertion in client-server environment.
So far I've found that there are only two ways to do it:
use REST
implement server extension
I can say upfront that our design requires to be able to insert from many concurrently running processes/machines, so using batch insert with direct connection is not an option.
I would also like to avoid having to implement server extension as we already have tight schedule.
I benchmarked massive insertion via REST from just a single client, sending 2 kinds of very simple Cypher queries:
create (vertex:V {guid: {guid}, vtype: {vtype}, random1: {random1}, random2: {random2} })
match (a:V {guid: {a} }) match (b:V {guid: {b} }) create (a)-[:label]->(b)
Guid field had an index.
Results so far are very poor around (10k V + 40k E) in 13 minutes, compared to competing products like Titan or Orient, which provide efficient server out of the box and throughput at around (10k V + 40k E) per 1 minute.
I tried longer lasting transactions, and query parameters, none give any significant gains. Furthermore, the overhead from REST is very small as I tested dummy queries and they execute much much faster (and both client and server are on the same machine). I also tried inserting from multiple threads - performance does not scale up.
I found another StackOverflow question, where advise was to batch inserts into large requests containing thousands of commands and periodically commit.
Unfortunatelly, due to the nature of how we generate the data, batching the requests is not feasible. Ideally we'd like the inserts to be atomic operations and have the results appear as soon as they are executed (no need for transactions in fact).
Thus my questions are:
are my Cypher queries optimal for the insertion?
are the results so far in line with what can be achieved with REST (or can I squeeze much more from REST) ?
are there any other ways to perform efficient multi-client massive insertion?
I have a number of thoughts/questions that don't fit very well in a comment ;)
What version of Neo4j are you using? 2.3 introduced some things which might help
When you say you have an index, do you mean the new style and not the legacy indexes? The newer indexes are created with CREATE INDEX ON :V(guid) and apply to the combination of a label and a property. You can try your queries in the web console prefixed with PROFILE to see if the query is hitting the index and where it might be slow
If you can have the data in a CSV format you might look into the LOAD CSV clause in Cypher. That's also a batch sort of thing, so it might not be as useful
I don't think it would help performance much, but this is a bit nicer to read:
match (a:V {guid: {a} }), (b:V {guid: {b} }) create (a)-[:label]->(b)
I know it's of no help now, but Neo4j 3.0 is planned to have a new compressed binary socket protocol called Bolt which should be an improvement over REST. It's estimated for Q2
I know a lot of these suggestions probably aren't too helpful, but they're things to think about. There's also a public Slack chat for Neo4j here:
http://neo4j.com/blog/public-neo4j-users-slack-group/
I'll share this question there to see if anybody has any ideas
EDIT:
Max DeMarzi passed on one of this articles on queueing requests which might be useful:
http://maxdemarzi.com/2014/07/01/scaling-concurrent-writes-in-neo4j/
Looks like you'd need to write a bit of Java, but he lays it out for you
What is actually better? Having classes with complex queries responsible to load for instance nested objects? Or classes with simple queries responsible to load simple objects?
With complex queries you have to go less to database but the class will have more responsibility.
Or simple queries where you will need to go more to database. In this case however each class will be responsible for loading one type of object.
The situation I'm in is that loaded objects will be sent to a Flex application (DTO's).
The general rule of thumb here is that server roundtrips are expensive (relative to how long a typical query takes) so the guiding principle is that you want to minimize them. Basically each one-to-many join will potentially multiply your result set so the way I approach this is to keep joining until the result set gets too large or the query execution time gets too long (roughly 1-5 seconds generally).
Depending on your platform you may or may not be able to execute queries in parallel. This is a key determinant in what you should do because if you can only execute one query at a time the barrier to breaking up a query is that much higher.
Sometimes it's worth keeping certain relatively constant data in memory (country information, for example) or doing them as a separately query but this is, in my experience, reasonably unusual.
Far more common is having to fix up systems with awful performance due in large part to doing separate queries (particularly correlated queries) instead of joins.
I don't think that any option is actually better. It depends on your application specific, architecture, used DBMS and other factors.
E.g. we used multiple simple queries with in our standalone solution. But when we evolved our product towards lightweight internet-accessible solution we discovered that our framework made huge number of request and that killed performance cause of network latency. So we sufficiently reworked our framework for using aggregated complex queries. Meanwhile, we still maintained our stand-alone solution and moved from Oracle Light to Apache Derby. And once more we found that some of our new complex queries should be simplified as Derby performed them too long.
So look at your real problem and solve it appropriately. I think that simple queries are good for beginning if there are no strong objectives against them.
From a gut feeling I would say:
Go with the simple way as long as there is no proven reason to optimize for performance. Otherwise I would put the "complex objects and query" approach in the basket of premature optimization.
If you find that there are real performance implications then you should in the next step optimize the roundtripping between flex and your backend. But as I said before: This is a gut feeling, you really should start out with a definition of "performant", start simple and measure the performance.