Couchbase Query Execution time? - performance

How I can calculate Query time and Query Execution plan in Couchbase.Is there any Utilities like Oracle Explain plan and tkprof in Couchbase db?
edit:
I am trying to see which database performs best for my data. So i am trying to experiment with mysql, mongodb, couchbase. I have tried with three different number of entries 10k, 20k, 40k entries.
With mysql, i can see the query time using "set profiling =1". with this settings i ran queries under three scenarios 1) without indexing primary key, 2) after indexing primary key 3) running the same query second time ( to see the effect of query caching)
Similarly i ran same tests with mongodb and summarized my results in a table format. I would want to run same tests with couchbase to see how well it would perform. I tried to search over net but couldn't find anything which i can follow to get similar results.
Below is the table i have (all times are in milli seconds). the second row with braces() shows the query time for second run.
Records Count Mysql MongoDB CouchBase
___________________ _______________ ___________
Without | With Without | With With Index
Index | Index Index | Index
10K 62.27325 | 8.537 3311 | 33
(33.3135) | (3.27825) (7) | (0)
20K 108.4075 | 23.238 132 | 39
(80.90525)| (4.576) (17) | (0)
40K 155.074 | 26.26725 48 | 10
(110.42) | (10.037) (42) | (0)
For couchbase i would want to know both the performance when retrieving a document using its key( similar function as memcahed). Also the query time using its views.

You have to understand that couchbase works differently to RDBMS's such as Oracle. Couchbase offers two ways for you to retrieve your data:
1) Key lookup, you know the key(s) of the document(s) that you want to retrieve.
2) Define Map Reduce jobs called Views which create indexes allowing you to query your data on attributes other than the key.
Couchbase documents are always consistent but views are not and are eventually consistent (although you have the ability to change this).
As the couchbase documentation states
Views are updated when the document data is persisted to disk. There is a delay between creating or updating the document, and the document being updated within the view.
So query time really depends on a variety of factors, can the view data be stale? How large is the data emitted from the index, and what is the current workload and db size? Couchbase provides the following 3 flags for working with views and how you want to access the data. False means the index has to be updated before returning the result, therefore it can potentially be slow.
false : Force a view update before returning data
ok : Allow stale views
update_after : Allow stale view, update view after it has been accessed
Please check out the official document for more in depth answers http://docs.couchbase.com/couchbase-manual-2.2/#views-and-indexes
Also you can check out this interesting article on caching views http://blog.couchbase.com/caching-queries-couchbase-high-performance
Currently in development at Couchbase is N1QL, effectively the couchbase version of SQL, this will have the EXPLAIN statement available, this won't be released until late 2014 I believe.
A blog post introducing N1QL
http://blog.couchbase.com/n1ql-it-makes-cents
A cheat sheet for N1QL
http://www.couchbase.com/communities/sites/default/files/Couchbase-N1QL-CheatSheet.pdf
And where you can download the dev preview if you want to play with N1QL
http://www.couchbase.com/communities/n1ql
Also checkout the cb stats tool http://docs.couchbase.com/couchbase-manual-2.2/#cbstats-tool it gives a high level overview of persistence rates,updates,key misses etc.

Related

Querying from azure db using linq takes a lot of time

I am using Azure db and my table has over 120000 records.
Applied with paging of 10 records a page I am fetching data into IQueryable, but fetching 10 records only taking around 2 minutes. This query has no join and just 2 filters. While using Azure search I can get all the records within 3 seconds.
Please suggest me who to minimise my Linq search as azure search is costly.
Based on the query in the comment to your question, it looks like you are reading the entire db table to the memory of your app. The data is potentially transferred from one side of the data centre to another, thus causing the performance issue.
unitofwork.Repository<EntityName().GetQueriable().ToList().Skip(1).Take(10);
Without seeing the rest of the code, I'm just guessing your LINQ query should be something like this:
unitofwork.Repository<EntityName().GetQueriable().Skip(1).Take(10).ToList();
Skip and Take should be executed on the db Server, while .ToList() is at the end will materialize the entities.

Does the number of columns in a Vertica table impact query performance?

We are working with a Vertica 8.1 table containing 500 columns and 100 000 rows.
The following query will take around 1.5 seconds to execute, even when using the vsql client straight on one of the Vertica cluster nodes (to eliminate any network latency issue) :
SELECT COUNT(*) FROM MY_TABLE WHERE COL_132 IS NOT NULL and COL_26 = 'anotherValue'
But when checking the query_requests table, the request_duration_ms is only 98 ms, and the resource_acquisitions table doesn't indicate any delay in resource asquisition. I can't understand where the rest of the time is spent.
If I then export to a new table only the columns used by the query, and run the query on this new, smaller, table, I get a blazing fast response, even though the query_requests table still tells me the request_duration_ms is around 98 ms.
So it seems that the number of columns in the table impacts the execution time of queries, even if most of these columns are not referenced. Am I wrong ? If so, why is it so ?
Thanks by advance
It sounds like your query is running against the (default) superprojection that includes all tables. Even though Vertica is a columnar database (with associated compression and encoding), your query is probably still touching more data than it needs to.
You can create projections to optimize your queries. A projection contains a subset of columns; if one is available that has all the columns your query needs, then the query uses that instead of the superprojection. (It's a little more complicated than that, because physical location is also a factor, but that's the basic idea.) You can use the Database Designer to create some initial projections based on your schema and sample queries, and iteratively improve it over time.
I was running Vertica 8.1.0-1, it seems the issue was a Vertica bug in the Vertica planning phase causing a performance degradation. It was solved in versions >= 8.1.1 :
[https://my.vertica.com/docs/ReleaseNotes/8.1.x/Vertica_8.1.x_Release_Notes.htm]
VER-53602 - Optimizer - This fix improves complex query performance during the query planning phase.

Cassandra query flexibility

I'm pretty new to the field of big data and currently stucking by a fundamental decision.
For a research project i need to store millions of log entries per minute to my Cassandra based data center, which works pretty fine. (single data center, 4 nodes)
Log Entry
------------------------------------------------------------------
| Timestamp | IP1 | IP2 ...
------------------------------------------------------------------
| 2015-01-01 01:05:01 | 10.10.10.1 | 192.10.10.1 ...
------------------------------------------------------------------
Each log entry has a specific timestamp. The log entries should be queried by different time ranges in first instance. As recommended i start to "model my query" in a big row approach.
Basic C* Schema
------------------------------------------------------------------
| row key | column key a | column key b ...
------------------------------------------------------------------
| 2015-01-01 01:05 | 2015-01-01 01:05:01 | 2015-01-01 01:05:23
------------------------------------------------------------------
Additional detail:
column keys are composition of timestamp+uuid, to be unique and to avoid overwritings;
log entries of a specific time are stored nearby on a node by its identical partition key;
Thus log entries are stored in shorttime intervals per row. For example every log entry for 2015-01-01 01:05 with the precision of a minute. Queries are not really peformed as a range query with an < operator, rather entries are selected as blocks of a specified minute.
Range based queries succeed in a decent response time which is fine for me.
Question:
In the next step we want to gain additional informations by queries, which are mainly focused on the IP field. For example: select all the entries which have IP1=xx.xx.xx.xx and IP2=yy.yy.yy.yy.
So obviously the current model is pretty not usable for additional IP focused CQL queries. So the problem is not to find a possible solution, rather the various choices of possible technologies which could be a possible solution:
Try to solve the problem with standalone C* solutions. (Build a second model and administer the same data in a different shape)
Choose additional technologies like Spark...
Switch to HDFS/Hadoop - Cassandra/Hadoop solution...
and so on
With my lack of knowledge in this field, it is pretty hard to find the best way which i should take. Especially with the feeling that the usage of a cluster computing framework would be an excessive solution.
As I understood your question, your table schema looks like this:
create table logs (
minute timestamp,
id timeuuid,
ips list<string>,
message text,
primary key (minute,id)
);
With this simple schema, you:
can fetch all logs for a specific minute.
can fetch short inter-minute ranges of log events.
want to query dataset by IP.
From my point of view, there are multiple ways of implementing this idea:
create secondary index on IP addresses. But in C* you will lose the ability to query by timestamp: C* cannot merge primary and secondary indexes (like mysql/pgsql).
denormalize data. Write your log events to two tables at once, first being optimized for timestamp queries (minute+ts as PK), second being for IP-based queries (IP+ts as PK).
use spark for analytical queries. But spark will need to perform (full?) table scan (in a nifty distributed map-reduce way, but nevertheless it's a table scan) each time to extract all the data you've requested, so all your queries will require a lot of time to finish. This way can cause problems if you plan to have a lot of low-latency queries.
use external index like ElasticSearch for quering, and C* for storing the data.
For my opinion, the C* way of doing such things is to have a set of separate tables for different queries. It will give you an ability to perform blazing-fast queries (but with increased storage cost).

How to increase mdx Query speed in pentaho cde and how to clear Mondrian Schema cache

I have a problem with mdx query. Actually I developed one dashboard has 23 mdx queries. if we run these dashboard it take 2 minute to run.How to solve this problem.
Another issue
i modify some data in database.If we run these dashboard modified data isn't shown. It show previous data only.How to solve this problem.
1) 23 queries on first load may be a bit too much. Can't you simplify that? Also, are the queries all as fast as possible but it's just too many of them? Or are there slower queries that need to be improved? Check also the priority of components. You may have components rendering more than once. Example: you have a Country selector and a City selector. Because the city selector was put in befor the country selector, if they have the same priority (default=5), it'll run first, retrieving the full list of cities; Then the country selector runs and picks the first value as parameter value. As your City selector most likely listens to the Country parameter, it'll fire again because the Country was fireChange'd.
2) Cache. You're changing the data but either Mondrian or CDA (or both) are getting data from their cache. Two options here:
- Clear Mondrian cache and clear CDA cache after the data is updated (suitable for large updates that affect most of the database);
- Disable the cache on the query definition and the cube cache on the Mondrian schema.

Is it good design to store 500 mb of data in jvm cache which can be searched like a sql query?

I have a requirement to get results of search query within 1 second for a database table.The database table is returning results slowly at this point.A table has to be moved to a cache and searched from there so that search results come fast.I want to do google type page refresh on my existing search page -- which means the page should refresh as the user types.
In order to acheive this the search results should return within one second.My database is teradata.Its queries are taking 2 to 3 seconds at least.Hence i want to look for other options like caching.I want to use cache so that the resuls come fast.
Columns are
company , Id , Industry, parent ...4 more
Its a search page.So if user types "ja" all items starting from ja like
company ------------- Id ------------- Industry --------------parent
jaico ------------- 222 -------------paints ------------- Jaico asia
Jammy fruits------------- 232-------------food------------- jammy International
The table contains 3.2 million rows and there are 8 columns that are present.The search data need to return all 8 columns.Considering byte wise there are 150 chars per row.So total bytes are 3.2 million * 150 chars = 480 Megabytes .I need to store this much data in cache and then fire search queries like sql (grouping ,like ,order by) across them.What would be the best option to use in this case
ehcache
jboss cache
Inifinispan
Apache Lucene
Please suggest which option is good .Is it better to do caching in memory or to use lucene?
What need to be cached?--> It is a table of 3.2 million rows with 8 columns.
Why it is to be cached?--> It is to be cached so that search results come faster than sql query.If i use sql query it takes very long time.Hence i want to move towards caching data.
Take a look at Apache Solr - you can get that kind of performance with the right deployment. You can shard to distribute queries, for one thing.

Resources