Is there easy way to get information, like total database size, cache used and so on in rethinkdb?
You should take a look at system tables (stored in the rethinkdb database). They contain a lot of useful information about the database. In particular, you should take a look at status tables and system stats.
Related
As per Oracle documentation its said it collects statistics for "all objects" in database. But, it does not specify anywhere that it collects for user specific schemas.
1) What is criteria it follows for auto collection of statistics on user specific schemas.
2) Is there any detailed explanation in metalink which explains how it is done.
Appreciate your valuable response on it.
Thanks,
Mir
The default statistics gathering process works for all schemas, including user schemas. Statistics collection is difficult but basically boils down to when it gathers statistics and what statistics it gathers:
WHEN AutoTasks gather statistics during specified maintenance windows (usually 10PM every day).
WHAT The STALE_PERCENT preference determines when to gather statistics on a table or index. By default, if 10% of the rows change statistics will be gathered.
But there are lots of exceptions. Fixed object stats, dictionary object stats, and system statistics (about system performance), are only gathered manually. And tables can be locked to not have their statistics altered.
You can read more details in the Optimizer Statistics section of the Database Concepts Guide, or the Optimizer Statistics part of the SQL TUning Guide.
There are several ways to determine when statistics were last gathered. Per object, you can look for the LAST_ANALYZED date column in views like DBA_TABLES and DBA_INDEXES.
To see when statistics auto tasks should run, there are lots of DBA_AUTOTASK_* views. Those views are difficult to understand, there are many ways a task can be disabled. (I wish Oracle had just used DBMS_SCHEDULER). TO see when statistics tasks were run, see the views DBA_OPTSTAT_*.
It's a huge topic, and this answer is only a high level overview.
We have the following use case example:
We have users, stores, friends (relationships between users) and likes. We store these tables in MySQL and as a key-value stores in Redis, in order to read from the Redis cache and not hit the database. Writes are done to both data stores.
Our app is therefore VERY fast, and scalable since we rarely hit the database for reads. We are using AWS for scalable Redis.
However, we have a problem when a user is logged in and we have to show a list of stores, AND which of his friends like that store. This is a join, and Redis does not support joins directly. We'd like to know what is the best way to store and show this data. Ex: if this should be stored in a Redis table where the key value is "store/user_who likes" and mantained with every write, or maybe have an hourly cron that construct this. Then we can read already stored data or we should construct this join on demand?
We notice that not even Facebook updates this info in realtime, but rather it takes several minutes for a friend to see which of my friends likes a page we have in common.
Thanks in advance for any responses.
Depends how important it is to you. Why not store each person's friends as a set, and each store's likes as a set, and then when you need the friends who like a given store, you just take the SINTER (set intersection) between the two. Should be fast, and storing friends and store likes as sets will get you a lot of similarly nice operations as well. Not sure how you're currently using Redis cache, but you could use these as a likely cheaper memory replacement as well for getting users' friends, stores' likes, etc...
As for cron, not sure how that would help. Redis is more than fast enough to handle the above sorts of writes. Memory will be your bottleneck first.
The idea is to redesign data structure and/or change DB.
I just started to review this project and plan to start optimization from this one.
Currently i have CouchDb with about 80GB of document data, around 30M records.
From that subset for the most of documents properties like id, group_id, location, type can be considered as generic, but unfortunately for now such are even stored with different property naming around the set. Also a lot of deeply nested can be found.
Structure isn't hardly defined, that's why NoSQL db was selected way before some picture was seen.
Data is calculated and populated in DB in a separate Job on powerful cluster. This isn't done too often. From that perspective i can conclude that general write/update performance isn't very important. Also size decrease would be great, but isn't most important. There are only like 1-10 active customers at a time.
Actually read performance with various filtering/grouping etc is most important.
But no heavy summary calculations should be done, this one is already done while population.
This one is a data analytical tool for displaying compare and other reports to quality engineers and data analyst, so they can browse the results, group them or filter from the Web UI.
Now such tasks like searching a subset of document properties for a text isn't possible due to performance.
For sure i've done some initial investigations(like http://www.datastax.com/wp-content/themes/datastax-2014-08/files/NoSQL_Benchmarks_EndPoint.pdf) and it looks Cassandra seems to be good choice among NoSql.
Also it's quite interesting trying to port this data into the new PostgreSQl.
Any ideas would be highly appreciated :-)
Hello please check the following articles:
http://www.enterprisedb.com/nosql-for-enterprise
For me, PostgreSQL json(and jsonb!) capabilities allow to start schema-less, have transactions, indexes, grouping, aggregate functions with very good performance, just from the start. And when ready(and if needed), you can go for the schema, with internal data migration.
Also check:
https://www.compose.io/articles/is-postgresql-your-next-json-database/
Good luck
I have been using cache for a long time. We store data against some key and fetch it from cache whenever required. I know that StackOverflow and many other sites heavily rely on cache. My question is do they always use key-value mechanism for caching or do they form some sql like query within a cache? For instance, I want to view last week report. This report's content will vary each day. Do i need to store different reports against each day (where day as a key) or can I get this result from forming some query that aggregate result across different key? Does any caching product (like redis) provide this functionality?
Thanks In Advance
Cache is always done as a key-value hash table. This is how it stays so fast. If you're doing querying then you're not doing cache.
What you may be trying to ask is... you could have in your database a table that contains agregated report data. And you could query against that pre-calculated table.
One of the reasons for cache (e.g. memcached ) being fast is its simplicity of data access and querying protocol.
The more functionality you add, more tradeoff you will have to do on the efficiency part. A full fledged SQL engine in a "caching" database is not a good design. Though you can utilize a data structures oriented database like Redis to design your cache data to suit your querying needs. For example: one set or one hash for each date.
A step further, you can use databases like MongoDb , or memsql which are pretty fast and have rich querying support.So an aggregation report once a while won't be an issue.
However, as a design decision, you will have to accept that their caching throughput will not be as much as memcached or redis.
I need to store large amount of small data objects (millions of rows per month). Once they're saved they wont change. I need to :
store them securely
use them to analysis (mostly time-oriented)
retrieve some raw data occasionally
It would be nice if it could be used with JasperReports or BIRT
My first shot was Infobright Community - just a column-oriented, read-only storing mechanism for MySQL
On the other hand, people says that NoSQL approach could be better. Hadoop+Hive looks promissing, but the documentation looks poor and the version number is less than 1.0 .
I heard about Hypertable, Pentaho, MongoDB ....
Do you have any recommendations ?
(Yes, I found some topics here, but it was year or two ago)
Edit:
Other solutions : MonetDB, InfiniDB, LucidDB - what do you think?
Am having the same problem here and made researches; two types of storages for BI :
column oriented. Free and known : monetDB, LucidDb, Infobright. InfiniDB
Distributed : hTable, Cassandra (also column oriented theoretically)
Document oriented / MongoDb, CouchDB
The answer depends on what you really need :
If your millions of row are loaded at once (nighly batch or so), InfiniDB or other column oriented DB are the best; They have great performance and are "BI oriented". http://www.d1solutions.ch/papers/d1_2010_hauenstein_real_life_performance_database.pdf
And they won't require a setup of "nodes", "sharding" and other stuff that comes with distributed/"NoSQL" DBs.
http://www.mysqlperformanceblog.com/2010/01/07/star-schema-bechmark-infobright-infinidb-and-luciddb/
If the rows are added in real time.. then column oriented DB are bad. You can either choose two have two separate DB (that's my choice : one noSQL for real feeding of the stats by the front, and real time stats. The other DB column-oriented for BI). Or turn towards something that mixes column oriented (for out requests) and distribution (for writes) / like Cassandra.
Document oriented DBs are not suited for BI, they are more useful for CRM/CMS issues where you need frequent access to a particular row
As for the exact choice inside a category, I'm still undecided. Cassandra in distributed, and Monet or InfiniDB for CODB, are leaders. Monet is reported to have problem loading very big tables because it runs indexes in memory.
You could also consider GridSQL. Even for a single server, you can create multiple logical "nodes" to utilize multiple cores when processing queries.
GridSQL uses PostgreSQL, so you can also take advantage of partitioning tables into subtables to evaluate queries faster. You mentioned the data is time-oriented, so that would be a good candidate for creating subtables.
If you're looking for compatibility with reporting tools, something based on MySQL may be your best choice. As for what will work for you, Infobright may work. There are several other solutions as well, however you may want also to look at plain-old MySQL and the Archive table. Each record is compressed and stored and, IIRC, it's designed for your type of workload, however I think Infobright is supposed to get better compression. I haven't really used either, so I'm not sure which will work best for you.
As for the key-value stores (E.g. NoSQL), yes, they can work as well and there are plenty of alternatives out there. I know CouchDB has "views", but I haven't had the opportunity to use any, so I don't know how well any of them work.
My only concern with your data set is that since you mentioned time, you may want to ensure that whatever solution you use will allow you to archive data past a certain time. It's a common data warehouse practice to only keep N months of data online and archive the rest. This is where partitioning, as implemented in an RDBMS, comes in very useful.