Codeigniter backup database (how much can it backup?) - codeigniter

When reading the manuel guide for Codeigniter, i notice under the database utilities that the backup feature says "backing up very large databases may not be possible."
What exactly does this mean? How much would it be able to backup?

From CodeIgniter's user guide:
Due to the limited execution time and memory available to PHP, backing up very large databases may not be possible.
The configuration of your server (hosting your website/CodeIgniter) will likely limit the maximum execution time of a PHP script and the total memory available to PHP. Determining what size database you can backup will depend entirely on your specific server configuration. Running this backup utility with your database on your server and benchmarking the results - CodeIgniter's Benchmarking Class may help here - will help you determine what size database you can backup. You can potentially change your server's configuration to allocate more resources to PHP as required.
I decided to benchmark this backup function with a few different databases. This was just out of curiosity, so I wouldn't rely on these results, but they may be of interest.
Database 1
306.4 KB
78 Tables
279 rows
Results:
Execution time: 0.0603s
Peak memory usage:3 MB
Database 2
1 MB
11 Tables
165 rows
Results:
Execution time: 0.0350s
Peak memory usage: 3.25 MB
Database 3
16.6 MB
4 Tables
403 rows
Results:
Execution time: 0.6335s
Peak memory usage: 30.5 MB
Database 4
6.5 MB
9 Tables
93,289 rows
Results:
Time taken: 7.1702s
Peak memory usage: 91.25 MB

Related

How can I improve ClickHouse server's startup time?

I am evaluating ClickHouse's performance for potential use in a project. The write performance has been encouraging up to this point but as I was running my tests and had to restart the server a few times, I noticed an issue which has the potential of being a hard showstopper: the server startup time is fluctuating and most of the times extremely high.
My evaluation server contains 26 databases holding about 54 billion records and taking up 697.32 GB on disk.
With this amount of data I have been getting startup times from as low as 7m35s to almost 3h.
Is this normal? Can it be solved with some fancier configuration? Am I doing something really wrong? Because, as it stands, such a long startup time is a showstopper.
The main cause of slow startup time is due to the gigantic amount of metadata needed to be loaded, which has a positive correlation with the number of data files. In order to boost startup time, you need to either shrink the file count or get more memory in order to preserve all dentry and inode caches.
My evaluation server contains 26 databases holding about 54 billion records and taking up 697.32 GB on disk.
I'd suggest the following:
Try adjustint current data partition schemes in a coarser manner
Use OPTIMIZE TABLE <table> FINAL to compact all data files
Upgrade data disk to SSD or efficient RAID, or use file systems like btrfs to store meta data separately on fast storage.

Why is my H2 database 7x larger on disk than it should be?

I have an H2 database that has ballooned to several Gigabytes in size, causing all sorts of operational problems. The database size didn't seem right. So I took one little slice of it, just one table, to try to figure out what's going on.
I brought this table into a test environment:
The columns add up to 80 bytes per row, per my calculations.
The table has 280,000 rows.
For this test, all indexes were removed.
The table should occupy approximately
80 bytes per row * 280,000 rows = 22.4 MB on disk.
However, it is physically taking up 157 MB.
I would expect to see some overhead here and there, but why is this database a full 7x larger than can be reasonably estimated?
UPDATE
Output from CALL DISK_SPACE_USED
There's always indices, etc. to be taken into account.
Can you try:
CALL DISK_SPACE_USED('my_table');
Also, I would also recommend running SHUTDOWN DEFRAG and calculating the size again.
Setting MV_STORE=FALSE on database creation solves the problem. Whole database (not the test slice from the example) is now approximately 10x smaller.
Update
I had to revisit this topic recently and had to run a comparison to MySQL. On my test dataset, when MV_STORE=FALSE, the H2 database takes up 360MB of disk space, while the same data on MySQL 5.7 InnoDB with default-ish configurations takes up 432MB. YMMV.

Correcting improper usage of Cassandra

I have a similar question that was unanswered (but had many comments):
How to make Cassandra fast
My setup:
Ubuntu Server
AWS service - Intel(R) Xeon(R) CPU E5-2680 v2 # 2.80GHz, 4GB Ram.
2 Nodes of Cassandra Datastax Community Edition: (2.1.3).
PHP 5.5.9. With datastax php-driver
I come from a MySQL database knowledge with very basic NoSQL hands on in terms of ElasticSearch (now called Elastic) and MongoDB in terms of Documents storage.
When I read how to use Cassandra, here are the bullets that I understood
It is distributed
You can have replicated rings to distribute data
You need to establish partition keys for maximum efficiency
Rethink your query rather than to use indices
Model according to queries and not data
Deletes are bad
You can only sort starting from the second key of your primary key set
Cassandra has "fast" write
I have a PHP Silex framework API that receive a batch json data and is inserted into 4 tables as a minimum, 6 at maximum (mainly due to different types of sort that I need).
At first I only had two nodes of Cassandra. I ran Apache Bench to test. Then I added a third node, and it barely shaved off a fraction of a second at higher batch size concurrency.
Concurrency Batch size avg. time (ms) - 2 Nodes avg. time (ms) - 3 Nodes
1 5 288 180
1 50 421 302
1 400 1 298 1 504
25 5 1 993 2 111
25 50 3 636 3 466
25 400 32 208 21 032
100 5 5 115 5 167
100 50 11 776 10 675
100 400 61 892 60 454
A batch size is the number of entries (to the 4-6 tables) it is making per call.
So batch of 5, means it is making 5x (4-6) table insert worth of data. At higher batch size / concurrency the application times out.
There are 5 columns in a table with relatively small size of data (mostly int with text being no more than approx 10 char long)
My keyspace is the following:
user_data | True | org.apache.cassandra.locator.SimpleStrategy | {"replication_factor":"1"}
My "main" question is: what did I do wrong? It seems to be this is relatively small data set of that considering that Cassandra was built on BigDataTable at very high write speed.
Do I add more nodes beyond 3 in order to speed up?
Do I change my replication factor and do Quorum / Read / Write and then hunt for a sweet spot from the datastax doc: http://docs.datastax.com/en/cassandra/2.0/cassandra/dml/dml_config_consistency_c.html
Do I switch framework, go node.js for higher concurrency for example.
Do I rework my tables, as I have no good example as to how effectively use column family? I need some hint for this one.
For the table question:
I'm tracking history of a user. User has an event and is associated to a media id, and there so extra meta data too.
So columns are: event_type, user_id, time, media_id, extra_data.
I need to sort them differently therefore I made different tables for them (as I understood how Cassandra data modeling should work...I am perhaps wrong). Therefore I'm replicating the different data across various tables.
Help?
EDIT PART HERE
the application also has redis and mysql attached for other CRUD points of interest such as retrieving a user data and caching it for faster pull.
so far on avg with MySQL and then Redis activated, I have a 72ms after Redis kicks in, 180ms on MySQL pre-redis.
The first problem is you're trying to benchmark the whole system, without knowing what any individual component can do. Are you trying to see how fast an individual operation is? or how many operations per second you can do? They're different values.
I typically recommend you start by benchmarking Cassandra. Modern Cassandra can typically do 20-120k operations per second per server. With RF=3, that means somewhere between 5k and 40k reads / second or writes/second. Use cassandra-stress to make sure cassandra is doing what you expect, THEN try to loop in your application and see if it matches. If you slow way down, then you know the application is your bottleneck, and you can start thinking about various improvements (different driver, different language, async requests instead of sync, etc).
Right now, you're doing too much and analyzing too little. Break the problem into smaller pieces. Solve the individual pieces, then put the puzzle together.
Edit: Cassandra 2.1.3 is getting pretty old. It has some serious bugs. Use 2.1.11 or 2.2.3. If you're just starting development, 2.2.3 may be OK (and let's assume you'll actually go to production with 2.2.5 or so). If you're ready to go prod tomorrow, use 2.1.x instead.

Spark cache inefficiency

I have quite powerful cluster with 3 nodes each 24 cores and 96gb RAM = 288gb total. I try to load 100gb of tsv files into Spark cache and do series of simple computation over data, like sum(col20) by col2-col4 combinations. I think it's clear scenario for cache usage.
But during Spark execution I found out that cache NEVER load 100% of data despite plenty of RAM space. After 1 hour of execution I have 70% of partitions in cache and 75gb cache usage out of 170gb available. It's looks like Spark somehow limit number of blocks/partitions it adds to cache instead to add all at very first action and have a great performance from very beginning.
I use MEMORY_ONLY_SER / Kryo ( cache size appr. 110% of on-disk data size )
Does someone have similar experience or know some Spark configs / environment conditions that could cause this caching behaviour ?
So, "problem" was solved with further reducing of split size. With mapreduce.input.fileinputformat.split.maxsize set to 100mb I got 98% cache load after 1st action finished, and 100% at 2nd action.
Other thing that worsened my results was spark.speculation=true - I try to avoid long-running tasks with that, but speculation management creates big performance overhead, and is useless for my case. So, just left default value for spark.speculation ( false )
My performance comparison for 20 queries are as following:
- without cache - 160 minutes ( 20 times x 8 min, reload each time 100gb from disk to memory )
- cache - 33 minutes total - 10m to load cache 100% ( during first 2 queries ) and 18 queries x 1.5 minutes each ( from in-memory Kryo-serialized cache )

Sql Server large data

I have sql server 2012 installed on a Virtual machine.
With quad core processor (2.0 GHz) and 32 GB RAM.
The Data volume is about 400 GB and there is a Table with 824 million records
in it.
The Records are date wise.
I have to pull data from this table as per the start date and end date given by the user. - this generally takes 24 minutes to process query.
Is there is any way to optimize this ,
even I have tried sql server 2012 column store index in it yet there is no significant improvement in the performance of the query.
Should i ask for some higher machine configurations for such large data or is there is any other way to achieve good performance in it?

Resources