Can dumping and restoring database make it slower?

Can dumping and restoring database make it slower? - performance

I have a Amazon RDS Postgres database. I created a snapshot of this database(say database-A) and then restored the snapshot on a new db instance(say database-B). database-A was a 8 GiB machine with 2 cores. database-B is a 3.75 GiB machine with 1 core.
I find the following:
Storage occupied by database-B is greater than database-A. I found the occupied storage using pg_database_size.
I find the queries slower on database-B than they were on database-A.
Are these two things possible in normal scenario or I must have made some mistake during dump/restore process?

Related

How do I decide where I should locate my TimesTen database files?

I am setting up a TimesTen In-Memory database and I am looking for guidance on the storage and location that I should use for the database's persistence files.

A TimesTen database consists of two types of file; checkpoint files (two) and transaction log files (always at least one, usually many).
There are 3 criteria to consider:
a) Data safety and availability (regular storage versus RAID). The database files are critical to the operation of the database and if they become inaccessible or are lost/damaged then your database will become inoperable and you will likely lose data. One way to protect against this is to use TimesTen's built in replication to implement high availability but even if you do that you may also want to protect your database files using some form of RAID storage. For performance reasons RAID-1 is preferred over RAID-5 or RAID-6. Use of NFS storage is not recommended for database files.
b) Capacity. Both checkpoint files are located in the same directory (Datastore attribute) and hence in the same filesystem. Each file can grow to a maximum size of PermSize + ~64 MB. Normally the space for these files is pre-allocated when the files are created, so it is less likely you will run out of space for them. By default, the transaction log files are also located in the same directory as the checkpoint files, though you can (and almost always should) place them in a different location by use of the LogDir attribute. The filesystem where the transaction logs are located should have enough space such that you never run out. If the database is unable to write data to the transaction logs it will stop processing transactions and operations will start to receive errors.
c) Performance. If you are using traditional spinning magnetic media, then I/O contention is a significant concern. The checkpoint files and the transaction log files should be stored on separate devices and separate from any other files that are subject to high levels of I/O activity. I/O loading and contention is less of a consideration for SSD storage and usually irrelevant for PCIe/NVMe flash storage.

How to make a cached from a finished Spark Job still accessible for the other job?

My project is implement a interaction query for user to discover that data. Like we have a list of columns user can choose then user add to list and press view data. The current data store in Cassandra and we use Spark SQL to query from it.
The Data Flow is we have a raw log after be processed by Spark store into Cassandra. The data is time series with more than 20 columns and 4 metrics. Currently I tested because more than 20 dimensions into cluster keys so write to Cassandra is quite slow.
The idea here is load all data from Cassandra into Spark and cache it in memory. Provide a API to client and run query base on Spark Cache.
But I don't know how to keep that cached data persist. I am try to use spark-job-server they have feature call share object. But not sure it works.
We can provide a cluster with more than 40 CPU cores and 100 GB RAM. We estimate data to query is about 100 GB.
What I have already tried:
Try to store in Alluxio and load to Spark from that but the time to load is slow because when it load 4GB data Spark need to do 2 things first is read from Alluxio take more than 1 minutes and then store into disk (Spark Shuffle) cost more than 2 or 3 minutes. That mean is over the time we target under 1 minute. We tested 1 job in 8 CPU cores.
Try to store in MemSQL but kind of costly. 1 days it cost 2GB RAM. Not sure the speed is keeping good when we scale.
Try to use Cassandra but Cassandra does not support GROUP BY.
So, what I really want to know is my direction is right or not? What I can change to archive the goal (query like MySQL with a lot of group by, SUM, ORDER BY) return to client by a API.

If you explicitly call cache or persist on a DataFrame, it will be saved in memory (and/or disk, depending on the storage level you choose) until the context is shut down. This is also valid for sqlContext.cacheTable.
So, as you are using Spark JobServer, you can create a long running context (using REST or at server start-up) and use it for multiple queries on the same dataset, because it will be cached until the context or the JobServer service shuts down. However, using this approach, you should make sure you have a good amount of memory available for this context, otherwise Spark will save a large portion of the data on disk, and this would have some impact on performance.
Additionally, the Named Objects feature of JobServer is useful for sharing specific objects among jobs, but this is not needed if you register your data as a temp table (df.registerTempTable("name")) and cache it (sqlContext.cacheTable("name")), because you will be able to query your table from multiple jobs (using sqlContext.sql or sqlContext.table), as long as these jobs are executed on the same context.

Extremely poor performance with Tableau + Spark + Cassandra

Currently I am in the process of investigating the possibility of using Cassandra in combination with Spark and Tableau for data analysis. However, the performance that I am currently experiencing with this setup is so poor that I cannot imagine using it for production purposes. As I am reading about how great the performance of the combination of Cassandra + Spark must be, I am obviously doing something wrong, yet I cannot find out what.
My test data:
All data is stored on a single node
Queries are performed on a single table with 50MB (interval data)
Columns used in selection criteria have an index on it
My test setup:
MacBook 2015, 1.1 GHz, 8GB memory, SSD, OS X El Capitan
Virtual Box, 4GB memory, Ubuntu 14.04
Single node wit Datastax Enterprise 4.8.4:
Apache Cassandra 2.1.12.1046
Apache Spark 1.4.2.2
Spark Connector 1.4.1
Apache Thrift 0.9.3
Hive Connector 0.2.11
Tableau (Connected through ODBC)
Findings:
When a change in Tableau requires loading data from the database, it takes anywhere between 40s and 1.4 mins. to retrieve the data (which is basically unworkable)
When I use Tableau in combination with Oracle instead of Cassandra + Spark, but on the same virtual box, I get the results almost instantaneously
Here is the table definition used for the queries:
CREATE TABLE key.activity (
interval timestamp,
id bigint,
activity_name text,
begin_ts timestamp,
busy_ms bigint,
container_code text,
duration_ms bigint,
end_location_code text,
end_ts timestamp,
pallet_code text,
src_location_code text,
start_location_code text,
success boolean,
tgt_location_code text,
transporter_name text,
PRIMARY KEY (interval, id)
) WITH CLUSTERING ORDER BY (id ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"ALL"}'
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
CREATE INDEX activity_activity_name_idx ON key.activity (activity_name);
CREATE INDEX activity_success_idx ON key.activity (success);
CREATE INDEX activity_transporter_name_idx ON key.activity (transporter_name);
Here is an example of a query produced by Tableau:
INFO 2016-02-10 20:22:21 org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation: Running query 'SELECT CASE WHEN 4 >= 0 THEN SUBSTRING(`activity`.`transporter_name`,1,CAST(4 AS INT)) ELSE NULL END AS `calculation_185421691185008640`,
AVG(CAST(`activity`.`busy_ms` AS DOUBLE)) AS `avg_busy_ms_ok`,
CAST((MONTH(`activity`.`interval`) - 1) / 3 + 1 AS BIGINT) AS `qr_interval_ok`,
`activity`.`transporter_name` AS `transporter_name`,
YEAR(`activity`.`interval`) AS `yr_interval_ok`
FROM `key`.`activity` `activity`
GROUP BY CASE WHEN 4 >= 0 THEN SUBSTRING(`activity`.`transporter_name`,1,CAST(4 AS INT)) ELSE NULL END,
CAST((MONTH(`activity`.`interval`) - 1) / 3 + 1 AS BIGINT),
`activity`.`transporter_name`,
YEAR(`activity`.`interval`)'
Here is an example on statistics of a 52s query:
Spark statistics on query taken 52 secs. to complete
I've tried playing around with the partition keys as mentioned in other posts, but did not see a significant difference. I've also tried to enable row caching (Cassandra config + table property), but this also did not have any effect (although perhaps I have overlooked something there).
I would have expected to get at least a factor 10x-20x better performance out of the box, even without fiddling around with all these parameters and I've run out of ideas what to do.
What am I doing wrong? What performance should I expect?

Answering your questions will not be easy due to the variables that you do not define in your post. You mention data that is stored on one node, which is fine but you don't describe how you have structured your tables/column families. You also don't mention the cassandra cache hit ratios. You also have to consider Cassandra Compaction, if compaction is running during the heavy read/write operations it will slow things down.
You also appear to have a single SSD in which case you will have the Data directory and commitlogs and cache directories on the same physical drive. Even though it is not a spinning disc you will see degraded performance unless you split the data dir from the commitlogs/cache directories. I saw a 50% increase in performance by splitting the Data dir onto its own physical SSD.
Also, lastly you're running in a VM on a laptop host in Vbox none the less. Your largest bottleneck here is the 1.1 GHz CPU. In my cassandra environments on VMWare while running medium jobs I see almost 99% CPU use across 4 X 2 cores on 16GB RAM. My data dir(s) are on SSD's while my commitlogs and cache directories are on a magnetic HDD. I get good performance, but I tuned my environments to get to this point and I accept the latency my non production environments provide.
Take a look HERE and try to get a better understanding of how Cassandra should be used and how to achieve better performance out of the box. Distributed Systems are just that.. distributed and for a reason. Shared resources that you don't have available on a single machine.
Hope this explains a little more about where you're headed.
EDIT
Your table definition looks fine. Are you using the Tableau Spark connector? Your performance problem is likely on the cassandra/Spark side of things.
Take a look at this article which describes a compaction related problem while reading from cache. Basically on cassandra releases prior to 2.1.2 post compaction you now have lost your cache because Cassandra threw the file (and cache) away once the compaction finished. Once you start reading you imediately get a missed cache hit and cassandra then goes back to disc. This is fixed in releases from 2.1.2 onward. Everything else looks normal with respect towards running Spark/Cassandra.

While the query time does seem a little high, there's a few things I see that could cause issues.
I noticed you're using a MacBook. Beautiful computer but not ideal for Spark. I believe those are using the dual core Intel M processors. If you go to your Spark Master UI, it'll show you available cores. It might show 4 (to include vCPU's).
The nature in which you are running this query doesn't allow for a lot of parallelism (if any). You basically don't get the advantages of Spark in this case because you're running in an extremely small VM and you're running on a single node (with limited CPU's). Visualization tools haven't really caught up to Spark yet.
One other thing to keep in mind is that Spark is not designed as an 'adhoc query' tool. You can think of SparkSQL as an abstraction over proper Spark Batch. Comparing it to Oracle, at this scale, wont yield the results you expect. There's a 'minimum' performance threshold that you'll notice with Spark. Once you scale data and nodes far enough, you'll start to see that time to completion and size of data is not linear and as you add more data, the time to process remains relatively flat.
I suggest trying that query in the SparkSQL REPL dse spark-sql and see if you get similar times. If you do, then you know that's the best you'll get with your current setup. If Tableau is MUCH slower than the REPL, I'd guess it's something on their end at that point.

How to store hadoop data into oracle

My final table is in Hive(HDFS)
1) I have tired "Sqoop"
2) sql loader
3) oraoop
Performance of all are very discouraging ,while we are putting data into sql database
have to import 1 TB file and 1 GB is taking over all 8 Min (1297372920 Rows) in 5 node cluster with sqoop ,oraoop,sql loader

Your Sqoop export to Oracle speed will be determined by various factors including data size/characteristics, network performance and perhaps most importantly the target database server's configuration. Since the current release of Sqoop doesn't allow for use of "direct" in exporting data to Oracle the optimizations available in this use case are limited. I'd strongly encourage you to review the documentation (http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_literal_sqoop_export_literal) and try to get yourself into a position where you can work with incremental imports/exports if possible since you're displeased with latency using your 1tb dataset. Perhaps go with an initial full load of your entire desired dataset and find a way to only update incrementally from there.

how can i reduce the data fetch time with mongo in a bigger datasize

We have a collection(name_list) of 30 million 'names'. We are comparing this 30 million records with 4 million 'names'. We are fetching these 4 million 'names' from a txt file.
I am using PHP and Linux platform. I gave index for 'names' field. I am using simple 'find' to compare data with mongodb with txt file's data
$collection->findOne(array('names' => $name_from_txt))
I am comparing one by one. I Know join is not possible in mongodb.Is there any better method to compare data in mongodb?
The OS and other details are as follows.
OS : Ubuntu
Kernel Version : 3.5.0-23-generic
64 bit
MongoDB shell version: 2.4.5
CPU info - 24
Memory - 64G
Disks 3 - out of which mongo is written to a fusion i/o disk of size 320G
File system on mongo disk - ext4 with noatime as mentioned in mongo doc
ulimit settings for mongo changed to 65000
readahead is 32
numa is disabled with --interleave option
when i use a script to compare this, it takes around 5 min to complete ... what can be done, so that it gets executed faster and finish in say 1-2 min ? can anyone help please?

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio