MonetDB; !FATAL: BBPextend: trying to extend BAT pool beyond the limit (16384000) - monetdb

Our monetdbd instance throws the error "!FATAL: BBPextend: trying to extend BAT pool beyond the limit (16384000)" after restarting from a normal shutdown (monetdbd start farm works, monetdb start database fails with the given error).
The database contains less than 10 tables and each table has min. 3 fields and max. 22 fields. The overall database size is about 16 GB and a table with 5 fields (3 ints, 1 bigint, 1 date) has 450mil. rows.
Has anyone an idea how to solve that problem without loosing the data?
monetdbd --version
MonetDB Database Server v1.7 (Jan2014-SP1)
Server details:
Ubuntu 13.10 (GNU/Linux 3.11.0-19-generic x86_64)
12 Core CPU (hexacore + ht): Intel(R) Core(TM) i7 CPU X 980 # 3.33GHz
24 GB Ram
2x 120 GB SSD, Software-Raid 1, LVM
Further details:
# wc BBP.dir: "240 10153 37679 BBP.dir"

It sounds strange. What OS and hardware platform?
Are you accidentally using a 32-bit Windows version?

Related

Performance of Postgres-XL 9.5 cluster vs single PostgreSQL 9.5

I use a VMWare environment to compare the performance of Postgres-XL 9.5 and PostgreSQL 9.5.
I build Postgres-XL cluster following the instruction of Creating a Postgres-XL cluster
Physical HW:
M/B: Gigabyte H97M-D3H
CPU: Intel i7-4790 #3.60Mhz
RAM: 32GB DDR3 1600
HD: 2.5" Seagate SSHD ST1000LM014 1TB
Infra:
VMWare ESXi 6.0
VM:
DB00~DB05:
CPU: 1 core, limit to 2000Mhz
RAM: 2GB, limit to 2GB
HD: 50GB
Advanced CPU Hyperthread mode: any
OS: Ubuntu 16.04 LTS x64 (all packages are upgraded to the current version with apt-update; apt-upgrade)
PostgreSQL 9.5+173 on DB00
Postgres-XL 9.5r1.2 on DB01~DB05
userver: (for executing pgbench)
CPU: 2 cores,
RAM: 4GB,
HD: 50GB
OS: Ubuntu 14.04 LTS x64
Role:
DB00: Single PostgreSQL
DB01: GTM
DB02: Coordinator Master
DB03~DB05: datanode master dn1~dn3
postgresql.conf in DB01~DB05
shared_buffers = 128MB
dynamic_shared_memory_type = posix
max_connections = 300
max_prepared_transactions = 300
hot_standby = off
# Others are default values
postgresql.conf of DB00 is
max_connections = 300
shared_buffers = 128MB
max_prepared_transactions = 300
dynamic_shared_memory_type = sysv
#Others are default values
On userver:
pgbench -h db00 -U postgres -i -s 10 -F 10 testdb;
pgbench -h db00 -U postgres -c 30 -t 60 -j 10 -r testdb;
pgbench -h db02 -U postgres -i -s 10 -F 10 testdb;
pgbench -h db02 -U postgres -c 30 -t 60 -j 10 -r testdb;
I confirmed that all tables pgbench_* are averagely distributed amoung dn1~dn3 in Postgres-XL
pgbench results:
Single PostgreSQL 9.5: (DB00)
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 10
query mode: simple
number of clients: 30
number of threads: 10
number of transactions per client: 60
number of transactions actually processed: 1800/1800
tps = 1263.319245 (including connections establishing)
tps = 1375.811566 (excluding connections establishing)
statement latencies in milliseconds:
0.001084 \set nbranches 1 * :scale
0.000378 \set ntellers 10 * :scale
0.000325 \set naccounts 100000 * :scale
0.000342 \setrandom aid 1 :naccounts
0.000270 \setrandom bid 1 :nbranches
0.000294 \setrandom tid 1 :ntellers
0.000313 \setrandom delta -5000 5000
0.712935 BEGIN;
0.778902 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
3.022301 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
3.244109 UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
7.931936 UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
1.129092 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
4.159086 END;
_
Postgres-XL 9.5
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 10
query mode: simple
number of clients: 30
number of threads: 10
number of transactions per client: 60
number of transactions actually processed: 1800/1800
tps = 693.551818 (including connections establishing)
tps = 705.965242 (excluding connections establishing)
statement latencies in milliseconds:
0.003451 \set nbranches 1 * :scale
0.000682 \set ntellers 10 * :scale
0.000656 \set naccounts 100000 * :scale
0.000802 \setrandom aid 1 :naccounts
0.000610 \setrandom bid 1 :nbranches
0.000553 \setrandom tid 1 :ntellers
0.000536 \setrandom delta -5000 5000
0.172587 BEGIN;
3.540136 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
0.631834 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
6.741206 UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
17.539502 UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
0.974308 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
10.475378 END;
My question is, why are Postgres-XL's TPS and other indexes (such as INSERT, UPDATE) are far poor than those of PostgreSQL ? I thought Postgres-XL's performance should be better that of PostgreSQL, isn't it ?
Postgres-XL is designed to run on multiple physical nodes. Running it on VMWare is a good educational exercise but should not be expected to show any performance gain. You are adding virtualization overhead and the overhead of the clustering software. The web page test from joyeu’s answer used 4 physical machines. Assuming that the performance increase quoted over a single node is based on the same machine you would read this as 4 times the hardware for a 2.3x increase in performance.
Maybe you should try a large "Scale" value.
I got the similar result as you.
And then I found this webpage from Postgres-XL official site:
http://www.postgres-xl.org/2016/04/postgres-xl-9-5-r1-released/eased/
It says:
Besides proving its mettle on Business Intelligence workloads,
Postgres-XL has performed remarkably well on OLTP workloads when
running pgBench (based on TPC-B) benchmark. In a 4-Node (Scale: 4000)
configuration, compared to PostgreSQL, XL gives up to 230% higher TPS
(-70% latency comparison) for SELECT workloads and up to 130% (-56%
latency comparison) for UPDATE workloads. Yet, it can scale much, much
higher than even the largest single node server.
So I guess Postgres-XL performs well for large data size.
And I will conduct a test to confirm this right now.
Postgres-XL is a clustered server. Individual transactions will always be slightly slower on it, but because it can be scale up to massive clusters letting it be able to process MUCH more data concurrently letting it process large data sets much faster.
Also performance varies WIDELY depending on what configuration options you use.
From your test specs:
Physical HW:
M/B: Gigabyte H97M-D3H
CPU: Intel i7-4790 #3.60Mhz
RAM: 32GB DDR3 1600
HD: 2.5" Seagate SSHD ST1000LM014 1TB <-----
using a single disk will likely introduce a bottleneck and slower your performances. You are using the same read/write speed divided by 4 considering that GTM, Coordinator and data nodes are going to access/spool data etc.
Despite of people speaking about performance gaps introduced by the hypervisor, database are disk intensive applications, not memory/cpu intensive one, this means that are perfect for virtualization to the condition of distributing accordingly the workload between disk groups. Obiviously use a preallocated disk or you will slow down the inserts for real.

Extremely poor performance with Tableau + Spark + Cassandra

Currently I am in the process of investigating the possibility of using Cassandra in combination with Spark and Tableau for data analysis. However, the performance that I am currently experiencing with this setup is so poor that I cannot imagine using it for production purposes. As I am reading about how great the performance of the combination of Cassandra + Spark must be, I am obviously doing something wrong, yet I cannot find out what.
My test data:
All data is stored on a single node
Queries are performed on a single table with 50MB (interval data)
Columns used in selection criteria have an index on it
My test setup:
MacBook 2015, 1.1 GHz, 8GB memory, SSD, OS X El Capitan
Virtual Box, 4GB memory, Ubuntu 14.04
Single node wit Datastax Enterprise 4.8.4:
Apache Cassandra 2.1.12.1046
Apache Spark 1.4.2.2
Spark Connector 1.4.1
Apache Thrift 0.9.3
Hive Connector 0.2.11
Tableau (Connected through ODBC)
Findings:
When a change in Tableau requires loading data from the database, it takes anywhere between 40s and 1.4 mins. to retrieve the data (which is basically unworkable)
When I use Tableau in combination with Oracle instead of Cassandra + Spark, but on the same virtual box, I get the results almost instantaneously
Here is the table definition used for the queries:
CREATE TABLE key.activity (
interval timestamp,
id bigint,
activity_name text,
begin_ts timestamp,
busy_ms bigint,
container_code text,
duration_ms bigint,
end_location_code text,
end_ts timestamp,
pallet_code text,
src_location_code text,
start_location_code text,
success boolean,
tgt_location_code text,
transporter_name text,
PRIMARY KEY (interval, id)
) WITH CLUSTERING ORDER BY (id ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"ALL"}'
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
CREATE INDEX activity_activity_name_idx ON key.activity (activity_name);
CREATE INDEX activity_success_idx ON key.activity (success);
CREATE INDEX activity_transporter_name_idx ON key.activity (transporter_name);
Here is an example of a query produced by Tableau:
INFO 2016-02-10 20:22:21 org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation: Running query 'SELECT CASE WHEN 4 >= 0 THEN SUBSTRING(`activity`.`transporter_name`,1,CAST(4 AS INT)) ELSE NULL END AS `calculation_185421691185008640`,
AVG(CAST(`activity`.`busy_ms` AS DOUBLE)) AS `avg_busy_ms_ok`,
CAST((MONTH(`activity`.`interval`) - 1) / 3 + 1 AS BIGINT) AS `qr_interval_ok`,
`activity`.`transporter_name` AS `transporter_name`,
YEAR(`activity`.`interval`) AS `yr_interval_ok`
FROM `key`.`activity` `activity`
GROUP BY CASE WHEN 4 >= 0 THEN SUBSTRING(`activity`.`transporter_name`,1,CAST(4 AS INT)) ELSE NULL END,
CAST((MONTH(`activity`.`interval`) - 1) / 3 + 1 AS BIGINT),
`activity`.`transporter_name`,
YEAR(`activity`.`interval`)'
Here is an example on statistics of a 52s query:
Spark statistics on query taken 52 secs. to complete
I've tried playing around with the partition keys as mentioned in other posts, but did not see a significant difference. I've also tried to enable row caching (Cassandra config + table property), but this also did not have any effect (although perhaps I have overlooked something there).
I would have expected to get at least a factor 10x-20x better performance out of the box, even without fiddling around with all these parameters and I've run out of ideas what to do.
What am I doing wrong? What performance should I expect?
Answering your questions will not be easy due to the variables that you do not define in your post. You mention data that is stored on one node, which is fine but you don't describe how you have structured your tables/column families. You also don't mention the cassandra cache hit ratios. You also have to consider Cassandra Compaction, if compaction is running during the heavy read/write operations it will slow things down.
You also appear to have a single SSD in which case you will have the Data directory and commitlogs and cache directories on the same physical drive. Even though it is not a spinning disc you will see degraded performance unless you split the data dir from the commitlogs/cache directories. I saw a 50% increase in performance by splitting the Data dir onto its own physical SSD.
Also, lastly you're running in a VM on a laptop host in Vbox none the less. Your largest bottleneck here is the 1.1 GHz CPU. In my cassandra environments on VMWare while running medium jobs I see almost 99% CPU use across 4 X 2 cores on 16GB RAM. My data dir(s) are on SSD's while my commitlogs and cache directories are on a magnetic HDD. I get good performance, but I tuned my environments to get to this point and I accept the latency my non production environments provide.
Take a look HERE and try to get a better understanding of how Cassandra should be used and how to achieve better performance out of the box. Distributed Systems are just that.. distributed and for a reason. Shared resources that you don't have available on a single machine.
Hope this explains a little more about where you're headed.
EDIT
Your table definition looks fine. Are you using the Tableau Spark connector? Your performance problem is likely on the cassandra/Spark side of things.
Take a look at this article which describes a compaction related problem while reading from cache. Basically on cassandra releases prior to 2.1.2 post compaction you now have lost your cache because Cassandra threw the file (and cache) away once the compaction finished. Once you start reading you imediately get a missed cache hit and cassandra then goes back to disc. This is fixed in releases from 2.1.2 onward. Everything else looks normal with respect towards running Spark/Cassandra.
While the query time does seem a little high, there's a few things I see that could cause issues.
I noticed you're using a MacBook. Beautiful computer but not ideal for Spark. I believe those are using the dual core Intel M processors. If you go to your Spark Master UI, it'll show you available cores. It might show 4 (to include vCPU's).
The nature in which you are running this query doesn't allow for a lot of parallelism (if any). You basically don't get the advantages of Spark in this case because you're running in an extremely small VM and you're running on a single node (with limited CPU's). Visualization tools haven't really caught up to Spark yet.
One other thing to keep in mind is that Spark is not designed as an 'adhoc query' tool. You can think of SparkSQL as an abstraction over proper Spark Batch. Comparing it to Oracle, at this scale, wont yield the results you expect. There's a 'minimum' performance threshold that you'll notice with Spark. Once you scale data and nodes far enough, you'll start to see that time to completion and size of data is not linear and as you add more data, the time to process remains relatively flat.
I suggest trying that query in the SparkSQL REPL dse spark-sql and see if you get similar times. If you do, then you know that's the best you'll get with your current setup. If Tableau is MUCH slower than the REPL, I'd guess it's something on their end at that point.

yarn is using 100% resources when running a hive job

I'm running a hive tez job. the job is to load the data from one table which is of text file format to another table with orc format.
I'm using
INSERT INTO TABLE ORDERREQUEST_ORC
PARTITION(DATE)
SELECT
COLUMN1,
COLUMN2,
COLUMN3,
DATE
FROM ORDERREQUEST_TXT;
When I'm monitoring the job through ambari web console I saw that YARN memory utilized is 100%.
can you please advice how to maintain Healthy Yarn memory.
the load average on all the three datanodes;
1. top - 17:37:24 up 50 days, 3:47, 4 users, load average: 15.73, 16.43, 13.52
2. top - 17:38:25 up 50 days, 3:48, 2 users, load average: 16.14, 15.19, 12.50
3. top - 17:39:26 up 50 days, 3:49, 1 user, load average: 11.89, 12.54, 10.49
These are the yarn configurations
yarn.scheduler.minimum-allocation-mb=5120
yarn.scheduler.maximum-allocation-mb=46080
yarn.nodemanager.resource.memory-mb=46080
FYI:- My cluster config
Nodes = 4 (1 Master, 3 DN )
memory = 64 GB on each node
Processors = 6 on each node
1 TB on each node (5 Disk * 200 GB)
How to reduce the yarn utilization memory?
you are getting the error because the cluster hasn't been configured to allocate max yarn memory per user.
Please set the below properties in Yarn configurations to allocate 33% of max yarn memory per job, which can be altered based on your requirement.
Change from:
yarn.scheduler.capacity.root.default.user-limit-factor=1
To:
yarn.scheduler.capacity.root.default.user-limit-factor=0.33
If you need further info on this, please refer following link
https://analyticsanvil.wordpress.com/2015/08/16/managing-yarn-memory-with-multiple-hive-users/

Can dumping and restoring database make it slower?

I have a Amazon RDS Postgres database. I created a snapshot of this database(say database-A) and then restored the snapshot on a new db instance(say database-B). database-A was a 8 GiB machine with 2 cores. database-B is a 3.75 GiB machine with 1 core.
I find the following:
Storage occupied by database-B is greater than database-A. I found the occupied storage using pg_database_size.
I find the queries slower on database-B than they were on database-A.
Are these two things possible in normal scenario or I must have made some mistake during dump/restore process?

how can i reduce the data fetch time with mongo in a bigger datasize

We have a collection(name_list) of 30 million 'names'. We are comparing this 30 million records with 4 million 'names'. We are fetching these 4 million 'names' from a txt file.
I am using PHP and Linux platform. I gave index for 'names' field. I am using simple 'find' to compare data with mongodb with txt file's data
$collection->findOne(array('names' => $name_from_txt))
I am comparing one by one. I Know join is not possible in mongodb.Is there any better method to compare data in mongodb?
The OS and other details are as follows.
OS : Ubuntu
Kernel Version : 3.5.0-23-generic
64 bit
MongoDB shell version: 2.4.5
CPU info - 24
Memory - 64G
Disks 3 - out of which mongo is written to a fusion i/o disk of size 320G
File system on mongo disk - ext4 with noatime as mentioned in mongo doc
ulimit settings for mongo changed to 65000
readahead is 32
numa is disabled with --interleave option
when i use a script to compare this, it takes around 5 min to complete ... what can be done, so that it gets executed faster and finish in say 1-2 min ? can anyone help please?

Resources