Too many simultaneous queries in clickhouse - clickhouse

Our clickhouse server had several exceptions when running small queries under a peak load:
DB::Exception: Too much simultaneous queries. Maximum: 100
Is there a setting to increase this number and what can the increase of this setting cause?

<max_concurrent_queries>100</max_concurrent_queries>
Just read config.xml https://github.com/ClickHouse/ClickHouse/blob/master/programs/server/config.xml#L237
Probably you want some proxy like haproxy in front of ClickHouse.

Edit the main clickHouse config file located in:
/etc/clickhouse-server/config.xml
Find the entry:
<max_concurrent_queries>100</max_concurrent_queries>
Change to:
<max_concurrent_queries>200</max_concurrent_queries>
Restart the ClickHouse Database to apply the configuration changes:
in Ubuntu:
sudo service clickhouse-server restart
The documentation states: max_concurrent_queries - The maximum number of simultaneously processed queries related to MergeTree table.
It does not go into detail on how high of a number you can use.
NOTE: from the documentation:
If you want to adjust the configuration, it’s not handy to directly edit config.xml file, considering it might get rewritten on future package updates. The recommended way to override the config elements is to create files in config.d directory which serve as “patches” to config.xml.
Create a new config file inside this directory /etc/clickhouse-server/config.d/
Example: touch /etc/clickhouse-server/config.d/my_config.xml

Related

Hive: modify external table's location take too long

Hive has two kinds of tables which are Managed and External Tables, for the difference, you can check Managed. VS External Tables.
Currently, to move external database from HDFS to Alluxio, I need to modify external table's location to alluxio://.
The statement is something like: alter table catalog_page set location "alluxio://node1:19998/user/root/tpcds/1000/catalog_returns"
According to my understanding, it should be a simple metastore modification,however, for some tables modification, it will spend dozens of minutes. The database itself contains about 1TB data btw.
Is there anyway for me to accelerate the table alter process? If no, why it's so slow? Any comment is welcomed, thanks.
I found suggested way which is metatool under $HIVE_HOME/bin.
metatool -updateLocation <new-loc> <old-loc> Update FS root location in the
metastore to new location.Both
new-loc and old-loc should be
valid URIs with valid host names
and schemes.When run with the
dryRun option changes are
displayed but are not persisted.
When run with the
serdepropKey/tablePropKey option
updateLocation looks for the
serde-prop-key/table-prop-key
that is specified and updates
its value if found.
By using this tool, the location modification is very fast. (maybe several seconds.)
Leave this thread here for anyone who might run into the same situation.

Way to export all the data from Cassandra cluster to file(s)

I need to export Cassandra schema and data to a file in order to quickly setup identical cluster when needed.
Identical likely means the same topology, same number of nodes and replication factor.
In case of NetworkTopologyStrategy simple file backup/sstable snapshot is not helpful cause peer IPs are recorded with other data. After restore on another node it tries to reach source cluster seeds.
I was surprised there is almost no ready solution for such task.
Suppose i have to use DESC SCHEMA; then parse output for all the tables, backup them with COPY keyspace.table TO /backup/keyspace.table.csv; and later use sstableloader to restore on other node.
Any better solutions?
You can use the solution you've specified.
Or you can use snapshots option (looks easier for me). Here's a docs describing how to copy snapshots between clusters:
http://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_snapshot_restore_new_cluster.html

Greenplum: gpfdist file serving

I'm running through Greenplum tutorial.
I'm having trouble understanding how gpfdist works.
What does this mean: gpfdist: Serves data files to or writes data files out from Greenplum Database segments.
What does it mean to "serve a file"? I thought it read external tables. Is gpfdist running on both the client and server? How does it work in parallel? Is it calling gpfdist on several hosts, is that how?
I just need help understanding the big picture. In this tutorial http://greenplum.org/gpdb-sandbox-tutorials/ we call it twice, why? (It's confusing because the server and client are on the same machine.)
gpfdist can run on any host. It is basically lighttpd that you point to a directory and it sits there and listens for connections on the port you specify.
On the greenplum server/database side, you create and external table definition that uses the LOCATION setting to your gpfdist location.
You can then query this table and gpfdist will "serve the file" to the database engine.
Read: http://gpdb.docs.pivotal.io/4380/utility_guide/admin_utilities/gpfdist.html
and http://gpdb.docs.pivotal.io/4380/ref_guide/sql_commands/CREATE_EXTERNAL_TABLE.html
An external table is made up of a few things and the two most important are the location of where to get (or put) data and the other is how to take that data and parse it into something that can be used as table records. When you create the external table you are just creating the definitions of how it should work.
When you execute a query against an external table only then do the segments go out and do what has been setup in that definition. It should be noted they aren't creating a persistent connection or caching that data. Each time you execute that query the cluster is going to look at it's definitions and move that data across the wire and use it for the length of that query.
In the case of gpfdist as an endpoint, it is really just a webserver. People frequently run one on an ETL node. When the location is gpfdist and you create a readable external table each segment will reach out to gpfdist and ask for a chunk of the file and process it. This is the parallelism, multiple segments reaching out to gpfdist and getting chunks they will then try to parse into a tuples according to what was specified in the table definition and then assemble it all to create a table of data for your query.
gpfist can also be the endpoint for a writable external table. In this case the segments are all going to push the data they have to that remote location and gpfdist is going to write the data it was pushed down to disk. The thing to note here is that there is no sort order promised, the data is written to disk as it's streamed from multiple segments.
yes, Gpfdist is file distribution service , it used for external tables .
An Green plum DB directly query a file like a table from a directory(Unix or windows)
We can select the flat file data and have the further processing. Unicode and wild characters also can be processed with predefined encoding .
External table concepts emerging with the help of gpfdist.
syntax to setup in windows
gpfdist -d ${FLAT_FILES_DIR} -p 8081 -l /tmp/gpfdist.8081.log
Just make sure u have gpdist.exe in yourparticular source machine

Is there a way to set a TTL for certain directories in HDFS?

I have the following requirements. I am adding date-wise data to a specific directory in HDFS, and I need to keep a backup of the last 3 sets, and remove the rest. Is there a way to set a TTL for the directory so that the data perishes automatically after a certain number of days?
If not, is there a way to achieve similar results?
This feature is not yet available on HDFS.
There was a JIRA ticket created to support this feature: https://issues.apache.org/jira/browse/HDFS-6382
But, the fix is not yet available.
You need to handle it using a cron job. You can create a job (this could be a simple Shell, Perl or Python script), which periodically deletes the data older than a certain pre-configured period.
This job could:
Run periodically (For e.g. once an hour or once a day)
Take the list of folders or files which need to be checked, along with their TTL as input
Delete any file or folder, which is older than the specified TTL.
This can be achieved easily, using scripting.

Multidatacenter Replication with Rethinkdb

I have two servers in two different geographic locations (alfa1 and alfa2).
r.tableCreate('dados', {shards:1, replicas:{alfa1:1, alfa2:1}, primaryReplicaTag:'alfa1'})
I need to be able to write for both servers, but when I try to shutdown alfa1, and write to alfa2, rethinkdb only allow reads: Table test.dados is available for outdated reads, but not up-to-date reads or writes.
I need a way to write for all replicas, not only for Primary.
Is this possible ? Does rethinkdb allow multidatacenter replication ?
I think that multidatacenter replication need to permit write for both datacenters.
I tried to remove "primaryReplicaTag" but system don't accept !
Any help is welcome !!!
RethinkDB does support multi-datacenter replication/sharding.
I think the problem here is that you've setup a cluster of two, which means that when one fails you only have 50% of the nodes in the cluster which means you have less than 51%.
From the failover docs - https://rethinkdb.com/docs/failover/
To perform automatic failover for a table, the following requirements
must be met:
The cluster must have three or more servers
The table must be configured to have three or more replicas
A majority (greater thanhalf) of replicas for the table must be available
Try adding just one additional server and your problems should be resolved.

Resources