Can I decrease usage of Non-Table Cluster Data on my cockroach cluster? - cockroachdb

On my cockroachDB cluster, the Time Series data grows up to about 1 GByte.
Is there any way to decrease it? Thank you!

Yes, you can control this. By default, CockroachDB stores timeseries data for the last 30 days for display in the Admin UI, but you can reduce the interval for timeseries storage or disable timeseries storage entirely.
Reduce the interval for timeseries storage
To reduce the interval for storage of timeseries data, change the timeseries.storage.resolution_10s.ttl cluster setting to an INTERVAL value less than 720h0m0s (30 days). For example, to store timeseries data for the last 15 days, run the following SET CLUSTER SETTING command:
SET CLUSTER SETTING timeseries.storage.resolution_10s.ttl = '360h0m0s';
Disable timeseries storage entirely
Note: Disabling timeseries storage entirely is recommended only if you exclusively use a third-party tool such as Prometheus for timeseries monitoring. Prometheus and other such tools do not rely on CockroachDB-stored timeseries data; instead, they ingest metrics exported by CockroachDB from memory and then store the data themselves.
To disable the storage of timeseries data entirely, run the following command:
SET CLUSTER SETTING timeseries.storage.enabled = false;
If you want all existing timeseries data to be deleted, change the timeseries.storage.resolution_10s.ttl cluster setting as well:
SET CLUSTER SETTING timeseries.storage.resolution_10s.ttl = '0s';
Further reference: https://www.cockroachlabs.com/docs/stable/operational-faqs.html#can-i-reduce-or-disable-the-storage-of-timeseries-data

Related

is it possible for command time on Databricks to increase to almost double without changing the cluster specifications? what could be causing that?

to store data in s3 bucket form Databricks, I used to write the following:
df.write.format("delta").mode("overwrite").save("s3://.....")
the code above used to take 3.27 minutes, it takes now 7.37 minutes, using the same
cluster configuration and the same data.

Does prometheus ensure persistency of time series data of infrastructure?

Prometheus use TSDB database, which is in-memory database(written in GoLang), which is temporary storage, to store time series metrics of infrastructure.
Prometheus is supposed to provide time series metrics, which needs permanent storage, to store the information. For example: CPU metrics of a Host for last month.
How prometheus handles persistency(permanent storage) of time series data? Because TSDB is in-memory database..
How cortex storage different from TSDB?
The Prometheus TSDB is part in-memory, part on-disk. Recent data is kept in-memory and backed up on-disk in WAL (write ahead log) segments, as it's the most frequently accessed. If the instance is shut down the in-memory data can be restored from the WAL. After a few hours the in-memory data is formally saved to the disk in the format of Blocks. Therefore, all data is persisted until its retention period expires.
Some more good resources can be found in the TSDB READme: https://github.com/prometheus/prometheus/blob/main/tsdb/README.md.

Multi data center for Clickhouse

Does clickhouse Multi Master or multi data center set up support?
Any other solutions for multi data center replication for clickhouse?
CH is multi-master only.
CH is multi / geo DC out the box. There are many users with cross-ocean DCs.
The only requirement is proper latency for Replicated* Engines.
All!!!!! ZK nodes should be in the same DC or in DCs with latency < 50ms. CH loading nodes (which ingest data) should be as close as possible to ZK (better <100ms). Non-loading replicas can be far -- 150-250ms.
Cross-ocean setup needs proper configuration of load-balancing to run queries on local-DC replicas and tuning some params (connect_timeout_with_failover_ms -- 50ms by default).
yes, clickhouse can be setup as multi-DC
please read about Distributed engine
https://clickhouse.yandex/docs/en/table_engines/distributed/
also look to load_balancing settings
https://clickhouse.yandex/docs/en/operations/settings/settings/#settings-load_balancing

Hbase TimeSeries data distribution

I have hbase table with following fields.
Row Key -
UniqueID#SensorID#Timestamp
Columns -
sensor_id
Metric
value
timestamp
I am pushing data in every minute to this hbase table.My hbase clusters includes two region servers for storing data.
Is there any way to distribute data particular region server for a particular time duration and add new data nodes to the cluster as data increases?

How to make a cached from a finished Spark Job still accessible for the other job?

My project is implement a interaction query for user to discover that data. Like we have a list of columns user can choose then user add to list and press view data. The current data store in Cassandra and we use Spark SQL to query from it.
The Data Flow is we have a raw log after be processed by Spark store into Cassandra. The data is time series with more than 20 columns and 4 metrics. Currently I tested because more than 20 dimensions into cluster keys so write to Cassandra is quite slow.
The idea here is load all data from Cassandra into Spark and cache it in memory. Provide a API to client and run query base on Spark Cache.
But I don't know how to keep that cached data persist. I am try to use spark-job-server they have feature call share object. But not sure it works.
We can provide a cluster with more than 40 CPU cores and 100 GB RAM. We estimate data to query is about 100 GB.
What I have already tried:
Try to store in Alluxio and load to Spark from that but the time to load is slow because when it load 4GB data Spark need to do 2 things first is read from Alluxio take more than 1 minutes and then store into disk (Spark Shuffle) cost more than 2 or 3 minutes. That mean is over the time we target under 1 minute. We tested 1 job in 8 CPU cores.
Try to store in MemSQL but kind of costly. 1 days it cost 2GB RAM. Not sure the speed is keeping good when we scale.
Try to use Cassandra but Cassandra does not support GROUP BY.
So, what I really want to know is my direction is right or not? What I can change to archive the goal (query like MySQL with a lot of group by, SUM, ORDER BY) return to client by a API.
If you explicitly call cache or persist on a DataFrame, it will be saved in memory (and/or disk, depending on the storage level you choose) until the context is shut down. This is also valid for sqlContext.cacheTable.
So, as you are using Spark JobServer, you can create a long running context (using REST or at server start-up) and use it for multiple queries on the same dataset, because it will be cached until the context or the JobServer service shuts down. However, using this approach, you should make sure you have a good amount of memory available for this context, otherwise Spark will save a large portion of the data on disk, and this would have some impact on performance.
Additionally, the Named Objects feature of JobServer is useful for sharing specific objects among jobs, but this is not needed if you register your data as a temp table (df.registerTempTable("name")) and cache it (sqlContext.cacheTable("name")), because you will be able to query your table from multiple jobs (using sqlContext.sql or sqlContext.table), as long as these jobs are executed on the same context.

Resources