Can we delete older records from memsql table automatically? - pinterest

We are using memsql streamliner for our real time data pipeline as part of POC.Is there any way in memsql settings through which we can specify the rule to delete the 1 week older data from memsql automatically?

No. This logic must be performed elsewhere, in an application, cron job, etc.

Related

How to delete empty partitions in cratedb?

Cratedb:4.x.x
We have one table in which we are doing partition based on day.
we will take snapshot of tables based on that partition and after taking backup we delete the data of that day.
Due to multiple partition, shards count is more than 2000 and configured shard is 6
I have observed that old partitions have no data but still exist in database.
So it will take more time to become healthy and available to write data after restarting the crate.
So Is there any way to delete those partition?
Is there any way to stop replication of data on startup the cluster? cause it takes too much time to become healthy cluster and due to that table is not writable until that process finished.
Any solution for this issue will be great help?
You should be able to delete empty partitions with a DELETE with an exact match on the partitioned by column. Like DELETE FROM <tbl> WHERE <partitioned_by_column> = <value>

Contents of elasticsearch snapshot

We are going to be using the snapshot API for blue green deployment of our cluster. We want to snapshot the existing cluster, spin up a new cluster, restore the data from the snapshot. We also need to apply any changes to the existing cluster data to our new cluster (before we switchover and make the new cluster live).
The thinking is we can index data from our database that has changes after the timestamp of when the snapshot was created, to ensure that any writes that happened to the running live cluster will get applied to the new cluster (the new cluster only has the data restored from the snapshot). My question is what timestamp date to use? Snapshot API has start_time and end_time values for a given snapshot - but I am not certain that end_time in this context means “all data modified up to this point”. I feel like it is just a marker to tell you how long the operation took. I may be wrong.
Does anyone know how to find what a snapshot contains? Can we use the end_time as a marker to know that th snapshot contains all data modifications before that date?
Thanks!
According to documentation
Snapshotting process is executed in non-blocking fashion. All indexing
and searching operation can continue to be executed against the index
that is being snapshotted. However, a snapshot represents the
point-in-time view of the index at the moment when snapshot was
created, so no records that were added to the index after the snapshot
process was started will be present in the snapshot.
You will need to use start_time or start_time_in_millis.
Because snapshots are incremental, you can create first full snapshot and than one more snapshot right after the first one is finished, it will be almost instant.
One more question: why create functionality already implemented in elasticsearch? If you can run both clusters at the same time, you can merge both clusters, let them sync, switch write queries to new cluster and gradually disconnect old servers from merged cluster leaving only new ones.

how to manage modified data in Apache Hive

We are working on Cloudera CDH and trying to perform reporting on the data stored on Apache Hadoop. We send daily reports to client so need to import data from operational store to hadoop daily.
Hadoop works on the append only mode. Hence we can not perform the Hive update/delete query. We can perform Insert overwrite on dimension tables and add delta values in the fact tables. Introducing thousands for the delta rows daily does not seem quite impressive solution.
Are there any other standard better ways to update modified data in Hadoop?
Thanks
HDFS might be append only, but Hive does support updates from 0.14 on.
see here:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Update
A design pattern is to take all your previous and current data and insert it into a new table every time.
Depending on your usecase have a look at Apache Impala/Hbase/... or even Drill.

Is there a way to set a TTL for certain directories in HDFS?

I have the following requirements. I am adding date-wise data to a specific directory in HDFS, and I need to keep a backup of the last 3 sets, and remove the rest. Is there a way to set a TTL for the directory so that the data perishes automatically after a certain number of days?
If not, is there a way to achieve similar results?
This feature is not yet available on HDFS.
There was a JIRA ticket created to support this feature: https://issues.apache.org/jira/browse/HDFS-6382
But, the fix is not yet available.
You need to handle it using a cron job. You can create a job (this could be a simple Shell, Perl or Python script), which periodically deletes the data older than a certain pre-configured period.
This job could:
Run periodically (For e.g. once an hour or once a day)
Take the list of folders or files which need to be checked, along with their TTL as input
Delete any file or folder, which is older than the specified TTL.
This can be achieved easily, using scripting.

Hive Update,Insert,delete

I have been trying to implement the UPDATE,INSERT,DELETE operations in hive table as per instructions. But whenever I try to include the properties which will do our work i.e. configuration values set for INSERT, UPDATE, DELETE hive.support.concurrency true (default is false) hive.enforce.bucketing true (default is false) hive.exec.dynamic.partition.mode nonstrict (default is strict) After that, if I run show tables on hive shell it's taking 65.15 seconds which normally runs at 0.18 seconds without the above properties. Apart from show tables, rest of the commands not giving any output i.e. they keep on running until and unless I kill the process. Could you tell me reason for this?
Hive is not an RDBMS. A query that ran for 2 mins may run for 5 mins under the same configuration. Neither Hive nor Hadoop guarantee us about the time taken for a query to execute. Also, please include information about whether you are running on a single node cluster or multi node cluster. And also provide information about the size of data on which you are querying. The information you have provided is insufficient. But, don't come to any conclusion based on time to execute query. Because, lots of factors such as disk, CPU slots, N/W etc etc., are involved in deciding run time of query.

Resources