Is it possible to transfer expired data into another table ? I want to implement an efficient mechanism to transfer previous day data into weekly table, previous week data into monthly table etc.
I don't want to use multiple materialized views, to not calculate summary data for each table when new data comes.
Any design suggestion to handle this is also welcome.
Options:
TTL Group by : https://github.com/ClickHouse/ClickHouse/issues/14345
GraphiteMergeTree: https://clickhouse.tech/docs/en/engines/table-engines/mergetree-family/graphitemergetree/
clear column: https://github.com/kshvakov/ClickHouse-CPP-Meetup#summingmergetree
Related
I've below requirement to write a spring batch. I would like to know the best approach to achieve it.
Input: A relatively large file with report data (for today)
Processing:
1. Update Daily table and monthly table based on the report data for today
Daily table - Just update the counts based on ID
Monthly table: Add today's count to the existing value
My concerns are:
1. since data is huge I may end up having multiple DB transactions. How can I do this operation in bulk?
2. To add to the existing counts of the monthly table, I must have the existing counts with me. I may have to maintain a map beforehand. But is this a good way to process in this way?
Please suggest the approach I should follow or any example if there is any?
Thanks.
You can design a chunk-oriented step to first insert the daily data from the file to the table. When this step is finished, you can use a step execution listener and in the afterStep method, you will have a handle to the step execution where you can get the write count with StepExecution#getWriteCount. You can write this count to the monthly table.
since data is huge I may end up having multiple DB transactions. How can I do this operation in bulk?
With a chunk oriented step, data is already written in bulk (one transaction per chunk). This model works very well even if your input file is huge.
To add to the existing counts of the monthly table, I must have the existing counts with me. I may have to maintain a map beforehand. But is this a good way to process in this way?
No need to store the info in a map, you can get the write count from the step execution after the step as explained above.
Hope this helps.
I have an oracle database and need to import data to a hive table. The daily import data size would be around 1 GB. What would be the better approach?
If I import each day data as a partition, how can the updated values be handled?
For example, if I imported today's data as a partition and for the next day there are some fields that are updated with the new values.
Using --lastmodified we can get the values but where we need to send the updated values to the new partition or to the old (already existing) partition?
If I send to the new partition, then the data is duplicated.
If I want to send to the already existing partition, how we can it be achieved?
Your only option is to override the entire existing partition with 'INSERT OVERWRITE TABLE...'.
Question is - how far back are you going to be constantly updating the data?
I think of 3 approaches u can consider:
Decide on a threshold for 'fresh' data. for example '14 days backwards' or '1 month backwards'.
Then each day you are running the job, you override partitions (only the ones which have updated values) backwards, until the threshold decided.
With ~1 GB a day it should be feasible.
All the data from before your decided time is not guranteed to be 100% correct.
This scenario could be relevant if you know the fields can only be changed a certain time window after they were initially set.
Make your Hive table compatible with ACID transactions, thus allowing updates on the table.
Split your daily job to 2 tasks: the new data being written for the run day. the updated data that you need to run backwards. the sqoop will be responsible for the new data. take care of the updated data 'manually' (some script that generates the update statements)
Don't use partitions based on time. maybe dynamic partitioning is more suitable for your use case.It depends on the nature of the data being handled.
We have to create a materialised view in our database from a remote database which is currently in production. The view has about 5 crore records and is taking a long time. In between , the connection has snapped once and not even a single record persisted in our database. Since the remote database is production server, we get a very limited window to create the view.
My question is, can we have something like auto commit /auto start from where left at last time while the view is created so that we don't have to do the entire operation in one go ?
We are tying to make the query such that records can be fetched in smaller numbers as a alternate. But the data is read only for us and doesn't really have a where clause at this point which we can use.
Due to sensitive nature of data, I cannot post the view structure or the query.
No, you can not commit during the process of creating the view. But why don't you import the data to a table instead of a view? This would provide the possibility to commit in between. Furthermore this might provide the possibility, to load only the delta of changes maybe on daily basis - this would reduce the required time dramatically.
Wanted some advice on how to deal with table operations (rename column) in Google BigQuery.
Currently, I have a wrapper to do this. My tables are partitioned by date. eg: if I have a table name fact, I will have several tables named:
fact_20160301
fact_20160302
fact_20160303... etc
My rename column wrapper generates aliased queries. ie. if I want to change my table schema from
['address', 'name', 'city'] -> ['location', 'firstname', 'town']
I do batch query operation:
select address as location, name as firstname, city as town
and do a WRITE_TRUNCATE on the parent tables.
My main issues lies with the fact that BigQuery only supports 50 concurrent jobs. This means, that when I submit my batch request, I can only do around 30 partitions at a time, since I'd like to reserve 20 spots for ETL jobs that are runnings.
Also, I haven't found of a way where you can do a poll_job on a batch operation to see whether or not all jobs in a batch have completed.
If anyone has some tips or tricks, I'd love to hear them.
I can propose two options
Using View
Views creation is very simple to script out and execute - it is fast and free to compare with cost of scanning whole table with select into approach.
You can create view using Tables: insert API with properly set type property
Using Jobs: insert EXTRACT and then LOAD
Here you can extract table to GCS and then load it back to GBQ with adjusted schema
Above approach will a) eliminate cost cost of querying (scan) tables and b) can help with limitations. But might not depends on the actual volumke of tables and other requirements you might have
The best way to manipulate a schema is through the Google Big Query API.
Use the tables get api to retrieve the existing schema for your table. https://cloud.google.com/bigquery/docs/reference/v2/tables/get
Manipulate your schema file, renaming columns etc.
Again using the api perform an update on the schema, setting it to your newly modified version. This should all occur in one job https://cloud.google.com/bigquery/docs/reference/v2/tables/update
Basically, the needed job is for large amount of records on a data base, and more records can be inserted all the time:
Select <1000> records with status "NEW" -> process the records -> update the records to status "DONE".
This sounds to me like "Map Reduce".
I think that the job described above can may be done in parallel, even by different machines, but then my concern is:
When I select <1000> records with status "NEW" - how can I know that none of these records are already being processed by some other job ?
The same records should not be selected and processed more than once of course.
Performance is critical.
The naive solution is to do the mentioned basic job in a loop.
It seems related to big data processing / nosql / map reduce etc'.
Thanks
Since considering Performance issue... We can can achieve this.The main goal is to distribute records to clients such way that no to clients get same record.
I irrespective of database...
If you have one more column which is used for locking record. So on fetching those records you can set lock, To prevent from fetching for send time.
But if you don not have such capability then my bets bet would be to create another table or im-memory key-value store, with Record primary key and lock, and on fetching records you need to check of record does not exist in other table....
If you have HBase then it can be achieved easily first approach is achievable with performance.