i have a problem with duplicate data in clickhouse.
my case is i have records come in parts then i have to group all these parts by text_id.
The arrival time of the parts may be at different times
for example :
id,text_id,total_parts,part_number,text
101,11,3,1,How
102,12,2,2,World
103,12,2,1,Hello
104,11,3,3,you
105,11,3,2,are
and the result should be like this :
text_id,text
11, How are you
12, Hello World
i create a view to group all parts and it's working fine.
but when i read from this view i want to remove the rows that i already read. I tried to add a column to the table called flag then update this column to 1 then change the view to read flag = 0.
but i read in clickhouse docs that update it decrease the performance. and my table has billions of records.
1- the view will be slow if i can't remove the processed records.
2- If there is no performance issue in view i don't want to read the processed data again.
any suggestion?
The closest result you can is arrays of text column:
SELECT groupArray(text) as msg
FROM
(SELECT * ROM merge_rows ORDER BY text_id, part_number)
GRUP BY text_id
┌─msg─────────────────┐
│ ['Hello','World'] │
│ ['How','are','you'] │
└─────────────────────┘
Since you have billions of rows, integrating into materialized views will do it really fast.
Related
I implemented mapping same as below. can someone suggest that is the good approach or not.
old records copied from production so they didn't provide flag for those records. only new records we will get flag.
Source data:
col1 col2 col3 DML_FLAG
1 a 123 NULL(old record)
2 b 456 I
3 c 678 U
Mapping:
Source...>SQ...>exp...>lkp(on target to identify new or update)
..>exp..>...>RTR(for insert and update)-->upd(for update)...>target
First time load I have to load all records i.e full load(old records (DML_flag is null) and new records
From 2nd run I have to capture only changed records from source.For this I am using mapping variables
Here I have a question like, we have already I and U flags are available in source again I am using LKP,with out lookup, I can use DML_FLAG with two groups I and U in RTR.
But I need to refresh the data in every 30mints,with in 30 mints one record inserted(I) and same record got updated then flag changed to 'U' in the source, same record not available in the target, in that case can how can I capture that new record with flag 'U' without lkp.
can someone suggest how can I do this without lookup?
From what I understand of your question, you want to make sure you apply the insert in your target before you apply the update to that same record - is this correct? If so the just use a target load plan having inserts routed to separate alias of same target and higher up in the load order than updates
The answer to whether this is correct as a design choice depends on what the target db is... for a datawarehouse fact table you would usually insert all records be they inserts or updates because you would be reporting on the event rather than the record state. For a dimension table it would depend on your slow changing dimension strategy
I would like to add a compressed index to the Oracle Applications workflow table hr.pqh_ss_transaction_history in order to access specific types of workflows (process_name) and workflows for specific people (selected_person_id).
There are lots of repeating values in process_name although the data is skewed. I would however want to access the TFG_HR_NEW_HIRE_PLACE_JSP_PRC and TFG_HR_TERMINATION_JSP_PRC process types.
"PROCESS_NAME","CNT"
"HR_GENERIC_APPROVAL_PRC",40347
"HR_PERSONAL_INFO_JSP_PRC",39284
"TFG_HR_NEW_HIRE_PLACE_JSP_PRC",18117
"TFG_HREMPSTS_TERMS_CHG_JSP_PRC",14076
"TFG_HR_TERMINATION_JSP_PRC",8764
"HR_ADV_INDIVIDUAL_COMP_PRC",4907
"TFG_HR_SIT_NOAPP",3979
"TFG_YE_TAX_PROV",2663
"HR_TERMINATION_JSP_PRC",1310
"HR_CHANGE_PAY_JSP_PRC",953
"TFG_HR_SIT_EXIT_JSP_PRC",797
"HR_SIT_JSP_PRC",630
"HR_QUALIFICATION_JSP_PRC",282
"HR_CAED_JSP_PRC",250
"TFG_HR_EMP_TERM_JSP_PRC",211
"PER_DOR_JSP_PRC",174
"HR_AWARD_JSP_PRC",101
"TFG_HR_SIT_REP_MOT",32
"TFG_HR_SIT_NEWPOS_NIB_JSP_PRC",30
"TFG_HR_SIT_NEWPOS_INBU_JSP_PRC",28
"HR_NEW_HIRE_PLACE_JSP_PRC",22
"HR_NEWHIRE_JSP_PRC",6
selected_person_id would obviously be more selective. Unfortunately there are 3774 nulls for this column and the highest count after that is 73 for one person. A lot of people would only have 1 row. The total row count is 136963.
My query would be in this format:
select psth.item_key,
psth.creation_date,
psth.last_update_date
from hr.pqh_ss_transaction_history psth
where nvl(psth.selected_person_id, :p_person_id) = :p_person_id
and psth.process_name = 'HR_TERMINATION_JSP_PRC'
order by psth.last_update_date
I am on Oracle 12c release 1.
I assume it would be a good idea to put a non-compressed b-tree index on selected_person_id since the values returned would fall in the less than 5% of the total rows scenario, but how do you handle the nulls in the column which would not go into the index when you select using nvl(psth.selected_person_id, :p_person_id) = :p_person_id? Is there a more efficient way to write the sql and how should you create this index?
For process_name I would like to use a compressed b-tree index. I am assuming that the statement is
CREATE INDEX idxname ON pqh_ss_transaction_history(process_name) COMPRESS
where there would be an implicit second column for rowid. Is it safe for it to use rowid here, since normally it is not advised to use rowid? Is the skewed data an issue (most of the time I would be selecting on the high volume side)? I don't understand how compressed indexes would be efficient. For b-tree indexes you would normally want to return 5% of the data, otherwise a full table scan is actually more efficient. How does the compressed index return so many rowids and then do lookup into the full table using those rowids, faster than a full table scan?
Or since the optimizer will only be able to use one of the two indexes should I rather create an uncompressed function based index with selected_person_id and process_name concatenated?
Perhaps you could create this index:
CREATE INDEX idxname ON pqh_ss_transaction_history
(process_name, NVL(selected_person_id,-1)) COMPRESS 1
Then change your query to:
select psth.item_key,
psth.creation_date,
psth.last_update_date
from hr.pqh_ss_transaction_history psth
where nvl(psth.selected_person_id, -1) in (:p_person_id,-1)
and psth.process_name = 'HR_TERMINATION_JSP_PRC'
order by psth.last_update_date
I have a table that is already listed in ascending desired order with two distinct sections and I want to add headers for each section but I'm not quite sure how to do so. I am using the directive Angular datatables and would like to have the end result have row titles like the "Status:New" and "Status:Open" in the example image below.
After much research, currently there is no supported way to fix rows in the middle of the specific table I'm working with.
Add something like this after your initial code for your table. Assuming you're using 2 tables.
var table1 = $("#your_top_table").DataTable({});
$(table.table().header()).append('<tr><td>Status: New</td></tr>');
var table2 = $("#your_bottom_table").DataTable({});
$(table2.table().header()).append('<tr><td>Status: Open</td></tr>');
I have two files, A and B. The records in both files share the same format and the first n characters of a record is its unique identifier. The record is of fixed length format and consists of m fields (field1, field2, field3, ...fieldm). File B contains new records and records in file A that have changed. How can I use cloverETL to determine which fields have changed in a record that appears in both files?
Also, how can I gather metrics on the frequency of changes for individual fiels. For example, I would like to know how many records had changes in fieldm.
This is typical example of Slowly Changing Dimension problem. Solution with CloverETL is described on theirs blog: Building Data Warehouse with CloverETL: Slowly Changing Dimension Type 1 and Building Data Warehouse with CloverETL: Slowly Changing Dimension Type 2.
I have to do a bit complicated data import. I need to do a number of UPDATEs which currently updating over 3 million rows in one query. This query is applying about 30-45 sec each (some of them even 4-5 minutes). My question is, whether I can speed it up. Where can I read something about it, what kind of indexes and on which columns I can set to improve those updates. I don't need exacly answer, so I don't show the tables. I am looking for some stuff to learn about it.
Two things:
1) Post an EXPLAIN ANALYZE of your UPDATE query.
2) If your UPDATE does not need to be atomic, then you may want to consider breaking apart the number of rows affected by your UPDATE. To minimize the number of "lost rows" due to exceeding the Free Space Map, consider the following approach:
BEGIN
UPDATE ... LIMIT N; or some predicate that would limit the number of rows (e.g. WHERE username ilike 'a%';).
COMMIT
VACUUM table_being_updated
Repeat steps 1-4 until all rows are updated.
ANALYZE table_being_updated
I suspect you're updating every row in your table and don't need all rows to be visible with the new value at the end of a single transaction, therefore the above approach of breaking the UPDATE up in to smaller transactions will be a good approach.
And yes, an INDEX on the relevant columns specified in the UPDATE's predicate will help will dramatically help. Again, post an EXPLAIN ANALYZE if you need further assistance.
If by a number of UPDATEs you mean one UPDATE command to each updated row then the problem is that all the target table's indexes will be updated and all constraints will be checked at each updated row. If that is the case then try instead to update all rows with a single UPDATE:
update t
set a = t2.b
from t2
where t.id = t2.id
If the imported data is in a text file then insert it in a temp table first and update from there. See my answer here