Insert into ElasticSearch using Hive/Qubole - elasticsearch

I am trying to insert data into elastic search from a hive table.
CREATE EXTERNAL TABLE IF NOT EXISTS es_temp_table (
dt STRING,
text STRING
)
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource'='aggr_2014-10-01/metric','es.index.auto.create'='true')
;
INSERT OVERWRITE TABLE es_temp_table
SELECT dt, description
FROM other_table
However, the data is off. When I do a count(*) on my other table I am getting 6,000 rows. When I search the aggr_2014-10-01 index, I see 10,000 records! Somehow, the records are being duplicated (rows are being copied over multiple times). Maybe I can remove duplicate records inside of elastic search? Not sure how I would do that though.
I believe it might be a result of Hive/Qubole spawning two tasks for every mapping. If one mapper succeeds, it tries to kill the other. However, the other task already did damage (aka inserted into ElasticSearch). This is my best guess, but I would prefer to know the exact reason and if there is a way for me to fix it.
set mapred.map.tasks.speculative.execution=false;
One thing I found was to set speculative execution to false, so that only one task is spawned per mapper (see above setting). However, now I am seeing undercounting. I believe this may be due to records being skipped, but I am unable to diagnose why those records would be skipped in the first place.
In this version, it also means that if even one task/mapper fails, the entire job fails, and then I need to delete the index (partial data was uploaded) and rerun the entire job (which takes ~4hours).
[PROGRESS UPDATE]
I attempted to solve this by putting all of the work in the reducer (it's the only way to only spawn one task to ensure no duplicate record insertions).
INSERT OVERWRITE TABLE es_temp_table
SELECT dt, description
FROM other_table
DISTRIBUTE BY cast(rand()*250 as int);
However, I now see a huge underestimate! 2,000 records only now. Elastic search does estimate some things, but not to this extent. There are simply less records in ElasticSearch. This may be due to failed tasks (that are no longer retrying). It may be from when Qubole/Hive passes over malformed entries. But I set:
set mapreduce.map.skip.maxrecords=1000;
Here are some other settings for my query:
set es.nodes=node-names
set es.port=9200;
set es.bulk.size.bytes=1000mb;
set es.http.timeout=20m;
set mapred.tasktracker.expiry.interval=3600000;
set mapred.task.timeout=3600000;

I determined the problem. As I suspected, insertion was skipping over some records that were considered "bad." I was never able to find what records exactly were being skipped, but I tried replacing all non-alphanumeric characters with a space. This solved the problem! The records are no longer being skipped, and all data is uploaded to Elastic Search.
INSERT OVERWRITE TABLE es_temp_table
SELECT dt, REGEXP_REPLACE(description, '[^0-9a-zA-Z]+', ' ')
FROM other_table

Related

How to get mutation is done?(ReplicatedMergeeTree)

In my limited experience with CH Cluster, now I have two nodes, using replicatedMergeTree,1 sharding 2 replicas. I meet the problem that do data synchronize from Mysql.
When to update the table, I first delete data some days ago and count the table record where date >days_ago, and then load data from Mysql,codes like follows:
delete from ods.my_table where data_date>:days_ago;
# here to check if record count is zero
select count(*) from ods.my_table where data_date>:days_ago;
# if count(*) =0 ,load data ; else wait
insert into ods.my_table select * from mysql('xxx'......) where data_date>:days_ago;
but I get zero records in CH ods.my_table where data_date>:days_ago;
if I run it again, it will have data; and run it again, it will be zero..., the result is like that: when it's zero, rerun will be ok; when it's not zero, rerun will not be ok.
I analysis the log, and found that when the mutation is not done, the insert statement has been executed, so, data missed.
I try to check if the mutation is finished on the table, but I could not find any solution, can anybody help me ? Thank you in advantage?
just add table TTL definition on clickhouse side and forget about manual delete
https://clickhouse.tech/docs/en/engines/table-engines/mergetree-family/mergetree/#mergetree-table-ttl
you can add TTL to exists clickhouse MergeTree table
https://clickhouse.tech/docs/en/sql-reference/statements/alter/ttl/

Amazon Redshift, getting slower in every update run

I am beginning with Amanzon Redshift.
I have just loaded a big table, millions of rows and 171 fields. The data quality is poor, there are a lot of characters that must be removed.
I have prepare updates for every column, since redshift stores in column mode, I suppose it is faster by column.
UPDATE MyTable SET Field1 = REPLACE(Field1, '~', '');
UPDATE MyTable SET Field2 = REPLACE(Field2, '~', '');
.
.
.
UPDATE MyTable set FieldN = Replace(FieldN, '~', '');
The first 'update' took 1 min. The second one took 1 min and 40 sec...
Every time a run one of the updates, it takes more time than the last one. I have run 19 and the last one took almost 25 min. The time consumed by every 'update' increases one after another.
Another thing is that with the first update, the cpu utilization was minimal, now with the last update it is taking 100%
I have a 3-nodes cluster of dc1.large instances.
I have rebooted the cluster but the problem continues.
Please, I need orientation to find the cause of this problem.
When you update a column, Redshift actually deletes all those rows and inserts new rows with the new value. So there are bunch of space that needs to be reclaimed. So you need to VACUUM your table after the update.
They also recommend that you run ANALYZE after each update to update statistics for the query planner.
http://docs.aws.amazon.com/redshift/latest/dg/r_UPDATE.html
A more optimal way might be
Create another identical table.
Read N ( say 10000) rows at a time from first table, process and load into the second table using s3
loading (instead of insert).
Delete first table and rename second table
If are running into space issues, delete the N migrated rows after every iteration from the first table and run vacuum delete only <name_of_first_table>
Refrences
s3 loading : http://docs.aws.amazon.com/redshift/latest/dg/tutorial-loading-run-copy.html
copy table from 's3://<your-bucket-name>/load/key_prefix' credentials 'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>' options;

after alter system flush shared_pool low performance Oracle

We did refactoring and replaced 2 similar requests with parameterized request
a.isGood = :1
after that request that used this parameter with parameter 'Y' was executed longer that usually (become almost the same with parameter 'N'). We used alter system flush shared_pool command and request for parameter 'Y' has completed fast (as before refactoring) while request with parameter 'N' hangs for a long time.
As you could understand number of lines in data base with parameter 'N' much more then with 'Y'
Oracle 10g
Why it happened?
I assume that you have an index on that column, otherwise the performance would be the same regardless of the Y/N combination. I have seen this happening quite bit on 10g+ due to Oracle's optimizer Bind Peeking combined to histograms on columns with skewed data distribution. The histograms get created automatically when one gathers tables statistics using the parameter method_opt with 'FOR ALL COLUMNS SIZE AUTO' (among other values). Oracle optimizes the query for the value in the bind variables provided in the very first execution of that query. If you run the query with Y the first time, Oracle might want to use an index instead of a full table scan, since Y will return a small quantity of rows. The next time you run the query with N, then Oracle will repeat the first execution plan, which happens to be a poor choice for N, since it will return the vast majority of rows.
The execution plans are cached in the SGA. Once you flush it, you get a brand new execution plan the very first time the query runs again.
My suggestion is:
Obtain the explain plan of both original queries (one with a hardcoded Y and one with a hardcode N). Investigate if the two plans use different indexes or one has a much higher Cost than the other. I have the feeling that one uses a full table scan and the other uses an index. The first one should be faster for N and the second should be faster for Y.
Try to remove the statistics on the table and see if it makes a difference on the query that has the bind variable. Later you need to gather statistics again for the table or other queries on that table might suffer.
You can also gather statistics for that one table using method_opt => FOR ALL COLUMNS SIZE 1. That will keep the statistics without the histograms on any columns of that table.
A bitmap index on this column might fix the issue as well. Indexes on a column that have only two possible values (Y and N) are not exactly very efficient.
If column isGood has 99,000 'N' values and 1,000 'Y' values and you run with the condition isGood = 'Y', then it may be appropriate to use an index to find the results: you are returning 1% of the rows. If you run the query with the condition isGood = 'N', a full table scan would be more appropriate since you are returning most of the table anyway. If you were to use an index for the N condition, you would be doing an extra index lookup for every data item lookup.
Although the general rule is that bind parameters are good, it can be problematic in this kind of instance if really two different plans are required for the query. With the bind parameter scenario:
SELECT * FROM x WHERE isGood = :1
The statement will be parsed and a plan computed and saved in the sql cache. The same plan will be used for both query scenarios which is not desirable. But:
SELECT * FROM x WHERE isGood = 'Y'
SELECT * FROM x WHERE isGood = 'N'
will result in two plans being stored in the sql cache, hopefully each with the appropriate plan for the query. Version 11g avoids this problem with adaptive cursor sharing, which can use different plans for different bind variable values.
You need to look at your plans (EXPLAIN PLAN) to see what is happening in your case. Flush the cache, try one method, examine the plan; try the other, examine the plan. It might give you an idea what is happening in your case. There are a bunch of other topics you might follow up on that may help, for example:
using a hint to force the use of an index
cursor_sharing parameter
histograms on statistics

Oracle: difference between max(id)+1 and sequence.nextval

I am using Oracle
What is difference when we create ID using max(id)+1 and using sequance.nexval,where to use and when?
Like:
insert into student (id,name) values (select max(id)+1 from student, 'abc');
and
insert into student (id,name) values (SQ_STUDENT.nextval, 'abc');
SQ_STUDENT.nextval sometime gives error that duplicate record...
please help me on this doubt
With the select max(id) + 1 approach, two sessions inserting simultaneously will see the same current max ID from the table, and both insert the same new ID value. The only way to use this safely is to lock the table before starting the transaction, which is painful and serialises the transactions. (And as Stijn points out, values can be reused if the highest record is deleted). Basically, never use this approach. (There may very occasionally be a compelling reason to do so, but I'm not sure I've ever seen one).
The sequence guarantees that the two sessions will get different values, and no serialisation is needed. It will perform better and be safer, easier to code and easier to maintain.
The only way you can get duplicate errors using the sequence is if records already exist in the table with IDs above the sequence value, or if something is still inserting records without using the sequence. So if you had an existing table with manually entered IDs, say 1 to 10, and you created a sequence with a default start-with value of 1, the first insert using the sequence would try to insert an ID of 1 - which already exists. After trying that 10 times the sequence would give you 11, which would work. If you then used the max-ID approach to do the next insert that would use 12, but the sequence would still be on 11 and would also give you 12 next time you called nextval.
The sequence and table are not related. The sequence is not automatically updated if a manually-generated ID value is inserted into the table, so the two approaches don't mix. (Among other things, the same sequence can be used to generate IDs for multiple tables, as mentioned in the docs).
If you're changing from a manual approach to a sequence approach, you need to make sure the sequence is created with a start-with value that is higher than all existing IDs in the table, and that everything that does an insert uses the sequence only in the future.
Using a sequence works if you intend to have multiple users. Using a max does not.
If you do a max(id) + 1 and you allow multiple users, then multiple sessions that are both operating at the same time will regularly see the same max and, thus, will generate the same new key. Assuming you've configured your constraints correctly, that will generate an error that you'll have to handle. You'll handle it by retrying the INSERT which may fail again and again if other sessions block you before your session retries but that's a lot of extra code for every INSERT operation.
It will also serialize your code. If I insert a new row in my session and go off to lunch before I remember to commit (or my client application crashes before I can commit), every other user will be prevented from inserting a new row until I get back and commit or the DBA kills my session, forcing a reboot.
To add to the other answers, a couple of issues.
Your max(id)+1 syntax will also fail if there are no rows in the table already, so use:
Coalesce(Max(id),0) + 1
There's nothing wrong with this technique if you only have a single process that inserts into the table, as might be the case with a data warehouse load, and if max(id) is fast (which it probably is).
It also avoids the need for code to synchronise values between tables and sequences if you are moving restoring data to a test system, for example.
You can extend this method to multirow insert by using:
Coalesce(max(id),0) + rownum
I expect that might serialise a parallel insert, though.
Some techniques don't work well with these methods. They rely of course on being able to issue the select statement, so SQL*Loader might be ruled out. However SQL*Loader has support for this technique in general through the SEQUENCE parameter of the column specification: http://docs.oracle.com/cd/E11882_01/server.112/e22490/ldr_field_list.htm#i1008234
Assuming MAX(ID) is actually fast enough, wouldn't it be possible to:
First get MAX(ID)+1
Then get NEXTVAL
Compare those two and increase sequence in case NEXTVAL is smaller then MAX(ID)+1
Use NEXTVAL in INSERT statement
In that case I would have a fully stable procedure and manual inserts would also be allowed without worrying about updating the sequence

ORA-30926: unable to get a stable set of rows in the source tables

I am getting
ORA-30926: unable to get a stable set of rows in the source tables
in the following query:
MERGE INTO table_1 a
USING
(SELECT a.ROWID row_id, 'Y'
FROM table_1 a ,table_2 b ,table_3 c
WHERE a.mbr = c.mbr
AND b.head = c.head
AND b.type_of_action <> '6') src
ON ( a.ROWID = src.row_id )
WHEN MATCHED THEN UPDATE SET in_correct = 'Y';
I've ran table_1 it has data and also I've ran the inside query (src) which also has data.
Why would this error come and how can it be resolved?
This is usually caused by duplicates in the query specified in USING clause. This probably means that TABLE_A is a parent table and the same ROWID is returned several times.
You could quickly solve the problem by using a DISTINCT in your query (in fact, if 'Y' is a constant value you don't even need to put it in the query).
Assuming your query is correct (don't know your tables) you could do something like this:
MERGE INTO table_1 a
USING
(SELECT distinct ta.ROWID row_id
FROM table_1 a ,table_2 b ,table_3 c
WHERE a.mbr = c.mbr
AND b.head = c.head
AND b.type_of_action <> '6') src
ON ( a.ROWID = src.row_id )
WHEN MATCHED THEN UPDATE SET in_correct = 'Y';
You're probably trying to to update the same row of the target table multiple times. I just encountered the very same problem in a merge statement I developed. Make sure your update does not touch the same record more than once in the execution of the merge.
A further clarification to the use of DISTINCT to resolve error ORA-30926 in the general case:
You need to ensure that the set of data specified by the USING() clause has no duplicate values of the join columns, i.e. the columns in the ON() clause.
In OP's example where the USING clause only selects a key, it was sufficient to add DISTINCT to the USING clause. However, in the general case the USING clause may select a combination of key columns to match on and attribute columns to be used in the UPDATE ... SET clause. Therefore in the general case, adding DISTINCT to the USING clause will still allow different update rows for the same keys, in which case you will still get the ORA-30926 error.
This is an elaboration of DCookie's answer and point 3.1 in Tagar's answer, which from my experience may not be immediately obvious.
How to Troubleshoot ORA-30926 Errors? (Doc ID 471956.1)
1) Identify the failing statement
alter session set events ‘30926 trace name errorstack level 3’;
or
alter system set events ‘30926 trace name errorstack off’;
and watch for .trc files in UDUMP when it occurs.
2) Having found the SQL statement, check if it is correct (perhaps using explain plan or tkprof to check the query execution plan) and analyze or compute statistics on the tables concerned if this has not recently been done. Rebuilding (or dropping/recreating) indexes may help too.
3.1) Is the SQL statement a MERGE?
evaluate the data returned by the USING clause to ensure that there are no duplicate values in the join. Modify the merge statement to include a deterministic where clause
3.2) Is this an UPDATE statement via a view?
If so, try populating the view result into a table and try updating the table directly.
3.3) Is there a trigger on the table? Try disabling it to see if it still fails.
3.4) Does the statement contain a non-mergeable view in an 'IN-Subquery'? This can result in duplicate rows being returned if the query has a "FOR UPDATE" clause. See Bug 2681037
3.5) Does the table have unused columns? Dropping these may prevent the error.
4) If modifying the SQL does not cure the error, the issue may be with the table, especially if there are chained rows.
4.1) Run the ‘ANALYZE TABLE VALIDATE STRUCTURE CASCADE’ statement on all tables used in the SQL to see if there are any corruptions in the table or its indexes.
4.2) Check for, and eliminate, any CHAINED or migrated ROWS on the table. There are ways to minimize this, such as the correct setting of PCTFREE.
Use Note 122020.1 - Row Chaining and Migration
4.3) If the table is additionally Index Organized, see:
Note 102932.1 - Monitoring Chained Rows on IOTs
Had the error today on a 12c and none of the existing answers fit (no duplicates, no non-deterministic expressions in the WHERE clause). My case was related to that other possible cause of the error, according to Oracle's message text (emphasis below):
ORA-30926: unable to get a stable set of rows in the source tables
Cause: A stable set of rows could not be got because of large dml activity or a non-deterministic where clause.
The merge was part of a larger batch, and was executed on a live database with many concurrent users. There was no need to change the statement. I just committed the transaction before the merge, then ran the merge separately, and committed again. So the solution was found in the suggested action of the message:
Action: Remove any non-deterministic where clauses and reissue the dml.
SQL Error: ORA-30926: unable to get a stable set of rows in the source tables
30926. 00000 - "unable to get a stable set of rows in the source tables"
*Cause: A stable set of rows could not be got because of large dml
activity or a non-deterministic where clause.
*Action: Remove any non-deterministic where clauses and reissue the dml.
This Error occurred for me because of duplicate records(16K)
I tried with unique it worked .
but again when I tried merge without unique same proble occurred
Second time it was due to commit
after merge if commit is not done same Error will be shown.
Without unique, Query will work if commit is given after each merge operation.
I was not able to resolve this after several hours. Eventually I just did a select with the two tables joined, created an extract and created individual SQL update statements for the 500 rows in the table. Ugly but beats spending hours trying to get a query to work.
As someone explained earlier, probably your MERGE statement tries to update the same row more than once and that does not work (could cause ambiguity).
Here is one simple example. MERGE that tries to mark some products as found when matching the given search patterns:
CREATE TABLE patterns(search_pattern VARCHAR2(20));
INSERT INTO patterns(search_pattern) VALUES('Basic%');
INSERT INTO patterns(search_pattern) VALUES('%thing');
CREATE TABLE products (id NUMBER,name VARCHAR2(20),found NUMBER);
INSERT INTO products(id,name,found) VALUES(1,'Basic instinct',0);
INSERT INTO products(id,name,found) VALUES(2,'Basic thing',0);
INSERT INTO products(id,name,found) VALUES(3,'Super thing',0);
INSERT INTO products(id,name,found) VALUES(4,'Hyper instinct',0);
MERGE INTO products p USING
(
SELECT search_pattern FROM patterns
) o
ON (p.name LIKE o.search_pattern)
WHEN MATCHED THEN UPDATE SET p.found=1;
SELECT * FROM products;
If patterns table contains Basic% and Super% patterns then MERGE works and first three products will be updated (found). But if patterns table contains Basic% and %thing search patterns, then MERGE does NOT work because it will try to update second product twice and this causes the problem. MERGE does not work if some records should be updated more than once. Probably you ask why not update twice!?
Here first update 1 and second update 1 are the same value but only by accident. Now look at this scenario:
CREATE TABLE patterns(code CHAR(1),search_pattern VARCHAR2(20));
INSERT INTO patterns(code,search_pattern) VALUES('B','Basic%');
INSERT INTO patterns(code,search_pattern) VALUES('T','%thing');
CREATE TABLE products (id NUMBER,name VARCHAR2(20),found CHAR(1));
INSERT INTO products(id,name,found) VALUES(1,'Basic instinct',NULL);
INSERT INTO products(id,name,found) VALUES(2,'Basic thing',NULL);
INSERT INTO products(id,name,found) VALUES(3,'Super thing',NULL);
INSERT INTO products(id,name,found) VALUES(4,'Hyper instinct',NULL);
MERGE INTO products p USING
(
SELECT code,search_pattern FROM patterns
) s
ON (p.name LIKE s.search_pattern)
WHEN MATCHED THEN UPDATE SET p.found=s.code;
SELECT * FROM products;
Now first product name matches Basic% pattern and it will be updated with code B but second product matched both patterns and cannot be updated with both codes B and T in the same time (ambiguity)!
That's why DB engine complaints. Don't blame it! It knows what it is doing! ;-)

Resources