Efficiently query table with conditions including array column in PostgreSQL - performance

Need to come up with a way to efficiently execute a query with and array and integer columns in the WHERE clause, ordered by a timestamp column. Using PostgreSQL 9.2.
The query we need to execute is:
SELECT id
from table
where integer = <int_value>
and <text_value> = any (array_col)
order by timestamp
limit 1;
int_value is an integer value, and text_value is a 1 - 3 letter text value.
The table structure is like this:
Column | Type | Modifiers
---------------+-----------------------------+------------------------
id | text | not null
timestamp | timestamp without time zone |
array_col | text[] |
integer | integer |
How should I design indexes / modify the query to make it as efficient as possible?
Thanks so much! Let me know if more information is needed and I'll update ASAP.

PG can use indexes on array but you have to use array operators for that so instead of <text_value> = any (array_col) use ARRAY[<text_value>]<#array_col (https://stackoverflow.com/a/4059785/2115135). You can use the command SET enable_seqscan=false; to force pg to use indexes if it's possible to see if the ones you created are valid. Unfortunately GIN index can't be created on integer column so you will have to create two diffrent indexes for those two columns.
See the execution plans here: http://sqlfiddle.com/#!12/66a71/2

Unfortunately GIN index can't be created on integer column so you will have to create two diffrent indexes for those two columns.
That's not entirely true, you can use btree_gin or -btree_gist
-- feel free to use GIN
CREATE EXTENSION btree_gist;
CREATE INDEX ON table USING gist(id, array_col, timestamp);
VACUUM FULL ANALYZE table;
Now you can run the operation on the index itself
SELECT *
FROM table
WHERE id = ? AND array_col #> ?
ORDER BY timestamp;

Related

How can I fetch "IF STATEMENT" condition from the database table (ORACLE)?

Eg: I need to calculate employee rating using following code:
IF POINTS > 150 AND EMP_Experience > 3 THEN 5;
I want the If condition to be configurable in the future and hence want to store it in the database.
TABLE EMP_CALC_MASTER
ID | Rating | Condition
1 | 5 | %POINTS% > 150 AND %EMP_Experience% > 3
In my stored procedure, I want to fetch the above condition and replace %Points% and %Emp_Experience% with data and then assign the rating, which is also stored in the table.
v_condition = condition.replace(CALC_MASTER.Condition, %POINTS%, <points of the employee>)
IF v_condition THEN -- This statement is not understood by Oracle as a valid condition
...
END IF
Oracle doesn't have an IF statement in SQL, but you can do the same thing with CASE.
You haven't said to what degree you want it to be configurable or what your actual table looks like, so it's hard to be explicit here. Oracle can't have dynamic tables or columns.
You could add a column for each condition, like
rating points emp_experience
5 150 3
Then, inner join that to your base table and have a clause in the select like
CASE WHEN emp.points > emp_calc_master.points AND
emp.emp_experience > emp_calc_master.emp_experience
THEN emp_calc_master.rating ELSE NULL END
But I'm guessing you want more than one possible row in emp_calc_master.
Honestly, that's getting complicated enough that I'd probably create a PL/SQL function to do the calculation and call that from the query.

cassandra long latency for query if many rows in result

exp: table schema:
Create Table tbl {
key int,
seq int,
name text,
Primary key(key, seq) };
For each key, there are multiple rows(1000K suppose);
Suppose I want to query content for a specific key, My query is:
select * from tbl where key = 'key1'
(actually I use the cpp driver in program, and use the paging interface)
Result contains 1000k rows, and it costs about 10s for this query.
I think data for each query is stored together on disk, so it should be very fast to return.
Why it costs so long time?
Is there any way to optimize???
Why it costs so long time?
There are almost 1000K=1000,000=1M rows returned from your query. That's why it costs too long time.
Is there any way to optimize???
Yes!! there are.
Try using limit and pivoting/pagination in the query.
From table definition, it seems that you have a clustering key seq you can easily use this seq value to optimize your query. Assuming clustering key(seq) has default ascending order. Changed your query to:
select * from tbl where key = 'key1' and seq > [pivot] limit 100
replace [pivot] with the last value of your result set. for the first query use Integer.MIN_VALUE as [pivot].
For example:
select * from tbl where key = 'key1' and seq > -100 limit 100

Duplicate removal from a table using Informatica

I have a scenario to be implemented in informatica where I need to remove duplicate records from a table based on PK. But I need to keep the 1st occurrence of the PK values and remove the others(in case of duplicate PK).
For example, If my source has 1,1,1,2,3,3,4,5,4. I want to see my target data as 1,2,3,4,5. I have to read data from the same table and need to load into the same table., no new table can be introduced. please help me with your inputs.
Thanks in Advance!
I suppose you want the first occurrence because there are other (data) columns in addition to the key you entered. Therefore you want
1,b
1,c
1,a
2,d
3,c
3,d
4,e
5,f
4,b
Turned into
1,b
2,d
3,c
4,e
5,f
??
In that case try this mapping layout:
SRC -> SQ -> SRT -> AGG -> TGT
SEQ /
Where the sorter is set to sort on the KEY,sequence_port (desc)
And the aggregator is set to group by the KEY, and the sequence_port should not go out of the sorter
Hope you can follow me :)
There are multiple ways to do this, the simplest would be too do it in the SQL override.
Assuming the example quoted above, the SQL would be like this. General idea is to set a row number for a primary key ( so if you have 3 rows with same pk they will have 1,2,3 as row numbers before being reset for the next pk)
SQL:
select * from (
Select primary_key,column2 row_number() over (partition by primary_key order by primary_key) as distinct_key) where distinct_key=1
Before:
1,b
1,c
1,a
2,d
3,c
3,d
Intermediate query:
1,c,1
1,a,2
2,d,1
3,c,1
3,d,2
output:
1,c
2,d
3,d
I am able to achieve this by following the below steps.
1. Passing Sorted data(keys are EMP_ID, MOBILE, DEPTID) to an expression.
2. Creating the following variable ports in the expression and getting the counts.
V_CURR_EMP_ID = EMP_ID
V_CURR_MOBILE = MOBILE
V_CURR_DEPTID = DEPTID
V_COUNT =
IIF(V_CURR_EMP_ID=V_PREV_EMP_ID AND V_CURR_MOBILE=V_PREV_MOBILE AND V_CURR_DEPTID=V_PREV_DEPTID ,V_COUNT+1,1)
V_PREV_EMP_ID = EMP_ID
V_PREV_MOBILE = MOBILE
V_PREV_DEPTID = DEPTID
O_COUNT =V_COUNT
3. In the next transformation which is filter, I am taking only the records which have count more than 1 and deleting them using update strategy(DD_DELETE).
Here is the mapping flow.
SQ->SRTR->EXP->FIL->UPD->TGT
Also, when I tried to delete them using aggregator , it is deleting only the first occurrence of duplicates but not all.
Thanks again for your inputs!

Hive: SemanticException [Error 10002]: Line 3:21 Invalid column reference 'name'

I am using the following hive query script for the version 0.13.0
DROP TABLE IF EXISTS movies.movierating;
DROP TABLE IF EXISTS movies.list;
DROP TABLE IF EXISTS movies.rating;
DROP DATABASE IF EXISTS movies;
ADD JAR /usr/local/hadoop/hive/hive/lib/RegexLoader.jar;
CREATE DATABASE IF NOT EXISTS movies;
CREATE EXTERNAL TABLE IF NOT EXISTS movies.list (id STRING, name STRING, genre STRING)
ROW FORMAT SERDE 'com.cisco.hadoop.loaders.RegexSerDe'with SERDEPROPERTIES(
"input.regex"="^(.*)\\:\\:(.*)\\:\\:(.*)$",
"output.format.string"="%1$s %2$s %3$s");
CREATE EXTERNAL TABLE IF NOT EXISTS movies.rating (id STRING, userid STRING, rating STRING, timestamp STRING)
ROW FORMAT SERDE 'com.cisco.hadoop.loaders.RegexSerDe'
with SERDEPROPERTIES(
"input.regex"="^(.*)\\:\\:(.*)\\:\\:(.*)\\:\\:(.*)$",
"output.format.string"="%1$s %2$s %3$s %4$s");
LOAD DATA LOCAL INPATH 'ml-10M100K/movies.dat' into TABLE movies.list;
LOAD DATA LOCAL INPATH 'ml-10M100K/ratings.dat' into TABLE movies.rating;
CREATE TABLE movies.movierating(id STRING, name STRING, genre STRING, rating STRING);
INSERT OVERWRITE TABLE movies.movierating
SELECT list.id, list.name, list.genre, rating.rating from movies.list list LEFT JOIN movies.rating rating ON (list.id=rating.id) GROUP BY list.id;
The issue is when I execute the script without the "GROUP BY" clause it works fine.
But when I execute it with the "GROUP BY" clause, I get the following error
FAILED: SemanticException [Error 10002]: Line 4:21 Invalid column reference 'name'
Any ideas what is happening here?
Appreciate your help
Thanks!
If you group by a column, your select statement can only select a) that column, b) columns derived only from that column, or c) a UDAF applied to other columns.
In this case, you're only grouping by list.id, so when you try to select list.name, that's invalid. Think about it this way: what if your list table contained the following two entries:
id|name |genre
--+-----+------
01|name1|comedy
01|name2|horror
What would you expect this query to return:
select list.id, list.name, list.genre from list group by list.id;
In this case it's nonsensical. I'm guessing that id in reality is a primary key, but note that hive does not know this, so the above data set is perfectly valid.
With all that in mind, it's not clear to me how to fix it because I don't know the desired output. For example, let's say without the group by (just the join), you have as output:
id|name |genre |rating
--+-----+------+-------
01|name1|comedy|'pretty good'
01|name1|comedy|'bad'
02|name2|horror|'9/10'
03|name3|action|NULL
What would you want the output to be with the group by? What are you trying to accomplish by doing the group by?
OK let me see if I can ask this in a better way.
Here are my two tables
Movies list table - Consists of movies information
ID | Movie Name | Genre
1 | Movie 1 | comedy
2 | movie 2 | action
3 | movie 3 | thriller
And I have ratings table
MOVIE_ID | USER ID | RATING on 5 | TIMESTAMP
1 | xyz | 5 | 12345612
1 | abc | 4 | 23232312
2 | zvc | 1 | 12321123
2 | zyx | 2 | 12312312
What I would like to do is get the output in the following way:
Movie ID | Movie Name | Genre | Rating Average
1 | Movie 1 | comedy | 4.5
2 | Movie 2 | action | 1.5
I am not a db expert but I understand this, when you group the data together you need to convert the multiple values to the scalar values or all the values, if string should be same right?
For example in my previous case, I was grouping them together as a string. So which is okay for list.id, list.name and list.genre, but the list.rating, well that is always going to give some problem here (I just learnt PIG along with hive, so grouping works differently there)
So to tackle the problem, I casted the rating and averaged it out and stored it in the float table. Have a look at my code below:
CREATE TABLE movies.movierating(id STRING, name STRING, genre STRING, rating FLOAT);
INSERT OVERWRITE TABLE movies.movierating
SELECT list.id, list.name, list.genre, AVG(cast(rating.rating as FLOAT)) from movies.list list LEFT JOIN movies.rating rating ON (list.id=rating.id) GROUP BY list.id, list.name,list.genre order by list.id DESC;
Thank you for your explanation. I might save the following question for the next thread but here is my observation:
The performance of the Overall job is reduced when performing Grouping and Joining together than to do it in two separate queries. For the same job, I had changed the code a bit to perform the grouping first and then joining the data and the over all time was reduced by 40 seconds. Earlier it was taking 140 seconds and now it is taking 100 seconds. Any reasons to that?
Once again thank you for your explanation.
I came across same issue:
org.apache.hadoop.hive.ql.parse.SemanticException: Invalid column reference "charge_province"
After I put the "charge_province" in the group by, the issue is gone. I don't know why.

Synchronizing two tables using a stored procedure, only updating and adding rows where values do not match

The scenario
I've got two tables with identical structure.
TABLE [INFORMATION], [SYNC_INFORMATION]
[ITEM] [nvarchar](255) NOT NULL
[DESCRIPTION] [nvarchar](255) NULL
[EXTRA] [nvarchar](255) NULL
[UNIT] [nvarchar](2) NULL
[COST] [float] NULL
[STOCK] [nvarchar](1) NULL
[CURRENCY] [nvarchar](255) NULL
[LASTUPDATE] [nvarchar](50) NULL
[IN] [nvarchar](4) NULL
[CLIENT] [nvarchar](255) NULL
I'm trying to create a synchronize procedure that will be triggered by a scheduled event at a given time every day.
CREATE PROCEDURE [dbo].[usp_SynchronizeInformation]
AS
BEGIN
SET NOCOUNT ON;
--Update all rows
UPDATE TARGET_TABLE
SET TARGET_TABLE.[DESCRIPTION] = SOURCE_TABLE.[DESCRIPTION],
TARGET_TABLE.[EXTRA] = SOURCE_TABLE.[EXTRA],
TARGET_TABLE.[UNIT] = SOURCE_TABLE.[UNIT],
TARGET_TABLE.[COST] = SOURCE_TABLE.[COST],
TARGET_TABLE.[STOCK] = SOURCE_TABLE.[STOCK],
TARGET_TABLE.[CURRENCY] = SOURCE_TABLE.[CURRENCY],
TARGET_TABLE.[LASTUPDATE] = SOURCE_TABLE.[LASTUPDATE],
TARGET_TABLE.[IN] = SOURCE_TABLE.[IN],
TARGET_TABLE.[CLIENT] = SOURCE_TABLE.[CLIENT]
FROM SYNC_INFORMATION TARGET_TABLE
JOIN LSERVER.dbo.INFORMATION SOURCE_TABLE ON TARGET_TABLE.ITEMNO = SOURCE_TABLE.ITEMNO
WHERE TARGET_TABLE.ITEMNO = SOURCE_TABLE.ITEMNO
--Add new rows
INSERT INTO SYNC_INFORMATION (ITEMNO, DESCRIPTION, EXTRA, UNIT, STANDARDCOST, STOCKTYPE, CURRENCY_ID, LASTSTANDARDUPDATE, IN_ID, CLIENTCODE)
SELECT
src.ITEM,
src.DESCRIPTION,
src.EXTRA,
src.UNIT,
src.COST,
src.STOCKTYPE,
src.CURRENCY_ID,
src.LASTUPDATE,
src.IN,
src.CLIENT
FROM LSERVER.dbo.INFORMATION src
LEFT JOIN SYNC_INFORMATION targ ON src.ITEMNO = targ.ITEMNO
WHERE
targ.ITEMNO IS NULL
END
Currently, this procedure (including some others that are also executed at the same time) takes about 15 seconds to execute.
I'm planning on adding a "Synchronize" button in my work interface so that users can manually synchronize when, for instance, a new item is added and needs to be used the same day.
But in order for me to do that, I need to trim those 15 seconds as much as possible.
Instead of updating every single row, like in my procedure, is it possible to only update rows that have values that does not match?
This would greatly increase the execution speed, since it doesn't have to update all the 4000 rows when maybe only 20 actually needs it.
Can this be done in a better way, or optimized?
Does it need improvements, if yes, where?
How would you solve this?
Would also appreciate some time differences between the solutions so I can compare them.
UPDATE
Using marc_s's CHECKSUM is really brilliant. The problem is that in some instances the information creates the same checksum. Here's an example, due to the classified content, I can only show you 2 columns, but I can say that all columns have identical information except these 2. To clarify: this screenshot is of all the rows that had duplicate CHECKSUMs. These are also the only rows with a hyphen in the ITEM column, I've looked.
The query was simply
SELECT *, CHECKSUM(*) FROM SYNC_INFORMATION
If you can change the table structure ever so slightly - you could add a computed CHECKSUM column to your two tables, and in the case the ITEM is identical, you could then check that checksum column to see if there are any differences at all in the columns of the table.
If you can do this - try something like this here:
ALTER TABLE dbo.[INFORMATION]
ADD CheckSumColumn AS CHECKSUM([DESCRIPTION], [EXTRA], [UNIT],
[COST], [STOCK], [CURRENCY],
[LASTUPDATE], [IN], [CLIENT]) PERSISTED
Of course: only include those columns that should be considered when making sure whether a source and a target row are identical ! (this depends on your needs and requirements)
This persists a new column to your table, which is calculated as the checksum over the columns specified in the list of arguments to the CHECKSUM function.
This value is persisted, i.e. it could be indexed, too! :-O
Now, you could simplify your UPDATE to
UPDATE TARGET_TABLE
SET ......
FROM SYNC_INFORMATION TARGET_TABLE
JOIN LSERVER.dbo.INFORMATION SOURCE_TABLE ON TARGET_TABLE.ITEMNO = SOURCE_TABLE.ITEMNO
WHERE
TARGET_TABLE.ITEMNO = SOURCE_TABLE.ITEMNO
AND TARGET_TABLE.CheckSumColumn <> SOURCE_TABLE.CheckSumColumn
Read more about the CHECKSUM T-SQL function on MSDN!

Resources