Presto Cassandra Slow performance slow - performance

I am using presto to query Cassandra records, it is taking around 8 mins to respond the result. Need to improve the response time.
Presto configuration below:
coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8080
query.max-memory=5GB
query.max-memory-per-node=3GB
discovery-server.enabled=true
discovery.uri=http://URL:8080
task.max-worker-threads=10
task.concurrency=32
Worker : 4
coordinator=false
http-server.http.port=8080
query.max-memory=5GB
query.max-memory-per-node=2GB
discovery.uri=http://URL:8080
task.max-worker-threads=16
task.concurrency=32
Cassandra : 4 NODE
Fragment 2
Cost: CPU 1.98m, Input: 17833912 rows (1.49GB), Output: 13089502 rows (1.31GB)
ScanFilterProject[table = cassandra:cassandra:rasapp:raslog, originalConstraint = (("bucketid" = CAST('2017062113'
Cost: 96.12%, Input: 23169736 rows (22.10MB), Output: 17833912 rows (1.49GB), Filtered: 23.03%
How to improve the response time in presto still i m using partition key which has around 23 million records?
CREATE TABLE TEST.TEST_LOG (
bucketId varchar,
id timeuuid,
transaction_id varchar,
ras_transaction_id varchar,
msg_seq_id int,
host_name varchar,
matip_channel_id varchar,
hth_id varchar,
mq_id varchar,
log_point varchar,
entry_time timestamp,
exit_time timestamp,
source_carrier varchar,
destination_carrier varchar,
source_dcs varchar,
destination_dcs varchar,
message_type varchar,
message_direction int,
error_code_business varchar,
exception_code varchar,
exception_description varchar,
scenario varchar,
created_date timestamp,
huborcar varchar,
noof_fanout varchar,
flight_date timestamp,
route_origin varchar,
route_destination varchar,
class_service varchar,
no_of_seats varchar,
ras_host varchar,
cp_host varchar,
PRIMARY KEY(bucketid, created_date, msg_seq_id,message_direction,scenario,source_dcs,exception_code,log_point,transaction_id,id)
) WITH default_time_to_live = 2851200 and CLUSTERING ORDER BY (created_date ASC, msg_seq_id ASC,message_direction ASC,scenario ASC,source_dcs ASC,exception_code ASC,log_point ASC,transaction_id ASC,id ASC);
Query
select
transaction_id,
message_direction,
message_type,
max(exception_code) as exception_code,
min(entry_time) as min_entry,
max(entry_time) as max_entry,
min(exit_time) as min_exit,
max(exit_time) as max_exit
from TEST.TEST_LOG
where bucketid='2017062113'
and (
((msg_seq_id<=2 and message_type='PAOREQ' ) or
( msg_seq_id>2 and message_type='PAORES' )))
group by transaction_id,
message_direction,
message_type
Time taken : 8 mins
Thanks,

Two things: The 0.180 release of Presto will include pushdown of inequality predicates on clustering keys, which will help out your query. Also, your schema does not work well with the query that you are running. In Cassandra, it is best to a) query on particular partitions (which you do) and also to have predicates on the clustering keys in the order in which you use them (since that's the sort order that Cassandra uses). You will probably see better performance if you have a primary key of (bucketid, message_type, msg_seq_id, ...).
Additionally, Presto does not push down aggregations to Cassandra (or any connector), so if there is a large amount of data that you're aggregating, and you don't need Presto for the federated query, it may be faster to just do the query in Cassandra.

Related

Table partition on non date column

How can I partition a table in oracle on non-date column (Say partition on Username)?
I have table partitioning on only date columns.Say:
CREATE TABLE X
(
Username Varchar2(10 Char),
Import_date Date
)
PARTITION BY RANGE ("IMPORT_DATE") INTERVAL (NUMTODSINTERVAL(1,'DAY'))
(PARTITION "CL_REP_DEF" VALUES LESS THAN
(TO_DATE(' 2018-06-29 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN'))
)
Though I am not sure how to partition with username here.
Oracle offers three types of partitions:
Range
Hash
List
You can use any of them.
Selection of partitioning type depends on the data stored in a table and values of partitioned column (columns). If the number of distinct values in a column (columns) is limited and known, then LIST type would be a better choice.
As to your case, I think HASH partition fits the most.
Here's an example of how you can partition your X table:
CREATE TABLE X
(
Username Varchar2(10 Char),
Import_date Date
) PARTITION BY HASH(Username) PARTITIONS 16; -- 16 is the number of partitions.
You can find more about partitioning in official Oracle documentation.

Unusable partition Oracle / datastage

I am facing an issue with my datastage job. I have to fill a table ttperiodeas in Oracle from a .csv file. The SQL query in Oracle connector is shown in this screenshot:
Oracle connector
And here is the oracle script
CREATE TABLE TTPERIODEAS
(
CDPARTITION VARCHAR2(5 BYTE) NOT NULL ENABLE,
CDCOMPAGNIE NUMBER(4,0) NOT NULL ENABLE,
CDAPPLI NUMBER(4,0) NOT NULL ENABLE,
NUCONTRA CHAR(15 BYTE) NOT NULL ENABLE,
DTDEBAS NUMBER(8,0) NOT NULL ENABLE,
DTFINAS NUMBER(8,0) NOT NULL ENABLE,
TAUXAS NUMBER(8,5) NOT NULL ENABLE,
CONSTRAINT PK_TTPERIODEAS
PRIMARY KEY (CDPARTITION, CDCOMPAGNIE, CDAPPLI, NUCONTRA, DTDEBAS)
)
PARTITION BY LIST(CDPARTITION)
(PARTITION P_PERIODEAS_13Q VALUES ('13Q'));
When running the job, I get the following message error and the table is not filled.:
The index 'USINODSD0.SYS_C00249007' its partition is unusable
Please I need help thanks
The index is global (i.e. not partitioned) because there is no using index local at the end of the definition. This is also true for the PK index shown above. (I'm assuming they are two different things, because by default the DDL above would create an index named PK_TTPERIODEAS, so I'm not sure what SYS_C00249007 is.) If you can drop and rebuild them as local indexes (i.e. partitioned to match the table) then truncating or dropping a partition will no longer invalidate indexes.
For example, you could rebuild the primary key as:
alter table ttperiodeas
drop primary key;
alter table ttperiodeas
add constraint pk_ttperiodeas primary key (cdpartition,cdcompagnie,cdappli,nucontra,dtdebas)
using index local;
I don't know how SYS_C00249007 is defined, but you could use something similar.
The create table command might be something like:
create table ttperiodeas
( cdpartition varchar2(5 byte) not null
, cdcompagnie number(4,0) not null
, cdappli number(4,0) not null
, nucontra varchar2(15 byte) not null
, dtdebas number(8,0) not null
, dtfinas number(8,0) not null
, tauxas number(8,5) not null
, constraint pk_ttperiodeas
primary key (cdpartition,cdcompagnie,cdappli,nucontra,dtdebas)
using index local
)
partition by list(cdpartition)
( partition p_periodeas_13q values ('13Q') );
Alternatively, you could add the update global indexes clause when dropping the partition:
alter table demo_temp drop partition p_periodeas_14q update global indexes;
(By the way, NUCONTRA should probably be a standard VARCHAR2 and not CHAR, which is intended for cross-platform compatibility and ANSI completeness, and in practice just wastes space and creates bugs.)
the message says that the index for the given partition is unusable: so you could try to rebuild the correponding index partition by the use of
create index [index_name] rebuild partition [partition_name]
(with the fitting values for [index_name] and [partition_nme].
Before you do that you should check the status of the index partitions in user_indexes - since your error message looks not like Oracle error messages usually do.
But since the index is global as William Robertson pointed out, this is not applicable for the given situation.

Hive select from table as complex type

Considering a base table employee and a table derived from employee called employee_salary_period which contains a complex datatype map. How to select and insert data from employee into employee_salary_period where salary_period_map is a key value pair i.e. salary: period
CREATE TABLE employee(
emp_id bigint,
name string,
address string,
salary double,
period string,
position string
)
PARTITIONED BY (
dept_id bigint)
STORED AS PARQUET
CREATE TABLE employee_salary_period(
emp_id
name string,
salary string,
period string,
salary_period_map Map<String,String>,
)
PARTITIONED BY (
dept_id bigint)
STORED AS PARQUET
I'm stuck trying to figure out how to select data as salary_period_map
Consider using str_to_map function provided by hive. I hope you have only one key (salary) in you map
select
emp_id
name,
salary,
period,
str_to_map(concat(salary,":",period),'&',':') as salary_period_map
from employee_salary_period

Sorting in Cassandra on primary key of type timeuuid

I am new to cassandra. I want to get the sorted resultset on the basis of primary key i.e. timeuuid. My table stucture is.
CREATE TABLE user_session
(
session_id timeuuid,
ip inet,
device_type int,
is_active int,
last_access_time timestamp,
logout_reason text,
logout_type int,
start_time timestamp,
uid int,
PRIMARY KEY(session_id)
);
Can anyone help me out.
You cannot use order by in your query on primary column. It is only supported on clustering column. You would have to change this table so you can perform such a query. It might look something like this:
CREATE TABLE user_session
(
user_id int,
session_id timeuuid,
ip inet,
device_type int,
is_active int,
last_access_time timestamp,
logout_reason text,
logout_type int,
start_time timestamp,
uid int,
PRIMARY KEY(user_id, session_id)
);
Then, your query would look like:
select * from user_session where user_id=5 order by session_id ASC;
Basicaly, you need some primary key which will be used for searching data where only EQ and IN relations are allowed, so you can't have user_id > 5 or something similar, and then you can order your results by on clustering column which is in your case session_id.
Zoran
You can use ORDER BY from table definition in order to keep the correct order during insertions. The PRIMARY KEY definition has the partition key user_id and clustering column session_id which is used to sort. It might look something like this:
CREATE TABLE user_session
(
user_id int,
session_id timeuuid,
ip inet,
device_type int,
is_active int,
last_access_time timestamp,
logout_reason text,
logout_type int,
start_time timestamp,
uid int,
PRIMARY KEY(user_id, session_id)
) WITH CLUSTERING ORDER BY (session_id ASC);

Oracle DB daily partitioning

I have the following table
CREATE TABLE "METRIC_VALUE_RAW"
(
"SUBELEMENT_ID" INTEGER NOT NULL ,
"METRIC_METADATA_ID" INTEGER NOT NULL ,
"METRIC_VALUE_INT" INTEGER,
"METRIC_VALUE_FLOAT" FLOAT(126),
"TIME_STAMP" TIMESTAMP NOT NULL
) ;
Every hour data will be loaded into the table using sql loader.
I want to create partitions so that data for every day go into a partition.
In table I want to store data for 30 days. So when it crosses 30 days, the oldest partition should get deleted.
Can you share your ideas on how I can design the partitions.
here is an example how to do it on Oracle 11g and it works very well. I haven't tried it on Oracle 10g, you can try it.
This is the way, how to create a table with daily partitions:
CREATE TABLE XXX (
partition_date DATE,
...,
...,
)
PARTITION BY RANGE (partition_date)
INTERVAL (NUMTODSINTERVAL(1, 'day'))
(
PARTITION part_01 values LESS THAN (TO_DATE('2000-01-01','YYYY-MM-DD'))
)
TABLESPACE MY_TABLESPACE
NOLOGGING;
As you see above, Oracle will automaticaly create separate partitions for each distinct partition_day after 1st January 2000. The records, whose partition_date is older than this date, will be stored in partition called 'part_01'.
You can monitore your table partitions using this statement:
SELECT * FROM user_tab_partitions WHERE table_name = 'XXX';
Afterwards, when you would like to delete some partitions, use following command:
ALTER TABLE XXX DROP PARTITION AAAAAA UPDATE GLOBAL INDEXES
where 'AAAAAA' is parition name.
I hope it will help you!
As i said , There are big differences in partition automation between 10g and 11g.
In 10G you will have to manually manage the partitions during your ETL process (I'm sure every 10g DBA has a utility package he wrote to manage partitions ... ).
For steps 1 & 2 , you have several options
load data directly into the daily partition.
load data into a new partition and merge it into the daily one.
load data into a new partition every hour, and during a maintenance
window merge all hourly partitions into a daily partition.
The right way for you depends on your needs. Is the newly added data is queried immediately ? In what manner ? Would you query for data across several hours (or loads...) ? Are you showing aggregations ? are you performing DML operations on the data (DDL operations on partitions cause massive locking).
about 3, again - manually. drop old partitions.
In 11G, you have the new interval partition feature with automates some of the tasks mentioned above.
Following is a sample create table sql to parititon data:
CREATE TABLE quarterly_report_status (
report_id INT NOT NULL,
report_status VARCHAR(20) NOT NULL,
report_updated TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
)
PARTITION BY RANGE ( UNIX_TIMESTAMP(report_updated) ) (
PARTITION p0 VALUES LESS THAN ( UNIX_TIMESTAMP('2008-01-01 00:00:00') ),
PARTITION p1 VALUES LESS THAN ( UNIX_TIMESTAMP('2008-01-02 00:00:00') ),
PARTITION p2 VALUES LESS THAN ( UNIX_TIMESTAMP('2008-01-03 00:00:00') ),
PARTITION p3 VALUES LESS THAN ( UNIX_TIMESTAMP('2008-01-04 00:00:00') ),
PARTITION p4 VALUES LESS THAN ( UNIX_TIMESTAMP('2008-01-05 00:00:00') ),
PARTITION p5 VALUES LESS THAN ( UNIX_TIMESTAMP('2008-01-06 00:00:00') ),
PARTITION p6 VALUES LESS THAN ( UNIX_TIMESTAMP('2008-01-07 00:00:00') ),
PARTITION p7 VALUES LESS THAN ( UNIX_TIMESTAMP('2008-01-08 00:00:00') ),
PARTITION p8 VALUES LESS THAN ( UNIX_TIMESTAMP('2008-01-09 00:00:00') ),
PARTITION p9 VALUES LESS THAN (MAXVALUE)
);
Paritions will be created by DBa and rest will be taken care of by oracle.
If you want to delete partition then you will have to write separate jobs for it.

Resources