How do you measure query performance in Cassandra? - performance

I am still new in learning Cassandra, and I am doing some measurements, regarding memory and processor resources for each of my query. Does Cassandra has her own way to show performance of query or should I use some third party tool?

You can use the tracing switched on to see the internal steps.
TRACING ON
For the query below
INSERT INTO cycling.cyclist_name (
id,
lastname,
firstname
)
VALUES (
e7ae5cf3-d358-4d99-b900-85902fda9bb0,
'FRAME',
'Alex'
);
The below is the trace log
Tracing session: 9b378c70-b114-11e6-89b5-b7fad52e1885
activity | timestamp | source | source_elapsed | client
-----------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+-----------+----------------+-----------
Execute CQL3 query | 2016-11-22 16:34:34.300000 | 127.0.0.1 | 0 | 127.0.0.1
Parsing INSERT INTO cycling.cyclist_name (id, lastname, firstname) VALUES (e7ae5cf3-d358-4d99-b900-85902fda9bb0, 'FRAME','Alex'); [Native-Transport-Requests-1] | 2016-11-22 16:34:34.305000 | 127.0.0.1 | 5935 | 127.0.0.1
Preparing statement [Native-Transport-Requests-1] | 2016-11-22 16:34:34.308000 | 127.0.0.1 | 9199 | 127.0.0.1
Determining replicas for mutation [Native-Transport-Requests-1] | 2016-11-22 16:34:34.330000 | 127.0.0.1 | 30530 | 127.0.0.1
Appending to commitlog [MutationStage-3] | 2016-11-22 16:34:34.330000 | 127.0.0.1 | 30979 | 127.0.0.1
Adding to cyclist_name memtable [MutationStage-3] | 2016-11-22 16:34:34.330000 | 127.0.0.1 | 31510 | 127.0.0.1
Request complete | 2016-11-22 16:34:34.333633 | 127.0.0.1 | 33633 | 127.0.0.1
Reference link: https://docs.datastax.com/en/cql-oss/3.3/cql/cql_reference/cqlshTracing.html

Related

Timestamp is different for the same table in hive-cli & presto-cli

I am getting different timestamps for the same table in hive-cli & presto-cli.
table structure for the table:
+----------------------------------------------------+
| createtab_stmt |
+----------------------------------------------------+
| CREATE EXTERNAL TABLE `mea_req_201`( |
| `mer_id` bigint, |
| `mer_from_dttm` timestamp, |
| `mer_to_dttm` timestamp, |
| `_c0` bigint, |
| `a_number` string, |
| `b_number` string, |
| `time_stamp` timestamp, |
| `duration` bigint) |
| PARTITIONED BY ( |
| `partition_col` bigint) |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' |
| LOCATION |
| 'hdfs://hadoop5:8020/warehouse/tablespace/external/hive/mea_req_201' |
| TBLPROPERTIES ( |
| 'TRANSLATED_TO_EXTERNAL'='TRUE', |
| 'bucketing_version'='2', |
| 'external.table.purge'='TRUE', |
| 'spark.sql.create.version'='2.4.0.7.1.4.0-203', |
| 'spark.sql.sources.schema.numPartCols'='1', |
| 'spark.sql.sources.schema.numParts'='1', |
| 'transient_lastDdlTime'='1625496239') |
+----------------------------------------------------+
While running from hive-cli the output is:
While from presto-cli:
in mer_from_dttm col, there's a time difference but for other timestamps columns, dates are exactly the same. Note this time difference behaviour is the same when done from presto-jdbc also. I believe this got nothing to do with timezone because if it was timezone, the time difference should be across all timestamp columns, not just one. Please provide some resolution.
Some Facts:
Presto server version: 0.180
Presto Jdbc version: 0.180
hive.time-zone=Asia/Calcutta
In Presto jvm.config: -Duser.timezone=Asia/Calcutta
Client TimeZone: Asia/Calcutta
Edit 1:
Sorted the query with mer_id to ensure both queries are outputting the same set of rows, However, the erroneous behavior still remains the same.
While Running from hive-cli:
While Running from presto-cli:
Presto 0.180 is really old. It was released in 2017, and many bugs have been fixed along the way.
I would suggest you try with a recent version. In particular, recent versions of Trino (formerly known as PrestoSQL) had a lot work done around handling of timestamp data.
Use order by to see exact rows in each of clients.
SELECT `mer_id`,`mer_from_dttm`, `mer_to_dttm`, `time_stamp` FROM mea_req_201 ORDER BY `mer_id`;

Does Sqoop incremental lastmodified option include the date, mentioned in last value, in result?

My SQL table is
mysql> select * from Orders;
+--------+-----------+------------+-------------+-------------+
| ord_no | purch_amt | ord_date | customer_id | salesman_id |
+--------+-----------+------------+-------------+-------------+
| 70001 | 150.50 | 2012-10-05 | 3005 | 5002 |
| 70009 | 270.65 | 2012-09-10 | 3001 | 5005 |
| 70002 | 65.26 | 2012-10-05 | 3002 | 5001 |
| 70004 | 110.50 | 2012-08-17 | 3009 | 5003 |
| 70007 | 948.50 | 2012-09-10 | 3005 | 5002 |
| 70005 | 999.99 | 2012-07-27 | 3007 | 5001 |
| 70008 | 999.99 | 2012-09-10 | 3002 | 5001 |
| 70010 | 999.99 | 2012-10-10 | 3004 | 5006 |
| 70003 | 999.99 | 2012-10-10 | 3009 | 5003 |
| 70012 | 250.45 | 2012-06-27 | 3008 | 5002 |
| 70011 | 75.29 | 2012-08-17 | 3003 | 5007 |
| 70013 | 999.99 | 2012-04-25 | 3002 | 5001 |
+--------+-----------+------------+-------------+-------------+
I ran an Sqoop import
sqoop import --connect jdbc:mysql://ip-172-31-20-247:3306/sqoopex --
username sqoopuser --password <hidden> --table Orders --target-dir
SqoopImp2 --split-by ord_no --check-column ord_date --incremental
lastmodified --last-value '2012-09-10'
As per Sqoop 1.4.6 manual stated below,
An alternate table update strategy supported by Sqoop is called lastmodified
mode. You should use this when rows of the source table may be updated, and each such update will set the value of a last-modified column to the current timestamp. Rows where the check column holds a timestamp more recent than the timestamp specified with --last-value are imported
I was not expecting columns with date '2012-09-10' in output. However my output, as shown below,
[manojpurohit17834325#ip-172-31-38-146 ~]$ hadoop fs -cat SqoopImp2/*
70001,150.50,2012-10-05,3005,5002
70002,65.26,2012-10-05,3002,5001
70003,999.99,2012-10-10,3009,5003
70007,948.50,2012-09-10,3005,5002
70009,270.65,2012-09-10,3001,5005
70008,999.99,2012-09-10,3002,5001
70010,999.99,2012-10-10,3004,5006
contains rows with date 20125-10-10. Note: the output directory was not present earlier and it was created by this sqoop execution.
From this execution, I see that date in --last-modified is included in output which is contrary to what is mentioned in manual. Please help me understand this discrepancy and correct me if am missing something here.
yes, --lastvalue also included in the results.
in --incremental import 2 modes available: i) append ii) lastmodified
append: in this mode it just checks --lastvalue and import from last value onwards.
--> it's not imported any previous values even if they updated
lastmodified: it's also the same as append mode only but here it imports new rows and imported previous rows also if they updated.
Note: lastmodified is working on the only date or Timestamp type columns only

Sqoop import having SQL query with where clause and parallel processing

I have a table as below in mysql :
Order_Details :
+---------+------------+-------------------+--------------+
| orderid | order_date | order_customer_id | order_status |
+---------+------------+-------------------+--------------+
| A001 | 10/30/2018 | C003 | Completed |
| A002 | 10/30/2018 | C005 | Completed |
| A451 | 11/02/2018 | C376 | Pending |
| P9209 | 10/30/2018 | C234 | Completed |
| P92099 | 10/30/2018 | C244 | Pending |
| P9210 | 10/30/2018 | C035 | Completed |
| P92398 | 10/30/2018 | C346 | Pending |
| P9302 | 10/30/2018 | C034 | Completed |
+---------+------------+-------------------+--------------+
and the description as below :
mysql> desc Order_Details_Sankha;
+-------------------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------------+-------------+------+-----+---------+-------+
| orderid | varchar(20) | NO | PRI | | |
| order_date | varchar(20) | YES | | NULL | |
| order_customer_id | varchar(20) | YES | | NULL | |
| order_status | varchar(20) | YES | | NULL | |
+-------------------+-------------+------+-----+---------+-------+
I am using the below sqoop import with parallel processing :
sqoop import
--connect jdbc:mysql://ip-10-0-1-10.ec2.internal/06july2018_new
--username labuser
--password abc123
--driver com.mysql.jdbc.Driver
--query "select * from Order_Details where order_date = '10/30/2018' AND \$CONDITIONS"
--target-dir /user/sankha087_gmail_com/outputs/EMP_Sankha_1112201888
--split-by ","
--m 3
and I am getting the below error message
18/12/15 17:15:14 WARN security.UserGroupInformation: PriviledgedActionException as:sankha087_gmail_com (auth:SIMPLE) cause:java.io.IOException: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You hav
e an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '), MAX(,) FROM (select * from Order_Details_Sankha where order_date = '10/30/201' a
t line 1
18/12/15 17:15:14 ERROR tool.ImportTool: Import failed: java.io.IOException: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an error in your SQL syntax; check the manual that corresponds to
your MySQL server version for the right syntax to use near '), MAX(,) FROM (select * from Order_Details_Sankha where order_date = '10/30/201' at line 1
at org.apache.sqoop.mapreduce.db.DataDrivenDBInputFormat.getSplits(DataDrivenDBInputFormat.java:207)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:305)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:322)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:200)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1307)
Please advise what needs to be changed in my import statement .
Sqoop parallel execution doesn't happen with vertical split, it happens with horizontal split.
--split-by should be a column name. column should be one which is evenly distributed.
https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1765770
Read: 7.2.4. Controlling Parallelism

Hive: Error while fetching data

I tried to connect to Hive using the below query:
select * from some-table where yyyy = 2018 and mm = 01 and dd = 05 runs
The query ran successfully.
After adding one more filter, i.e string data type
The following error is generated:
java.io.IOException:java.lang.ClassCastException:
org.apache.hadoop.hive.serde2.io.DateWritable cannot be cast to
org.apache.hadoop.io.Text
The error is generated by Serializer-Deserializers.
Root Cause: When you created the table, you probably didn't define the STORED AS tag. Try to describe your table using desc <table name> and you may see something like this:
| # Storage Information | NULL | NULL |
| SerDe Library: | org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL |
| InputFormat: | org.apache.hadoop.hive.ql.io.orc.OrcInputFormat | NULL |
| OutputFormat: | org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat | NULL |
Which is not in the good practice. Your SerDes use Lazy Evaluations by default. Create a table using STORED AS ORC and then try to describe your table and the result may be different this time:
| # Storage Information | NULL | NULL |
| SerDe Library: | org.apache.hadoop.hive.ql.io.orc.OrcSerde | NULL |
| InputFormat: | org.apache.hadoop.hive.ql.io.orc.OrcInputFormat | NULL |
| OutputFormat: | org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat | NULL |
Try this and you may be able to resolve the issue.

how to get top n count by each group in hive

there are some datas like following format
url | ip
-----------------------+-----------------
http://aaa.com/ | 1.1.1.1
http://bbb.com/ | 1.2.3.5
http://ccc.com/ | 1.1.1.6
http://ddd.com/ | 1.2.3.4
http://ccc.com/ | 1.1.1.2
http://ccc.com/ | 1.1.1.2
http://ccc.com/ | 1.1.1.2
http://aaa.com/ | 1.1.1.1
http://bbb.com/ | 1.2.3.5
I am now try to count ip column which is the top n ip in each group by url. like
url | ip | ipcount
-----------------------+-----------------+-----------------
http://aaa.com/ | 1.1.1.1 | 2
http://aaa.com/ | 5.6.7.8 | 1
http://bbb.com/ | 1.2.3.5 | 2
http://ccc.com/ | 1.1.1.2 | 3
http://ccc.com/ | 1.1.1.6 | 1
http://ddd.com/ | 1.2.3.4 | 1
please tell me how can i write a HQL for implement this in a Hive ?
update: sorry i forgot to notice that i should get the top N records in each group like ...
Try: SELECT url, ip, COUNT(url) FROM tbl GROUP BY url, ip
SELECT url, ip, count(*) as ipcount
from table t
group by url, ip
This should work in your case.

Resources