MySQL 8 slow query issue - mysql-8.0

My MySQL 8 conf as below :
[mysqld]
slow_query_log = ON
long_query_time = 5
log_slow_admin_statements = 1
log_queries_not_using_indexes = 1
But when I got slow query log in my table that will show :
The query time do not match long_query_time (5s)
Any ideas ?

You have just found out why I recommend turning this off
log_queries_not_using_indexes = 0
When a tiny table is accessed without an index, it will run fast enough that you should not care, and it should not clutter the log.
I also recommend a much lower value for this
long_query_time = 1
pt-query-digest is a good tool for summarizing the data, especially if the log is written to a FILE.

Related

jdbc write to greenplum/postgres issue

Spark jdbc write is giving nightmares for data having more columns (400 columns and 200 rows) and even with less columns and more rows it is taking quite long (200k records 30 to 60 minutes). We don't have primary key for partitioning so will use little relevant key(reading from jdbc/transformations has no problem only writing is problem).
Spark Cluster conf -- 1 master, 2 workers >> 8 cores 32 Gb each.
Spark sumit command params -- 'executor_cores': 2,'executor_memory': '2G','num_executors': 2,'driver_memory': '2G'
Tried below ways as per other stackoverflow suggestions.
df.write.format('jdbc').options(url=url,driver=driver,dbtable=table,user=user,password=password,batchsize=20000,rewriteBatchedStatements=true).mode(mode).save()
df.repartition(15).write.format('jdbc').options(url=url,driver=driver,dbtable=table,user=user,password=password,batchsize=20000,rewriteBatchedStatements=true).mode(mode).save()
Write to mysql is working fine. Writing to greenplum and postgres is issue(verified in both).
I couldn't find much options.
After trial and error found some parameters which helped improve performance.
For Postgres reWriteBatchedInserts=true should be used instead of rewriteBatchedStatements=true(this is for mysql only). This helps a lot in terms of performance.
Reducing batchsize helped in writing to database. With trial and error we can identify suitable value for a environment.

Spark. Data caching?

I'm testing below script in spark-shell - single partition scan of partitioned table.
val s = System.nanoTime
var q =
s"""
select * from partitioned_table where part_column = 'part_column_value'
"""
spark.sql(q).show
println("Elapsed: " + (System.nanoTime-s) / 1e9 + " seconds")
First execution takes around 30 seconds while all subsequent executions take around 2 seconds.
If we have a look at runtime statistics - there are two additional jobs before first execution
Looks like job with 1212 stages scans all the partitions in a table (total num of partitions 1199, total num of HDFS files for this table - 1384).
I did not find a way to discover what exactly scala/java or SQL code is running for Job 0 but I suspect it's for caching.
Each time I exit spark-shell and start it again - I see this two additional jobs before first executions.
Of course, similar observations are true for other queries.
Questions
Is it possible to prove or negate hypothesis about caching?
If it's for caching - how to disable cache and how to clean it up?
Update. Details about job.
The problem was occurring for specific Spark version 2.0.2.
Spark has been scanning of all the partitions while building the plan and before query execution.
The issues has been logged and fixed in Spark 2.1.0
https://issues.apache.org/jira/browse/SPARK-16980

ExecuteSQL does nothing

I am trying to fetch data from oracle database through Nifi. In the canvas, I put "GenerateFlowFile" processor with a file size of 0 KB scheduled to run every 5 min. This is just to trigger the "ExecuteSQL" processor on success. For the "ExecuteSQL", I set the DB Connection Pooling Service to be DBCPConnectionPool. I input the SQL query "SELECT * FROM SOMETABLE". My DBCPConnectionPool configuration is as follows:
URL = jdbc:oracle:thin:#hostname:port:sid
Driver = oracle.jdbc.driver.OracleDriver
Jar URL = file:///somelocation/ojdbc6.jar
User = someuser
Password = somepassword
When I tried to run, nothing happens. The red box becomes green and there's a number 1 on the top right corner of "ExecuteSQL" processor. But nothing happens. Then when I stop it, still the Active Threads is 1.
Can please advise me cause I am new to this? Thank you.
Since the original post is answered, I'll respond to the question within its comments:
You can set the GenerateFlowFile processor to run every 30 seconds or so, then start and immediately stop it. This will cause ExecuteSQL to run exactly once, fetching all rows.
Alternatively (in NiFi 0.6.0+) you can use the QueryDbTable processor, which will fetch all the rows the first time but then (based on a maximum-value column like an increasing primary key) only return rows as they are added.

Extremely poor performance with Tableau + Spark + Cassandra

Currently I am in the process of investigating the possibility of using Cassandra in combination with Spark and Tableau for data analysis. However, the performance that I am currently experiencing with this setup is so poor that I cannot imagine using it for production purposes. As I am reading about how great the performance of the combination of Cassandra + Spark must be, I am obviously doing something wrong, yet I cannot find out what.
My test data:
All data is stored on a single node
Queries are performed on a single table with 50MB (interval data)
Columns used in selection criteria have an index on it
My test setup:
MacBook 2015, 1.1 GHz, 8GB memory, SSD, OS X El Capitan
Virtual Box, 4GB memory, Ubuntu 14.04
Single node wit Datastax Enterprise 4.8.4:
Apache Cassandra 2.1.12.1046
Apache Spark 1.4.2.2
Spark Connector 1.4.1
Apache Thrift 0.9.3
Hive Connector 0.2.11
Tableau (Connected through ODBC)
Findings:
When a change in Tableau requires loading data from the database, it takes anywhere between 40s and 1.4 mins. to retrieve the data (which is basically unworkable)
When I use Tableau in combination with Oracle instead of Cassandra + Spark, but on the same virtual box, I get the results almost instantaneously
Here is the table definition used for the queries:
CREATE TABLE key.activity (
interval timestamp,
id bigint,
activity_name text,
begin_ts timestamp,
busy_ms bigint,
container_code text,
duration_ms bigint,
end_location_code text,
end_ts timestamp,
pallet_code text,
src_location_code text,
start_location_code text,
success boolean,
tgt_location_code text,
transporter_name text,
PRIMARY KEY (interval, id)
) WITH CLUSTERING ORDER BY (id ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"ALL"}'
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
CREATE INDEX activity_activity_name_idx ON key.activity (activity_name);
CREATE INDEX activity_success_idx ON key.activity (success);
CREATE INDEX activity_transporter_name_idx ON key.activity (transporter_name);
Here is an example of a query produced by Tableau:
INFO 2016-02-10 20:22:21 org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation: Running query 'SELECT CASE WHEN 4 >= 0 THEN SUBSTRING(`activity`.`transporter_name`,1,CAST(4 AS INT)) ELSE NULL END AS `calculation_185421691185008640`,
AVG(CAST(`activity`.`busy_ms` AS DOUBLE)) AS `avg_busy_ms_ok`,
CAST((MONTH(`activity`.`interval`) - 1) / 3 + 1 AS BIGINT) AS `qr_interval_ok`,
`activity`.`transporter_name` AS `transporter_name`,
YEAR(`activity`.`interval`) AS `yr_interval_ok`
FROM `key`.`activity` `activity`
GROUP BY CASE WHEN 4 >= 0 THEN SUBSTRING(`activity`.`transporter_name`,1,CAST(4 AS INT)) ELSE NULL END,
CAST((MONTH(`activity`.`interval`) - 1) / 3 + 1 AS BIGINT),
`activity`.`transporter_name`,
YEAR(`activity`.`interval`)'
Here is an example on statistics of a 52s query:
Spark statistics on query taken 52 secs. to complete
I've tried playing around with the partition keys as mentioned in other posts, but did not see a significant difference. I've also tried to enable row caching (Cassandra config + table property), but this also did not have any effect (although perhaps I have overlooked something there).
I would have expected to get at least a factor 10x-20x better performance out of the box, even without fiddling around with all these parameters and I've run out of ideas what to do.
What am I doing wrong? What performance should I expect?
Answering your questions will not be easy due to the variables that you do not define in your post. You mention data that is stored on one node, which is fine but you don't describe how you have structured your tables/column families. You also don't mention the cassandra cache hit ratios. You also have to consider Cassandra Compaction, if compaction is running during the heavy read/write operations it will slow things down.
You also appear to have a single SSD in which case you will have the Data directory and commitlogs and cache directories on the same physical drive. Even though it is not a spinning disc you will see degraded performance unless you split the data dir from the commitlogs/cache directories. I saw a 50% increase in performance by splitting the Data dir onto its own physical SSD.
Also, lastly you're running in a VM on a laptop host in Vbox none the less. Your largest bottleneck here is the 1.1 GHz CPU. In my cassandra environments on VMWare while running medium jobs I see almost 99% CPU use across 4 X 2 cores on 16GB RAM. My data dir(s) are on SSD's while my commitlogs and cache directories are on a magnetic HDD. I get good performance, but I tuned my environments to get to this point and I accept the latency my non production environments provide.
Take a look HERE and try to get a better understanding of how Cassandra should be used and how to achieve better performance out of the box. Distributed Systems are just that.. distributed and for a reason. Shared resources that you don't have available on a single machine.
Hope this explains a little more about where you're headed.
EDIT
Your table definition looks fine. Are you using the Tableau Spark connector? Your performance problem is likely on the cassandra/Spark side of things.
Take a look at this article which describes a compaction related problem while reading from cache. Basically on cassandra releases prior to 2.1.2 post compaction you now have lost your cache because Cassandra threw the file (and cache) away once the compaction finished. Once you start reading you imediately get a missed cache hit and cassandra then goes back to disc. This is fixed in releases from 2.1.2 onward. Everything else looks normal with respect towards running Spark/Cassandra.
While the query time does seem a little high, there's a few things I see that could cause issues.
I noticed you're using a MacBook. Beautiful computer but not ideal for Spark. I believe those are using the dual core Intel M processors. If you go to your Spark Master UI, it'll show you available cores. It might show 4 (to include vCPU's).
The nature in which you are running this query doesn't allow for a lot of parallelism (if any). You basically don't get the advantages of Spark in this case because you're running in an extremely small VM and you're running on a single node (with limited CPU's). Visualization tools haven't really caught up to Spark yet.
One other thing to keep in mind is that Spark is not designed as an 'adhoc query' tool. You can think of SparkSQL as an abstraction over proper Spark Batch. Comparing it to Oracle, at this scale, wont yield the results you expect. There's a 'minimum' performance threshold that you'll notice with Spark. Once you scale data and nodes far enough, you'll start to see that time to completion and size of data is not linear and as you add more data, the time to process remains relatively flat.
I suggest trying that query in the SparkSQL REPL dse spark-sql and see if you get similar times. If you do, then you know that's the best you'll get with your current setup. If Tableau is MUCH slower than the REPL, I'd guess it's something on their end at that point.

What is Hive: Return Code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask

I am getting:
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
While trying to make a copy of a partitioned table using the commands in the hive console:
CREATE TABLE copy_table_name LIKE table_name;
INSERT OVERWRITE TABLE copy_table_name PARTITION(day) SELECT * FROM table_name;
I initially got some semantic analysis errors and had to set:
set hive.exec.dynamic.partition=true
set hive.exec.dynamic.partition.mode=nonstrict
Although I'm not sure what the above properties do?
Full ouput from hive console:
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Starting Job = job_201206191101_4557, Tracking URL = http://jobtracker:50030/jobdetails.jsp?jobid=job_201206191101_4557
Kill Command = /usr/lib/hadoop/bin/hadoop job -Dmapred.job.tracker=master:8021 -kill job_201206191101_4557
2012-06-25 09:53:05,826 Stage-1 map = 0%, reduce = 0%
2012-06-25 09:53:53,044 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201206191101_4557 with errors
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
That's not the real error, here's how to find it:
Go to the hadoop jobtracker web-dashboard, find the hive mapreduce jobs that failed and look at the logs of the failed tasks. That will show you the real error.
The console output errors are useless, largely beause it doesn't have a view of the individual jobs/tasks to pull the real errors (there could be errors in multiple tasks)
I know I am 3 years late on this thread, however still providing my 2 cents for similar cases in future.
I recently faced the same issue/error in my cluster.
The JOB would always get to some 80%+ reduction and fail with the same error, with nothing to go on in the execution logs either.
Upon multiple iterations and research I found that among the plethora of files getting loaded some were non-compliant with the structure provided for the base table(table being used to insert data into partitioned table).
Point to be noted here is whenever I executed a select query for a particular value in the partitioning column or created a static partition it worked fine as in that case error records were being skipped.
TL;DR: Check the incoming data/files for inconsistency in the structuring as HIVE follows Schema-On-Read philosophy.
Adding some information here, as it took me awhile to find the hadoop jobtracker web-dashboard in HDInsight (Azure's Hadoop), and a colleague finally showed me where it was. There is a shortcut on the head node called "Hadoop Yarn Status" which is just a link to a local http page (http://headnodehost:9014/cluster in my case). When opened the dashboard looked like this:
In that dashboard you can find your failed application, and then after clicking into it you can look at the logs of the individual map and reduce jobs.
In my case it seemed to still be running out of memory in the reducers, even though I had cranked the memory in the configuration already. For some reason it was not surfacing the "java outofmemory" errors I got earlier though.
The top answer is right, that the error code doesn't give you much info. One of the common causes that we saw in our team for this error code was when the query was not optimized well. A known reason was when we do an inner join with the left side table magnitudes bigger than the table on right side. Swapping these tables would usually do the trick in such cases.
I removed the _SUCCESS file from the EMR output path in S3 and it worked fine.
I was also facing same error when I was inserting the data into HIVE external table which was pointing to Elastic search cluster.
I replaced the older JAR elasticsearch-hadoop-2.0.0.RC1.jar to elasticsearch-hadoop-5.6.0.jar, and everything worked fine.
My Suggestion is please use the specific JAR as per the elastic search version. Don't use older JARs if you are using newer version of elastic search.
Thanks to this post Hive- Elasticsearch Write Operation #409
Received this error when joining two tables. And one table is large in size and another table is small, which could fit into disk memory. In such a case, use
set hive.auto.convert.join = false
This might help to get rid of the above error. For more detail on this issue please refer to the below threads
Hive Map-Join configuration mystery
Hive.auto.convert.join = true what is the significance of this?
Even I faced the same issue - when checked on dashboard I found following Error. As the data was coming through Flume and had interrupted in between due to which may be there was inconsistency in few files.
Caused by: org.apache.hadoop.hive.serde2.SerDeException: org.codehaus.jackson.JsonParseException: Unexpected end-of-input within/between OBJECT entries
Running on fewer files it worked. Format consistency was the reason in my case.
I faced the same issue because I didn't have permission to query the database I was trying to.
In the case you don't have permission to query the table/database, besides the Return Code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask error, you will see that in Cloudera Manager is not even registering your query.
In my case, the solution was adding more RAM Memory to the Virtual Machines. Sometimes code 2 means that Map and Reduce nodes do not have enough memory.
Another option could be changing the properties "mapreduce.map.memory.mb" y "mapreduce.reduce.memory.mb" in the mapred-site.xml file.
I got the same error while creating the hive table in beeline and then tried to create through spark-shell which thrown actual error. In my case error was with disk space quota for hdfs directory.
org.apache.hadoop.ipc.RemoteException: The DiskSpace quota of /user/hive/warehouse/XXX_XX.db is exceeded: quota = 6597069766656 B = 6 TB but diskspace consumed = 6597493381629 B = 6.00 TB

Resources