How to speed up a simple count operation in SORM? - performance

I'm trying to perform a simple count operation on a large table with SORM. This works inexplicably slow. Code:
case class Tests (datetime: DateTime, value:Double, second:Double)
object DB extends Instance (
entities = Set() +
Entity[Tests](),
url = "jdbc:h2:~/test",
user = "sa",
password = "",
poolSize = 10
)
object TestSORM {
def main(args: Array[String]) {
println("Start at " + DB.now())
println(" DB.query[Tests].count() = " + DB.query[Tests].count())
println("Finish at " + DB.now())
DB.close()
}
}
Results:
Start at 2013-01-28T22:36:05.281+04:00
DB.query[Tests].count() = 3390132
Finish at = 2013-01-28T22:38:39.655+04:00
It took two and a half minutes!
In H2 console 'select count(*) from tests' works in a tick. What is my mistake? Can I execute a sql query through it?
Thanks in advance!

It's a known optimization issue. Currently SORM does not utilize a COUNT operation. Since finally the issue has come up, I'll address it right away. Expect a release fixing this in the comming days.
The issue fix has been delayed. You can monitor its status on the issue tracker.

Related

[Snowflake-jdbc]It hangs when get info from resetset object of connection.getMetadata().getColumns(...)

I try to test the jdbc connection of snowflake with codes below
Connection conn = .......
.......
ResultSet rs = conn.getMetaData().getColumns(**null**, "PUBLIC", "TAB1", null); // 1. set parameters to get metadata of table TAB1
while (rs.next()) { // 2. It hangs here if the first parameter is null in above liune; otherwise(set the corrent db name), it works fine
System.out.println( "precision:" + rs.getInt(7)
+ ",col type name:" + rs.getString(6)
+ ",col type:" + rs.getInt(5)
+ ",col name:" + rs.getString(4)
+ ",CHAR_OCTET_LENGTH:" + rs.getInt(16)
+ ",buf LENGTH:" + rs.getString(8)
+ ",SCALE:" + rs.getInt(9));
}
.......
I debug the codes above in Intellij IDEA, and find that the debugger can't get the details of the object, it always shows "Evaluating..."
The JDBC driver I used is snowflake-jdbc-3.12.5.jar
Is it a bug?
When the catalog (database) argument is null, the JDBC code effectively runs the following SQL, which you can verify in your Snowflake account's Query History UIs/Views:
show columns in account;
This is an expensive metadata query to run due to no filters and the wide requested breadth (columns across the entire account).
Depending on how many databases exist in your organization's account, it may require several minutes or upto an hour of execution to return back results, which explains the seeming "hang". On a simple test with about 50k+ tables dispersed across 100+ of databases and schemas, this took at least 15 minutes to return back results.
I debug the codes above in Intellij IDEA, and find that the debugger can't get the details of the object, it always shows "Evaluating..."
This may be a weirdness with your IDE, but in a pinch you can use the Dump Threads (Ctrl + Escape, or Ctrl + Break) option in IDEA to provide a single captured thread dump view. This should help show that the JDBC client thread isn't hanging (as in, its not locked or starved), it is only waiting on the server to send back results.
There is no issue with the 3.12.5 jar.I just tested the same version in Eclipse, I can inspect all the objects . Could be an issue with your IDE.
ResultSet columns = metaData.getColumns(null, null, "TESTTABLE123",null);
while (columns.next()){
System.out.print("Column name and size: "+columns.getString("COLUMN_NAME"));
System.out.print("("+columns.getInt("COLUMN_SIZE")+")");
System.out.println(" ");
System.out.println("COLUMN_DEF : "+columns.getString("COLUMN_DEF"));
System.out.println("Ordinal position: "+columns.getInt("ORDINAL_POSITION"));
System.out.println("Catalog: "+columns.getString("TABLE_CAT"));
System.out.println("Data type (integer value): "+columns.getInt("DATA_TYPE"));
System.out.println("Data type name: "+columns.getString("TYPE_NAME"));
System.out.println(" ");
}

Is there any way to view the physical SQLs executed by Calcite JDBC?

Recently I am studying Apache Calcite, by now I can use explain plan for via JDBC to view the logical plan, and I am wondering how can I view the physical sql in the plan execution? Since there may be bugs in the physical sql generation so I need to make sure the correctness.
val connection = DriverManager.getConnection("jdbc:calcite:")
val calciteConnection = connection.asInstanceOf[CalciteConnection]
val rootSchema = calciteConnection.getRootSchema()
val dsInsightUser = JdbcSchema.dataSource("jdbc:mysql://localhost:13306/insight?useSSL=false&serverTimezone=UTC", "com.mysql.jdbc.Driver", "insight_admin","xxxxxx")
val dsPerm = JdbcSchema.dataSource("jdbc:mysql://localhost:13307/permission?useSSL=false&serverTimezone=UTC", "com.mysql.jdbc.Driver", "perm_admin", "xxxxxx")
rootSchema.add("insight_user", JdbcSchema.create(rootSchema, "insight_user", dsInsightUser, null, null))
rootSchema.add("perm", JdbcSchema.create(rootSchema, "perm", dsPerm, null, null))
val stmt = connection.createStatement()
val rs = stmt.executeQuery("""explain plan for select "perm"."user_table".* from "perm"."user_table" join "insight_user"."user_tab" on "perm"."user_table"."id"="insight_user"."user_tab"."id" """)
val metaData = rs.getMetaData()
while(rs.next()) {
for(i <- 1 to metaData.getColumnCount) printf("%s ", rs.getObject(i))
println()
}
result is
EnumerableCalc(expr#0..3=[{inputs}], proj#0..2=[{exprs}])
EnumerableHashJoin(condition=[=($0, $3)], joinType=[inner])
JdbcToEnumerableConverter
JdbcTableScan(table=[[perm, user_table]])
JdbcToEnumerableConverter
JdbcProject(id=[$0])
JdbcTableScan(table=[[insight_user, user_tab]])
There is a Calcite Hook, Hook.QUERY_PLAN that is triggered with the JDBC query strings. From the source:
/** Called with a query that has been generated to send to a back-end system.
* The query might be a SQL string (for the JDBC adapter), a list of Mongo
* pipeline expressions (for the MongoDB adapter), et cetera. */
QUERY_PLAN;
You can register a listener to log any query strings, like this in Java:
Hook.QUERY_PLAN.add((Consumer<String>) s -> LOG.info("Query sent over JDBC:\n" + s));
It is possible to see the generated SQL query by setting calcite.debug=true system property. The exact place where this is happening is in JdbcToEnumerableConverter. As this is happening during the execution of the query you will have to remove the "explain plan for"
from stmt.executeQuery.
Note that by setting debug mode to true you will get a lot of other messages as well as other information regarding generated code.

How to stabilize spark streaming application with a handful of super big sessions?

I am running a Spark Streaming application based on mapWithState DStream function . The application transforms input records into sessions based on a session ID field inside the records.
A session is simply all of the records with the same ID . Then I perform some analytics on a session level to find an anomaly score.
I couldn't stabilize my application because a handful of sessions are getting bigger at each batch time for extended period ( more than 1h) . My understanding is a single session (key - value pair) is always processed by a single core in spark . I want to know if I am mistaken , and if there is a solution to mitigate this issue and make the streaming application stable.
I am using Hadoop 2.7.2 and Spark 1.6.1 on Yarn . Changing batch time, blocking interval , partitions number, executor number and executor resources didn't solve the issue as one single task makes the application always choke. However, filtering those super long sessions solved the issue.
Below is a code updateState function I am using :
val updateState = (batchTime: Time, key: String, value: Option[scala.collection.Map[String,Any]], state: State[Seq[scala.collection.Map[String,Any]]]) => {
val session = Seq(value.getOrElse(scala.collection.Map[String,Any]())) ++ state.getOption.getOrElse(Seq[scala.collection.Map[String,Any]]())
if (state.isTimingOut()) {
Option(null)
} else {
state.update(session)
Some((key,value,session))
}
}
and the mapWithStae call :
def updateStreamingState(inputDstream:DStream[scala.collection.Map[String,Any]]): DStream[(String,Option[scala.collection.Map[String,Any]], Seq[scala.collection.Map[String,Any]])] ={//MapWithStateDStream[(String,Option[scala.collection.Map[String,Any]], Seq[scala.collection.Map[String,Any]])] = {
val spec = StateSpec.function(updateState)
spec.timeout(Duration(sessionTimeout))
spec.numPartitions(192)
inputDstream.map(ds => (ds(sessionizationFieldName).toString, ds)).mapWithState(spec)
}
Finally I am applying a feature computing session foreach DStream , as defined below :
def computeSessionFeatures(sessionId:String,sessionRecords: Seq[scala.collection.Map[String,Any]]): Session = {
val features = Functions.getSessionFeatures(sessionizationFeatures,recordFeatures,sessionRecords)
val resultSession = new Session(sessionId,sessionizationFieldName,sessionRecords)
resultSession.features = features
return resultSession
}

JIRA Email Notification when log time worked is reaching budget time. OnDemand

I would like to know if is there an option to send a notification when log time worked is reaching estimating time.
example:
Start.
Estimated:
Original Estimate - 5 minutes
5m
Remaining:
Remaining Estimate - 5 minutes
5m
Logged:
Time Spent - Not Specified
Not Specified
When i Log time.
Estimated:
Original Estimate - 5 minutes
5m
Remaining:
Time Spent - 4 minutes Remaining Estimate - 1 minute
1m
Logged:
Time Spent - 4 minutes Remaining Estimate - 1 minute
4m
I like JIRA send the notification before 1 minute ending or what ever i set.
I'm sorry for the bad english.
Thank you
I believe you wanted to create some kind of triggered event when logged work approaches original estimate, but I do not know how you could do that in JIRA. Nevertheless, I know something that still might help you to solve your problem.
Try using following groovy script:
import com.atlassian.jira.ComponentManager
import com.atlassian.jira.component.ComponentAccessor
import com.atlassian.jira.config.properties.APKeys
import com.atlassian.jira.config.properties.ApplicationProperties
import com.atlassian.jira.issue.Issue
import com.atlassian.jira.issue.search.SearchResults
import com.atlassian.jira.issue.worklog.Worklog
import com.atlassian.jira.jql.parser.DefaultJqlQueryParser
import com.atlassian.jira.web.bean.PagerFilter
import com.atlassian.mail.Email
import com.atlassian.mail.MailException
import com.atlassian.mail.MailFactory
import com.atlassian.mail.queue.SingleMailQueueItem
import com.atlassian.query.Query
import groovy.text.GStringTemplateEngine
import org.apache.log4j.Logger
import com.atlassian.core.util.DateUtils
def componentManager = ComponentManager.getInstance()
def worklogManager = componentManager.getWorklogManager()
def userUtil = componentManager.getUserUtil()
def user = userUtil.getUser('admin')
def searchProvider = componentManager.getSearchProvider()
def queryParser = new DefaultJqlQueryParser()
Logger log = Logger.getLogger('worklogNotification')
Query jql = queryParser.parseQuery('project = ABC and updated > startOfDay(-1d)')
SearchResults results = searchProvider.search(jql, user, PagerFilter.getUnlimitedFilter())
List issues = results.getIssues()
String emailFormat = 'HTML'
def mailServerManager = componentManager.getMailServerManager()
def mailServer = mailServerManager.getDefaultSMTPMailServer()
String defaultSubject = 'Logged work on JIRA issue %ISSUE% exceeds original estimate'
String body = ''
Map binding = [:]
String loggedWorkDiff = ''
String template = '''
Dear ${issue.assignee.displayName}, <br /><br />
Logged work on issue ${issue.key} (${issue.summary}) exceeds original estimate ($loggedWorkDiff more than expected).<br />
*** This is an automatically generated email, you do not need to reply ***<br />
'''
GStringTemplateEngine engine = new GStringTemplateEngine()
ApplicationProperties applicationProperties = componentManager.getApplicationProperties()
binding.put("baseUrl", applicationProperties.getString(APKeys.JIRA_BASEURL))
if (mailServer && ! MailFactory.isSendingDisabled()) {
for (Issue issue in issues) {
if(issue.originalEstimate) {
loggedWork = 0
worklogs = worklogManager.getByIssue(issue)
worklogs.each{Worklog worklog -> loggedWork += worklog.getTimeSpent()}
if(loggedWork - issue.originalEstimate) {
loggedWorkDiff = DateUtils.getDurationString(Math.round(loggedWork - issue.originalEstimate))
email = new Email(issue.getAssigneeUser().getEmailAddress())
email.setFrom(mailServer.getDefaultFrom())
email.setSubject(defaultSubject.replace('%ISSUE%', issue.getKey()))
email.setMimeType(emailFormat == "HTML" ? "text/html" : "text/plain")
binding.put("issue", issue)
binding.put('loggedWorkDiff', loggedWorkDiff)
body = engine.createTemplate(template).make(binding).toString()
email.setBody(body)
try {
log.debug ("Sending mail to ${email.getTo()}")
log.debug ("with body ${email.getBody()}")
log.debug ("from template ${template}")
SingleMailQueueItem item = new SingleMailQueueItem(email);
ComponentAccessor.getMailQueue().addItem(item);
}
catch (MailException e) {
log.warn ("Error sending email", e)
}
}
}
}
}
This script takes issues for project ABC, which have been updated during the previous day (JQL: "project = ABC and updated > startOfDay(-1d)"), calculates difference between logged work and estimated work and sends notification to the issue assignee if logged work exceeds original estimate.
You can add this script to the list of JIRA services (JIRA -> Administration -> System -> Advanced -> Services).
Name: [any name]
Class: com.onresolve.jira.groovy.GroovyService
Delay: 1440
Input file: [path to your script on server]
Delay: 1440
Note that you will put 1440 (min) as a service delay, which equals to one day. So, the script will be executed once per day sending notification to issue assignees about exceeded original estimates.
Also note that Groovy Scripting plugin should be installed in order to be able to run your script.

Azure Storage Queue very slow from a worker role in the cloud, but not from my machine

I'm doing a very simple test with queues pointing to the real Azure Storage and, I don't know why, executing the test from my computer is quite faster than deploy the worker role into azure and execute it there. I'm not using Dev Storage when I test locally, my .cscfg is has the connection string to the real storage.
The storage account and the roles are in the same affinity group.
The test is a web role and a worker role. The page tells to the worker what test to do, the the worker do it and returns the time consumed. This specific test meassures how long takes get 1000 messages from an Azure Queue using batches of 32 messages. First, I test running debug with VS, after I deploy the app to Azure and run it from there.
The results are:
From my computer: 34805.6495 ms.
From Azure role: 7956828.2851 ms.
That could mean that is faster to access queues from outside Azure than inside, and that doesn't make sense.
I'm testing like this:
private TestResult InQueueScopeDo(String test, Guid id, Int64 itemCount)
{
CloudStorageAccount account = CloudStorageAccount.Parse(_connectionString);
CloudQueueClient client = account.CreateCloudQueueClient();
CloudQueue queue = client.GetQueueReference(Guid.NewGuid().ToString());
try
{
queue.Create();
PreTestExecute(itemCount, queue);
List<Int64> times = new List<Int64>();
Stopwatch sw = new Stopwatch();
for (Int64 i = 0; i < itemCount; i++)
{
sw.Start();
Boolean valid = ItemTest(i, itemCount, queue);
sw.Stop();
if (valid)
times.Add(sw.ElapsedTicks);
sw.Reset();
}
return new TestResult(id, test + " with " + itemCount.ToString() + " elements", TimeSpan.FromTicks(times.Min()).TotalMilliseconds,
TimeSpan.FromTicks(times.Max()).TotalMilliseconds,
TimeSpan.FromTicks((Int64)Math.Round(times.Average())).TotalMilliseconds);
}
finally
{
queue.Delete();
}
return null;
}
The PreTestExecute puts the 1000 items on the queue with 2048 bytes each.
And this is what happens in the ItemTest method for this test:
Boolean done = false;
public override bool ItemTest(long itemCurrent, long itemCount, CloudQueue queue)
{
if (done)
return false;
CloudQueueMessage[] messages = null;
while ((messages = queue.GetMessages((Int32)itemCount).ToArray()).Any())
{
foreach (var m in messages)
queue.DeleteMessage(m);
}
done = true;
return true;
}
I don't what I'm doing wrong, same code, same connection string and I got these resuts.
Any idea?
UPDATE:
The problem seems to be in the way I calculate it.
I have replaced the times.Add(sw.ElapsedTicks); for times.Add(sw.ElapsedMilliseconds); and this block:
return new TestResult(id, test + " with " + itemCount.ToString() + " elements",
TimeSpan.FromTicks(times.Min()).TotalMilliseconds,
TimeSpan.FromTicks(times.Max()).TotalMilliseconds,
TimeSpan.FromTicks((Int64)Math.Round(times.Average())).TotalMilliseconds);
for this one:
return new TestResult(id, test + " with " + itemCount.ToString() + " elements",
times.Min(),times.Max(),times.Average());
And now the results are similar, so apparently there is a difference in how the precision is handled or something. I will research this later on.
The problem apparently was a issue with different nature of the StopWatch and TimeSpan ticks, as discussed here.
Stopwatch.ElapsedTicks Property
Stopwatch ticks are different from DateTime.Ticks. Each tick in the DateTime.Ticks value represents one 100-nanosecond interval. Each tick in the ElapsedTicks value represents the time interval equal to 1 second divided by the Frequency.
How is your CPU utilization? Is this possible that your code is spiking the CPU and your workstation is much faster than your Azure node?

Resources