I have a table keyed by time, e.g.
time | valA | valB
---- | ---- | ----
09:00| 1.4 | 1.2
09:05| 1.5 | 1.4
09:10| 1.5 | 1.4
I want to store this in a data structure and query values as of arbitrary times. E.g.
asof 09:01, valA = 1.4
asof 09:06, valB = 1.4
asof 09:14, valA = 1.5
What is the best way of structuring this in c++11? Which std::chrono datatype should I use to represent my times. How can I develop a solution that supports time zones? E.g. the times listed in my table may be in US/Central time and I may want to query using Australia/Sydney based times.
To support local times in different time zones with <chrono> I recommend Howard Hinnant's free, open-source time zone library. This library is built on top of <chrono>, and uses the IANA time zone database to manage time zones.
Also, to handle time zones, you will need to store more than just time-of-day. You will need to store the entire date, as a time zone's UTC offset often varies with date. I.e. 09:05 Australia/Sydney doesn't really nail down a moment in time. But 2017-08-16 09:05 Australia/Sydney does.
Here is how you could create such a time stamp with <chrono> and the time zone library:
using namespace date;
using namespace std::chrono;
auto zt = make_zoned("Australia/Sydney", local_days{2017_y/aug/16} + 9h + 5min);
You can print it out like this:
std::cout << zt << '\n';
And the output is:
2017-08-16 09:05:00 AEST
If you want to find out the local time in US/Central that corresponds to this same moment in time:
auto zt2 = make_zoned("US/Central", zt);
std::cout << zt2 << '\n';
And the output is:
2017-08-15 18:05:00 CDT
date::zoned_time<std::chrono::seconds> is the type of zt and zt2 in these examples, and that is what I recommend you store. Under the hood this type is a pairing of {date::time_zone const*, std::chrono::time_point<system_clock, seconds>} (two words of storage).
Source code: https://github.com/HowardHinnant/date
Documentation: http://howardhinnant.github.io/date/tz.html
Video: https://www.youtube.com/watch?v=Vwd3pduVGKY
Related
I've got sports data that I've imported from an online source via a .xlsx file. Each observation is a penalty in an NFL (American football) game. In order to later merge this with another dataset, I need to have certain variables/values that match up between the two files. I'm hitting an issue with one variable, however.
In the main dataset in question (the penalty dataset originally mentioned), my ultimate goal is to create two variables, Minute and Second, that are of type byte and format %8.0g. This would make them perfectly correspond with the respective variables in the destination dataset. I have the required information available, which is the time remaining in the given quarter of the NFL game, but it's stored in a strange way, and I'm having trouble converting things.
The data is stored in a variable called Time. Visibly, the data looks fine as imported from the original .xlsx file. For example, the first observation reads "12:21", indicating that there are 12 minutes and 21 seconds left in the quarter. When importing from the .xlsx sheet, however, Stata assumes that the variable Time is a date/time variable measured in hh:mm, and thus assigns it a type of double and a format of %tchh:MM.
In the end, I don't really care about correctly formatting this Time variable, but I need to somehow make this match the required Minute and Second columns of the destination file. I've tried several different approaches, but so far nothing seems to work.
If Stata is misreading minutes and seconds as hours and minutes, and also (as it does) storing date-times in milliseconds, then it is off by a factor of 60 (minutes/hour) x 1000 (ms/s) = 60000. So, consider
. clear
. set obs 1
number of observations (_N) was 0, now 1
. gen double wrong = clock("1jan1960 12:21:00", "DMY hms")
. format wrong %tchh:MM
. clonevar alsowrong = wrong
. format alsowrong %15.0f
. list
+------------------+
| wrong alsowr~g |
|------------------|
1. | 12:21 44460000 |
+------------------+
. gen right = wrong/60000
. gen byte Minute = floor(right/60)
. gen byte Second = mod(right, 60)
. list
+--------------------------------------------+
| wrong alsowr~g right Minute Second |
|--------------------------------------------|
1. | 12:21 44460000 741 12 21 |
+--------------------------------------------+
I can't comment easily on your import, as neither imported file nor exact import code are given as examples.
EDIT Another way to do it:
. gen alsoright = string(wrong, "%tchh:MM")
. gen minute = real(substr(alsoright, 1, strpos(alsoright, ":") - 1))
. gen second = real(substr(alsoright, strpos(alsoright, ":") + 1, .))
. l alsoright minute second
+----------------------------+
| alsori~t minute second |
|----------------------------|
1. | 12:21 12 21 |
+----------------------------+
I want to print the timestamp for each function in x86. I am able to print the function name during early boot but I want to print the timestamp also similar to ftrace:
TASK-PID CPU# TIMESTAMP FUNCTION
| | | | |
bash-16939 [000] 6075.461561: mutex_unlock <-tracing_set_tracer
<ide>-0 [001] 6075.461562: _spin_unlock_irqrestore <-hrtimer_get_next_event
Since ftrace can be used after boot they use some clock source to get timestamp.
But I want to print timestamp during early boot and I have heard we can use early clock registers concept in x86 to achieve this. How can I use this to print timestamp? I have found that there are clocksources like tsc,hpet and acpi_pm. Can I use these during early boot. If so how?
I am wondering what they formats are? Any advice would be much appreciated. This is used in the IBM application called Tealeaf
4682158116698062848 = 12:00:00 AM
4682162239866667008 = 12:01:00 AM
4682166363035271168 = 12:02:00 AM
4682405506814312448 = 01:00:00 AM
If I have to use an application to convert it, then the choice would be PHP
This looks like a Microsoft OLE Automation timestamp. Here is Microsoft's page about it. It represents the number of 24 hour periods since 1 Jan. 1900.
Looks like 64+ bit stamps. The most significant 28+bits are the seconds about 788 days after some epoch (Jan 1, 1970??) which would make it Feb 28, 1972 - or possible some other encoding based on seconds. The least significant 36-bits are all 0. I would expect the values could reach pow(2,72) or 22 decimal digits.
I'm in a situation that involves the manual reconstruction of raw data, to include MFT records and other Windows artifacts. I understand that timestamps in MFT records are 64-bit integers, big endian, and are calculated by the number of 100 nanosecond intervals since 01/01/1601 00:00:00 UTC. I am also familiar with Windows email header timestamps, which consist of two 32-bit values that combine to form a single 64-bit value, also calculated by the number of 100 nanosecond intervals since 01/01/1601 00:00:00 UTC.
But there are other Windows timestamps with different epochs, such as SQL Server timestamps, which I believe use a date in the 1800's. I cannot find much documentation on all of this. What timestamps are used in Windows other than the two listed above? How do you break them down?
Here is some code for decoding Windows timestamps:
static string microsoftDateToISODate(const uint64_t &time) {
/**
* See comment above for more information on
* SECONDS_BETWEEN_WIN32_EPOCH_AND_UNIX_EPOCH
*
* Convert UNIX time_t to ISO8601 format
*/
time_t tmp = (time / ONE_HUNDRED_NANO_SEC_TO_SECONDS)
- SECONDS_BETWEEN_WIN32_EPOCH_AND_UNIX_EPOCH;
struct tm time_tm;
gmtime_r(&tmp, &time_tm);
char buf[256];
strftime(buf, sizeof(buf), "%Y-%m-%dT%H:%M:%S", &time_tm);
return string(buf);
}
And your reference for Mac timestamps is:
Apple Mac and Unix timestamps
http://developer.apple.com/library/mac/#qa/qa1398/_index.html
Overview
I'm attempting to improve the performance of our database queries for SQLAlchemy. We're using psycopg2. In our production system, we're chosing to go with Java because it is simply faster by at least 50%, if not closer to 100%. So I am hoping someone in the Stack Overflow community has a way to improve my performance.
I think my next step is going to be to end up patching the psycopg2 library to behave like the JDBC driver. If that's the case and someone has already done this, that would be fine, but I am hoping I've still got a settings or refactoring tweak I can do from Python.
Details
I have a simple "SELECT * FROM someLargeDataSetTable" query running. The dataset is GBs in size. A quick performance chart is as follows:
Timing Table
Records | JDBC | SQLAlchemy[1] | SQLAlchemy[2] | Psql
--------------------------------------------------------------------
1 (4kB) | 200ms | 300ms | 250ms | 10ms
10 (8kB) | 200ms | 300ms | 250ms | 10ms
100 (88kB) | 200ms | 300ms | 250ms | 10ms
1,000 (600kB) | 300ms | 300ms | 370ms | 100ms
10,000 (6MB) | 800ms | 830ms | 730ms | 850ms
100,000 (50MB) | 4s | 5s | 4.6s | 8s
1,000,000 (510MB) | 30s | 50s | 50s | 1m32s
10,000,000 (5.1GB) | 4m44s | 7m55s | 6m39s | n/a
--------------------------------------------------------------------
5,000,000 (2.6GB) | 2m30s | 4m45s | 3m52s | 14m22s
--------------------------------------------------------------------
[1] - With the processrow function
[2] - Without the processrow function (direct dump)
I could add more (our data can be as much as terabytes), but I think changing slope is evident from the data. JDBC just performs significantly better as the dataset size increases. Some notes...
Timing Table Notes:
The datasizes are approximate, but they should give you an idea of the amount of data.
I'm using the 'time' tool from a Linux bash commandline.
The times are the wall clock times (i.e. real).
I'm using Python 2.6.6 and I'm running with python -u
Fetch Size is 10,000
I'm not really worried about the Psql timing, it's there just as a reference point. I may not have properly set fetchsize for it.
I'm also really not worried about the timing below the fetch size as less than 5 seconds is negligible to my application.
Java and Psql appear to take about 1GB of memory resources; Python is more like 100MB (yay!!).
I'm using the [cdecimals] library.
I noticed a [recent article] discussing something similar to this. It appears that the JDBC driver design is totally different to the psycopg2 design (which I think is rather annoying given the performance difference).
My use-case is basically that I have to run a daily process (with approximately 20,000 different steps... multiple queries) over very large datasets and I have a very specific window of time where I may finish this process. The Java we use is not simply JDBC, it's a "smart" wrapper on top of the JDBC engine... we don't want to use Java and we'd like to stop using the "smart" part of it.
I'm using one of our production system's boxes (database and backend process) to run the query. So this is our best-case timing. We have QA and Dev boxes that run much slower and the extra query time can become significant.
testSqlAlchemy.py
#!/usr/bin/env python
# testSqlAlchemy.py
import sys
try:
import cdecimal
sys.modules["decimal"]=cdecimal
except ImportError,e:
print >> sys.stderr, "Error: cdecimal didn't load properly."
raise SystemExit
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
def processrow (row,delimiter="|",null="\N"):
newrow = []
for x in row:
if x is None:
x = null
newrow.append(str(x))
return delimiter.join(newrow)
fetchsize = 10000
connectionString = "postgresql+psycopg2://usr:pass#server:port/db"
eng = create_engine(connectionString, server_side_cursors=True)
session = sessionmaker(bind=eng)()
with open("test.sql","r") as queryFD:
with open("/dev/null","w") as nullDev:
query = session.execute(queryFD.read())
cur = query.cursor
while cur.statusmessage not in ['FETCH 0','CLOSE CURSOR']:
for row in query.fetchmany(fetchsize):
print >> nullDev, processrow(row)
After timing, I also ran a cProfile and this is the dump of worst offenders:
Timing Profile (with processrow)
Fri Mar 4 13:49:45 2011 sqlAlchemy.prof
415757706 function calls (415756424 primitive calls) in 563.923 CPU seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.001 0.001 563.924 563.924 {execfile}
1 25.151 25.151 563.924 563.924 testSqlAlchemy.py:2()
1001 0.050 0.000 329.285 0.329 base.py:2679(fetchmany)
1001 5.503 0.005 314.665 0.314 base.py:2804(_fetchmany_impl)
10000003 4.328 0.000 307.843 0.000 base.py:2795(_fetchone_impl)
10011 0.309 0.000 302.743 0.030 base.py:2790(__buffer_rows)
10011 233.620 0.023 302.425 0.030 {method 'fetchmany' of 'psycopg2._psycopg.cursor' objects}
10000000 145.459 0.000 209.147 0.000 testSqlAlchemy.py:13(processrow)
Timing Profile (without processrow)
Fri Mar 4 14:03:06 2011 sqlAlchemy.prof
305460312 function calls (305459030 primitive calls) in 536.368 CPU seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.001 0.001 536.370 536.370 {execfile}
1 29.503 29.503 536.369 536.369 testSqlAlchemy.py:2()
1001 0.066 0.000 333.806 0.333 base.py:2679(fetchmany)
1001 5.444 0.005 318.462 0.318 base.py:2804(_fetchmany_impl)
10000003 4.389 0.000 311.647 0.000 base.py:2795(_fetchone_impl)
10011 0.339 0.000 306.452 0.031 base.py:2790(__buffer_rows)
10011 235.664 0.024 306.102 0.031 {method 'fetchmany' of 'psycopg2._psycopg.cursor' objects}
10000000 32.904 0.000 172.802 0.000 base.py:2246(__repr__)
Final Comments
Unfortunately, the processrow function needs to stay unless there is a way within SQLAlchemy to specify null = 'userDefinedValueOrString' and delimiter = 'userDefinedValueOrString' of the output. The Java we are using currently already does this, so the comparison (with processrow) needed to be apples to apples. If there is a way to improve the performance of either processrow or SQLAlchemy with pure Python or a settings tweak, I'm very interested.
This is not an answer out of the box, with all client/db stuff you may need to do some work to determine exactly what is amiss
backup postgresql.conf changing
log_min_duration_statement to 0
log_destination = 'csvlog' # Valid values are combinations of
logging_collector = on # Enable capturing of stderr and csvlog
log_directory = 'pg_log' # directory where log files are written,
log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log' # log file name pattern,
debug_print_parse = on
debug_print_rewritten = on
debug_print_plan output = on
log_min_messages = info (debug1 for all server versions prior to 8.4)
Stop and restart your database server ( reload may not pick up the changes )
Reproduce your tests ensuring that the server time and client times match and that you record the start times etc.
copy the log file off an import into editor of your choice (excel or another spreadsheet can be useful for getting advance manipulation for sql & plans etc)
now examine the timings from the server side and note:
is the sql reported on the server the same in each case
if the same you should have the same timings
is the client generating a cursor rather than passing sql
is one driver doing a lot of casting/converting between character sets or implicit converting of other types such as dates or timestamps.
and so on
The plan data will be included for completeness, this may inform if there are gross differences in the SQL submitted by the clients.
The stuff below is probably aiming above and beyond what you have in mind or what is deemed acceptable in your environment, but I'll put the option on the table just in case.
Is the destination of every SELECT in your test.sql truly a simple |-separated results file?
Is non-portability (Postgres-specificity) acceptable?
Is your backend Postgres 8.2 or newer?
Will the script run on the same host as the database backend, or would it be acceptable to generate the |-separated results file(s) from within the backend (e.g. to a share?)
If the answer to all of the above questions is yes, then you can transform your SELECT ... statements to COPY ( SELECT ... ) TO E'path-to-results-file' WITH DELIMITER '|' NULL E'\\N'.
An alternative could be to use ODBC. This is assuming that Python ODBC driver performs well.
PostgreSQL has ODBC drivers for both Windows and Linux.
As someone who programmed mostly in assembler, there is one thing that sticks out as obvious. You are losing time in the overhead, and the overhead is what needs to go.
Rather than using python, which wraps itself in something else that integrates with something that is a C wrapper around the DB.... just write the code in C. I mean, how long can it take? Postgres is not hard to interface with (quite the opposite). C is an easy langauge. The operations you are performing seem pretty straightforward. You can also use SQL embedded in C, it's just a matter of a pre-compile. No need to translate what you were thinking - just write it there along with the C and use the supplied ECPG compiler (read postgres manual chapter 29 iirc).
Take out as much of the in-betweeny interfacey stuff as you can, cut out the middle man and get talking to the database natively. It seems to me that in trying to make the system simpler you are actually making it more complicated than it needs to be. When things are getting really messy, I usually ask myself the question "What bit of the code am I most afraid to touch?" - that usually points me to what needs changing.
Sorry for babbling on, but maybe a step backward and some fresh air will help ;)