Pyspark to_date() function gives different answers on Windows and WSL Ubuntu - windows

I have a function that converts an int to a date which is then fed into datediff to find how many days since an event happened. One of our tests passes on PySpark Windows and in our Azure DevOps pipeline, but fails when run on PySpark in WSL Ubuntu. We've narrowed it down to the to_date() function producing different results on the 2 platforms, but don't understand why.
import pyspark.sql.functions as F
import datetime
def from_int_to_date(int_date: int) -> datetime.datetime:
"""
Convert an integer in YYYYMMDD format into a datetime object
"""
return datetime.datetime.strptime(str(int_date), "%Y%m%d")
If I calculate F.to_date(F.lit(from_int_to_date(20190401))) I get Column<b"to_date(TIMESTAMP '2019-04-01 00:00:00')"> on Windows and Column<b"to_date(TIMESTAMP('2019-03-31 23:00:00.0'))> on the version running under WSL.
I am based in the UK and on 1 April 2019 we did our clock change for summer so I can understand the reason why it goes back an hour as the problem doesn't occur with an input int of 20190331. I'm just trying to understand why the behaviour of to_date() is different on the two systems and what we should do to mitigate for this (and any other differences) as ideally our code would be platform agnostic.

Set the timezone to the spark driver with the configuration spark.sql.session.timeZone so you won't depend on the system clock.
spark.conf.set("spark.sql.session.timeZone", "Europe/London")
This option can be settled even when the spark session is created.

Related

Get system timezone name in tz database format [duplicate]

This question already has an answer here:
get current location region from local system [closed]
(1 answer)
Closed 2 years ago.
I want to get a system's timezone information in tz database format, e.g. "America/New_York". Also I want it to be platform independent, e.g. code should work on Windows, Linux and MacOS.
Tried two recipes:
viaLocation := time.Now().Location().String() // Gives "Local"
viaZone, _ := time.Now().Zone() // Gives "EST"
"EST" is somewhat better, is there any way to map it into "America/New_York"?
I don't mind migrating to Go 1.15 and import time/tzdata
You can't do this reliably.
On Linux the local time is typically configured via /etc/localtime. The file format doesn't include the IANA name.
But even if it did, that's not the only way to configure the time zone. An obvious alternative is the TZ environment variable. I can set TZ to, say, UTC+4, so my local time zone doesn't have a name at all. This is a trivial example, but the TZ value can be much more complicated too.
The time/tzdata package is only used if the system doesn't provide time zone definitions, so importing that package doesn't help either.
Marc's answer to a similar question shows how you can take a guess on Linux (and possibly MacOS), but it's nothing more than that.
So you see, it can't be done reliably on Linux at least. I assume MacOS works similar. I don't know how local time works on Windows, but I'm sure it's possible to configure a fixed UTC offset too, i.e. a nameless time zone.

Doing Math With Time Bash [duplicate]

I have been developing a script on my linux box for quite some time, and wanted to run it on my Mac as well.
I thought that the functions on the Mac were the same as the functions on linux, but today I realized it was wrong. I knew that fewer functions existed on the Mac, but I thought that the functions that did exist, had the same implementation.
This problem is specifically in regards to the date command.
When I run the command on my linux machine with the parameter to provide some time in nanoseconds, I get the correct result, but when I run it on my mac, it does not have that option.
Linux-Machine> date +%N
55555555555 #Current time in nanoseconds
Mac-Machine> date +%N
N
How do I go about getting the current time in nanoseconds as a bash command on the Mac?
Worst case is I create a small piece of code that calls a system function in C or something and then call it within my script.
Any help is much appreciated!
This is because OSX and Linux use two different sets of tools. Linux uses the GNU version of the date command (hence, GNU/Linux). Remember that Linux is Linux and OS X is Unix. They're different.
You can install the GNU date command which is included in the "coreutils" package from MacPorts. It will be installed on your system as gdate. You can either use that, or link the date binary with the new gdate binary; your choice.
man date indicates that it doesn't go beyond one second. I would recommend trying another language (Python 2):
$ python -c 'import time; print repr(time.time())'
1332334298.898616
For Python 3, use:
$ python -c 'import time; print(repr(time.time()))'
There are "Linux specifications" but they do not regulate the behavior of the date command much. What you have is really the opposite -- Linux (or more specifically the GNU user-space tools) has a large number of extensions which are not compatible with Unix by any reasonable definition.
There is a large number of standards which do regulate these things. The one you should be looking at is POSIX which requires
date [-u] [+format]
and nothing more to be supported by adhering implementations. (There are other standards like XPG and SUS which you might want to look at as well, but at the very least, you should require and expect POSIX these days ... finally.)
The POSIX document contains a number of examples but there is nothing for date conversion which is however a practical problem which many scripts turn to date for. Also, for your concrete problem, there is nothing for reporting times with sub-second accuracy in POSIX.
Anyway, griping that *BSD isn't Linux isn't really helpful here; you just have to understand what the differences are, and code defensively. If your requirements are complex or unusual, perhaps turn to a scripting language like Perl or Python which perform these types of date formatting operations more or less out of the box in a standard installation (though neither Perl nor Python have a quick and elegant way to do date conversion out of the box, either; solutions tend to be somewhat tortured).
In practical terms, you can compare the MacOS date man page and the Linux one and try to reconcile your requirements.
For your practical requirement, MacOS date does not support any format string with nanosecond accuracy, but nor are you likely to receive useful results on that scale when the execution of the command will take a significant number of nanoseconds. I would settle for millisecond-level accuracy (and even that is going to be thrown off by the execution time in the final digits) and multiply to get the number in nanosecond scale.
nanoseconds () {
python -c 'import time; print(int(time.time()*1000*1000*1000))'
}
(Notice the parentheses around the argument to print() for Python 3.) You will notice that Python does report a value at nanosecond accuracy (the last digits are often not zeros), though by the time you have run time.time() the value will obviously no longer be correct.
To get an idea of the error rate,
bash#macos-high-sierra$ python3
Python 3.5.1 (default, Dec 26 2015, 18:08:53)
[GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import time
>>> import timeit
>>> def nanoseconds ():
... return int(time.time()*1000*1000*1000)
...
>>> timeit.timeit(nanoseconds, number=10000)
0.0066173350023746025
>>> timeit.timeit('int(time.time()*1000*1000*1000)', number=10000)
0.00557799199668807
The overhead of starting Python and printing the value is probably going to add a few orders of magnitude of overhead, realistically, but I haven't attempted to quantify that. (The output from timeit is in seconds.)

Windows XP Batch File Date to file name

I am currently using the below code to return a date format from a XP embedded machine, it is a fairly basic version of XP, the below code returns the correct format on a windows 7 machine (10-02-2015) but on the XP machine it returns (Tue), how can I modify the code to return the correct format, without changing the XP time format on the machine
Set timestamp=%DATE:/=-%
The date format includes the day-of-week at the beginning in many environments - use:
set DT=%DATE:/=-%
set timestamp=%DT:~4%
to set timestamp the way is on your Win7 environment; however, this approach is not exactly portable, just be aware.
EDIT
This will reorder the date and time to something that sorts properly ... and it does happen to also be the order used in Europe:
set DT=%DATE:/=-%
set timestamp=%DT:~10,4%-%DT:~4,5%
keeping in mind, this still isn't portable across systems.
EDIT
Whoop, you wanted UK, which isn't the same as other places - that would be:
set timestamp=%DT:~7,3%%DT:~4,3%%DT:~10,4%

OSX, Bash date: Show milliseconds/nanoseconds [duplicate]

I have been developing a script on my linux box for quite some time, and wanted to run it on my Mac as well.
I thought that the functions on the Mac were the same as the functions on linux, but today I realized it was wrong. I knew that fewer functions existed on the Mac, but I thought that the functions that did exist, had the same implementation.
This problem is specifically in regards to the date command.
When I run the command on my linux machine with the parameter to provide some time in nanoseconds, I get the correct result, but when I run it on my mac, it does not have that option.
Linux-Machine> date +%N
55555555555 #Current time in nanoseconds
Mac-Machine> date +%N
N
How do I go about getting the current time in nanoseconds as a bash command on the Mac?
Worst case is I create a small piece of code that calls a system function in C or something and then call it within my script.
Any help is much appreciated!
This is because OSX and Linux use two different sets of tools. Linux uses the GNU version of the date command (hence, GNU/Linux). Remember that Linux is Linux and OS X is Unix. They're different.
You can install the GNU date command which is included in the "coreutils" package from MacPorts. It will be installed on your system as gdate. You can either use that, or link the date binary with the new gdate binary; your choice.
man date indicates that it doesn't go beyond one second. I would recommend trying another language (Python 2):
$ python -c 'import time; print repr(time.time())'
1332334298.898616
For Python 3, use:
$ python -c 'import time; print(repr(time.time()))'
There are "Linux specifications" but they do not regulate the behavior of the date command much. What you have is really the opposite -- Linux (or more specifically the GNU user-space tools) has a large number of extensions which are not compatible with Unix by any reasonable definition.
There is a large number of standards which do regulate these things. The one you should be looking at is POSIX which requires
date [-u] [+format]
and nothing more to be supported by adhering implementations. (There are other standards like XPG and SUS which you might want to look at as well, but at the very least, you should require and expect POSIX these days ... finally.)
The POSIX document contains a number of examples but there is nothing for date conversion which is however a practical problem which many scripts turn to date for. Also, for your concrete problem, there is nothing for reporting times with sub-second accuracy in POSIX.
Anyway, griping that *BSD isn't Linux isn't really helpful here; you just have to understand what the differences are, and code defensively. If your requirements are complex or unusual, perhaps turn to a scripting language like Perl or Python which perform these types of date formatting operations more or less out of the box in a standard installation (though neither Perl nor Python have a quick and elegant way to do date conversion out of the box, either; solutions tend to be somewhat tortured).
In practical terms, you can compare the MacOS date man page and the Linux one and try to reconcile your requirements.
For your practical requirement, MacOS date does not support any format string with nanosecond accuracy, but nor are you likely to receive useful results on that scale when the execution of the command will take a significant number of nanoseconds. I would settle for millisecond-level accuracy (and even that is going to be thrown off by the execution time in the final digits) and multiply to get the number in nanosecond scale.
nanoseconds () {
python -c 'import time; print(int(time.time()*1000*1000*1000))'
}
(Notice the parentheses around the argument to print() for Python 3.) You will notice that Python does report a value at nanosecond accuracy (the last digits are often not zeros), though by the time you have run time.time() the value will obviously no longer be correct.
To get an idea of the error rate,
bash#macos-high-sierra$ python3
Python 3.5.1 (default, Dec 26 2015, 18:08:53)
[GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import time
>>> import timeit
>>> def nanoseconds ():
... return int(time.time()*1000*1000*1000)
...
>>> timeit.timeit(nanoseconds, number=10000)
0.0066173350023746025
>>> timeit.timeit('int(time.time()*1000*1000*1000)', number=10000)
0.00557799199668807
The overhead of starting Python and printing the value is probably going to add a few orders of magnitude of overhead, realistically, but I haven't attempted to quantify that. (The output from timeit is in seconds.)

Date command does not follow Linux specifications (Mac OS X Lion)

I have been developing a script on my linux box for quite some time, and wanted to run it on my Mac as well.
I thought that the functions on the Mac were the same as the functions on linux, but today I realized it was wrong. I knew that fewer functions existed on the Mac, but I thought that the functions that did exist, had the same implementation.
This problem is specifically in regards to the date command.
When I run the command on my linux machine with the parameter to provide some time in nanoseconds, I get the correct result, but when I run it on my mac, it does not have that option.
Linux-Machine> date +%N
55555555555 #Current time in nanoseconds
Mac-Machine> date +%N
N
How do I go about getting the current time in nanoseconds as a bash command on the Mac?
Worst case is I create a small piece of code that calls a system function in C or something and then call it within my script.
Any help is much appreciated!
This is because OSX and Linux use two different sets of tools. Linux uses the GNU version of the date command (hence, GNU/Linux). Remember that Linux is Linux and OS X is Unix. They're different.
You can install the GNU date command which is included in the "coreutils" package from MacPorts. It will be installed on your system as gdate. You can either use that, or link the date binary with the new gdate binary; your choice.
man date indicates that it doesn't go beyond one second. I would recommend trying another language (Python 2):
$ python -c 'import time; print repr(time.time())'
1332334298.898616
For Python 3, use:
$ python -c 'import time; print(repr(time.time()))'
There are "Linux specifications" but they do not regulate the behavior of the date command much. What you have is really the opposite -- Linux (or more specifically the GNU user-space tools) has a large number of extensions which are not compatible with Unix by any reasonable definition.
There is a large number of standards which do regulate these things. The one you should be looking at is POSIX which requires
date [-u] [+format]
and nothing more to be supported by adhering implementations. (There are other standards like XPG and SUS which you might want to look at as well, but at the very least, you should require and expect POSIX these days ... finally.)
The POSIX document contains a number of examples but there is nothing for date conversion which is however a practical problem which many scripts turn to date for. Also, for your concrete problem, there is nothing for reporting times with sub-second accuracy in POSIX.
Anyway, griping that *BSD isn't Linux isn't really helpful here; you just have to understand what the differences are, and code defensively. If your requirements are complex or unusual, perhaps turn to a scripting language like Perl or Python which perform these types of date formatting operations more or less out of the box in a standard installation (though neither Perl nor Python have a quick and elegant way to do date conversion out of the box, either; solutions tend to be somewhat tortured).
In practical terms, you can compare the MacOS date man page and the Linux one and try to reconcile your requirements.
For your practical requirement, MacOS date does not support any format string with nanosecond accuracy, but nor are you likely to receive useful results on that scale when the execution of the command will take a significant number of nanoseconds. I would settle for millisecond-level accuracy (and even that is going to be thrown off by the execution time in the final digits) and multiply to get the number in nanosecond scale.
nanoseconds () {
python -c 'import time; print(int(time.time()*1000*1000*1000))'
}
(Notice the parentheses around the argument to print() for Python 3.) You will notice that Python does report a value at nanosecond accuracy (the last digits are often not zeros), though by the time you have run time.time() the value will obviously no longer be correct.
To get an idea of the error rate,
bash#macos-high-sierra$ python3
Python 3.5.1 (default, Dec 26 2015, 18:08:53)
[GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import time
>>> import timeit
>>> def nanoseconds ():
... return int(time.time()*1000*1000*1000)
...
>>> timeit.timeit(nanoseconds, number=10000)
0.0066173350023746025
>>> timeit.timeit('int(time.time()*1000*1000*1000)', number=10000)
0.00557799199668807
The overhead of starting Python and printing the value is probably going to add a few orders of magnitude of overhead, realistically, but I haven't attempted to quantify that. (The output from timeit is in seconds.)

Resources