How can I speed up String to DateTime conversion in Julia? It takes long and allocates a lot of memory in the process?
When you convert a column or vector of strings, define the string format in a separate variable and then pass this variable as a second variable in the function Dates.DateTime.
Assuming, your strings are in a DataFrame column df.Date, then
replace:
df.DateTime = Dates.DateTime.(df.Date , "yyy-mm-dd HH:MM:SS")
with:
myFormat = Dates.DateFormat("yyy-mm-dd HH:MM:SS")
df.DateTime = Dates.DateTime.(df.Date , myFormat)
This speeds up conversion noticeably (in my case by factor 20 for a 30k element vector).
Thanks to user BioTurboNick on discourse.julialang for figuring this out. The reason is in the documentation. Essentially in the former case, julia creates a DateFormat object for each individual conversion, drastically increasing memory allocation.
DateTime(dt::AbstractString, format::AbstractString; locale="english") -> DateTime
Construct a DateTime by parsing the dt date time string following the pattern given in the format string (see
DateFormat for syntax).
This method creates a DateFormat object each time it is called. If you are parsing many date time strings of the
same format, consider creating a DateFormat object once and using that as the second argument instead.```
Related
I have two columns in excel and want to subtract the time difference in minutes
21/12/2022, 8:17 pm
21/12/2022, 8:00 pm
How can I convert text to a number format so that I can subtract those easily
You can use the VALUE function. The VALUE function converts a text string that represents a number to a number value.
For example, to convert the text "8:17" to a number value, you can use the following formula:
=VALUE("8:17")
This will return a number value equivalent to the time 8:17.
To subtract the time difference between two cells in Excel, you can use the =B2-A2 formula, where A2 and B2 are the cells containing the two times you want to subtract.
You can also use the =TEXT function to convert the result of the subtraction back to a text string in a desired format, if needed. For example, to convert the result to a time format, you can use the following formula:
=TEXT(B2-A2, "h:mm")
This will return the time difference in the format "hh:mm".
I am trying to find the duration between two times with the below code:
DateTimeFormatter formatter = DateTimeFormat.forPattern("HH:mm");
System.out.println(airTime1);
System.out.println(startTime1);
Minutes difference = ((Minutes.minutesBetween(startTime1,airTime1)));
String differenceS = String.valueOf(difference);
System.out.println(differenceS);
LocalTime remaining1 = formatter.parseLocalTime(differenceS);
System.out.println(remaining1);
airTime1 & startTime1 are both localTime variables. difference should contain the duration between the two times. differenceS is a String representation of difference, as minutes cannot be converted to String.
When I enter times into the variables such as 12:00 & 13:00, the variables are recorded as: 12:00:00.000 & 13:00:00.000, but differenceS received a value of PT-60M, which obviously throws an error. Does anyone know why the minutes difference line could be calculating this value?
Thanks in advance!
The Minutes class of jodatime overwrites the toString() method in a way that returns a String in ISO8601 duration format as mentioned in the JavaDoc. This is exactly what your PT-60M represents. A duration of -60 minutes.
If you just want the raw minutes printed your code could look like this:
DateTimeFormatter formatter = DateTimeFormat.forPattern("HH:mm");
System.out.println(airTime1);
System.out.println(startTime1);
Minutes difference = Minutes.minutesBetween(startTime1,airTime1);
System.out.println(Math.abs(difference.getMinutes()));
Since version 0.12 Hive supports the VARCHAR data type.
Will VARCHAR provide better performance than STRING in a typical analytical Hive query?
In hive by default String is mapped to VARCHAR(32762) so this means
if value exceed 32762 then the value is truncated
if data does not require the maximum VARCHAR length for storage (for example, if the column never exceeds 100 characters), then it allocates unnecessary resources for the handling of that column
The default behavior for the STRING data type is to map the type to SQL data type of VARCHAR(32762), the default behavior can lead to performance issues
This explanation is on the basis of IBM BIG SQL which uses Hive implictly
IBM BIGINSIGHTS doc reference
varchar datatype is also saved internally as a String. The only difference I see is String is unbounded with a max value of 32,767 bytes and Varchar is bounded with a max value of 65,535 bytes. I don't think we will have any performance gain because the internal implementation for both the cases is String. I don't know much about hive internals but I could see the additional processing done by hive for truncating the varchar values. Below is the code (org.apache.hadoop.hive.common.type.HiveVarchar) :-
public static String enforceMaxLength(String val, int maxLength) {
String value = val;
if (maxLength > 0) {
int valLength = val.codePointCount(0, val.length());
if (valLength > maxLength) {
// Truncate the excess chars to fit the character length.
// Also make sure we take supplementary chars into account.
value = val.substring(0, val.offsetByCodePoints(0, maxLength));
}
}
return value;
}
If anyone has done performance analysis/benchmarking please share.
I have following problem:
I need to write a begin and end date into a matrix. Where the matrix contains the yearly quarters (1-4) in the collumns and the rows are the year.
E.g.
Matrix:
Q1 Q2 Q3 Q4
2010
2011
Now the Date 01.01.2010 should be put in the first element and the date 09.20.2011 in the sixed element.
Thanks in advance.
You first have to consider that SAS does not actually have date/time/datetime variables. It just uses numeric variables formatted as date/time/datetime. The actual value being:
days since 1/1/1960 for dates
seconds since 00:00 for times
seconds since 1/1/1960 00:00 for datetimes
SAS does not even distinguish between integer and float numeric types. So a date value can contain a fractional part.
What you do or can do with a SAS numeric variable is completely up to you, and mostly depends on the format you apply. You could mistakenly format a variable containing a date value with a datetime format... or even with a currency format... SAS won't notice or complain.
You also have to consider that SAS does not even actually have matrixes and arrays. It does provide a way to simulate their use to read and write to dataset variables.
That said, SAS does provide a whole lot of formats and informats that allow you to implement date and time manipulation.
Assuming you are coding within a data step, and assuming the "dates" are in dataset numeric variables, then the PUT function can extract the datepart you need to calculate row, column of the matrix element to write to, like so:
DATA table;
ARRAY dm{2,4} dm_r1c1-dm_r1c4 dm_r2c1-dm_r2c4;
beg_row = PUT(beg_date, YEAR4.)-2009;
end_row = PUT(end_date, YEAR4.)-2009;
beg_col = PUT(beg_date, QTR1.);
end_col = PUT(end_date, QTR1.);
dm{beg_row,beg_col} = beg_date;
dm{end_row,end_col} = end_date;
RUN;
... or if you are using a one-dimensional array:
DATA table;
ARRAY da{8} da_1-da_8;
beg_index = 4 * (PUT(beg_date, YEAR4.)-2010) + PUT(beg_date, QTR1.);
end_index = 4 * (PUT(end_date, YEAR4.)-2010) + PUT(end_date, QTR1.);
da{beg_index} = beg_date;
da{end_index} = end_date;
RUN;
For some reason, Hive is not recognizing columns emitted as integers, but does recognize columns emitted as strings.
Is there something about Hive or RCFile or GZ that is preventing proper rendering of int?
My Hive DDL looks like:
create external table if not exists db.table (intField int, strField string) stored as rcfile location '/path/to/my/data';
And the relevant portion of my Java looks like:
BytesRefArrayWritable dataWrite = new BytesRefArrayWritable(2);
byte[] byteArray;
BytesRefWritable bytesRefWritable = new BytesRefWritable(); intWritable.set(myObj.getIntField());
byteArray = WritableUtils.toByteArray(intWritable.get());
bytesRefWritable.set(byteArray, 0, byteArray.length);
dataWrite.set(0, bytesRefWritable); // sets int field as column 0
bytesRefWritable = new BytesRefWritable();
textWritable.set(myObj.getStrField());
bytesRefWritable.set(textWritable.getBytes(), 0, textWritable.getLength());
dataWrite.set(1, bytesRefWritable); // sets str field as column 1
The code runs fine, and through logging I can see the various Writables have bytes within them.
Hive can read the external table as well, but the int field shows up as NULL, indicating some error.
SELECT * from db.table;
OK
NULL my string field
Time taken: 0.647 seconds
Any idea what might be going on here?
So, I'm not sure exactly why this is the case, but I got it working using the following method:
In the code that writes the byte array representing the integer value, instead of using WritableUtils.toByteArray(), I instead Text.set(Integer.toString(intVal)).getBytes().
In other words, I convert the integer to its String representation, and use the Text writable object to get the byte array as if it were a string.
Then, in my Hive DDL, I can call the column an int and it interprets it correctly.
I'm not sure what was initially causing the problem, be it a bug in WritableUtils, some incompatibility with compressed integer byte arrays, or a faulty understanding of how this stuff works on my part. In any event, the solution described above successfully meets the task's needs.