About quantileTiming function of Clickhouse - clickhouse

As described in ClickHouse docs quantileTiming function accepts an expression returning float type number.
I got an error when i passed a float field to the function:
Query:
SELECT quantileTiming(0.5)(number / 2) FROM numbers(10)
Received exception from server:
StatusCode 500 Response Code: 43, e.displayText() = DB::Exception: Argument for function quantileTiming must be integer, but it has type Float64 (version 19.5.3.8)
Err <nil>`
Can someone tell me what's the problem? And what's the difference between quantile and quantileTiming. What alg does quantileTiming use? Thanks.

Your expression number / 2 is not of type Integer but Float.
Also as said in docs:
The function expects input values in unix timestamp format in
milliseconds, but it doesn't validate format.
From your question it's not clear what you're trying to achieve. You should pass unix timestamp in milliseconds instead of number / 2.
AFAIK quantile is used to calculate standard quantiles as we know it from statistics. And quantileTiming is optimised for computing quantiles of page loading times. As the use case is more narrow, it should be more precise at least. You can find implementations in ClickHouse repo in Quantile*.h files.

Related

If Statement error: Expressions that yield variant data-type cannot be used to define calculated columns

I'm new in Power BI and I'm more used to work with Excel. I try to translate following Excel formula:
=IF(A2="UPL";0;IF(MID(D2;FIND("OTP";D2)+3;1)=" ";"1";(MID(D2;FIND("OTP";D2)+3;1))))
in Power Bi as follows:
Algo =
VAR FindIT = FIND("OTP",Fixed_onTop_Data[Delivery Date],1,0)
RETURN
IF(Fixed_onTop_Data[Delivery Type] = "UPL", 0,
IF(FindIT = BLANK(), 1, MID(Fixed_onTop_Data[Delivery Date],FindIT+3,1))
)
Unfortunately I receive following error message:
Expressions that yield variant data-type cannot be used to define calculated columns.
My values are as follows:
Thank you so much for your help!
You cant mix Two datatypes in your output; In one part of if, you return an INT (literally 0/1), and is second you return a STRING
MID(Fixed_onTop_Data[Delivery Date],FindIT+3,1)
You must unify your output datatype -> everything to string or everything to INT
Your code must be returning BLANK in some cells therefore PowerBI isn't able to choose a data type for the column, wrap your code inside CONVERT(,INTEGER).

gfortran compiler Error: result of exponentiation exceeds the range of INTEGER(4)

I have this line in fortran and I'm getting the compiler error in the title. dFeV is a 1d array of reals.
dFeV(x)=R1*5**(15) * (a**2) * EXP(-(VmigFe)/kbt)
for the record, the variable names are inherited and not my fault. I think this is an issue with not having the memory space to compute the value on the right before I store it on the left as a real (which would have enough room), but I don't know how to allocate more space for that computation.
The problem arises as one part of your computation is done using integer arithmetic of type integer(4).
That type has an upper limit of 2^31-1 = 2147483647 whereas your intermediate result 5^15 = 30517578125 is slightly larger (thanks to #evets comment).
As pointed out in your question: you save the result in a real variable.
Therefor, you could just compute that exponentiation using real data types: 5.0**15.
Your formula will end up like the following
dFeV(x)= R1 * (5.0**15) * (a**2) * exp(-(VmigFe)/kbt)
Note that integer(4) need not be the same implementation for every processor (thanks #IanBush).
Which just means that for some specific machines the upper limit might be different from 2^31-1 = 2147483647.
As indicated in the comment, the value of 5**15 exceeds the range of 4-byte signed integers, which are the typical default integer type. So you need to instruct the compiler to use a larger type for these constants. This program example shows one method. The ISO_FORTRAN_ENV module provides the int64 type. UPDATE: corrected to what I meant, as pointed out in comments.
program test_program
use ISO_FORTRAN_ENV
implicit none
integer (int64) :: i
i = 5_int64 **15_int64
write (*, *) i
end program
Although there does seem to be an additional point here that may be specific to gfortran:
integer(kind = 8) :: result
result = 5**15
print *, result
gives: Error: Result of exponentiation at (1) exceeds the range of INTEGER(4)
while
integer(kind = 8) :: result
result = 5**7 * 5**8
print *, result
gives: 30517578125
i.e. the exponentiation function seems to have an integer(4) limit even if the variable to which the answer is being assigned has a larger capacity.

Does varchar perform better than string in Hive?

Since version 0.12 Hive supports the VARCHAR data type.
Will VARCHAR provide better performance than STRING in a typical analytical Hive query?
In hive by default String is mapped to VARCHAR(32762) so this means
if value exceed 32762 then the value is truncated
if data does not require the maximum VARCHAR length for storage (for example, if the column never exceeds 100 characters), then it allocates unnecessary resources for the handling of that column
The default behavior for the STRING data type is to map the type to SQL data type of VARCHAR(32762), the default behavior can lead to performance issues
This explanation is on the basis of IBM BIG SQL which uses Hive implictly
IBM BIGINSIGHTS doc reference
varchar datatype is also saved internally as a String. The only difference I see is String is unbounded with a max value of 32,767 bytes and Varchar is bounded with a max value of 65,535 bytes. I don't think we will have any performance gain because the internal implementation for both the cases is String. I don't know much about hive internals but I could see the additional processing done by hive for truncating the varchar values. Below is the code (org.apache.hadoop.hive.common.type.HiveVarchar) :-
public static String enforceMaxLength(String val, int maxLength) {
String value = val;
if (maxLength > 0) {
int valLength = val.codePointCount(0, val.length());
if (valLength > maxLength) {
// Truncate the excess chars to fit the character length.
// Also make sure we take supplementary chars into account.
value = val.substring(0, val.offsetByCodePoints(0, maxLength));
}
}
return value;
}
If anyone has done performance analysis/benchmarking please share.

How to format floating point numbers into a string using Go

Using Go I'm trying to find the "best" way to format a floating point number into a string. I've looked for examples however I cannot find anything that specifically answers the questions I have. All I want to do is use the "best" method to format a floating point number into a string. The number of decimal places may vary but will be known (eg. 2 or 4 or zero).
An example of what I want to achieve is below.
Based on the example below should I use fmt.Sprintf() or strconv.FormatFloat() or something else?
And, what is the normal usage of each and differences between each?
I also don't understand the significance of using either 32 or 64 in the following which currently has 32:
strconv.FormatFloat(float64(fResult), 'f', 2, 32)
Example:
package main
import (
"fmt"
"strconv"
)
func main() {
var (
fAmt1 float32 = 999.99
fAmt2 float32 = 222.22
)
var fResult float32 = float32(int32(fAmt1*100) + int32(fAmt2*100)) / 100
var sResult1 string = fmt.Sprintf("%.2f", fResult)
println("Sprintf value = " + sResult1)
var sResult2 string = strconv.FormatFloat(float64(fResult), 'f', 2, 32)
println("FormatFloat value = " + sResult2)
}
Both fmt.Sprintf and strconv.FormatFloat use the same string formatting routine under the covers, so should give the same results.
If the precision that the number should be formatted to is variable, then it is probably easier to use FormatFloat, since it avoids the need to construct a format string as you would with Sprintf. If it never changes, then you could use either.
The last argument to FormatFloat controls how values are rounded. From the documentation:
It rounds the
result assuming that the original was obtained from a floating-point
value of bitSize bits (32 for float32, 64 for float64)
So if you are working with float32 values as in your sample code, then passing 32 is correct.
You will have with Go 1.12 (February 2019) and the project cespare/ryu a faster alternative to strconv:
Ryu is a Go implementation of Ryu, a fast algorithm for converting floating-point numbers to strings.
It is a fairly direct Go translation of Ulf Adams's C library.
The strconv.FormatFloat latency is bimodal because of an infrequently-taken slow path that is orders of magnitude more expensive (issue 15672).
The Ryu algorithm requires several lookup tables.
Ulf Adams's C library implements a size optimization (RYU_OPTIMIZE_SIZE) which greatly reduces the size of the float64 tables in exchange for a little more CPU cost.
For a small fraction of inputs, Ryu gives a different value than strconv does for the last digit.
This is due to a bug in strconv: issue 29491.
Go 1.12 might or might not include that new implementation directly in strconv, but if it does not, you can use this project for faster conversion.

calculate standard deviation of daily data within a year

I have a question,
In Matlab, I have a vector of 20 years of daily data (X) and a vector of the relevant dates (DATES). In order to find the mean value of the daily data per year, I use the following script:
A = fints(DATES,X); %convert to financial time series
B = toannual(A,'CalcMethod', 'SimpAvg'); %calculate average value per year
C = fts2mat(B); %Convert fts object to vector
C is a vector of 20 values. showing the average value of the daily data for each of the 20 years. So far, so good.. Now I am trying to do the same thing but instead of calculating mean values annually, i need to calculate std annually but it seems there is not such an option with function "toannual".
Any ideas on how to do this?
THANK YOU IN ADVANCE
I'm assuming that X is the financial information and it is an even distribution across each year. You'll have to modify this if that isn't the case. Just to clarify, by even distribution, I mean that if there are 20 years and X has 200 values, each year has 10 values to it.
You should be able to do something like this:
num_years = length(C);
span_size = length(X)/num_years;
for n = 0:num_years-1
std_dev(n+1,1) = std(X(1+(n*span_size):(n+1)*span_size));
end
The idea is that you simply pass the date for the given year (the day to day values) into matlab's standard deviation function. That will return the std-dev for that year. std_dev should be a column vector that correlates 1:1 with your C vector of yearly averages.
unique_Dates = unique(DATES) %This should return a vector of 20 elements since you have 20 years.
std_dev = zeros(size(unique_Dates)); %Just pre allocating the standard deviation vector.
for n = 1:length(unique_Dates)
std_dev(n) = std(X(DATES==unique_Dates(n)));
end
Now this is assuming that your DATES matrix is passable to the unique function and that it will return the expected list of dates. If you have the dates in a numeric form I know this will work, I'm just concerned about the dates being in a string form.
In the event they are in a string form you can look at using regexp to parse the information and replace matching dates with a numeric identifier and use the above code. Or you can take the basic theory behind this and adapt it to what works best for you!

Resources