highcharts remove redundant data points to improve speed - performance

i am drawing a simple line chart with highcharts. one chart can include many many points which introduces delay when playing around with the chart.
since many data-points are redundant i came up with the idea of not adding a new data-point if the value is the same as the previous one. this reduces the amount of data, but should still result in the same graph.
please see this example: http://jsfiddle.net/qm94j14t/1/
i would like to have one straight line without the data-points from February until November.
right now the data array looks like this:
data: [7,7,7,7,7,7,7,7,7,7,7,10]
What do i need to change in the code to get a straight line without these redundant 7 values?

Instead of using [y_1, y_2, ... , y_n] format, use [ [x_1, y_1], [x_2, y_2] , ... , [x_n, y_n]] format.
Then remove redundant data, demo: http://jsfiddle.net/qm94j14t/7/ So in your case it's [[0,7], [9, 7], [10, 10]].

Related

Transformer-XL: Input and labels for Language Modeling

I'm trying to finetune the pretrained Transformer-XL model transfo-xl-wt103 for a language modeling task. Therfore, I use the model class TransfoXLLMHeadModel.
To iterate over my dataset I use the LMOrderedIterator from the file tokenization_transfo_xl.py which yields a tensor with the data and its target for each batch (and the sequence length).
Let's assume the following data with batch_size = 1 and bptt = 8:
data = tensor([[1,2,3,4,5,6,7,8]])
target = tensor([[2,3,4,5,6,7,8,9]])
mems # from the previous output
My question is: I currently pass this data into the model like this:
output = model(input_ids=data, labels=target, mems=mems)
Is this correct?
I am wondering because the documentation says for the labels parameter:
labels (:obj:torch.LongTensor of shape :obj:(batch_size, sequence_length), optional, defaults to :obj:None):
Labels for language modeling.
Note that the labels are shifted inside the model, i.e. you can set lm_labels = input_ids
So what is it about the parameter lm_labels? I only see labels defined in the forward method.
And when the labels "are shifted" inside the model, does this mean I have to pass data twice (additionally instead of targets) because its shifted inside? But how does the model then know the next token to predict?
I also read through this bug and the fix in this pull request but I don't quite understand how to treat the model now (before vs. after fix)
Thanks in advance for some help!
Edit: Link to issue on Github
That does sound like a typo from another model's convention. You do have to pass data twice, once to input_ids and once to labels (in your case, [1, ... , 8] for both). The model will then attempt to predict [2, ... , 8] from [1, ... , 7]). I am not sure adding something at the beginning of the target tensor would work as that would probably cause size mismatches later down the line.
Passing twice is the default way to do this in transformers; before the aforementioned PR, TransfoXL did not shift labels internally and you had to shift the labels yourself. The PR changed it to be consistent with the library and the documentation, where you have to pass the same data twice.

Mathematica removing columns cannot take positions through 2 to 3 error

I have a matrix consisting of 3 rows and 4 columns of which which I require the central two columns.
I have attempted extracting the central two columns as follows:
a = a[[2 ;; 3, All]];
On the mathematica function list, the first entry in a[[2 ;; 3, All]] represents the rows and the second the columns, however whenever I try a[[All,2 ;; 3]] it removes the top row rather than the two columns. For some reason they seem inverted. I tried going around this by switching the entries around however, when I use a[[2 ;; 3, All]], I get the error: Part: Cannot take positions 2 through 3 in a.
I cannot wrap my head around why this keeps happening. It also refuses to extract single columns from the matrix as well.
You show that you are assigning a variable to itself and then saying that things don't work for you. That makes me think you might have previously made assignments to variables and the results of that are lurking in the background and might be responsible for what you are seeing.
With a fresh start of Mathematica, before you do anything else, try
mat={{a,b,c,d},
{e,f,g,h},
{i,j,k,l}};
take23[row_]:=Take[row,{2,3}];
newmat = Map[take23, mat]
Map performs the function take23 on every row and returns a list containing all the results giving
{{b,c},
{f,g},
{j,k}}
If need be you can abbreviate that to
newmat = Map[Take[#,{2,3}]&, mat]
but that requires you understand # and & and it gives the same result.
If necessary you can further abbreviate that to
newmat = Take[#,{2,3}]& /# mat
Map is widely used in Mathematica programming and can do many more things than just extract elements. Learning how to use that will increase your Mathematica skill greatly.
Or if you really need to use ;; then this
newmat = mat[[All, 2;;3]]
I interpret the documentation for that to mean you want to do something with All the rows and then within each row you want to extract from the second to the third item. That seems to work for me and instantly returns the same result.
If you instead wrote
newmat = mat[[1;;2, 2;;3]]
that would tell it that you wanted to work from row 1 down to row 2 and within those you want to work from column 2 to column 3 and that gives
{{b,c},
{f,g}}

(Using Julia) How can I reduce my data matrix by averaging values from the same hour?

I am trying to reduce the size of my data and I cannot make it work. I have data points taken every minute over 1 month. I want to reduce this data to have one sample for every hour. The problem is: Some of my runs have "NA" value, so I delete these rows. There is not exactly 60 points for every hour - it varies.
I have a 'Timestamp' column. I have used this to make a 'datehour' column which has the same value if the data set has the same date and hour. I want to average all the values with the same 'datehour' value.
How can I do this? I have tried using the if and for loop below, but it takes so long to run.
Thanks for all your help! I am new to Julia and come from a Matlab background.
======= CODE ==========
uniquedatehour=unique(datehour,1)
index=[]
avedata=reshape([],0,length(alldata[1,:]))
for j in uniquedatehour
for i in 1:length(datehour)
if datehour[i]==j
index=vcat(index,i)
else
rows=alldata[index,:]
rows=convert(Array{Float64,2},rows)
avehour=mean(rows,1)
avedata=vcat(avedata,avehour)
index=[]
continue
end
end
end
There are several layers to optimizing this code. I am assuming that your data is sorted on datehour (your code assumes this).
Layer one: general recommendation
Wrap your code in a function. Executing code in global scope in Julia is much slower than within a function. By wrapping it make sure to either pass data to your function as arguments or if data is in global scope it should be qualified with const;
Layer two: recommendations to your algorithm
Statement like [] creates an array of type Any which is slow, you should use type qualifier like index=Int[] to make it fast;
Using vcat like index=vcat(index,i) is inefficient, it is better to do push!(index, i) in place;
It is better to preallocate avedata with e.g. fill(NA, length(uniquedatehour), size(alldata, 2)) and assign values to an existing matrix than to do vcat on it;
Your code will produce incorrect results if I am not mistaken as it will not catch the last entry of uniquedatehour vector (assume it has only one element and check what happens - avedata will have zero rows)
Line rows=convert(Array{Float64,2},rows) is probably not needed at all. If alldata is not Matrix{Float64} it is better to convert it at the beginning with Matrix{Float64}(alldata);
You can change line rows=alldata[index,:] to a view like view(alldata, index, :) to avoid allocation;
In general you can avoid creation of index vector as it is enough that you remember start s and end e position of the range of the same values and then use range s:e to select rows you want.
If you correct those things please post your updated code and maybe I can help further as there is still room for improvement but requires a bit different algorithmic approach (but maybe you will prefer option below for simplicity).
Layer three: how I would do it
I would use DataFrames package to handle this problem like this:
using DataFrames
df = DataFrame(alldata) # assuming alldata is Matrix{Float64}, otherwise convert it here
df[:grouping] = datehour
agg = aggregate(df, :grouping, mean) # maybe this is all what you need if DataFrame is OK for you
Matrix(agg[2:end]) # here is how you can convert DataFrame back to a matrix
This is not the fastest solution (as it converts to a DataFrame and back but it is much simpler for me).

D3 graph not updating on click

My line chart is not updating with new data once I click the black button and I'm not sure what I could possibly be doing wrong.
Block here:
Let's look at your NaN errors:
<path class="line" d="M0,324.19471776281716L0,NaNL155,NaNL155,270L310,270L310,353.84774728120146L465,353.84774728120146" transform="translate(78.1818,0)"></path>
Seems that we are missing two y values, we can see this if we split the path data into its x,y pairs:
M0,324.19471776281716
L0,NaN
L155,NaN
L155,270
L310,270
L310,353.84774728120146
L465,353.84774728120146
So, we need to check two things, one is the y scale, and the other is the data used in the y scale. The y scale looks ok, if it failed on one number it should fail on all numbers. Let's look at the csv data:
education,number
Bachelor's degree,2367
Degree in medicine, dentistry, veterinary medicine or optometry,5763
Earned doctorate,3862
Master's degree,1549
Here's our problem: we have a comma separated file type with lots of extra commas on the second row (not counting the column headers). We can see that is causing issues with the name of the second column in the alternate data: "Degree in medicine", the portion of the name beyond the comma is dropped. Let's entomb that data with quotations so that the commas won't count as delimiters:
education,number
Bachelor's degree,2367
"Degree in medicine, dentistry, veterinary medicine or optometry",5763
Earned doctorate,3862
Master's degree,1549
You're code in your update function is selecting #body when #body (id="body") doesn't seem to exist. Could you be meaning to use body instead to select the html body?

Iteratively populate dataframes using a for loop in Julia

I am looking to find a way to iteratively populate a dataframe in Julia.
I have a working function that creates multiple points along a line:
#function to draw QMD lines
using DataFrames
function make_lines(qmd)
BA=Float64[]
TPA=Float64[]
QMD=Int[]
for i in stk_percent
tpa= 1*(i*10)/(a[1]+a[2]*(-0.259+0.973*qmd)+a[3]*qmd^2)
ba=pi*(qmd/24)^2*tpa
push!(TPA,tpa)
push!(BA,ba)
push!(QMD,qmd)
end
return DataFrame(TPA=TPA,BA=BA,QMD=QMD)
end
The next step I am trying to accomplish is to run the make_lines function in a loop using a pre-defined set of inputs with all the outputs in one single dataframe but I cannot get it to work.
dia = [7, 8, 10, 12, 14, 16, 18, 20, 22]
# can't get for loop to append all the data frames?
for i in dia
df=DataFrame(TPA=Float64[],BA=Float64[],QMD=Int[])
append!(df,make_lines(i))
return df
end
At first I thought it was how I was using Dataframes, I have never used Push! etc before but I got this code chunk to work
#this works to combine dataframe
test=make_lines(22)
test2=make_lines(8)
test[:]
append!(test,test2)
So why when I run the for loop, do I end up with only the last dataframe it produces?
Am I misinterpreting something? From what I have read Dataframes in Julia work differently than dataframes in R, but I cannot wrap my head around how to get this working.
You are pretty close, but there are a couple of places where you are getting tripped up in your code. You currently have:
dia = [7, 8, 10, 12, 14, 16, 18, 20, 22]
# can't get for loop to append all the data frames?
for i in dia
df=DataFrame(TPA=Float64[],BA=Float64[],QMD=Int[])
append!(df,make_lines(i))
return df
end
This isn't quite what you want for two reasons:
One: This snippet isn't a function. It thus doesn't make sense, and will cause problems, to have return in it.
Two: At each step in your loop, you are re-creating your dataframe df from scratch, erasing everything that you put before it. This is why, as you say, you only end up with the last data frame that it produces. Instead, you would want something like:
dia = [7, 8, 10, 12, 14, 16, 18, 20, 22]
df=DataFrame(TPA=Float64[],BA=Float64[],QMD=Int[])
for i in dia
append!(df,make_lines(i))
end
Note: I couldn't get a completely working version of your code going - the objects stk_percent and a in your main function never get defined, so I didn't really know what to put in for those. But, I believe that if you fix these issues you'll likely be in a better spot (I made up some values for them and it worked fine).
Performance Tip: When you do fix those, my recommendation would be to make them as explicit arguments that you pass to your function. Although it will still work if they are just variables in the global space, this will lead to suboptimal performance of your code, both now and in the future, and potentially worse things, like confusing the scope of variables, having their values change when you don't want, etc. Best to start off from the beginning of your journey with Julia adopting as many best practices in writing your code as is practicable.
I managed to create a blank dataframe by providing the type of variable and the column names
df = DataFrame([DateTime;fill(Float64, 2);String;fill(Float64, 2)],
["Date","A","B","Letter","C","D"])
Then I can append the results to populate the new dataframe by using rename! and then append! functions inside the for loop.
This is very useful for large datasets with numerous columns.

Resources