Way to deal with large data files in Wolfram Mathematica - wolfram-mathematica

I wonder if there exists way to work with large files in Mathematica ?
Currently I have a file about 500Mb with table data.
Import["data.txt","Table"];
What is alternate way?

Use OpenRead["file"] which gives you an InputStream object on which you can use Read[stream]. Depending on the formatting of the data file you may need to set custom option values in Read[] for RecordSeparators.
Example:
In[1]:= str = OpenRead["ExampleData/USConstitution.txt"]
Out[1]= InputStream["ExampleData/USConstitution.txt", 24]
In[2]:= Read[str, Word]
Out[2]= "We"
In[3]:= Read[str, Word]
Out[3]= "the"
In[4]:= Read[str, Record]
Out[4]= "People of the United States, in Order to form a more perfect Union,"

You could also load your data into a database (for example MySQL) and access it from Mathematica using DatabaseLink

The function DumpSave can also be helpful for saving large datasets. It saves data in Mathematica's internal format, so it's more efficient in both time and file size.

Related

Transformer-XL: Input and labels for Language Modeling

I'm trying to finetune the pretrained Transformer-XL model transfo-xl-wt103 for a language modeling task. Therfore, I use the model class TransfoXLLMHeadModel.
To iterate over my dataset I use the LMOrderedIterator from the file tokenization_transfo_xl.py which yields a tensor with the data and its target for each batch (and the sequence length).
Let's assume the following data with batch_size = 1 and bptt = 8:
data = tensor([[1,2,3,4,5,6,7,8]])
target = tensor([[2,3,4,5,6,7,8,9]])
mems # from the previous output
My question is: I currently pass this data into the model like this:
output = model(input_ids=data, labels=target, mems=mems)
Is this correct?
I am wondering because the documentation says for the labels parameter:
labels (:obj:torch.LongTensor of shape :obj:(batch_size, sequence_length), optional, defaults to :obj:None):
Labels for language modeling.
Note that the labels are shifted inside the model, i.e. you can set lm_labels = input_ids
So what is it about the parameter lm_labels? I only see labels defined in the forward method.
And when the labels "are shifted" inside the model, does this mean I have to pass data twice (additionally instead of targets) because its shifted inside? But how does the model then know the next token to predict?
I also read through this bug and the fix in this pull request but I don't quite understand how to treat the model now (before vs. after fix)
Thanks in advance for some help!
Edit: Link to issue on Github
That does sound like a typo from another model's convention. You do have to pass data twice, once to input_ids and once to labels (in your case, [1, ... , 8] for both). The model will then attempt to predict [2, ... , 8] from [1, ... , 7]). I am not sure adding something at the beginning of the target tensor would work as that would probably cause size mismatches later down the line.
Passing twice is the default way to do this in transformers; before the aforementioned PR, TransfoXL did not shift labels internally and you had to shift the labels yourself. The PR changed it to be consistent with the library and the documentation, where you have to pass the same data twice.

In Wolfram Mathematica, who do I query the result of a Counts operation efficiently and conveniently?

EDIT At the suggestion of #HighPerformanceMark, I've moved the question to mathematica.stackexchange.com: my question, so I attempted to close the question here. But SO doesn't allow me to do it properly, hence this up-front warning.
Setup
Say, I'm given a dataset, like the one below:
titanic = ExampleData[{"Dataset", "Titanic"}]; titanic
Answering with:
And I want to count the occurrences of any combination between { "1st", "2nd"} and {"female", "male"}, using the Counts operator on the dataset, like:
genderclasscounts = titanic[All, {"class", "sex"}][Counts]
Problem statement
This is not a "flat" dataset and I don't have a clue how to query in the usual way, like:
genderclasscount[Select[ ... ], ...]
The resulting dataset doesn't provide "column" names to be used as parameters in the Select nor can I refer to the number representing the count by a name.
And I've no clue how to express an Association as a value in a Select!?
Furthermore, try genderclasscount[Print], this demonstrates the values presented to the operation over this dataset are just numbers!
An unsatisfactory attempt
Of course, I can "flatten" the Counts result, by doing something horrific and inefficient like:
temp = Dataset[(row \[Function]
AssociationThread[{"class", "sex", "count"} -> row]) /# (Nest[
Normal, genderclasscounts, 3] /.
Rule[{Rule["class", class_], Rule["sex", sex_]},
count_] -> {class, sex, count})]
In this form it is easy to query a count result:
First#temp[Select[#class == "1st" \[And] #sex == "female" &], "count"]
Question
So, my questions are
How can I query the (immediate) result of the Count operation in a convenient and efficient fashion, like using a Select operation on the resulting dataset? Or, if that is not possible;
Is there an efficient and convenient transformation of the Counts result dataset possible facilitating such a query? With "convenient" I mean, for example, that you just provide the dataset and the transformation handles the rest. So, not something like I've shown above in my unsatisfactory "solution" ;-)
Thanks for reading this far and I'm looking forward to anwsers and inspiration.
/#nanitous

Change lgbm internal parameter (threshold) by hand

I have trained a model with lgbm. I can dump its interval values with
booster.dump_model()
and see all the internal parameters that has been optimized during the training (leaf values, threshold, index of the variables for each split, ...). For testing purpose I would like to change some. Is there a way? I guess that changing just the output of dump_model will do nothing.
You can save your model to a human-understandable format using
booster.save_model('model.txt'), do your modifications on model.txt, and load back the modified model using modified_booster = lightgbm.Booster(model_file='model.txt').
I hope it helps!

creating table from import from two files

I just started using Wolfram Mathematica.
I have two files with numbers:
x=Import["c"\.path here..\x.txt","Table"];
y=Import["c"\.path here..\y.txt","Table"];
now I have two tables x and y. I want to combine then to have a one table
{{x1, y1}, {x2, y2}, {x3, y3}, {x4, y4}}
That I can use to build a graphic using ListPlot.
I tried using something like that
num={};
l1=length[x]; l2=length[y];
Do[num=Append[num,Partition[x[[i]],1]],Append[num,Partition[y[[i]],1]],{i,l1}]
so how can I do that?
I find the answer
t=MapThread[List,{x,y}]
that it, simple and short
Shorter and on my machine approx. 2 times faster than your answer:
t = Thread[{x, y}]
There is no speed difference to the answer given in a comment by agentp.

How to extract only the data points from BodePlot plot?

I am trying to fix the Phase plot part of BodePlot, as it does not wrap correctly. And there is no option that I can use to tell it to wrap.
So, instead of doing the full plot myself, (I can do that if I have to) I am thinking of first making the BodePlot, grab the data points, do the wrapping on the data (once I get the x,y data, the rest is easy), then I need to put the new list of points back into the plot, and then use Show to display it.
The part I am stuck at, is extracting the points from FullForm. I can't get the correct Pattern to do that.
This is what I go to so far:
hz=z/(z^2-z+0.3);
tf=TransferFunctionModel[hz,z,SamplingPeriod->2];
phasePlot=BodePlot[tf,{0.001,2 Pi},
ScalingFunctions->{Automatic,{"Linear","Degree"}},PlotLayout->"List"][[2]]
You see how it does not wrap at 180 degrees. It is more common in dsp that Bode phase plot wraps. Here is what it 'should' look like:
So, this is what I did:
FullForm[phasePlot]
Graphics[List[
List[List[], List[],
List[Hue[0.67, 0.6, 0.6],
Line[List[List[0.0010000243495554542, -0.2673870119911639],
List[0.0013659538057574799, -0.36521403872250247],
List[0.0017318832619595053, -0.46304207336414027],
....
I see the data there (the x,y) But how to pull them out? I tried this:
Cases[FullForm[phasePlot], List[x_, y_] -> {x, y}, Infinity];
But the above matches in addition to the list of point, other stuff that I do not need.
I tried many other things, but can't get only the list of points out.
I was wondering if someone knows how to pull only the (x,y) points from the above plot. Is there a better way to do this other than using FullForm?
Thanks
Update:
I just find a post here which shows how to extract data from plot. So I used it:
points = Cases[Normal#phasePlot, Line[pts_] -> pts, Infinity]
You could do try nesting the replacement rules, for example
phase2 = phasePlot /.
Line[a_] :> (Line[a] /. {x_?NumericQ, y_?NumericQ} :> {x, Mod[y, 360, -180]});
Show[phase2, PlotRange -> {Automatic, {-180, 180}}, FrameTicks -> Automatic]
Output:
The list you are looking for appears to be wrapped by Line[], and it seems to be the only case in your plot. So you could use
Cases[phasePlot, Line[list_] :> list, Infinity]
Edit:
When I posted my response, the page refreshed and I saw that you came across precisely what I had proposed. I'll leave my response posted here anyway.
Edit2:
Szabolics pointed out that FullForm[] has no effect, so I removed it from my original posting.

Resources