Using scala breeze with complex datatypes - scala-breeze

What is the best way to build a dense matrix within Breeze that is made up of object with different data types?
For example how would
case class X ( a1:Int, b2:Long, c3:Double, d4:Double)
get mapped into a dense matrix?
Is there an equivalent of Numpy's dtypes within Breeze?
Ultimately I have many millions of records I'm trying to process and am looking to perform arithmetic functions on the b2, c3 and d4 columns based on slices on the a1 column in the X class.

Breeze doesn't support that kind of functionality yet. It's desired, but there's nothing at this time.

Related

Concatenated Doc2Vec - calculate similarities

I have two Doc2Vec models trained on the same corpus but with different parameters. I would like to concatenate the two of them and calculate the similarity of a given input word, choosing the returned vectors from the concatenated model. I read a lot of comments regarding the fact that this method may not be particularly suited for performance improvement and that it might be necessary to change the source code to the KeyedVector class in gensim to enable it. Up to now I attempted to do that using the Translation Matrix but it returns 5 features from the second model and I am not sure about whether it is performing the translations correctly or not.
Has anybody already encountered this issue? Is there another way to calculate the similarity for an input word in a concatenated doc2vec model?
Up to now I have been able to reproduce this:
vocab1 = model1.wv
vocab2 = model2.wv
concatenated_vectors = {}
vocab_concatenated = vocab1
for i in range(len(vocab1.vectors)):
v1 = vocab1.vectors[i]
v2 = vocab2.vectors[i]
vocab_concatenated[list(vocab1.vocab.keys())[i]] = np.concatenate((v1, v2))
In order to re-calculate the most_similar() top-n features for a passed argument, how should I re-istantiate the newly created object? It seems that
.add_vectors(list(vocab1.vocab.keys()), vocab_concatenated[list(vocab1.vocab.keys())])
is not working, but I am sure I am missing something.

LightGBM incrementally construct Dataset

I want to construct a LightGBM Dataset object from very large X and y, which can not be load to memory. Is there any method that can construct Dataset in "batch"? eg. something like
import lightgbm as lgb
ds = lgb.Dataset()
for X, y in data_generator():
ds.add_new_data(data=X, label=y)
regarding the data there are a few hacks, for example, if your data has numeric, you make sure the precision are too long, e.g. probably two digits would be enough (it depends on your data). or if you have categorical data make sure you store them with digits. but probably you are looking for a better approach
There is a concept called incremental learning. Basically you make a model (a tree) in your first iteration using the first batch of data. Then for your next model, you use that tree as a template and only updates the values (you can also allow for shrinkage). you can use the keep_training_booster for such scenario and please read on your own to learn the mechanism.
The third technique is you make multiple models: say you divide your data into N pieces and make N models, then use an ensemble approach. This way you have used your entire data with N number of observations.

Algorithm to suggest which chart to build using structure of the data

I am working on a development where we need to dynamically generate visuals based on data. Structure of data changes based on query fired on data.
I have seen this feature in many BI tools where it suggests the type of suitable visualization based on data structure.
I have tried creating my own algorithm based on rules which we generally use to create a chart.
I want to know if there are any such algorithms or rules which can help us build this
I'm just trying to give you a head start here with what I could think. Since you have mentioned you have already tried writing your own algorithm with rules, please show your work. As far as I know the chart types can be determined based on the nature of (x, y) points you're trying to plot.
For a single x, if there are many corresponding y values, go with scatter plot.
For a single x, if there is only one corresponding y value,
If x takes integer values, go with line charts
If x takes rather string values, go with bar charts

Representing 2D data optimized by row vs by column vs flat

In D3 I need to visualize loading lab samples into plastic 2D plates of 8 rows x 12 columns or similar. Sometimes I load a row at a time, sometimes a column at a time, occasionally I load flat 1D 0..95, or other orderings. Should the base D3 data() structure nest rows in columns (or vice verse) or should I keep it one dimensional?
Representing the data optimized for columns [columns[rows[]] makes code complex when loading by rows, and vice versa. Representing it flat [0..95] is universal but it requires calculating all row and column references for 2D modes. I'd rather reference all orderings out of a common base but so far it's a win-lose proposition. I lean toward 1D flat and doing the math. Is there a win-win? Is there a way to parameterize or invert the ordering and have it optimized for both ways?
I believe in your case the best implementation would be an Associative array (specifically, a hash table implementation of it). Keys would be coordinates and values would be your stored data. Depending on your programming language you would need to handle keys in one way or another.
Example:
[0,0,0] -> someData(1,2,3,4,5)
[0,0,1] -> someData(4,2,4,6,2)
[0,0,2] -> someData(2,3,2,1,5)
Using a simple associative array would give you great insertion speeds and reading speeds, however code would become a mess if some complex selection of data blocks is required. In that case, using some database could be reasonable (though slower than a hashmap implementation of associative array). It would allow you to query some specific data in batches. For example, you could get whole row (or several rows) of data using one simple query:
SELECT * FROM data WHERE x=1 AND y=2 ORDER BY z ASC
Or, let's say selecting a 2x2x2 cube from the middle of 3d data:
SELECT * FROM data WHERE x>=5 AND x <=6 AND y>=10 AND Y<=11 AND z >=3 AND z <=4 ORDER BY x ASC, y ASC, z ASC
EDIT:
On a second thought, if the size of the dimensions wont change during runtime - you should go with a 1-dimentional array using all the math yourself, as it is the fastest solution. If you try to initialize a 3-dimentional arrays as array of arrays of arrays, every read/write to an element would require 2 additional hops in memory to find the required address. However, writing some function like:
int pos(w,h, x,y,z) {return z*w*h+y*w+x;} //w,h - dimensions, x,y,z, - position
Would make it inlined by most compilers and pretty fast.

What algorithm would allow building optimal "groups" of terms?

I have a table of data and I want to pull specific records. The records are indicated in various, nigh-random ways (how isn't important), but I want to be able to identify them using 11 specific terms. Essentially, I'm being given a lot of queries against non-indexed fields and having to rewrite them using specific indexed fields -- except thanks to an Enterprisey System it's not as simple as that: the data has to be packaged in a certain way that avoids directly touching SQL.
It might be easier to give an example in 2-dimensions, although the problem itself uses 11 that will probably change:
123
+---+
A|X O|
B| X |
C|X O|
+---+
If I wanted to group all the X's in the above grid, I could say: A1 and B2 and C1. Better would be (A,C)1 and B2. Even better would be (A,B,C)(1,2) -- empty spaces can be included or excluded for this problem, they don't matter. What's important is keeping the number of groups down, getting all the Xs and avoiding all the Os.
To give a hint on sizing, the actual problem will generally deal with anywhere between 100 and 5000 "good" records. It is also not necessary to have The Ideal Answer -- a Pretty Good answer would suffice.
This sounds a lot like Karnaugh maps, with X=true, 0=false, and blank="don't care".

Resources