I'm using crossfilter and dc to render charts of subject-related observations.
Each observation gets treated as a dimension. However, not all rows will have values for all dimensions as some dimensions have data that is repeated over time. For example, Column A has four values over four rows, but Column B has only one value so the other three rows it will be 0 / "" / blank.
Now if I filter on Column B for rows with a certain range / value, then automatically I lose all other rows for Column A and if I wanted to filter on Column A AFTER filter on Column B then I'm only filtering out of the one common row that has values for both.
This may sound as a logical behaviour, but it's not true to the data, because if I wanted to filter subjects (i.e.) rows that have a certain range for Column A AND a certain range for Column B that results in a wrong result, because of the blank values which are not missing they are just there because it's a table and all columns are expected to have values for all rows.
Is there a way to be able to filter on Column B without that excluding values from Column B only because they're blank?
I'm sorry that took that much text to explain!
UPDATE
An example: observation data is collected for patients, let's say 'weight' and 'blood pressure'. For one subject there might be two weight readings, but four blood pressure readings. When I try to create the data structure for crossfilter, I create two columns one for weight and another for blood pressure. I want to display to the user two bar charts showing the distrubtion of values in each pbservation across all subjects. The user should be able to filter subjects with a weight range AND a blood pressure range. Because two of the rows for a subject will not have values for blood pressure, filtering on weight will filter out subjects(i.e. rows) that might be in the range for the blood pressure filter, but did not have a value for weight so they were wrongly excluded
I managed to do it with arrays so now instead of my data structured in a flat table-like structure:
{subjectId: "subject-101", study: "CRC305A", A: "24", B: "79"}
{subjectId: "subject-101", study: "CRC305A", A: "", B: "74"}
{subjectId: "subject-101", study: "CRC305A", A: "", B: "83"}
{subjectId: "subject-101", study: "CRC305A", A: "", B: "74"}
{subjectId: "subject-101", study: "CRC305A", A: "", B: "72"}
{subjectId: "subject-101", study: "CRC305A", A: "", B: "82"}
{subjectId: "subject-101", study: "CRC305A", A: "", B: "74"}
{subjectId: "subject-101", study: "CRC305A", A: "", B: "79"}
{subjectId: "subject-101", study: "CRC305A", A: "", B: "76"}
{subjectId: "subject-101", study: "CRC305A", A: "", B: "72"}
it's structured as below to allow variability in values from one column to another
{subjectId: “subject-101",
A:[“24”],
B:[“79", "74", "83", "74", "72", "82", "74", "79", "76", "72", "79", "76", "77", "72", "83", "69", "72”]
}
And the filtering magically works!
Have one problem still related to the behaviour of dimension.top and dimension.bottom with respect to arrays. I'll post that in another question
If the values are logically there, you should probably propagate them to the next rows before sticking them in crossfilter. Crossfilter doesn't have any concept of row order or defaulted values.
If I understand your question correctly, I'd do something like
var lastA, lastB, lastC;
data.forEach(function(d) {
if(d.A)
lastA = d.A:
else
d.A = lastA;
if(d.B)
lastB = d.B;
else
d.B = lastB
// ...
});
var cf = crossfilter(data);
Trying to create sort of "wildcard values" like you suggest in your question, might be possible but you'd definitely have to change at least the filter handler for every chart, because they expect to be dealing with discrete values.
Related
i am using this code https://github.com/go-echarts/examples/blob/master/examples/parallel.go to generate a parabox chart which have the following fields :
User,Product,Shop
which are strings
my Data structures represented below :
parallelAxisList = []opts.ParallelAxis{
{Dim: 0, Name: "User", Type: "category"},
{Dim: 1, Name: "Product", Type: "category"},
{Dim: 2, Name: "Shop", Type: "category"}
}
parallelData = [][]interface{}{
{"user1","product1","shop1"},
{"user1","product2","shop1"},
{"user1","product2","shop2"},
{"user2","product1","shop2"},
{"user1","product1","shop2"},
}
but for some reason the graph render only with the first axis shown (i.e user1 and user2 but the other axises empty) any idea?
if I am replacing the second and third columns of parallelData with numeric values and remove the category type of product and shop on parallelAxisList it rendering good , any idea how to render strings fields ?
Scoured the internet as best as I could but couldn't find an answer -- I was wondering, is there some way to sort the contents of a row by record? E.g. take the following table:
Key
Row to sort
Other row
a
bca
A
cab
cab
abc
f
b
zyx
yxz
u
c
def
h
fed
h
and turn it into:
Key
Row to sort
Other row
a
abc
A
bca
cab
cab
f
b
yxz
zyx
u
c
def
h
fed
h
The ultimate goal is to sort all of the columns for each record alphabetically, and then blank up so that each record is a single row.
I've tried doing a sort on the column to sort within the record itself, but that orders records by whichever record has an entry that comes in alphabetical order (regardless of whether it's the 1st entry for the record or not, interestingly).
Here is a solution using sort
Prerequisite: assuming that the values in the "Key" column are unique.
Switch to rows mode
Fill down the "Key" column via Key=> Edit cells => Fill down.
Sort the "Key" column via Key=> Sort...
Sort the "Row to sort" column via Row to sort => Sort... as additional sort
Make the sorting permanent by selecting Reorder rows permanently in the sort menu.
Blank down the "Key" and "Row to sort" column.
Here is a solution using GREL
As deduplicating and sorting records is quite a common task I have a GREL expression reducing this task to two steps:
Transform the "Row to sort" column with the following GREL expression:
if(
row.index - row.record.fromRowIndex == 0,
row.record.cells[columnName].value.uniques().sort().join(","),
null
)
Split the multi-valued cells in the "Row to sort" column on the separator ,.
The GREL expression will take all the record cells of the current column, extract their values into an array, make the values in the array unique, sort the remaining value in the array and join it into a string using , as separator.
The joining into a string is necessary as OpenRefine currently has no support for displaying arrays in the GUI.
I would do it as follows:
For all columns except the key column, use the Edit cells > Join multi-valued cells operation, with a separator that is not present in the cell values
Transform all columns except the key column with: value.split(',').sort().join(',')
Split back your columns with Edit cells > Split multi-valued cells
Then you can blank down / fill down as you wish.
Here is the JSON representation of the workflow for your example:
[
{
"op": "core/multivalued-cell-join",
"columnName": "Row to sort",
"keyColumnName": "Key",
"separator": ",",
"description": "Join multi-valued cells in column Row to sort"
},
{
"op": "core/multivalued-cell-join",
"columnName": "Other row",
"keyColumnName": "Key",
"separator": ",",
"description": "Join multi-valued cells in column Other row"
},
{
"op": "core/text-transform",
"engineConfig": {
"facets": [],
"mode": "record-based"
},
"columnName": "Row to sort",
"expression": "grel:value.split(',').sort().join(',')",
"onError": "keep-original",
"repeat": false,
"repeatCount": 10,
"description": "Text transform on cells in column Row to sort using expression grel:value.split(',').sort().join(',')"
},
{
"op": "core/text-transform",
"engineConfig": {
"facets": [],
"mode": "record-based"
},
"columnName": "Other row",
"expression": "grel:value.split(',').sort().join(',')",
"onError": "keep-original",
"repeat": false,
"repeatCount": 10,
"description": "Text transform on cells in column Other row using expression grel:value.split(',').sort().join(',')"
},
{
"op": "core/multivalued-cell-split",
"columnName": "Row to sort",
"keyColumnName": "Key",
"mode": "separator",
"separator": ",",
"regex": false,
"description": "Split multi-valued cells in column Row to sort"
},
{
"op": "core/multivalued-cell-split",
"columnName": "Other row",
"keyColumnName": "Key",
"mode": "separator",
"separator": ",",
"regex": false,
"description": "Split multi-valued cells in column Other row"
}
]
My data consists of strings that can be numbers or strings ("12" should be seen as a number while "12REF" should be seen as a string).
I am looking to implement an order by in my criteriaBuilder that sorts the strings by numbers first, and puts the strings sorted alphabetically at the end.
Correctly sorted example:
<[["1", "2", "10", "A", "AB", "B", "DUP", "LNE", "NUL"]]>
Currently my code looks like this (just sorting by asc, using the CriteriaQuery's orderBy).
.orderBy(QueryUtils.toOrders(
Sort.by(Sort.Direction.ASC, selection.getAlias()), root,
criteriaBuilder)
);
which results in:
<[["1", "10", "2", "A", "AB", "B", "DUP", "LNE", "NUL"]]>
How can I implement a custom ordering here?
Edit: the data is stored in a MySQL database
I've an index of products.
Each product, has several variants (can be a few or hundreds, each has a color & size e.g. Red)
Each variant, is available (in a certain quantity) at several warehouses (aronud 100 warehouses).
Warehouses have codes e.g. AB, XY, CD etc.
If I had my choice, I'd index it as:
stock: {
Red: {
S: { AB: 100, XY: 200, CD: 20 },
M: { AB: 0, XY: 500, CD: 20 },
2XL: { AB: 5, XY: 0, CD: 9 }
},
Blue: {
...
}
}
Here's a kind of customer query I might receive:
Show me all products, that have Red.S color in stock (minimum 100) at warehouses AB & XY.
So this would probably be a filter like
Red.S.AB > 100 AND Red.S.XY > 100
I'm not writing whole filter query here, but its straightforward in elastic.
We might also get SUM queries, e.g. the sum of inventories at AB & XY should be > 500.
That'd be easy through a script filter, say Red.S.AB + Red.S.XY > 500
The problem is, given 100 warehouses, 100 sizes, 25 colors, this easily needs 100*100*25 = 250k mappings. Elasticsearch simply can't handle that many number of keys.
The easy answer is use nested documents, but nested documents pose a particular problem. We cannot sum across a given selection of nested documents, and nested docs are slow, specially when we're going to have 250k per product.
I'm open to external solutions than elastic as well. We're rails/postgres stack.
You have your product index with variants, that's fine, but I'd use another index for managing anything related to the multi-warehouse stock. One document per product/size/color/warehouse with the related count. For instance:
{
"product": 123,
"color": "Red",
"size": "S",
"warehouse": "AB",
"quantity": 100
}
{
"product": 123,
"color": "Red",
"size": "S",
"warehouse": "XY",
"quantity": 200
}
{
"product": 123,
"color": "Red",
"size": "S",
"warehouse": "CD",
"quantity": 20
}
etc...
That way, you'll be much more flexible with your stock queries, because all you'll need is to filter on the fields (product, color, size, warehouse) and simply aggregate on the quantity field, sums, averages or whatever you might think of.
You will probably need to leverage the bucket_script pipeline aggregation in order to decide whether sums are above or below a desired threshold.
It's also much easier to maintain the stock movements by simply indexing the new quantity for any given combination than having to update the master product document every time an item gets out of the stock.
No script, no nested documents required.
The best possible solution will be to create separate indexes for the warehouses and each warehouse index will have documents. One document per product/size/color/warehouse with related values like this:
{
"product": 123,
"color": "Red",
"size": "S",
"warehouse": "AB",
"quantity": 100
}
This will reduce your mappings 100 * 25 = 2500 mappings per index.
Rest for other operations, I feel #Val has mentioned in his answer which is quite impressive and beautiful.
Coming to external solutions, I would say you want to carry to out task of storing data, searching it and fetching it. Elasticsearch and Apache Solr are the best search engines to carry out these kind of tasks. I have not tried Apache Solr but I would highly recommend to go with Elasticsearch because of it's features, active community support and searching is really fast. Searching can also be made fast using analyzers and tokenizers. It also has some features like Full-Text Searching and Term Level Searching to customize searching according to situation or problem statement.
I'm having a problem that should be very simple but I'm stumped on this one -- maybe I'm misunderstanding something about compound indexes in MongoDB.
To reproduce this problem, I have created a simple collection with 500000 entries and six fields, each with a random number. In a mongo terminal, I generated the collection like this:
for(i = 0; i < 500000; i++){
db.test.save({a: Math.random(), b: Math.random(), c: Math.random(), d: Math.random(), e: Math.random() })
}
Then, I time a simple query on this collection like this:
t1 = new Date().getTime()
db.test.count({a : {$gt: 0.5}, b : {$gt: 0.5}, c : {$gt: 0.5}, d : {$gt: 0.5}, e : {$gt: 0.5} })
t2 = new Date().getTime()
t2-t1
=> 335ms
The query completed in 335 ms. So now I add a compound index to try to make the query faster:
db.test.ensureIndex({a: 1, b:1 ,c:1 ,d:1, e:1})
The query should be faster now, but running the exact same query takes longer:
t1 = new Date().getTime()
db.test.count({a : {$gt: 0.5}, b : {$gt: 0.5}, c : {$gt: 0.5}, d : {$gt: 0.5}, e : {$gt: 0.5} })
t2 = new Date().getTime()
t2-t1
=> 762ms
The same query takes over twice as long when the index is added! This is repeatable even when I try this multiple times. Removing the index with db.test.dropIndexes() makes the query run faster again, back to ~350ms.
Checking the queries with explain() shows that a BasicCursor is used before the index is added. After the index is added a BtreeCursor is used and has the expected indexBounds.
So my question is: why is this happening? And more importantly, how DO I get this query to run faster? In a SQL benchmark that I did on the same machine, an analogous query with SQL took ~240ms without an index, with an index dropping that down to ~180ms.
My MongoDB version info:
> mongo --version
MongoDB shell version: 2.6.3
The problem with your example here is basically that the data in indeed far "too random" in order to make effective use of an index in this case. The result is as expected since there is not much "order" in how an index can traverse this, along with the consideration that as you are indexing every field in the document the index size will be somewhat larger than the document itself.
For a better representation of a "real world" situation you can look at a more 50/50 split of the relevant data to search for. Here with a more optimized form of generator:
var samples = [{ "a": "a", "b": "a" },{ "a": "b", "b": "b" }];
for ( var x = 0; x < 5; x++ ) {
samples.forEach(function(s) {
var batch = [];
for(i = 0; i < 10000; i++){
batch.push( s );
}
db.test.insert(batch);
});
}
That inserts the data with a fair enough representation that either search would essentially have to scan through every document in the collection in certainty to retrieve them all in absence of an index.
So if you look a a query now with a form to get 50% of the data:
db.test.find({ "a": 1, "b": 1 }).explain()
On my hardware where I am sitting, even warmed up that is going to consistently take over 100ms to complete. But when you add an index to both fields:
db.test.ensureIndex({ "a": 1, "b": 1 })
Then the same query consistently completes under 100ms, and mostly around the 90ms mark. This also gets a little more interesting when you add some projection in order to force the stats to "index only":
db.test.find({ "a": 1, "b": 1 },{ "_id", "a": 1, "b": 1 }).explain()
Now while this does not need to go back to the documents in this case and is marked as "indexOnly": true, the working set size is likely small enough to fit in memory and thus you see a slight performance degradation due to the extra work "projecting" the fields. The average now with the index is around 110ms on the hardware. But when you drop the index:
db.test.dropIndexes()
The performance of the query without the use of an index drops to 170ms. This shows the overhead in projection against the benefits of the index more clearly.
Pulling the index back to the form as you had originally:
db.test.ensureIndex({ "a": 1, "b": 1, "c": 1, "d": 1, "e": 1 })
Keeping the same projection query you get around 135ms with the index and of course the same 170ms without. Now if you then go back to the original query form:
db.test.find({ "a": 1, "b": 1, "c": 1, "d":1, "e": 1}).explain()
The results with the index are still around the 135ms mark and the non-indexed query is skipping around the 185ms mark.
So it does make sense that real world data distribution is not typically so "random" as the test you designed. Though it is also true that distribution is almost never as clear cut as 50/50, the general case is there is not in fact so much scatter and there tends to be natural clusters of the ranges you are looking for.
This also serves as an example that with "truly random" data with a high level of distribution between values, then b-tree indexes are not the most optimal way to address the accessing of data.
I hope that makes some of the points to consider about this more clear to you.
Here is another sample closer to your original test, the only difference is altering the "precision" so the data is not so "random" which was one of the main points I am making:
var batch = []
for( i = 0; i < 500000; i++){
batch.push({
"a": Math.round(Math.random()*100)/100,
"b": Math.round(Math.random()*100)/100,
"c": Math.round(Math.random()*100)/100,
"d": Math.round(Math.random()*100)/100,
"e": Math.round(Math.random()*100)/100
});
if ( batch.length % 10000 == 0 ) {
db.test.insert( batch );
batch = [];
}
}
So there is a "two decimal place precision" in the data being enforced which again represents real world data cases more directly. Also note that the inserts are not being done on every iteration, as the implementation of insert for the shell in MongoDB 2.6 will return the "write concern" response with every update. So much faster to set up.
If you then consider your original test query, the response without an index will take around 590ms to complete as per my hardware. When you add the same index the query completes in 360ms.
If you do that on just "a" and "b" without an index:
db.test.find({ "a": {"$gt": 0.5}, "b": {"$gt": 0.5} }).explain()
The response comes in at around 490ms. Adding an index to just "a" and "b"
db.test.ensureIndex({ "a": 1, "b": 1 })
And the indexed query takes around 300ms, so still considerably faster.
Everything here says essentially:
Natural distribution is supported very well with B-tree indexes, fully random is not.
Index what you need to query on an those fields only. There is a size cost and there is a memory cost as well.
From that second point there is one more thing to demonstrate as most examples here are generally required to look up the document from the collection as well as find it in the index. The obvious cost here is that both the index and the collection need to be paged into memory in order to return the results. This of course takes time.
Consider the full compound index in place with the following query, the response without the index takes around 485ms:
db.test.find({ "a": {"$gt": 0.5}, "b": {"$gt": 0.5} }).explain()
Adding the compound index on "a" through "e" makes the same query with the index in place run around 385ms.Still faster, but slower than our full query, but there is a good reason why considering the index contains all of the fields and the conditions. But if you alter that with a projection for only the required fields:
db.test.find(
{ "a": {"$gt": 0.5}, "b": {"$gt": 0.5} },
{ "_id": 0, "a": 1, "b": 1 }
).explain()
That drops the time somewhat and now the index is used soley to get the results. Dropping the index and issuing the same query takes around 650ms with the additional overhead of the projection. This shows that an effective index actually does make a lot of difference to the results.