Joins in rethinkdb - rethinkdb

I have the following data structure stored in RethinkDB table:
{
id: string,
parentId: string,
timestamp: number,
data: Object
}
This data structure forms a tree, it can be depicted using the following diagram (white records represent ordinary data carrying records, the red ones have their data property equal to null which represents delete operation):
Now for every record in the table I would like to be able to compute the nextRecord. Which is the closest record in time to the current one. The task seems simple when there is only one record pointing back to a parent:
1 => 2
4 => 9
5 => 6
6 => 8
...
But it becomes more difficult to compute such value when parent record is being referenced by several child records:
2 => 3
3 => 5
7 => 11
The is also case when there is no child reference in which case the result should be null (for example record #8 has no child records, and so null should be returned).
So I'm not asking to write the query itself (which on the other hand would be really great to me) but at least point out the direction in which I can find solution to this problem.
Thank you in advance!

You can do this efficiently with a compound index on parentId and timestamp. You can create the index like this:
r.table('data').indexCreate('parent_timestamp', function(row) {
return [row('parentId'), row('timestamp')];
})
After you've done that, you can find the earliest item with parent PARENT like so:
r.table('data')
.between([PARENT, r.minval], [PARENT, r.maxval], {index: 'parent_timestamp'})
.orderBy({index: 'parent_timestamp'})
.nth(0).default(null)

Related

How do you create a new column based on Max value of 1 column and Category of another?

I am working with a bunch of data for my job creating status reports on the documents that we are working through that we then assign to an area. We decided to use PowerBI as an interactive way to see where everything is at.
Using Power BI Desktop I've created a new table that excludes documents that are not ready for QC but we have several different statuses. Instead of creating a new table for each status type (since some can be grouped together) I would like to create a new column that has the grouped status value's Max for each area. The higher the Status Value the further it is from being complete.
EX:
Record:
Area:
Status Value:
Max Status Value:
152385
A
1
2
354354
B
2
3
131322
B
3
3
132136
A
2
2
213513
A
1
2
351315
B
2
3
If anyone knows how to get the Max Status Value column that would greatly help. I did find another post (https://community.powerbi.com/t5/Desktop/LOOKUPVALUE-return-min-max-of-values-found/td-p/657534) that was similar but I'm still new to DAX and could not figure out how to apply it to my situation.
This post actually helped me answer the question.
https://community.powerbi.com/t5/Power-Query/Maxifs-Power-Query/m-p/1693606
The only difference I made was getting rid of the true/false portion to receive my results. Thus my result was:
Max Status Value =
VAR vMaxVal=
CALCULATE (
MAX ( 'Table'[Status Value] ),
ALLEXCEPT (
'Table',
'Table'[Area]
)
)
RETURN
vMaxVal

how can I group sum and count with sequel ORM and postgresl?

This is too tough for me guys. It's for Jeremy!
I have two tables (although I can also envision needing to join a third table) and I want to sum one field and count rows, in the same, table while joining with another table and return the result in json format.
First of all, the data type field that needs to be summed, is numeric(10,2) and the data is inserted as params['amount'].to_f.
The tables are expense_projects which has the name of the project and the company id and expense_items which has the company_id, item and amount (to mention just the critical columns) - the "company_id" columns are disambiguated.
So, the following code:
expense_items = DB[:expense_projects].left_join(:expense_items, :expense_project_id => :project_id).where(:project_company_id => company_id).to_a.to_json
works fine but when I add
expense_total = expense_items.sum(:amount).to_f.to_json
I get an error message which says
TypeError - no implicit conversion of Symbol into Integer:
so, the first question is why and how can this be fixed?
Then I want to join the two tables and get all the project names form the left (first table) and sum amount and count items in the second table. I have tried
DB[:expense_projects].left_join(:expense_items, :expense_items_company_id => expense_projects_company_id).count(:item).sum(:amount).to_json
and variations of this, all of which fails.
I would like a result which gets all the project names (even if there are no expense entries and returns something like:
project item_count item_amount
pr 1 7 34.87
pr 2 0 0
and so on. How can this be achieved with one query returning the result in json format?
Many thanks, guys.
Figured it out, I hope this helps somebody else:
DB[:expense_projects___p].where(:project_company_id=>user_company_id).
left_join(:expense_items___i, :expense_project_id=>:project_id).
select_group(:p__project_name).
select_more{count(:i__item_id)}.
select_more{sum(:i__amount)}.to_a.to_json

Cassandra slow get_indexed_slices speed

We are using Cassandra for log collecting.
About 150,000 - 250,000 new records per hour.
Our column family has several columns like 'host', 'errorlevel', 'message', etc and special indexed column 'indexTimestamp'.
This column contains time rounded to hours.
So, when we want to get some records, we use get_indexed_slices() with first IndexExpression by indexTimestamp ( with EQ operator ) and then some other IndexExpressions - by host, errorlevel, etc.
When getting records just by indexTimestamp everything works fine.
But, when getting records by indexTimestamp and, for example, host - cassandra works for long ( more than 15-20 seconds ) and throws timeout exception.
As I understand, when getting records by indexed column and non-indexed column, Cassandra firstly gets all records by indexed column and than filters them by non-indexed columns.
So, why Cassandra does it so slow? By indexTimestamp there are no more than 250,000 records. Isn't it possible to filter them at 10 seconds?
Our Cassandra cluster is running on one machine ( Windows 7 ) with 4 CPUs and 4 GBs memory.
You have to bear in mind that Cassandra is very bad with this kind of queries. Indexed columns queries are not meant for big tables. If you want to search for your data around this type of queries you have to tailor your data model around it.
In fact Cassandra is not a DB you can query. It is a key-value storage system. To understand that please go there and have a quick look: http://howfuckedismydatabase.com/
The most basic pattern to help you is bucket-rows and ranged range-slice-queries.
Let's say you have the object
user : {
name : "XXXXX"
country : "UK"
city : "London"
postal_code :"N1 2AC"
age : "24"
}
and of course you want to query by city OR by age (and & or is another data model yet).
Then you would have to save your data like this, assuming the name is a unique id :
write(row = "UK", column_name = "city_XXXX", value = {...})
AND
write(row = "bucket_20_to_25", column_name = "24_XXXX", value = {...})
Note that I bucketed by country for the city search and by age bracket for age search.
the range query for age EQ 24 would be
get_range_slice(row= "bucket_20_to_25", from = "24-", to = "24=")
as a note "minus" == "under_score" - 1 and "equals" == "under_score" + 1, giving you effectively all the columns that start with "24_"
This also allow you to query for age between 21 and 24 for example.
hope it was useful

MongoDB performance. Embedded documents search speed

I was wandering what keep MongoDB faster. Having a few parent documents with big arrays of embedded documents inside of them or having a lot of parent documents with few embedded documents inside.
This question only regards querying speed. I'm not concerned with the amount of repeated information, unless you tell me that it influences the search speed. (I don't know if MongoDb automatically indexes Id's)
Example:
Having the following Entities with only an Id field each one:
Class (8 different classes )
Student ( 100 different students )
In order to associate students with classes, would I be taking most advantage of MongoDB's speed if I:
Stored all Students in arrays, inside the classes they attend
Inside each student, I kept an array with the classes they attend.
This example is just an example. A real sittuation would involve thousands of documents.
I am going to search for specific students inside a given class.
If so, you should have a Student collection, with a field set to the class (just the class id is maybe better than an embedded and duplicated class document).
Otherwise, you will not be able to query for students properly:
db.students.find ({ class: 'Math101', gender: 'f' , age: 22 })
will work as expected, whereas storing the students inside the classes they attend
{ _id: 'Math101', student: [
{ name: 'Jim', age: 22 } , { name: 'Mary', age: 23 }
] }
has (in addition to duplication) the problem that the query
db.classes.find ( { _id: 'Math101', 'student.gender': 'f', 'student.age': 22 })
will give you the Math class with all students, as long as there is at least one female student and at least one 22-year-old student in it (who could be male).
You can only get a list of the main documents, and it will contain all embedded documents, unfiltered, see also this related question.
I don't know if MongoDb automatically indexes Id
The only automatic index is the primary key _id of the "main" document. Any _id field of embedded documents is not automatically indexed, but you can create such an index manually.

LINQ Grouping help

I have a database table that holds parent and child records much like a Categories table. The ParentID field of this table holds the ID of that record's parent record...
My table columns are: SectionID, Title, Number, ParentID, Active
I only plan to allow my parent to child relationship go two levels deep. So I have a section and a sub section and that it.
I need to output this data into my MVC view page in an outline fashion like so...
Section 1
Sub-Section 1 of 1
Sub-Section 2 of 1
Sub-Section 3 of 1
Section 2
Sub-Section 1 of 2
Sub-Section 2 of 2
Sub-Section 3 of 2
Section 3
I am using Entity Framework 4.0 and MVC 2.0 and have never tried something like this with LINQ. I have a FK set up on the section table mapping the ParentID back to the SectionID hoping EF would create a complex "Section" type with the Sub-Sections as a property of type list of Sections but maybe I did not set things up correctly.
So I am guessing I can still get the end result using a LINQ query. Can someone point me to some sample code that could provide a solution or possibly a hint in the right direction?
Update:
I was able to straighten out my EDMX so that I can get the sub-sections for each section as a property of type list, but now I realize I need to sort the related entities.
var sections = from section in dataContext.Sections
where section.Active == true && section.ParentID == 0
orderby section.Number
select new Section
{
SectionID = section.SectionID,
Title = section.Title,
Number = section.Number,
ParentID = section.ParentID,
Timestamp = section.Timestamp,
Active = section.Active,
Children = section.Children.OrderBy(c => c.Number)
};
produces the following error.
Cannot implicitly convert type 'System.Linq.IOrderedEnumerable' to 'System.Data.Objects.DataClasses.EntityCollection'
Your model has two navigation properties Sections1 and Section1. Rename the first one to Children and the second one to Parent.
Depending on whether you have a root Section or perhaps have each top-level section parented to itself (or instead make parent nullable?), your query might look something like:-
// assume top sections are ones where parent == self
var topSections = context.Sections.Where(section => section.ParentId == SectionId);
// now put them in order (might have multiple orderings depending on input, pick one)
topSections = topSections.OrderBy(section => section.Title);
// now get the children in order using an anonymous type for the projection
var result = topSections.Select(section => new {top = section, children = section.Children.OrderBy(child => child.Title)});
For some linq examples:
http://msdn.microsoft.com/en-us/vcsharp/aa336746.aspx
This covers pretty much all of the linq operations, have a look in particular at GroupBy. The key is to understand the input and output of each piece in order to orchestrate several in series and there is no shortcut but to learn what they do so you know what's at hand. Linq expressions are just combinations of these operations with some syntactic sugar.

Resources