MongoDB performance. Embedded documents search speed - performance

I was wandering what keep MongoDB faster. Having a few parent documents with big arrays of embedded documents inside of them or having a lot of parent documents with few embedded documents inside.
This question only regards querying speed. I'm not concerned with the amount of repeated information, unless you tell me that it influences the search speed. (I don't know if MongoDb automatically indexes Id's)
Example:
Having the following Entities with only an Id field each one:
Class (8 different classes )
Student ( 100 different students )
In order to associate students with classes, would I be taking most advantage of MongoDB's speed if I:
Stored all Students in arrays, inside the classes they attend
Inside each student, I kept an array with the classes they attend.
This example is just an example. A real sittuation would involve thousands of documents.

I am going to search for specific students inside a given class.
If so, you should have a Student collection, with a field set to the class (just the class id is maybe better than an embedded and duplicated class document).
Otherwise, you will not be able to query for students properly:
db.students.find ({ class: 'Math101', gender: 'f' , age: 22 })
will work as expected, whereas storing the students inside the classes they attend
{ _id: 'Math101', student: [
{ name: 'Jim', age: 22 } , { name: 'Mary', age: 23 }
] }
has (in addition to duplication) the problem that the query
db.classes.find ( { _id: 'Math101', 'student.gender': 'f', 'student.age': 22 })
will give you the Math class with all students, as long as there is at least one female student and at least one 22-year-old student in it (who could be male).
You can only get a list of the main documents, and it will contain all embedded documents, unfiltered, see also this related question.
I don't know if MongoDb automatically indexes Id
The only automatic index is the primary key _id of the "main" document. Any _id field of embedded documents is not automatically indexed, but you can create such an index manually.

Related

Search by PK slower compared to other index column

I am working on preparing some benchmarks for a database structure where I compare the usage of an UUID for the primary key vs a sequential ID. Based on different articles, I was expecting that the UUID to be slower for insertion and selection. Most other articles that treated this topic had simple objects, but I have a more complex structure with many one-to-many relations, so I decided to try my luck with my own benchmarks.
I had a structure like this:
class A {
UUID/Long id;
String name;
UUID uuid; // only when PK is Long
List<B> b; // one to many, 5 items in the list
List<C> c; // one to many, 5 items in the list
}
class B {
UUID/Long id;
String name;
List<D> d; // one to many, 5 items
}
// C and D just have an ID and a name;
As a note, I do have different tables and different entities for UUID and Long PK. Also, for the Long PK, I have an additional UUID column for class A that gets populated with a random UUID. Also, added an index for the UUID column since I will be measuring search by this column as well.
I made an app in Spring Boot, with Spring Data for the JPA implementation and MS SQL for the database.
I started to populate the database in both cases (with UUID PK and Long PK) with 2000 items and did not see any major differences in timings between the two tests.
Next, I did a search by UUID. For the first scenario, the UUID is also the PK. For the second scenario, the PK is a Long, with the UUID a separate column with an index. It was way faster.
Next, only where the PK is a long, I did a search by the PK and here is where I had a big surprise. The search was almost as slow as for UUID PK.
Here are some results (timings are in ms):
Benchmark UUID PK Long PK
2000 Product Insertion 368910 354643
800 items search by UUID, 1 iteration 2582 908
800 items search by UUID, 3 iterations 5853 1981
800 items search by ID, 1 iteration - 1794
800 items search by ID, 3 iterations - 4421
500 Products insertion 38940 39852
200 items search by UUID, 1 iteration 492 167
200 items search by UUID, 5 iterations 1840 763
200 items search by UUID, 10 iterations 3450 1472
200 items search by ID, 1 iteration - 448
200 items search by ID, 5 iterations - 2254
200 items search by ID, 10 iterations - 4588
I was expecting that when using a long PK, everything to be faster, but that is not always the case. I based my original assumptions mostly on these two articles:
https://www.mssqltips.com/sqlservertip/5105/sql-server-performance-comparison-int-versus-guid/
https://tomharrisonjr.com/uuid-or-guid-as-primary-keys-be-careful-7b2aa3dcb439
I can accept that even with UUID PK, if we exclude DB fragmentation, timing to be similar. What baffles me is why search by the UUID column is faster than the search by PK when the PK is Long.
Even when show_sql was turned on, I could not see any differences (no select *, which I know can cause slowness). I also tried to eliminate other factors, but results were consistent.
Am I doing something wrong? Am I not understanding something properly? Does it not really matter that the PK is a UUID, even with a more complex structure and many items?
It just so happens that someone blogged about their analysis of the bigint vs. uuid performance on PostgreSQL recently which might apply to SQL Server as well: https://www.cybertec-postgresql.com/en/uuid-serial-or-identity-columns-for-postgresql-auto-generated-primary-keys/

ClickHouse: how to enable performant queries against increasing user-defined attributes

I am designing a system that handles a large number of buried point event. An event record contains:
buried_point_id, for example: 1 means app_launch, 2 means user_register.
happened_at: the event timestamp.
user_id: the user identifier.
other attributes, including basic ones (phone_number, city, country) and user-defined ones (click_item_id, it literally can be any context information). PMs will add more and more user-defined attributes to the event record.
The query pattern is like:
SELECT COUNT(DISTINCT user_id) FROM buried_points WHERE buried_point_id = 1 AND city = 'San Francisco' AND click_item_id = 123;
Since my team invests heavily in ClickHouse, I want to leverage ClickHouse for the problem. I wonder if it is a good practice to use the experimental Map data type to store all attributes in a MAP-type column such as {city: San Francisco, click_item_id: 123, ...}, or any other recommendation? Thanks.

Performance with pagination

Question
Given the following query:
MATCH (t:Tenant)-[:lives_in]->(:Apartment)-[:is_in]->(:City {name: 'City1'})
RETURN t
ORDER BY t.id
LIMIT 10
So: "Give me the first 10 tenants that live in City1"
With the sample data below, the database will get hit for every single apartment in City1 and for every tenant that lives in each of these apartments.
If I remove the ORDER BY this doesn't happen.
I am trying to implement pagination so I need the ORDER BY. How to improve the performance on this?
Sample data
UNWIND range(1, 5) as CityIndex
CREATE (c:City { id: CityIndex, name: 'City' + CityIndex})
WITH c, CityIndex
UNWIND range(1, 5000) as ApartmentIndex
CREATE (a:Apartment { id: CityIndex * 1000 + ApartmentIndex, name: 'Apartment'+CityIndex+'_'+ApartmentIndex})
CREATE (a)-[:is_in]->(c)
WITH c, a, CityIndex, ApartmentIndex
UNWIND range(1, 3) as TenantIndex
CREATE (t:Tenant { id: (CityIndex * 1000 + ApartmentIndex) * 10 + TenantIndex, name: 'Tenant'+CityIndex+'_'+ApartmentIndex+'_'+TenantIndex})
CREATE (t)-[:lives_in]->(a)
Without the ORDER BY, cypher can lazily evaluate the tenants and stop at 10 rather than matching every tenant in City1. However, because you need to order the tenants, the only way it can do that is to fetch them all and then sort.
If the only labels that can live in apartments is Tenants then you could possibly save a Filter step by removing the Tenant in your query like MATCH (t)-[:lives_in]->(:Apartment)....
You might want to check the profile of your query as well and see if it uses the index backed order by
What sort of numbers are you expecting back from this query? What's the worst case number of tenants in a given city?
EDIT
I was hoping a USING JOIN on t would use the index to improve the plan but it does not.
The query performs slightly better if you add a redundant relation from the tenant to the city:
MATCH (t:Tenant)-[:CITY]->(:City {name: 'City1'})
RETURN t
ORDER BY t.id
LIMIT 10
and similarly by embedding the city name onto the tenant- no major gains. I tested for 150,000 tenants in City1, perhaps the gains are more visible as you approach millions, but not sure.

Cassandra slow get_indexed_slices speed

We are using Cassandra for log collecting.
About 150,000 - 250,000 new records per hour.
Our column family has several columns like 'host', 'errorlevel', 'message', etc and special indexed column 'indexTimestamp'.
This column contains time rounded to hours.
So, when we want to get some records, we use get_indexed_slices() with first IndexExpression by indexTimestamp ( with EQ operator ) and then some other IndexExpressions - by host, errorlevel, etc.
When getting records just by indexTimestamp everything works fine.
But, when getting records by indexTimestamp and, for example, host - cassandra works for long ( more than 15-20 seconds ) and throws timeout exception.
As I understand, when getting records by indexed column and non-indexed column, Cassandra firstly gets all records by indexed column and than filters them by non-indexed columns.
So, why Cassandra does it so slow? By indexTimestamp there are no more than 250,000 records. Isn't it possible to filter them at 10 seconds?
Our Cassandra cluster is running on one machine ( Windows 7 ) with 4 CPUs and 4 GBs memory.
You have to bear in mind that Cassandra is very bad with this kind of queries. Indexed columns queries are not meant for big tables. If you want to search for your data around this type of queries you have to tailor your data model around it.
In fact Cassandra is not a DB you can query. It is a key-value storage system. To understand that please go there and have a quick look: http://howfuckedismydatabase.com/
The most basic pattern to help you is bucket-rows and ranged range-slice-queries.
Let's say you have the object
user : {
name : "XXXXX"
country : "UK"
city : "London"
postal_code :"N1 2AC"
age : "24"
}
and of course you want to query by city OR by age (and & or is another data model yet).
Then you would have to save your data like this, assuming the name is a unique id :
write(row = "UK", column_name = "city_XXXX", value = {...})
AND
write(row = "bucket_20_to_25", column_name = "24_XXXX", value = {...})
Note that I bucketed by country for the city search and by age bracket for age search.
the range query for age EQ 24 would be
get_range_slice(row= "bucket_20_to_25", from = "24-", to = "24=")
as a note "minus" == "under_score" - 1 and "equals" == "under_score" + 1, giving you effectively all the columns that start with "24_"
This also allow you to query for age between 21 and 24 for example.
hope it was useful

Play Framework: How to render a table structure from plain SQL table

I would be happy to get a good way to get the "table" structure from a plain SQL table.
In my specific case, I need to render JSON structure used by Google Visualization API "datatable" object:
http://code.google.com/apis/chart/interactive/docs/reference.html#DataTable
However, having an example in HTML would help either.
My "source" is a plain SQL table of "DailySales": its columns are "Day" (date), "Product" and "DailySaleTotal" (daily sale for that product). Please recall that my "model" reflects the 3-column table above.
The table columns should be "products" (suppose we have very small number of such). Each row should represent a specific date, and the row data are the actual sales for that day.
Date Product1 Product2 Product3
01/01/2012 30 50 60
01/02/2012 35 3 15
I was trying to use nested #{list} tags in a template, but unfortunately I failed to find a natural way to provide a template with a "list" to represent the "row data".
Of course, I can build a "helper object" in Java that will build a list of the "sales data" items per date - but this looks very weird to me.
I would be thankful to anyone who can provide an elegant solution.
Max
When you load your model order it by date and product name. Then in your controller build a map with date as index and list of model objects that have the same date as value of the map
Then in your template you have a first list iteration on map keys for the rows and a second list iteration on the list value for the columns.
Something like
[
#{list modelMap.keys, as: 'date'}
[${date},#{list modelMap.get(date), as: 'product'}${product.dailySaleTotal}#{ifnot product_isLast},#{/ifnot}#{/list}]#{ifnot date_isLast},#{/ifnot}
#{/list}
]
you can then adapt your json rendering to the exact structure you want to have. Here it is an array of arrays.
Instead of generating the JSON yourself, like Seb suggested, you can generate it:
private static Result queryToJsonResult(String sql) {
SqlQuery sqlQuery = Ebean.createSqlQuery(sql);
return ok(Json.toJson(sqlQuery.findList()));
}

Resources