I am working on preparing some benchmarks for a database structure where I compare the usage of an UUID for the primary key vs a sequential ID. Based on different articles, I was expecting that the UUID to be slower for insertion and selection. Most other articles that treated this topic had simple objects, but I have a more complex structure with many one-to-many relations, so I decided to try my luck with my own benchmarks.
I had a structure like this:
class A {
UUID/Long id;
String name;
UUID uuid; // only when PK is Long
List<B> b; // one to many, 5 items in the list
List<C> c; // one to many, 5 items in the list
}
class B {
UUID/Long id;
String name;
List<D> d; // one to many, 5 items
}
// C and D just have an ID and a name;
As a note, I do have different tables and different entities for UUID and Long PK. Also, for the Long PK, I have an additional UUID column for class A that gets populated with a random UUID. Also, added an index for the UUID column since I will be measuring search by this column as well.
I made an app in Spring Boot, with Spring Data for the JPA implementation and MS SQL for the database.
I started to populate the database in both cases (with UUID PK and Long PK) with 2000 items and did not see any major differences in timings between the two tests.
Next, I did a search by UUID. For the first scenario, the UUID is also the PK. For the second scenario, the PK is a Long, with the UUID a separate column with an index. It was way faster.
Next, only where the PK is a long, I did a search by the PK and here is where I had a big surprise. The search was almost as slow as for UUID PK.
Here are some results (timings are in ms):
Benchmark UUID PK Long PK
2000 Product Insertion 368910 354643
800 items search by UUID, 1 iteration 2582 908
800 items search by UUID, 3 iterations 5853 1981
800 items search by ID, 1 iteration - 1794
800 items search by ID, 3 iterations - 4421
500 Products insertion 38940 39852
200 items search by UUID, 1 iteration 492 167
200 items search by UUID, 5 iterations 1840 763
200 items search by UUID, 10 iterations 3450 1472
200 items search by ID, 1 iteration - 448
200 items search by ID, 5 iterations - 2254
200 items search by ID, 10 iterations - 4588
I was expecting that when using a long PK, everything to be faster, but that is not always the case. I based my original assumptions mostly on these two articles:
https://www.mssqltips.com/sqlservertip/5105/sql-server-performance-comparison-int-versus-guid/
https://tomharrisonjr.com/uuid-or-guid-as-primary-keys-be-careful-7b2aa3dcb439
I can accept that even with UUID PK, if we exclude DB fragmentation, timing to be similar. What baffles me is why search by the UUID column is faster than the search by PK when the PK is Long.
Even when show_sql was turned on, I could not see any differences (no select *, which I know can cause slowness). I also tried to eliminate other factors, but results were consistent.
Am I doing something wrong? Am I not understanding something properly? Does it not really matter that the PK is a UUID, even with a more complex structure and many items?
It just so happens that someone blogged about their analysis of the bigint vs. uuid performance on PostgreSQL recently which might apply to SQL Server as well: https://www.cybertec-postgresql.com/en/uuid-serial-or-identity-columns-for-postgresql-auto-generated-primary-keys/
Related
I'm building a table to manage some articles:
Table
| Company | Store | Sku | ..OtherColumns.. |
| 1 | 1 | 123 | .. |
| 1 | 2 | 345 | .. |
| 3 | 1 | 123 | .. |
Scenario
Most time company, store and sku will be used to SELECT rows:
SELECT * FROM stock s WHERE s.company = 1 AND s.store = 1 AND s.sku = 123;
..but sometimes the company will not be available when accessing the table.
SELECT * FROM stock s WHERE s.store = 1 AND s.sku = 123;
..Sometimes all articles will be selected for a store.
SELECT * FROM stock s WHERE s.company = 1 AND s.store = 1;
The Question
How to properly index the table?
I could add three indexes - one for each select, but i think oracle should be smart eneugh to re-use other indexes.
Would an Index "Store, Sku, Company" be used if the WHERE-condition has no company?
Would an Index "Company, Store, Sku" be used if the WHERE-condition has no company?
You can think of the index key as conceptually being the 'concatenation' of the all of the columns, and generally you need to have a leading element of that key in order to get benefit from the index. So for an index on (company,store,sku) then
WHERE s.company = 1 AND s.store = 1 AND s.sku = 123;
can potentially benefit from the index
WHERE s.store = 1 AND s.sku = 123;
is unlikely to benefit (but see footnote below)
WHERE s.company = 1 AND s.store = 1;
can potentially benefit from the index.
In all cases, I say "potentially" etc, because it is a costing decision by the optimizer. For example, if I only have (say) 2 companies and 2 stores then a query on company and store, whilst it could use the index is perhaps better suited to not to do so, because the volume of information to be queried is still a large percentage of the size of the table.
In your example, it might be the case that an index on (store,sku,company) would be "good enough" to satisfy all three, but that depends on the distribution of data. But you're thinking the right way, ie, get as much value from as few indexes as possible.
Footnote: There is a thing called a "skip scan" where we can get value from an index even if you do not specify the leading column(s), but you will typically only see that if the number of distinct values in those leading columns is low.
first - do you need index at all? Indexes are not for free. If your table is small enoguh, perhaps you don't need index at all.
Second - what is data structure? You have store column in every scenario - I can imagine situation in which filtering data on store dissects source data to enough degree to be good enough for you.
However if you want to have maximum reasonable performance benefit you need two:
(store, sku, company)
(store, company)
or
(store, company, sku)
(store, sku)
Would an Index "Store, Sku, Company" be used if the WHERE-condition has no company?
Yes
Would an Index "Company, Store, Sku" be used if the WHERE-condition has no company?
Probably not, but I can imagine scenarios in which it might happen (not for the index seek operation which is really primary purpose of indices)
You dissect data in order of columns. So you group data by first element and order them by first columns sorting order, then within these group you group the same way by second element etc.
So when you don't use first element of index in filtering, the DB would have to access all "subgroups" anyway.
I recommend reading about indexes in general. Start with https://en.wikipedia.org/wiki/B-tree and try to draw how it behaves on paper or write simple program to manage simplified version. Then read on indexes in database - any db would be good enough.
I would like to store user purchase custom tags on each transaction, example if user bought shoes then tags are "SPORTS", "NIKE", SHOES, COLOUR_BLACK, SIZE_12,..
These tags are that seller interested in querying back to understand the sales.
My idea is when ever new tag comes in create new code(something like hashcode but sequential) for that tag, and code starts from "a-z" 26 letters then "aa, ab, ac...zz" goes on. Now keep all the tags given for in one transaction in the one column called tag (varchar) by separating with "|".
Let us assume mapping is (at application level)
"SPORTS" = a
"TENNIS" = b
"CRICKET" = c
...
...
"NIKE" = z //Brands company
"ADIDAS" = aa
"WOODLAND" = ab
...
...
SHOES = ay
...
...
COLOUR_BLACK = bc
COLOUR_RED = bd
COLOUR_BLUE = be
...
SIZE_12 = cq
...
So storing the above purchase transaction, tag will be like tag="|a|z|ay|bc|cq|" And now allowing seller to search number of SHOES sold by adding WHERE condition tag LIKE %|ay|%. Now the problem is i cannot use index (sort key in redshift db) for "LIKE starts with %". So how to solve this issue, since i might have 100 millions of records? dont want full table scan..
any solution to fix this?
Update_1:
I have not followed bridge table concept (cross-reference table) since I want to perform group by on the results after searching the specified tags. My solution will give only one row when two tags matched in a single transaction, but bridge table will give me two rows? then my sum() will be doubled.
I got suggestion like below
EXISTS (SELECT 1 FROM transaction_tag WHERE tag_id = 'zz' and trans_id
= tr.trans_id) in the WHERE clause once for each tag (note: assumes tr is an alias to the transaction table in the surrounding query)
I have not followed this; since i have to perform AND and OR condition on the tags, example ("SPORTS" AND "ADIDAS") ---- "SHOE" AND ("NIKE" OR "ADIDAS")
Update_2:
I have not followed bitfield, since dont know redshift has this support also I assuming if my system will be going to have minimum of 3500 tags, and allocating one bit for each; which results in 437 bytes for each transaction, though there will be only max of 5 tags can be given for a transaction. Any optimisation here?
Solution_1:
I have thought of adding min (SMALL_INT) and max value (SMALL_INT) along with tags column, and apply index on that.
so something like this
"SPORTS" = a = 1
"TENNIS" = b = 2
"CRICKET" = c = 3
...
...
"NIKE" = z = 26
"ADIDAS" = aa = 27
So my column values are
`tag="|a|z|ay|bc|cq|"` //sorted?
`minTag=1`
`maxTag=95` //for cq
And query for searching shoe(ay=51) is
maxTag <= 51 AND tag LIKE %|ay|%
And query for searching shoe(ay=51) AND SIZE_12 (cq=95) is
minTag >= 51 AND maxTag <= 95 AND tag LIKE %|ay|%|cq|%
Will this give any benefit? Kindly suggest any alternatives.
You can implement auto-tagging while the files get loaded to S3. Tagging at the DB level is too-late in the process. Tedious and involves lot of hard-coding
While loading to S3 tag it using the AWS s3API
example below
aws s3api put-object-tagging --bucket --key --tagging "TagSet=[{Key=Addidas,Value=AY}]"
capture tags dynamically by sending and as a parameter
2.load the tags to dynamodb as a metadata store
3.load data to Redshift using S3 COPY command
You can store tags column as varchar bit mask, i.e. a strictly defined bit sequence of 1s or 0s, so that if a purchase is marked by a tag there will be 1 and if not there will be 0, etc. For every row, you will have a sequence of 0s and 1s that has the same length as the number of tags you have. This sequence is sortable, however you would still need lookup into the middle but you will know at which specific position to look so you don't need like, just substring. For further optimization, you can convert this bit mask to integer values (it will be unique for each sequence) and make matching based on that but AFAIK Redshift doesn't support that yet out of box, you will have to define the rules yourself.
UPD: Looks like the best option here is to keep tags in a separate table and create an ETL process that unwraps tags into tabular structure of order_id, tag_id, distributed by order_id and sorted by tag_id. Optionally, you can create a view that joins the this one with the order table. Then lookups for orders with a particular tag and further aggregations of orders should be efficient. There is no silver bullet for optimizing this in a flat table, at least I don't know of such that would not bring a lot of unnecessary complexity versus "relational" solution.
I have 1 milion of the following struct:
type person struct {
age int
. . .
//some more attributes like name, surname, etc
}
My goal at the end is to know for each person's name the average of its ages. A person may occur multiple times or just 1 time. I read the struct datas one by one, they are random given and I can't sort them.
Example with only age and name attributes written like a key-value data:
Josh: 34
Abigail: 6
Aaron: 43
Josh: 4
Frederich: 22
...
Aaron: 3
...
So when I access a data eg Aaron, I don't know how many times he occurs in the given stock of data, maybe that's the only time I see him. At the end I need to know the average of each person's age not necessary in an order.
My idea was the following:
I used a key-values data like this: map[name]=average, howMany. When i accessed a data, I calculated the new average with the new data's age and incremented howMany. Quite straightforward I'd say.
I can't keep in my RAM 1 milion structs like this.
I'd appreciate any suggestion and any grammar correction.
We are using Cassandra for log collecting.
About 150,000 - 250,000 new records per hour.
Our column family has several columns like 'host', 'errorlevel', 'message', etc and special indexed column 'indexTimestamp'.
This column contains time rounded to hours.
So, when we want to get some records, we use get_indexed_slices() with first IndexExpression by indexTimestamp ( with EQ operator ) and then some other IndexExpressions - by host, errorlevel, etc.
When getting records just by indexTimestamp everything works fine.
But, when getting records by indexTimestamp and, for example, host - cassandra works for long ( more than 15-20 seconds ) and throws timeout exception.
As I understand, when getting records by indexed column and non-indexed column, Cassandra firstly gets all records by indexed column and than filters them by non-indexed columns.
So, why Cassandra does it so slow? By indexTimestamp there are no more than 250,000 records. Isn't it possible to filter them at 10 seconds?
Our Cassandra cluster is running on one machine ( Windows 7 ) with 4 CPUs and 4 GBs memory.
You have to bear in mind that Cassandra is very bad with this kind of queries. Indexed columns queries are not meant for big tables. If you want to search for your data around this type of queries you have to tailor your data model around it.
In fact Cassandra is not a DB you can query. It is a key-value storage system. To understand that please go there and have a quick look: http://howfuckedismydatabase.com/
The most basic pattern to help you is bucket-rows and ranged range-slice-queries.
Let's say you have the object
user : {
name : "XXXXX"
country : "UK"
city : "London"
postal_code :"N1 2AC"
age : "24"
}
and of course you want to query by city OR by age (and & or is another data model yet).
Then you would have to save your data like this, assuming the name is a unique id :
write(row = "UK", column_name = "city_XXXX", value = {...})
AND
write(row = "bucket_20_to_25", column_name = "24_XXXX", value = {...})
Note that I bucketed by country for the city search and by age bracket for age search.
the range query for age EQ 24 would be
get_range_slice(row= "bucket_20_to_25", from = "24-", to = "24=")
as a note "minus" == "under_score" - 1 and "equals" == "under_score" + 1, giving you effectively all the columns that start with "24_"
This also allow you to query for age between 21 and 24 for example.
hope it was useful
I was wandering what keep MongoDB faster. Having a few parent documents with big arrays of embedded documents inside of them or having a lot of parent documents with few embedded documents inside.
This question only regards querying speed. I'm not concerned with the amount of repeated information, unless you tell me that it influences the search speed. (I don't know if MongoDb automatically indexes Id's)
Example:
Having the following Entities with only an Id field each one:
Class (8 different classes )
Student ( 100 different students )
In order to associate students with classes, would I be taking most advantage of MongoDB's speed if I:
Stored all Students in arrays, inside the classes they attend
Inside each student, I kept an array with the classes they attend.
This example is just an example. A real sittuation would involve thousands of documents.
I am going to search for specific students inside a given class.
If so, you should have a Student collection, with a field set to the class (just the class id is maybe better than an embedded and duplicated class document).
Otherwise, you will not be able to query for students properly:
db.students.find ({ class: 'Math101', gender: 'f' , age: 22 })
will work as expected, whereas storing the students inside the classes they attend
{ _id: 'Math101', student: [
{ name: 'Jim', age: 22 } , { name: 'Mary', age: 23 }
] }
has (in addition to duplication) the problem that the query
db.classes.find ( { _id: 'Math101', 'student.gender': 'f', 'student.age': 22 })
will give you the Math class with all students, as long as there is at least one female student and at least one 22-year-old student in it (who could be male).
You can only get a list of the main documents, and it will contain all embedded documents, unfiltered, see also this related question.
I don't know if MongoDb automatically indexes Id
The only automatic index is the primary key _id of the "main" document. Any _id field of embedded documents is not automatically indexed, but you can create such an index manually.