I have the following tables:
types | id | name
------+----+----------
1 | A
2 | B
4 | C
8 | D
16| E
32| F
and
vendors | id | name | type
--------+----+----------+-----
1 | Alex | 2 //type B only
2 | Bob | 5 //A,C
3 | Cheryl | 32 //F
4 | David | 43 //F,D,A,B
5 | Ed | 15 //A,B,C,D
6 | Felix | 8 //D
7 | Gopal | 4 //C
8 | Herry | 9 //A,D
9 | Iris | 7 //A,B,C
10| Jack | 23 //A,B,C,E
I would like to query now:
select id, name from vendors where type & 16 >0 //should return Jack as he is type E
select id, name from vendors where type & 7 >0 //should return Ed, Iris, Jack
select id, name from vendors where type & 8 >0 //should return David, Ed, Felix, Herry
What is the best possible index for tables types and vendors in postgres? I may have millions of rows in vendors. Moreover, what are the tradeoffs of using this bitwise method compared with Many To Many relation using a 3rd table? Which is better?
Use can use partial indices to work around the fact that "&" isn't an indexable operator (afaik):
CREATE INDEX vendors_typeA ON vendors(id) WHERE (type & 2) > 0;
CREATE INDEX vendors_typeB ON vendors(id) WHERE (type & 4) > 0;
Of course, you'll need to add a new index every time you add a new type. Which is one of the reasons for expanding the data into an association table which can then be indexed properly. You can always write triggers to maintain a bitmask table additionally, but use the many-to-many table to actually maintain the data normally, as it will be much clearer.
If your entire evaluation of scaling and performance is to say "I may have millions of rows", you haven't done enough to start going for this sort of optimisation. Create a properly-structured clear model first, optimise it later on the basis of real statistics about how it performs.
Related
Beeing pretty new to Power Query, I find myself faced with this problem I wish to solve.
I have a TableA with these columns. Example:
Key | Sprint | Index
-------------------------
A | PI1-I1 | 1
A | PI1-I2 | 2
B | PI1-I3 | 1
C | PI1-I1 | 1
I want to end up with a set looking like this:
Key | Sprint | Index | HasSpillOver
-------------------------
A | PI1-I1 | 1 | Yes
A | PI2-I2 | 2 | No
B | PI1-I3 | 1 | No
C | PI1-I1 | 1 | No
I thought I could maybe nestedjoin TableA on itself and then compare indicies and strip them away and then count rows in the table, like outlined below.
TableA=Key, Sprint, Index
// TableA Nested joined on itself (Key, Sprint, Index, Nested)
TableB=NestedJoin(#"TableA", "Key", #"TableA", "Key", "Nested", JoinKind.Inner)
TableC= Table.TransformColumns(#"TableB", {"Nested", (x)=>Table.SelectRows(x, each [Index] <x[Index])} )
.. and then do the count, however this throws an error:
Can not apply operator < on types List and Number.
Any suggestions how to approach this problem? Possibly (probably) in a different way.
You did not define very well what "spillover" means but this should get you most of the way
Mine assumes adding another index. You could use what you have if it is relevant
Then the code counts the number of rows where the (2nd) index is higher, and the [Key] field matches. You could add code so that the Sprint field matches as well if relevant
let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
#"Added Index" = Table.AddIndexColumn(Source, "Index.1", 0, 1),
#"Added Custom" = Table.AddColumn(#"Added Index" ,"Count",(i)=>Table.RowCount(Table.SelectRows(#"Added Index" , each [Key]=i[Key] and [Index.1]>i[Index.1])))
in #"Added Custom"
Here's my scenario:
I have a Event model and a Stage model, a event can have multiple stages and a stage could be assigned to multiple events. So Many-to-many. The thing is, a stage has a sort_order, and that sort_order could be different in each event. That's why I added the sort_order into the pivot table instead in, for example, the stage table.
table: events_stages
| event_id | stage_id | sort_order |
------------------------------------
| 1 | 1 | 1 |
| 1 | 2 | 2 |
| 1 | 5 | 3 |
The thing is when I'm going to relate the Stage with the events its in,
I'm doing something like in the StageController:
sending a post with events: [1,2,3] and sort_order: [1,1,2]
$relatedEvents = array();
foreach ($request->events as $key => $event)
{
$relatedEvents[] = array(
'event_id' => $relatedEventId,
'sort_order' => $request->sort_order[$key]
);
}
$stage->events()->sync(
$relatedEvents
);
but rely simply in the order of the post, doesn't seem like a really good idea.
Does anyone have a nicer solution?
Thanks!
Sometimes is better to create another model (and use it as a pivot) rather than use pivot table itself. You have more control. I'm not sure what exactly you want to achieve.
Sorry for a newbie question.
Currently I have log files which contains fields such as: userId, event, and timestamp, while lacking of the sessionId. My aim is to create a sessionId for each record based on the timestamp and a pre-defined value TIMEOUT.
If the TIMEOUT value is 10, and sample DataFrame is:
scala> eventSequence.show(false)
+----------+------------+----------+
|uerId |event |timestamp |
+----------+------------+----------+
|U1 |A |1 |
|U2 |B |2 |
|U1 |C |5 |
|U3 |A |8 |
|U1 |D |20 |
|U2 |B |23 |
+----------+------------+----------+
The goal is:
+----------+------------+----------+----------+
|uerId |event |timestamp |sessionId |
+----------+------------+----------+----------+
|U1 |A |1 |S1 |
|U2 |B |2 |S2 |
|U1 |C |5 |S1 |
|U3 |A |8 |S3 |
|U1 |D |20 |S4 |
|U2 |B |23 |S5 |
+----------+------------+----------+----------+
I find one solution in R (Create a "sessionID" based on "userID" and differences in "timeStamp"), while I am not able to figure it out in Spark.
Thanks for any suggestions on this problem.
Shawn's answer regards on "How to create a new column", while my aim is to "How to create an sessionId column based on timestamp". After days of struggling, the Window function is applied in this scenario as a simple solution.
Window is introduced since Spark 1.4, it provides functions when such operations is needed:
both operate on a group of rows while still returning a single value for every input row
In order to create a sessionId based on timestamp, first I need to get the difference between user A's two immediate operations. The windowDef defines the Window will be partition by "userId" and ordered by timestamp, then diff is a column which will return a value for each row, whose value will be 1 row after the current row in the partition(group), or null if the current row is the last row in this partition
def handleDiff(timeOut: Int) = {
udf {(timeDiff: Int, timestamp: Int) => if(timeDiff > timeOut) timestamp + ";" else timestamp + ""}
}
val windowDef = Window.partitionBy("userId").orderBy("timestamp")
val diff: Column = lead(eventSequence("timestamp"), 1).over(windowDef)
val dfTSDiff = eventSequence.
withColumn("time_diff", diff - eventSequence("timestamp")).
withColumn("event_seq", handleDiff(TIME_OUT)(col("time_diff"), col("timestamp"))).
groupBy("userId").agg(GroupConcat(col("event_seq")).alias("event_seqs"))
Updated:
Then exploit the Window function to apply the "cumsum"-like operation (provided in Pandas):
// Define a Window, partitioned by userId (partitionBy), ordered by timestamp (orderBy), and delivers all rows before current row in this partition as frame (rowsBetween)
val windowSpec = Window.partitionBy("userId").orderBy("timestamp").rowsBetween(Long.MinValue, 0)
val sessionDf = dfTSDiff.
withColumn("ts_diff_flag", genTSFlag(TIME_OUT)(col("time_diff"))).
select(col("userId"), col("eventSeq"), col("timestamp"), sum("ts_diff_flag").over(windowSpec).alias("sessionInteger")).
withColumn("sessionId", genSessionId(col("userId"), col("sessionInteger")))
Previously:
Then split by ";" and get each session, create a sessionId; afterwards split by "," and explodes to final result. Thus sessionId is created with the help of string operations.
(This part should be replaced by cumulative sum operation instead, however I did not find a good solution)
Any idea or thought about this question is welcomed.
GroupConcat could be found here: SPARK SQL replacement for mysql GROUP_CONCAT aggregate function
Reference: databricks introduction
dt.withColumn('sessionId', expression for the new column sessionId)
for example:
dt.timestamp + pre-defined value TIMEOUT
I have two databases: one is old and deprecated; the other one is new, working. Both of them have a table called brands.
In the deprecated database, the brands table is something like the following:
id | name
1 | Playstation 1
2 | Playstation 2
3 | Playstation 3
4 | Playstation 4
5 | Xbox
6 | Xbox 360
7 | Xbox One
In the new one, this is the brands table:
id | name
1 | Xbox
2 | Xbox 360
3 | Xbox One
4 | Playstation 1
5 | Playstation 2
6 | Playstation 3
7 | Playstation 4
In practice, the scenario is more complex, but the example I gave represents well. So, there's also a products table:
id | name | brand_id | created_at | updated_at
I want to import products from the old database to the new one, but the brands aren't matching by id as you saw. Then, I want to do something like this:
brand_id 1 on old_database == brand_id 4 on new_database
To be more specific, is kind of a dictionary without ifs.
This is what I've done:
if query.brand == 1
brand_id == 4
elsif query.brand == 2
brand_id = 5
end
But this isn't what I really want. Yes, it works, but I want to do something simpler. I think hashes are exactly what I'm looking for. Any suggestions?
You could declare a hash like this:
brand_map = {1 => 4, 2 => 5} # add other entries as needed
and then lookup the new id like this:
brand_id = brand_map[1]
=> 4
Yes, it seems that a hash is what you want. For example,
id_map = { 1=>4, 2=>5, ... } # old id => new id
then for a record id, name, write it to the new database as id_map(id), name.
I'm trying to update multiple documents in RethinkDB, based on some precalculated values in a Hash. i.e.
Given a table stats with primary key slug with data like
[{slug: 'foo', stats: {}}, {slug:'bar', stats:{}}]
and given a Hash with values like
updated_stats = {
'foo' => {a: 1, b: 2},
'bar' => {a: 3, b: 4}
}
I can do this
updated_stats.each{|k,v|
r.table('stats').get(k).update{|s|
{ :stats => v }
}
}
So, why can't I do the following?
r.table('stats').get_all(*updated_stats.keys).update{|s|
{ :stats => updated_stats[s["slug"]] }
}
the rql shows nil as the value of updated_stats[s["slug"]]. Would really appreciate any help on this. Thanks.
For anyone looking for how to bulk update records, it's actually pretty easy but not at all intuitive.
You actually have to perform an insert while specifying that if there's any conflicts, to update those records. You will obviously need to provide the Id of each record to be updated.
Using the following data set:
|-------------|--------------|
| id | title |
|-------------|--------------|
| 1 | fun |
|-------------|--------------|
| 2 | in |
|-------------|--------------|
| 3 | the |
|-------------|--------------|
| 4 | sun |
|-------------|--------------|
Here's an example (javascript):
const new_data = [
{id: 1, title: 'dancing'},
{id: 4, title: 'rain'},
];
r.db('your_db').table('your_table').insert(new_data, {conflict: 'update'});
The results would be:
|-------------|--------------|
| id | title |
|-------------|--------------|
| 1 | dancing |
|-------------|--------------|
| 2 | in |
|-------------|--------------|
| 3 | the |
|-------------|--------------|
| 4 | rain |
|-------------|--------------|
One caveat you should be aware of, though, is that if you represent something in the new_data array that doesn't currently exist in the table, it will be added/upserted.
Cheers!
It's a tricky problem.
Here's the solution first.
r.table('stats').get_all(*updated_stats.keys).update{|s|
{ :stats => r.expr(updated_stats).get_field(s["slug"]) }
}.run()
Then updated_stats is a ruby hash so when you use the brackets, it's the usual bracket operator, and since updated_stats doesn't have the key s["slug"], it returns nil.
So you have to wrap updated_stats in r.expr().
Then brackets in ruby are used for nth, get_field, slice etc. And when given a variable, it cannot guess which one it should use.
So you have to explicitly say you want to use get_field.
We will add a bracket term, which should fix this problem -- see https://github.com/rethinkdb/rethinkdb/issues/1179
Sorry you ran into this!