sphinx SPH_SORT_EXTENDED - full-text-search

Is it possible to sort based on the "weight" and then DESC on attr in the same Query?
For example, if I search for this text "test is fine" and I have this in the index
+------------+---------+
| Field | Type |
+------------+---------+
| id | integer |
| text | field |
| importance | uint |
+------------+---------+
importance is attr here
with these values,
1, "test", 3
2, "test is fine", 1
3, "test", 8
then if I search for "test is fine" I need the result to be sorted first based on the relevance of the keywords(weight) then based on "importance" attr, so the id output for the search will be
ID result = 2, 3, 1
I'm using this but the result is being sorted based on the attr 'importance' wih no regards to the weight
$cl->SetSortMode( SPH_SORT_ATTR_DESC, 'importance' );

You've sort of answered your own question. Yes its SPH_SORT_EXTENDED you want!
$cl->setSortMode(SPH_SORT_EXTENDED, "#relevance DESC, importance DESC");

Related

How do i strip rows inside a Column-Table, based on the "outer" tables value in Power Query?

Beeing pretty new to Power Query, I find myself faced with this problem I wish to solve.
I have a TableA with these columns. Example:
Key | Sprint | Index
-------------------------
A | PI1-I1 | 1
A | PI1-I2 | 2
B | PI1-I3 | 1
C | PI1-I1 | 1
I want to end up with a set looking like this:
Key | Sprint | Index | HasSpillOver
-------------------------
A | PI1-I1 | 1 | Yes
A | PI2-I2 | 2 | No
B | PI1-I3 | 1 | No
C | PI1-I1 | 1 | No
I thought I could maybe nestedjoin TableA on itself and then compare indicies and strip them away and then count rows in the table, like outlined below.
TableA=Key, Sprint, Index
// TableA Nested joined on itself (Key, Sprint, Index, Nested)
TableB=NestedJoin(#"TableA", "Key", #"TableA", "Key", "Nested", JoinKind.Inner)
TableC= Table.TransformColumns(#"TableB", {"Nested", (x)=>Table.SelectRows(x, each [Index] <x[Index])} )
.. and then do the count, however this throws an error:
Can not apply operator < on types List and Number.
Any suggestions how to approach this problem? Possibly (probably) in a different way.
You did not define very well what "spillover" means but this should get you most of the way
Mine assumes adding another index. You could use what you have if it is relevant
Then the code counts the number of rows where the (2nd) index is higher, and the [Key] field matches. You could add code so that the Sprint field matches as well if relevant
let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
#"Added Index" = Table.AddIndexColumn(Source, "Index.1", 0, 1),
#"Added Custom" = Table.AddColumn(#"Added Index" ,"Count",(i)=>Table.RowCount(Table.SelectRows(#"Added Index" , each [Key]=i[Key] and [Index.1]>i[Index.1])))
in #"Added Custom"

spring data exists by all values of some column

I want to know, is there set of entities by following rule:
I have a table with two primary keys:
| id | key |
| 1 | a |
| 2 | b |
| 1 | c |
So, I want to do something like that:
boolean existsByIdAndAllOfKey(
long id,
Set<Key> keys
)
This query should return true if in the database there are entities with all keys presented in input Set.
I wondering is there any keyword from spring data? Or what is the best way to do that?
found following solution:
int countByIdAndKeyIn(
long id,
Set<Key> keys
)
boolean isThereEntityWithAllKeys(long id, Set<Key> keys) {
return countByIdAndKeyIn(id, keys) == keys.size;
}

How to sort a comma separated string with a specific value on first position?

Lets say I have an unsorted string like "Apples,Bananas,Pineapples,Apricots" in my query. I want to sort that list and specificly have "Bananas" first in the list if they occur and the rest sorted ascending.
Example:
[BASKET] | [CONTENT] | [SORTED]
John | Apples,Apricots,Bananas | Bananas,Apples,Apricots
Melissa | Pineapples,Bananas | Bananas,Pineapples
Tom | Pineapples,Apricots,Apples | Apples,Apricots,Pineapples
How can I accomplish this with Power Query?
Cheap version (a) replace Banana with something that will sort first in strict alpha sort (b) Sort in new column (c) Fix Banana (d) remove extra column
let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
#"Replaced Value" = Table.ReplaceValue(Source,"Bananas","111Bananas",Replacer.ReplaceText,{"Items"}),
MySort = (x) => Text.Combine(List.Sort(Text.Split(x, ",")), ","),
Sorted = Table.AddColumn(#"Replaced Value", "Sorted", each MySort([Items])),
#"Replaced Value1" = Table.ReplaceValue(Sorted,"111Bananas","Bananas",Replacer.ReplaceText,{"Sorted"}),
#"Removed Columns" = Table.RemoveColumns(#"Replaced Value1",{"Items"})
in #"Removed Columns"

Spark: How to create a sessionId based on userId and timestamp

Sorry for a newbie question.
Currently I have log files which contains fields such as: userId, event, and timestamp, while lacking of the sessionId. My aim is to create a sessionId for each record based on the timestamp and a pre-defined value TIMEOUT.
If the TIMEOUT value is 10, and sample DataFrame is:
scala> eventSequence.show(false)
+----------+------------+----------+
|uerId |event |timestamp |
+----------+------------+----------+
|U1 |A |1 |
|U2 |B |2 |
|U1 |C |5 |
|U3 |A |8 |
|U1 |D |20 |
|U2 |B |23 |
+----------+------------+----------+
The goal is:
+----------+------------+----------+----------+
|uerId |event |timestamp |sessionId |
+----------+------------+----------+----------+
|U1 |A |1 |S1 |
|U2 |B |2 |S2 |
|U1 |C |5 |S1 |
|U3 |A |8 |S3 |
|U1 |D |20 |S4 |
|U2 |B |23 |S5 |
+----------+------------+----------+----------+
I find one solution in R (Create a "sessionID" based on "userID" and differences in "timeStamp"), while I am not able to figure it out in Spark.
Thanks for any suggestions on this problem.
Shawn's answer regards on "How to create a new column", while my aim is to "How to create an sessionId column based on timestamp". After days of struggling, the Window function is applied in this scenario as a simple solution.
Window is introduced since Spark 1.4, it provides functions when such operations is needed:
both operate on a group of rows while still returning a single value for every input row
In order to create a sessionId based on timestamp, first I need to get the difference between user A's two immediate operations. The windowDef defines the Window will be partition by "userId" and ordered by timestamp, then diff is a column which will return a value for each row, whose value will be 1 row after the current row in the partition(group), or null if the current row is the last row in this partition
def handleDiff(timeOut: Int) = {
udf {(timeDiff: Int, timestamp: Int) => if(timeDiff > timeOut) timestamp + ";" else timestamp + ""}
}
val windowDef = Window.partitionBy("userId").orderBy("timestamp")
val diff: Column = lead(eventSequence("timestamp"), 1).over(windowDef)
val dfTSDiff = eventSequence.
withColumn("time_diff", diff - eventSequence("timestamp")).
withColumn("event_seq", handleDiff(TIME_OUT)(col("time_diff"), col("timestamp"))).
groupBy("userId").agg(GroupConcat(col("event_seq")).alias("event_seqs"))
Updated:
Then exploit the Window function to apply the "cumsum"-like operation (provided in Pandas):
// Define a Window, partitioned by userId (partitionBy), ordered by timestamp (orderBy), and delivers all rows before current row in this partition as frame (rowsBetween)
val windowSpec = Window.partitionBy("userId").orderBy("timestamp").rowsBetween(Long.MinValue, 0)
val sessionDf = dfTSDiff.
withColumn("ts_diff_flag", genTSFlag(TIME_OUT)(col("time_diff"))).
select(col("userId"), col("eventSeq"), col("timestamp"), sum("ts_diff_flag").over(windowSpec).alias("sessionInteger")).
withColumn("sessionId", genSessionId(col("userId"), col("sessionInteger")))
Previously:
Then split by ";" and get each session, create a sessionId; afterwards split by "," and explodes to final result. Thus sessionId is created with the help of string operations.
(This part should be replaced by cumulative sum operation instead, however I did not find a good solution)
Any idea or thought about this question is welcomed.
GroupConcat could be found here: SPARK SQL replacement for mysql GROUP_CONCAT aggregate function
Reference: databricks introduction
dt.withColumn('sessionId', expression for the new column sessionId)
for example:
dt.timestamp + pre-defined value TIMEOUT

Bulk Update in RethinkDB

I'm trying to update multiple documents in RethinkDB, based on some precalculated values in a Hash. i.e.
Given a table stats with primary key slug with data like
[{slug: 'foo', stats: {}}, {slug:'bar', stats:{}}]
and given a Hash with values like
updated_stats = {
'foo' => {a: 1, b: 2},
'bar' => {a: 3, b: 4}
}
I can do this
updated_stats.each{|k,v|
r.table('stats').get(k).update{|s|
{ :stats => v }
}
}
So, why can't I do the following?
r.table('stats').get_all(*updated_stats.keys).update{|s|
{ :stats => updated_stats[s["slug"]] }
}
the rql shows nil as the value of updated_stats[s["slug"]]. Would really appreciate any help on this. Thanks.
For anyone looking for how to bulk update records, it's actually pretty easy but not at all intuitive.
You actually have to perform an insert while specifying that if there's any conflicts, to update those records. You will obviously need to provide the Id of each record to be updated.
Using the following data set:
|-------------|--------------|
| id | title |
|-------------|--------------|
| 1 | fun |
|-------------|--------------|
| 2 | in |
|-------------|--------------|
| 3 | the |
|-------------|--------------|
| 4 | sun |
|-------------|--------------|
Here's an example (javascript):
const new_data = [
{id: 1, title: 'dancing'},
{id: 4, title: 'rain'},
];
r.db('your_db').table('your_table').insert(new_data, {conflict: 'update'});
The results would be:
|-------------|--------------|
| id | title |
|-------------|--------------|
| 1 | dancing |
|-------------|--------------|
| 2 | in |
|-------------|--------------|
| 3 | the |
|-------------|--------------|
| 4 | rain |
|-------------|--------------|
One caveat you should be aware of, though, is that if you represent something in the new_data array that doesn't currently exist in the table, it will be added/upserted.
Cheers!
It's a tricky problem.
Here's the solution first.
r.table('stats').get_all(*updated_stats.keys).update{|s|
{ :stats => r.expr(updated_stats).get_field(s["slug"]) }
}.run()
Then updated_stats is a ruby hash so when you use the brackets, it's the usual bracket operator, and since updated_stats doesn't have the key s["slug"], it returns nil.
So you have to wrap updated_stats in r.expr().
Then brackets in ruby are used for nth, get_field, slice etc. And when given a variable, it cannot guess which one it should use.
So you have to explicitly say you want to use get_field.
We will add a bracket term, which should fix this problem -- see https://github.com/rethinkdb/rethinkdb/issues/1179
Sorry you ran into this!

Resources