I'm using Sphinx for indexing news which i gather from about 100 sites daily.
Each news document has id,title,body,date fields.
For homepage of my project i want to show latest news of today group by topic.
For example site A has a news with title:
"Internet of Things Will Burn Privacy for a While, Cerf Warns"
And site B has one with title:
"Cerf Warns : Internet of Things Will Burn Privacy for a While"
I want to show these news as one item with sites that covered it. Like:
"Internet of Things Will Burn Privacy for a While, Cerf Warns"
Published by : a.com,b.org,...
Is it possible with Sphinx?
Sphinx wont do it on its own. It can't just 'magically' group similar items into clusters of likely duplicate items.
(if the titles where identical - charactor for charactor, could just group by, but thats not the case in your example)
Once you've got your documents into clusters - eg assigned them a 'cluster-id'. Eg the two items in your example, would have the same cluster-id. A unique article not mentioned by mulitple sources would have its own id. - Sphinx could then help you search or render results - using the built in group by.
So first you need to cluster your documents.
There are dedicated tools for this type of thing, for example: https://github.com/open-city/dedupe
But a very basic one could actully be built with sphinx. Would probably work ok in your example, because the titles contain the same words, just in different order.
Basically just need a script that loops though all documents that DONT have a cluster-id, then run a sphinx search against the index, looking for duplicates. If one is found, duplicate its cluster-id, otherwise just allocate a fresh unique id.
This script can then just be run after inserting news documents, to 'cluster' any new stories.
The exact sphinx query can be varied. eg just including the words in a basic query, would require all the same words - regardless of order. But could also perhaps use a quorum search to require most words matching etc.
Might also want to filter by date to avoid dupluicating stories from wildly differnt dates.
Related
I have a filter by different divisions (Management, IT, Finance, etc). I currently have it so that I can do data quality check for each division. However, I don't want this filter to show up users of different divisions. Right now, the IT folks can see the division filter and as they are restricted to see only their division's data, they only see the "IT" as the only option in the filter. I would like for them to not see the Division filter at all as it is not useful for them, but I do need it for data quality purposes.
One of the solutions is to publish a workbook without the filter for the end users and another workbook with the filter just for you in a different folder (which only you can access). This one works always but you have 2 versions of the same report.
Or you can try it with parameters as described here:
https://www.theinformationlab.co.uk/2017/01/26/hiding-parameters-filters-tableau/
I have a products index that displays filtered results on category pages.
For a given category, any amount of products may be flagged as featured, meaning it displays first.
When products are displayed for a category, only one featured product should show at a time (at random from the available products flagged as featured)
Additionally, a product should not be flagged as featured if it has date range fields and the current date is not within the range
So, my index might look something like: https://gist.github.com/1a0327d8a321dc6627e197b94f4209c9
A solution to 1. has been posted here: https://stackoverflow.com/a/40922535 using optional filters, currently in private beta: https://www.algolia.com/doc/advanced/optional-filters-scoring/. At query time I could do optionalFacetFilters: ["featuredIn.category:category1"] Update: I've since been told by Algolia reps that this feature is has a large performance cost, and therefor is only really viable for enterprise customers.
However, I'm at a loss as to how to pull of 2. and 3..
Any guidance is greatly appreciated!
Suppose I have a large (300-500k) collection of text documents stored in the relational database. Each document can belong to one or more (up to six) categories. I need users to be able to randomly select documents in a specific category so that a single entity is never repeated, much like how StumbleUpon works.
I don't really see a way I could implement this using slow NOT IN queries with large amount of users and documents, so I figured I might need to implement some custom data structure for this purpose. Perhaps there is already a paper describing some algorithm that might be adapted to my needs?
Currently I'm considering the following approach:
Read all the entries from the database
Create a linked list based index for each category from the IDs of documents belonging to the this category. Shuffle it
Create a Bloom Filter containing all of the entries viewed by a particular user
Traverse the index using the iterator, randomly select items using Bloom Filter to pick not viewed items.
If you track via a table what entries that the user has seen... try this. And I'm going to use mysql because that's the quickest example I can think of but the gist should be clear.
On a link being 'used'...
insert into viewed (userid, url_id) values ("jj", 123)
On looking for a link...
select p.url_id
from pages p left join viewed v on v.url_id = p.url_id
where v.url_id is null
order by rand()
limit 1
This causes the database to go ahead and do a 1 for 1 join, and your limiting your query to return only one entry that the user has not seen yet.
Just a suggestion.
Edit: It is possible to make this one operation but there's no guarantee that the url will be passed successfully to the user.
It depend on how users get it's random entries.
Option 1:
A user is paging some entities and stop after couple of them. for example the user see the current random entity and then moving to the next one, read it and continue it couple of times and that's it.
in the next time this user (or another) get an entity from this category the entities that already viewed is clear and you can return an already viewed entity.
in that option I would recommend save a (hash) set of already viewed entities id and every time user ask for a random entity- randomally choose it from the DB and check if not already in the set.
because the set is so small and your data is so big, the chance that you get an already viewed id is so small, that it will take O(1) most of the time.
Option 2:
A user is paging in the entities and the viewed entities are saving between all users and every time user visit your page.
in that case you probably use all the entities in each category and saving all the viewed entites + check whether a entity is viewed will take some time.
In that option I would get all the ids for this topic- shuffle them and store it in a linked list. when you want to get a random not viewed entity- just get the head of the list and delete it (O(1)).
I assume that for any given <user, category> pair, the number of documents viewed is pretty small relative to the total number of documents available in that category.
So can you just store indexed triples <user, category, document> indicating which documents have been viewed, and then just take an optimistic approach with respect to randomly selected documents? In the vast majority of cases, the randomly selected document will be unread by the user. And you can check quickly because the triples are indexed.
I would opt for a pseudorandom approach:
1.) Determine number of elements in category to be viewed (SELECT COUNT(*) WHERE ...)
2.) Pick a random number in range 1 ... count.
3.) Select a single document (SELECT * FROM ... WHERE [same as when counting] ORDER BY [generate stable order]. Depending on the SQL dialect in use, there are different clauses that can be used to retrieve only the part of the result set you want (MySQL LIMIT clause, SQLServer TOP clause etc.)
If the number of documents is large the chance serving the same user the same document twice is neglibly small. Using the scheme described above you don't have to store any state information at all.
You may want to consider a nosql solution like Apache Cassandra. These seem to be ideally suited to your needs. There are many ways to design the algorithm you need in an environment where you can easily add new columns to a table (column family) on the fly, with excellent support for a very sparsely populated table.
edit: one of many possible solutions below:
create a CF(column family ie table) for each category (creating these on-the-fly is quite easy).
Add a row to each category CF for each document belonging to the category.
Whenever a user hits a document, you add a column with named and set it to true to the row. Obviously this table will be huge with millions of columns and probably quite sparsely populated, but no problem, reading this is still constant time.
Now finding a new document for a user in a category is simply a matter of selecting any result from select * where == null.
You should get constant time writes and reads, amazing scalability, etc if you can accept Cassandra's "eventually consistent" model (ie, it is not mission critical that a user never get a duplicate document)
I've solved similar in the past by indexing the relational database into a document oriented form using Apache Lucene. This was before the recent rise of NoSQL servers and is basically the same thing, but it's still a valid alternative approach.
You would create a Lucene Document for each of your texts with a textId (relational database id) field and multi valued categoryId and userId fields. Populate the categoryId field appropriately. When a user reads a text, add their id to the userId field. A simple query will return the set of documents with a given categoryId and without a given userId - pick one randomly and display it.
Store a users past X selections in a cookie or something.
Return the last selections to the server with the users new criteria
Randomly choose one of the texts satisfying the criteria until it is not a member of the last X selections of the user.
Return this choice of text and update the list of last X selections.
I would experiment to find the best value of X but I have in mind something like an X of say 16?
I have a database of about 200k books. I wish to give my users a way to quickly search a book by the title. Now, some titles might have prefix like A, THE, etc. and also can have numbers in the title, so search for 12 should match books with "12", "twelve" and "dozen" in the title. This will work via AJAX, so I need to make sure database query is really fast.
I assume that most of the users will try to search using some words of the title, so I'm thinking to split all the titles into words and create a separate database table which would map words to titles. However, I fear this might not give the best results. For example, the book title could be some 2 or 3 commonly used words, and I might get a list of books with longer titles that contain all 2-3 words and the one I'm looking for lost like a needle in a haystack. Also, searching for a book with many words in the title might slow down the query because of a lot of OR clauses.
Basically, I'm looking for a way to:
find the results quickly
sort them by relevance.
I assume this is not the first time someone needs something like this, and I'd hate to reinvent the wheel.
P.S. I'm currently using MySQL, but I could switch to anything else if needed.
Using a SOUNDEX is the best way i think.
SELECT
id,
title
FROM products AS p
WHERE p.title SOUNDS LIKE 'Shaw'
// This will match 'Saw' etc.
For best database performances you can best calculate the SOUNDEX value of your titles and put this in a new column. You can calculate the soundex with SOUNDEX('Hello').
Example usage:
UPDATE `books` SET `soundex_title` = SOUNDEX(title);
You might want to have a look at Apache Lucene. this is a high performance java based Information Retrieval System.
you would want to create an IndexWriter, and index all your titles, and you can add parameters (have a look at the class) linking to the actual book.
when searching, you would need an IndexReader and an IndexSearcher, and use the search() oporation on them.
have a look at the sample at: src/demo and in: http://lucene.apache.org/java/2_4_0/demo2.html
using Information Retrieval techniques makes the indexing take longer, but every search will not require going through most of the titles, and overall you can expect better performance for searching.
also, choosing good Analyzer enables you to ignore words such "the","a"...
One solution that would easily accomodate your volume of data and speed requirment is to use the Redis key-value pair store.
The way I see it, you can go ahead with your solution of mapping titles to keywords and storing them under the form:
keyword : set of book titles
Redis already has a built-in set data-type that you can use.
Next, to get the titles of the books that contains the search keywords you can use the sinter command which will peform set intersection for you.
Everything is done in memory; therefore the response time is very fast.
Also, if you want to save your index, redis has a number of different persistance/caching mechanisms.
Apache Lucene with Solr is definitely a very good option for your problem
You can directly link Solr/Lucene to directly index your MySQL database. Here is a simple tutorial on how to link your MySQL database with Lucene/Solr: http://www.cabotsolutions.com/2009/05/using-solr-lucene-for-full-text-search-with-mysql-db/
Here are the advantages and pains of using Lucene-Solr instead of MySQL full text search: http://jayant7k.blogspot.com/2006/05/mysql-fulltext-search-versus-lucene.html
Keep it simple. Create an index on the title field and use wildcard pattern matching. You can not possibly make it any faster as your bottleneck is not the string matching but the number of strings you want to match against the title.
And just came up with a different idea. You say that some words can be interpreted differently. Like 12, Twelve, dozen. Instead of creating a query with different interpretations, why not store different interpretations of the titles in a separate table with a one to many to the books. You can then GROUP BY book_id to get unique book titles.
Say the book "A dime in a dozen". In books table it will be:
book_id=356
book_title='A dime in a dozen'
In titles table will be stored:
titles_id=123
titles_book_id=356
titles_title='A dime in a dozen'
--
titles_id=124
titles_book_id=356
titles_title='A dime in a 12'
--
titles_id=125
titles_book_id=356
titles_title='A dime in a twelve'
The query for this:
SELECT b.book_id, b.book_title
FROM books b JOIN titles t on b.book_id=t.titles_book_id
WHERE t.titles_title='%twelve%'
GROUP BY b.book_id
Now, insertions becomes a much bigger task, but creating the variants can be done outside the database and inserted in one swoop.
I've a client testing the full text (example below) search on a new Oracle UCM site.
The random text string they chose to test was 'test only'. Which failed; from my testing it seems 'only' is a reserved word, as it is never returned from a full text search (it is returned from metadata searches).
I've spent the morning searching oracle.com and found this which seems pretty comprehensive, yet does not have 'only'.
So my question is thus, is 'only' a reserved word. Where can I find a complete list of reserved words for Oracle full text search (10g)?
Full text search string example;
(<ftx>test only</ftx>)
Update.
I have done some more testing. Seems it ignores words that indicate places or times;
only, some, until, when, while, where, there, here, near, that, who, about, this, them.
Can anyone confirm this? I can't find this in on Oracle anywhere.
Update 2. Post Answer
I should have been looking for 'stop' words not 'reserved'.
Updated the question title and tags to reflect.
Additional answers:
See default Oracle (11g) stopword lists here: http://download.oracle.com/docs/cd/B28359_01/text.111/b28304/astopsup.htm#i634475
The following query allows to list stopwords from all stoplists (to be run on CTXSYS schema):
SELECT *
FROM DR$STOPWORD
LEFT JOIN DR$STOPLIST ON DR$STOPWORD.SPW_SPL_ID = DR$STOPLIST.SPL_ID
In the results, the SPL_* fields come from the DR$STOPLIST system table, and the SPW_* fields from the DR$STOPWORD table
From a user schema, user defined stoplists and stopwords can be retrieved through
SELECT * FROM CTX_USER_STOPLISTS;
SELECT * FROM CTX_USER_STOPWORDS;
I bet the system is trying to automatically ignore frequently occurring words. That would explain why you cannot find 'only' but 'onnly' can be found. Can you search for 'a', 'an', ...
The list you gave of words that do not work looks like some very common words that frequently are not the primary words in a sentence. Given this, they are not likely to be words you are searching for on a full text search.
What are the odds that you are looking for an article that includes the word 'that' and the inclusion of that word is the only fact you have on the article?
I think I found your list.... Ironically from the wiki page of the last company I started..: http://www.sugarcrm.com/wiki/index.php?title=Overview_of_Full_Text_Stop_Words#Default_Stop_Words_.28for_English.29
2.10.3 Modifying the Default Stoplist The default stoplist is always named CTXSYS.DEFAULT_STOPLIST. You can use the following procedures to modify this stoplist:
• CTX_DDL.ADD_STOPWORD
• CTX_DDL.REMOVE_STOPWORD
• CTX_DDL.ADD_STOPTHEME
• CTX_DDL.ADD_STOPCLASS
When you modify CTXSYS.DEFAULT_STOPLIST with the CTX_DDL package, you must re-create your index for the changes to take effect.
Default stopword list:
a he out up
be more their at
had one will from
it than and is
only when corp not
she also in says
was by ms to
about her over
because most there
has or with
its that are
of which could
some an inc
we can mz
after his s
been mr they
have other would
last the as
on who for
such any into
were co no
all if so
but mrs this
Update - A nice whitepaper from Oracle that includes how full text searching works can be downloaded from: http://www.oracle.com/technology/products/text/pdf/text_techwp.pdf. They mention the stopwords and the fact that there is a default list, but don't mention the words themselves.
Keywords reserved:
http://www.toadworld.com/KNOWLEDGE/KnowledgeXpertforOracle/tabid/648/TopicID/SQL15/Default.aspx
click on "Keyword reserved words" on left.
"Only" is in the list.
I am not sure what is going on in your case, but I cannot imaging that Oracle will not support the word only in full text search. In many full text cases, you have to search for one word. Could that be the problem you are encountering?