How do I implement threaded comments? - performance

I am developing a web application that can support threaded comments. I need the ability to rearrange the comments based on the number of votes received. (Identical to how threaded comments work in reddit)
I would love to hear the inputs from the SO community on how to do it.
How should I design the comments table?
Here is the structure I am using now:
Comment
id
parent_post
parent_comment
author
points
What changes should be done to this structure?
How should I get the details from this table to display them in the correct manner?
(Implementation in any language is welcome. I just want to know how to do it in the best possible manner)
What are the stuff I need to take care while implementing this feature so that there is less load on the CPU/Database?
Thanks in advance.

Storing trees in a database is a subject which has many different solutions. It depends on if you want to retrieve a subhierarchy as well (so all children of item X) or if you just want to grab the entire set of hierarchies and build the tree in an O(n) way in memory using a dictionary.
Your table has the advantage that you can fetch all comments on a post in 1 go, by filtering on the parentpost. As you've defined the comment's parent in the textbook/naive way, you have to build the tree in memory (see below). If you want to obtain the tree from the DB, you need a different way to store a tree:
See my description of a pre-calc based approach here:
http://www.llblgen.com/tinyforum/GotoMessage.aspx?MessageID=17746&ThreadID=3208
or by using balanced trees described by CELKO here:
or yet another approach:
http://www.sqlteam.com/article/more-trees-hierarchies-in-sql
If you fetch everything in a hierarchy in memory and build the tree there, it can be more efficient due to the fact that the query is pretty simple: select .. from Comment where ParentPost = #id ORDER BY ParentComment ASC
After that query, you build the tree in memory with just 1 dictionary which keeps track of the tuple CommentID - Comment. You now walk through the resultset and build the tree on the fly: every comment you run into, you can lookup its parentcomment in the dictionary and then store the comment currently processed also in that dictionary.

Couple things to also consider...
1) When you say "sort like reddit" based on rank or date, do you mean the top-level or the whole thing?
2) When you delete a node, what happens to the branches? Do you re-parent them? In my implementation, I'm thinking that the editors will decide--either hide the node and display it as "comment hidden" along with the visible children, hide the comment and it's children, or nuke the whole tree. Re-parenting should be easy (just set the chidren's parent to the deleted's parent), but it anything involving the whole tree seems to be tricky to implement in the database.
I've been looking at the ltree module for PostgreSQL. It should make database operations involving parts of the tree a bit faster. It basically lets you set up a field in the table that looks like:
ltreetest=# select path from test where path <# 'Top.Science';
path
------------------------------------
Top.Science
Top.Science.Astronomy
Top.Science.Astronomy.Astrophysics
Top.Science.Astronomy.Cosmology
However, it doesn't ensure any kind of referential integrity on its own. In other words, you can have a records for "Top.Science.Astronomy" without having a record for "Top.Science" or "Top". But what it does let you do is stuff like:
-- hide the children of Top.Science
UPDATE test SET hide_me=true WHERE path #> 'Top.Science';
or
-- nuke the cosmology branch
DELETE FROM test WHERE path #> 'Top.Science.Cosmology';
If combined with the traditional "comment_id"/"parent_id" approach using stored procedures, I'm thinking you can get the best of both worlds. You can quickly traverse the comment tree in the database using your "path" and still ensure referential integrity via "comment_id"/"parent_id". I'm envisioning something like:
CREATE TABLE comments (
comment_id SERIAL PRIMARY KEY,
parent_comment_id int REFERENCES comments(comment_id) ON UPDATE CASCADE ON DELETE CASCADE,
thread_id int NOT NULL REFERENCES threads(thread_id) ON UPDATE CASCADE ON DELETE CASCADE,
path ltree NOT NULL,
comment_body text NOT NULL,
hide boolean not null default false
);
The path string for a comment look like be
<thread_id>.<parent_id_#1>.<parent_id_#2>.<parent_id_#3>.<my_comment_id>
Thus a root comment of thread "102" with a comment_id of "1" would have a path of:
102.1
And a child whose comment_id is "3" would be:
102.1.3
A some children of "3" having id's of "31" and "54" would be:
102.1.3.31
102.1.3.54
To hide the node "3" and its kids, you'd issue this:
UPDATE comments SET hide=true WHERE path #> '102.1.3';
I dunno though--it might add needless overhead. Plus I don't know how well maintained ltree is.

Your current design is basically fine for small hierarchies (less than thousand items)
If you want to fetch on a certian level or depth, add a 'level' item to your structure and compute it as part of the save
If performance is an issue use a decent cache

I'd add the following new fields to the above tabel:
thread_id: identifier for all comments attached to a specific object
date: the comment date (allows fetching the comments in order)
rank: the comment rank (allows fetching the comment order by ranking)
Using these fields you'll be able to:
fetch all comments in a thread in a single op
order comments in a thread either by date or rank
Unfortunately if you want to preserve your queries DB close to SQL standard you'll have to recreate the tree in memory. Some DBs are offering special queries for hierarchical data (f.e. Oracle)
./alex

Related

Recursive database viewing

I have this situation. Starting from a table, I have to check all the records that match a key. If records are found, I have to check another table using a key from the first table and so on, more on less on five levels. There is a way to do this in a recursive way, or I have to write all the code "by hand"? The language I am using is Visual Fox Pro. If this is is not possible, is it al least possible to use recursion to popolate a treeview?
You can set a relation between tables. For example:
USE table_1.dbf IN 0 SHARED
USE table_2.dbf IN 0 SHARED
SET ORDER TO TAG key_field OF table_2.cdx IN table_2
SET RELATION TO key_field INTO table_2 ADDITIVE IN table_1
First two commands open table_1 and table_2. Then you have to set the order/index of table_2. If you don't have an index for the key field then this will not work. The final command sets the relation between the two tables on the key field.
From here you can browse both tables and table_2's records will be filtered based on table_1's key field. Hope this helps.
If the tables have similar structure or you only need to look at a few fields, you could write a recursive routine that receives the name of the table, the key to check, and perhaps the fields you need to check as parameters. The tricky part, I guess, is knowing what to pass down to the next call.
I don't think I can offer any more advice without at least seeing some table structures.
Sorry for answering so late, but the problem was of course that the recursion wasn't a viable solution since I had to search inside multiple tables. So I resolved by doing a simple 2-Level search in the tables that I needed.
Thank you very much for the help, and sorry again for answering so late.

Marklogic - get list of all unique document structures in a Marklogic database

I want to get a list of all distinct document structures with a count in a Marklogic database.
e.g. a database with these 3 documents:
1) <document><name>Robert</name></document>
2) <document><name>Mark</name></document>
3) <document><fname>Robert</fname><lname>Smith</lname></document>
Would return that there are two unique document structures in the database, one used by 2 documents, and the other used by 1 document.
I am using this xquery and am getting back the list of unique sequence of elements correctly:
for $i in distinct-values(for $document in doc()
return <div>{distinct-values(
for $element in $document//*/*/name() return <div>{$element}</div>)} </div>)
return $i
I appreciate that this code will not handle duplicate element names but that is OK for now.
My t questions are:
1) Is there a better/more efficient way to do this? I am assuming yes.
2) Is there a way to get back enough detail so that I could build up the xml tree of each unique structure?
3) What is the best way to return the count of each distinct structure e.g. 2 and 1 and in the above example
If you have a finite list of elements for which you need to do this for, consider co-occurance or other similiar solutions: https://docs.marklogic.com/cts:value-co-occurrences
This requires a range index on each element in question.
MarkLogic works best to use indexes whenever possible. The other solution I can think of is that you actually create a hash/checksum for the values of the target content for each document in question and store this with the document (or in a triple if you happen to have a licence for semantics). Then you you would already have a key for
the unique combinations.
1) Is there a better/more efficient way to do this? I am assuming yes.
If it were up to me, I would create the document structured in a consistent fashion (like you're doing), then hash it, and attach the hash to each document as a collection. Then I could count the docs in each collection. I can't see any efficient way (using indexes) to get the counts without first writing to the document content or metadata (collection is a type of metadata) then querying against the indexes.
2) Is there a way to get back enough detail so that I could build up the xml tree of each unique structure?
After you get the counts for each collection, you could retrieve one doc from each collection and walk through it to build an empty XML structure. XSLT would probably be a good way to do this if you already know XSLT.
3) What is the best way to return the count of each distinct structure e.g. 2 and 1 and in the above example
Turn on the collection lexicon on your database. Then do something like the following:
for $collection in cts:collections()
return ($collection, cts:frequency($collection))
Not sure I follow exactly what you are after, but I am wondering if this is more what you are looking for- functx:distinct-element-paths($doc)
http://www.xqueryfunctions.com/xq/functx_distinct-element-paths.html
Here's a quick example:
xquery version "1.0-ml";
import module namespace functx = "http://www.functx.com" at "/MarkLogic/functx/functx-1.0-nodoc-2007-01.xqy";
let $doc := <document><fname>Robert</fname><lname>Smith</lname></document>
return
functx:distinct-element-paths($doc)
Outputs the following strings (which could be parsed, of course):
document
document/fname
document/lname
there are existing 3rd party tools that may work, depending on the size of the data, and the coverage required (is 100% sampleing needed).
Search for "Generate Schema from XML" --
Such tools will look at a sample set and infer a schema (xsd, dtd, rng etc).
They do an accurate job, but not always in the same way a human would.
If they do not have native ML integration then you need to expose a service or exort the data for analysis.
Once you HAVE a schema, load it into MarkLogic, and you can query the schema (and elements validated by it) directly and programmatically in ML
If you find a 'generate schema' tool that is implemented in XSLT, XQuery, or JavaScript you may be able to import and execute it in-server.

How to invalidate parts of a hierarchy (tree) of data in Redis cache

I have some product data that I need to store multiple versions of in a Redis cache. The data is made up of JSON-serialised objects. The process of obtaining the plain (basic) data is expensive, and the process of customising it into different versions is also expensive, so I'd like to cache all versions to optimise wherever possible. The data structure looks something like this:
BaseProduct
/\
/ \
/ \
/ \
/ \
CustomisedProductA CustomisedProductB
/ \ / \
CustomisedProductA1 CustomisedProductA2 CustomisedProductB1 CustomisedProductB2
The general idea here is:
There is a base product stored in a database.
One level of customisation can be applied to this product - e.g. information about a specific version of this product for a sales region.
A second level of customisation can be applied within that - e.g. information about this product at a particular store within a region.
The data is stored in this way because each step of the data retrieval/calculation process is expensive. The first time a particular product is retrieved for a region, there will be one set of customisations performed to make it into a region-specific product. The first time a particular product is retrieved for a store, I need to perform customisations based on the regional product to generate the store-specific product.
The problem comes in due to the fact that I may need to invalidate data in a few ways:
If the base product data changes, then the whole tree needs to be invalidated and everything needs to be regenerated. I can achieve this by storing the whole structure in a hash and deleting the hash by its key.
If the first set of customisations for a product change (i.e. the middle level), then I need to invalidate the nodes underneath this level too. For example, if the customisations for CustomisedProductA are affected by a change, I need to expire CustomisedProductA, CustomisedProductA1, and CustomisedProductA2.
If the second set of customisations for a product change (i.e. the bottom level), then that node needs to be invalidated. I can achieve this in a hash by calling HDEL key field (e.g. HDEL product CustomisedProductA:CustomisedProductA1).
My question is therefore: is there a way of representing this type of multi-level data structure, to allow for the performance of storing the data in multiple levels while enabling invalidation of only parts of the tree? Or, am I limited to expiring the entire tree (DEL key) or specific nodes (HDEL key field) but nothing in between?
There are at least 3 different ways for doing that, each has its own pros and cons.
The first approach is to use non-atomic ad-hoc scanning of the tree to identify and invalidate (delete) the tree's 2nd level (1st set of customizations). To do that, use a hierarichal naming scheme for your Hash's fields and iterate through them using HSCAN. For example, assuming that your Hash's key name is the product's ID (e.g. ProductA), you'd use something like '0001:0001' as the field name for the first customization's first version, '0001:0002' for its second version and so forth. Similarly, '0002:0001' would be the 2nd customization 1st version, etc... Then, do find all of customization 42's versions, use HSCAN ProductA 0 MATCH 0042:*, HDEL the fields in the reply, and repeat until the cursor zeros.
The opposite approach is to proactively "index" each customization's versions so you can fetch them efficiently instead of performing the Hash's full scan. The way to go about that is using Redis' Sets - you keep a Set with all the field names for a given product's version. Versions can either be sequential (as in my example) or anything else as long as they are unique. The cost is maintaining these indices - whenever you add or remove a product's customization and/or version, you'll need to maintain consistency with these Sets. For example, the creation of a version would be something like:
HSET ProductA 0001:0001 "<customization 1 version 1 JSON payload"
SADD ProductA:0001 0001
Note that these two operations should be in a single transaction (i.e. use a MULTI\EXEC block or EVAL a Lua script). When you have this set up, invalidating a customization is just a matter of calling SMEMBERS on the relevant Set and deleting the versions in it from the Hash (and the Set itself as well). It is important to note, however, that reading all members from a large Set could be time consuming - 1K members isn't that bad, but for larger Sets there's SSCAN.
Lastly, you could consider using a Sorted Set instead of a Hash. While perhaps less intuitive in this use case, the Sorted Set will let you perform all the operations you need. The price for using it, however, is the increased complexity of O(logN) for adding/removing/reading compared to the Hash's O(1), but given the numbers the difference isn't significant.
To unleash the Sorted Set's power, you'll use lexicographical ordering so all of the Sorted Set's members should have the same score (e.g. use 0). Each product will be represented by a Sorted Set, just like with the Hash. The members of the Set are the equivalents of the Hash's field, namely customizations' versions. The "trick" is constructing the members in a way that allows you to perform range searches (or level-2 invalidations if you will). Here's an example of how it should look like (note that here the key ProductA isn't a Hash but a Sorted Set):
ZADD ProductA 0 0001:0001:<JSON>
To read a customization version, use ZRANGEBYLEX ProductA [0001:0001: [0001:0001:\xff and split the JSON from the reply and to remove an entire customization, use ZREMRANGEBYLEX.

Cassandra DB: is it favorable, or frowned upon, to index multiple criteria per row?

I've been doing a lot of reading lately on Cassandra, and specifically how to structure rows to take advantage of indexing/sorting, but there is one thing I am still unclear on; how many "index" items (or filters if you will) should you include in a column family (CF) row?
Specifically: I am building an app and will be using Cassandra to archive log data, which I will use for analytics.
Example types of analytic searches will include (by date range):
total visits to specific site section
total visits by Country
traffic source
I plan to store the whole log object in JSON format, but to avoid having to go through each item to get basic data, or to create multiple CF just to get basic data, I am curious to know if it's a good idea to include these above "filters" as columns (compound column segment)?
Example:
Row Key | timeUUID:data | timeUUID:country | timeUUID:source |
======================================================
timeUUID:section | JSON Object | USA | example.com |
So as you can see from the structure, the row key would be a compound key of timeUUID (say per day) plus the site section I want to get stats for. This lets me query a date range quite easily.
Next, my dilemma, the columns. Compound column name with timeUUID lets me sort & do a time based slice, but does the concept make sense?
Is this type of structure acceptable by the current "best practice", or would it be frowned upon? Would it be advisable to create a separate "index" CF for each metric I want to query on? (even when it's as simple as this?)
I would rather get this right the first time instead of having to restructure the data and refactor my application code later.
I think the idea behind this is OK. It's a pretty common way of doing timeslicing (assuming I've understood your schema anyway - a create table snippet would be great). Some minor tweaks ...
You don't need a timeUUID as your row key. Given that you suggest partitioning by individual days (which are inherently unique) you don't need a UUID aspect. A timestamp is probably fine, or even simpler a varchar in the format YYYYMMDD (or whatever arrangement you prefer).
You will probably also want to swap your row key composition around to section:time. The reason for this is that if you need to specify an IN clause (i.e. to grab multiple days) you can only do it on the last part of the key. This means you can do WHERE section = 'foo' and time IN (....). I imagine that's a more common use case - but the decision is obviously yours.
If your common case is querying the most recent data don't forget to cluster your timeUUID columns in descending order. This keeps the hot columns at the head.
Double storing content is fine (i.e. once for the JSON payload, and denormalised again for data you need to query). Storage is cheap.
I don't think you need indexes, but it depends on the queries you intend to run. If your queries are simple then you may want to store counters by (date:parameter) instead of values and just increment them as data comes in.

Random exhaustive (non-repeating) selection from a large pool of entries

Suppose I have a large (300-500k) collection of text documents stored in the relational database. Each document can belong to one or more (up to six) categories. I need users to be able to randomly select documents in a specific category so that a single entity is never repeated, much like how StumbleUpon works.
I don't really see a way I could implement this using slow NOT IN queries with large amount of users and documents, so I figured I might need to implement some custom data structure for this purpose. Perhaps there is already a paper describing some algorithm that might be adapted to my needs?
Currently I'm considering the following approach:
Read all the entries from the database
Create a linked list based index for each category from the IDs of documents belonging to the this category. Shuffle it
Create a Bloom Filter containing all of the entries viewed by a particular user
Traverse the index using the iterator, randomly select items using Bloom Filter to pick not viewed items.
If you track via a table what entries that the user has seen... try this. And I'm going to use mysql because that's the quickest example I can think of but the gist should be clear.
On a link being 'used'...
insert into viewed (userid, url_id) values ("jj", 123)
On looking for a link...
select p.url_id
from pages p left join viewed v on v.url_id = p.url_id
where v.url_id is null
order by rand()
limit 1
This causes the database to go ahead and do a 1 for 1 join, and your limiting your query to return only one entry that the user has not seen yet.
Just a suggestion.
Edit: It is possible to make this one operation but there's no guarantee that the url will be passed successfully to the user.
It depend on how users get it's random entries.
Option 1:
A user is paging some entities and stop after couple of them. for example the user see the current random entity and then moving to the next one, read it and continue it couple of times and that's it.
in the next time this user (or another) get an entity from this category the entities that already viewed is clear and you can return an already viewed entity.
in that option I would recommend save a (hash) set of already viewed entities id and every time user ask for a random entity- randomally choose it from the DB and check if not already in the set.
because the set is so small and your data is so big, the chance that you get an already viewed id is so small, that it will take O(1) most of the time.
Option 2:
A user is paging in the entities and the viewed entities are saving between all users and every time user visit your page.
in that case you probably use all the entities in each category and saving all the viewed entites + check whether a entity is viewed will take some time.
In that option I would get all the ids for this topic- shuffle them and store it in a linked list. when you want to get a random not viewed entity- just get the head of the list and delete it (O(1)).
I assume that for any given <user, category> pair, the number of documents viewed is pretty small relative to the total number of documents available in that category.
So can you just store indexed triples <user, category, document> indicating which documents have been viewed, and then just take an optimistic approach with respect to randomly selected documents? In the vast majority of cases, the randomly selected document will be unread by the user. And you can check quickly because the triples are indexed.
I would opt for a pseudorandom approach:
1.) Determine number of elements in category to be viewed (SELECT COUNT(*) WHERE ...)
2.) Pick a random number in range 1 ... count.
3.) Select a single document (SELECT * FROM ... WHERE [same as when counting] ORDER BY [generate stable order]. Depending on the SQL dialect in use, there are different clauses that can be used to retrieve only the part of the result set you want (MySQL LIMIT clause, SQLServer TOP clause etc.)
If the number of documents is large the chance serving the same user the same document twice is neglibly small. Using the scheme described above you don't have to store any state information at all.
You may want to consider a nosql solution like Apache Cassandra. These seem to be ideally suited to your needs. There are many ways to design the algorithm you need in an environment where you can easily add new columns to a table (column family) on the fly, with excellent support for a very sparsely populated table.
edit: one of many possible solutions below:
create a CF(column family ie table) for each category (creating these on-the-fly is quite easy).
Add a row to each category CF for each document belonging to the category.
Whenever a user hits a document, you add a column with named and set it to true to the row. Obviously this table will be huge with millions of columns and probably quite sparsely populated, but no problem, reading this is still constant time.
Now finding a new document for a user in a category is simply a matter of selecting any result from select * where == null.
You should get constant time writes and reads, amazing scalability, etc if you can accept Cassandra's "eventually consistent" model (ie, it is not mission critical that a user never get a duplicate document)
I've solved similar in the past by indexing the relational database into a document oriented form using Apache Lucene. This was before the recent rise of NoSQL servers and is basically the same thing, but it's still a valid alternative approach.
You would create a Lucene Document for each of your texts with a textId (relational database id) field and multi valued categoryId and userId fields. Populate the categoryId field appropriately. When a user reads a text, add their id to the userId field. A simple query will return the set of documents with a given categoryId and without a given userId - pick one randomly and display it.
Store a users past X selections in a cookie or something.
Return the last selections to the server with the users new criteria
Randomly choose one of the texts satisfying the criteria until it is not a member of the last X selections of the user.
Return this choice of text and update the list of last X selections.
I would experiment to find the best value of X but I have in mind something like an X of say 16?

Resources