This is strictly related to the graph algorithm(not SEO or anything). I'm interested in knowing if there are other algorithms out there that solely use the structure of a graph(not content like keywords, etc) to make inferences?
So for example, if your given a large graph full of nodes how can you make inferences assuming you have no idea what the values within the nodes actually mean(for example, pagerank knows who's linking(edges) to whom and doesn't know anything about the content itself)?
This is not exclusive to web searching, anything that uses graph structure to make inferences.
As well as HITS [as suggested by #larsmans], there is also SALSA, which is concidered more "stable" from HITS [and thus is less vulnerable to be affected by spammers].
You are also encourage to have a look at this survey or ranking algorithms
The main alternative to PageRank is HITS.
Another alternative to page rank is OPIC.
Related
I'm new to the Graph Database scene, looking into Neo4j and learning Cypher, we're trying to model a graph database, it's a fairly simple one, we got users, and we got movies, users can VIEW movies, RATE movies, create playlists and playlists can HAVE movies.
The question is regarding the Super Node performance issue. And I will quote something from a very good book I am currently reading - Learning Neo4j by Rik Van Bruggen, so here it is:
A very interesting problem then occurs in datasets where some parts of the graph
are all connected to the same node. This node, also referred to as a dense node or a
supernode, becomes a real problem for graph traversals because the graph database
management system will have to evaluate all of the connected relationships to
that node in order to determine what the next step will be in the graph traversal.
The solution to this problem proposed in the book is to have a Meta node with 100 connections to it, and the 101th connection to be linked to a new Meta node that is linked to the previous Meta Node.
I have seen a blog post from the official Neo4j Blog saying that they will fix this problem in the upcoming future (the blog post is from January 2013) - http://neo4j.com/blog/2013-whats-coming-next-in-neo4j/
More exactly they say:
Another project we have planned around “bigger data” is to add some specific optimizations to handle traversals across densely-connected nodes, having very large numbers (millions) of relationships. (This problem is sometimes referred to as the “supernodes” problem.)
What are your opinions on this issue? Should we go with the Meta node fanning-out pattern or go with the basic relationship that every tutorial seem to be using? Any other suggestions?
UPDATE - October 2020. This article is the best source on this topic, covering all aspects of super nodes
(my original answer below)
It's a good question. This isn't really an answer, but why shouldn't we be able to discuss this here? Technically I think I'm supposed to flag your question as "primarily opinion based" since you're explicitly soliciting opinions, but I think it's worth the discussion.
The boring but honest answer is that it always depends on your query patterns. Without knowing what kinds of queries you're going to issue against this data structure, there's really no way to know the "best" approach.
Supernodes are problems in other areas as well. Graph databases sometimes are very difficult to scale in some ways, because the data in them is hard to partition. If this were a relational database, we could partition vertically or horizontally. In a graph DB when you have supernodes, everything is "close" to everything else. (An Alaskan farmer likes Lady Gaga, so does a New York banker). Moreso than just graph traversal speed, supernodes are a big problem for all sorts of scalability.
Rik's suggestion boils down to encouraging you to create "sub-clusters" or "partitions" of the super-node. For certain query patterns, this might be a good idea, and I'm not knocking the idea, but I think hidden in here is the notion of a clustering strategy. How many meta nodes do you assign? How many max links per meta-node? How did you go about assigning this user to this meta node (and not some other)? Depending on your queries, those questions are going to be very hard to answer, hard to implement correctly, or both.
A different (but conceptually very similar) approach is to clone Lady Gaga about a thousand times, and duplicate her data and keep it in sync between nodes, then assert a bunch of "same as" relationships between the clones. This isn't that different than the "meta" approach, but it has the advantage that it copies Lady Gaga's data to the clone, and the "Meta" node isn't just a dumb placeholder for navigation. Most of the same problems apply though.
Here's a different suggestion though: you have a large-scale many-to-many mapping problem here. It's possible that if this is a really huge problem for you, you'd be better off breaking this out into a single relational table with two columns (from_id, to_id), each referencing a neo4j node ID. You then might have a hybrid system that's mostly graph (but with some exceptions). Lots of tradeoffs here; of course you couldn't traverse that rel in cypher at all, but it would scale and partition much better, and querying for a particular rel would probably be much faster.
One general observation here: whether we're talking about relational, graph, documents, K/V databases, or whatever -- when the databases get really big, and the performance requirements get really intense, it's almost inevitable that people end up with some kind of a hybrid solution with more than one kind of DBMS. This is because of the inescapable reality that all databases are good at some things, and not good at others. So if you need a system that's good at most everything, you're going to have to use more than one kind of database. :)
There is probably quite a bit neo4j can do to optimize in these cases, but it would seem to me that the system would need some kinds of hints on access patterns in order to do a really good job at that. Of the 2,000,000 relations present, how to the endpoints best cluster? Are older relationships more important than newer, or vice versa?
Re. the Neo4j blog, dense node support should be enhanced in Neo4j 2.1 (and above), see also http://neo4j.com/blog/neo4j-2-1-graph-etl/
(disclaimer: not an answer, but some discussion)
The 2013 neo4j blog post you mentioned links to this github commit, where the intended problem scope and its solution is discussed. To summarize, it does not address the general supernode issue. Instead, it alleviates the issue when, among multiple relationship types (and directions) that a supernode has, some of the types (directions) happen to have disproportionately less edges than the others. The engine is able to filter based on types and directions.
A more generic solution is the vertex centric approach from Titan (https://stackoverflow.com/a/21385213/1311956), which sort the edges by one or a composite of properties, result in O(log(E)) searching performance, where E is the number of edges in/out of the supernode.
Neo4j has the concept of index on relationships. Unlike vertex centric approach of Titan, the index is global. However, relationship index is a legacy one in Neo4j. This is discussed in another stackoverflow thread.
Another issue with Supernode is the storage problem which leads to storage issue and IO cost.
I have to manually go through a long list of terms (~3500) which have been entered by users through the years. Beside other things, I want to reduce the list by looking for synonyms, typos and alternate spellings.
My work will be much easier if I can group the list into clusters of possible typos before starting. I was imagining to use some metric which can calculate the similarity to a term, e.g. in percent, and then cluster everything which has a similarity higher than some threshold. As I am going through it manually anyway, I don't mind a high failure rate, if it can keep the whole thing simple.
Ideally, there exists some easily available library to do this for me, implemented by people who know what they are doing. If there is no such, then at least one calculating a similarity metric for a pair of strings would be great, I can manage the clustering myself.
If this is not available either, do you know of a good algorithm which is simple to implement? I was first thinking a Hamming distance divided by word length will be a good metric, but noticed that while it will catch swapped letters, it won't handle deletions and insertions well (ptgs-1 will be caught as very similar to ptgs/1, but hematopoiesis won't be caught as very similar to haematopoiesis).
As for the requirements on the library/algorithm: it has to rely completely on spelling. I know that the usual NLP libraries don't work this way, but
there is no full text available for it to consider context.
it can't use a dictionary corpus of words, because the terms are far outside of any everyday language, frequently abbreviations of highly specialized terms.
Finally, I am most familiar with C# as a programming language, and I already have a C# pseudoscript which does some preliminary cleanup. If there is no one-step solution (feed list in, get grouped list out), I will prefer a library I can call from within a .NET program.
The whole thing should be relatively quick to learn for somebody with almost no previous knowledge in information retrieval. This will save me maybe 5-6 hours of manual work, and I don't want to spend more time than that in setting up an automated solution. OK, maybe up to 50% longer if I get the chance to learn something awesome :)
The question: What should I use, a library, or an algorithm? Which ones should I consider? If what I need is a library, how do I recognize one which is capable of delivering results based on spelling alone, as opposed to relying on context or dictionary use?
edit To clarify, I am not looking for actual semantic relatedness the way search or recommendation engines need it. I need to catch typos. So, I am looking for a metric by which mouse and rodent have zero similarity, but mouse and house have a very high similarity. And I am afraid that tools like Lucene use a metric which gets these two examples wrong (for my purposes).
Basically you are looking to cluster terms according to Semantic Relatedness.
One (hard) way to do it is following Markovitch and Gabrilovitch approach.
A quicker way will be consisting of the following steps:
download wikipedia dump and an open source Information Retrieval library such as Lucene (or Lucene.NET).
Index the files.
Search each term in the index - and get a vector - denoting how relevant the term (the query) is for each document. Note that this will be a vector of size |D|, where |D| is the total number of documents in the collection.
Cluster your vectors in any clustering algorithm. Each vector represents one term from your initial list.
If you are interested only in "visual" similarity (words are written similar to each other) then you can settle for levenshtein distance, but it won't be able to give you semantic relatedness of terms.For example, you won't be able to relate between "fall" and "autumn".
A new project with some interesting requirements has arrived on my desk. I need to develop a searchable directory of businesses, with a focus on delivering relevant results based on arbitrary search queries. The businesses can be of any niche; there's no one area that is more represented than another.
When googling for things like "search algorithm" or "content relevance algorithm," all I get are references to Google's "Mystical Algorithm of the Old Gods" and SEO firms.
Does the relevance value of MySQL's full text Match() function have what it takes for the task? I've never used it, but I'm definitely going to do some testing. Also, since this will largely be a human edited directory, I can assume that we can add weighted factors like tagging and categories. What would be a good way to combine these factors with MySQL's Match() relevancy?
I'm also open to ideas that I've not discussed here.
For an example of information retrieval based techniques lookup TF-IDF or BM25.
For machine learning based techniques, lookup RankNet and its variants from MSR.
If you have hand edited data, have a look at Oracle text search. In one of my previous projects we had some good results.
I was not directly involved in the database setups, but I know that the results were very welcome. (Before this they had just keyword based search).
Use a search engine like Solr to index the data. You can still use MySql to hold the data, but for searches use a search engine.
I am trying to store a large list of strings in a concise manner so that they can be very quickly analyzed/searched through.
A directed acyclic word graph (DAWG) suits this purpose wonderfully. However, I do not have a list of the strings to include in the first place, so it must be incrementally buildable. Additionally, when I search through it for a string, I need to bring back data associated with the result (not just a boolean saying if it was present).
I have found information on a modification of the DAWG for string data tracking here: http://www.pathcom.com/~vadco/adtdawg.html It looks extremely, extremely complex and I am not sure I am capable of writing it.
I have also found a few research papers describing incremental building algorithms, though I've found that research papers in general are not very helpful.
I don't think I am advanced enough to be able to combine both of these algorithms myself. Is there documentation of an algorithm already that features these, or an alternative algorithm with good memory use & speed?
I wrote the ADTDAWG web page. Adding words after construction is not an option. The structure is nothing more than 4 arrays of unsigned integer types. It was designed to be immutable for total CPU cache inclusion, and minimal multi-thread access complexity.
The structure is an automaton that forms a minimal and perfect hash function. It was built for speed while traversing recursively using an explicit stack.
As published, it supports up to 18 characters. Including all 26 English chars will require further augmentation.
My advice is to use a standard Trie, with an array index stored in each node. Ya, it is going to seem infantile, but each END_OF_WORD node represents only one word. The ADTDAWG is a solution to each END_OF_WORD node in a traditional DAWG representing many, many words.
Minimal and perfect hash tables are not the sort of thing that you can just put together on the fly.
I am looking for something else to work on, or a job, so contact me, and I'll do what I can. For now, all I can say is that it is unrealistic to use heavy optimization on a structure that is subject to being changed frequently.
Java
For graph problems which require persistence, I'd take a look at the Neo4j graph DB project. Neo4j is designed to store large graphs and allow incremental building and modification of the data, which seems to meet the criteria you describe.
They have some good examples to get you going quickly and there's usually example code to get you started with most problems.
They have a DAG example with a link at the bottom to the full source code.
C++
If you're using C++, a common solution to graph building/analysis is to use the Boost graph library. To persist your graph you could maintain a file based version of the graph in GraphML (for example) and read and write to that file as your graph changes.
You may also want to look at a trie structure for this (potentially building a radix-tree). It seems like a decent 'simple' alternative structure.
I'm suggesting this for a few reasons:
I really don't have a full understanding of your result.
Definitely incremental to build.
Leaf nodes can contain any data you wish.
Subjectively, a simple algorithm.
We discussed Google's PageRank algorithm in my algorithms class. What we discussed was that the algorithm represents webpages as a graph and puts them in an adjacency matrix, then does some matrix tweaking.
The only thing is that in the algorithm we discussed, if I link to a webpage, that webpage is also considered to link back to me. This seems to make the matrix multiplication simpler. Is this still the way that PageRank works? If so, why doesn't everyone just link to slashdot.com, yahoo.com, and microsoft.com just to boost their page rankings?
If you read the PageRank paper, you will see that links are not bi-directional, at least for the purposes of the PageRank algorithm. Indeed, it would make no sense if you could boost your page's PageRank by linking to a highly valued site.
If you link to the web page, that web page gets it's pagerank number increased according to your site page rank.
It doesn't work the other way around. Links are not bidirectional. So if you link to slashdot, you won't get any increase in pagerank, if slashdot links to you, you will get increase in pagerank.
Its a mystery beyond what we know about the beginnings of backrub and the paper that avi linked.
My favorite (personal) theory involves lots and lots of hamsters with wheel revolutions per minute heavily influencing the rank of any particular page. I don't know what they give the hamsters .. probably something much milder than LSD.
See the paper "The 25 Billion dollar eigenvector"
http://www.rose-hulman.edu/~bryan/googleFinalVersionFixed.pdf