I prototyped a tiny search engine with PageRank that worked on my computer. I am interested in building a Knowledge Graph on top of it, and it should return only queried webpages that are within the right context, similarly to how Google found relevant answers to search questions. I saw a lot of publicity around Knowledge Graphs, but not a lot of literature and almost no pseudocode like guideline of building one. Does anyone know good references on how such Knowledge Graphs work internally, so that there will be no need to create models about a KG?
Knowledge graph is a buzzword. It is a sum of models and technologies put together to achieve a result.
The first stop on your journey starts with Natural language processing, Ontologies and Text mining. It is a wide field of artificial intelligence, go here for a research survey on the field.
Before building your own models, I suggest you try different standard algorithms using dedicated toolboxes such as gensim. You will learn about tf-idf, LDA, document feature vectors, etc.
I am assuming you want to work with text data, if you want to do image search using other images it is different. Same for the audio part.
Building models is only the first step, the most difficult part of Google's knowledge graph is to actually scale to billions of requests each day ...
A good processing pipeline can be built "easily" on top of Apache Spark, "the current-gen Hadoop". It provides a resilient distributed datastore which is mandatory if you want to scale.
If you want to keep your data as a graph, as in graph theory (like pagerank), for live querying, I suggest you use Bulbs which is a framework which is "Like an ORM for graphs, but instead of SQL, you use the graph-traversal language Gremlin to query the database". You can switch the backend from Neo4j to OpenRDF (useful if you do ontologies) for instance.
For graph analytics you can use Spark, GraphX module or GraphLab.
Hope it helps.
I know I'm really late but first to clarify some terminology: Knowledge Graph and Ontology are similar (I'm talking in the Semantic Web paradigm). In the semantic web stack the foundation is RDF which is a language for defining graphs as triples (Subject, Predicate, Object). RDFS is a layer on top of RDF. It defines a meta-model, e.g., predicates such as rdf:type and nodes such as rdfs:Class. Although RDFS provides a meta-model there is no logical foundation for it so there are no reasoners that can validate the model or do further reasoning on it. The layer on top of RDFS is OWL (Web Ontology Language). That has a formal semantics defined by Description Logic which is a decidable subset of First Order Logic. It has more predefined nodes and links such as owl:Class, owl:ObjectProperty, etc. So when people use the term ontology they typically mean an OWL model. When they use the term Knowledge Graph it may refer to an ontology defined in OWL (because OWL is still ultimately an RDF graph) or it may mean just a graph in RDF/RDFS.
I said that because IMO the best way to build a knowledge graph is to define an ontology and then use various semantic web tools to load data (e.g., from spreadsheets) into the ontology. The best tool to start with IMO is the Protege ontology editor from Stanford. It's free and for a free open source tool very reliable and intuitive. And there is a good tutorial for how to use Protege and learn OWL as well as other Semantic Web tools such as SPARQL and SHACL. That tutorial can be found here: New Protege Pizza Tutorial (disclosure: that links to my site, I wrote the tutorial). If you want to get into the lower levels of the graph you probably want to check out a triplestore. It is a graph database designed for OWL and RDF models. The free version of Franz Inc's AllegroGraph triplestore is easy to use and supports 5M triples. Another good triplestore that is free and open source is part of the Apache Jena framework.
Related
Can someone help me identify the top-down, bottom-up, and hybrid data warehouse design methodologies as mentioned here in Wikipedia in the following diagram? I am interested in understanding how the diagram differs depending on each design methodology.
The diagram is too generic to enable identification of a methodology. Further, the Wikipedia article is surprisingly out of date.
There are four mainstream DW methodologies in common use today - Dimensional (Kimball), 3NF (Inmon), Data Vault (Linstedt) and Anchor Modelling (Ronnback). All could be represented within that diagram.
The issue of top-down or bottom-up in this article is centred around data marts. There is no requirement that marts are stored in a separate database, or even in a DBMS. In the context of your diagram they might exist in either the data warehouse or the analysis tool. In any case, the diagram does not give any indication of what came first, so you can't infer an approach.
In order to identify the methodology (Kimball, etc.) that was used to design the warehouse you'd need to see its data model. It would be immediately apparent from the model.
To identify the order in which components were delivered you'd need to see some sort of timeline, project plan, etc.
I want to build a web application that lets users upload documents, videos, images, music, and then give them an ability to search them. Think of it as Dropbox + Semantic Search.
When user uploads a new file, e.g. Document1.docx, how could I automatically generate tags based on the content of the file? In other words no user input is needed to determine what the file is about. If suppose that Document1.docx is a research paper on data mining, then when user searches for data mining, or research paper, or document1, that file should be returned in search results, since data mining and research paper will most likely be potential auto-generated tags for that given document.
1. Which algorithms would you recommend for this problem?
2. Is there an natural language library that could do this for me?
3. Which machine learning techniques should I look into to improve tagging precision?
4. How could I extend this to video and image automatic tagging?
Thanks in advance!
The most common unsupervised machine learning model for this type of task is Latent Dirichlet Allocation (LDA). This model automatically infers a collection of topics over a corpus of documents based on the words in those documents. Running LDA on your set of documents would assign words with probability to certain topics when you search for them, and then you could retrieve the documents with the highest probabilities to be relevant to that word.
There have been some extensions to images and music as well, see http://cseweb.ucsd.edu/~dhu/docs/research_exam09.pdf.
LDA has several efficient implementations in several languages:
many implementations from the original researchers
http://mallet.cs.umass.edu/, written in Java and recommended by others on SO
PLDA: a fast, parallelized C++ implementation
These guys propose an alternative to LDA.
Automatic Tag Recommendation Algorithms for
Social Recommender Systems
http://research.microsoft.com/pubs/79896/tagging.pdf
Haven't read thru the whole paper but they have two algorithms:
Supervised learning version. This isn't that bad. You can use Wikipedia to train the algorithm
"Prototype" version. Haven't had a chance to go thru this but this is what they recommend
UPDATE: I've researched this some more and I've found another approach. Basically, it's a two-stage approach that's very simple to understand and implement. While too slow for 100,000s of documents, it (probably) has good performance for 1000s of docs (so it's perfect for tagging a single user's documents). I'm going to try this approach and will report back on performance/usability.
In the mean time, here's the approach:
Use TextRank as per http://qr.ae/36RAP to generate a tag list for a single document. This generates a tag list for a single document independent of other documents.
Use the algorithm from "Using Machine Learning to Support Continuous
Ontology Development" (https://www.researchgate.net/publication/221630712_Using_Machine_Learning_to_Support_Continuous_Ontology_Development) to integrate the tag list (from step 1) into the existing tag list.
Text documents can be tagged using this keyphrase extraction algorithm/package.
http://www.nzdl.org/Kea/
Currently it supports limited type of documents (Agricultural and medical I guess) but you can train it according to your requirements.
I'm not sure how would the image/video part work out, unless you're doing very accurate object detection (which has it's own shortcomings). How are you planning to do it ?
You want Doc-Tags (https://www.Doc-Tags.com) which is a commercial product that automatically and Unsupervised - generates Contextually Accurate Document Tags. The built-in Reporting functionality makes the product a light-weight document management system.
For Developers wanting to customize their own approach - the source code is available (very cheap) and the back-end service xAIgent (https://xAIgent.com) is very inexpensive to use.
I posted a blog article today to answer your question.
http://scottge.net/2015/06/30/automatic-image-and-video-tagging/
There are basically two approaches to automatically extract keywords from images and videos.
Multiple Instance Learning (MIL)
Deep Neural Networks (DNN), Recurrent Neural Networks (RNN), and the variants
In the above blog article, I list the latest research papers to illustrate the solutions. Some of them even include demo site and source code.
Thanks, Scott
I've tried making a hierarchy of all forms of knowledge - including physical objects, numbers, procedures etc. How could this be improved? How would a sentence such as "Jack is producing music sitting on a tree" fall into this chart? Jack would go into humans, tree into place , but where would music go?
I dont have any experience in this field but looking at it from purely linguistic and logical way, you can put "producing " totally into work. Or maybe branch out your "objects" into different kinds like "products"->{"tech","healthcare"...} etc and "art objects"->{"music","paintings"...}
But yeah you can argue that this is a little more specific just to fit in the sentence.
I do not have a direct answer to your question, but I think I know where you would get more information on this tops. If I am not mistaken, you are trying to develop an ontology.
In computer science and information science, an ontology formally represents knowledge as a set of concepts within a domain, and the relationships among those concepts. It can be used to reason about the entities within that domain and may be used to describe the domain.
The most general ontology is developed at Cycorp, although the whole ontology is not freely available, a (large) subset of the onotlogy is available by Opencyc. They have an instalaltion prepared, which enables you to make queries to the ontology via a web-browser. Maybe you should explore this ontology and get some new ideas where to go next.
I want to run some analysis on networked data having multiple modes(i.e. multiple types of network nodes) and multiplex relations(i.e. multiples types of network edges).
The analysis is probably about SNA or applying any algorithm from graph theory, e.g. tie strength, centrality, betweenness, node distance, block, cluster, etc.
The source data is rather unstructured, therefore I should at first think about how I represent, store, and retrieve the data.
Following are some ideas. I would appreciate any feedback or further suggestion.:)
I know that there are already some great NoSQL databases, for example Neo4J, InfoGrid, for such kind of application. But for some extensibility reasons (e.g. licence, web standard...) I would like to prefer using RDF to store and represent my data. The tools to use would be SESAME or JENA.
the idea to represent network/graph data with RDF is trivial.
For example:
Network/Graph data
*Alice* ----lend 100USD----> *Bob* ----- likes ----> *Skiing*
represented with RDF
*Alice* --src--> *lend_relation* <---target--- *Bob* ---likes---> *Skiing*
|
has_value
\|/
*100USD*
[Alice src lend_relation]
[Bob target lend_relation]
[lend_relation has_value 100USD]
[Bob likes Skiing]
However, the problem is that RDF as well as SPARQL lacks of perspectives of graph model.
It is not efficient to traverse between nodes or find (the shortest) distance with RDF query.
It must be done with some extra analysis tools, for example JUNG or JGarphT,
and I must at first construct a sub graph by querying RDF storage and then convert it into the data model used by JUNG or JGraphT. If I want extra visualization (neither from JUNG nor JGraphT), then I must construct another data model for the visualization toolkit.
I don't know if that is a clear or efficient integration.
thanks again for any suggestion!
If you want to do network analysis of your RDF data with SPARQL you can have a look at SPARQL 1.1 Property Paths. I believe that in Jena/ARQ it's been already implemented ARQ - Property Paths.
Property Paths, from the new spec of SPARQL, allows you to query the RDF data model by defining graph patterns. Graph patterns that are a bit more complex than the ones you could define in SPARQL 1.0.
With this feature plus some logic at the application level you might be able to implement some interesting network analysis over your data.
I am bit new to this semantic web topic and especially DBpedia, as much as I did reading about this I could not find any information about possibility to determine weight of link between DBpedia objects. For example, is it possible to determine that PHP is more related to Symfony than Ruby on Rails is, even though they are both related?
OWLIM has an algorithm called RDF Rank which is mentioned in this presentation but it doesn't say how it works, I couldn't find any articles/papers with a quick Google search but probably worth you looking more thoroughly.
If all you want to know is how linked some concepts are it may be easiest to set up OWLIM yourself and import the DBPedia dataset dump and run the algorithm so you can retrieve this information.