Social Network data sets that contains a community structure - social-networking

I am working on a set of spectral clustering methods that can essentially be used to detect a community structure in a social network. I tried this on a very small data set (Zachary's Karate club) that was available to me. I am looking out for some large networks where I can check the performance of the algorithm.
I tried looking the SNAP(snap.stanford.edu) but not getting a data set that contains a well defined community structure.
Is there any data set which I can use for my purpose?

The 'Social Network Analysis Interactive Dataset Library' at http://www.growmeme.com/sna/visual has over 150 datasets, and an ability to filter by size and whether community information is present (among other things).

Related

Recommendation for Hypergraph DBMS with visual editor (or at least viewer)?

Can anyone point me in a promising direction or to a specialized community? I would like to create a graph representing domain knowledge with the following requirements:
allow true hypergraph: native reification = statements about statements = n-ary relationships = sequences of more than 3 vertices (subject-predicate-object logic not sufficient, modelling a hyperedge as vertex with two edges over-complicated).
result can be accesses / queried as semantic data (URI or similar, query by SPARQL, GRAQL, Gremlin or the like)
server-based / multiple user contributions of vertices + edges (no desktop system like non-web Protege).
graphic editor available for creating, developing, and viewing the graph: Content to be generated by non-IT/graph specialists who will understand visual representation of semantic graphs, but cannot reliably formulate code. Ready to install, not as API/ library to code it myself (I am no / have no web developer).
As a fallback to the above, visual client for visualization of the results and exploration of the graph available (simple text commands to add new objects and relations).
If a class structure / schema is required, it should be kept highly generic (The schema is the content to be developed collaboratively, primary use is to describe a domain, not to store big data).
Should be Open Source, or at least permanently free for academic use.
I am no IT-specialist and have limited insights into graphDBMS, but am willing to dig into it, so here is what I have researched so far:
web-Protege: no hypergraph, but visual editing and exploration
Neo4J: no hypergraph, but with bloom visual exploration (more or less visually supported editing)
grakn: hypergraph with visual viewer workbase, but all navigation and editing per text commands, limited function for community edition, proprietary format (more for internal use than semantic web)
HypergraphDB: hypergraph, open source, no visual editing - in general, I do not understand how to manipulate data with a client
Cayley Graph: no info on hyperhraph capability found, no graphic client found
TitanDB => Janusgraph: no info on hypergraph capability found, no graphic client found.
MS GraphEngine (formerly Trinity): supposedly hypergraph possible, but hardly any information on details / features / visual clients.
Hyper-X: supposedly hypergraph possible, but no further detailed info found.
I found the generic viewers linkurious (commercial), Gephi and Cytoscape (both open source), but not what GraphDBMS they can connect to and whether you can edit with them. Some seem to be local clients (I'd prefer to offer a web-based access, but I know I will have to live with compromises).

Looking for libraries which support deduplication on entity

I am going to work on some projects to deal with entity deduplication. Datasets (one or more) which may contain duplicate entity. In the realtime, entity may represent the name, address, country, email, social media id in the different form. My goal is to identify that these are possible duplicates based on different weightage for the different entity Info. I am trying to look for a library that is open-source & preferably written in Java.
As I need to process the millions of data, I need to take concern on scaling and performance. Also, the performance should not be in the order of n^2. In the below findings, some use Index-based search using Lucene and some use Data grouping.
Please pour the suggestion which one is better?
Here are my findings so far:
Duke (Java/Lucene)
Comments: Uses genetic algorithms, it's flexible. Since 2016, there had been any updates.
YannBrrd/elasticsearch-entity-resolution (extension of Duke)
Comments: Since 2017, there had been any updates. Also, need to check whether it's compatible with the latest ES and Lucene
dedupeio/dedupe (Python)
Comments: Uses Data grouping method. but It's written in Python.
JedAIToolkit (Java)
Comments: Uses Data grouping method.
Zentity (Elasticsearch Plugin)
Comments: It's a good one. Need to check whether it supports deduplication. So far in the document, it says about entity identity resolution.
Python Record Linkage Toolkit Documentation
Comments: It is in Python.
bakdata/dedupe (Java)
Comments: Not having clear documentation on how to use
I was wondering if anybody else had any others. Also please pour pros and cons of the above.

go-diameter: support for different AVP dictionary for different network provider (i.e. Ericsson, Nokia) and different nodes (i.e. GGSN, Tango)

We are working on creating a diameter adapter for OCS. Currently our AVP dictionary is as supplied by go-diameter.
We are trying to Provide a configurable dictionary to support Following
Vendor Specific AVPs to support different network providers, like Nokia and Ericsson
Support for different Network traffic, like VoLTE, GGSN, Tango.
Following are the two approaches that we are currently thinking on.
Include a single dictionary with all supported AVPs and have a single release of diameter adapter. The intelligence to be build inside the code for identifying which AVPs are required for which node.
Have different releases for each dictionary that we want to support, and deploy which ever is required by the service provider.
I have search over the internet to see if something similar has been done as a proof of concept. Need help in identifying which is better solution for implementation.
Im not familiar with go-diameter but My suggestion: Use one dictionary
This dictionary should be used by all vendors and providers.
Reasons:
You dont know how many different releases you will end up with and you might need to support many dictionaries at the end.
If you use few dictionaries most of the AVPs will be the same on all
As bigger as your one dictionary will be you will support more AVPs everywhere and you never 100% sure which AVP might arrive from different clients

Reusable Data Mining Code?

I was assigned an extra credit project where I'm given a set of data and told to run data mining algorithms on them. (Supposed to choose two of Apriori, PART, RIPPER, and J48).
We're told to reuse code for the algorithm implementation of this project, but the only code I can find is example code and can't easily be run on the data set given to us for the project. Does anybody know of premade java data mining algorithms that I can use on my data?
I would start by looking at http://www.cs.waikato.ac.nz/ml/weka/. Websearches suggest that it supports the first two options I checked, Apriori and J48. There is quite reasonable Java source code, but also a GUI interface. You may be able to do what you need just by writing code to produce stuff in a format Weka can read and then working within the Weka GUI.

Where is Pentaho Kettle's architecture?

Where can I find Pentaho Kettle architecture? I'm looking for a short wiki, design document, blog post, anything to give a good overview on how things work. This question is not meant for specific "how to" starting guides but rather a good view at the technology and architecture.
Specific questions I have are:
How does data flow between steps? It would seem everything is in memory - am I right about this?
Is the above true about different transformations as well?
How are the Collect steps implemented?
Any specific performence guidelines to using it?
Is the ftp task reliable and performant?
Any other "Dos and Don'ts" ?
See this PDF.
How does data flow between steps? It would seem everything is in
memory - am I right about this?
Data flow is row-based. For transformation every step produce a 'tuple' or a row with fields. Every field is pair of data and a metadata. Every step has input and output. Step takes rows from input, modify rows and send rows to outputs. For most cases every all information is in memory. But. Steps reads data in streaming fashion (like jdbc or other) - so typically in memory only a part of data from a stream.
Is the above true about different transformations as well?
There is a 'job' concept and 'transformation' concept. All written above is mostly true for transformation. Mostly - means transformation can contain very different steps, some of them - like collect steps - can try to collect all data from a stream. Jobs - is a way to perform some actions that do not follow 'streaming' concept - like send email on success, load some files from net, execute different transformations one by one.
How are the Collect steps implemented?
It only depend on particular step. Typically as said above - collect steps may try to collect all data from stream - having so - can be a reason of OutOfMemory exceptions. If data is too big - consider replace 'collect' steps with different approach to process data (for example use steps that do not collect all data).
Any specific performence guidelines to using it?
A lot of. Depends on steps transformation is consists, sources of data used. I would try to speak on exact scenario rather then general guidelines.
Is the ftp task reliable and performant?
As far as I remember ftp is backed by EdtFTP implementation, and there may be some issues with that steps like - some parameters not saved, or http-ftp proxy not working or other. I would say Kettle in general is reliable and perfomant - but for some not commonly used scenarios - it can be not so.
Any other "Dos and Don'ts" ?
I would say the Do - is to understand a tool before starting use it intensively. As mentioned in this discussion - there is a couple of literature on Kettle/Pentaho Data Integration you can try search for it on specific sites.
One of advantages of Pentaho Data Integration/Kettle is relatively big community you can ask for specific aspects.
http://forums.pentaho.com/
https://help.pentaho.com/Documentation

Resources