Propose please an open data for graphs [closed] - algorithm

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 8 years ago.
Improve this question
I should to prepare myself for upcoming task which consist of a lot of graphs.
I need some data (available in free domain) to train myself.
Bigger - is better...
could you suggest some open data resource?
I'll appreciate this.

You can visit http://snap.stanford.edu/data/ . It contains many different kind of network or graph data.

Here is an answer for your could you suggest some open data resource? and not for which consist of a lot of graphs. So, plz, keep it in mind.
Here (data.gov.au) you can find a huge datasets (864!) of a different types in a different formats (txt, csv, xml, ). You will find a Finance, Industry, Geography, etc. datasets.
In other case, if you want some special (and meaningful data, for example, global population density) you can see this (a bit outdated, but usefull) source from readwriteweb.com.
And one more source: "Open Governmental Datasets" - it's worth to see it indeed.

Related

How functional languages deal with immutable data before Phil Bagwell's HAMT idea? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 6 months ago.
Improve this question
Like the question, I think persistent data structures are based on the idea of Phil Bagwell about Hash Array Mapped Trie (HAMT), which was discovered in 2001.
So, before that, how do people make data immutable without using HAMT?
I think the premise of your question is mistaken. The HAMT was not the first persistent data structure. For example, the landmark paper Making Data Structures Persistent by Driscoll, Sarnak, and Tarjan was published in 1986, building on a large number of previous papers that had been published before in different contexts. Some of those settings include purely functional programming languages, where all (technically, most) data structures are persistent.

Split text files into two groups - unsupervised learning [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
Imagine, you are a librarian and during time you
have classified a bunch of text files (approx 100)
with a general ambiguous keyword.
Every text file is actually a topic of keyword_meaning1
or a topic of keyword_meaning2.
Which unsupervised learning approach would you use,
to split the text files into two groups?
What precision (in percentage) of correct classification
can be achieved according to a number of text files?
Or can be somehow indicated in one group, that there is
a need of a librarian to check certain files, because
they may be classifed incorrectly?
The easiest starting point would be to use a naive Bayes classifier. It's hard to speculate about the expected precision. You have to test it yourself. Just get a program for e-mail spam detection and try it out. For example, SpamBayes (http://spambayes.sourceforge.net/) is a quite good starting point and easily hackable. SpamBayes has a nice feature that it will label messages as "unsure" when there is no clear separation between two classes.
Edit: When you really want unsupervised clustering method, then perhaps something like Carrot2 (http://project.carrot2.org/) is more appropriate.

from hg18 to GRCh38 reference human genome [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 6 years ago.
Improve this question
Does anyone know if it is possible to convert SNP coordinates from Hapmap database to the new reference genome GRCh38. UCSC doesn't have the liftover yet ready. Any suggestions?
As was linked in the BioStars answer, NCBI offers a remapping tool that will translate positions from one reference genome to another. UCSC also offers a similar tool, LiftOver, which has a downloadable version as well.
However, as I discovered years ago, these tools do not always succeed in remapping your coordinates, and sometimes produce incorrect results. You should take all output from these tools with a grain of salt. Bottom line, you should only assume your original coordinates are the correct ones, and try to work with the corresponding reference genome build.
The UCSC liftover tool as mentioned above can work most of the time https://genome.ucsc.edu/cgi-bin/hgLiftOver
However, there might be repetitive pieces which may make the liftover tool confused.
An alternative way is to map the sequences (fasta/fastq) onto the new hg38 genome using bowtie to get coordinates.

How to classify a large collection of user entered company names? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
Our site allows users to enter the company they work for as a free form text entry.
Historically we gathered around a few millions of unique entries. Since we put no constraints we ended up with a lot of variations, typos (e.g. over 1000 distinct entries just for McDonald's)
We realized we could provide our users with a great feature if only we could tie these variations together. We compiled a clean list of companies as a starting point using various online sources [Dictionary]
Now, we're trying to find out a best way to deal with the user data source. We thought about assigning some similarity score:
- comparing each entry with [Dictionary], calculating a lexical distance (possibly in Hadoop job)
- taking advantage of some search database (e.g. Solr)
And associate the user enter text this way.
What we're wondering is did anyone go through similar "classification" exercise and could share any tips?
Thanks,
Piotr
I'd use simple Levenshtein distance (http://en.wikipedia.org/wiki/Levenshtein_distance).
A few millions entries - you should be able to process it easily on one computer (no hadoop, or other heavy-weight tools).

"Random Article" Feature on wikipedia.com [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
i would like to know what algorithm and what programming language wikipedia is using to randomly choose an article to display.
i would also like to know how does it work so fast?
Here's information on that.
Every article is assigned a random number between 0 and 1 when it is created (these are indexed in SQL, which is what makes selection fast). When you click random article it generates a target random number and then returns the article whose recorded random number is closest to this target.
If you are interested you can read the actual code here.
Something along this lines:
"SELECT cur_id,cur_title
FROM cur USE INDEX (cur_random)
WHERE cur_namespace=0 AND cur_is_redirect=0
AND cur_random>RAND()
ORDER BY cur_random
LIMIT 1"
From MediaWiki.org:
MediaWiki is a free software wiki
package written in PHP, originally
for use on Wikipedia. It is now used
by several other projects of the
non-profit Wikimedia Foundation and by
many other wikis, including this
website, the home of MediaWiki.
MediaWiki is open source, so you can download the code and inspect it, to see how they have implemented this feature.
If you look at the source, they use PHP/MySQL a sort and filter pages by pregenerated random values (page_random column) that have an index on them.

Resources