Sample data for Hadoop [duplicate] - hadoop

This question already has answers here:
Download large data for Hadoop [closed]
(7 answers)
Closed 9 years ago.
For education purpose I am looking for a large set of data. Data from social networks could be interesting but difficult to obtain. Data from scientific experiments could lead to write very difficult algorithm to have interesting results. Does any one have an idea how / where can I generate / find a large interesting data set ?

Here are some public data sets I have gathered over time
http://wiki.gephi.org/index.php/Datasets
Download large data for Hadoop
http://datamob.org/datasets
http://konect.uni-koblenz.de/
http://snap.stanford.edu/data/
http://archive.ics.uci.edu/ml/
https://bitly.com/bundles/hmason/1
http://www.inside-r.org/howto/finding-data-internet
http://goo.gl/Jecp6
http://ftp3.ncdc.noaa.gov/pub/data/noaa/1990/
http://data.cityofsantacruz.com/
http://bitly.com/bundles/hmason/1

Here Amazon has a list of some huge public datasets you may try out :
http://aws.amazon.com/publicdatasets/

Related

Design and Analysis of Algorithms? [duplicate]

This question already has answers here:
Sort with the limited memory
(6 answers)
Closed 5 years ago.
Now, to sort some limited number of registers, we often use RAM to hold elements under process. The problem is when we are asked to sort millions of random registers where each register contains set of elements. This huge file cannot be sorted using traditional sorting algorithms . how i can solve this problem.
You need to look an efficient algorithm for sorting data that is not completely read into memory. A few adaptations to Merge-Sort can achieve this.
Here is the Java Implementation of merge sort that sorts very large files:
Take a look into these too:
http://en.wikipedia.org/wiki/Merge_sort
http://en.wikipedia.org/wiki/External_sorting

ElasticSearch for Time Series Data [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I am evaluating a number of different NoSQL databases to store time series JSON data. ElasticSearch has been very interesting due to the query engine, I just don't know how well it is suited to storing time series data.
The data is composed of various metrics and stats collected at various intervals from devices. Each piece of data is a JSON object. I expect to collect around 12GB/day, but only need to keep the data in ES for 180 days.
Would ElasticSearch be a good fit for this data vs MongoDB or Hbase?
You can read up on ElasticSearch time-series use-case example here.
But I think columnar databases are a better fit for your requirements.
My understanding is that ElasticSearch works best when your queries return a small subset of results, and it caches such parameters to be used later. If same parameters are used in queries again, it can use these cached results together in union, hence returning results really fast. But in time series data, you generally need to aggregate data, which means you will be traversing a lot of rows and columns together. Such behavior is quite structured and is easy to model, in which case there does not seem to be a reason why ElasticSearch should perform better than columnar databases. On the other hand, it may provide ease of use, less tuning, etc all of which may make it more preferable.
Columnar databases generally provide a more efficient data structure for time series data. If your query structures are known well in advance, then you can use Cassandra. Beware that if your queries request without using the primary key, Cassandra will not be performant. You may need to create different tables with the same data for different queries, as its read speed is dependent on the way it writes to disk. You need to learn its intricacies, a time-series example is here.
Another columnar database that you can try is the columnar extension provided for Postgresql. Considering that your max db size will be about 180 * 12 = 2.16 TB, this method should work perfectly, and may actually be your best option. You can also expect some significant size compression of about 3x. You can learn more about it here.
Using time based indices, for instance an index a day, together with the index-template feature and an alias to query all indices at once there could be a good match. Still there are so many factors that you have to take into account like:
- type of queries
- Structure of the document and query requirements over this structure.
- Amount of reads versus writes
- Availability, backups, monitoring
- etc
Not an easy question to answer with yes or no, I am afraid you have to do more research yourself before you are really say that it is the best tool for the job.

Mahout Clustering with one dim K-means [duplicate]

This question already has answers here:
1D Number Array Clustering
(6 answers)
Closed 8 years ago.
Can I cluster data with one variable instead of many (What I had already test) using mahout K-means Algorithm ? if yes (I hope so :) )could you give me an Example of clustering and thinks
How big is your data? If it is not exabytes, you would be better off without Mahout.
If it is exabytes, use sampling, and then process it on a single machine.
See also:
Cluster one-dimensional data optimally?
1D Number Array Clustering
Which clustering algorithm is suitable for one-dimensional Lists without knowing k?
and many more.
Mahout is not your general go-to place for data anlysis. It only shines when you have Google scale data. Otherwise, the overhead is too large.

big data - where does the data come from? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
This might seem like an inane question but with all the buzz about big data I was curious as to how the typical datasets used in big data are sourced? Twitter keywords seem to be a common source - but what are the origins of the huge twitter feed files that get analysed? I saw an example where there was an analysis of election related words like Obama and Romney..has someone queried the Twitter API and downloaded effectively several terabytes of Tweets? Does Twitter even want people hitting their servers that hard? Or is this data already 'owned' by the companies doing the analytics. It might sound an odd scenario but most of the articles I have seen are fuzzy about these basic physical steps. Any links to good articles or tutorials that address these fundamental issues would be most appreciated
Here are some ideas to get sources of Big Data:
As you pointed Twitter is a great place to grab data and there's a lot of useful analysis to do. If you're taking the online course about Data Science one of the assignments is actually how to get live data from Twitter to analyze so I would recommend you take a look at this assignment as the process of getting live Twitter data is very detailed. You could let the live stream run for days and it would probably generate Gigabytes worth of data the longer it runs.
If you have a website you could get web server logs. It might not be a lot if it's a small website, but for large websites who see a lot of traffic this is a huge source of data. Think about what you could do if you had StackOverflow web server logs...
Oceanographic data which you can find at Marinexplore, they have some huge datasets available that you can download and analyze yourself if you want to analyze ocean data.
Web crawling data, for example used by search engines. You can see some open data coming from web crawl at Common Crawl which is already on Amazon S3 so ready to get your Hadoop jobs running on it ! You could also get data from Wikipedia here.
Genomic data is now available on a very large scale and you can find genome data on the 1000 genomes project via FTP.
...
More generally I would advise you look at Amazon AWS datasets which has a bunch of big datasets on various topics if you're not just looking at Twitter but Big Data in a more general context.
Most businesses get their social data from Twitter Certified data partners such as Gnip.
Note: I work for Gnip.

Advantages of BTree+ over BTree [duplicate]

This question already has answers here:
Closed 13 years ago.
Possible Duplicate:
B- trees, B+ trees difference
What are the advantages/disadvantages of BTree+ over BTree? When should I prefer one over other? I'm also interested in knowing any real world examples where one has been preferred over other.
According to the Wikipedia article about BTree+, this kind of data structure is frequently used for indexing block-oriented storage. Apparently, BTree+ stored keys (and not values) are stored in the intermediate nodes. This would mean that you would need fewer intermediate node blocks and would increase the likelihood of a cache hit.
Real world examples include various file systems; see the linked article.

Resources