Download large data for Hadoop [closed] - hadoop

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I need a large data (more than 10GB) to run Hadoop demo. Anybody known where I can download it. Please let me know.

I would suggest you downloading million songs Dataset from the following website:
http://labrosa.ee.columbia.edu/millionsong/
The best thing with Millions Songs Dataset is that you can download 1GB (about 10000 songs), 10GB, 50GB or about 300GB dataset to your Hadoop cluster and do whatever test you would want. I love using it and learn a lot using this data set.
To start with you can download dataset start with any one letter from A-Z, which will be range from 1GB to 20GB.. you can also use Infochimp site:
http://www.infochimps.com/collections/million-songs
In one of my following blog I showed how to download 1GB dataset and run Pig scripts:
http://blogs.msdn.com/b/avkashchauhan/archive/2012/04/12/processing-million-songs-dataset-with-pig-scripts-on-apache-hadoop-on-windows-azure.aspx

Tom White mentioned about a sample weather data set in his Book(Hadoop: the definitive guide).
http://hadoopbook.com/code.html
Data is available for more than 100 years.
I used wget in linux to pull the data. For the year 2007 itself the data size is 27 GB.
It is hosted as an FTP link. So, you can download with any FTP utility.
ftp://ftp.ncdc.noaa.gov/pub/data/noaa/
For complete details please check my blog:
http://myjourneythroughhadoop.blogspot.in/2013/07/how-to-download-weather-data-for-your.html

There are public datasets availbale on Amazon:
http://aws.amazon.com/publicdatasets/
I would suggest to consider running demo cluster there - and thus to save downloading.
There is also good dataset of the crowled web from Common Crawl, which is also available on amazon s3. http://commoncrawl.org/

An article that might be of interest to you, "Using Hadoop to analyze the full Wikipedia dump files using WikiHadoop".
If you are after Wikipedia page view statistics, then this might help. You can download pagecount files from 2007 up until current date. Just to give an idea of the size of the files, 1.9 GB for a single day (here I chose 2012-05-01) spread across 24 files.
Currently, 31 countries have sites which make available public data in various formats, http://www.data.gov/opendatasites. In addition, the World Bank makes available data at http://data.worldbank.org/data-catalog

What about "Internet Census 2012", data gathered by a distributed scan over the whole Internet:
Announcement: http://seclists.org/fulldisclosure/2013/Mar/166
Data: http://internetcensus2012.bitbucket.org/
The whole data is 7TB, (obviously) only available by torrent.

If you are interested in countries indicators, the best source I found was worldbank.org. The data they offer can be exported as CSV which makes it very easy to work with in Hadoop. If you are using .NET, I wrote a blogpost http://ryanlovessoftware.blogspot.ro/2014/02/creating-hadoop-framework-for-analysing.html where you can see how the data looks, and if you download the code from gidhub https://github.com/ryan-popa/Hadoop-Analysis, you already have the string parsing methods.

It might be faster to generate the data than it is to download it and put it up. This has the advantage of giving you control of the problem domain and letting your demo mean something to the people who are watching.

Related

Best application monitoring system with dashboard [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
I have been playing with Graphite as application monitoring system but I'm wondering if there's something better out there for what I want.
Here are a few requirements I have in mind:
Dashboards (easy to create/change)
the items on the dashboard should be mostly charts but also colored "number boxes" (a la http://shopify.github.io/dashing/)
when a metrics goes below/beyond a certain value show some warnings on the screen (different frame/background) and potentially send an email
setting up a rule-to-warn (see above) should be simple to do and have many ways to specify a threshold (absolute value, +/- the min/max/avg over the past 30 days, percentile, etc...)
Clicking on one of the charts/boxes would redirect to a larger/more detailed chart or a "sub-dashboard"
I would prefer open-source but I'm open to commercial products especially cloud-based solutions.
Any suggestions?
Many thanks in advance.
I personally use the following combination:
Dashboard: Grafana. It is really good looking and makes easy to create and edit dashboards. Unfortunately it doesn't have colored "number boxes" but you can also look into using a wider-purpose one like Geckoboard or Ducksboard for this end.
Alerting system: Seyren. Lets you specify alerts when any metric trespasses a certain threshold and alerts via mail and dozen of other ways, readily integrated. However it doesn't ease in any way dealing with historical values, percentiles, etc. You have to do this manually via Graphite functions. Another popular option: Cabot. I use Seyren instead because it looks more active and is lighter to deploy.
Unfortunately there is not final "answer" to your question, only suggestions. You might find more appropriate forums for your question than Stackoverflow, like mailing lists or Reddit.
Hope it helps anyhow! :)
Your question states prefer open source, but if you’re really open to commercial option, I think the ZingChart JavaScript charting library meets your requirements.
1. Dashboards (easy to create/change): ZingChart uses a CSS-like syntax that is pretty easy to use and edit.
2. Dashboard items should be charts and number boxes: The library allows you to create widgets to display items like you've described. Here is a demo with number boxes -- http://www.zingchart.com/playground/presentation/51b21c1a3c8ae
3 . Warnings on screen: As you can see the in the demo above, rules have been set for the number boxes to display in green for increases and red for decreases. Similar rules can be set for a range of values. (Which addresses number 4).
You could even use multiple rules sets for a values that are close to reaching the threshold. http://zingchart.com/playground/run/5460f51991002 This example shows rules set to place a red marker on data points below the value of 200.
4 . Rule-to-warn: There are a variety of ways to use rules to replicate your desired warning. You can also combine rules with our API and create warnings as well as fire an event which can be used to trigger an email.
Its not mentioned in your question but if real time data is a consideration, ZingChart also provides the ability to transfer data via http or websocket protocols. I’m on the team at ZingChart so if you have any questions about the demo or the features described, please feel free to reach out.
If you have the right budget the best tool is splunk. It is not cloud based but it is the best when it comes to analyzing data and creating graphs and dashboards out of generated data by scripts and log files.
Splunk comes with a very flexible query language and the ability to create scheduled searches that can be used as a very robust monitoring solution. I still did not find any better product but the downside is the high price.

What is best way to store mp3 files in server ? Storing it in database (BLOB) , is right? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I have an audio site where user can upload their music files , but problem is i can't go for expensive hosting , since i am not monetizing this service.I am searching for some shortcuts to store the mp3 files to cut some hosting cost.
What will be best idea to do technically or any (hosting)suggestion will be help full.
I need to save server space as much as possible.
In most cases, the size of your database will also count against your overall hosting space as well. Furthermore, inserting huge BLOBs into your database isn't going to help performance with it.
The typical pattern to follow when doing something like this is to save the MP3 (or any binary file) on the server in a particular directory, and save the path to the file in the database.
The least expensive way, outside of using the original hosting environment, would probably be to utilize Amazon AWS S3 reduced redundancy storage, which starts at $0.093 per GB/per month. Pretty darn cheap.
But in answer to your original question, inserting stuff in the database probably won't save server space, and if your host is worth its salt, they will pick up on a huge database that keeps growing and growing, even if they claim "unlimited databases" or similar.
Just consider that storing in a database (BLOB) is usually a bad practice because it slows the queries, makes the database big and fall off the database performance. A database is used to store "searchable" information, not as a data store. Although a database can do it, it's not designed for that.
Take a look to some cloud storage service/provider instead, as the ADrive services like their personal plan ( http://www.adrive.com/personal_basic ) that let you store 50G for free (Im not sure if it's a trial), and also has Remote File Transfer functionality that allows you to transfer files from external websites.
I never tried this service, but give it a try, it's free and maybe solves your problem

Data sets for realistic random/test data generation [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
Where to get data sets for random or test data generation, such as names/surnames with distribution, address data, university/school names, company names etc.?
I've found the list of English names and surnames, with the count of them (unfortunately I haven't noted from where I got that). I got address database from Poland. However these data sets from other countries would also be very useful for me. So with university and school names.
What data do you need as source for such information? Could you provide links to such data? (of course, only those who are free publicly available)
I think you will find answer to your question in following topics:
Sample database for exercise
https://stackoverflow.com/questions/202092/where-can-i-find-free-and-open-data
There are many open source and commericial test data generators on internet. Below 2 are good ones
http://www.sqledit.com/dg/
http://www.generatedata.com/#about
for random numbers/strings: http://www.random.org/
Amazon has made several public data sets available for free download:
http://aws.amazon.com/publicdatasets/
Try http://www.mockaroo.com
You can generate up to 100,000 rows of data in CSV, tab-delimited and SQL formats, save & reuse schemas, and automate test data generation using curl.
There is a free API at http://randomprofile.com/api-for-developers/ for generating test user profiles which include name, surname, address, bank info, CC number, blood type etc. Not sure about the schools, but useful if you're looking after data on Asian users.

Any good tool or library for recursive convert ANY files to tiff / images? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
We have to convert EVERYTHING to images for archiving purpose. DOC, HTML, email, ZIP, PDF, TXT and any document you can read/view on computer. In addition, it must recursive convent on all embed attachment and files in zip.
I know ImgMaker only. Is it the best or I can have something better?
My boss ask me to search that are there any alternative other then ImgMaker.
Any open source or profit suggestion are welcome.
There is a whole industry built around this type of function and numerous service providers that charge a fee per document to do this type of conversion. You are better off buying than building it on your own.
The idea of converting Everything is fundamentally a fool's errand as you would need a single program that could render every file type ever created (in essence recreating every piece of software that ever wrote a data file AND recreating every version of each). Also, not every file format has a format that has a direct rendered form. For example, what do you do with a database file, a DLL,an XML file, a WAV file?
If you are looking for something that does a reasonable job for a large number of formats, there are two main players with OEM toolkits, but both are extremely expensive and neither supports the Java platform directly. I use the former if you have any additional questions.
Stellent (now Oracle) OutsideIn: http://www.oracle.com/technologies/embedded/outside-in.html
Autonomy KeyView: http://www.autonomy.com/content/Products/idol-modules-keyview-viewing/index.en.html
Another possible option is an image print driver like Black Ice, but it has several issues including the need for a copy of every software application on the machine the code is running on, and an operator to dismiss all the inevitable dialogs that will come up when you open the files in the native application. Also, for things like Excel spreadsheets, you usually need some manual tweaking of the spreadsheet to make the printout look right (else you get 900 pages added to your tiff with that one extra column that wouldn't fit)
I don't know if this will help, because it sounds like you want something totally automated, but there are many pseudo-printer drivers that can create TIFF images as output. For example:
http://sourceforge.net/projects/pdfcreator/
Uh? How do you expect to convert a zip archive to an image? What should the pixels show? Should it be lossless, so you can convert back? If it's for archiving, I would guess that is a requirement, but it sounds weird.
What's going to happen to the tiff images afterwards? Assuming you want to manage them in some way, it seems to me you'd be better off looking for some complete documentation management product that can take these doc types as input and manage/archive the (presumably) large number of images that you'll have.
Otherwise you would seem to be re-inventing the wheel.
If you want open-source, something like Alfresco
Note the server based transformation feature below
Alfresco offers one integrated
repository to manage all formats of
content across image management,
document management, web content
management and email repositories. The
repository is a modern platform with:
One Repository for any Digital Asset
The industry’s most scalable, standards-based, JSR-170 content repository
Standards support for JSR-170, Web Services and REST
High-Availability, Fault Tolerance and Scalability – Auto failover and clustering
Secure Distributed Capture over Web Services, HTTP and HTTPS
Reuse of Alfresco Business Policy Rules
Server-based transformation between many formats including TIFF, JPEG, GIF, PNG, MS-Office, PDF and FLASH
Metadata Extraction and Management
Automatic Classification Framework
find to do the recursion in combination with convert from imagemagick tookit would get you pretty far. I guess to support all what you want, you'll need to write a script that calls the right program.
The question as asked cannot be answered sensibly. One obvious solution is to simply rename each file by attaching .tiff. E.g. you could get ringtone.mp3.tiff. Insane as it is, there are not many better ways to convert an .mp3 to a .tiff.
Note that this is not an IT problem. The business is assuming everything is an image, and music is the trivial example of something that isn't.
( To clarify - this was assuming an automated setting, e.g. to archive incoming email for legal reasons. If that's required, you MUST archive incoming MP3's too. If you've got humans in the loop, this question would not belong on a programming forum. )

project-tracking tools for navigating with topic maps? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I'm having trouble with project management & am looking for a good tool that will be a good match for the way my brain works (very associatively). I'd like a bug-tracker but one that I can group tasks into topics and associate the topics to each other in a graph (see the Wikipedia entry on Topic Maps ) so that I can find & visualize easily the "big picture". I've tried using AbstractSpoon's ToDoList and it works well but it's hierarchical and after about 30 or 40 entries I get lost in a maze of things to do.
any suggestions?
edit: I've now tried Freemind, Conzilla, XMind, and VUE. Freemind and Conzilla were a little flaky. XMind seems to be the most polished of the four; they have a "pro" version which is non-free (pay by the month >:( which is weird) but an open-source base version which is free. You can't export the data directly from the program with the free version, but the storage format is just a .jar-style (ZIP file w/ extension .xmind) file that contains a "contents.xml" that is easily parsed if I needed to.
#codeslave:
but how important is the visual
representation any way
Visualization is everything! I've got information overload and I need to be able to navigate a mess of information. I don't want it to be super-Powerpoint-polished, but I need to be able to use the associations that I create to remind myself how to find what I'm looking for. In an ideal world you could just full-text search everything, but that only works if you can remember the search phrase. Often I'll file something under "algorithm" and when I go to look for it I look under "programming" instead, or vice-versa. Associativity solves that problem by allowing me to visually browse my "mental model" of the information I've stored.
You can always get an CVS export from your "favourite" tool and create Topic Maps maps you can view with the Omnigator or the xSiteable tool. I used to have a few XSLT files to do such a job dealing with JIRA data. If the interest is high enough, maybe a ressurection is needed?
I've developed a small utility that will import MindMaps into Project plans. Let me know if something like this is helpful and I will develop it further.
Right now I just use it one-way from MindMap -> Project file. I generally use this for brainstorming and scope management, then export to Project when we like the scope of what we are working with for more formal project management.
What about using good old FogBugz? You can associate cases together pretty easily. You don't get the pretty graph of the topic space/mind map (feature idea Joel) but how important is the visual representation any way.

Resources