Visualizing large data sets with Hadoop [closed] - hadoop

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I'm looking for a framework, a combination of frameworks, best-practices, or a tutorial about visualizing large data sets with Hadoop.
I am not looking for a framework to visualize the mechanics of running Hadoop jobs or managing disk space on Hadoop. I am looking for an approach or a guideline for visualizing the data contained within HDFS using graphs and charts, etc.
For example, let's say I have a set of data points stored in multiple files in HDFS, and I would like to show a histogram of the data. Is my only option to write a custom map/reduce job that would try and figure out which points fall into which bucket, write the totals to a file, and then use a plotting library to visualize that?
Do I need to roll out a custom solution, or is there anyone else doing this sort of thing out there? I've trying looking online, but I haven't been able to find something that directly relates to this.
Thank you for your help

We do something like this at Datameer. The files would take a few more processing steps to get to our visualizations, but we run natively on Hadoop so the files would not be far away.

Related

Text-Based Databases for Log Search? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I am working on a large amounts of multidimensional log data at my company. I have to save and retrieve data from my text database really fast because there are amounts of data and if I build a search query(they are not so simple queries, i.e. between some dates etc.) it takes an efficient time.
Here are my points:
We use Lucene but it doesn't fit out requirements.
We don't use SQL based databases because it is overkill for storing large amount of log data and querying at this situation.
We don't want to use NoSQL databases for log search because of our needs. We need a text based database.
We want to use Pytables however my question is that I want to learn if there exists any other systems to store and search fast on logs?

How to write a desktop app that filters test questions according to topic [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
What programming language/method would be best suited to writing a desktop app that
filters question types and displays a listing of those questions to view.
For example, if I have a mix algebra, geometry, and calculus questions stored in the app,
I should be able to select just the algebra questions to view and print.
I have a little experience with python/django but I've never made a desktop app before.
You have lots of options. You will need to make several design decisions before you move forward. Things to consider are:
Which technologies do you feel comfortable with?
How much time/effort do you want to put into the project?
Are you willing to spend money on tools?
Etc.
That being said, the rest of this answer is to give you some options to consider:
You'll need a data structure which can filter the problems for you.
From your description, the first thing I thought of was using a
database, however I'm not sure if you are familar with databases, in
which case you'd have to create some classes/structs that would allow for you to do the filtering yourself. Some options for databases are SQL Express, Oracle, MySQL, DB2, and many more.
Another thing to consider is you mentioned several different type of
math problems. You'd want to consider how you would be displaying
the problems. Mathematica formats math problems nicely, but if you
wanted to go down this road, you'd either have to find a tool that
would allow you to display that math problems in a syntax like
Mathematica or do exports/screen shots of the problems and have those as
part of your program.
Another option would be to try to find a
language that has some sort of plugin for TeX or LaTeX (For example,
you can see how wikipedia allows for nice math formatting here:
http://en.wikipedia.org/wiki/Help:Displaying_a_formula
This sounds like a good pet project to play with to learn different technologies. If that is the intent, great. If not, then you might want to do some googling to see if someone else has already created what you are looking for.

how to design cassandra (or another nosql) scheme? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
We are about to move a project on apache cassandra from test to pilot and as a rdbms team, we were propably missing something.
Basic rules (or lessons learned):
be sure you have big or almost no data (nothing between)
do not believe in extremely cheap storage (cheap or not expensive might be
better)
think of your primary key as it was a reverse index
think of time (or another data creation order) as it was a row/clustering key
forgot about 100% foreign keys whenewer you can
sample if you can
do not care about dups
json and asynchronous time aggregation on client can make cpus more relaxed
ETL:
sample history if you can (or sample it just for reporting usage on separate reporting cluster)
single threaded data streams spreaded over couple of servers will come in hand
if you can afford asynchronous processing you can profit from knowledge of data patterns
throw scrap data away (horizontaly and vertically) - or it will mislead BI people or even board members in worse case
do not care about dups
The question is am I still missing something?
Are there another ways to achieve even better performance?

Which data structure should be used while storing large number of data, but not any RDBMS? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
This question was asked in an interview. First, I came up with B-tree. He asked me to be more specific and asked me to describe how I would store the data so that it would be easier to retrieve.
Can you please throw some light on this. Thanks in advance
You question isn't really clear.
"Good" ways to store the data depend on what you want to do with it.
If you want access parts of your data, a list of offsets suffices. If you want to search in text, using an additional inverted index in combonation with docIds->offsets is great. If you have frequent updates to your data and reading is rare, none of those make sense. So it really depends
Sounds like an open question, so you can demonstrate your vast experience of ... well, http://en.wikipedia.org/wiki/NoSQL would be my guess, but you could argue that http://en.wikipedia.org/wiki/Dbm answers the question.

What is best architecture to store huge videos and lot log file with processed data? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
It is a research project. We will record lot videos everyday, and metadata with log data would be saved in semantic model, RDF or OWL. The video would be used to download or processed on server.Everyday we would add lot data. What is the best storage solution?
Some Options
use the Hbase. the binary files would be stored in HDFS. Hbase supportw semantic well.
use HDFS + Neo4j. Neo4j is just for semantic mode. So we just record the binary file's path in the RDF or OWL file. Then Java would process the logic.
Lustre+ Hbase or Neo4j. The Lustre would store the large video files, and Hbase/Neo4j is used for semantic data.
What is the Best one, or some better solution.
Thanks
For storing RDF triples one of the triple stores listed below can be a solution:
AllegroGraph, http://www.franz.com/agraph/allegrograph/
OWLIM, http://www.ontotext.com/owlim
BigData, http://www.systap.com/bigdata.htm
For an extensive list please see http://www.w3.org/wiki/LargeTripleStores

Resources