Text-Based Databases for Log Search? [closed] - full-text-search

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I am working on a large amounts of multidimensional log data at my company. I have to save and retrieve data from my text database really fast because there are amounts of data and if I build a search query(they are not so simple queries, i.e. between some dates etc.) it takes an efficient time.
Here are my points:
We use Lucene but it doesn't fit out requirements.
We don't use SQL based databases because it is overkill for storing large amount of log data and querying at this situation.
We don't want to use NoSQL databases for log search because of our needs. We need a text based database.
We want to use Pytables however my question is that I want to learn if there exists any other systems to store and search fast on logs?

Related

Visualizing large data sets with Hadoop [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I'm looking for a framework, a combination of frameworks, best-practices, or a tutorial about visualizing large data sets with Hadoop.
I am not looking for a framework to visualize the mechanics of running Hadoop jobs or managing disk space on Hadoop. I am looking for an approach or a guideline for visualizing the data contained within HDFS using graphs and charts, etc.
For example, let's say I have a set of data points stored in multiple files in HDFS, and I would like to show a histogram of the data. Is my only option to write a custom map/reduce job that would try and figure out which points fall into which bucket, write the totals to a file, and then use a plotting library to visualize that?
Do I need to roll out a custom solution, or is there anyone else doing this sort of thing out there? I've trying looking online, but I haven't been able to find something that directly relates to this.
Thank you for your help
We do something like this at Datameer. The files would take a few more processing steps to get to our visualizations, but we run natively on Hadoop so the files would not be far away.

Best way to prepare for Design and Architecture questions related to big data [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Recently, I attended an onsite interview for a company and I was asked design questions related to big data like e.g: get me the list of users accessed a website (say google) between time t1 and t2. What data structures to use, how to handle concurrency, stale data, how many servers are needed to store the data, and requirements(software, hardware) of each server etc.....
Please point me some books/web references to increase my knowledge in this new area.Also provide me insights on how to answer such type of design questions
this book (free download) (amazon: mining of massive datasets) was just posted to HN (that thread also has some useful comments) - from a first skim it looks really good. you could read that.

how to design cassandra (or another nosql) scheme? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
We are about to move a project on apache cassandra from test to pilot and as a rdbms team, we were propably missing something.
Basic rules (or lessons learned):
be sure you have big or almost no data (nothing between)
do not believe in extremely cheap storage (cheap or not expensive might be
better)
think of your primary key as it was a reverse index
think of time (or another data creation order) as it was a row/clustering key
forgot about 100% foreign keys whenewer you can
sample if you can
do not care about dups
json and asynchronous time aggregation on client can make cpus more relaxed
ETL:
sample history if you can (or sample it just for reporting usage on separate reporting cluster)
single threaded data streams spreaded over couple of servers will come in hand
if you can afford asynchronous processing you can profit from knowledge of data patterns
throw scrap data away (horizontaly and vertically) - or it will mislead BI people or even board members in worse case
do not care about dups
The question is am I still missing something?
Are there another ways to achieve even better performance?

Which data structure should be used while storing large number of data, but not any RDBMS? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
This question was asked in an interview. First, I came up with B-tree. He asked me to be more specific and asked me to describe how I would store the data so that it would be easier to retrieve.
Can you please throw some light on this. Thanks in advance
You question isn't really clear.
"Good" ways to store the data depend on what you want to do with it.
If you want access parts of your data, a list of offsets suffices. If you want to search in text, using an additional inverted index in combonation with docIds->offsets is great. If you have frequent updates to your data and reading is rare, none of those make sense. So it really depends
Sounds like an open question, so you can demonstrate your vast experience of ... well, http://en.wikipedia.org/wiki/NoSQL would be my guess, but you could argue that http://en.wikipedia.org/wiki/Dbm answers the question.

What is best architecture to store huge videos and lot log file with processed data? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
It is a research project. We will record lot videos everyday, and metadata with log data would be saved in semantic model, RDF or OWL. The video would be used to download or processed on server.Everyday we would add lot data. What is the best storage solution?
Some Options
use the Hbase. the binary files would be stored in HDFS. Hbase supportw semantic well.
use HDFS + Neo4j. Neo4j is just for semantic mode. So we just record the binary file's path in the RDF or OWL file. Then Java would process the logic.
Lustre+ Hbase or Neo4j. The Lustre would store the large video files, and Hbase/Neo4j is used for semantic data.
What is the Best one, or some better solution.
Thanks
For storing RDF triples one of the triple stores listed below can be a solution:
AllegroGraph, http://www.franz.com/agraph/allegrograph/
OWLIM, http://www.ontotext.com/owlim
BigData, http://www.systap.com/bigdata.htm
For an extensive list please see http://www.w3.org/wiki/LargeTripleStores

Resources