What are the performance improving techniques in HBASE? [closed] - hadoop

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
It can be while creating a table or while using other queries like Inserting, updating, deleting on a table.
I understood that using options like BloomFilter, BlockCache can have an impact. But I would like to know the other techniques that will improve the overall throughput. Also can anyone show how to add a BloomFilter on a Hbase table. I'd like to try it for practicing.
Any help is appreciated.

You question is too general. In order to know how to properly build you DataStore in HBase you should understand its internal logic of the storage and how data is distributed across the regions. This is probably the main place for start. I would recommend you to get acquainted with LSM-tree and how HBase implements it in this article. After this I would advice you to read about the proper design of the data schema here as it would play the main role in your performance. Correct schema with good key would make your data properly distributed across the nodes and would avoid you from having such thing as hotspotting. Then you can start looking through optimization techniques like blume filters, BlockCache, custom secondary indexes and other stuff.

Related

Which way is more efficient to learn data structures? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
My programming knowledge is up to OOP since that was the last thing we covered in the university. However, I am taking 2 courses this summer and I am constantly under pressure, but I am planning to learn data structures along the way too, to be prepared for it next semester.
I had two plans to learn it but I am not sure which one will be more efficient:
-The first one is to skim through and learn about all the types of data structures and how they are implemented.
-The second one is to try instead of just reading and knowing about a data structure, I will go and try to implement it. However, the drawbacks are that its slow and time consuming, so I might not be able to learn all of the data structures in time
Practice using the data structures in your code.
Code those data structures from scratch.
Repeat steps 1 and 2.
There is really no shortcut for that.

Data structure for dealing with millions of record [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
Which data structure is appropriate for operating over millions of records and later need to iterate over it.
While simple linked list might be sufficient for your needs, in case you also need to be able to maintain records in sorted order, and efficiently access records or begin iteration at a arbitrary point, I would recommend looking in to using a B-tree.
In case you want to persist it to disk, you should use a key-value store, which often use B-tree's (or LSM Trees) "under the hood" as well as providing ACID guarantees. Examples include LMDB, BerkeleyDB, LevelDB
In short, use a database.

SQL to MapReduce - How to? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I have a complex query used in ETL process (SQL-based). It is too big to fit here but in general few inner joins between several tables and some BL using window functions and other 'goodies'.
I need to port it to Hadoop MapReduce. just dump all the tables in FROM cases to CSV format and bring files jto HDFS.
Then write MapReduce jobs that copy the logic implemented in SQL.
I wonder: are there any best-practices/recommendations/pitfalls I should be aware of while porting SQL to MapReduce?
Googling in my case was no good as the results were either too specific or some scientific papers with no practical tips.
You can look at sqoop as one of the option for transferring data between Hadoop and structured datastores.
Also, this link could be helpful- http://www.ibm.com/developerworks/library/bd-sqltohadoop1/

Would Hadoop help my situation? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I am in the process of creating a survey engine that will store millions of responses to various large surveys.
There are various agencies that will have 10-100 users each. Each will be able to administer a 3000+ question survey. There will be multiple agencies as well.
If each agency was to have hundreds of thousands of sessions each with 3000+ responses, I'm thinking that hadoop would be a good candidate to get the sessions and their response data to run various analyses on (aggregations etc).
The sessions, survey questions, and responses are all currently held in a sql database. I was thinking that I would keep that and put the data in parallel. So when a new session is taken under an agency, it is then added to the hadoop 'file', such that when the entire dataset is called up it would be included.
Would this implementation work well with hadoop or am I still well within the limits of a relational database?
I don't think anyone is going to be able to tell you definitively, yes or no here. I also don't think I fully grasp what your program will be doing from the wording of the question, however, in general, Hadoop Map/Reduce excels at batch processing huge volumes of data. It is not meant to be an interactive (aka real-time) tool. So if your system:
1) Will be running scheduled jobs to analyze survey results, generate trends, summarize data, etc.....then yes, M/R would be a good fit for this.
2) Will allow users to search through surveys by specifying what they are interested in and get reports in real-time based on their input....then no, M/R would probably not be the best tool for this. You might want to take a look at HBase. I haven't used it yet, but Hive is a query based tool but I am not sure how "real-time" that can get. Also, Drill is an up and coming project that looks promising for interactively querying big data.

Use cases of hadoop [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
Recently I came across learning hadoop, all I found was only example to read text data and calculate wordcount. More or less all examples were of same task. Please help me understand is it the only use case of hadoop? Please provide me some references for more real use cases, or where I can understand and write where hadoop can be used.
Thanks
I can try to outline a few directions restricting myself to MapReduce:
a) ETL - data transformations. Here hadoop shines since latency is not important, but scalability is
b) Hive / Pig. There are cases when we need actually SQL or SQL like functionality over big data sets, but can not afford commercial MPP database
c) Log processing of different kinds.
d) Deep analytics - when we simply want to run java code over massive data volumes. Mahaout is used in many cases as machine learning library.

Resources