SQL to MapReduce - How to? [closed] - hadoop

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I have a complex query used in ETL process (SQL-based). It is too big to fit here but in general few inner joins between several tables and some BL using window functions and other 'goodies'.
I need to port it to Hadoop MapReduce. just dump all the tables in FROM cases to CSV format and bring files jto HDFS.
Then write MapReduce jobs that copy the logic implemented in SQL.
I wonder: are there any best-practices/recommendations/pitfalls I should be aware of while porting SQL to MapReduce?
Googling in my case was no good as the results were either too specific or some scientific papers with no practical tips.

You can look at sqoop as one of the option for transferring data between Hadoop and structured datastores.
Also, this link could be helpful- http://www.ibm.com/developerworks/library/bd-sqltohadoop1/

Related

Is it necessary to memorize the codes of data structures? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
Is it necessary to memorize the code of data structures like linked lists, dynamic arrays , circular linked list, queues , stacks , Graphs etc. Or just the basic knowledge of code is enough ? What kind of questions can be asked in a job interview regarding data structures ?
I don't know what your (future) employer may ask, but generally, I'd say no. You have to know how they work and what they're used for, expecially which data structure serves which purpose with its advantages/disadvantages. If you know that, you'll be able to write the code of such a structure without having it memorized - because you know how it will work.

What are the performance improving techniques in HBASE? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
It can be while creating a table or while using other queries like Inserting, updating, deleting on a table.
I understood that using options like BloomFilter, BlockCache can have an impact. But I would like to know the other techniques that will improve the overall throughput. Also can anyone show how to add a BloomFilter on a Hbase table. I'd like to try it for practicing.
Any help is appreciated.
You question is too general. In order to know how to properly build you DataStore in HBase you should understand its internal logic of the storage and how data is distributed across the regions. This is probably the main place for start. I would recommend you to get acquainted with LSM-tree and how HBase implements it in this article. After this I would advice you to read about the proper design of the data schema here as it would play the main role in your performance. Correct schema with good key would make your data properly distributed across the nodes and would avoid you from having such thing as hotspotting. Then you can start looking through optimization techniques like blume filters, BlockCache, custom secondary indexes and other stuff.

Loading huge data from Oracle to Teradata [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
Which is the best way to export huge data from Oracle DB and load into teradata db?
I would suggest using Teradata Parallel Transport. It offers different operators to move data depending on your needs and volume of data. It may be possible to accomplish this via named pipes so that you don't have to physically land the data to disk before loading it into Teradata.
I would recommend reading the information available about TPT and rephrase your question accordingly once you have a general direction in which you would like to proceed. This article on Teradata's Developer Exchange is a few years old but provides you a foundation from which you can move forward.

Does Hadoop use HBase as an "auxiliar" between the map and the reduce step? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
Or HBase does not have anything to do with this process?
I have read that Hbase works on top of hadoop, and I have seen some diagrams that shows Hbase as part of the MapReduce part of Hadoop, but I have not found anything concrete about my question.
The Map/Reduce framework itself doesn't rely on HBase. It would be interesting to see pointers to the diagrams you mention.
You can communicate with HBase in your map/reduce code, if you like (e.g. look up values by key).
HBase does work "on top of Hadoop": it stores its data in HDFS, relies on ZooKeeper, and its servers can run on the same cluster.

Use cases of hadoop [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
Recently I came across learning hadoop, all I found was only example to read text data and calculate wordcount. More or less all examples were of same task. Please help me understand is it the only use case of hadoop? Please provide me some references for more real use cases, or where I can understand and write where hadoop can be used.
Thanks
I can try to outline a few directions restricting myself to MapReduce:
a) ETL - data transformations. Here hadoop shines since latency is not important, but scalability is
b) Hive / Pig. There are cases when we need actually SQL or SQL like functionality over big data sets, but can not afford commercial MPP database
c) Log processing of different kinds.
d) Deep analytics - when we simply want to run java code over massive data volumes. Mahaout is used in many cases as machine learning library.

Resources