Does anyone know if there is any literature out there comparing mapjoins and reducejoins in hive? - hadoop

I need to compare the two for a presentation I'm preparing, but I cant use the data I've gathered on my own computer since that would be unreliable. If anyone has any literature on this it would be very helpful.
Thank you!

You can refer Chapter 8 of O'Reillys' Hadoop:The Definitive Guide book for Map and Reduce join.It has good comparision.

Related

What is the best way to learn about the Hadoop ecosystem

I'm a Data Scientist with a background in pure mathematics, so i have a bit of a learning curve in terms of tools. By working in the industry for about a year, i understand that a Data Scientist should also know some Data Engineering. Can anyone point me to some resources? My current tech stack includes mostly of Python, (Pyspark) etc.
Depends what exactly do you want to learn about Hadoop Ecosystem.
I would recommend you to start from this book:
Hadoop: The Definitive Guide it can help you to understand how it works under the hood and get some understanding what Hadoop ecosystem consists of. You don't need all chapters of this book, but many of them may be really useful.
Also you should probably check this book
Spark - The Definitive Guide due to spark is commonly used in Data Science area. But it's more practical book than the previous one.

Material and Information to improve algorithmic knowledge

Lately I have been stuck on improving my algorithmic skills. And at this point I am finding myself out of good material for solving grid problems based on dfs and bsf. I somehow managed to do http://www.spoj.pl/problems/POUR1/ with a brute force logic but i recently go-ogled to find out that the problem can be done by bfs. But I can't figure out exactly how to go about it. Can someone please provide some text to read or some kind of explanation to the above mentioned problem so I can add this to my skill set. It would be extremely kind if you could even help me out for these techniques in problems like these http://www.codechef.com/problems/MMANT/ .please help as soon as possible I am really stuck in these kind of problems ant can't move on. It would also be really kind if u could provide a list of good questions about Binary Indexed Trees and segment trees and some more examples of their usage.
Thanks for the help!! :)
One resource I've found useful is The Algorithmist:
The Algorithmist is a resource dedicated to anything algorithms - from
the practical realm, to the theoretical realm. There are also links
and explanation to problemsets.
Also The Algorithm Design Manual by Steve Skiena is extremely useful, especially the second part.

What is the Practical Byzantine Fault Tolerance?

Can somebody please provide a gist of the Byzantine Fault Tolerant algorithm and Liskov's algorithm?
Thanks.
I think the introduction to Chapter 4 of Castro and Liskov's article from 1999 gives a concise and good overview of the inner workings of the algorithm: http://pmg.csail.mit.edu/papers/osdi99.pdf
You can know much details of how PBFT works by reading the paper published in OSDI(1999).
If you want to have understanding the algorithm throughly used in PBFT, then, I highly recommend doctoral thesis and technical paper. Both are written by original author, Miguel Castro. It contains almost everything that you want to know about PBFT. And if you want to see its implementation in code-level, you can download and check the software in this page.

Text search question about implementation

Can someone explain me how the text searching algorithm works? I understand its a huge field but am trying to understand it from high level so that I can look up academic papers on it.
For example, Spelling mistakes is one problem that is tough to solve and of course Google solves it. When I search for a term and misspell it on Google, it automatically suggests the correct spelling. How is indexing done for it? Using MapReduce I can see they index various entities. What do they or some one else index and store? May be I am looking for a practical implementation of MapReduce if I am thinking in the right direction at all.
Pav
I'm afraid this question really is too big, which probably explains why it has not seen an answer yet. As far as Google's spell-checker is concerned, Peter Norvig explains how it is done: How to Write a Spelling Corrector
The exact implementation in productive use at Google surely looks quite a bit different and way more complicated, but this might get you started.

naive bayesian spam filter question

I am planning to implement spam filter using Naive Bayesian classification model.
Online I see a lot of info on Naive Bayesian classification, but the problem is its a lot of mathematical stuff, than clearly stating how its done. And the problem is I am more of a programmer than a mathematician (yes I had learnt Probability and Bayesian theorem back in school, but out of touch for a long long time, and I don't have luxury of learning it now (Have nearly 3 weeks to come-up with a working prototype)).
So if someone can explain or point me to location where its explained for programmers than a mathematician, it would be a great help.
PS: By the way I have to implement it in C, if you want to know. :(
Regards,
Microkernel
The book Programming Collective Intelligence has chapter that covers this and other methods. The chapter (#6) can be understood without reference to previous chapters, is written clearly, and discusses only the minimal mathematics necessary to get the job done.
You could try this website. It's got some source code.
I would highly recommend Andrew Moore's tutorials and I think you should start with this one.
You could also take a look at POPFile, an open source spam filter engine.
Have you looked at dspam?
http://dspam.irontec.com/faq.shtml#1.0
http://www.nuclearelephant.com/

Resources