How does MapReduce framework implement the sort phase? - sorting

I am interested in the implementation of the MapReduce sort phase; it seems to be very efficient. Could someone provide some references about it please? Thanks!

This points to ReduceTask.java as the place where sort phase is coded. See lines 393-408 in ReduceTask.java. If you need more info, download the entire source and dig into it.
EDITED
"Sort" phase falls under ReduceTask as shown in this figure below from hadoop book. (Page no: 163)

Related

How to import jars in Altova MapForce

Hello i am new to Altova Mapforce, I wanted to know what is the procedure to add JAR files import jars in MapForce.
Regards
RBRK.
If you mean "how do I add jar files that contain user-defined functions for use in my mappings", that is covered in the documentation.
If you mean "how do I add jar files at runtime when running a map", that is what the java classpath is for.
If it's the first one, then read the documentation, make your attempts, and post the results here if it doesn't work. As posed, your question doesn't really give us enough information about what you are trying to do...Much faster time to resolution, much more likely to get a realistic answer, if you show your work when you describe what you are trying to do.
rip

Where could I find an implementation of SVM on Hadoop?

I found an implementation in http://code.google.com/p/cascadesvm/.
However, there are no specifications about that. Has anyone tried that? Or where could I find an alternative implementation of SVM on Hadoop?
Thanks a lot~
Looks like someone did this within the Mahout project, not sure if it's been merged into trunk, but this looks like a good place to start:
https://issues.apache.org/jira/browse/MAHOUT-232
You can check it out https://code.google.com/p/cascadesvm/
The training part, and a demo in Matlab version are released.
https://code.google.com/p/cascadesvm/wiki/CascadeSVMMatlabVersion

Best way to get acclimated with a new Ruby on Rails project

What tools or steps would you recommend to someone who is brand new to a project and they are trying to get acclimated to a Ruby on Rails codebase that has no testing?
I am considering something like: https://codeclimate.com/ to help run some analysis on the code but I wanted other suggestions.
Thanks!
I use the command line tool wc to find where the code hotspots are. Running wc ./app/models/**/*.rb | sort -nr on my Mac gives me a pretty good idea of where the code is sitting. You can replace models with controllers or any other directory to find the details there.
Once I have a good idea of where things are, it's easier to find the larger and more complex areas of code. A brief description of the project and a run of wc should give you a pretty good idea of which data models and controllers are the most complex, and give you an idea of where to go for further investigation.
If the project is well tested I would definitely take the time to read through the spec headings. I'd take time to read through the implementation details of specs that interest me.
The Ruby Rogues podcast has a pretty good episode about code reading that you may find helpful: http://rubyrogues.com/031-rr-code-reading/
I'm hoping this is a will tested project... If it is, I always read through the tests first as they give you an understanding of what the expectations of each piece of code is. This its by far the best method that has worked for me.

How do I see/debug the way SOLR find it's results?

Let's say I search for "ABLS" and the SOLR returns a result that to me does not make any sense.
How can I debug why SOLR picked this record to be returned?
debugQuery=true would help you get the detailed score calculation and the explanation for each scores.
An over view of the scoring is available at link
For detailed explaination of the debug information you can refer Link
You could add debugQuery=true&indent=true to the url and examine the results. You could also use the analysis tool in solr. Go to the admin and click analysis. You would need to read the wiki to understand either of these more in depth.
queryDebug will give you knowledge about why your scoring looks like it does (end how every field is relevant).
I will get some results that you are not understand and play with them with Solr's analysis
You should find it under:
/admin/analysis.jsp?highlight=on
Alternatively turn on highlighting over your results to see what is actually matching in your results
Solr queries are full of short parameters, hard to read and modify, especially when the parameters are too many.
And after it is even harder to debug and understand why a document is more or less relevant than another. The debug explain output usually is a three too big to fit in one page.
I found this Google Chrome extension useful to see Solr Query explain and debug in a clear manner.
For those who still use very old version of solr 3.X, "debugQuery=true" will not put the debug information. you should specify "debugQuery=on".
There are two ways of doing that. First is the query level, which means adding the debugQuery=on to your query. That will include a few things:
parsed query
debug timing information
detailed scoring information which helps you with analysis of why a give document is given a score.
In addition to that, you can use the [explain] transformer and add it to your fl parameter. For example ...&fl=*,[explain], which will result in your documents having the scoring information as another field.
The scoring information can be quite extensive and will include calculations done by the similarity algorithm. If you would like to learn more about the similarities and the scoring algorithm in Solr, have a look at this my and my colleague Radu from Sematext talk from the Activate conference: https://www.youtube.com/watch?v=kKocQdYGVJM

How to use the MultipleTextOutputFormat class to rename the default output file to some meaningful names?

After the reduce phase in Hadoop, I wanted the output file names to be something meaningful depending on the input key value. However I'm not successful on following the example on "Hadoop: The Definative Guide" which used MultipleTextOutputFormat to do this. The reason is that it's based on old API and it doesn't work on the new API ?
Can anybody hint on the solution or point me to the relevant documentation ?
You are probably right. Most things that worked in the old API don't always work in the new one.
There is a "new way" of doing this now, called MultipleOutputs.

Resources