Hadoop starter project suggestions - hadoop

I would love to find a few topics, thanks.

MergeSort is a fantastic/easy one to start with. You could also go with generating word counts for all words in a file. A good source of data is the Project Gutenberg library of public domain books (you could always concatenate a few of them together).
If you want something more advanced but in the same vein as word count, you could write a very simple distributed spell checker. Peter Norvig as an awesome simple demonstration of a spell checker written in Python. A good exercise would be extending this algorithm to operate on a file in a distributed fashion.

You have a few projects here
There is a few nice and interesting examples of small hadoop projects. Everything is described very well, additionally you can find the source code and all needed theory.

Related

Marble diagram generator java/javascript for documentation using rxjava/rxjs or reactor

I am looking to create documentation for a project created with reactor library.
I searched but did not found any useful tool that generates photo diagrams after running a piece of reactor(or rx in general) code. The only thing i found is a text based syntax like this.Which I guess is a solution follow if i dont find anything else.
libraries found that use this syntax
https://flames-of-code.netlify.com/blog/rx-marbles/
https://github.com/cescoffier/rx-marble-docker
Ideally i would like to run a piece of code eg.
Flux.from(f1)
.bufferTimeout(writeDbBuffer, Duration.ofSeconds(10))
.parallel()
.runOn(Schedulers.parallel()).subscribe(photosBatch -> {
photoRepository.saveAll(photosBatch);
});
And generate marble diagram in photo or ever text based.
As a solution to the text based syntax mentioned above i could create text generators based on this syntax but this would require a lot of effort and time.
There is any way to generate images with marble diagrams with rxjava, rxjs or preferable reactor library from pieces of code?(I am including rx because is way more popular that reactor)
There is any library generating the above text based syntax from pieces of code?
What other options i have for documentation over these libraries?
also a similar question but not exactly what i am looking for
Something that dynamic is, to my knowledge, not yet available in the Java world. Closest thing I know of is rxfiddle, and to an extent rxmarbles.com (although the later doesn't allow generation from arbitrary pieces of code).
Generating clean and good looking visualization of arbitrary reactive sequences dynamically is no small task, but that's something the Reactor team would love to see at some point (either done officially or by the community).
The text-based solutions are great for simple marbles and simple operators, because you are in essence drawing the marble yourself, using the syntax of each tool (and thus being limited by it).
Higher-order sequences, parallelization, etc... introduce far greater complexity and start to stretch these tools to their limits.

Sentiment Analysis of given text

This topic has many thread. But also I am posting another one. All the post may be a way to do a sentiment analysis, but I found no way.
I want to implement the doing ways of sentiment analysis. So I would request to show me a way. During my research, I found that this is used anyway. I guess Bayesian algorithm is used to calculate positive words and negative words and calculate the probability of the sentence being positive or negative using bag of words.
This is only for the words, I guess we have to do language processing too. So is there anyone who has more knowledge? If yes, can you guide me with some algorithms with their links for reference so that I can implement. Anything in particular that may help me in my analysis.
Also can you prefer me language that I can work with? Some says Java is comparably time consuming so they don't recommend Java to work with.
Any type of help is much appreciated.
First of all, sentiment analysis is done on various levels, such as document, sentence, phrase, and feature level. Which one are you working on? There are many different approaches to each of them. You can find a very good intro to this topic here. For machine-learning approaches, the most important element is feature engineering and it's not limited to bag of words. You can find many other useful features in different applications from the tutorial I linked. What language processing you need to do depends on what features you want to use. You may need POS-tagging if POS information is needed for your features for example.
For classifiers, you can try Support Vector Machines, Maximum Entropy, and Naive Bayes (probably as a baseline) and these are frequently used in the literature, about which you can also find a pretty comprehensive list in the link. The Mallet toolkit contains ME and NB, and if you use SVMlight, you can easily convert the feature formats to the Mallet format with a function. Of course there are many other implementations of these classifiers.
For rule-based methods, Pointwise Mutual Information is frequently used, and some kinds of scoring-based methods, etc.
Hope this helps.
For the text analyzing there is no language stronger than SNOBOL. In SNOBOL-4 the Fortran interpretator, for example, takes only 60 lines.
NLTK offers really good Algorithm for sentiment analysis. It is open source so you can have a look at the source code and check out the algorithm used. You can even download NLTK book which is free and has some good material on sentiment analysis.
Coming to your second point I dont think Java is that slow. I am myself coding in c++ for years but lately also started with java as if you see a lot of very popular open source softwares like lucene, solr, hadoop, neo4j are all written in java.

How to implement AIML in Prolog?

AIML files: http://www.alicebot.org/aiml/aaa/
I want to make these AIML files the knowledge base of my Prolog program.
Help me. Thanks in advance.
P.S. Excuse my bad english.
http://pycdep.sourceforge.net contains something AIML-like implemented in prolog.
Maybe it can serve as a starting point.
You might want to consult (rent it from your local library, don't buy the whole book) the following book:
An Introduction to Language Processing with Perl and Prolog
Pierre M. Nugues (Autor)
Text Book
Before delving into chart parsers and the like, the book contains two sections that deal with eliza like template matching. The sections are:
9.2 Word Spotting and Template Matching
9.3 Multiword Detection
It has Prolog code snipets. The code snipets are not optimized for speed or large volumes, but they show the general idea of the algorithms.
The book is good in computer linguistics, but you might want to consult additional sources for Q&A logic etc..
Best Regards
P.S.: Currently working as well on a Prolog port of a Java/Prolog hybrid extraction pipeline
CAT

Automate Finding Pertinent Methods in Large Project

I have tried to be disciplined about decomposing into small reusable methods when possible. As the project growing, I am re-implementing the exact same method.
I would like to know how to deal with this in an automated way. I am not looking for an IDE specific solution. Dependency on method names may not be sufficient. Unix and scripting are solutions that would be extremely beneficial. Answers such as "take care" etc. are not the solutions I am seeking.
I think the cheapest solution to implement might be to use Google Desktop. A more accurate solution would probably be much harder to implement - treat your code base as a collection of documents where the identifiers (or tokens in the identifiers) are words of the document, and then use document clustering techniques to find the closest matching code to a query. I'm aware of some research similar to that, but nothing close to out-of-the-box code that you could use. You might try looking on Google Code Search for something. I don't think they offer a desktop version, but you might be able to find some applicable code you can adapt.
Edit: And here's a list of somebody's favorite code search engines. I don't know whether any are adaptable for local use.
Edit2: Source Code Search Engine is mentioned in the comments following the list of code search engines. It appears to be a commercial product (with a free evaluation version) that is intended to be used for searching local code.

Finding patterns in source code

If I wanted to learn about pattern recognition in general what would be a good place to start (recommend a book)?
Also, does anybody have any experience/knowledge on how to go about applying these algorithms to find abstraction patterns in programs? (repeated code, chunks of code that do the same thing, but in slightly different ways, etc.)
Thanks
Edit: I don't mind mathematically intensive books. In fact, that would be a good thing.
If you are reasonably mathematically confident then either of Chris Bishop's books "Pattern Recognition and Machine Learning" or "Neural Networks for Pattern Recognition" are very good for learning about pattern recognition.
It helps if you have access to the parse tree generated during compilation. This way you can look for pieces of the tree which are similar, ignoring the nodes which are deeper than what you are looking at, this way you can pick out e.g. nodes which multiply together two sub-expressions, ignoring the contents of the sub-expressions. You can apply the same logic to a collection of nodes, e.g. you want to find a multiplication of two sub-expressions where those two sub-expressions are additions of more sub-expressions. You first look for multiplies, then check if the two nodes underneath the multiply are additions, ignoring anything any deeper.
I'd suggest looking at the code of some open source project (e.g. FindBugs or SIM)
that does the kind of thing you're talking about.
If you're working in one of the supported languages, IntelliJ idea has a really smart structural search and replace that would fit your problem.
Other interesting projects are PMD and Eclipse.
Eclipse uses AST (abstract syntax trees) for all source code in any project. Tools can then register for certain types of ASTs (like Java source) and get a preprocessed view where they can add additional information (like links to documentation, error markers, etc).
Another project you can look into is Duplo - it's an open-source/GPL project, so you can pore over their approach by grabbing the code from SourceForge.
This is specific to .Net and visual studio, but it finds duplicate code in your project. It does report some false positives I've found but it could be a good place to start.
Clone Detective
One kind of pattern is code that has been cloned by copy and paste methods. See CloneDR for a tool that automatically finds such code in spite of variations in layout and even changes in the body of the clone, by comparing abstract syntax trees for the language in question.
CloneDR works with a variety of langauges: C, C++, C#, Java, JavaScript, PHP, COBOL, Python, ... The website shows clone detection reports for a variety of programming languages.

Resources