Interesting projects based on Distributed/Operating Systems - hadoop

I would like to know some interesting challenges based on distributed systems that could be solved within the time frame of a quarter (my university follows quarter system!). I am hoping to work on a single project that would satisfy both an Operating Systems course as well as a Distributed Systems course so that I would have plenty of time to work on it (as I have taken both courses!). I am looking for a strong programming component. Could anybody point me in the right direction? I know Hadoop/map-reduce is the hot stuff but I would appreciate it if someone suggested solvable challenges in the field of virtual machines.

MIT has an OCW class, "Distributed Computer Systems Engineering," with a Projects Section which you might find helpful.

Check this blog entry on Mapreduce & Hadoop Algorithms in Academic Papers (5th update – Nov 2011). The problem areas in the blog entry could also be solved using non-MapReduce distributed algorithms also.
I suggest instead of going with the hype, pick a topic of your interest and work on it.

Related

Resources to learn how to design algorithms like which Nest Thermostat uses?

I am trying to build some smart home devices by myself. And I am very interested in building IoT algorithms like Nest Thermostat does which is able to learn the characteristics of the house and the behavior of the family members.
Though I have some machine learning basics, I barely know about thermal model which is all the researches and methods of Nest based on.
So if I want do some study and create similar algorithms like Nest do by myself, how should I get started? Any suggested references?
You said it yourself - thermal modelling. So read up on thermodynamics. If you don't read on thermodynamics you won't know which part of thermodynamics to read on to model heat distribution in a house.
One of the most important thing about being a programmer is not programming. Programming is almost the least important thing a programmer does (slightly lower than debugging). The most important thing about being a programmer is to understand the requirements of the program.
So someone writing an accounting program should know a bit about accounting. He doesn't need to be an expert but he should at least be able to spot a bug.
Working for big companies you'll find that usually you'll have project managers and systems analysts helping you figure out the requirements. But coding your own project you have to be your own project manager and architect. So you have to do the reading-up.
Now, apart from the general advice above, when writing software to control real-world objects and phenomena you can't get away from knowing about the PID loop (Proportional, Integral, Differential). It's how software thermostats control the temperature of industrial ovens. It's how quadcopters can hover without becoming unstable. It's how Segways balance themselves.
The theory behind PID is more than a hundred years old. It was developed to govern steam engines. But it is so useful and important that we generally still depend on it in electronics.
There's a lot of math-heavy theory out there about PIDs. There are also a lot of less complicated rule-of-thumb guides about PIDs aimed at technicians and mechanics. I suggest reading the simpler less theory-heavy guides first then work your way up if you need to know something.

Where can I find extra resources on how to use Quantopian?

I am sure you all know Quantopian (I love it!!).
Even though I am pretty good with Python, I am still having trouble writing a full algorithm (I am trying to write one using Fundamentals data).
I have successfully used pipeline to get the stocks that I want, but am having trouble with writing the buys and the sells. Specifically, how to tell the program to buy this stock with some logic behind it and how to sell (the same or some other) stock.
I have gone through tons of resources, checked out other users' algorithms, went through both Quantopian and non Quantopian resources but still having trouble. Do you have any suggestions for other resources?
P.S. the link above is this guy 'sentdex'. His tutorials are the best!! It would have been fine if Quantopian didn't upgrade their systems, so unfortunately his tutorials are out of date now.
Background Knowledge:
Building up intuition can be tough. I found the books written by Ernie Chan to be most helpful when looking to complement my cs + math background with domain knowledge. Going through the process of translating portions of his books from MATLAB to Python was especially helpful.
Quantopian Specific Material:
Following strategies written by James Christopher will be worth your while. This algo in particular should serve as a great reference
https://www.quantopian.com/posts/long-short-pipeline-multi-factor
The lecture series also has a number of gems that will aid in building up the fundamentals necessary to do meaningful work on the platform.
Last But Not Least:
Use the community section as a learning tool and take advantage of the fact that it is the only place where you can engage with almost 100k quants in one coherent place. The "algotrading" subreddit isn't bad - that said it can occasionally feel like chewing on sand as many posts are misinformed.
Disclaimer: I interned at Quantopian in the Fall.
y can try think global about
initialize --> shedule_function --> rebalance --> if-elif logic: order(context.your_ticker_variable, amount)
and what about filters, I try custom - but still having trouble with it!

Successful FPGA application for HPC, e.g. on a cluster with InfiniBand backbone?

Assuming there is a task (e.g. an image processing method with a lot math) which is reasonable to be implemented on FPGA in sense of answer https://stackoverflow.com/a/8695228/544463
Is there any known (that you can actually name) successful application or practice for combining it with "dedicated" (designed on custom demand) super computing cluster (HPC), e.g. with Infiniband stack? I wonder if that has already been done and to which extend that was successful.
My main motivation for the question is that http://en.wikipedia.org/wiki/Reconfigurable_computing is a long term (academic) perspective for the future development of cluster computing as a distinctive alternative to cloud computing (the later concentrates more on the software (higher) flexibility level but also through possible "reconfiguration"). Is it already practical?
I would also expect somebody is doing research on this... It would be nice to learn about results.
Well, it's not FPGA, but D.E. Shaw's Anton computer for molecular dynamics is famously ASICs connected with a custom high-speed network; J. P. Morgan uses clusters of FPGAs in its risk-analysis calculations (recent Forbes article here). Convey computers has been pushing FPGA+x86+high speed networking fairly hard for the past couple of years, so presumably there's some sort of market there...
http://www.maxeler.com/ - they build racks of Intel PCs hosting custom boards stuffed with FPGAs (and - critically - the associated software and FPGA code) to speed up seismic processing, financial analysis and the like.
I think they could be regarded as successful (I gather they turn a profit) and have big customers from finance and oil companies amongst their clientele.
Is there any known (that you can actually name) successful application
or practice for combining it with "dedicated" (designed on custom
demand) super computing cluster (HPC), e.g. with Infiniband stack? I
wonder if that has already been done and to which extend that was
successful.
It's being attempted academically with Novo-G.
You might be interested in Maxwell.
I know that Cray used to have a series of supercomputers some years ago that combined AMD Opterons with Xilinx FPGAs (iirc) through a HyperTransport bus, basically allowing you to create your own specialized processor for custom workloads. According to their website though, they now seem to have dropped FPGAs in favor of GPUs.
For the current research, there's always Google Scholar...
Update: After a bit of searching, it appears to have been the Cray XT5h, which had the possibility of using FPGA coprocessors...
Some have already been mentioned (convey, cray), some not (e.g. beecube).
But one of the biggest FPGA-Clusters I ever heard of, is missing:
The Large Hadron Collider at CERN. They produce in seconds enormous amounts of data (2.7 Terabit/s). They use the FPGAs (> 100) of them to reduce and filter the data to reduce it, and make it handable.
It does not fit your request to be connected to a dedicated HPC-Cluster, but they are a HPC-Cluster on their own (as on the higher hierarchy levels the used FPGAs are FX, they include two PowerPCs and are also some kind of "normal" cluster).
There is quite a lot of published work in reconfigurable computing applications.
Here's a list of links to SRC Computers-centric published papers.
There's the Center for High-Performance Reconfigurable Computing.
Google search "FPGA" or "reconfigurable" along with these academic institution names and you'll find many published papers. Some of the papers you'll find go back to 2004.
Jackson State University
Clemson University
Catholic University
George Washington University
George Mason University
National Center for Supercomputing Applications (NCSA)
University of Illinois (UIUC)
Naval Postgraduate School (NPS)
Air Force Research Lab (AFRL)
University of Dayton Research Institute (UDRI)
University of Florida
University of Arkansas
There also was a reconfigurable-centric conference hosted by NCSA, the Reconfigurable Systems Summer Institute (RSSI).
This list is certainly not exhaustive, but it will get you started.
Disclosures: I currently work for SRC Computers, LLC, I worked at NCSA/UIUC and I chaired the RSSI conference its first two years.
Yet another great use case developed by adapteva called parallela (they have a kickstarter project).
They are developing a epoch-series of processors controlled by a two cores ARM processor (that shares the board).
I am so much anticipating to have this toy in my hands!
PS
Since it was largely inspired by ardunio (and similar ARM-like) systems, this project is still limited by 1 Gbps networking.

Is it worth purchasing Mahout in Action to get up to speed with Mahout, or are there other better sources?

I'm currently a very casual user of Apache Mahout, and I'm considering purchasing the book Mahout in Action. Unfortunately, I'm having a really hard time getting an idea of how worth it this book is -- and seeing as it's a Manning Early Access Program book (and therefore only currently available as a beta-version e-book), I can't take a look myself in a bookstore.
Can anyone recommend this as a good (or less good) guide to getting up to speed with Mahout, and/or other sources that can supplement the Mahout website?
Speaking as a Mahout committer and co-author of the book, I think it is worth it. ;-)
But seriously, what are you working on? Maybe we can point you to some resources.
Some aspects of Mahout are just plain hard to figure out on your own. We work hard at answering questions on the mailing list, but it can really help to have sample code and a roadmap. Without some of that, it is hard to even ask a good question.
Also a co-author here. Being "from the horse's mouth" it's probably by far the most complete write-up out there for Mahout itself. There are some good blog posts out there, and certainly plenty of good books on more generally machine learning (I like Collective Intelligence in Action as a broad light intro). user#mahout.apache.org has a few people that say they like the book FWIW, as do the book forums (http://www.manning-sandbox.com/forum.jspa?forumID=623) I think you can return the e-book if it's not quite what you wanted. It definitely has 6 chapters on clustering.
there are many parts of the book that are out of date, a version or two behind what is current. In addition, there are several mistakes within the text, particularly within the examples. this may make things a bit tricky when trying to replicate the discussed results.
Additionally, you should be aware that the most mature part of mahout, the recommender system, taste, isnt distributed. I'm not really sure why this is packaged with the rest of mahout. this is more a complaint about the software package than mahout itself.
Currently the best out there. Probably as mature as the product. Some aspects are better than others, insight into the underlying implementation is good, practical methods to get up and running on Linux, mac osx, etc for beginners not so much. Defining a clear strategy about how to keep a recommender updated is iffy. Production examples pretty thin. Good as a starting point but you need a lot more. Authors make best attempt to help, but is a pretty new product. All in all, yes, buy it.
I got the book a few weeks ago. Highly recommended. The authors are very active on the mailing list, too, and there is a lot of cool energy in this project.
You might also consider reading through Paco Nathan's Enterprise Data Workflows in Cascading. You can run PMML on your cluster exported from R or SAS. That is not to say anything bad about Mahout in Action, the authors did a great job and clearly put good time and effort into making it instructive and interesting. This is more of a suggestion to look beyond Mahout. It's not currently getting the kind of traction it would if it were more user friendly.
As it stands, the Mahout user experience is kinda choppy, and doesn't really give you a clear idea of how to develop and update intelligent systems and their life cycles, IMO. Mahout is not really acceptable for academics either, they are more likely to use Matlab or R. In the Mahout docs, the random forest implementation barely works and the docs have erroneous examples, etc... Thats frustrating, and the parallelism and scalability of the Mahout routines depend on the algorithm. I don't currently see Mahout going anywhere solid as it stands, again IMO. I hope I'm wrong!
http://shop.oreilly.com/product/0636920028536.do

Help with Event-Based Components

I have started to look at Event-Based Components (EBCs), a programming method currently being explored by Ralf Wesphal in Germany, in particular. This is a really interesting and promising way to architect a software solution, and gets close to the age-old idea of being able to stick software components together like Lego :)
A good starting point is the Channel 9 video here, and there is a fair bit of discussion in German at the Google Group on EBCs. I am however looking for more concrete examples - while the ideas look great, I am finding it hard to translate them into real code for anything more than a trivial project.
Does anyone know of any good code examples (in C# preferably), or any more good sites where EBCs are discussed?
I find EBCs conceptually similar to Event Based Programming. You'll find a comprehensive treatment of the subject in Ted Faison's Event Based Programming, which also includes the complete source code for three types of systems of varying complexity (a file browser, a http service and a distributed workflow system) -- all written in C#.

Resources