Why should Ruby not be used to create a spider

Why should Ruby not be used to create a spider - ruby

In Episode 78 of the Joel & Jeff podcast one of the Doctype / Litmus guys states that you would never want to build a spider in ruby. Would anyone like to guess at his reasoning for this?

Just how fast does a crawler need to be, anyhow? It depends upon whether you're crawling the whole web on a tight schedule, or gathering data from a few dozen pages on one web site.
With Ruby and the nokogiri library, I can read this page and parse it in 0.01 seconds. Using xpath to extract data from the parsed page, I can turn all of the data into domain specific objects in 0.16 seconds. All 223 rows.
I am running into fewer and fewer problems where the traditional constraints (cpu/memory/disk) matter. This is an age of plenty. Where resources are not a constraint, don't ask "what's better for the machine." Ask "what's better for the human?"

In my opinion it's just a matter of scale. If you're writing a simple scraper for your own personal use or just something that will run on a single machine a couple of times a day, then you should choose something that involves less code/effort/maintenance pains. Whether that's ruby is a different question (I'd pick Groovy over Ruby for this task => better threading + very convenient XML parsing). If, on the other hand, you're scraping terabytes of data per day, then throughput of your application is probably more important than shorter development time.
BTW, anyone that says that you would never want to use some technology in some context or another is most probably wrong.

You wouldn't get the desired performance out of Ruby. See the referenced link: http://blog.dhananjaynene.com/2008/07/performance-comparison-c-java-python-ruby-jython-jruby-groovy/
While performance tests like these should be taken with a grain of salt, there is a considerable difference between Ruby and the top(in speed) languages.
Edit: Shame on me for answering a loaded question. All-in-all choosing a language is a series of trade offs spanning from performance to personal preferences on what you are efficient in. The beauty of programming is that all of these languages are available for you to use, so you can test what works best for the requirements of your project. My recommendation is to experiment and see what works best for you.

What OG said. In simpler terms, Ruby is dog slow and if you're looking to get a lot done per unit time, it's the wrong choice of language.

Related

Where can I find extra resources on how to use Quantopian?

I am sure you all know Quantopian (I love it!!).
Even though I am pretty good with Python, I am still having trouble writing a full algorithm (I am trying to write one using Fundamentals data).
I have successfully used pipeline to get the stocks that I want, but am having trouble with writing the buys and the sells. Specifically, how to tell the program to buy this stock with some logic behind it and how to sell (the same or some other) stock.
I have gone through tons of resources, checked out other users' algorithms, went through both Quantopian and non Quantopian resources but still having trouble. Do you have any suggestions for other resources?
P.S. the link above is this guy 'sentdex'. His tutorials are the best!! It would have been fine if Quantopian didn't upgrade their systems, so unfortunately his tutorials are out of date now.

Background Knowledge:
Building up intuition can be tough. I found the books written by Ernie Chan to be most helpful when looking to complement my cs + math background with domain knowledge. Going through the process of translating portions of his books from MATLAB to Python was especially helpful.
Quantopian Specific Material:
Following strategies written by James Christopher will be worth your while. This algo in particular should serve as a great reference
https://www.quantopian.com/posts/long-short-pipeline-multi-factor
The lecture series also has a number of gems that will aid in building up the fundamentals necessary to do meaningful work on the platform.
Last But Not Least:
Use the community section as a learning tool and take advantage of the fact that it is the only place where you can engage with almost 100k quants in one coherent place. The "algotrading" subreddit isn't bad - that said it can occasionally feel like chewing on sand as many posts are misinformed.
Disclaimer: I interned at Quantopian in the Fall.

y can try think global about
initialize --> shedule_function --> rebalance --> if-elif logic: order(context.your_ticker_variable, amount)
and what about filters, I try custom - but still having trouble with it!

Entertaining and Interesting example to teach Ruby

I have to take a small introductory talk on ruby tomorrow, I want to avoid going the boring power point presentation way and have a hands on session.
The goal would be to introduce ruby to people, just the basic concepts really.
I'm planning to take an example from Why's Poignant Guide, do you know of any interesting example that would captivate the attention of the audience and make it an interesting talk.
EDIT : I'm done with the talk, it went reasonably well, there were about 50 people who turned up, About 10 of them picked up ruby really well. Some complained that I went too fast. All in all I covered the basics of ruby, didn't touch the OO Stuff. As for the examples I gave one in which we could scrape data from our college website using watir-webdriver. Thanks to all for your valuable answers and comments.

Ok, so your audience are not programmers, so there's no point of pointing out Ruby's advantages over other languages. Also, there's no place for advanced topics such as metaprogramming or more serious OO or functional programming.
What I would try to show them first is irb, how they could evaluate simple mathematical expressions, and show them the concept of variables.
Strings and string interpolation.
Loops (10.times{ puts 'Hello world!' }) and branches (if-then-else).
If you have time, show them arrays (not sure about hashes)
I would also try to squeeze in a simple program in Sinatra as well. Students already know web and you can show "how the web really works" using couple lines of Sinatra code.

Maybe they'd be interested in a bit of web scraping.
My project Easy Roommate parser visits a flat-mate sharing web site, and parses profiles to see which ones are compatible with what you want. My main warning is that my code isn't very good, and if lots of people used it, the web site owners may complain. However, it would be solving a problem that's common to many students.
Another project I did ages ago was Get to philosophy, which tried to work out why, when you click on the first link in each Wikipedia article you come across, you usually end up at Philosophy. Warning: this project is abandoned.

Why's Guide is a little nuts for the average audience, more of a work of art than an educational tool. If you like whimsy, you might want to check out Rails for Zombies.

You could show them how to write programs that can be useful for college students in general. For example, let's say they have to study some chapters of a book for an exam. I have a Python scripts, 30 lines or so, that takes a number of days and a list of tuples representing page ranges, and prints how many pages you need to study each day, and at what page you need to get by the end of each day. For example, if you have to study pages [10,19] and [30,33] over two days, you'd need to get to the end of page 16 on the first day, and to the end of page 33 by the second.
You could show them how to implement this sort of thing with Ruby. It's a bit dry, but very practical.

To show oo stuff (if you want to do) consider showing some automation of the Browser with watir. -> Real oo with real objects all in simple Ruby. Perfect for interesting and lively demos to motivate young people (maybe everybody wants to try it out after the presentation (my experience)).
http://watir.com/examples/

Sonic Pi is a Ruby based SDK and DSL. Its typical usage is in music creation and sound generation (samples and synthesizer sounds). The Sonic Pi website http://sonic-pi.net contains tutorials, which are specifically designed to teach programming from the start. There is even a guide for teachers! Sonic Pi is really entertaining and it gives you immediate sense of achievement.

Where can i find sample alogrithms for analyzing historical stock prices?

Can anyone direct me in the right direction?
Basically, I'm trying to analyze stock prices and see if I can spot any patterns. I'm using PHP and MySQL to do this. Where can I find sample algorithms like the ones used in MetaStock or thinkorswim? I know they are closed source, but are there any tutorials available for beginners?
Thank you,
P.S. I don't even know what to search for in google :(

A basic, educational algorithm to start with is a dual-crossover moving average. Simply chart fast (say, 5-day) and slow (say, 10-day) moving averages of a stock's closing price, and you have a weak predictor of when to buy long (fast line goes above slow) and sell short (slow line goes above the fast). After getting this working, you could implement exponential smoothing (see previously linked wiki article).
That would be a decent start. Take a look at other technical analysis techniques, but do keep in mind that this is quite a perilous method of trading.
Update: As for actually implementing this? You're a PHP programmer, so here is a charting library for PHP. This is the one I used a few years ago for this very project, and it worked out swimmingly. Maybe someone else can recommend a better one. If you need a free source of data, take a look at Yahoo! Finance's historical data. They dispense CSV files containing daily opening prices, closing prices, trading volume, etc. of virtually every indexed corporation.

Check out algorithms at investopedia and FM Labs has formulas for a lot of technical analysis indicators.

First you will need a solid math background : statistics in general, correlation analysis, linear algebra... If you really want to push it check out dimensional transposition. Then you will need solid basis in Data Mining. Associations can be useful if yo want to link strict numerical data with news headlines and other events.
One thing for sure you will most likely not find pre-digested algorithms out there that will make you rich...
I know someone who is trying just that... He is somewhat successful (meaning is is not loosing money and is making a bit) and making his own algorithms... I should mention he has a doctorate in Actuarial science.
Here are a few more links... hope they help out a bit
http://mathworld.wolfram.com/ActuarialScience.html
http://www.actuary.com/actuarial-science/
http://www.actuary.ca/
Best of luck to you

Save yourself time and use programs like NinjaTrader and Wealth-Lab. Both of them are great technical analysis platforms and accept C# as a programming language for defining your trading rules. Every possible technical indicator you can imagine is already included and if you need something more advanced you can always write your own indicator. You would also need a lot of data in order for your analysis to be statistically significant. For US stocks and ETFs, visit www.Kibot.com. We have good experience using their data.

Here's a pattern for ya
http://ddshankar.files.wordpress.com/2008/02/image001.jpg

I'd start with a good introduction to time series analysis and go from there. If you're interested in finding patterns then the interesting term is "1D-Pattern Matching". But for that you need nice features, so google for "Feature extraction in time series". Remember GiGo. So make sure you have error-free stock price data for a sufficiently long timeperiod before you start.

May I suggest that you do a little reading with respect to the Kalman filter? Wikipedia is a pretty good place to start:
http://en.wikipedia.org/wiki/Kalman_filter/
This should give you a little background on the problem of estimating and predicting the variables of some system (the stock market in this case).
But the stock market is not very well behaved so you may want to familiarize yourself with non linear extensions to the KF. Yes, the wikipedia entry has sections on the extended KF and the unscented KF, but here is an introduction that is just a little more in-depth:
http://cslu.cse.ogi.edu/nsel/ukf/
I suppose if anyone had ever tried this before then it would have been all over the news and very well known. So you may very well be on to something.

Use TradeStation
It is a platform that lets you write software to analyze historical stock data. You can even write programs that would trade the stock, and you can back test your program on historical data or run it real time through out the day.

Minimum CompSci Knowledge Needed for Writing Desktop Apps

Having been a hobbyist programmer for 3 years (mainly Python and C) and never having written an application longer than 500 lines of code, I find myself faced with two choices :
(1) Learn the essentials of data structures and algorithm design so I can become a l33t computer scientist.
(2) Learn Qt, which would help me build projects I have been itching to build for a long time.
For learning (1), everyone seems to recommend reading CLRS. Unfortunately, reading CLRS would take me at least an year of study (or more, I'm not Peter Krumins). I also understand that to accomplish any moderately complex task using (2), I will need to understand at least the fundamentals of (1), which brings me to my question : assuming I use C++ as the programming language of choice, which parts of CLRS would give me sufficient knowledge of algorithms and data structures to work on large projects using (2)?
In other words, I need a list of theoretical CompSci topics absolutely essential for everyday application programming tasks. Also, I want to use CLRS as a handy reference, so I don't want to skip any material critical to understanding the later sections of the book.
Don't get me wrong here. Discrete math and the theoretical underpinnings of CompSci have been on my "TODO: URGENT" list for about 6 months now, but I just don't have enough time owing to college work. After a long time, I have 15 days off to do whatever the hell I like, and I want to spend these 15 days building applications I really want to build rather than sitting at my desk, pen and paper in hand, trying to write down the solution to a textbook problem.
(BTW, a less-math-more-code resource on algorithms will be highly appreciated. I'm just out of high school and my math is not at the level it should be.)
Thanks :)

This could be considered heresy, but the vast majority of application code does not require much understanding of algorithms and data structures. Most languages provide libraries which contain collection classes, searching and sorting algorithms, etc. You generally don't need to understand the theory behind how these work, just use them!
However, if you've never written anything longer than 500 lines, then there are a lot of things you DO need to learn, such as how to write your application's code so that it's flexible, maintainable, etc.

For a less-math, more code resource on algorithms than CLRS, check out Algorithms in a Nutshell. If you're going to be writing desktop applications, I don't consider CLRS to be required reading. If you're using C++ I think Sedgewick is a more appropriate choice.

Try some online comp sci courses. Berkeley has some, as does MIT. Software engineering radio is a great podcast also.
See these questions as well:
What are some good computer science resources for a blind programmer?
https://stackoverflow.com/questions/360542/plumber-programmers-vs-computer-scientists#360554

Heed the wisdom of Don and just do it. Can you define the features that you want your application to have? Can you break those features down into smaller tasks? Can you organize the code produced by those tasks into a coherent structure?
Of course you can. Identify any 'risky' areas (areas that you do not understand, e.g. something that requires more math than you know, or special algorithms you would have to research) and either find another solution, prototype a solution, or come back to SO and ask specific questions.

Moving from 500 loc to a real (eve if small) application it's not that easy.
As Don was pointing out, you'll need to learn a lot of things about code (flexibility, reuse, etc), you need to learn some very basic of configuration management as well (visual source safe, svn?)
But the main issue is that you need a way to don't be overwhelmed by your functiononalities/code pair. That it's not easy. What I can suggest you is to put in place something to 'automatically' test your code (even in a very basic way) via some regression tests. Otherwise it's going to be hard.
As you can see I think it's no related at all to data structure, algorithms or whatever.
Good luck and let us know

I must say that sitting down with a dry old textbook and reading it through is not the way to learn how to do anything effectively, even if you are making notes. Doing it is the best way to learn, using the textbooks as a reference. Indeed, using sites like this as a reference.
As for data structures - learn which one is good for whatever situation you envision: Sets (sorted and unsorted), Lists (ArrayList, LinkedList), Maps (HashMap, TreeMap). Complexity of doing basic operations - adding, removing, searching, sorting, etc. That will help you to select an appropriate library data structure to use in your application.
And also make sure you're reasonably warm with MVC - i.e., ensure your model is separate from your view (the QT front-end) as best as possible. Best would be to have the model and algorithms working on their own, and then put the GUI on top. Or a unit test on top. Etc...
Good luck!

It's like saying you want to move to France, so should you learn french from a book, and what are the essential words - or should you just go to France and find out which words you need to know from experience and from copying the locals.
Writing code is part of learning computer science. I was writing code long before I'd even heard of the term, and lots of people were writing code before the term was invented.
Besides, you say you're itching to write certain applications. That can't be taught, so just go ahead and do it. Some things you only learn by doing.
(The theoretical foundations will just give you a deeper understanding of what you wind up doing anyway, which will mainly be copying other people's approaches. The only caveat is that in some cases the theoretical stuff will tell you what's futile to attempt - e.g. if one of your itches is to solve an NP complete problem, you probably won't succeed :-)

I would say the practical aspects of coding are more important. In particular, source control is vital if you don't use that already. I like bzr as an easy to set up and use system, though GUI support isn't as mature as it could be.
I'd then move on to one or both of the classics about the craft of coding, namely
The Pragmatic Programmer
Code Complete 2
You could also check out the list of recommended books on Stack Overflow.

Exercises to enforce good practices such as TDD and Mocking

I'm looking for resources that provide an actual lesson plan or path to encourage and reinforce programming practices such as TDD and mocking. There are plenty of resources that show examples, but I'm looking for something that actually provides a progression that allows the concepts to be learned instead of forcing emulation.
My primary goal is speeding up the process for someone to understand the concepts behind TDD and actually be effective at implementing them. Are there any free resources like this?

It's a difficult thing to encourage because it can be perceived (quite fairly) as a sea-change; not so much a progression to a goal but an entirely different approach to things.
The short-list of advice is:
You need to be the leader, you need to become proficient before you can convince others to, you need to be able to show others the path and settle their uncertainties.
First become proficient in writing unit tests yourself
Practice writing tests for existing methods. You'll probably beat your head on the desk trying to test lots of your code--it's not because testing is hard or you can't understand testing; it's more likely because your existing code and coding style isn't very testable.
If you have a hard time getting started then find the simplest methods you can and use them as a starting point.
Then focus on improving the testability of the code you produce
The single biggest tip: make things smaller and more to the point. This one is the big change--this is the hardest part to get yourself to do, and even harder to convince others of.
Personally I had my "moment of clarity" while reading Bob Martin's "Clean Code" book; an early chapter talks about what a clean method will look like and as an example he takes a ~40 line method that visually resembled something I'd produce and refactors it out into a class which is barely larger line-count wise but consists of nothing but bite-sized methods that are perhaps 3-7 lines each.
Looking at these itty-bitty methods it suddenly clicked that the unit-testing cornerstone "each test only tests one thing" is easiest to achieve when your methods only do one thing (and do that one thing without having 30 internal mechanisms at play).
The good thing is that you can begin to apply your findings immediately; practice writing small methods and small classes and testing along the way. You'll probably start out slow, and hit a few snags fairly quickly, but the first couple months will help get you pointed in the right direction.

You could try attending (or hosting one if there is none near you!) a coding dojo
I attended one such excercise and it was fun learning TDD.

Books are always a good resource - even though not free - they may be worth your time searching for the good free resources - for the money those books cost.
"Test driven development by example" by Kent Beck.
"Test Driven Development in Microsoft .NET" by James W. Newkirk and Alexei A. Vorontsov
please feel free to add to this list

One thing I worked through that helped me appreciate TDD more was NHibernate and the Unit of Work Pattern. Although it's specific to NHibernate and .NET, I liked the way that it was arranged. Using TDD, you develop something (a UnitofWork) that's actually useful rather than some simple "this is what a mock looks like" example.
How I learn a concept best is by putting it to use towards an actual need. I suggest you take a look at the structure of the article and see if it's along the lines of what you're looking for.

Geeks are excellent at working to metrics, whether they are good for them or not!
You can use this to your advantage. Set up a CI server and fail the build whenever code coverages drops below 50 percent. Let them know that the threshold will rise 10 percent every month until it's 90. You could perhaps use some commit hooks to stop them being able to check code in to begin with but I've never tried this myself.
Let them know the coverage by the team will be taken into effect in any performance reviews, etc. By emphasising it is the coverage of the team, you should get peer pressure helping you ensure good coverage.
This will only ensure they are testing their code, not how well they are testing their code, nor whether they are writing the tests first. However, it is strongly encouraging (or forcing) them to incorporate testing into their daily development process.
Generally, once people have something in their process they'll want to do something as easily/ efficiently as possible. TDD is the easiest way to write code with high coverage as you don't write a line of code without it being covered.

Find someone with experience and talk to them. If there isn't a local developer group, then start one.
You should also try pushing things too far to start with, and then learn when to back off. For example, the whole mock thing started when someone asked "What if we program with no getters".
Finally, learn to "listen to the tests". When the tests look dreadful, consider whether it's the code that's at fault, not your testing technique.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio