Related
We used SocketIO quite extensively in Etherpad(since very early on) and we're really grateful for all of the efforts of the team for providing such a useful thingy :)
Etherpad is a nodejs project.
My problem with SocketIO is probably due to me misconfiguration or understanding something but after quite a lot of test tool generation, tweaking of memory settings etc. we still get a frustratingly low maximum number of messages per second, hitting the 10k mark.
Etherpad latest simulated load test
Reading online it looks like switching to ws would be more performant but I fail to see how that could be the case in our scenario where our bottleneck is not negotiations (which end up being websockets) but messages per second handled by the server.
I'm reluctant to try other packages so I thought I'd come here and ask for some insight or things to try to see if we can improve performance by, well, a lot.. The usual node tricks(access to more hardware[ram/cpu]) help a bit but it still feels like we're getting really small gains and not the huge numbers you see on other module benchmarks.
A dream outcome of this question would be for someone to look at the Etherpad code and tell me why I'm an idiot and hopefully we can get Etherpad into the competitive ~100k changes per second but also I may be being misty eyed about other modules so if anyone has benchmarks that contradict the likes of ws then I'm all ears.
I feel like I should just add, we tested to see if it was internal Etherpad logic that is the cause and it's not, it really is the communication layer that ends up bottlenecking an operational transform algorithm, we're like 99.95% sure...
Throwing more hardware at this problem is not the solution, nor is any method of reverse proxy/passing the problem.
If you are blind to where the "problem" is, you don't have many options. You could be looking for a "misconfiguration" that does not exist. Which could waste you a lot of time and money and in the end you will probably still have to switch.
Maturity, one discovers, has everything to do with the acceptance of "not knowing".
Rewrite pieces of the code that are relevant for the load test, to test if using e.g. uWebSockets would help push the bondary. There are multiple sources stating that uWebSockets server is A LOT faster. I bet it will not take that much time and you will get very important information you need to help you decide if its worth switching. The new web technology is moving forward extremely fast and if you want to be able to make a right choice for future of the product, you have to be willing to experiment with it. Alex Hultman wrote an article
how-µwebsockets-achieves-efficient-pub-sub
where he encorages switching and explains why its worth a try.
I know that you shouldnt optimize too early, and you should instead aim for maintainability. My question is, at what point is it too late?
I'm working on a website, similar to yahoo answers, and my database structure is exactly what I feel it should be. Table for users, questions, answers, question_comments, answer_comments, etc.
My question is, IF the site were to grow, how would this architecture scale? I'm thinking of putting both questions and answers in a single table (posts), separating them by type, and then putting both question_comments and answer_comments in the same table (comments). I believe this is similar to stackoverflow's DB scheme.
I know what you guys are gonna say, "Dont worry about it until it becomes an actual problem". But wouldn't it be a little too late to worry about it then?
Thanks
The reason why it's a bad practice to optimize early is you don't know where your bottlenecks will be until your website sees a significant amount of traffic. How your users access and interact with your site is an unknown at this point.
It's almost always best to start with a 'good' architecture (normalized database, MVC architecture, DRY, well-written frontend code, etc) and go from there. It will be much easier to scale a clean, organized architecture than one that was prematurely optimized.
At best right now you can do some load testing via ab or another load testing tool to see where your current bottlenecks are. It certainly won't find all of them, but it will find some.
If you're really worried about this (and you shouldn't be yet), install Nagios or Munin on your server to monitor performance. Use a third party tool to measure page load time daily. Once you start seeing issues then you can profile and tune.
You absolutely should optimize if a fast service is a fundamental requirement of the application.
If sub-second responses are not a requirement, than you can write clean code and optimize later.
A good example of this was JavaScript before the latest version of browsers, people who wrote nice, clean, extensible JS for their pages had terrible performance and had to start from scratch.
One huge table is generally harder to maintain. People usually cut their tables into partitions and even their databases into shards.
I don't see how putting all comments into the same table would save you a join. Really, putting questions and answers into the same table won't save you a join either, you'll just be joining by the same table.
If you want to save on joins, I'd expect you use a document-oriented NoSQL database, such as MongoDB. That's where you can store a question with all related answers and comments in a single 'record', fetchable with one operation.
Databases need to be designed with performance in mind not wait until you havea problem later. Premature optimization doesn't mean don't do it in design, it means don't get ridiculously excessive about it. However, there are known performance killers for every database backend and it is foolish to design to use one of those when a differnt technique will be faster and take the same amount of time to write code for if you are familar with it. So before designing any database, read up on performance tuning and you will never write database code the same way again.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 12 years ago.
Improve this question
As I have been testing sites, I have found reCAPTCHAs getting more and more difficult to read. Is it just me or are others having this problem too?
Along with this, I had a user this morning complain about receiving a Bristish Pound character in their reCAPTCHA. Of course the user didn't know what to do, even though I have message stating they can click the reload/refresh icon to get a new CAPTCHA.
Unfortunately, this implementation is on a site often used by people over 60 years of age, so more complicated or confusing CAPTCHAs are a problem, but the site still receives a lot of people attempting to produce spam.
Despite the opinions presented until now I actually like the reCAPTCHA system. I like it mostly because I consider that it manages to solve two problems at once: verifying human identity and help digitalizes writings (For those of you who don't know here is why it uses 2 words and not one : reCAPTCHA philosophy
So I encourage all of you to try passing the reCAPTCHA tests as often as you can because you are really helping a good cause.
The worst are the ones that are case sensitive. L, l, I, o O 0 ?
I have a hard time reading most Captcha's, but I agree that reCAPTCHA's are a special nuisance.
Yes, Captchas are getting more difficult to read.
Image of CAPTCHA http://img165.imageshack.us/img165/1253/picture3rs8.png
I can't find the link right now but I believe the Microsoft Passport (MSN and Hotmail) are the hardest ones to break.
The problem is that whenever software gets better at detecting the text, the text has to become more difficult to read.
The irony I guess is that CAPTCHA stands for "Completely Automated Public Turing test to tell Computers and Humans Apart" but it won't be long for computers to catch up and they become too hard for the majority of humans to read. At this time they'll go away and some other version of a CAPTCHA will be used.
Perhaps photo based CAPTCHAS using googles image labelling system?
Ironic, because although computers are certainly getting smarter, people are probably getting dumber, too.
I think they are getting harder, I know I tend to fail every captcha I try at least once, sometimes twice. There are good alternatives emerging though. For example, Geoff Appleby shows nine photos and gives a text description for you to select three of them (scroll down to the comments form).
Such a system would be very accessible to the profiles you outlined (the photos could be quite big). Also a lot easier to implement.
Definitely getting harder now. My most recent one had something completely indistinguishable, next to 'are' written upside down.
I find reCAPTCHA's to be the absolute worst for usability. I often avoid sites that use them.
I don't mind that sites need to do these tests, but they don't need to be so near-impossible to figure out.
Perhaps reCAPTCHA, as it starts to run lower on words that people get correctly, starts paring harder and harder 'unknkown' words as people filter out all the easy ones?
I think eventually CAPTCHA is going to stop being feasible and there's going to have to be some kind of universally recognized "passport" system for websites. Some kind of account that you pay a couple bucks for and it identifies you as a human when you sign up for a website.
Then, if you start using that account for your spam robots, you can get banned universally. Sites could even retroactively clean up posts based on those bans. shrug Just a thought
I've been identified as not-human several times by the Stack Overflow blog comment captcha. Now I just keep requesting new captchas until I get one I can read. Usually only takes ~3 tries.
Update: According to Ben Maurer, the Chief Engineer at reCAPTCHA, who commented on my blog about this, over 96% of reCAPTCHAs are solved correctly. So maybe we as a group are just getting dumber?
reCAPTCHA will always get harder.
As they make tools to break reCAPTCHA, they will be using the same technology to help digitize text, therefore only the ones that the latest technology cannot read will be used as a CAPTCHA.
Its spy vs spy, except its a win win for reCAPTCHA and human knowledge.
The only problem they face is if they have a reader that is so good it never fails, reCAPTCHA will no longer work, but it would be a good problem to have for digitization of human knowledge.
Quite a few downloading sites have just stopped using captchas. All you really need to do is log the IP address of the client and stop giving them access for x minutes.
Same thing can be used for passwords. Did the user mistype his password 3 times? Let them wait five minutes to try again. And give them the option to refresh it by sending them an e-mail.
About time we get rid of those captchas. Computers and algorithms have become fast enough to crack even the hardest ones. While only making it frustrating for people.
Yes. It is getting harder.
If everyone realized how reCAPTCHA works, everyone should pass even with an unreadable word. reCAPTCHA always shows 2 words: one of the words reCAPTCHA knows its ASCII representation through OCR, the another, you can fail, because reCAPTCHA doesn't know the correct answer. When I find a too difficult reCAPTCHA I simply type "verydifficultword" along with the readable word.
Yes, it is getting harder. What ever may be the good thing it does, it should be usable. I tried 3 or 4 times on their audio captcha and failed each time. Though captchas try to solve a real issue, for those who can not see the captcha image and have to rely on audio captchas it is a big problem. Also not all the sites which uses captcha provides audio options. In any case, I think we'll have to keep proving to these machines that we are indeed humans for a long time to come.
The thing to keep in mind about ReCAPTCHA is that they are images actually scanned from real books and articles. As such you have to be aware that funky punctuation and stuff can make it in--it's not just words. For example I've seen partial words that end in a hyphen (that obviously occurred on the end of a line) as well as dollar-signs, numbers (like 1. Something), etc.
I find if you bear in mind the origin it makes a heck of a lot more sense and is easier to solve.
Also interestingly, you only need to get one of the reCAPTCHA words right, because the other is used to aid in the digitization. However you won't know which is which. :)
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
When do you start to consider a code base to be getting too large and unwieldy?
-when a significant amount of your coding time is devoted to "where do I put this code?"
-when reasoning about side-effects starts to become really hard.
-when there's a significant amount of code that's just "in there", and nobody knows what it does or if it's still running but it's too scary to remove
-when lots of team members spend significant chunks of their time chasing down intermittent bugs caused by some empty string somewhere in the data where it wasn't expected, or something that you think would usually be caught in a well-written application, in some edge case
-when, in considering how to implement a new feature, "complete rewrite" starts to seem like a good answer
-when you dread looking at the mess of code you need to maintain and wish you could find work building something clean and logical instead of dumpster diving through the detritus of someone else's poorly organized thinking
When it's over 100 lines. Joke. This is probably the hardest question to answer, because it's very individual.
But if you structure the application well and use different layers for i.e. interfaces, data, services and front-end you will automaticly get a nice "base"-structure. Then you can dividie each layer into different classes and then inside the classes you point out the appropriet methods for the class.
However, there's not an "x amount of lines per method is bad" but think of it more like this, if there is possibility of replication, split it from the current peice and make it re-usable.
Re-using code is the basics of all good structure.
And splitting up into different layers will help the base to become more and more flexible and modular.
There exist some calculable metrics if that's what you're searching for. Static code analysis tools can help with that:
Here's one list: http://checkstyle.sourceforge.net/config_metrics.html
Other factors can be the time it takes to change/add something.
Other non-calculable factors can be
the risk associated to changes
the level intermingling of features.
if the documentation can keep up with the features / code
if the documentation represent the application.
the level of training needed.
the quantity of repeat instead of reuse.
Ah, the god-program anti-pattern.
When you can't remember at least the
outline of sections of it.
When you have to think about how
changes will affect itself or
dependencies.
When you can't remember all the
things it's dependant on or depend
on it.
When it takes more than a few
minutes(?) to download the source or
compile.
When you have to worry about how to
deploy new versions.
When you encounter classes which are
functionally identical to other
classes elsewhere in the app.
So many possible signs.
I think there are many thoughts to why some code base is too large.
It is hard to remain in a constant naming convention. If classes/methods/atributes can't be named consistently or can't be found consistently, then it's time to reorganize.
When your programmers are surfing the web and going to lunch in order to compile. Keeping compiling/linking time to a minimum is important for management. The last thing you want is a programmer to get distracted by twiddling their thumbs for too long.
When small changes start to affect many MANY other places of code. There is a benefit to consolidation of code, but there is also a cost. If a small change to fix one bug causes a dozen more, and this is commonly happens, then your code base needs to be spread out (versioned libraries) or possibly unconsolidated (yes, duplicate code).
If the learning curve of new programmers to the project is obviously longer than acceptable (usually 90 days), then your code base/training isn't set up right.
..There are many many more, I'm sure. If you think about it from these three perspectives:
Is it hard to support?
Is it hard to change?
Is it hard to learn?
...Then you will have an idea if your code fits the "large and unwieldy" category
For me, code becomes unwieldy when there's been a lot of changes made to the codebase that weren't planned for when the program was initially written or last refactored significantly. At this point, stuff starts to get fitted into the existing codebase in odd places for expediency and you start to get a lot of design artifacts that only make sense if you know the history of the implementation.
Short answer: it depends on the project.
Long answer:
A codebase doesn't have to be large to be unwieldy - spaghetti code can be written from line 1. So, there's not really a magic tripping point from good to bad - it's more of a spectrum of great <---> awful, and it takes daily effort to keep your codebase from heading in the wrong direction. What you generally need is a lead developer that has the ability to review others' code objectively, and keep an eye on the architecture and design of the code as a whole - no one line developer can do that.
When I can't remember what a class does or what other classes it uses off the top of my head. It's really more a function of my cognitive capacity coupled with the code complexity.
I was trying to think of a way of deciding based on how your collegues perceive it to be.
During my first week at a gig a few years ago, I said during a stand-up that I had been tracking a white rabbit around the ContainerManagerBean, the ContainerManagementBean and the ContextManagerBean (it makes me shudder just recalling these words!). At least two of the developers looked at their shoes and I could see them keeping in a snigger.
Right then and there, I knew that this was not a problem with my lack of familiarity with the codebase - all the developers perceived a problem with it.
If over years of development different people code change requests and bug fixes you will sooner or later get parts of code with duplicated functionality, very similar classes, some spaghetti etc.
This is mostly due to the fact that a fix is needed fast and the "new guy" doesn't know the code base. So he happily codes away something which is already there.
But if you have automatic checks in place checking the style, unit test code coverage and similar you can avoid some of it.
A lot of the things that people have identified as indicating problems don't really have to do with the raw size of the codebase, but rather its comprehensibility. How does size relate to comprehensibility? If at all...
I've seen very short programs that are just a mess -- easier to throw away and redo from scratch. I've also seen very large programs whose structure is transparent enough that it is comprehensible even at progressively more detailed views of it. And everything in between...
I think look at this question from the standpoint of an entire codebase is a good one, but it probably pays to work up from the bottom and look first at the comprehensibility of individual classes, to multi-class components, to subsystems, and finally up to an entire system. I would expect the answers at each level of detail to build on each other.
For my money, the simplest benchmark is this: Can you explain the essence of what X does in one sentence? Where X is some granularity of component, and you can assume an understanding of the levels immediately above and below the component.
When you come to need a utility method or class, and have no idea whether someone else has already implemented it or have any idea where to look for one.
Related: when several slightly different implementations of the same functionality exist, because each author was unaware of other authors' work.
How would you begin improving on a really bad system?
Let me explain what I mean before you recommend creating unit tests and refactoring. I could use those techniques but that would be pointless in this case.
Actually the system is so broken it doesn't do what it needs to do.
For example the system should count how many messages it sends. It mostly works but in some cases it "forgets" to increase the value of the message counter. The problem is that so many other modules with their own workarounds build upon this counter that if I correct the counter the system as a whole would become worse than it is currently. The solution could be to modify all the modules and remove their own corrections, but with 150+ modules that would require so much coordination that I can not afford it.
Even worse, there are some problems that has workarounds not in the system itself, but in people's head. For example the system can not represent more than four related messages in one message group. Some services would require five messages grouped together. The accounting department knows about this limitation and every time they count the messages for these services, they count the message groups and multiply it by 5/4 to get the correct number of the messages. There is absolutely no documentation about these deviations and nobody knows how many such things are present in the system now.
So how would you begin working on improving this system? What strategy would you follow?
A few additional things: I'm a one-men-army working on this so it is not an acceptable answer to hire enough men and redesign/refactor the system. And in a few weeks or months I really should show some visible progression so it is not an option either to do the refactoring myself in a couple of years.
Some technical details: the system is written in Java and PHP but I don't think that really matters. There are two databases behind it, an Oracle and a PostgreSQL one. Besides the flaws mentioned before the code itself is smells too, it is really badly written and documented.
Additional info:
The counter issue is not a synchronization problem. The counter++ statements are added to some modules, and are not added to some other modules. A quick and dirty fix is to add them where they are missing. The long solution is to make it kind of an aspect for the modules that need it, making impossible to forget it later. I have no problems with fixing things like this, but if I would make this change I would break over 10 other modules.
Update:
I accepted Greg D's answer. Even if I like Adam Bellaire's more, it wouldn't help me to know what would be ideal to know. Thanks all for the answers.
Put out the fires. If there are any issues of critical priority, whatever they are, you've got to handle them first. Hack it in if you must, with a smelly codebase it's ok. You know you'll improve it going forward. This is your sales technique targeted at whomever you're reporting to.
Pick some low-hanging fruit. I assume you're relatively new to this particular software and that you were re-tasked to deal with it. Find some apparently easy problems in a related subsystem of the code that shouldn't take more than a day or two to resolve apiece, and fix them. This may involve refactoring, or it may not. The goal is to familiarize yourself with the system and with the style of the original author. You may not get really lucky (One of the two incompetents who worked on my system before me always post-fixed his comments with four punctuation marks instead of one, which made it very easy to distinguish who wrote the particular segment of code.), but you'll develop insight into the author's weaknesses so you know what to look out for. Extensive, tight coupling with global state vs poor understanding of language tools, for example.
Set a big goal. If your experience parallels mine, you'll find yourself in a particular bit of spaghetti code more and more often as you perform the prior step. This is the first knot you need to untangle. With the experience you've gained understanding the component and knowledge about what the original author likely did wrong (and thus, what you need to watch out for), you can start envisioning a better model for this subset of the system. Don't worry if you still have to maintain some messy interfaces to maintain functionality, just take it one step at a time.
Lather, rinse, repeat! :)
Given time, consider adding unit tests for your new model one level underneath your interfaces with the rest of the system. Don't engrave the bad interfaces in code via tests that use them, you'll be changing them in a future iteration.
Addressing the particular issues you mention:
When you run into a situation that users are working around manually, talk with the users about changing it. Verify that they'll accept the change if you provide it before sinking the time into it. If they don't want the change, your job is to maintain the broken behavior.
When you run into a buggy component that multiple other components have worked around, I espouse a parallel component technique. Create a counter that works how the existing one should work. Provide a similar (or, if practical, identical) interface and slide the new component into the codebase. When you touch external components that work around the broken one, try to replace the old component with the new one. Similar interfaces ease porting of the code, and the old component is still around if the new one fails. Don't remove the old component until you can.
What is being asked of you right now? Are you being asked to implement functionality, or fix bugs? Do they even know what they want you to do?
If you don't have the manpower, time, or resources to "fix" the system as a whole, then all you can do is bail water. You're saying you should be able to make some "visible progress" in a few months' time. Well, with the system being as bad as you described, you may actually make the system worse. Under pressure to do something noticeable, you'll simply add code, and make the sysem even more convoluted.
You need to refactor, eventually. There is no way around it. If you can find a way to refactor that is visible to your end users, that would be ideal, even if it takes 6-9 months or a year instead of "a few months." But if you can't, then you have a choice to make:
Refactor, and risk being viewed as "not accomplishing anything" despite your efforts
Don't refactor, accomplish "visible" goals, and make the system more convoluted and more difficult to refactor one day. (Maybe after you find a better job, and hope the next developer to come along can never find out where you live.)
Which one is most beneficial to you personally depends on your company's culture. Will they one day decide to hire more developers, or replace this system completely with some other product?
Conversely, if your efforts to "fix things" actually break other things, will they be understanding about the monstrosity you're being asked to tackle single-handedly?
No easy answers here, sorry. You have to evaluate based on your unique, individual situation.
This is a whole book that will basically say unit test and refactor, but with more practical advice on how to do it
http://ecx.images-amazon.com/images/I/51RCXGPXQ8L._SL500_AA240_.jpg
http://www.amazon.com/Working-Effectively-Legacy-Robert-Martin/dp/0131177052
You open the directory that contains this system with Windows Explorer. Then, press Ctrl-A, and then Shift-Delete. That sounds like an improvement in your case.
Seriously though: that counter sounds like it's got thread-safety issues. I'd put a lock around the increasing functions.
And regarding the rest of the system, you can't do the impossible so try to do the possible. You need to attack your system from two fronts. Take care of the more visibly problematic issues first, so you can show progress. At the same time, you should deal with the more infrastructural problems, so that you have a chance at actually fixing this thing some day.
Good luck, and may the source be with you.
Pick one area that would be of medium difficulty to refactor. Create a skeleton of the original code with only the method signatures of the existing ones; maybe use an Interface even. Then start hacking away. You can even point the "new" methods to the old ones until you get to them.
Then, testing, testing, testing. Since there aren't any unit tests, maybe just use good old fashioned Voice-Activated-Unit Tests (people)? Or write your own tests as you go.
Document your progress as you go in some kind of repository, including frustrations and questions, so that when the next poor schmuck who gets this project won't be where you are :).
Once you get the first part done, move on to the next. The key is to build on top of incremental progress, that's why you shouldn't start with the hardest part first; it'll be too easy to get demoralized.
Joel has a couple of articles on rewriting/refactoring:
http://www.joelonsoftware.com/articles/fog0000000069.html
http://www.joelonsoftware.com/articles/fog0000000348.html
I've been working with a legacy system with the same characteristics for almost three years now, and there are no shortcuts that I'm aware of.
What bothers me most with our legacy system is that I'm not allowed to fix some bugs, since many other functions could break if I fixed them. This calls for ugly workarounds or creating new versions of old functions. Calls to the old functions can then be replaced with the new one at a time (while testing).
I'm not sure what the goal of your task is, but I strongly advise you to touch as little of the code as possible. Only do what you need to do.
You may want to get as much as possible documented by interviewing people. This is a huge task, since you don't know which questions to ask, and people will have forgotten a lot of details.
Other than that: make sure you're getting paid and enough moral support. There will be weeping and gnashing of teeth...
Well you need to start somewhere, and it sounds like there are bugs that need fixing. I would work through those bugs, making quick win refactorings, and writing any unit tests possible along the way. I would also use a tool like SourceMonitor to identify some of the most 'complex' parts of code in the system and see if I could simplify their design in any way. Ultimately, you just have to accept that it will be a slow process, and make small steps towards a better system.
I would try to pick a part of the system that could be extracted and rewritten in isolation fairly quickly. Even if it doesn't do much, you could show progress pretty quickly, and you don't have the problem of interfacing with the legacy code directly.
Hopefully, if you could pick off a few such tasks, they will see you making visible progress, and you could put forward an argument for hiring more people to rewrite the bigger modules. When parts of the system rely on broken behaviour, you don't have much choice but to separate before you fix anything.
Hopefully, you could gradually build a team capable of rewriting the whole lot.
All of this would have to go hand in hand with some decent training, otherwise people's old habits will stick, and your work will get the blame when things don't work as expected.
Good luck!
Deprecate everything that currently exists that has problems, and write new ones that work correctly. Document as much as you can about what will change and put big red flashing signs all over the place pointing to this documentation.
By doing it that way, you can keep your existing bugs (the ones that are being compensated for somewhere else) around without slowing down your progress towards getting an actual working system.