What environment do I need for Testing Big Data Frameworks? [closed] - hadoop

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
As part of my thesis i have to evaluate and test some Big Data Frameworks like Hadoop or Storm. What minimal setup would you recommend to get some relevant Information about Performance and scalability? What Cloud Plattforms would be best suitable for this? Since im evaluating more than one Framework a out of the box PaaS - Solution wouldnt be the best choice. right? Whats the minimal number of nodes/servers to get some relevant Information? The cheaper the better, since the company im doing it for wont probably grant me a 20 Machine Cluster ;)
thanks a lot,
kroax

Well, you're definitely going to want at least two physical machines. Anything like putting multiple VMs on one physical machine is out of the question, as then you don't get the network overhead that's typical of distributed systems.
Three is probably the absolute minimum you could get away with as being a realistic scenario. And even then, a lot of the time, the overhead of Hadoop is just barely outweighed by the gains.
I would say five is the most realistic minimum, and a pretty typical small cluster size. 5 - 8 is a good, small range.
As far as platforms go, I would say Amazon EC2/EMR should always be a good first option to consider. It's a well-established, great service, and many real-world clusters are running on it. The upsides are that it's easy to use, relatively inexpensive, and representative of real-world scenarios. The only downside is that the virtualization could cause it to scale slightly differently than individual physical machines, but that may or may not be an issue for you. If you use larger instance types, I believe they are less virtualized.
Hope this helps.

Related

What is the most limited and expensive resource in a computer? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 2 years ago.
Improve this question
What is the most expensive and limited resource in a computer today?
Is it the CPU? Maybe the memory or as I was told the bandwidth (Or something entirely different)?
Does that mean that a computer should do everything to use that resource more efficiently,
including putting more load on other resources?
For example by compressing files, do we put more load on the CPU, so the file can be transmitted over
the network faster?
I think I know the answer to that, but I would like to hear it from someone else, please provide an explanation.
There is a more costly resource that you left out -- Design and Programming.
I answer a lot of questions here. Rarely do I say "beef up the hardware". I usually say "redesign or rewrite".
Most hardware improvements are measured in percentages. Cleaver redesigns are measured in multiples.
A complex algorithm can be replaced by a big table lookup. -- "Speed" vs "space".
"Your search returned 8,123,456 results, here are the first 10" -- You used to see things like that from search engines. Now it says "About 8,000,000 results" or does not even say anything. -- "Alter the user expectations" or "Get rid of the bottleneck!".
One time I was looking at why a program was so slow. I found that 2 lines of code were responsible for 50% of the CPU consumed. I rewrote those 2 lines into about 20, nearly doubling the speed. This is an example of how to focus the effort to efficiently use the programmer.
Before SSDs, large databases were severely dominated by disk speed. SSDs shrank that by a factor of 10, but disk access is still a big problem.
Many metrics in computing have followed Moore's law. But one hit a brick wall -- CPU speed. That has only doubled in the past 20 years. To make up for it, there are multiple CPUs/cores/threads. But that requires much more complex code. Most products punt -- and simply use a single 'cpu'.
"Latency" vs "throughput" -- These two are mostly orthogonal. The former measures elapsed time, which is limited by the speed of light, etc. The latter measures how much data -- fiber optics is much "fatter" than a phone wire.

Which navigation methods would be the most performant and flexible for a game with a very large number of AI on a dynamic playfield? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I'm not 100% sure what factors are important when deciding whether to use Unity's NavMesh vs an advanced pathing algorithm such as HPA* or similar. When considering the mechanics below, what are the implications of using Unity's NavMesh vs rolling my own algorithms:
Grid based real-time building.
Large number of AI, friendly, hostile, neutral. Into the hundreds. Not all visible on screen at once but the playfield would be very large.
AI adheres to a hierarchy. Basically does things where AI entities issues commands, receive commands, and execute them in tandem with one-another. This could allow for advanced pathing to be done on a single unit that relays rough directions to others where they can commence lower-level pathing to save on performance.
World has a strong chance of being procedural. I wanted to go infinite proc-gen but I think that's out of scope. I don't intend on having the ground plane being very diverse in regards to actual height, just the objects placed on it.
Additions and removals within the environment will be dynamic at run-time by both the player and AI entities.
I've read some posts talking about how NavMesh can't handle runtime changes very well but have seen tutorials/store assets that are contrary to that. Maybe I could combine methods too? The pathing is going to be a heavy investment of time so any advice here would be greatly appreciated.
There are lots of solutions. It's way too much for a single answer, but here's some keywords to look into:
Swarm pathfinding
Potential fields
Flocking algorithms
Boids
Collision avoidance
Which one you use depends on how many units will be pathing at a time, whether they're pathing to the same place or different places, and how you want them to behave if multiple are going to the same place (eg. should they intentionally avoid collisions with each other? Take alternate routes when one is gridlocked? Or all just stupidly cram into the same hallway?)

refactor old webapp to gain speed [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
4 years ago, I've built a webapp which is still used by some friends. the problem with that app, is that now it has a huge database, and it loads very slow. I know that is just my fault, mysql queries are mixted all over the places(even in the layout generation time).
ATM I know some about OO. I'll like to use this knowledge in my old app, but I don't know how to do it without rewriting all the from the beginning. Using MVC for my app, is very difficult at this moment.
If you were in my place, or if you will had the task to improve the speed of my old app, how you will do it? Do you have any tips for me? Any working scenarios?
It all depends on context. The best would be to change the entire application, introducing best practices and standards at once. But perhaps would be better to adopt an evolutionary approach:
1- Identify the major bottlenecks in the application using a profiling tool or load test.
2 - Estimate the effort required to refactoring each item.
3 - Identify the pages for which performance is more sensitive to the end user.
4 - Based on the information identified create a task list and set the priority of each item.
Attack one prolem at a time, making small increments. Always trying to spend 80% of your time solving the 20% more critical problems.
Hard to give specific advice without a specific question, but here are some general optimization/organization techniques:
Profile to find hot spots in your code
you mention mysql queries being slow to load, try to optimize them
possibly move data base access to stored procedures to help modularize your code
look for repeated code and try to move it to objects one piece at a time

How many of you are recording their historical project-data - for future estimates and how are you doing it? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 5 years ago.
Improve this question
When working on a project - I always estimate my tasks and calculate how long it will take me to finish. So in the end I get a time-span in which the project should be finished (it rarely is).
My question is - do you record your data and assumptions use in your estimates during a project and use them for later projects or refined estimates on the same project?
If so - how do you record such data and how do you store them?
I used an excel-sheet - but somehow (cannot imagine how that happened ;)) I tend to forget to fill in new assumptions or gained information. On the other hand it is not really readable or useful for evaluating my predictions after finishing the project - to learn from it for the next project.
Sounds like what Joel wrote FogBugz for.
I had a discussion with a friend recently about a pragmatic variation of this, more specifically, the feasiblity of using the coarse level evidence of when code is checked in.
Provided you work in a reasonably cohesive manner, your checkins can be related, at least through the files involved, to some work units and the elapsed time used to determine an average productivity.
This fits well with the Evidence-based Scheduling approach included in FogBugz. If you happen to have been spending time on other things to an unusual degree, then in future you will be more productive than the checkin rate suggests. Any error is on the safe side of over-allocating time.
The main flaw, for me, in an approach like this is that I typically interweave at least two projects, often more, in different repositories and languages. I would need to pull the details together and make a rough allocation of relative time between them to achieve the same thing. In a more focused team, I think repository date stamps may be good enough.
Isn't that what project managers are for? ;)

What is the best practice for estimating required time for development of the SDLC phases? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
As a project manager, you are required to organize time so that the project meets a deadline.
Is there some sort of equations to use for estimating how long the development will take?
let's say the database
time = sql storedprocedures * tables manipulated or something similar
Or are you just stuck having to get the experience to get adequate estimations?
As project manager you have to remember that the best you will ever we be able to do on your own is give your best guess as to how long a given project will take. How accurate you are. depends on your experience and the scope of the project.
The only way I know of to get a reasonably accurate estimate that is it to break the project into individual tasks and get the developer who will be doing the actual work to put an estimate on each task. You can then use an evidence based algorithm that takes the estimation accuracy of each developer into account to give you the probability of hitting a given deadline.
If the probability is too low, you have two choices: remove features or move the deadline.
Further reading:
http://www.joelonsoftware.com/items/2007/10/26.html
http://www.wordyard.com/2007/10/11/evidence-based-scheduling/
http://en.wikipedia.org/wiki/Monte_Carlo_method
There's no set formula out there that I've seen that would really work. Fogbugz has its monte carlo simulator which has somewhat of a concept for this, but really, experience is going to be your best point of reference. Every developer and every project will be different!
There will be such a formula as soon as computers can start generating all code themselves. Until then you are stuck with human developers who all have different levels of skill and development speed.

Resources