Fast linear system solver for D? [closed] - performance

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
Where can I get a fast linear system solver written in D? It should be able to take a square matrix A and a vector b and solve the equation Ax = b for b and, ideally, also perform explicit inversion on A. I have one I wrote myself, but it's pretty slow, probably because it's completely cache-naive. However, for my use case, I need something with the following absolute, non-negotiable requirements, i.e. if it doesn't meet these, then I don't otherwise care how good it otherwise is:
Must be licensed public domain, Boost license, or some similar permissive license. Ideally it should not require attribution in binaries (i.e. not BSD), though this point is somewhat negotiable.
Must be written in pure D or easily translatable to pure D. Inscrutable Fortran code (i.e. LAPACK) is not a good answer no matter how fast it is.
Must be optimized for large (i.e. n > 1000) systems. I don't want something that's designed for game programmers to solve 4x4 matrices really, really fast.
Must not be inextricably linked to a huge library of stuff I don't need.
Edit: The reason for these seemingly insane requirements is that I need this code for a permissively licensed open source library that I don't want to have any third-party dependencies.

If you don't like Fortran code, one reasonably fast C++ dense matrix library with modest multi-core support, well-written code and a good user-interface is Eigen. It should be straightforward to translate its code to D (or to take some algorithms from it).
And now my "think about your requirements": there is a reason why "everyone" (Mathematica, Matlab, Maple, SciPy, GSL, R, ...) uses ATLAS / LAPACK, UMFPACK, PARDISO, CHOLMOD etc. It is hard work to write fast, multi-threaded, memory-efficient, portable and numerically stable matrix solvers (trust me, I have tried). A lot of this hard work has gone into ATLAS and the rest.
So my approach would be to write bindings for the relevant library depending on your matrix type, and link from D against the C interfaces. Maybe the bindings in multiarray are enough (I haven't tried). Otherwise, I'd suggest looking at another C++ library, namely uBlas and the respective bindings for ideas.

Related

Which FPGA should I choose? (or should I choose another hardware) [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
You see guys, I've always been interesting in buying one of this development boards, but they were too expensive to me as a student since I had to spent in other proyects, however, I sold some things i don't use and finally made the money to buy one.
So my problem is, I am currently studying electronic engineering, but I've been dedicating a lot of time to programing, reverse engineering stuff and undesrtanding some a little bit complex math cryptographic algorithms (mainly the ones used for hashing) and prime number testing, NP-hard kind of algorithms and some graph path search algorithms, so i wanted to buy an FPGA that was anywhere under $200 that could do the job if i wanted to compute this kind of tasks with him, right now i use my computer for some.
Lets say as example if i wanted to make an architecture for wpa or md5 brute-forcing, we all know that the numbers go nuts if the password is longer than 8, and eventhough im more interested on understanding deeply how the protocols work and how to implement this ideas, it just would be nice to see it working.
Right now the options I've looked at so far are:
-Cyclone V GX Starter Kit ($179)
which has: Cyclone V GX 5CGXFC5C6F27C7N Device
https://www.terasic.com.tw/cgi-bin/page/archive.pl?Language=English&CategoryNo=167&No=830
-DE10-Nano Kit ($130)
which has : Cyclone V 5CSEBA6U2317N Device
https://www.terasic.com.tw/cgi-bin/page/archive.pl?Language=English&CategoryNo=167&No=1046
But since im kind of new to FPGA, I mean I've worked with them but used for university projects, with their FPGA's, so i didn't get to know them a lot.
So my final question is, does the FPGA speed depends only on the amount of logif elemnts it has? or, should i care more about that than the other "add-ons" the boards have? because eventhough the second one is cheaper it has lie 30% more Logic elements than the first on, but I don't know if that would mean i would have better performance.
Also, here's the datasheet fot the cyclone V devices:
https://www.altera.com/en_US/pdfs/literature/hb/cyclone-v/cv_51001.pdf
Also thank you for your time on reading this guys, I know it's usually more interesting to solve programming issues and that kind of stuff haha
EDIT: Forgot the "1" on the "$179"
The boards you've listed has similar speed grade, so won't be any different in raw speed.
GX series include 3Gbps transceivers, and that exact Starter kit has 2.5v levels on HSMC connector. Unless you will be using that connector with some really fast hardware (like 80Msps ADC/DAC, etc), it's unlikely you will benefit from GX. If only due to a bit larger number of hw multipliers available, but that depends on your exact project and needs.
Lots of gpio lines will be lost to hsmc connector. There's boards to fan out hsmc connector into convenient 40-pin gpio connectors, but that will cost another $56. And still you might have difficulties with external hw you will be playing with, since i/o banks on those lines will be using 2.5v levels, while most likely you will have lots of 3.3v devices. It's compatible to a degree and under some conditions, but it's safer to assume there will be issues.
If you will be playing with DIY stuff eventually, then you will need more i/o lines at more convenient voltage of 3.3v. DE10-nano kit looks more promising to me in general case. There's two arm cores that you can use to run higher level logic in linux. It has arduino-compatible connectors, so you can play with existing shields. It's also larger than Starter kit in terms of ALM and on-chip memory, you will need those to instantiate lots of parallel blocks to crunch your numbers.
Sure, if you already have some daughter boards in hsmc format, or planning to have one - you will need some kit with hsmc support.

Prolog - high-level purpose of WAM [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I am trying to understand the purpose of the WAM at a conceptual high level, but all the sources I have consulted so far assume that I know more than I currently do at this juncture, and they approach the issue from the bottom (details). They start with throwing trees at me, where as right now I am concerned with seeing the whole forest.
The answers to the following questions would help me in this endeavor:
Pick any group of accomplished, professional Prolog implementers - the SISCtus people, the YAP people, the ECLiPSe people - whoever. Now, give them the goal of implementing a professional, performant, WAM-based Prolog on an existing virtual machine - say the Erlang VM or Java VM. To eliminate answers such as "it depends on what your other goals are," lets say that any other goals they have besides the one I just gave are the ones they had when they developed their previous implementations.
Would they (typically) implement a virtual machine (the WAM) inside of a VM (Erlang/JVM), meaning would you have a virtual machine running on top of, or being simulated by, another virtual machine?
If the answer to 1 is no, does that mean that they would try to somehow map the WAM and its associated instructions and execution straight onto the underlying Erlang/Java VM, in order to make the WAM 'disappear' so to speak and only have one VM running (Erlang/JVM)? If so, does this imply that any WAM heaps, stacks, memory operations, register allocations, instructions, etc. would actually be Erlang/Java ones (with some tweaking or massaging)?
If the answer to 1 is yes, does that mean that any WAM heaps, stacks, memory ops, etc. would simply be normal arrays or linked lists in whatever language (Erlang or Java, or even Clojure running on the JVM for that matter) the developers were using?
What I'm trying to get at is this. Is the WAM merely some abstraction or tool to help the programmer organize code, understand what is going on, map Prolog to the underlying machine, perhaps provide portability, etc. or is it seen as an (almost) necessary, or at least quite useful "end within itself" in order to implement a Prolog?
Thanks.
I'm excited to see what those more knowledgeable than I are able to say in response to this interesting question, but in the unlikely event that I actually know more than you do, let me outline my understanding. We'll both benefit when the real experts show up and correct me and/or supply truer answers.
The WAM gives you a procedural description of a way of implementing Prolog. Prolog as specified does not say how exactly it must be implemented, it just talks about what behavior should be seen. So WAM is an implementation approach. I don't think any of the popular systems follow it purely, they each have their own version of it. It's more like an architectural pattern and algorithm sketch than a specification like the Java virtual machine. If it were firmer, the book Warren's Abstract Machine: A Tutorial Reconstruction probably wouldn't need to exist. My (extremely sparse) understanding is that the principal trick is the employment of two stacks: one being the conventional call/return stack of every programming language since Algol, and the other being a special "trail" used for choice points and backtracking. (edit: #false has now arrived and stated that WAM registers are the principal trick, which I have never heard of before, demonstrating my ignorance.) In any case, to implement Prolog you need a correct way of handling the search. Before WAM, people mostly used ad-hoc methods. I wouldn't be surprised to learn that there are newer and/or more sophisticated tricks, but it's a sound architecture that is widely used and understood.
So the answer to your three-part question is, I think, both. There will be a VM within the VM. The VM within the VM will, of course, be implemented in the appropriate language and will therefore use that language's primitives for handling the invisible parts of the VM (the stack and the trail). Clojure might provide insight into the ways a language can share things with its own implementation language. You would be free to intermix as desired.
The answer to your final question, what you're trying to get at, is that the WAM is merely an abstraction for the purposes you describe and not an end to itself. There is not, for instance, such a thing as "portable WAM bytecode" the way compiled Java becomes portable JVM bytecode which might justify it absent the other benefits. If you have a novel way of implementing Prolog, by all means try it and forget all about WAM.

Sorting algorithm for unpredicted data [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
Considering the fact that currently many libraries already have optimized sort engines, then why still many companies ask about Big O and some sorting algorithms, when in reality in our every day in computing, this type of implementation is not any longer really needed?
For example, self balancing binary tree, is a kind of problem that some big companies from the embedded industry still ask programmers to code as part of their testing and candidate selection process.
Even for embedded coding, there are any circumstances when such kind of implementation is going to take place, given fact that boost, SQLite and other libraries are available for use? In other words, is it worth still to think on ways to optimize such algorithms?
As an embedded programmer, I would say it all comes down to the problem and system constraints. Especially on a microprocessor, you may not want/need to pull in Boost and SQLite may still be too heavy for a given problem. How you chop up problems looks different if you have say, 2K of RAM - but this is definitely the extreme.
So for example, you probably don't want to rewrite code for a red-black tree yourself because as you pointed out, you will likely end up with highly non-optimized code. But in the pursuit of reusability, abstraction often adds layers of indirection to the code. So you may end up reimplementing at least simpler "solved" problems where you can do better than the built-in library for certain niche cases. Recently I wrote a specialized version of linked lists using shared memory pools for allocation across lists, for example. I had benchmarked against STL's list and it just wasn't cutting it because of the added layers of abstraction. Is my code as good? No, probably not. But I was more easily able to specialize the specific uses cases, so it came out better.
So I guess I'd like to address a few things from your post:
-Why do companies still ask about big-O runtime? I have seen even pretty good engineers make bad choices with regards to data structures because they didn't reason carefully about the O() runtime. Quadratic versus linear or linear versus constant time operation is a painful lesson when you get the assumption wrong. (ah, the voice of experience)
-Why do companies still ask about implementing classic structures/algorithms? Arguably you won't reimplement quick sort, but as stated, you may very well end up implementing slightly less complicated structures on a more regular basis. Truthfully, if I'm looking to hire you, I'm going to want to make sure that you understand the theory inside and out for existing solutions so if I need you to come up with a new solution you can take an educated crack at it. And if the other applicant has that and you don't, I'd probably say they have an advantage.
Also, here's another way to think about it. In software development, often the platform is quite powerful and the consumer already owns the hardware platform. In embedded software development, the consumer is probably purchasing the hardware platform - hopefully from your company. This means that the software is often selling the hardware. So often it makes more cents to use less powerful, cheaper hardware to solve a problem and take a bit more time to develop the firmware. The firmware is a one-time cost upfront, whereas hardware is per-unit. So even from the business side there are pressures for constrained hardware which in turn leads to specialized structure implementation on the software side.
If you suggest using SQLite on a 2 kB Arduino, you might hear a lot of laughter.
Embedded systems are bit special. They often have extraordinarily tight memory requirements, so space complexity might be far more important than time complexity. I've rarely worked in that area myself, so I don't know what embedded systems companies are interested in, but if they're asking such questions, then it's probably because you'll need to be more acquainted with such issues than in other areas of I.T.
Nothing is optimized enough.
Besides, the questions are meant to test your understanding of the solution (and each part of the solution) and not how great you are at memorizing stuff. Hence it makes perfect sense to ask such questions.

How do you write data structures that are as efficient as possible in GHC? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
So sometimes I need to write a data structure I can't find on Hackage, or what I find isn't tested or quality enough for me to trust, or it's just something I don't want to be a dependency. I am reading Okasaki's book right now, and it's quite good at explaining how to design asymptotically fast data structures.
However, I am working specifically with GHC. Constant factors are a big deal for my applications. Memory usage is also a big deal for me. So I have questions specifically about GHC.
In particular
How to maximize sharing of nodes
How to reduce memory footprint
How to avoid space leaks due to improper strictness/laziness
How to get GHC to produce tight inner loops for important sections of code
I've looked around various places on the web, and I have a vague idea of how to work with GHC, for example, looking at core output, using UNPACK pragmas, and the like. But I'm not sure I get it.
So I popped open my favorite data structures library, containers, and looked at the Data.Sequence module. I can't say I understand a lot of what they're doing to make Seq fast.
The first thing that catches my eye is the definition of FingerTree a. I assume that's just me being unfamiliar with finger trees though. The second thing that catches my eye is all the SPECIALIZE pragmas. I have no idea what's going on here, and I'm very curious, as these are littered all over the code.
Many functions also have an INLINE pragma associated with them. I can guess what that means, but how do I make a judgement call on when to INLINE functions?
Things get really interesting around line ~475, a section headered as 'Applicative Construction'. They define a newtype wrapper to represent the Identity monad, they write their own copy of the strict state monad, and they have a function defined called applicativeTree which, apparently is specialized to the Identity monad and this increases sharing of the output of the function. I have no idea what's going on here. What sorcery is being used to increase sharing?
Anyway, I'm not sure there's much to learn from Data.Sequence. Are there other 'model programs' I can read to gain wisdom? I'd really like to know how to soup up my data structures when I really need them to go faster. One thing in particular is writing data structures that make fusion easy, and how to go about writing good fusion rules.
That's a big topic! Most has been explained elsewhere, so I won't try to write a book chapter right here. Instead:
Real World Haskell, ch 25, "Performance" - discusses profiling, simple specialization and unpacking, reading Core, and some optimizations.
Johan Tibell is writing a lot on this topic:
Computing the size of a data structure
Memory footprints of common data types
Faster persistent structures through hashing
Reasoning about laziness
And some things from here:
Reading GHC Core
How GHC does optimization
Profiling for performance
Tweaking GC settings
General improvements
More on unpacking
Unboxing and strictness
And some other things:
Intro to specialization of code and data
Code improvement flags
applicativeTree is quite fancy, but mainly in a way which has to do with FingerTrees in particular, which are quite a fancy data structure themselves. We had some discussion of the intricacies over at cstheory. Note that applicativeTree is written to work over any Applicative. It just so happens that when it is specialized to Id then it can share nodes in a manner that it otherwise couldn't. You can work through the specialization yourself by inlining the Id methods and seeing what happens. Note that this specialization is used in only one place -- the O(log n) replicate function. The fact that the more general function specializes neatly to the constant case is a very clever bit of code reuse, but that's really all.
In general, Sequence teaches more about designing persistent data structures than about all the tricks for eeking out performance, I think. Dons' suggestions are of course excellent. I'd also just browse through the source of the really canonical and tuned libs -- Map, IntMap, Set, and IntSet in particular. Along with those, its worth taking a look at Milan's paper on his improvements to containers.

Behavior Tree Implementations [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I am looking for behavior tree implementations in any language, I would like to learn more about how they are implemented and used so can roll my own but I could only find one
Owyl, unfortunately, it does not contain examples of how it is used.
Any one know any other open source ones that I can browse through the code see some examples of how they are used etc?
EDIT: Behavior tree is the name of the data structure.
Here's a few I've found:
C# - https://github.com/netgnome/BehaviorLibrary (free)
C++ - http://aigamedev.com/insider/tutorial/second-generation-bt/ ($10)
C# - http://code.google.com/p/treesharp/ (free)
C# - https://github.com/ArtemKoval/Simple-Behavior-Tree-Library
Java - http://code.google.com/p/daggame/ DAG AI Code
C# - http://www.sgtconker.com/affiliated-projects/brains/
This Q on GameDev could be helpful, too.
Take a look at https://skill.codeplex.com/. this is a BehaviorTree code generator for unity. you can download source code and see if it is useful.
I did my own behavior tree implementation in C++ and used some modified code from the Protothreads Library. Coroutines in C is also a good read. Using this one can implement a coroutine system that allows one to run multiple behaviors concurrently without the use of multiple threads. Basically each tree node would have its own coroutine.
I don't know that I understand you right but I think to implement a tree you'r better choice is to use an formal language such as F# or Haskell. With Haskell you can use flexible and fast tree-structures and with F# you have an multiparadigm Language to parse and handle tree structures in oo Code.
I hope that helps you.
You can find behavior trees implemented in .NET in the YVision framework. We found them to be particularly suited for the development of Natural User Interface (NUI) applications.
It's not open-source but it's free to use and you can find information on how we implemented them in the tutorials: http://www.yvision.com/support/tutorials/
EDIT: Let me add that we use behavior trees for a lot more than just AI. Even the synchronization of the subsystem in the game loop is define by them.
Check the cases page to find the range of applications we are using them: robotics, camera-based interaction, augmented reality, etc.
Download the framework, try the samples and please give us feedback on our implementation.
https://github.com/TencentOpen/behaviac is a really excellent one.
behaviac supports the behavior tree, finite state machine and hierarchical task network.
Behaviors can be designed and debugged in the designer, exported and executed by the game.
The C++ version is suitable for the client and server side.
and, it is open sourced!

Resources