Identifying unusual memory allocation in julia

Identifying unusual memory allocation in julia - memory-management

I am a new julia user about to write my first script. I use R, Matlab and python for data analysis and I have yet to have to worry about concerns people in c or java would have with memory allocation and more advacned programming I geuss. I want to use julia to simulate some biological data and in python these types of programs can get really slow. So as I was reading the performance tips in the part on memory allocation caught my eye. As some one who kind of lacks a background in more advanced languages like C what should I be looking for? I read this stack post here but he had python code he had compiled and then run in Julia so I don't feel it is pertinent.
How will I know when I have unusual memory allocation, and conversely what does "good" memory allocation in Julia look like? Is there a way to identify what an unacceptable level of memory demand is based on what I am running?
In addition to the #time macro what other proofing tips do you have?

To understand Julia performance issues very carefully read https://docs.julialang.org/en/v1/manual/performance-tips/ Running yourself all examples in that page is a must.
Usually, memory allocation is not the major bottleneck. From my experience when people are starting to use Julia the type stability and using global variables are a major source of all performance problems.
The above document has also a section on memory allocation:
"Unexpected memory allocation is almost always a sign of some problem with your code, usually a problem with type-stability or creating many small temporary arrays."
You will find there detailed discussion about addressing such issues.
Regarding code profiling you have the following tools:
#time macro (note that the first time you run it for a given methods you are actually mostly measuring compile time rather than run time. Hence usually BenchmarkTools.jl is recommended
BenchmarkTools.jl with its #btime macro (among others - the first choice to understand what is going on
inbuilt profiling (https://docs.julialang.org/en/v1/manual/profile/) together with ProfileView.jl and ProfileSVG.jl packages to visualize profiling data in standalone and Jupyter environments respectively
in most complicated cases you can view the code at various compiling stages with #code_lowered, #code_typed and #code_native
These are the basic and most useful tools for a great start to performant Julia code.

Related

(Un)Limit CPU usage - Windows - Mathematica

I'm programming in Mathematica 8.
When I run my programme, I check with Win8 task manager that the CPU usage is at 35% as soon as it starts to run, and my memory usage also increases to 44%. Does Win8 limit the amount of CPU usage that a certain programme may have? I need to make my computer to use more of its resources to run the programme faster.
Any help would be appreciated.

What's happening here is a common misconception about how processors approach problems involving heavy computation.
Although you may indeed have a powerful 4-core processing machine, and you have a program which is capable of using all 4 processing cores (which mathematica definitely is!), unless the code is written in a parallel fashion, you will only be able to use 1 core at a time to do the calculations. As Mysticial mentioned in the comment, not all code is parallelizable, in fact, I'd say a great many problems are not inherently able to be parallelized.
Check here for some good examples of problems that can be split up in a parallel fashion well. Now, your memory usage will simply increase with the size of the data that's being stored temporarily. (ex: storing a 69X69 matrix takes up less memory (RAM) than a 4000X4000, being parallel has little to do with this, and more with the problem itself).
Anyway, tl;dr, your computer is acting normally. To use all 100% of that 4-core machine you're using, check out This mathematica reference guide to parallel operations.

ARM NEON: Tools to predict performance issues due to memory access limited bandwidth?

I am trying to optimize critical parts of a C code for image processing in ARM devices and recently discovered NEON.
Having read tips here and there, I am getting pretty nice results, but there is something that escapes me. I see that overall performance is very much dependant on memory accesses and how they are done.
Which is the simplest way (by simple I mean, if possible, not having to run the whole compiled code in an emulator or simulator, but something that can be feed of small pieces of assembly and analyze them), in order to get an idea of how memory accesses are "bottlenecking" the subroutine?
I know this can not be done exactly without running it in a specific hardware and specific conditions, but the purpose is to have a "comparison" trial-and error tool to experiment with, even if the results are only approximations.
(something similar to this great tool for cycle counting)

I think you've probably answered your own question. Memory is a system level effect and many ARM implementers (Apple, Samsung, Qualcomm, etc) implement the system differently with different results.
However, of course you can optimize things for a certain system and it will probably work well on others, so really it comes down to figuring out a way that you can quickly iterate and test/simulate system level effects. This does get complicated so you might pay some money for system level simulators such as is included in ARM's RealView. Or I might recommend getting some open source hardware like a Panda Board and using valgrind's cache-grind. With linux on the panda board you can write some scripts to automate your testing.
It can be a hassle to get this going but if optimizing for ARM will be part of your professional life, then it's worth the (relatively low compared to your salary) software/hardware investment and time.
Note 1: I recommend against using PLD. This is very system tuning dependent, and if you get it working well on one ARM implementation it may hurt you for the next generation of chip or a different implementation. This may be a hint that trying to optimize at the system level, other than some basic data localization and ordering stuff may not be worth your efforts? (See Stephen's comment below).

Memory access is one thing that simply cannot be modeled from "small pieces of assembly” to generate meaningful guidance. Cache hierarchies, store buffers, load miss queues, cache policy, etc … even relatively simple processors have an enormous amount of “state” hiding underneath the LSU, and any small-scale analysis cannot accurately capture that state. That said, there are a few basic guidelines for getting the best performance:
maximize the ratio of "useful computation” instructions to LSU operations.
align your memory accesses (ideally to 16B).
if you need to pick between aligning loads or aligning stores, align your stores.
try to write out complete cachelines when possible.
PLD is mainly useful for non-uniform-but-somehow-still-predictable memory access patterns (these are rare).
For NEON specifically, you should prefer to use the vld1 and vst1 instructions (with an alignment hint). On most micro-architectures, in most cases, they are the fastest way to move between NEON and memory. Eschew v[ld|st][3|4] in particular; these are an attractive nuisance, slower than doing separate permutes on most micro-architectures in most cases.

"Well-parallelized" algorithm not sped up by multiple threads

I'm sorry to ask a question one a topic that I know so little about, but this idea has really been bugging me and I haven't been able to find any answers on the internet.
Background:
I was talking to one of my friends who is in computer science research. I'm in mostly ad-hoc development, so my understanding of a majority of CS concepts is at a functional level (I know how to use them rather than how they work). He was saying that converting a "well-parallelized" algorithm that had been running on a single thread into one that ran on multiple threads didn't result in the processing speed increase that he was expecting.
Reasoning:
I asked him what the architecture of the computer he was running this algorithm on was, and he said 16-core (non-virtualized). According to what I know about multi-core processors, the processing speed increase of an algorithm running on multiple cores should be roughly proportional to how well it is parallelized.
Question:
How can an algorithm that is "well-parallelized" and programmed correctly to run on a true multi-core processor not run several times more quickly? Is there some information that I'm missing here, or is it more likely a problem with the implementation?
Other stuff: I asked if the threads were possibly taking up more power than any individual core had available and apparently each core runs at 3.4 GHz. This is much more than the algorithm should need, and when diagnostics are run the cores aren't maxed out during runtime.

It is likely sharing something. What is being shared may not be obvious.
One of the most common non-obvious shared resources is CPU cache. If the threads are updating the same cache line that cache line has to bounce between CPUs, slowing everything down.
That can happen because of accessing (even read-only) variables which are near to each other in memory. If all accesses are read-only it is OK, but if even one CPU is writing to that cache line it will force a bounce.
A brute-force method of fixing this is to put shared variables into structures that look like:
struct var_struct {
int value;
char padding[128];
};
Instead of hard-coding 128 you could research what system parameter or preprocessor macros define the cache-line size for your system type.
Another place that sharing can take place is inside system calls. Even seemingly innocent functions might be taking global locks. I seem to recall reading about Linux fixing an issue like this a while back with locks on the functions that return process and thread identifiers and parent identifiers.

Performance versus number of cores is often a S-like curve - first it obviously increases but as locking, shared cache and the like take they debt the further cores do not add so much and even may degrade. Hence nothing mysterious. If we would know more details about the algorithm it may be possible to find an idea to speed it up.

Debugging a micro-processor

One of our co-processors is an 8-bit microprocessor. It's main role is to control the hardware that handles flash memory. We suspect that the code it's running is highly inefficient since we measured low speeds when reading/writing to flash memory. The problem is, we have only one J-TAG port that's connected to the main CPU so debugging it is not an option. What we do have, is a register that's available from CPU that contains the micro-processor's program counter. The bad news, is that the micro-processor works at a different frequency than the CPU so monitoring it's program counter outside is also hard. Measuring time inside the micro-processor is also very difficult since it's registers are only 8-bit long. Needless to say, the code is in assembly and very complex. How would you go about approaching this problem?

Needless to say, the code is in assembly and very complex. How would you go about approaching this problem?
I would advise that you start from (or generate) the requirements specification for this part and reimplement the code in C (or even careful use of a C++ subset). If the "complexity" you perceive is merely down the the code rather than the requirements it would be a good idea to design that out - it will only make maintenance in the future more complex, error prone and expensive.
One of the common arguments for using assembler are size and performance, but more frequently a large body of assembler code is far from optimal; in order to retain a level of productivity and maintainability often "boiler-plate" code is used and reused that is not tailored to the specific situation, whereas a compiler will analyse code changes and perform the kind of "micro-optimisation" that system designers really shouldn't have to sweat about. Make your algorithms and data structures efficient and leave the target instruction set details to the compiler.
Even without the ability to directly debug on the target, the use of a high-level language will allow prototyping and simulation on a PC for example.
Even if you retain the assembler code, if your development tools include an instruction set simulator, that may be a good alternative to hardware debugging; especially if it supports debugger scripts that can be used to simulate the behaviour of hardware devices.
All that said, looking at this as a "black-box" and concluding that the code is inefficient is a bit of a leap. What kind of flash memory is appearing to be slow for example? How is it interfaced to the microcontroller? And how have you measured this performance? Flash memory is intrinsically slow - especially writing and page erase; check the performance specification of the Flash before drawing any conclusion on the software performance.

When might you not want to use garbage collection? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Garbage collection has been around since the early days of LISP, and now - several decades on - most modern programming languages utilize it.
Assuming that you're using one of these languages, what reasons would you have to not use garbage collection, and instead manually manage the memory allocations in some way?
Have you ever had to do this?
Please give solid examples if possible.

I can think of a few:
Deterministic deallocation/cleanup
Real time systems
Not giving up half the memory or processor time - depending on the algorithm
Faster memory alloc/dealloc and application-specific allocation, deallocation and management of memory. Basically writing your own memory stuff - typically for performance sensitive apps. This can be done where the behavior of the application is fairly well understood. For general purpose GC (like for Java and C#) this is not possible.
EDIT
That said, GC has certainly been good for much of the community. It allows us to focus more on the problem domain rather than nifty programming tricks or patterns. I'm still an "unmanaged" C++ developer though. Good practices and tools help in that case.

Memory allocations? No, I think the GC is better at it than I am.
But scarce resource allocations, like file handles, database connections, etc.? I write the code to close those when I'm done. GC won't do that for you.

I do a lot of embedded development, where the question is more likely to be whether to use malloc or static allocation and garbage collection is not an option.
I also write a lot of PC-based support tools and will happily use GC where it is available & fast enough and it means that I don't have to use pedant::std::string.
I write a lot of compression & encryption code and GC performance is usually not good enough unless I really bend the implementation. GC also requires you to be very careful with address aliasing tricks. I normally write performance sensitive code in C and call it from Python / C# front ends.
So my answer is that there are reasons to avoid GC, but the reason is almost always performance and it's then best to code the stuff that needs it in another language rather than trying to trick the GC.
If I develop something in MSVC++, I never use garbage collection. Partly because it is non-standard, but also because I've grown up without GC in C++ and automatically design in safe memory reclamation. Having said this, I think that C++ is an abomination which fails to offer the translation transparency and predictability of C or the scoped memory safety (amongst other things) of later OO languages.

Real time applications are probably difficult to write with a garbage collector. Maybe with an incremental GC that works in another thread, but this is an additional overhead.

One case I can think of is when you are dealing with large data sets amounting to hundreads of megabytes or more. Depending on the situation you might want to free this memory as soon as you are done with it, so that other applications can use it.
Also, when dealing with some unmanaged code there might be a situation where you might want to prevent the GC from collecting some data because it's still being used by the unmanaged part. Though I still have to think of a good reason why simply keeping a reference to it might not be good enough. :P

One situation I've dealt with is image processing. While working on an algorithm for cropping images, I've found that managed libraries just aren't fast enough to cut it on large images or on multiple images at a time.
The only way to do processing on an image at a reasonable speed was to use non-managed code in my situation. This was while working on a small personal side-project in C# .NET where I didn't want to learn a third-party library because of the size of the project and because I wanted to learn it to better myself. There may have been an existing third-party library (perhaps Paint.NET) that could do it, but it still would require unmanaged code.

Two words: Space Hardening
I know its an extreme case, but still applicable. One of the coding standards that applied to the core of the Mars rovers actually forbid dynamic memory allocation. While this is indeed extreme, it illustrates a "deploy and forget about it with no worries" ideal.
In short, have some sense as to what your code is actually doing to someone's computer. If you do, and you are conservative .. then let the memory fairy take care of the rest. While you develop on a quad core, your user might be on something much older, with much less memory to spare.
Use garbage collection as a safety net, be aware of what you allocate.

There are two major types of real time systems, hard and soft. The main distinction is that hard real time systems require that an algorithm always finish in a particular time budget where as a soft system would like it to normally happen. Soft systems can potentially use well designed garbage collectors although a normal one would not be acceptable. However if a hard real time system algorithm did not complete in time then lives could be in danger. You will find such sorts of systems in nuclear reactors, aeroplanes and space shuttles and even then only in the specialist software that the operating systems and drivers are made of. Suffice to say this is not your common programming job.
People who write these systems don't tend to use general purpose programming languages. Ada was designed for the purpose of writing these sorts of real time systems. Despite being a special language for such systems in some systems the language is cut down further to a subset known as Spark. Spark is a special safety critical subset of the Ada language and one of the features it does not allow is the creation of a new object. The new keyword for objects is totally banned for its potential to run out of memory and its variable execution time. Indeed all memory access in Spark is done with absolute memory locations or stack variables and no new allocations on the heap is made. A garbage collector is not only totally useless but harmful to the guaranteed execution time.
These sorts of systems are not exactly common, but where they exist some very special programming techniques are required and guaranteed execution times are critical.

Just about all of these answers come down to performance and control. One angle I haven't seen in earlier posts is that skipping GC gives your application more predictable cache behavior in two ways.
In certain cache sensitive applications, having the language automatically trash your cache every once in a while (although this depends on the implementation) can be a problem.
Although GC is orthogonal to allocation, most implementations give you less control over the specifics. A lot of high performance code has data structures tuned for caches, and implementing stuff like cache-oblivious algorithms requires more fine grained control over memory layout. Although conceptually there's no reason GC would be incompatible with manually specifying memory layout, I can't think of a popular implementation that lets you do so.

Assuming that you're using one of these languages, what reasons would you have to not use garbage collection, and instead manually manage the memory allocations in some way?
Potentially, several possible reasons:
Program latency due to the garbage collector is unacceptably high.
Delay before recycling is unacceptably long, e.g. allocating a big array on .NET puts it in the Large Object Heap (LOH) which is infrequently collected so it will hang around for a while after it has become unreachable.
Other overheads related to garbage collection are unacceptably high, e.g. the write barrier.
The characteristics of the garbage collector are unnacceptable, e.g. redoubling arrays on .NET fragments the Large Object Heap (LOH) causing out of memory when 32-bit address space is exhausted even though there is theoretically plenty of free space. In OCaml (and probably most GC'd languages), functions with deep thread stacks run asymptotically slower. Also in OCaml, threads are prevented from running in parallel by a global lock on the GC so (in theory) parallelism can be achieved by dropping to C and using manual memory management.
Have you ever had to do this?
No, I have never had to do that. I have done it for fun. For example, I wrote a garbage collector in F# (a .NET language) and, in order to make my timings representative, I adopted an allocationless style in order to avoid GC latency. In production code, I have had to optimize my programs using knowledge of how the garbage collector works but I have never even had to circumvent it from within .NET, much less drop .NET entirely because it imposes a GC.
The nearest I have come to dropping garbage collection was dropping the OCaml language itself because its GC impedes parallelism. However, I ended up migrating to F# which is a .NET language and, consequently, inherits the CLR's excellent multicore-capable GC.

I don't quite understand the question. Since you ask about a language that uses GC, I assume you are asking for examples like
Deliberately hang on to a reference even when I know it's dead, maybe to reuse the object to satisfy a future allocation request.
Keep track of some objects and close them explicitly, because they hold resources that can't easily be managed with the garbage collector (open file descriptors, windows on the screen, that sort of thing).
I've never found a reason to do #1, but #2 is one that comes along occasionally. Many garbage collectors offer mechanisms for finalization, which is an action that you bind to an object and the system runs that action before the object is reclaimed. But oftentimes the system provides no guarantees about whether or if finalizers actually run, so finalization can be of limited utility.
The main thing I do in a garbage-collected language is to keep a tight watch on the number of allocations per unit of other work I do. Allocation is usually the performance bottleneck, especially in Java or .NET systems. It is less of an issue in languages like ML, Haskell, or LISP, which are typically designed with the idea that the program is going to allocate like crazy.
EDIT: longer response to comment.
Not everyone understands that when it comes to performance, the allocator and the GC must be considered as a team. In a state-of-the-art system, allocation is done from contiguous free space (the 'nursery') and is as quick as test and increment. But unless the object allocated is incredibly short-lived, the object incurs a debt down the line: it has to be copied out of the nursery, and if it lives a while, it may be copied through several generatations. The best systems use contiguous free space for allocation and at some point switch from copying to mark/sweep or mark/scan/compact for older objects. So if you're very picky, you can get away with ignoring allocations if
You know you are dealing with a state-of-the art system that allocates from continuous free space (a nursery).
The objects you allocate are very short-lived (less than one allocation cycle in the nursery).
Otherwise, allocated objects may be cheap initially, but they represent work that has to be done later. Even if the cost of the allocation itself is a test and increment, reducing allocations is still the best way to improve performance. I have tuned dozens of ML programs using state-of-the-art allocators and collectors and this is still true; even with the very best technology, memory management is a common performance bottleneck.
And you'd be surprised how many allocators don't deal well even with very short-lived objects. I just got a big speedup from Lua 5.1.4 (probably the fastest of the scripting language, with a generational GC) by replacing a sequence of 30 substitutions, each of which allocated a fresh copy of a large expression, with a simultaneous substitution of 30 names, which allocated one copy of the large expression instead of 30. Performance problem disappeared.

In video games, you don't want to run the garbage collector in between a game frame.
For example, the Big Bad is in front
of you and you are down to 10 life.
You decided to run towards the Quad
Damage powerup. As soon as you pick up
the powerup, you prepare yourself to
turn towards your enemy to fire with
your strongest weapon.
When the powerup disappeared, it would
be a bad idea to run the garbage
collector just because the game world
has to delete the data for the
powerup.
Video games usually manages their objects by figuring out what is needed in a certain map (this is why it takes a while to load maps with a lot of objects). Some game engines would call the garbage collector after certain events (after saving, when the engine detects there's no threat in the vicinity, etc).
Other than video games, I don't find any good reasons to turn off garbage collecting.
Edit: After reading the other comments, I realized that embedded systems and Space Hardening (Bill's and tinkertim's comments, respectively) are also good reasons to turn off the garbage collector

The more critical the execution, the more you want to postpone garbage collection, but the longer you postpone garbage collection, the more of a problem it will eventually be.
Use the context to determine the need:
1.
Garbage collection is supposed to protect against memory leaks
Do you need more state than you can manage in your head?
2.
Returning memory by destroying objects with no references can be unpredictable
Do you need more pointers than you can manage in your head?
3.
Resource starvation can be caused by garbage collection
Do you have more CPU and memory than you can manage in your head?
4.
Garbage collection cannot address files and sockets
Do you have I/O as your primary concern?
In systems that use garbage collection, weak pointers are sometimes used to implement a simple caching mechanism because objects with no strong references are deallocated only when memory pressure triggers garbage collection. However, with ARC, values are deallocated as soon as their last strong reference is removed, making weak references unsuitable for such a purpose.
References
GC FAQ
Smart Pointer Guidelines
Transitioning to ARC Release Notes
Accurate Garbage Collection with LLVM
Memory management in various languages
jwz on Garbage Collection
Apple Could Power the Web
How Do The Script Garbage Collectors Work?
Minimize Garbage Generation: GC is your Friend, not your Servant
Garbage Collection in IE6
Slow web browser performance when you view a Web page that uses JScript in Internet Explorer 6
Transitioning to ARC Release Notes: Which classes don’t support weak references?
Automatic Reference Counting: Weak References

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio