Zstandard levels in hadoop - hadoop

Compression level in org.apache.hadoop.io.compress.zstd.ZStandardCompressor does't seem to work. I see the reset function getting called in ZStandardCompressor constructor which is turn call init(level, stream) to call native function which I believe to be only place setting zstd parameter.
In my test, I am ensuring that this is being called but calling it different levels like 1, 5, 10. 20 etc did not make any difference as output size is exact same.
Hadoop doesn't seem to use zstd-jni and use own stuff to use zstd. I am sure people are using different levels in hadoop. Could you someone point I should go around chasing for next step

Given that people are finding this question without answer, I am adding solution which I used. InternalParquetRecordWriter has compressor as argument, so I integrated zstd-jni library here by creating a compressor by extending BytesInputCompressor.

Related

Returnn Switchboard data processing

Could anybody give me pointers on how to process Switchboard dataset for training with RETURNN? I did see BlissDataset class that seems to be designed for switchboard, but it's not clear to me what I should include in the paths given in the example:
Example:
./tools/dump-dataset.py "
{'class':'BlissDataset',
'path': '/u/tuske/work/ASR/switchboard/corpus/xml/train.corpus.gz',
'bpe_file': '/u/zeyer/setups/switchboard/subwords/swb-bpe-codes',
'vocab_file': '/u/zeyer/setups/switchboard/subwords/swb-vocab'}"
The switchboard dataset has several folders with audios, i.e. swb1_d2/data/*.sph and transcripts swb1_LDC97S62/swb_ms98_transcriptions/**/*
I'm not quite sure how to proceed with this to get a dataset that can be used to train RETURNN.
At our group (RWTH Aachen University), we use the config as it was published on GitHub. As you see, this one uses ExternSprintDataset. That dataset uses
The implementation uses Sprint (publicly called RWTH ASR (RASR), see here) as an external tool (ran in a subprocess) to handle the data (feature extraction, etc). Sprint gets a Bliss XML file which describes all the segments with path to audio and audio offsets and transcriptions, and also it gets further configs for the feature extraction and maybe other things. There is an open source version of RASR which should work but it might be a bit involved to get this to work.
The BlissDataset was planned to be a simpler replacement for that. However, the implementation is incomplete. Also, you still would need to generate the Bliss XML by yourself in some way (we have used some own internal scripts to prepare that based on the official LDC data).
So, unfortunately, there is no simple way yet. Actually, I think the easiest way would be to come up with yet another custom format, which might be similar to the LibriSpeechDataset implementation, or maybe just the same, and then you could just reuse LibriSpeechDataset, or at least parts of that. That dataset implementation takes the data in some zip format which contains the transcripts in txt files and the audio in ogg or wav files. It uses librosa to do MFCC feature extraction (or also other feature types). I planned to implement that for Switchboard, and then reproduce the results, however I did not have time yet and not sure when I will get to that. But if you want to try that on your own, I will be happy to help you however I can. The starting point would be to look at LibriSpeechDataset and understand how the format of that looks like.

Does using global variables impact performance in MATLAB?

As I understand, MATLAB cannot use pass by reference when sending arguments to other functions. I am doing audio processing, and I frequently have to pass waveforms as arguments into functions, and because MATLAB uses pass by value for these arguments, it really eats up a lot of RAM when I do this.
I was considering using global variables as a method to pass my waveforms into functions, but everywhere I read there seems to be a general opinion that this is a bad idea, for organization of code, and potentially performance issues... but I haven't really read any detailed answers on how this might impact performance...
My question: What are the negative impacts of using global variables (with sizes > 100MB) to pass arguments to other functions in MATLAB, both in terms of 1) performance and 2) general code organization and good practice.
EDIT: From #Justin's answer below, it turns out MATLAB does on occasion use pass by reference when you do not modify the argument within the function! From this, I have a second related question about global variable performance:
Will using global variables be any slower than using pass by reference arguments to functions?
MATLAB does use pass by reference, but also uses copy-on-write. That is to say, your variable will be passed by reference into the function (and so won't double up on RAM), but if you change the variable within the the function, then MATLAB will create a copy and change the copy (leaving the original unaffected).
This fact doesn't seem to be too well known, but there's a good post on Loren's blog discussing it.
Bottom line: it sounds like you don't need to use global variables at all (which are a bad idea as #Adriaan says).
While relying on copy on write as Justin suggested is typically the best choice, you can easily implement pass by reference. With Matlab oop being nearly as fast as traditional functions in Matlab 2015b or newer, using handle is a reasonable option.
I encountered an interesting use case of a global variable yesterday. I tried to parallellise a piece of code (1200 lines, multiple functions inside the main function, not written by me), using parfor.
Some weird errors came out and it turned out that this piece of code wrote to a log file, but used multiple functions to write to the log file. Rather than opening and closing the relevant log file every time a function wanted to write to it, which is very slow, the file ID was made global, so that all write-functions could access it.
For the serial case this made perfect sense, but when trying to parallellise this, using global apparently breaks the scope of a worker instance as well. So suddenly we had 4 workers all trying to write into the same log file, which resulted in some weird errors.
So all in all, I maintain my position that using global variables is generally a bad idea, although I can see its use in specific cases, provided you know what you're doing.
Using global variables in Matlab may increase performance alot. This is because you can avoid copying of data in some cases.
Before attempting to gain such performance tweaks, think carefully of the cost to your project, in terms of the many drawbacks that global variables come with. There are also pitfalls to using globals with bad consequences to performance, and those may be difficult to avoid(although possible). Any code that is littered with globals tend to be difficult to comprehend.
If you want to see globals in use for performance, you can look at this real-time toolbox for optical flow that I made. This is the only project in native Matlab that is capable of real-time optical flow that I know of. Using globals was one of the reasons this was doable. It is also a reason to why the code is quite difficult to grasp: Globals are evil.
That globals can be used this way is not a way to argue for their use, rather it should be a hint that something should be updated with Matlabs unflexible notions of workspace and inefficient alternatives to globals such as guidata/getappdata/setappdata.

How to find stuff in the kernel

I'm doing various tasks on the linux kernel, and I end up reading source code from time to time. I haven't really needed to change the kernel yet (I'm good with so called "Loadable Kernel Modules") so I didn't download the source of the kernel, just using http://lxr.free-electrons.com/ . And quite a lot I find myself finding a function that has many implementations, and start guessing which one is the one I need.
For example, I looked at the file Linux/virt/kvm/kvm_main.c at line 496 is a call to list_add, a click on it gives me two options: drivers/gpu/drm/radeon/mkregtable.c, line 84 and include/linux/list.h, line 60 - It's quite clear that kvm will not send my to something under "gpu" but this is not always the case. I have looked at the includes of the file - was not much help.
So my questions: Given a file from the kernel, and a function call at line ###, what is the nicest way to find where one function call actually continues?
(I'll be happy to hear also about ways that don't include the website and\or require me to download the source code)
There are many things in kernel that are #define'd or typedef'd or functions mapped inside structs (the fop struct in the drivers). So, there's no easy way to browse the kernel source. lxr site helps you but it can't go any further when you encounter any of the above data structs. The same is with using cscope/ctags. The best way though, despite you explicitly mentioning against it, is to download the source and browse through it.
Another method would be to use kgdb and inspect the code function by function, but that requires you to have some knowledge of the functions where you want to step in or not, to save a lot of time. And last but not the least, increase the kernel log level, and print the logs that are accessible through dmesg. But these all require you to have a kernel source.

What are your top 3 XPages performance tips for new XPages developers?

What 3 things would you tell developers new to XPages to do to help maximize the performance of their XPages apps?
Tim Tripcony had given a bunch of suggestion here.
http://www-10.lotus.com/ldd/xpagesforum.nsf/topicThread.xsp?action=openDocument&documentId=365493C31B0352E3852578FD001435D2#AEFBCF8B111E149B852578FD001E617B
Not sure if this tipp is for beginners, but use any of the LifeCyclePhaseListeners from the OpenNTF Snippets to see what is going on in your datasources during a complete or partial refresh (http://openntf.org/XSnippets.nsf/snippet.xsp?id=a-simple-lifecyclelistener-)
Use the extension Library. Report Bugs ( or what you consider a bug ) at OpenNTF.
Use the SampleDb from the extLib. ou can easily modify the samples to your own need. Even good for testing if the issue you encounter is reproducable in this DB.
Use Firebug ( or a similar tool that comes with the browser of your choice ) If you see an error in the error tab, go and fix it.
Since you're asking for only 3, here are the tips I feel make the biggest difference:
Determine what your users / customers mean by "performance", and set the page persistence option accordingly. If they mean scalability (max concurrent users), keep pages on disk. If they mean speed, keep pages in memory. If they want an ideal mixture of speed and scalability, keep the current page in memory. This latter option really should be the server default (set in the server's xsp.properties file), overridden only as needed per application.
Set value bindings to compute on page load (denoted by a $ in the source XML) wherever possible instead of compute dynamically (denoted by a #). $ bindings are only evaluated once, # bindings are recalculated over and over again, so changing computations that only need to be loaded once per page to $ bindings speed up both initial page load and any events fired against the page once loaded.
Minimize the use of SSJS. Wherever possible, use standard EL instead (e.g. ${database.title} instead of ${javascript:return database.getTitle();}). Every SSJS expression must be parsed into an abstract syntax tree to be evaluated, which is incrementally slower than the standard EL resolver.
There are many other ways to maximize performance, of course, but in my opinion these are the easiest ways to gain noticeable improvement.
1. Use the Script Library instead writing a bulk of code into the Xpage.
2. Use the Theme or separate CSS class for each elements [Relational]
3. Moreover try to control your SSJS code. Because server side request only reduce our system performance.
4. Final point consider this as sub point of 3, Try to get the direct functions from our SSJS, Don't use the while llop and for loop for like document collection, count and other things.
The basics like
Use the immediate flags ( or one of the other flags) on serverside events if possible
Check the Flag which (forgot its name..) generates the css and js as
one big file at runtime therefore minimizing the ammount of
requests.
Choose your scope wisely. Dont put everything in your sessionscope but define when, where and how your are using the data and based on that use the correct scope. This can lead to better memory usage..
And of course the most important one read the mastering xpages book.
Other tips I would add:
When retrieving data use viewentrycollections or the viewnavigstor
Upgrade to 8.5.3
Use default html tags if possible. If you dont need the functionality of a xp:div or xp:panel use a <div> instead so you dont generate an extra uicomponent on the tree.
Define what page persistance mode you need
Depends a lot what you mean by performance. For performance of the app:
Use compute on page load wherever feasible. It significantly improves performance.
In larger XPages particularly, combine code into single controls where possible. E.g. Use a single Computed Field control combining literal strings, EL and SSJS rather than one control for each language. On that point, EL performs better than SSJS, and SSJS on the XPage performs better than SSJS in a Script Library.
Use dataContexts for properties that are calculated more than once on an XPage.
Partial Execution mode is a very strong recommendation, but probably beyond new XPages developers at this point. Java will also perform better than SSJS in a Script Library, but again beyond new developers. XPages controls you've created with the Extensibility Framework should perform better, because they should run fewer lines of Java than multiple controls, but I haven't tested that.
If you mean performance of the developer:
Get the Extension Library.
Use themes to set default properties, e.g. A standard style for all your pagers.
Use Firebug. If you're developing for Notes Client or IE, still use Firebug. You'll spend longer suffering through Client/IE thank you will fixing the few quirks that will remain.

Why does loading cached objects increase the memory consumption drastically when computing them will not?

Relevant background info
I've built a little software that can be customized via a config file. The config file is parsed and translated into a nested environment structure (e.g. .HIVE$db = an environment, .HIVE$db$user = "Horst", .HIVE$db$pw = "my password", .HIVE$regex$date = some regex for dates etc.)
I've built routines that can handle those nested environments (e.g. look up value "db/user" or "regex/date", change it etc.). The thing is that the initial parsing of the config files takes a long time and results in quite a big of an object (actually three to four, between 4 and 16 MB). So I thought "No problem, let's just cache them by saving the object(s) to .Rdata files". This works, but "loading" cached objects makes my Rterm process go through the roof with respect to RAM consumption (over 1 GB!!) and I still don't really understand why (this doesn't happen when I "compute" the object all anew, but that's exactly what I'm trying to avoid since it takes too long).
I already thought about maybe serializing it, but I haven't tested it as I would need to refactor my code a bit. Plus I'm not sure if it would affect the "loading back into R" part in just the same way as loading .Rdata files.
Question
Can anyone tell me why loading a previously computed object has such effects on memory consumption of my Rterm process (compared to computing it in every new process I start) and how best to avoid this?
If desired, I will also try to come up with an example, but it's a bit tricky to reproduce my exact scenario. Yet I'll try.
Its likely because the environments you are creating are carrying around their ancestors. If you don't need the ancestor information then set the parents of such environments to emptyenv() (or just don't use environments if you don't need them).
Also note that formulas (and, of course, functions) have environments so watch out for those too.
If it's not reproducible by others, it will be hard to answer. However, I do something quite similar to what you're doing, yet I use JSON files to store all of my values. Rather than parse the text, I use RJSONIO to convert everything to a list, and getting stuff from a list is very easy. (You could, if you want, convert to a hash, but it's nice to have layers of nested parameters.)
See this answer for an example of how I've done this kind of thing. If that works out for you, then you can forego the expensive translation step and the memory ballooning.
(Taking a stab at the original question...) I wonder if your issue is that you are using an environment rather than a list. Saving environments might be tricky in some contexts. Saving lists is no problem. Try using a list or try converting to/from an environment. You can use the as.list() and as.environment() functions for this.

Resources