How to implement SUM aggregation for strings - monetdb

We're using MonetDB to implemnent an idea. For our purposes we need to implement the SUM Aggregation for Strings.
My first step in trying to realize it was to just add a sum pattern in the mal files that define these patterns for the other datatypes.
Then I tried to pinpoint the exact point where the list of possible implementations is compared with the input datatype but I'm making slow progress.
Where/how are mal files evaluated?
How to go about adding own sum command/pattern so that string data reaches the sum function without MonetDB trying to convert it to bte?

As of Oct2020 release, MAL files are integrated into the C source code, hence, MAL files are no longer used. The MAL files are in the source code repository for documentation purpose only and once that information has been moved to proper places, the MAL files will be gradually removed.
If you want to implement your own function, please have a look at the examples in this repository: https://dev.monetdb.org/hg/MonetDB-extend
You can clone it using hg clone https://dev.monetdb.org/hg/MonetDB-extend
I'm not sure how up-to-date this repository now is => depends on which MonetDB version you're using. If you have problems with those examples, please open new tickets in https://github.com/monetdb/monetdb/issues

Related

Simple arithmetic functions in Elasticsearch

I am starting to get acquainted with the use of ELK for work purposes, but struggle to find a solution to use simple mathematic requests in my database.
As shown on the picture, my DB contains 16 available fields, but I would like to create others, without doing it on Excel before converting my file in CVS again.
For example, I would like to create a variable #Bugs/Release. I've heard that this is quite easy to make with no need of scripting, but I can't find the way to do it... Has anybody the solution of this problem?
Huge thanksenter image description here

Algorithm to recursively search Git repo for a string

I am working on a project to automate the code review process for a team of engineers. Basically, what happens is every time an engineer makes a change to a file, before those changes are pushed to Github, they need to figure out what other files are being impacted by that change and add the engineers in charge of those files to view and approve that change. Right now, the person who made the change would manually do the following things: check which function the change occurred in, use the text search feature of an IDE (such as VS code) to see where that function is being used in the entire repo, go through all those search results and check which functions in other files is calling the original function, and then do a search for those functions. They would recursively search for functions until one of a group of designated files called "base files" appear in the search results. Separate engineers are in charge of separate base files, so once a base file appears in the search process, the person who made the change would need to add the engineer in charge of that base file to approve the change because the functionality of that file is potentially impacted by that change. We are trying to find a way to automate these manual steps.
I was wondering if there are any known algorithms that can be used to accomplish something like this. I am thinking of using graphs or trees, but I am not sure which specific graph or tree algorithms I should use.
Hmm, searching for strings is not good enough.
mark all base files
make call graph, directed graph (might not be acyclic)
do a BFS from changed file and log all Base files
Doxygen can generate some call graphs, or maybe there already is some Clang/LLVM call graph builder.

How many lines of code are in my Stata do-file, excluding comments?

Is there a fast way to see how many lines of code are in my Stata do-file (.do), not counting the comments? I could make a new version and delete all the comments out by hand, but that's too tedious for what I need.
My intent is to compare the lengths of an old version vs. a new version of the do file. I want to see whether I have made the code more efficient. However, I have some large commented sections of non-vital code in the files that I don't need to count.
A closely related question: is there a way to quickly see a total of all lines of code in a project (rather than just the do-file) - either including or excluding comments? Thank you.

Parsing STDF Files to Compare results

I am new to this site and I would like to get some inputs regarding parsing STDF files. Generally speaking, I am trying to parse a STDF file to gather only the results (numbers) and not the rest of the line. If I am able to achieve this, I would then like to compare all the numbers together through a bubble sort or insertion sort and see if any numbers are equal to each other. I am capable of doing this in C/C++ and Java but I have no experience parsing documents using Scripts.
Could anyone push me in the right direction? What should I be reading to learn my way around this?
Are you already using an STDF library?
You did not mention one, so I assume not.
You should find a library you are comfortable with (the list changes over time, but you can find some by Googling or looking at the STDF page on Wikipedia) rather than attempting to parse STDF yourself, unless you have a good reason to recreate the STDF parser wheel.
An STDF file contains many tests. It generally does not make sense to compare the results for different tests, so I assume you are looking for matching values within the set of results for each test.
I would use your chosen STDF parser to read the value of each test for each part. Keep a set of the results for each test. As you read each new result, check the set to see if already exists. If it does, you have found the case you were looking for, otherwise add the result to the set.

Eliminating code duplication in a single file

Sadly, a project that I have been working on lately has a large amount of copy-and-paste code, even within single files. Are there any tools or techniques that can detect duplication or near-duplication within a single file? I have Beyond Compare 3 and it works well for comparing separate files, but I am at a loss for comparing single files.
Thanks in advance.
Edit:
Thanks for all the great tools! I'll definitely check them out.
This project is an ASP.NET/C# project, but I work with a variety of languages including Java; I'm interested in what tools are best (for any language) to remove duplication.
Check out Atomiq. It finds code that is duplicate that is prime for extracting to one location.
http://www.getatomiq.com/
If you're using Eclipse, you can use the copy paste detector (CPD) https://olex.openlogic.com/packages/cpd.
You don't say what language you are using, which is going to affect what tools you can use.
For Python there is CloneDigger. It also supports Java but I have not tried that. It can find code duplication both with a single file and between files, and gives you the result as a diff-like report in HTML.
See SD CloneDR, a tool for detecting copy-paste-edit code within and across multiple files. It detects exact copyies, copies that have been reformatted, and near-miss copies with different identifiers, literals, and even different seqeunces of statements.
The CloneDR handles many languages, including Java (1.4,1.5,1.6) and C# especially up to C#4.0. You can see sample clone detection reports at the website, also including one for C#.
Resharper does this automagically - it suggests when it thinks code should be extracted into a method, and will do the extraction for you
Check out PMD , once you have configured it (which is tad simple) you can run its copy paste detector to find duplicate code.
One with some Office skills can do following sequence in 1 minute:
use ordinary formatter to unify the code style, preferably without line wrapping
feed the code text into Microsoft Excel as a single column
search and replace all dual spaces with single one and do other replacements
sort column
At this point the keywords for duplicates will be already well detected. But to go further
add comparator formula to 2nd column and counter to 3rd
copy and paste values again, sort and see the most repetitive lines
There is an analysis tool, called Simian, which I haven't yet tried. Supposedly it can be run on any kind of text and point out duplicated items. It can be used via a command line interface.
Another option similar to those above, but with a different tool chain: https://www.npmjs.com/package/jscpd

Resources