how should I interpret the MT940 specifications - mt940

I'm building my own MT940 parser and I'm running into something that seems to be unspecified issue.
The specification of a :61: tag, states that it ends with a variable amount of characters (34x). From an example file I see that they can continue on the next line.
For example:
:61:1510151015C54,01NTRFNONREF//15288910043499
/TRCD/00100/
How do I determine if the next line is a new tag or if it is a continuation of the content of the preceding tag. It seems that looking for an :xx: pattern at the beginning of the line is naive as it could cause a bug in the exceptional situation where the content actually contains that specific pattern.

Every line that starts with a tag such as :61: is a new line of information in the format. If it doesn't start with such an tag then it's a continuation.
Small word of warning though. MT940 is a standard, but there are subtle differences per bank. So it might be that works for one, but doesn't work for another. For instance some specifications have a header that defines that start of a transaction, but others don't.

Related

What is spifno1stsp really doing as a rsyslog property?

I was reading the template documentation of rsyslog to find better properties and I stumble upon this one:
spifno1stsp - expert options for RFC3164 template processing
However, as you can see, the documentation is quite vague. Moreover, I have not been able to find a longer explanation anywhere. The only mentions found with Google are always about the same snippet or the same very short description.
Indeed, there is no explanation of this property:
on the entire rsyslog.com website,
or in the RFC3164,
or anywhere else actually.
It is like everybody copy & paste the same snippet here and there but it is very difficult to understand what it is actually doing.
Any idea ?
Think of it as somewhat like an if statement. If a space is present, don't do anything. Otherwise, if a space is not present, add a space.
It is useful for ensuring that just one space is added to the output, often between two strings.
For any cases like this that you find where the docs can be improved please feel free to open an issue with a request for clarification in the official GitHub rsyslog documentation project. The documentation team is understaffed, but team members will assist where they can.
If you're looking for general help, the rsyslog-users mailing list is also a good resource. I've learned a lot over the years by going over the archives and reading prior threads.
Back to your question about the spifno1stsp option:
While you will get a few hits on that option, what you'll probably find more results on is searching for the older string template option, sp-if-no-1st-sp. Here is an example of its use from the documentation page you linked to:
template(name="forwardFormat" type="string"
string="<%PRI%>%TIMESTAMP:::date-rfc3339% %HOSTNAME% %syslogtag:1:32%%msg:::sp-if-no-1st-sp%%msg%"
)
Here is the specific portion that is relevant here:
`%msg:::sp-if-no-1st-sp%%msg%`
From the Property Replacer documentation:
sp-if-no-1st-sp
This option looks scary and should probably not be used by a user. For
any field given, it returns either a single space character or no
character at all. Field content is never returned. A space is returned
if (and only if) the first character of the field’s content is NOT a
space. This option is kind of a hack to solve a problem rooted in RFC
3164: 3164 specifies no delimiter between the syslog tag sequence and
the actual message text. Almost all implementation in fact delimit the
two by a space. As of RFC 3164, this space is part of the message text
itself. This leads to a problem when building the message (e.g. when
writing to disk or forwarding). Should a delimiting space be included
if the message does not start with one? If not, the tag is immediately
followed by another non-space character, which can lead some log
parsers to misinterpret what is the tag and what the message. The
problem finally surfaced when the klog module was restructured and the
tag correctly written. It exists with other message sources, too. The
solution was the introduction of this special property replacer
option. Now, the default template can contain a conditional space,
which exists only if the message does not start with one. While this
does not solve all issues, it should work good enough in the far
majority of all cases. If you read this text and have no idea of what
it is talking about - relax: this is a good indication you will never
need this option. Simply forget about it ;)
In short, sp-if-no-1st-sp (string template option) is analogous to spifno1stsp (standard template option).
Hope that helps.

Ruby - stop program from executing in certain way

I wrote a parser, which recognizes elements of text based on certain pattern.
My program is able to recognize paragraph, chapter etc. The problem is it shouldn't recognize elements, when there's a quote. For example:
Paragraph 1
Something here...
would be proceed as Paragraph.
And:
Paragraph 1
"Paragraph 2"
shouldn't. But as my program is based on regexp patterns, it looks for the word "Paragraph". I'm going line by line and recognize patterns for each line. I don't know how to tell my program: if you see quotes mark, leave text alone without doing anything? My mentor told me to use raise, but I'm not sure how to do it.
OK, so I'm still a bit of a beginner, I don't know if there is a way to direct the regex to ignore things inside quotes, but if I wanted to solve this problem, I would first make a copy of the text to be parsed, run a regex over that and delete everything inside quotes, then run the parser over the remaining text.
A bit kludgy and inelegant I admit, and may have performance issues over a large enough text, but it would get the job done.
See HERE for link to documentation of ruby regex. About a third of the way down it discusses quotes:
/\p{Pi}/ - 'Punctuation: Initial Quote'
/\p{Pf}/ - 'Punctuation: Final Quote'
You may be able to bake that into the regex with the ^ to direct it to ignore items in quotes.

How to find foreign language used in "C comments"

I have a large source code where most of the documentation and source code comments are in english. But one of the minor contributors wrote comments in a different language, spread in various places.
Is there a simple trick that will let me find them ? I imagine first a way to extract all comments from the code and generate a single text file (with possible source file / line number info), then pipe this through some language detection app.
If that matters, I'm on Linux and the current compiler on this project is CLang.
The only thing that comes to mind is to go through all of the code manually and check it yourself. If it's a similar language, that doesn't contain foreign letters, consider using something with a spellchecker. This way, the text that isn't recognized will get underlined, and easy to spot.
Other than that, I don't see an easy way to go through with this.
You could make a program, that reads the files and only prints the comments out to another output file, where you then spell check that file, but this would seem to be a waste of time, as you would easily be able to spot the comments yourself.
If you do make a program for that, however, keep in mind that there are three things to check for:
If comment starts with /*, make sure it stops reading when encountering */
If comment starts with //, only read one line - unless:
If line starting with // ends with \, read next line as well
While it is possible to detect a language from a string automatically, you need way more words than fit in a usual comment to do so.
Solution: Use your own eyes and your own brain...

Does a syntax highlighter in an IDE scan the whole file every time a letter is typed?

Assuming a syntax highlighter uses a lexer to do the background work: when typing in an IDE with live syntax highlighting, does the lexer have to re-tokenize the entire file (in whatever language, ex. Java, C++, Python, etc), does the lexer only have to re-read and tokenize the current line, or does it only keep itself occupied with a single character/word at a time?
I'm asking because in a lot of editors/IDEs, most code remains the same as the programmer is typing, however, in some cases there's stuff like starting a string literal, which re-highlights the rest of the line, and in other cases like starting a multi-line comment, the whole text file becomes re-highlighted from the point where I start the multi-line comment, to the end of the file.
If the lexical analysis has to be done for the entire file for every single letter typed, wouldn't that make it slow, especially for larger (100.000+ lines) text files?
There is a syntax highlight and semantic highlight.
Syntax highlight is when editor only decorates based on language syntax - e.g. identifiers are black, keywords are pink and comments are green. Syntax highlight does not necessarily reparses (or, rather, tokenizes) the whole file - it can only tokenize "damaged region" (e.g. tokens around edit location). Of course, editor developer may opt to tokenize the whole input - as it is really fast, error-proof and easier to implement.
Semantic highlight (one that, for instance, can highlight global and local identifiers differently) usually require complete reparse - e.g. in Java adding "static" to function declaration would require you to invalidate function references both above and below the cursor. In some cases caching may be implemented (e.g. cache include files parse result as user edit does not change it that much). Semantic highlight is slow so it is usually combined with syntax highligh (you may see in Eclipse that the keywords are highlighted instantly - while member variable changes the color from the black after some small delay).
I didn't look this up, but I am pretty sure that it depends what is being highlighted. That is, comparing the local area you are typing in with basic syntax; versus, say an open comment that until closed highlights from that point until the end of the file.

Why should I have to bother putting a linefeed at the end of every file?

I've occasionally encountered software - including compilers - that refuse to accept or properly handle text files that aren't properly terminated with a newline. I've even encountered explicit errors of the form,
no newline at the end of the file
...which would seem to indicate that they're explicitly checking for this case and then rejecting it just to be stubborn.
Am I missing something here? Why would - or should - anything care whether or not a file ends with a seemingly-superfluous bit of whitespace?
Historically, at least in the Unix world, "newline" or rather U+000A Line Feed was a line terminator. This stands in stark contrast to the practice in Windows for example, where CR+LF is a line separator.
A naïve solution of reading every line in a file would be to append characters to a buffer until an LF was encountered. If done really stupid this would ignore the last line in a file if it wasn't terminated by LF.
Another thing to consider are macro systems that allow including files. A line such as
%include "foo.inc"
might be replaced by the contents of the mentioned file where, if the last line wasn't ended with an LF, it would get merged with the next line. And yes, I've seen this behavior with a particular macro assembler for an embedded platform.
Nowadays I'm in the firm belief that (a) it's a relic of ancient times and (b) I haven't seen modern software that can't handle it but yet we still carry around numerous editors on Unix-like systems who helpfully put a byte more than needed at the end of a file.
Generally I would say that a lack of a newline at the end of a source file would mean that something went wrong in the editor or source code control client and not all of the code in the buffer got flushed. While it's likely that this would result in other errors, knowing that something likely went wrong in the editor/SCM and code may be missing is a pretty useful bit of knowledge. Certainly something that I would want to check.

Resources