Pick/UniBasic Field function that operates with a delimiter of more than one character? - multivalue

Has there ever been an implementation of the field function (page 311) in the various flavors of Pick/UniBasic etc. that would operate on a delimiter of more than one character?
The documented implementations I can find stipulate one character as the delimiter argument and if the delimiter is presented with more than one character, the first character of the delimiter string is used instead of the entire string as a delimiter.
I am asking this because there are many instances in the commercial and custom software I maintain where I see attempts to use a multi-character delimiter with the field statement. It seems programmers were using it expecting a different result than is currently happening.

jBASE does allow for this. From the FIELD docs:
This function returns a multi-character delimited field from within a string. It takes the general form:
FIELD(string, delimiter, occurrence{, extractCount})
where:
string specifies the string, from which the field(s) is to be extracted.
delimiter specifies the character or characters that delimit the fields within the dynamic array.
occurrence should evaluate to an integer of value 1 or higher. It specifies the delimiter used as the starting point for the extraction.
extractCount is an integer that specifies the number of fields to extract. If omitted, assumes one.
Additionally, an example from the docs:
in_Value = "AAAA : BBjBASEBB : CCCCC"
CRT FIELD(in_Value , "jBASE", 1)
Producing output:
AAAA : BB
Update 2020-08-13 (adding context for OpenQM):
As an official comment since we maintain both jBASE and OpenQM, I felt it worth calling out that OpenQM does not allow multi-character delimiters for FIELD().

Related

How to have a Multi Character Delimiter in Informatica Cloud?

I have a problem I need help solving. The business I am working for is using Informatica cloud to do alot of their ETL into AWS and Other Services.
We have been given a flat file by the business where the field delimiter is "~|" Currently to the best of my knowledge informatica only accepts a single character delimiter.
Does any one know how to overcome this?
Informatica cannot read composite delimiters.
First you could feed each line as one single long string into an
Expression transformation. In this case the delimiter character should
be set to \037 , I haven't seen this character (ASCII Unit Separator)
in use at least since 1982. Then use repetitive invocations of InStr()
within the EXP to identify the positions of those double pipe
characters and split up each line into fields using SubStr().
Second
(easier in the mapping, more work with the session) you could feed the
file into some utility which replaces those double pipe characters by
the character ASCII 31 (the Unit Separator mentioned above); the
session has to be set up such that it reads the output from this
utility (input file type = Command instead of File). Then the source
definition should contain the \037 as the field delimiter instead of
any pipe character or so.

Finding number range with grep

I have a database in this format:
username:something:UID:something:name:home_folder
Now I want to see which users have a UID ranging from 1000-5000. This is what what I tried to do:
ypcat passwd | grep '^.*:.*:[1-5][0-9]\{2\}:'
My thinking is this: I go to the third column and find numbers that start with a number from 1-5, the next number can be any number - range [0-9] and that range repeats itself 2 more times making it a 4 digit number. In other words it would be something like [1-5][0-9][0-9][0-9].
My output, however, lists even UID's that are greater than 5000. What am I doing wrong?
Also, I realize the code I wrote could potentially lists numbers up to 5999. How can I make the numbers 1000-5000?
EDIT: I'm intentionally not using awk since I want to understand what I'm doing wrong with grep.
There are several problems with your regex:
As Sundeep pointed out in a comment, ^.*:.*: will match two or more columns, because the .* parts can match field delimiters (":") as well as field contents. To fix this, use ^[^:]*:[^:]*: (or, equivalently, ^\([^:]:\)\{2\}); see the notes on bracket expressions and basic vs extended RE syntax below)
[0-9]\{2\} will match exactly two digits, not three
As you realized, it matches numbers starting with "5" followed by digits other than "0"
As a result of these problems, the pattern ^.*:.*:[1-5][0-9]\{2\}: will match any record with a UID or GID in the range 100-599.
To do it correctly with grep, use grep -E '^([^:]*:){2}([1-4][0-9]{3}|5000):' (again, see Sundeep's comments).
[Added in edit:]
Concerning bracket expressions and what ^ means in them, here's the relevant section of the re_format man page:
A bracket expression is a list of characters enclosed in '[]'. It
normally matches any single character from the list (but see below).
If the list begins with '^', it matches any single character (but see
below) not from the rest of the list. If two characters in the list
are separated by '-', this is shorthand for the full range of
characters between those two (inclusive) in the collating sequence,
e.g. '[0-9]' in ASCII matches any decimal digit.
(bracket expressions can also contain other things, like character classes and equivalence classes, and there are all sorts of special rules about things like how to include characters like "^", "-", "[", or "]" as part of a character list, rather than negating, indicating a range, class, or end of the expression, etc. It's all rather messy, actually.)
Concerning basic vs. extended RE syntax: grep -E uses the "extended" syntax, which is just different enough to mess you up. The relevant differences here are that in a basic RE, the characters "(){}" are treated as literal characters unless escaped (if escaped, they're treated as RE syntax indicating grouping and repetition); in an extended RE, this is reversed: they're treated as RE syntax unless escaped (if escaped, they're treated as literal characters).
That's why I suggest ^\([^:]:\)\{2\} in the first bullet point, but then actually use ^([^:]*:){2} in the proposed solution -- the first is basic syntax, the second is extended.
The other relevant difference -- and the reason I switched to extended for the actual solution -- is that only extended RE allows | to indicate alternatives, as in this|that|theother (which matches "this" or "that" or "theother"). I need this capability to match a 4-digit number starting with 1-4 or the specific number 5000 ([1-4][0-9]{3}|5000). There's simply no way to do this in a basic RE, so grep -E and the extended syntax are required here.
(There are also many other RE variants, such as Perl-compatible RE (PCRE). When using regular expressions, always be sure to know which variant your regex tool uses, so you don't use syntax it doesn't understand.)
ypcat passwd |awk -F: '$3>1000 && $3 <5000{print $1}'
awk here can go the task in a simple manner. Here we made ":" as the delimiter between the fields and put the condition that third field should be greater than 1000 and less then 5000. If this condition meets print first field.

Replace non-word characters, unless given sequence matches

I have a string like this:
"Jim-Bob's email ###hl###address###endhl### is: jb#example.com"
I want to replace all non-word characters (symbols and whitespace), except the ### delimiters.
I'm currently using:
str.gsub(/[^\w#]+/, 'X')
which yields:
"JimXBobXsXemailX###hl###address###endhl###XisXjb#exampleXcom"
In practice, this is good enough, but it offends me for two reasons:
The # in the email address is not replaced.
The use of [^\w] instead of \W feels sloppy.
How do I replace all non-word characters, unless those characters make up the ###hl### or ###endhl### delimiter strings?
str.gsub(/(###.*?###|\w+)|./) { $1 || "X" }
# => "JimXBobXsXemailX###hl###address###endhl###XisXXjbXexampleXcom"
This approach uses the fact that alternations work like case structure: the first matching one consumes the corresponding string, then no further matching is done on it. Thus, ###.*?### will consume a marker (like ###hl###; nothing else will be matched inside it. We also match any sequence of word characters. If any of those are captured, we can just return them as-is ($1). If not, then we match any other character (i.e. not inside a marker, and not a word character) and replace it with "X".
Regarding your second point, I think you are asking too much; there is no simple way to avoid that.
Regarding the first point, a simple way is to temporarily replace "###" with a character that you will never use (let's say you are using a system without "\r", so that that character is not used; we can use that as a temporal replacement).
"Jim-Bob's email ###hl###address###endhl### is: jb#example.com"
.gsub("###", "\r").gsub(/[^\w\r]/, "X").gsub("\r", "###")
# => "JimXBobXsXemailX###hl###address###endhl###XisXXjbXexampleXcom"

What does <\1> mean in String#sub?

I was reading this and I did not understand it. I have two questions.
What is the difference ([aeiou]) and [aeiou]?
What does <\1> mean?
"hello".sub(/([aeiou])/, '<\1>') #=> "h<e>llo"
Here it documented:
If replacement is a String it will be substituted for the matched text. It may contain back-references to the pattern’s capture groups of the form "\d", where d is a group number, or "\k<n>", where n is a group name. If it is a double-quoted string, both back-references must be preceded by an additional backslash. However, within replacement the special match variables, such as &$, will not refer to the current match.
Character Classes
A character class is delimited with square brackets ([, ]) and lists characters that may appear at that point in the match. /[ab]/ means a or b, as opposed to /ab/ which means a followed by b.
Hope above definition made clear what [aeiou] is.
Capturing
Parentheses can be used for capturing. The text enclosed by the nth group of parentheses can be subsequently referred to with n. Within a pattern use the backreference \n; outside of the pattern use MatchData[n].
Hope above definition made clear what ([aeiou]) is.
([aeiou]) - any characters inside the character class [..],which will be found first from the string "hello",is the value of \1(i.e.the first capture group). In this example value of \1 is e,which will be replaced by <e> (as you defined <\1>). That's how "h<e>llo" has been generated from the string hello using String#sub method.
The doc you post says
It may contain back-references to the pattern’s capture groups of the
form "\d", where d is a group number, or "\k", where n is a group
name.
So \1 matches whatever was captured in the first () group, i.e. one of [aeiou] and then uses it in the replacement <\1>

Allowed characters in map key identifier in YAML?

Which characters are and are not allowed in a key (i.e. example in example: "Value") in YAML?
According to the YAML 1.2 specification simply advises using printable characters with explicit control characters being excluded (see here):
In constructing key names, characters the YAML spec. uses to denote syntax or special meaning need to be avoided (e.g. # denotes comment, > denotes folding, - denotes list, etc.).
Essentially, you are left to the relative coding conventions (restrictions) by whatever code (parser/tool implementation) that needs to consume your YAML document. The more you stick with alphanumerics the better; it has simply been our experience that the underscore has worked with most tooling we have encountered.
It has been a shared practice with others we work with to convert the period character . to an underscore character _ when mapping namespace syntax that uses periods to YAML. Some people have similarly used hyphens successfully, but we have seen it misconstrued in some implementations.
Any character (if properly quoted by either single quotes 'example' or double quotes "example"). Please be aware that the key does not have to be a scalar ('example'). It can be a list or a map.

Resources