Hadoop/Pig regular expression matching - hadoop

This is kind of an odd situation, but I'm looking for a way to filter using something like MATCHES but on a list of unknown patterns (of unknown length).
That is, if the given input is two files, one with numbers A:
xxxx
yyyy
zzzz
zzyy
...etc...
And the other with patterns B:
xx.*
yyy.*
...etc...
How can I filter the first input, by all of the patterns in the second?
If I knew all the patterns beforehand, I could
A = FILTER A BY (num MATCHES 'somepattern.*' OR num MATCHES 'someotherpattern'....);
The problem is that I don't know them beforehand, and since they're patterns and not simple strings, I cannot just use joins/groups (at least as far as I can tell).
Maybe a strange nested FOREACH...thing?
Any ideas at all?

If you use the | which operates as an OR you can construct a pattern out of the individual patterns.
(xx.*|yyy.*|zzzz.*)
This will do a check to see if it matches any of the patterns.
Edit:
To create the combined regex pattern:
* Create a string starting with (
* Read in each line (assuming each line is a pattern) and append it to a string followed by a |
* When done reading lines, remove the last character (which will be an unneeded |)
* Append a )
This will create a regex pattern to check all the patterns in the input file. (Note: It's assumed the file contains valid patterns)

Related

Glob string pattern for one or more files

I need a pattern for one or more files, name of each will be known before the matching occurs (but I do not know what they are right now).
For example, one occurence could be two files: A.lkml and B.lkml, and another could be three: CDFDFDSADF.lkml, SD.lkml and R4545452.lkml. The filenames will be passed as a single argument with single space as separator (So for example 1, will see A.lkml B.lkml).
What I can be sure of:
all files end with .lkml
for each matching, I need to add a manifest.lkml into the list. For example, in example 1, the list should contain 3 instead of 2 filenames, A.lkml, B.lkml and manifest.lkml
What puzzles me is that glob pattern matching doesn't seem to be able to do logic "OR". I have tried to use ",", "|" to no avail. In my experiments I fixed the filenames but in reality they change each time.
Update: I think brace expression such as {a.lkml,manifest.lkml} should work. Somehow it doesn't pass.

Apache NiFi: Extracting nth column from a csv [duplicate]

I need a regular expression that can be used to find the Nth entry in a comma-separated list.
For example, say this list looks like this:
abc,def,4322,mail#mailinator.com,3321,alpha-beta,43
...and I wanted to find the value of the 7th entry (alpha-beta).
My first thought would not be to use a regular expression, but to use something that splits the string into an array on the comma, but since you asked for a regex.
most regexes allow you to specify a minimum or maximum match, so something like this would probably work.
/(?:[^\,]*,){5}([^,]*)/
This is intended to match any number of character that are not a comma followed by a comma six times exactly (?:[^,]*,){5} - the ?: says to not capture - and then to match and capture any number of characters that are not a comma ([^,]+). You want to use the first capture group.
Let me know if you need more info.
EDIT: I edited the above to not capture the first part of the string. This regex works in C# and Ruby.
You could use something like:
([^,]*,){$m}([^,]*),
As a starting point. (Replace $m with the value of (n-1).) The content would be in capture group 2. This doesn't handle things like lists of size n, but that's just a matter of making the appropriate modifications for your situation.
#list = split /,/ => $string;
$it = $list[6];
or just
$it = (split /,/ => $string)[6];
Beats writing a pattern with a {6} in it every time.

Finding number range with grep

I have a database in this format:
username:something:UID:something:name:home_folder
Now I want to see which users have a UID ranging from 1000-5000. This is what what I tried to do:
ypcat passwd | grep '^.*:.*:[1-5][0-9]\{2\}:'
My thinking is this: I go to the third column and find numbers that start with a number from 1-5, the next number can be any number - range [0-9] and that range repeats itself 2 more times making it a 4 digit number. In other words it would be something like [1-5][0-9][0-9][0-9].
My output, however, lists even UID's that are greater than 5000. What am I doing wrong?
Also, I realize the code I wrote could potentially lists numbers up to 5999. How can I make the numbers 1000-5000?
EDIT: I'm intentionally not using awk since I want to understand what I'm doing wrong with grep.
There are several problems with your regex:
As Sundeep pointed out in a comment, ^.*:.*: will match two or more columns, because the .* parts can match field delimiters (":") as well as field contents. To fix this, use ^[^:]*:[^:]*: (or, equivalently, ^\([^:]:\)\{2\}); see the notes on bracket expressions and basic vs extended RE syntax below)
[0-9]\{2\} will match exactly two digits, not three
As you realized, it matches numbers starting with "5" followed by digits other than "0"
As a result of these problems, the pattern ^.*:.*:[1-5][0-9]\{2\}: will match any record with a UID or GID in the range 100-599.
To do it correctly with grep, use grep -E '^([^:]*:){2}([1-4][0-9]{3}|5000):' (again, see Sundeep's comments).
[Added in edit:]
Concerning bracket expressions and what ^ means in them, here's the relevant section of the re_format man page:
A bracket expression is a list of characters enclosed in '[]'. It
normally matches any single character from the list (but see below).
If the list begins with '^', it matches any single character (but see
below) not from the rest of the list. If two characters in the list
are separated by '-', this is shorthand for the full range of
characters between those two (inclusive) in the collating sequence,
e.g. '[0-9]' in ASCII matches any decimal digit.
(bracket expressions can also contain other things, like character classes and equivalence classes, and there are all sorts of special rules about things like how to include characters like "^", "-", "[", or "]" as part of a character list, rather than negating, indicating a range, class, or end of the expression, etc. It's all rather messy, actually.)
Concerning basic vs. extended RE syntax: grep -E uses the "extended" syntax, which is just different enough to mess you up. The relevant differences here are that in a basic RE, the characters "(){}" are treated as literal characters unless escaped (if escaped, they're treated as RE syntax indicating grouping and repetition); in an extended RE, this is reversed: they're treated as RE syntax unless escaped (if escaped, they're treated as literal characters).
That's why I suggest ^\([^:]:\)\{2\} in the first bullet point, but then actually use ^([^:]*:){2} in the proposed solution -- the first is basic syntax, the second is extended.
The other relevant difference -- and the reason I switched to extended for the actual solution -- is that only extended RE allows | to indicate alternatives, as in this|that|theother (which matches "this" or "that" or "theother"). I need this capability to match a 4-digit number starting with 1-4 or the specific number 5000 ([1-4][0-9]{3}|5000). There's simply no way to do this in a basic RE, so grep -E and the extended syntax are required here.
(There are also many other RE variants, such as Perl-compatible RE (PCRE). When using regular expressions, always be sure to know which variant your regex tool uses, so you don't use syntax it doesn't understand.)
ypcat passwd |awk -F: '$3>1000 && $3 <5000{print $1}'
awk here can go the task in a simple manner. Here we made ":" as the delimiter between the fields and put the condition that third field should be greater than 1000 and less then 5000. If this condition meets print first field.

How can I split a string into an array in one operation, but only when the line contains a given pattern?

I have to match a line in a file and capture the lines contents.
The line is as as follows:
key:value key:value abc:123
I have a block of code processing different lines in the file based on the line content.
The above line can be identified by the key "abc" being present in the line.
I need one regex which does the following
Check if "abc" is present in the line
if "abc" is present get the contents in the form of an array
I am able to do these separately
#gives me an array of the key,value pairs
array = line.scan(/\w+:\d+/)
#matches "abc:value" but does not give me the other keys
/.*(abc:\d+)/.match(line)
Looking for a way do this in one operation
Don't Complicate Things
A regular expression, especially a single monolithic one, isn't the solution for everything. Even when it's possible, overly complex expressions don't make your code more readable or more maintainable. Unless your employer is charging you for each line of code, don't be afraid to use multiple lines of code to express a concept.
Use a Conditional Expression
You can use a conditional expression in your statement to match within a single line. For example:
line = 'key:value key:value abc:123'
line.scan /(\S+:\S+)/ if line =~ /abc:/
# => [["key:value"], ["key:value"], ["abc:123"]]
This will only split the line into an array of matches if it first matches the condition in the if statement. However, note that you're still fundamentally doing two regular expression matches.
If you're trying to avoid performing two regular expression matches, perhaps for performance reasons inside a tight loop, you can do something similar with a string pattern match as your condition. For example:
line = 'key:value key:value abc:123'
line.scan /(\S+:\S+)/ if line.include? 'abc:'
# => [["key:value"], ["key:value"], ["abc:123"]]
The results are the same, but String#scan uses a regular expression match while the conditional uses String#include?. The latter may be faster.
How about:
array = line.scan(/\w+:\d+/) if line[/abc:\d+/]

Regular expression Unix shell script

I need to filter all lines with words starting with a letter followed by zero or more letters or numbers, but no special characters (basically names which could be used for c++ variable).
egrep '^[a-zA-Z][a-zA-Z0-9]*'
This works fine for words such as "a", "ab10", but it also includes words like "b.b". I understand that * at the end of expression is problem. If I replace * with + (one or more) it skips the words which contain one letter only, so it doesn't help.
EDIT:
I should be more precise. I want to find lines with any number of possible words as described above. Here is an example:
int = 5;
cout << "hello";
//some comments
In that case it should print all of the lines above as they all include at least one word which fits the described conditions, and line does not have to began with letter.
Your solution will look roughly like this example. In this case, the regex requires that the "word" be preceded by space or start-of-line and then followed by space or end-of-line. You will need to modify the boundary requirements (the parenthesized stuff) as needed.
'(^| )[a-zA-Z][a-zA-Z0-9]*( |$)'
Assuming the line ends after the word:
'^[a-zA-Z][a-zA-Z0-9]+|^[a-zA-Z]$'
You have to add something to it. It might be that the rest of it can be white spaces or you can just append the end of line.(AFAIR it was $ )
Your problem lies in the ^ and $ anchors that match the start and end of the line respectively. You want the line to match if it does contain a word, getting rid of the anchors does what you want:
egrep '[a-zA-Z][a-zA-Z0-9]+'
Note the + matches words of length 2 and higher, a * in that place would signel chars too.

Resources