Filtering in Pig - hadoop

I am trying to do a filter for a relation in pig, I need all those records in which there is an occurrence of third field in the first field string.
I tried with:
(Assume my source relation is SRC)
Filtered= FILTER SRC BY $0 matches 'CONCAT(".*",$2,".")';
DUMP Filtered;
There is no syntax error but I am not getting any output for Filtered.

Pig's CONCAT takes only two arguments. See the documentation at http://pig.apache.org/docs/r0.10.0/func.html#concat
I'm not sure why it isn't complaining at runtime, but you're going to want to string together two CONCAT statements, like
CONCAT(".*", CONCAT($2, "."))
to get the string you want.

I don't think that the CONCAT is resolving to what you're expecting, more so the matches is probably trying to match the entire unevalutated string CONCAT(".*",$2,"."), which is why you are not getting any results
Can you break this out into two statements, the first where you create a field containing the evalulated content of the CONCAT, and a second to perform the matches operation:
TMP = FOREACH SRC GENERATE $0, CONCAT(".*",$2,".");
Filtered = FILTER TMP BY $0 matches $1;
DUMP Filtered;
Or something like that (completely untested)

I think you just have some syntax errors
As noted by A. Leistra, CONCAT only takes two arguments.
"." at the end should be ".*" if you want double sided wildcards
FILTER statement prefers parenthesis around the argument
Pig has a lot of weird edge cases involving double quotes, so just use single when you can
Filtered= FILTER SRC BY ($0 matches CONCAT('.*', CONCAT($2, '.*')));

Try this,
Filtered= FILTER SRC BY $0 matches '(.*)$2(.*)';
DUMP Filtered;
If the third field contains the first field then that results will be filtered.
This is done by using Regex.

Related

Apache NiFi: Extracting nth column from a csv [duplicate]

I need a regular expression that can be used to find the Nth entry in a comma-separated list.
For example, say this list looks like this:
abc,def,4322,mail#mailinator.com,3321,alpha-beta,43
...and I wanted to find the value of the 7th entry (alpha-beta).
My first thought would not be to use a regular expression, but to use something that splits the string into an array on the comma, but since you asked for a regex.
most regexes allow you to specify a minimum or maximum match, so something like this would probably work.
/(?:[^\,]*,){5}([^,]*)/
This is intended to match any number of character that are not a comma followed by a comma six times exactly (?:[^,]*,){5} - the ?: says to not capture - and then to match and capture any number of characters that are not a comma ([^,]+). You want to use the first capture group.
Let me know if you need more info.
EDIT: I edited the above to not capture the first part of the string. This regex works in C# and Ruby.
You could use something like:
([^,]*,){$m}([^,]*),
As a starting point. (Replace $m with the value of (n-1).) The content would be in capture group 2. This doesn't handle things like lists of size n, but that's just a matter of making the appropriate modifications for your situation.
#list = split /,/ => $string;
$it = $list[6];
or just
$it = (split /,/ => $string)[6];
Beats writing a pattern with a {6} in it every time.

Is there a $_ (entire line, complete line) special variable in Hadoop Pig Latin?

I would like to filter out lines in an incoming data set if ANY variable contains, say, a non-printable character.
The problem is that I don't want to have a regexp match for each column, would just like to regexp match the entire line, like $_ in PERL. Something that would go along with $0, $1, etc. Or perhaps a $* to mean any column (unsure what would happen to the numeric column).
Thus, I would like to do something like this:
tuple_stuff = LOAD 'really_big_file' USING PigStorage()
AS (
some_column:LONG,
another_column:CHARARRAY,
yet_another:CHARARRAY
...
ninty_fifth_column:CHARARRAY);
filtered_stuff = FILTER tuple_stuff BY ! $_ MATCHES '[^:print:]+';
(side note: I can't remember of Pig matches the entire string or just anywhere in the string...thus, do I need the "+" extender?)
Being required to use a JAVA UDF is fine, provided I don't have to install anything goofy.
Note: I realize the obvious limitations of this: once loaded in, the original line would most certainly not be available. Thus, it might have to be some form of modifier in LOAD.

How do I filter file names out of a SQLite dump?

I'm trying to filter out all file names from an SQLite text dump using Ruby. I'm not very handy/familiar with regex and need a way to read, and write to a file, another dump of image files that are within the SQLite dump. I can filter out everything except stuff like this:
VALUES(3,5,1,43,'/images/e/e5/Folder%2FOrders%2FFinding_Orders%2FView_orders3.JPG','1415',NULL);
and this:
src="/images/9/94/folder%2FGraph.JPG"
I can't figure out the easiest way to filter through this. I've tried using split and other functions, but instead of splitting the string into an array by the character specified, it just removed the character.
You should be able to use .gsub('%2', ' ') the %2 with a space, while quoted, it should be fine.
Split does remove the character that is being split, though. So you may not want to do that, or if you do, you may want to use the Array#join method with the argument of the character you split with to put it back in.
I want to 'extract' the file name from the statements above. Say I have src="/images/9/94/folder%2FGraph.JPG", I want folder%2FGraph.JPG to be extracted out.
If you want to extract what is inside the src parameter:
foo = 'src="/images/9/94/folder%2FGraph.JPG"'
foo[/^src="(.+)"/, 1]
=> "/images/9/94/folder%2FGraph.JPG"
That returns a string without the surrounding parenthesis.
Here's how to do the first one:
bar = "VALUES(3,5,1,43,'/images/e/e5/Folder%2FOrders%2FFinding_Orders%2FView_orders3.JPG','1415',NULL);"
bar.split(',')[4][1..-2]
=> "/images/e/e5/Folder%2FOrders%2FFinding_Orders%2FView_orders3.JPG"
Not everything in programming is a regex problem. Somethings, actually, in my opinion, most things, are not candidates for a pattern. For instance, the first example could be written:
foo.split('=')[1][1..-2]
and the second:
bar[/'(.+?)'/, 1]
The idea is to use whichever is most clean and clear and understandable.
If all you want is the filename, then use a method designed to return only the filename.
Use one of the above and pass its output to File.basename. Filename.basename returns only the filename and extension.

Pig Filter out NOT Matches

I have a bunch of strings that have various prefixes including "unknown:" I'd really like to filter out all the strings starting with "unknown:" in my Pig script, but it doesn't appear to work.
simpleFilter = FILTER records BY NOT(mystr MATCHES '^unknown');
I've tried a few other permutations of the regex, but it appears that MATCHES just doesn't work well with NOT. Am I missing something?
Using Pig 0.9.2
It's because the matches operator operates exactly like Java's String#matches, i.e. it tries to match the entire String and not just part of it (the prefix in your case). Just update your regular expression to match the the entire string with your specified prefix, like so:
simpleFilter = FILTER records BY NOT(mystr MATCHES '^unknown.*');

How can I split a string into an array in one operation, but only when the line contains a given pattern?

I have to match a line in a file and capture the lines contents.
The line is as as follows:
key:value key:value abc:123
I have a block of code processing different lines in the file based on the line content.
The above line can be identified by the key "abc" being present in the line.
I need one regex which does the following
Check if "abc" is present in the line
if "abc" is present get the contents in the form of an array
I am able to do these separately
#gives me an array of the key,value pairs
array = line.scan(/\w+:\d+/)
#matches "abc:value" but does not give me the other keys
/.*(abc:\d+)/.match(line)
Looking for a way do this in one operation
Don't Complicate Things
A regular expression, especially a single monolithic one, isn't the solution for everything. Even when it's possible, overly complex expressions don't make your code more readable or more maintainable. Unless your employer is charging you for each line of code, don't be afraid to use multiple lines of code to express a concept.
Use a Conditional Expression
You can use a conditional expression in your statement to match within a single line. For example:
line = 'key:value key:value abc:123'
line.scan /(\S+:\S+)/ if line =~ /abc:/
# => [["key:value"], ["key:value"], ["abc:123"]]
This will only split the line into an array of matches if it first matches the condition in the if statement. However, note that you're still fundamentally doing two regular expression matches.
If you're trying to avoid performing two regular expression matches, perhaps for performance reasons inside a tight loop, you can do something similar with a string pattern match as your condition. For example:
line = 'key:value key:value abc:123'
line.scan /(\S+:\S+)/ if line.include? 'abc:'
# => [["key:value"], ["key:value"], ["abc:123"]]
The results are the same, but String#scan uses a regular expression match while the conditional uses String#include?. The latter may be faster.
How about:
array = line.scan(/\w+:\d+/) if line[/abc:\d+/]

Resources