Pig Filter out NOT Matches - hadoop

I have a bunch of strings that have various prefixes including "unknown:" I'd really like to filter out all the strings starting with "unknown:" in my Pig script, but it doesn't appear to work.
simpleFilter = FILTER records BY NOT(mystr MATCHES '^unknown');
I've tried a few other permutations of the regex, but it appears that MATCHES just doesn't work well with NOT. Am I missing something?
Using Pig 0.9.2

It's because the matches operator operates exactly like Java's String#matches, i.e. it tries to match the entire String and not just part of it (the prefix in your case). Just update your regular expression to match the the entire string with your specified prefix, like so:
simpleFilter = FILTER records BY NOT(mystr MATCHES '^unknown.*');

Related

Apache NiFi: Extracting nth column from a csv [duplicate]

I need a regular expression that can be used to find the Nth entry in a comma-separated list.
For example, say this list looks like this:
abc,def,4322,mail#mailinator.com,3321,alpha-beta,43
...and I wanted to find the value of the 7th entry (alpha-beta).
My first thought would not be to use a regular expression, but to use something that splits the string into an array on the comma, but since you asked for a regex.
most regexes allow you to specify a minimum or maximum match, so something like this would probably work.
/(?:[^\,]*,){5}([^,]*)/
This is intended to match any number of character that are not a comma followed by a comma six times exactly (?:[^,]*,){5} - the ?: says to not capture - and then to match and capture any number of characters that are not a comma ([^,]+). You want to use the first capture group.
Let me know if you need more info.
EDIT: I edited the above to not capture the first part of the string. This regex works in C# and Ruby.
You could use something like:
([^,]*,){$m}([^,]*),
As a starting point. (Replace $m with the value of (n-1).) The content would be in capture group 2. This doesn't handle things like lists of size n, but that's just a matter of making the appropriate modifications for your situation.
#list = split /,/ => $string;
$it = $list[6];
or just
$it = (split /,/ => $string)[6];
Beats writing a pattern with a {6} in it every time.

How to use regular expression in fetching data from graphite?

I want to fetch data from different counters from graphite in one single request like:-
summarize(site.testing_server_2.triggers_unknown.count,'1hour','sum')&format=json
summarize(site.testing_server_2.requests_failed.count,'1hour','sum')&format=json
summarize(site.testing_server_2.core_network_bad_soap.count,'1hour','sum')&format=json
and so on.. 20 more.
But I don't want to fetch
summarize(site.testing_server_2.module_xyz_abc.count,'1hour','sum')&format=json
in that request how can i do that?
This is what I tried:
summarize(site.testing_server_2.*.count,'1hour','sum')&format=json&from=-24hour
It gets json data for 'module_xyz_abc' too, but that i don't want.
You can't use regular expressions per se, but you can use some similar (in concept and somewhat in format) matching techniques available within the Graphite Render URL API. There are a few ways you can "match" within a target's "bucket" (i.e. between the dots).
Target Matching
Asterisk * match
The asterisk can be used to match ANY -zero or more- character(s). It can be used to replace the entire bucket (site.*.test) or within the bucket (site.w*t.test). Here is an example:
site.testing_server_2.requests_*.count
This would match site.testing_server_2.requests_failed.count, site.testing_server_2.requests_success.count, site.testing_server_2.requests_blah123.count, and so forth.
Character range [a-z0-9] match
The character range match is used to match on a single character (site.w[0-9]t.test) in the target's bucket and is specified as a range or list. For example:
site.testing_server_[0-4].requests_failed.count
This would match on site.testing_server_0.requests_failed.count, site.testing_server_1.requests_failed.count, site.testing_server_2.requests_failed.count, and so forth.
Value list (group capture) {blah, test, ...} match
The value list match can be used to match anything in the list of values, in the specified portion of the target's bucket.
site.testing_server_2.{triggers_unknown,requests_failed,core_network_bad_soap}.count
This would match site.testing_server_2.triggers_unknown.count, site.testing_server_2.requests_failed.count, and site.testing_server_2.core_network_bad_soap.count. But nothing else, so site.testing_server_2.module_xyz_abc.count would not match.
Answer
Without knowing all of your bucket values it is difficult to be surgical with the approach (perhaps with a combination of the matching options), so I'll recommend just going with a value list match. This should allow you to get all of the values in one -somewhat long- request. For example (and keep in mind you'd need to include all of your values):
summarize(site.testing_server_2.{triggers_unknown,requests_failed,core_network_bad_soap}.count,'1hour','sum')&format=json&from=-24hour
For more, see Graphite Paths and Wildcards

Need a ruby regular exp for matching

I'm trying to extract the version from different RPM's list. Below is an example:
rpm = "abc-def-ghi-1.1.0-10.el6.x86_64"
This variable can have different string values,
rpm = "a-b-1.1.1-10.x86_64"
My goal is to write a regexp using the "match" method (as below) - though this one does not cover for .el6 aspect.
rpm.match(/^#{rpmname_to_match}-(.*).x86_64$/).nil?
I'm not certain about what you're trying to do with the .el6 part, but if you want a pattern which will only match the numeric part, then try this:
([0-9]+(?:(?:\.|-)(?:[0-9]+))*)
This will only match a string which starts with one or more digits, then can have any number of sequences which are a period or hyphen followed by one or more digits.
So your final statement might be the following:
rpm.match(/^#{rpmname_to_match}-([0-9]+(?:(?:\.|-)(?:[0-9]+))*)(.*)\.x86_64$/).nil?

How to discover a date or a number near a word - only with regex within regex

I am still learning the intrinsics of regex, and am wondering if it is possible with a single regex to find a number that is at a provided distance from a word.
Consider the following text
DateClient
15-01-20130060 15-01-20140010 15-01-20150020
I want that my regex matches just 15-01-2013.
I know I can have the full DateClient 15-01-2013 with DateClient\W+\d{2}-\d{2}-\d{4}, and then apply a regex afterwards, but i'm trying to build a configurable agnostic system, that gives power to the user, and so I would like to have a single regex expression that just matches 15-01-2013.
Is this even feasible?
Any suggestions?
You can use a capturing group :
DateClient\W+(\d{2}-\d{2}-\d{4})
Example in javascript (you didn't specify a language) :
var str = "DateClient\n15-01-20130060 15-01-20140010 15-01-20150020";
var date = str.match(/DateClient\W+(\d{2}-\d{2}-\d{4})/)[1];
EDIT (following the addition of the Ruby tag) :
In Ruby you can use
(?<=DateClient\W)(\d{2}-\d{2}-\d{4})
Demonstration
Check out lookbehind for matching only the date. However, lookbehind support of your environment can be limited.
Or you could just use a capturing group, which you will be able to extract from the match result.

Filtering in Pig

I am trying to do a filter for a relation in pig, I need all those records in which there is an occurrence of third field in the first field string.
I tried with:
(Assume my source relation is SRC)
Filtered= FILTER SRC BY $0 matches 'CONCAT(".*",$2,".")';
DUMP Filtered;
There is no syntax error but I am not getting any output for Filtered.
Pig's CONCAT takes only two arguments. See the documentation at http://pig.apache.org/docs/r0.10.0/func.html#concat
I'm not sure why it isn't complaining at runtime, but you're going to want to string together two CONCAT statements, like
CONCAT(".*", CONCAT($2, "."))
to get the string you want.
I don't think that the CONCAT is resolving to what you're expecting, more so the matches is probably trying to match the entire unevalutated string CONCAT(".*",$2,"."), which is why you are not getting any results
Can you break this out into two statements, the first where you create a field containing the evalulated content of the CONCAT, and a second to perform the matches operation:
TMP = FOREACH SRC GENERATE $0, CONCAT(".*",$2,".");
Filtered = FILTER TMP BY $0 matches $1;
DUMP Filtered;
Or something like that (completely untested)
I think you just have some syntax errors
As noted by A. Leistra, CONCAT only takes two arguments.
"." at the end should be ".*" if you want double sided wildcards
FILTER statement prefers parenthesis around the argument
Pig has a lot of weird edge cases involving double quotes, so just use single when you can
Filtered= FILTER SRC BY ($0 matches CONCAT('.*', CONCAT($2, '.*')));
Try this,
Filtered= FILTER SRC BY $0 matches '(.*)$2(.*)';
DUMP Filtered;
If the third field contains the first field then that results will be filtered.
This is done by using Regex.

Resources