I have a big log file.
After removing the timestamp of each line, I sort it by cat logfile | sort -u > logfile, so that the logs are clean and organized as
failed to correct PL.ASBF..HHZ.2011.348 because of divided by zero
failed to correct PL.ASBF..HHZ.2011.349 because of divided by zero
failed to correct PL.ASBF..HHZ.2011.350 because of divided by zero
.
. (lines not shown here)
.
failed to correct PL.ASBF..HHZ.2015.364 because of divided by zero
failed to correct PL.ASBF..HHZ.2015.365 because of divided by zero
.
.
. (lines not shown here)
.
.
failed to correct PL.HSPB..HHZ.2011.128 because of Illegal format
failed to correct PL.HSPB..HHZ.2011.129 because of Illegal format
failed to correct PL.HSPB..HHZ.2011.130 because of Illegal format
.
. (lines not shown here)
.
failed to correct PL.HSPB..HHZ.2014.364 because of Illegal format
failed to correct PL.HSPB..HHZ.2014.365 because of Illegal format
I can get the logged items (e.g. PL.HSPB in above example) by
grep -oE " [0-9A-Z]*\.[0-9A-Z]*" logfile | sort -u
However, I also want to known the date info and to make it clearer, I want to remove the intermedia lines. For example,
failed to correct PL.HSPB..HHZ.2011.128 because of Illegal format
failed to correct PL.HSPB..HHZ.2011.129 because of Illegal format
failed to correct PL.HSPB..HHZ.2011.130 because of Illegal format
.
. (lines not shown here)
.
failed to correct PL.HSPB..HHZ.2014.364 because of Illegal format
failed to correct PL.HSPB..HHZ.2014.365 because of Illegal format
after removal becomes
failed to correct PL.HSPB..HHZ.2011.128 because of Illegal format
failed to correct PL.HSPB..HHZ.2014.365 because of Illegal format
i.e., for an item, only the first and last lines are kept (the digits are year and julian day).
Is there any shell command to make this with easy?
Script:
$ cat hhz.py
#!/usr/bin/env python
import sys, re
from collections import OrderedDict
undateds = set()
firsts = OrderedDict()
lasts = OrderedDict()
while True:
line = sys.stdin.readline()
if line == '':
break
line = line.rstrip("\n")
x = re.match("(.*HHZ\.)([0-9][0-9][0-9][0-9]\.[0-9]+)( .*)", line)
if x is None:
continue
before = x.group(1)
during = x.group(2)
after = x.group(3)
undated = re.sub("(.*HHZ\.)[0-9][0-9][0-9][0-9]\.[0-9]+ (.*)", line, before+after)
if not undated in firsts:
firsts[undated] = line
lasts[undated] = line
for undated in firsts:
first = firsts[undated]
last = lasts[undated]
print first
if first != last:
print last
Input:
$ cat hhz.dat
failed to correct PL.ASBF..HHZ.2011.348 because of divided by zero
failed to correct PL.ASBF..HHZ.2011.349 because of divided by zero
failed to correct PL.ASBF..HHZ.2011.350 because of divided by zero
failed to correct PL.ASBF..HHZ.2015.364 because of divided by zero
failed to correct PL.ASBF..HHZ.2015.365 because of divided by zero
failed to correct PL.HSPB..HHZ.2011.128 because of Illegal format
failed to correct PL.HSPB..HHZ.2011.129 because of Illegal format
failed to correct PL.HSPB..HHZ.2011.130 because of Illegal format
failed to correct PL.HSPB..HHZ.2011.130 because of Something else
failed to correct PL.HSPB..HHZ.2014.364 because of Illegal format
failed to correct PL.HSPB..HHZ.2014.365 because of Illegal format
Output:
$ hhz.py < hhz.dat
failed to correct PL.ASBF..HHZ.2011.348 because of divided by zero
failed to correct PL.ASBF..HHZ.2015.365 because of divided by zero
failed to correct PL.HSPB..HHZ.2011.128 because of Illegal format
failed to correct PL.HSPB..HHZ.2014.365 because of Illegal format
failed to correct PL.HSPB..HHZ.2011.130 because of Something else
Group things by regexing out the date part. The undated is the uniqified name.
Get first in group by doing an ordered-dict put if not already set.
Get last in group by doing ordered-dict put unconditionally.
Use OrderedDict to preserve input-file ordering (use dict if you don't want that)
Check first != last to avoid printing the same thing twice in case there is only one item in the group
Related
The transform is getting aborted but only if I marked the checkbox copy empty fields and also the rest of the entry of the Import set is getting stuck at pending, also I verified the transform script but no luck.
Below is the error :
Import set: ISETxxxxxxx transform stopped due to error: java.lang.NumberFormatException
java.lang.NumberFormatException
at java.math.BigDecimal.<init>(BigDecimal.java:596)
at java.math.BigDecimal.<init>(BigDecimal.java:383)
at java.math.BigDecimal.<init>(BigDecimal.java:806)
at com.glide.script.glide_elements.GlideNumber.getSafeBigDecimal(GlideNumber.java:42)
at com.glide.currency.GlideElementCurrency.coerceAmount(GlideElementCurrency.java:406)
at com.glide.currency.GlideElementCurrency.cleanAmount(GlideElementCurrency.java:389)
at com.glide.currency.GlideElementCurrency.setDisplayValue(GlideElementCurrency.java:136)
at com.glide.currency.GlideElementCurrency.setValue(GlideElementCurrency.java:89)
at com.glide.db.impex.transformer.TransformerField.copyEmptyFields(TransformerField.java:202)
at com.glide.db.impex.transformer.TransformerField.setValue(TransformerField.java:130)
at com.glide.db.impex.transformer.TransformerField.transformField(TransformerField.java:84)
at com.glide.db.impex.transformer.TransformRow.transformCurrent(TransformRow.java:100)
at com.glide.db.impex.transformer.TransformRow.transform(TransformRow.java:69)
at com.glide.db.impex.transformer.Transformer.transformBatch(Transformer.java:150)
at com.glide.db.impex.transformer.Transformer.transform(Transformer.java:76)
at com.glide.system_import_set.ImportSetTransformerImpl.transformEach(ImportSetTransformerImpl.java:239)
at com.glide.system_import_set.ImportSetTransformerImpl.transformAllMaps(ImportSetTransformerImpl.java:91)
at com.glide.system_import_set.ImportSetTransformer.transformAllMaps(ImportSetTransformer.java:64)
at com.glide.system_import_set.ImportSetTransformer.transformAllMaps(ImportSetTransformer.java:50)
at com.snc.automation.ScheduledImportSetJob.runImport(ScheduledImportSetJob.java:55)
at com.snc.automation.ScheduledImportJob.execute(ScheduledImportJob.java:45)
at com.glide.schedule.JobExecutor.execute(JobExecutor.java:83)
at com.glide.schedule.GlideScheduleWorker.executeJob(GlideScheduleWorker.java:207)
at com.glide.schedule.GlideScheduleWorker.process(GlideScheduleWorker.java:145)
at com.glide.schedule.GlideScheduleWorker.run(GlideScheduleWorker.java:62)
I'm guessing you have a field that required that is a decimal or similar.
The error java.lang.NumberFormatException indicates it's failing to convert an empty string to 0.0.
Use a source script line to convert this, something along the lines of this
answer = (function transformEntry(source) {
if (source.u_number_field.nil())
return 0.0;
})(source);
I am new to pig and trying to learn on my own.
I have written a script to get the epoch time with a word that is reading from words.txt file.
Here is the script.
words = LOAD 'words.txt' AS word:chararray;
B = FOREACH A GENERATE CONCAT(CONCAT(A.word,'_'),(chararray)ToUnixTime(CurrentTime());
dump B;
But the issue is, if words.txt file have only one word it is giving proper output.
If it is having multiple words like
word1
word2
word3
word4
then it is giving the following error
ERROR 1066: Unable to open iterator for alias B
java.lang.Exception:
org.apache.pig.backend.executionengine.ExecException: ERROR 0:
Scalar has more than one row in the output. 1st : (word1 ), 2nd :(word2) (common cause: "JOIN" then "FOREACH ... GENERATE foo.bar"
should be "foo::bar" ) at
org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR
0: Scalar has more than one row in the output. 1st : (word1 ), 2nd
:(word2) (common cause: "JOIN" then "FOREACH ... GENERATE foo.bar"
should be "foo::bar" ) at
org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:122) at
o
Please suggest me to solve this issue.
Thank you.
solved on my own.
just removed the A. from the inner CONCAT. It worked for me.
script:
words = LOAD 'words.txt' AS word:chararray;
B = FOREACH A GENERATE CONCAT(CONCAT(word,'_'),(chararray)ToUnixTime(CurrentTime());
dump B;
I have input text file( name multidelimiter) with followings records
1,Mical,2000;10
2,Smith,3000;20
I have written pig code as follows
A =LOAD '/user/input/multidelimiter' AS line;
B = FOREACH A GENERATE FLATTEN( REGEX_EXTRACT_ALL( line,'(.*)[,](.*)[,](.*)[;]')) AS (f1,f2,f3,f4);
But this code in not work given following error
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Lexical error at line 1, column 78. Encountered: <EOF> after : "\'(.*)[,](.*)[,](.*)[;"
I refereed following links but not able to resolve my error
how to load files with different delimiter each time in piglatin
Please help me get out from this error.
Thanks.
Solution for your input example:
LOAD as comma separated, than STRSPLIT by ';' and FLATTEN
Finally got solution.
Here is my solution:
A =LOAD '/user/input/multidelimiter' using PigStorage(',') as (empid,ename,line);
B = FOREACH A GENERATE empid,ename, FLATTEN( REGEX_EXTRACT_ALL( line,'(.*)\\u003B(.*)')) AS (sal:int,deptno:int);
I am getting MismatchedTokenException on executing query as below:
0: jdbc:hive2://localhost:10000> INSERT INTO TABLE test_data
. . > VALUES ('s92bd2d2u922432c43', 'd93d2e03422f234',
. . > '{"Foo": "ABC","Bar": "20090101100000","Quux": {"QuuxId": 1234,"QuuxName":
. . > "Sam it doen't matter"}}');
Error: Error while compiling statement: FAILED: ParseException line 3:88 mismatched
input 't' expecting ) near ''{"Foo": "ABC","Bar": "20090101100000","Quux": {"QuuxId":
1234,"QuuxName": "Sam it doen'' in statement (state=42000,code=40000)
It seems due to extra ' in sentence "Sam it doen't matter".. it's failing.
But this is a valid json. How this can be resolved ?
It looks like that extra ' is terminating the string from Hive's perspective, so it doesn't matter if it's valid JSON because it doesn't get a chance to pass it along to whatever is going to parse the JSON. You can escape the ' from the Hive command parser using a \ similar to:
select get_json_object('{"Test":"This isn\'t a test"}','$');
How to find the line numbers(of source file) of instructions from AST.
example:
for the following code
24> void foo(){
25> System.out.println(" hi ");
26> }
the ast corresponding to print statement is
METHOD_CALL
.
.
System
out
println
ARGUMENT_LIST
EXPR
" hi "
I want to retrieve the line number of "System" from the generated Tree. The answer for "System" should be 25(line number in the source code).
If your Tree for the System token is in fact a CommonTree, then you can use the CommonTree.getToken() method to get the Token for Symbol. You can then call Token.getLine() to get the line number.