Slice keywords from log text files - bash

I have a big log file with lines as
[2016-06-03T10:03:12] No data: TW.WA2
,
[2016-06-03T11:03:02] wrong overlaps: XW.W12.HHZ.2007.289
and as
[2016-06-03T14:05:26] failed to correct YP.CT02.HHZ.2012.334 because No matching response.
Each line consists of a timestamp, a reason for the logging and a keyword composed of some substrings connected by dots (TW.WA2, XW.W12.HHZ.2007.289 and YP.CT02.HHZ.2012.334 in above examples).
The format of the keywords of a specific type is fixed (substrings are joined by fixed number of dots).
The substrings are composed of letters and digits (0-5 chars, but not all substrings can be empty, generally only one at maximum, e.g., XW.WTA12..2007.289).
I want to
extract the keywords
save different types of keywords uniqued to separated files
Currently I tried grep, but only the classification is done.
grep "wrong overlaps" logfile > wrong_overlaps
grep "failed to correct" logfile > no_resp
grep "No data" logfile > no_data
In no_data, the contents are expected as like
AW.AA1
TW.WA2
TW.WA3
...
In no_resp, the contents are expected as like
XP..HHZ.2002.334
YP.CT01.HHZ.2012.330
YP.CT02.HHZ.2012.334
...
However, the simple grep commands above save the full lines. I guess I need regex to extract the keywords?

Assuming a keyword is defined by containing period and surrounded by letters and digits, then the followed regex will match all keywords:
% grep -oE '\w+(\.\w+)+' data
TW.WA2
XW.W12.HHZ.2007.289
YP.CT02.HHZ.2012.334
-o will print the matches only. And -E enables Extended Regular Expressions
This will however not make it possible for you to split it into multiply files, eg: Creating a file wrong_overlaps that contains all lines with wrong overlaps.
You can use -P to enable Perl Compatible Regular Expressions which support lookbehinds:
% grep -oP '(?<=wrong overlaps: )\w+(\.\w+)+' data
XW.W12.HHZ.2007.289
But note that PCRE doesn't support variable length lookbehinds so you will need to type out the full pattern before, eg:
something test string: ABC:DEF
ABC:DEF Can be extracted with:
(?<=test string: )\w+(\.\w+)+
But not
(?<=test string)\w+(\.\w+)+

Related

grep listing false duplicates

I have the following data containing a subset of record numbers formatting like so:
>head pilot.dat
AnalogPoint,206407
AnalogPoint,2584
AnalogPoint,206292
AnalogPoint,206278
AnalogPoint,206409
AnalogPoint,206410
AnalogPoint,206254
AnalogPoint,206266
AnalogPoint,206408
AnalogPoint,206284
I want to compare the list of entries to another subset file called "disps.dat" to find duplicates, which is formatted in the same way:
>head disps.dat
StatusPoint,280264
StatusPoint,280266
StatusPoint,280267
StatusPoint,280268
StatusPoint,280269
StatusPoint,280335
StatusPoint,280336
StatusPoint,280334
StatusPoint,280124
I used the command:
grep -f pilot.dat disps.dat > duplicate.dat
However, the output file "duplicate.dat" is listing records that exist in the second file "disps.dat", but do not exist in the first file.
(Note, both files are big, so the sample shown above don't have duplicates, but I do expect and have confirmed at least 10-12k duplicates to show up in total)
> head duplicate.dat
AnalogPoint,208106
AnalogPoint,208107
StatusPoint,1235220
AnalogPoint,217270
AnalogPoint,217271
AnalogPoint,217272
AnalogPoint,217273
AnalogPoint,217274
AnalogPoint,217275
AnalogPoint,217277
> grep "AnalogPoint,208106" pilot.dat
>
I tested the above command with a smaller sample of data (10 records), also formatted the same, and the results work fine, so I'm a little bit confused on why it is failing on the larger execution.
I also tried feeding it in as a string with -F thinking that the "," comma might be the source of issue. Right now, I am feeding the data through a 'for' loop and echoing each line, which is executing very, very slowly but at least it will help me cross out the regex possibility.
the -x or -w option is needed to do an exact match.
-x will match exact string, and -w will match exact substring and block non-word characters which works in my case to handle trailing numbers.
The issue is that a record in the first file such as:
"AnalogPoint,1"
Would end up flagging records in the second file like:
"AnalogPoint,10"
"AnalogPoint,123"
"AnalogPoint,100200"
And so on.
Thanks to #Barmar for pointing out my issue.

How to get match of a pattern even if it is splitted by characters using a bash command (similar to grep)?

I'm trying to output all the lines of a file which contain a specific word/pattern even if it contains other characters between its letters.
Let's say we have a bunch of domain names and we want to filter out all those that contain "paypal" inside, I would like to have this kind of output :
pay-pal-secure.com
payppal.net
etc...
I was wondering if this is possible with grep or does it exist something else that might do it.
Many thanks !
Replace paypal with regexp p.*a.*y.*p.*a.*l to allow all characters between the letters.
Update:
Use extended regular expression p.{0,2}a.{0,2}y.{0,2}p.{0,2}a.{0,2}l to limit characters between the letters to none to two.
Example: grep -E 'p.{0,2}a.{0,2}y.{0,2}p.{0,2}a.{0,2}l' file
See: The Stack Overflow Regular Expressions FAQ
Alternatively you could use agrep (approximate grep):
$ agrep -By paypal file
agrep: 2 words match within 1 error
pay-pal-secure.com
payppal.net

How do I create pattern of kmer in unix for a given string?

I have a string called mystring=AACTCGCTTT. I want to create a pattern of this string allowing 4 mismatches or kmer= 6 starting from the first letter and ending to the last last letter. I want this so I can grep these patterns in a text file. How do I do that in bash? So my pattern would look like this:
????CGCTTT
A????GCTTT
AA?T???TTT
There is a tool called agrep for that purpose:
agrep -4 AACTCGCTTT filename
From the man page:
Searches for approximate matches of PATTERN in each FILE or standard input. Example: 'agrep -2 optimize foo.txt' outputs all lines in file 'foo.txt' that match "optimize" within two errors. E.g. lines which contain "optimise", "optmise", and "opitmize" all match.

How do I grep for all lines without a "#" character in the line

I have a text file open in BBEdit/InDesign with email addresses on some lines (about a third of the lines) and name and date stuff on the other lines. I just want to keep the lines that have emails and remove all the others.
A simple pattern I can see to eliminate all the lines apart from those with email addresses on them is to have a negative match for the # character.
I can't use grep -v pattern because the Find and Replace implementation of grep dialogue box just has the fields for Find pattern and Replace pattern. grep -something options don't exist in this context.
Note, I am note trying to construct a valid email address test at all, just using the presence of one (or more) # character to allow a line to stay, all other lines must be deleted from the list.
The closest I got was a pattern which hits only the email address lines (opposite outcome of my goal):
^((\w+|[ \.])\w+)[?#].*$
I tried various combination of ^.*[^#].*$ and more sophisticated /w and [/w|\.] in parentheses and escaping the # with [^\#] and negative look forwards like (!?).
I want to find these non-email address lines and delete them using any of these apps on OS X BBEdit/InDesign. I will use the command line if I have to. There must be a way using in-app Find and Replace with grep though I'd expect.
As stated in the comments grep -v # filename lists all lines without an # symbol. You could also use grep # filename > new_filename
The file new_filename will consist only of lines with #. You can use this new file or delete all lines in the old file and paste contents of new file into it.

Find variety of characters in text document

I have a CSV document with 47001 lines in it. Yet when I open it in Excel, there are only 31641 lines.
I know that 47001 is the correct number of lines; it's an export of a database table, whose size I know to be 47001. Additionally: wc -l my.csv returns 47001.
So, Excel's parsing fails. I suspect there is some funky control or whitespace character somewhere in this document.
How do I find out the variety of characters used in some document?
For example, consider this input file: ABCAAAaaa\n.
I would expect the alphabet of characters used in the file to be : ABCa\n.
Maybe if we compress it, we can somehow read the Huffman Tree?
I suspect it will be educational to compare the UTF-8 character variety versus the ASCII character variety. For example: Excel may parse multi-byte characters in ASCII, and thus interpret some bytes as control codepoints.
Here we go if you are on linux (the logic behind could be the same for all but for linux i give the command ) :
sed 's/./&\n/g' | sort -u | tr -d '\n'
What happend :
- First replace all letter by letter followed by "\n" [new line]
- Then sort all caracter and print uniq occurrences
- Remove all the "\n"
Then the input file :
ABCAAAaaa
will became :
A
B
C
A
A
A
a
a
a
After sort :
a
a
a
A
A
A
A
B
C
Then after uniq :
A
B
C
a
final output :
aABC
You can cut out of the original files some columns which are not likely to be changed by passing the cycle of being parsed and written out again, e. g. a pure text column like a name or a number. Names would be great. Then let this file pass the cycle and compare it to the original:
Here's the code:
cut -d, -f3,6,8 > columns.csv
This assumes that columns 3, 6, and 8 are the name columns and that a comma is the separator. Adjust these values according to your input file. Using a single column is also okay.
Now call Excel, parse the file columns.csv, write it out again as a csv file columns2.csv (with the same separator of course). Then:
diff columns.csv columns2.csv | less
A tool like meld instead of diff might also be handy to analyse the differences.
This will show you which lines experienced a change by the →parse→dump cycle. Hopefully it will affect only the lines you are looking for.

Resources