Bash script csv manipulation optimization - bash

I have a 2 million line csv file where what I want to do is replace the second column of each line in the csv file with a unique value to that string, these are all filled with usernames. The long process I've got below does work, but does take a while.
It doesn't have to be hashed, but this seemed like a sure way of when the next file comes along there are no discrepancies.
I'm by no means a coder, and was wondering if there was anyway that I could optimize the process. Although I understand the best way to do this would be in some sort of scripting language.
#!/bin/bash
#Enter Filename to Read
echo "Enter File Name"
read filename
#Extracts Usersnames from file
awk -F "\"*,\"*" '{print $2}' $filename > usernames.txt
#Hashes Usernames using SHA256
cat usernames.txt | while read line; do echo -n $line|openssl sha256 |sed 's/^.* //'; done > hashedusernames.txt
#Deletes usernames out of first file
cat hash.csv | cut -d, -f2 --complement > output.txt
#Pastes hashed usernames to end of first file
paste -d , output.txt hashedusernames.txt > output2.txt
#Moves everything back into place
awk -F "\"*,\"*" '{print $1","$4","$2","$3}' output2.txt > final.csv
Example File, there are 7 columns in all but only 3 are shown
Time Username Size
2017-01-01T14:53.45,Poke.callum,12345
2016-01-01T13:42.56,Test.User,54312
2015-01-01T12:34.34,Another.User,54123

You could do this in Perl easily in a few lines. The following program uses the Crypt::Digest::SHA256, which you need to install from CPAN or from your OS repository if they have it.
The program assumes input from the DATA section, which we typically do around here to include example data in an mcve.
use strict;
use warnings;
use Crypt::Digest::SHA256 'sha256_b64u';
while (my $line = <DATA>) {
# no need to chomp because we don't touch the last line
my #fields = split /,/, $line;
$fields[1] = sha256_b64u($fields[1]);
print join ',', #fields;
}
__DATA__
2017-01-01T14:53.45,Poke.callum,12345
2016-01-01T13:42.56,Test.User,54312
2015-01-01T12:34.34,Another.User,54123
It prints the following output.
2017-01-01T14:53.45,g8EPHWc3L1ln_lfRhq8elyOUgsiJm6BtTtb_GVt945s,12345
2016-01-01T13:42.56,jwXsws2dJq9h_R08zgSIPhufQHr8Au8_RmniTQbEKY4,54312
2015-01-01T12:34.34,mkrKXbM1ZiPiXSSnWYNo13CUyzMF5cdP2SxHGyO7rgQ,54123
To make it read a file that is supplied as a command line argument and write to a new file with the .new extension, you can use it like this:
use strict;
use warnings;
use Crypt::Digest::SHA256 'sha256_b64u';
open my $fh_in, '<', $ARGV[0] or die $!;
open my $fh_out, '>', "$ARGV[0].new" or die $!;
while (my $line = <$fh_in>) {
# no need to chomp because we don't touch the last line
my #fields = split /,/, $line;
$fields[1] = sha256_b64u($fields[1]);
print $fh_out join ',', #fields;
}
Run it as follows:
$ perl foo.pl example.csv
Your new file will be named example.csv.new.

Yet another Python solution, focus on speed but also on maintainability.
#!/usr/bin/python3
import argparse
import hashlib
import re
parser = argparse.ArgumentParser(description='CSV swaper')
parser.add_argument(
'-f',
'--file',
dest='file_path',
type=str,
required=True,
help='The CSV file path.')
def hash_user(users, user):
try:
return users[user]
except KeyError:
id_ = int(hashlib.md5(user.encode('utf-8')).hexdigest(), 16)
users[user] = id_
return id_
def main():
args = parser.parse_args()
username_extractor = re.compile(r',([\s\S]*?),')
users = {}
counter = 0
templ = ',{},'
with open(args.file_path) as file:
with open('output.csv', 'w') as output:
line = file.readline()
while line:
try:
counter += 1
if counter == 1:
continue
username = username_extractor.search(line).groups()[0]
hashuser = hash_user(users, username)
output.write(username_extractor.sub(
templ.format(hashuser), line)
)
except StopIteration:
break
except:
print('Malformed line at {}'.format(counter))
finally:
line = file.readline()
if __name__ == '__main__':
main()
There are still some points that could be optimized, but the central ones are based on do try instead of check, and save users hashes in the case there are repeated users will not have to redigest the username.
Also, will You run this on a multi-core host?.. this can be easily be improved using threads.

This Python program might do what you want. You can pass the filenames to convert on the command line:
$ python this_program.py file1.csv file2.csv
import fileinput
import csv
import sys
import hashlib
class stdout:
def write(self, *args):
sys.stdout.write(*args)
input = fileinput.input(inplace=True, backup=".bak", mode='rb')
reader = csv.reader(input)
writer = csv.writer(stdout())
for row in reader:
row[1] = hashlib.sha256(row[1]).hexdigest()
writer.writerow(row)

Since you used awk in your original attempt, here's a simpler approach in awk
awk -F"," 'BEGIN{i=0;}
{if (unique_names[$2] == "") {
unique_names[$2]="Unique"i;
i++;
}
$2=unique_names[$2];
print $0}'

Related

lowercase and remove punctuation from a csv

I have a giant file (6gb) which is a csv and the rows look like so:
"87687","institute Polytechnic, Brazil"
"342424","university of India, India"
"24343","univefrsity columbia, Bogata, Colombia"
and I would like to remove all punctuation and lower the case of second column yielding:
"87687","institutepolytechnicbrazil"
"342424","universityofindiaindia"
"24343","univefrsitycolumbiabogatacolombia"
what would be the most efficient way to do this on the terminal?
Tried:
cat TEXTFILE | tr -d '[:punct:]' > OUTFILE
problem: resultant is not in lowercase and tr seems to act on both columns not just the ssecond.
With a real CSV parser in Perl, the robust/reliable way, using just one process.
As far as it's line by line, the 6GB requirement of file size should not be an issue.
#!/usr/bin/perl
use strict; use warnings; # harness
use Text::CSV; # load the needed module (install it)
use feature qw/say/; # say = print("...\n")
# create an instance of a new CSV parser
my $csv = Text::CSV->new({ auto_diag => 1 });
# open a File Handle or exit with error
open my $fh, "<:encoding(utf8)", "file.csv" or die "file.csv: $!";
while (my $row = $csv->getline ($fh)) { # parse line by line
$_ = $row->[1]; # parse only column 2
s/[\s[:punct:]]//g; # removes both space(s) and punct(s)
$_ = lc $_; # Lower Case current value $_
$row->[1] = qq/"$_"/; # edit changes and (re)"quote"
say join ",", #$row; # display the whole current row
}
close $fh; # close the File Handle
Output
"87687","institutepolytechnicbrazil"
"342424","universityofindiaindia"
"24343","univefrsitycolumbiabogatacolombia"
install
cpan Text::CSV
Here's an approach using xsv and process substitution:
paste -d, \
<(xsv select 1 infile.csv) \
<(xsv select 2 infile.csv | sed 's/[[:blank:][:punct:]]*//g;s/.*/\L&/')
The sed command first removes all blanks and punctuation, then lowercases the entire match.
This also works when the first field contains blanks and commas, and retains quoting where required.
Using sed
$ sed -E ':a;s/([^,]*,)([^ ,]*)[ ,]([[:alpha:]]+)/\1\L\2\3/;ta' input_file
"87687","institutepolytechnicbrazil"
"342424","universityofindiaindia"
"24343","univefrsitycolumbiabogatacolombia
I suggest using this awk solution, which should work with any version of awk:
awk 'BEGIN{FS=OFS="\",\""} {
gsub(/[^[:alnum:]"]+/, "", $2); $2 = tolower($2)} 1' file
"87687","institutepolytechnicbrazil"
"342424","universityofindiaindia"
"24343","univefrsitycolumbiabogatacolombia"
Details:
We make "," input and output field separators in BEGIN block
gsub(/[^[:alnum:]"]+/, "", $2): Strip all non-alphanumeric characters except "
$2 = tolower($2): Lowercase second column
One GNU awk (for gensub()) idea:
awk '
BEGIN { FS=OFS="\"" }
{ $4=gensub(/[^[:alnum:]]/,"","g",tolower($4)) }
1'
This generates:
"87687","institutepolytechnicbrazil"
"342424","universityofindiaindia"
"24343","univefrsitycolumbiabogatacolombia"
Another sed approach -
sed -E 's/ +//g; s/([^"]),/\1/g; s/"([^"]*)"/"\L\1"/g' file
I don't like how that leaves no flexibility, and makes you rewrite the logic if you find something else you want to remove, though.
Another in awk -
awk -F'[", ]+' '
{ printf "\"%s\",\"", $2;
for(c=3;c<=NF;c++) printf "%s", tolower($c);
print "\"";
}' file
This approach lets you define and add any additional offending characters into the field delimiters without editing your logic.
$: pat=$"[\"',_;:!##\$%)(* -]+"
$: echo "$pat"
["',_;:!##$%)(* -]+
$: cat file
"87687","institute 'Polytechnic, Brazil"
"342424","university; of-India, India"
"24343","univefrsity )columbia, Bogata, Colombia"
$: awk -F"$pat" '{printf "\"%s\",\"", $2; for(c=3;c<=NF;c++) printf "%s", tolower($c); print "\"" }' file
"87687","institutepolytechnicbrazil"
"342424","universityofindiaindia"
"24343","univefrsitycolumbiabogatacolombia"
(I hate the way that lone single quote throws the markup color/format parsing off, lol)
Another way using ruby. Edited the data to show only the second field is modified.
% ruby -r 'csv' -e 'f = open("file");
CSV.parse(f) do |i|
puts "\"" + i[0] + "\",\"" + i[1].downcase.gsub(/[ ,]/,"") + "\"" end'
"8768, 7","institutepolytechnicbrazil"
"342 424","universityofindiaindia"
"243 43","univefrsitycolumbiabogatacolombia"
Using FastCSV gives a huge speedup
gem install fastcsv
% ruby -r 'fastcsv' -e 'f = open("file");
FastCSV.raw_parse(f) do |i|
puts "\"" + i[0] + "\",\"" + i[1].downcase.gsub(/[ ,]/,"") + "\"" end'
"8768, 7","institutepolytechnicbrazil"
"342 424","universityofindiaindia"
"243 43","univefrsitycolumbiabogatacolombia"
Data
% cat file
"8768, 7","institute Polytechnic, Brazil"
"342 424","university of India, India"
"243 43","univefrsity columbia, Bogata, Colombia"
With your shown samples and attempts please try following GNU awk code using match function of it. Using regex (^"[^"]*",")([^"]*)(".*)$ in match function which will create 3 capturing groups and will store the value into arr and respectively I am fetching the values of it later in program to meet OP's requirement.
awk '
match($0,/(^"[^"]*",")([^"]*)(".*)$/,arr){
gsub(/[^[:alnum:]]+/,"",arr[2])
print arr[1] tolower(arr[2]) arr[3]
}
' Input_file
This might work for you (GNU sed):
sed -E s'/("[^"]*",)/\1\n/;h;s/.*\n//;s/[[:punct:] ]//g;s/.*/"\L&"/;H;g;s/\n.*\n//' file
Divide and rule.
Partition the line into two fields, make a copy, process the second field removing punctuation and spaces, re-quote and lowercase and then re-assemble the fields
An alternative, perhaps?
sed -E ':a;s/^("[^"]*",".*)[^[:alpha:]"](.*)/\L\1\2/;ta' file
Here is a way to do so in PHP.
Note: PHP will not output double quotes unless needed by the first column. The second column will never need double quotes, it has no space or special characters.
$max_line_length = 100;
if (($fp = fopen("file.csv", "r")) !== FALSE) {
while (($data = fgetcsv($fp, $max_line_length, ",")) !== FALSE) {
$data[1] = strtolower(preg_replace('/[\s[:punct:]]/', '', $data[1]));
fputcsv(STDOUT, $data, ',', '"');
}
fclose($fp);
}

use grep and awk to transfer data from .srt to .csv/xls

I got an interesting project to do! I'm thinking about converting an srt file into a csv/xls file.
a srt file would look like this:
1
00:00:00,104 --> 00:00:02,669
Hi, I'm shell-scripting.
2
00:00:02,982 --> 00:00:04,965
I'm not sure if it would work,
but I'll try it!
3
00:00:05,085 --> 00:00:07,321
There must be a way to do it!
while I want to output it into a csv file like this:
"1","00:00:00,104","00:00:02,669","Hi, I'm shell-scripting."
"2","00:00:02,982","00:00:04,965","I'm not sure if it would work"
,,,"but I'll try it!"
"3","00:00:05,085","00:00:07,321","There must be a way to do it!"
So as you can see, each subtitle takes up two rows. My thinking would be using grep to put the srt data into the xls, and then use awk to format the xls file.
What do you guys think? How am I suppose to write it? I tried
$grep filename.srt > filename.xls
It seems that all the data including the time codes and the subtitle words ended up all in column A of the xls file...but I want the words to be in column B...How would awk be able to help with the formatting?
Thank you in advance! :)
$ cat tst.awk
BEGIN { RS=""; FS="\n"; OFS=","; q="\""; s=q OFS q }
{
split($2,a,/ .* /)
print q $1 s a[1] s a[2] s $3 q
for (i=4;i<=NF;i++) {
print "", "", "", q $i q
}
}
$ awk -f tst.awk file
"1","00:00:00,104","00:00:02,669","Hi, I'm shell-scripting."
"2","00:00:02,982","00:00:04,965","I'm not sure if it would work,"
,,,"but I'll try it!"
"3","00:00:05,085","00:00:07,321","There must be a way to do it!"
I think something like this should do it quite nicely:
awk -v RS= -F'\n' '
{
sub(" --> ","\x7c",$2) # change "-->" to "|"
printf "%s|%s|%s\n",$1,$2,$3 # print scene, time start, time stop, description
for(i=4;i<=NF;i++)printf "|||%s\n",$i # print remaining lines of description
}' file.srt
The -v RS= sets the Record Separator to blank lines. The -F'\n' sets the Field Separator to new lines.
The sub() replaces the "-->" with a pipe symbol (|).
The first three fields are then printed separated by pipes, and then there is a little loop to print the remaining lines of description, inset by three pipe symbols to make them line up.
Output
1|00:00:00,104|00:00:02,669|Hi, I'm shell-scripting.
2|00:00:02,982|00:00:04,965|I'm not sure if it would work,
|||but I'll try it!
3|00:00:05,085|00:00:07,321|There must be a way to do it!
As I am feeling like having some more fun with Perl and Excel, I took the above output and parsed it in Perl and wrote a real Excel XLSX file. Of course, there is no real need to use awk and Perl so ideally one would re-cast the awk and integrate it into the Perl since the latter can write Excel files while the former cannot. Anyway here is the Perl.
#!/usr/bin/perl
use strict;
use warnings;
use Excel::Writer::XLSX;
my $DEBUG=0;
my $workbook = Excel::Writer::XLSX->new('result.xlsx');
my $worksheet = $workbook->add_worksheet();
my $row=0;
while(my $line=<>){
$row++; # move down a line in Excel worksheet
chomp $line; # strip CR
my #f=split /\|/, $line; # split fields of line into array #f[], on pipe symbols (|)
for(my $j=0;$j<scalar #f;$j++){ # loop through all fields
my $cell= chr(65+$j) . $row; # calcuate Excell cell, starting at A1 (65="A")
$worksheet->write($cell,$f[$j]); # write to spreadsheet
printf "%s:%s ",$cell,$f[$j] if $DEBUG;
}
printf "\n" if $DEBUG;
}
$workbook->close;
Output
My other answer was half awk and half Perl, but, given that awk can't write Excel spreadsheets whereas Perl can, it seems daft to require you to master both awk and Perl when Perl is perfectly capable of doing it all on its own... so here goes in Perl:
#!/usr/bin/perl
use strict;
use warnings;
use Excel::Writer::XLSX;
my $workbook = Excel::Writer::XLSX->new('result.xlsx');
my $worksheet = $workbook->add_worksheet();
my $ExcelRow=0;
local $/ = ""; # set paragraph mode, so we read till next blank line as one record
while(my $para=<>){
$ExcelRow++; # move down a line in Excel worksheet
chomp $para; # strip CR
my #lines=split /\n/, $para; # split paragraph into lines on linefeed character
my $scene = $lines[0]; # pick up scene number from first line of para
my ($start,$end)=split / --> /,$lines[1]; # pick up start and end time from second line
my $cell=sprintf("A%d",$ExcelRow); # work out cell
$worksheet->write($cell,$scene); # write scene to spreadsheet column A
$cell=sprintf("B%d",$ExcelRow); # work out cell
$worksheet->write($cell,$start); # write start time to spreadsheet column B
$cell=sprintf("C%d",$ExcelRow); # work out cell
$worksheet->write($cell,$end); # write end time to spreadsheet column C
$cell=sprintf("D%d",$ExcelRow); # work out cell
$worksheet->write($cell,$lines[2]); # write description to spreadsheet column D
for(my $i=3;$i<scalar #lines;$i++){ # output additional lines of description
$ExcelRow++;
$cell=sprintf("D%d",$ExcelRow); # work out cell
$worksheet->write($cell,$lines[$i]);
}
}
$workbook->close;
Save the above on a file called srt2xls and then make it executable with the command:
chmod +x srt2xls
Then you can run it with
./srt2xls < SomeFileile.srt
and it will give you this spreadsheet called result.xlsx
Since you want to convert the srt into csv. below is awk command
awk '{gsub(" --> ","\x22,\x22");if(NF!=0){if(j<3)k=k"\x22"$0"\x22,";else{k="\x22"$0"\x22 ";l=1}j=j+1}else j=0;if(j==3){print k;k=""}if(l==1){print ",,,"k ;l=0;k=""}}' inputfile > output.csv
detail veiw of awk
awk '{
gsub(" --> ","\x22,\x22");
if(NF!=0)
{
if(j<3)
k=k"\x22"$0"\x22,";
else
{
k="\x22"$0"\x22 ";
l=1
}
j=j+1
}
else
j=0;
if(j==3)
{
print k;
k=""
}
if(l==1)
{
print ",,,"k;
l=0;
k=""
}
}' inputfile > output.csv
take the output.csv on windows platform and then open with microsoft excel and save it as .xls extension.

Print to out file

I am trying to find intersecting lines between two files. One of the files is 'Sample_hg19_mapped.bed' and the other one 'intersect.RData' has some of the same data as the first one.
Bed file:
chrM 16338 16363 HWI-ST575:220:C2MMMACXX:3:1112:17158:21371 255 -
chrM 16352 16377 HWI-ST575:220:C2MMMACXX:3:1102:7906:41988 255 -
chrM 16352 16377 HWI-ST575:220:C2MMMACXX:3:2113:18341:36393 255 -
chrM 16376 16401 HWI-ST575:220:C2MMMACXX:3:1310:14517:85268 255 -
RData file:
HWI-ST575:220:C2MMMACXX:3:1310:14517:85268
HWI-ST575:220:C2MMMACXX:3:2113:18341:36393
HWI-ST575:220:C2MMMACXX:3:2113:45341:56393
And as an output, it needs to give the line of BED file which has the same value in the RData.file. For example, the first and the second value of RData exists in BED file,but not the third one, so in output it needs to be :
chrM 16376 16401 HWI-ST575:220:C2MMMACXX:3:1310:14517:85268 255 -
chrM 16352 16377 HWI-ST575:220:C2MMMACXX:3:2113:18341:36393 255 -
I managed it with those code :
perl -ane '$f=$F[0].$F[1]; print "$k{$f}$_" if $k{$f}; $k{$f}=$_;' Sample_hg19_mapped.bed intersect.RData
But those lines that match are on the screen and I want them to keep in the file but I cannot make the output file. I tried this one by changing a lot:
####!/bin/bash
perl -ane '$f=$F[0].$F[1]';"Sample_hg19_mapped.bed intersect.RData"
if $k{$f};$k{$f}=$_ {
print "$k{$f}$_";
} else {
print "epic fail";
}
open($f, ">", "output.txt")
or die "cannot open > output.txt: $!";
close $f;
print "done\n";
But I have so many errors like:
/var/spool/slurmd/job2572366/slurm_script: line 3: Sample_hg19_mapped.bed intersect.RData: command not found
/var/spool/slurmd/job2572366/slurm_script: line 6: syntax error near unexpected token `}'
/var/spool/slurmd/job2572366/slurm_script: line 6: `} else {'
Can you maybe help me on this?
Thank you so much
If your command works but outputs to the screen, simply redirect that to a file:
command > output.txt
e.g.
perl -ane '$f=$F[0].$F[1]; print "$k{$f}$_" if $k{$f}; $k{$f}=$_;' Sample_hg19_mapped.bed intersect.RData > output.txt
If you want to remove all the empty lines you can add next if /^\s*$/; to the start:
perl -ane 'next if /^\s*$/; $f=$F[0].$F[1]; print "$k{$f}$_" if $k{$f}; $k{$f}=$_;' Sample_hg19_mapped.bed intersect.RData > output.txt
This will skip any input lines which are only whitespace.
Your code is abit messy and the erros are from there although if you want to output to a file you can do this:
open (MYFILE, '>>NameOfFile');
print MYFILE $variable
Have a try with this:
This uses your RData vales as hash keys, and then looks for them in the bed file, printing any matches to 'output.txt'.
use strict;
use warnings;
use autodie;
open my $bed, '<', 'in.txt';
open my $rdata, '<', 'Rdata.txt';
my (%bed, %rdata);
while(<$rdata>){
chomp;
$rdata{$_} = 2; # Each line is a key in the hash %rdata
}
open my $out_file, '>', 'output.txt';
while(<$bed>){
chomp;
next unless /chrM/;
my #split = split/\t/;
print $out_file "$_\n" if $rdata{$split[3]}; # will print to output.txt any line where the 4th column matches a key from %rdata
}
The following perl one-liner should do what you need:
perl -lane'
BEGIN { $x = pop; %h = map { chomp; $_ => 1 } <>; #ARGV = $x }
print if /./ && $h{$F[3]}
' intersect.RData Sample_hg19_mapped.bed
We load the intersect.RData in a hash map in the BEGIN block
In the main body we check if the third field from Sample_hg19_mapped.bed file is present in our hash map. If it does then print the line.
If the output looks fine to you then you can redirect to another file.

Parsing file like .ini using bash, sed, awk

I have a file like this:
[User1] <- unique id
name= <- values can be empty
pwd=
...
<- empty line
[User2]
name=
pwd=
..
[User3]
name=
pwd=
..
I need ability:
to get the fields values for User2
to change the field falue (e.g. pwd).
PS Using bash, sed or awk is preferable
You can do it with three rules like this (nawk compatible):
awk -F= '
/^\[/ { user=$1; gsub("[][]","",user) }
user == "User2" && $1 == "pwd" { $0=$1"=some_pwd" }
1
'
Output:
[User1]
name=
pwd=
...
[User2]
name=
pwd=some_pwd
..
[User3]
name=
pwd=
..
Here's a simple solution to change the value of pwd. This will add an extra newline to the end of the record if pwd is the last field.
awk '/^\[User2\]/ { sub( "\npwd=[^\n]*(\n|$)",
"\npwd=newvalue\n") } 1' ORS='\n\n' RS= input-file > output-file
mv output-file > input-file
This one is a clear win for Python vs. AWK, as Python comes with a built-in module for just this sort of problem.
The module's name changed from Python 2.x to Python 3.x; the try block at the top should allow this to work with either Python 2.x or Python 3.x (and I tested it with both on my computer).
EDIT: I just slightly improved the answer. Now instead of writing a new file, it writes a temp file, and when it is successfully done it deletes the original file and renames the temp file to the original file name. On non-Windows system the step of removing the original file is optional.
import os
import sys
try:
import ConfigParser as cp
except ImportError:
import configparser as cp
try:
_, fname = sys.argv
except Exception:
print("Usage: configedit <filename>")
temp_file = fname + ".tempfile"
c = cp.ConfigParser()
c.read(fname)
c.set("User2", "pwd", "XkcdApprovedLongerPassword")
with open(temp_file, "w") as f:
c.write(f)
os.remove(fname)
os.rename(temp_file, fname)
As requested via sed:
sed -i "/\[User2\]/,/^$/{s/\(^pwd\)\=.*$/\1\=password/}"
Change "password" in the line above to whatever you want to change the password to within the file.
This script will search between "[User2]" and the next blank line.
It will then find the line starting with "pwd=" and change anything after that.
For those looking closely I didn't capture the "=" sign for readability for the requester.

Grep search strings with line breaks

How to use grep to output occurrences of the string 'export to excel' in the input files given below? Specifically, how to handle the line breaks that happen in between the search strings? Is there a switch in grep that can do this or some other command probably?
Input files:
File a.txt:
blah blah ... export to
excel ...
blah blah..
File b.txt:
blah blah ... export to excel ...
blah blah..
Do you just want to find files that contain the pattern, ignoring linebreaks, or do you want to actually see the matching lines?
If the former, you can use tr to convert newlines to spaces:
tr '\n' ' ' | grep 'export to excel'
If the latter you can do the same thing, but you may want to use the -o flag to only print the actual match. You'll then want to adjust your regex to include any extra context you want.
I don't know how to do this in grep. I checked the man page for egrep(1) and it can't match with a newline in the middle either.
I like the solution #Laurence Gonsalves suggested, of using tr(1) to wipe out the newlines. But as he noted, it will be a pain to print the matching lines if you do it that way.
If you want to match despite a newline and then print the matching line(s), I can't think of a way to do it with grep, but it would be not too hard in any of Python, AWK, Perl, or Ruby.
Here's a Python script that solves the problem. I decided that, for lines that only match when joined to the previous line, I would print a --> arrow before the second line of the match. Lines that match outright are always printed without the arrow.
This is written assuming that /usr/bin/python is Python 2.x. You can trivially change the script to work under Python 3.x if desired.
#!/usr/bin/python
import re
import sys
s_pat = "export\s+to\s+excel"
pat = re.compile(s_pat)
def print_ete(fname):
try:
f = open(fname, "rt")
except IOError:
sys.stderr.write('print_ete: unable to open file "%s"\n' % fname)
sys.exit(2)
prev_line = ""
i_last = -10
for i, line in enumerate(f):
# is ete within current line?
if pat.search(line):
print "%s:%d: %s" % (fname, i+1, line.strip())
i_last = i
else:
# construct extended line that included previous
# note newline is stripped
s = prev_line.strip("\n") + " " + line
# is ete within extended line?
if pat.search(s):
# matched ete in extended so want both lines printed
# did we print prev line?
if not i_last == (i - 1):
# no so print it now
print "%s:%d: %s" % (fname, i, prev_line.strip())
# print cur line with special marker
print "--> %s:%d: %s" % (fname, i+1, line.strip())
i_last = i
# make sure we don't match ete twice
prev_line = re.sub(pat, "", line)
try:
if sys.argv[1] in ("-h", "--help"):
raise IndexError # print help
except IndexError:
sys.stderr.write("print_ete <filename>\n")
sys.stderr.write('grep-like tool to print lines matching "%s"\n' %
"export to excel")
sys.exit(1)
print_ete(sys.argv[1])
EDIT: added comments.
I went to some trouble to make it print the correct line number on each line, using a format similar to what you would get with grep -Hn.
It could be much shorter and simpler if you don't need line numbers, and you don't mind reading in the whole file at once into memory:
#!/usr/bin/python
import re
import sys
# This pattern not compiled with re.MULTILINE on purpose.
# We *want* the \s pattern to match a newline here so it can
# match across multiple lines.
# Note the match group that gathers text around ete pattern uses a character
# class that matches anything but "\n", to grab text around ete.
s_pat = "([^\n]*export\s+to\s+excel[^\n]*)"
pat = re.compile(s_pat)
def print_ete(fname):
try:
text = open(fname, "rt").read()
except IOError:
sys.stderr.write('print_ete: unable to open file "%s"\n' % fname)
sys.exit(2)
for s_match in re.findall(pat, text):
print s_match
try:
if sys.argv[1] in ("-h", "--help"):
raise IndexError # print help
except IndexError:
sys.stderr.write("print_ete <filename>\n")
sys.stderr.write('grep-like tool to print lines matching "%s"\n' %
"export to excel")
sys.exit(1)
print_ete(sys.argv[1])
grep -A1 "export to" filename | grep -B1 "excel"
I have tested this a little and it seems to work:
sed -n '$b; /export to excel/{p; b}; N; /export to\nexcel/{p; b}; D' filename
You can allow for some extra white space at the end and beginning of the lines like this:
sed -n '$b; /export to excel/{p; b}; N; /export to\s*\n\s*excel/{p; b}; D' filename
use gawk. set record separator as excel, then check for "export to".
gawk -vRS="excel" '/export.*to/{print "found export to excel at record: "NR}' file
or
gawk '/export.*to.*excel/{print}
/export to/&&!/excel/{
s=$0
getline line
if (line~/excel/){
printf "%s\n%s\n",s,line
}
}' file

Resources