Validate header of file in linux - bash

I am using below script to validate header of file. For which i have created one file which is having only header and comparing it with another file which is having data for column along with the header.
awk -F"|" 'FNR==NR{hn=split($0,header); next}
FNR==1 {n=split($0,fh)
for(i=0;i<=hn; i++)
if (fh[i]!=header[i]) {
printf "%s:order of %s is not correct\n",FILENAME, header[i]
next}
if (hn==n)
print FILENAME, "has expected order of fields"
else
print FILENAME, "has extra fields"
next
}' key /Scripts/gst/Kenan_Test_Scenarios1.txt
Sample file header(Key)
SourceIdentifier|SourceFileName|GLAccountCode|Division|SubDivision|ProfitCentre1|ProfitCentre2|PlantCode|ReturnPeriod|SupplierGSTIN|DocumentType|SupplyType|DocumentNumber|DocumentDate|OriginalDocumentNumber|OriginalDocumentDate|CRDRPreGST|LineNumber|CustomerGSTIN|UINorComposition|OriginalCustomerGSTIN|CustomerName|CustomerCode|BillToState|ShipToState|POS|PortCode|ShippingBillNumber|ShippingBillDate|FOB|ExportDuty|HSNorSAC|ProductCode|ProductDescription|CategoryOfProduct|UnitOfMeasurement|Quantity|TaxableValue|IntegratedTaxRate|IntegratedTaxAmount|CentralTaxRate|CentralTaxAmount|StateUTTaxRate|StateUTTaxAmount|CessRateAdvalorem|CessAmountAdvalorem|CessRateSpecific|CessAmountSpecific|InvoiceValue|ReverseChargeFlag|TCSFlag|eComGSTIN|ITCFlag|ReasonForCreditDebitNote|AccountingVoucherNumber|AccountingVoucherDate|Userdefinedfield1|Userdefinedfield2|Userdefinedfield3
File 2 header along with data(Kenan_Test_Scenarios1.txt)
SourceIdentifier|SourceFileName|GLAccountCode|Division|SubDivision|ProfitCentre1|ProfitCentre2|PlantCode|ReturnPeriod|SupplierGSTIN|DocumentType|SupplyType|DocumentNumber|DocumentDate|OriginalDocumentNumber|OriginalDocumentDate|CRDRPreGST|LineNumber|CustomerGSTIN|UINorComposition|OriginalCustomerGSTIN|CustomerName|CustomerCode|BillToState|ShipToState|POS|PortCode|ShippingBillNumber|ShippingBillDate|FOB|ExportDuty|HSNorSAC|ProductCode|ProductDescription|CategoryOfProduct|UnitOfMeasurement|Quantity|TaxableValue|IntegratedTaxRate|IntegratedTaxAmount|CentralTaxRate|CentralTaxAmount|StateUTTaxRate|StateUTTaxAmount|CessRateAdvalorem|CessAmountAdvalorem|CessRateSpecific|CessAmountSpecific|InvoiceValue|ReverseChargeFlag|TCSFlag|eComGSTIN|ITCFlag|ReasonForCreditDebitNote|AccountingVoucherNumber|AccountingVoucherDate|Userdefinedfield1|Userdefinedfield2|Userdefinedfield3
KEN|TEST1|||Tela|Outw|ANP|POST|1017|36AAA|NV|TX|4841446542|2017-12-12||2035-06-11|Y|1|36AAACB89|||||||36||||||94||Telecomm Servi||||1557.20|0.00|10.00|9.00|140.15|9.00|140.15|||||18.50||||||||B2B INV||
Getting below output and which is not correct though header in both files are same.
is not correctnan_Test_Scenarios1.txt:order of Userdefinedfield3
Could you please help me to rectify the code and also need to capture if multiple header names has msimatch

OK, you've tagged this perl, so here's a perl answer. I think you're focussing on the wrong problem - why not instead read row by row, parse them into a hash, and then output your desired ordering:
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
open ( my $first_file, '<', 'file_name_here' ) or die $!;
chomp ( my #header = split /\|/, <$first_file> );
close ( $first_file );
#debugging
print Dumper \#header;
open ( my $second_file, '<', 'second_file_name_here' ) or die $!;
chomp ( my #second_header = split /\|/, <$second_file> );
print join ( "|", #header ), "\n";
while ( <$second_file> ) {
my %row;
#use ordering of column headings to read into named fields;
#row{#second_header} = split /\|/;
#debugging output to show you what's going on.
print Dumper \%row;
print join ("|", #row{#header} ), "\n";
}
That way you don't care if the order is wrong, because you forward fix it.
If you really need to compare, then you can iterate each of the #header arrays and look for differences. But that's more a question of what you're actually trying to get - I would suggest looking at Array::Utils because that lets you trivially use array_diff, intersect and unique.

this may come in handy
$ diff -y --suppress-common-lines <(tr '|' '\n' <file1) <(tr '|' '\n' <file2)
used your first file as is for file1 and used this
$ sed 's/2/8/;s/Export/Import/' file1 > file2
to create the second file. Running the script gives
ProfitCentre2 | ProfitCentre8
ExportDuty | ImportDuty

Related

Splitting large text file on every blank line

I'm having a bit trouble of splitting a large text file into multiple smaller ones. Syntax of my text file is the following:
dasdas #42319 blaablaa 50 50
content content
more content
content conclusion
asdasd #92012 blaablaa 30 70
content again
more of it
content conclusion
asdasd #299 yadayada 60 40
content
content
contend done
...and so on
A typical information table in my file has anywhere between 10-40 rows.
I would like this file to be split in n smaller files, where n is the amount of content tables.
That is
dasdas #42319 blaablaa 50 50
content content
more content
content conclusion
would be its own separate file, (whateverN.txt)
and
asdasd #92012 blaablaa 30 70
content again
more of it
content conclusion
again a separate file whateverN+1.txt and so forth.
It seems like awk or Perl are nifty tools for this, but having never used them before the syntax is kinda baffling.
I found these two questions that are almost correspondent to my problem, but failed to modify the syntax to fit my needs:
Split text file into multiple files & How can I split a text file into multiple text files? (on Unix & Linux)
How should one modify the command line inputs, so that it solves my problem?
Setting RS to null tells awk to use one or more blank lines as the record separator. Then you can simply use NR to set the name of the file corresponding to each new record:
awk -v RS= '{print > ("whatever-" NR ".txt")}' file.txt
RS:
This is awk's input record separator. Its default value is a string containing a single newline character, which means that an input record consists of a single line of text. It can also be the null string, in which case records are separated by runs of blank lines, or a regexp, in which case records are separated by matches of the regexp in the input text.
$ cat file.txt
dasdas #42319 blaablaa 50 50
content content
more content
content conclusion
asdasd #92012 blaablaa 30 70
content again
more of it
content conclusion
asdasd #299 yadayada 60 40
content
content
contend done
$ awk -v RS= '{print > ("whatever-" NR ".txt")}' file.txt
$ ls whatever-*.txt
whatever-1.txt whatever-2.txt whatever-3.txt
$ cat whatever-1.txt
dasdas #42319 blaablaa 50 50
content content
more content
content conclusion
$ cat whatever-2.txt
asdasd #92012 blaablaa 30 70
content again
more of it
content conclusion
$ cat whatever-3.txt
asdasd #299 yadayada 60 40
content
content
contend done
$
You could use the csplit command:
csplit \
--quiet \
--prefix=whatever \
--suffix-format=%02d.txt \
--suppress-matched \
infile.txt /^$/ {*}
POSIX csplit only uses short options and doesn't know --suffix and --suppress-matched, so this requires GNU csplit.
This is what the options do:
--quiet – suppress output of file sizes
--prefix=whatever – use whatever instead fo the default xx filename prefix
--suffix-format=%02d.txt – append .txt to the default two digit suffix
--suppress-matched – don't include the lines matching the pattern on which the input is split
/^$/ {*} – split on pattern "empty line" (/^$/) as often as possible ({*})
Perl has a useful feature called the input record separator. $/.
This is the 'marker' for separating records when reading a file.
So:
#!/usr/bin/env perl
use strict;
use warnings;
local $/ = "\n\n";
my $count = 0;
while ( my $chunk = <> ) {
open ( my $output, '>', "filename_".$count++ ) or die $!;
print {$output} $chunk;
close ( $output );
}
Just like that. The <> is the 'magic' filehandle, in that it reads piped data or from files specified on command line (opens them and reads them). This is similar to how sed or grep work.
This can be reduced to a one liner:
perl -00 -pe 'open ( $out, '>', "filename_".++$n ); select $out;' yourfilename_here
You can use this awk,
awk 'BEGIN{file="content"++i".txt"} !NF{file="content"++i".txt";next} {print > file}' yourfile
(OR)
awk 'BEGIN{i++} !NF{++i;next} {print > "filename"i".txt"}' yourfile
More readable format:
BEGIN {
file="content"++i".txt"
}
!NF {
file="content"++i".txt";
next
}
{
print > file
}
In case you get "too many open files" error as follows...
awk: whatever-18.txt makes too many open files
input record number 18, file file.txt
source line number 1
You may need to close newly created file, before creating a new one, as follows.
awk -v RS= '{close("whatever-" i ".txt"); i++}{print > ("whatever-" i ".txt")}' file.txt
Since it's Friday and I'm feeling a bit helpful... :)
Try this. If the file is as small as you imply it's simplest to just read it all at once and work in memory.
use strict;
use warnings;
# slurp file
local $/ = undef;
open my $fh, '<', 'test.txt' or die $!;
my $text = <$fh>;
close $fh;
# split on double new line
my #chunks = split(/\n\n/, $text);
# make new files from chunks
my $count = 1;
for my $chunk (#chunks) {
open my $ofh, '>', "whatever$count.txt" or die $!;
print $ofh $chunk, "\n";
close $ofh;
$count++;
}
The perl docs can explain any individual commands you don't understand but at this point you should probably look into a tutorial as well.
awk -v RS="\n\n" '{for (i=1;i<=NR;i++); print > i-1}' file.txt
Sets record separator as blank line, prints each record as a separate file numbered 1, 2, 3, etc. Last file (only) ends in blank line.
Try this bash script also
#!/bin/bash
i=1
fileName="OutputFile_$i"
while read line ; do
if [ "$line" == "" ] ; then
((++i))
fileName="OutputFile_$i"
else
echo $line >> "$fileName"
fi
done < InputFile.txt
You can also try split -p "^$"

Extracting the first two characters from a file in perl into another file

I'm having a little bit of trouble with my code below -- I'm trying to figure out how to open up all these text files (.csv files that end in DIS that all have one line in them) and get the first two characters (these are all numbers) from them and print them into another file of the same name, with a ".number" suffix. Some of these .DIS files don't have anything in them, in which case I want to print "0".
Lastly, I would like to go through each original .DIS file and delete the first 3 characters -- I did this through bash.
my #DIS = <*.DIS>;
foreach my $file (#DIS){
my $name = $file;
my $output = "$name.number";
open(INHANDLE, "< $file") || die("Could not open file");
while(<INHANDLE>){
open(OUT_FILE,">$output") || die;
my $line = $_;
chomp ($line);
my $string = $line;
if ($string eq ""){
print "0";
} else {
print substr($string,0,2);
}
}
system("sed -i 's/\(.\{3\}\)//' $file");
}
When I run this code, I get a list of numbers are concatenated together and empty .DIS.number files. I'm rather new to Perl, so any help would be appreciated!
When I run this code, I get a list of numbers are concatenated together and empty .DIS.number files.
This is because of this line.
print substr($string,0,2);
print defaults to printing to STDOUT (ie. the screen). You need to give it the filehandle to print to.
print OUT_FILE substr($string,0,2);
They're being concatenated because print just prints what you tell it to, it won't put newlines in for you (there are some global variables which can change this, don't mess with them). You have to add the newline yourself.
print OUT_FILE substr($string,0,2), "\n";
As a final note, when working with files in Perl I would suggest using lexical filehandles, Path::Tiny, and autodie. They will avoid a great number of classic problems working with files in Perl.
I suggest you do it like this
Each *.dis file is opened and the contents read into $text. Then a regex substitution is used to remove the first three characters from the string and capture the first two in $1
If the substitution succeeded then the contents of $1 are written to the number file, otherwise the original file is empty (or shorter than two characters) and a zero is written instead. The remaining contents of $text are then written back to the *.dis file
use strict;
use warnings;
use v5.10.1;
use autodie;
for my $dis_file ( glob '*.DIS' ) {
my $text = do {
open my $fh, '<', $dis_file;
<$fh>;
};
my $num_file = "$dis_file.number";
open my $dis_fh, '>', $dis_file;
open my $num_fh, '>', $num_file;
if ( defined $text and $text =~ s/^(..).?// ) {
print $num_fh "$1\n";
print $dis_fh $text;
}
else {
print $num_fh "0\n";
print $dis_fh "-\n";
}
}
this awk script extract the first two chars of each file to it's own file. Empty files expected to have one empty line based on the spec.
awk 'FNR==1{pre=substr($0,1,2);pre=length(pre)==2?pre:0; print pre > FILENAME".number"}' *.DIS
This will remove the first 3 chars
cut -c 4-
Bash for loop will be better to do both, which we'll need to modify the awk script little bit
for f in *.DIS;
do awk 'NR==1{pre=substr($0,1,2);$0=length(pre)==2?pre:0; print}' $f > $f.number;
cut -c 4- $f > $f.cut;
done
explanation: loop through all files in *.DTS, for the first line of each file, try to get first two chars (1,2) of the line ($0) assign to pre. If the length of pre is not two (either the line is empty or with 1 char only) set the line to 0 or else use pre; print the line, output file name will be input file appended with .number suffix. The $0 assignment is a trick to save couple keystrokes since print without arguments prints $0, otherwise you can provide the argument.
Ideally you should quote "$f" since it may contain space in file name...

use grep and awk to transfer data from .srt to .csv/xls

I got an interesting project to do! I'm thinking about converting an srt file into a csv/xls file.
a srt file would look like this:
1
00:00:00,104 --> 00:00:02,669
Hi, I'm shell-scripting.
2
00:00:02,982 --> 00:00:04,965
I'm not sure if it would work,
but I'll try it!
3
00:00:05,085 --> 00:00:07,321
There must be a way to do it!
while I want to output it into a csv file like this:
"1","00:00:00,104","00:00:02,669","Hi, I'm shell-scripting."
"2","00:00:02,982","00:00:04,965","I'm not sure if it would work"
,,,"but I'll try it!"
"3","00:00:05,085","00:00:07,321","There must be a way to do it!"
So as you can see, each subtitle takes up two rows. My thinking would be using grep to put the srt data into the xls, and then use awk to format the xls file.
What do you guys think? How am I suppose to write it? I tried
$grep filename.srt > filename.xls
It seems that all the data including the time codes and the subtitle words ended up all in column A of the xls file...but I want the words to be in column B...How would awk be able to help with the formatting?
Thank you in advance! :)
$ cat tst.awk
BEGIN { RS=""; FS="\n"; OFS=","; q="\""; s=q OFS q }
{
split($2,a,/ .* /)
print q $1 s a[1] s a[2] s $3 q
for (i=4;i<=NF;i++) {
print "", "", "", q $i q
}
}
$ awk -f tst.awk file
"1","00:00:00,104","00:00:02,669","Hi, I'm shell-scripting."
"2","00:00:02,982","00:00:04,965","I'm not sure if it would work,"
,,,"but I'll try it!"
"3","00:00:05,085","00:00:07,321","There must be a way to do it!"
I think something like this should do it quite nicely:
awk -v RS= -F'\n' '
{
sub(" --> ","\x7c",$2) # change "-->" to "|"
printf "%s|%s|%s\n",$1,$2,$3 # print scene, time start, time stop, description
for(i=4;i<=NF;i++)printf "|||%s\n",$i # print remaining lines of description
}' file.srt
The -v RS= sets the Record Separator to blank lines. The -F'\n' sets the Field Separator to new lines.
The sub() replaces the "-->" with a pipe symbol (|).
The first three fields are then printed separated by pipes, and then there is a little loop to print the remaining lines of description, inset by three pipe symbols to make them line up.
Output
1|00:00:00,104|00:00:02,669|Hi, I'm shell-scripting.
2|00:00:02,982|00:00:04,965|I'm not sure if it would work,
|||but I'll try it!
3|00:00:05,085|00:00:07,321|There must be a way to do it!
As I am feeling like having some more fun with Perl and Excel, I took the above output and parsed it in Perl and wrote a real Excel XLSX file. Of course, there is no real need to use awk and Perl so ideally one would re-cast the awk and integrate it into the Perl since the latter can write Excel files while the former cannot. Anyway here is the Perl.
#!/usr/bin/perl
use strict;
use warnings;
use Excel::Writer::XLSX;
my $DEBUG=0;
my $workbook = Excel::Writer::XLSX->new('result.xlsx');
my $worksheet = $workbook->add_worksheet();
my $row=0;
while(my $line=<>){
$row++; # move down a line in Excel worksheet
chomp $line; # strip CR
my #f=split /\|/, $line; # split fields of line into array #f[], on pipe symbols (|)
for(my $j=0;$j<scalar #f;$j++){ # loop through all fields
my $cell= chr(65+$j) . $row; # calcuate Excell cell, starting at A1 (65="A")
$worksheet->write($cell,$f[$j]); # write to spreadsheet
printf "%s:%s ",$cell,$f[$j] if $DEBUG;
}
printf "\n" if $DEBUG;
}
$workbook->close;
Output
My other answer was half awk and half Perl, but, given that awk can't write Excel spreadsheets whereas Perl can, it seems daft to require you to master both awk and Perl when Perl is perfectly capable of doing it all on its own... so here goes in Perl:
#!/usr/bin/perl
use strict;
use warnings;
use Excel::Writer::XLSX;
my $workbook = Excel::Writer::XLSX->new('result.xlsx');
my $worksheet = $workbook->add_worksheet();
my $ExcelRow=0;
local $/ = ""; # set paragraph mode, so we read till next blank line as one record
while(my $para=<>){
$ExcelRow++; # move down a line in Excel worksheet
chomp $para; # strip CR
my #lines=split /\n/, $para; # split paragraph into lines on linefeed character
my $scene = $lines[0]; # pick up scene number from first line of para
my ($start,$end)=split / --> /,$lines[1]; # pick up start and end time from second line
my $cell=sprintf("A%d",$ExcelRow); # work out cell
$worksheet->write($cell,$scene); # write scene to spreadsheet column A
$cell=sprintf("B%d",$ExcelRow); # work out cell
$worksheet->write($cell,$start); # write start time to spreadsheet column B
$cell=sprintf("C%d",$ExcelRow); # work out cell
$worksheet->write($cell,$end); # write end time to spreadsheet column C
$cell=sprintf("D%d",$ExcelRow); # work out cell
$worksheet->write($cell,$lines[2]); # write description to spreadsheet column D
for(my $i=3;$i<scalar #lines;$i++){ # output additional lines of description
$ExcelRow++;
$cell=sprintf("D%d",$ExcelRow); # work out cell
$worksheet->write($cell,$lines[$i]);
}
}
$workbook->close;
Save the above on a file called srt2xls and then make it executable with the command:
chmod +x srt2xls
Then you can run it with
./srt2xls < SomeFileile.srt
and it will give you this spreadsheet called result.xlsx
Since you want to convert the srt into csv. below is awk command
awk '{gsub(" --> ","\x22,\x22");if(NF!=0){if(j<3)k=k"\x22"$0"\x22,";else{k="\x22"$0"\x22 ";l=1}j=j+1}else j=0;if(j==3){print k;k=""}if(l==1){print ",,,"k ;l=0;k=""}}' inputfile > output.csv
detail veiw of awk
awk '{
gsub(" --> ","\x22,\x22");
if(NF!=0)
{
if(j<3)
k=k"\x22"$0"\x22,";
else
{
k="\x22"$0"\x22 ";
l=1
}
j=j+1
}
else
j=0;
if(j==3)
{
print k;
k=""
}
if(l==1)
{
print ",,,"k;
l=0;
k=""
}
}' inputfile > output.csv
take the output.csv on windows platform and then open with microsoft excel and save it as .xls extension.

Taking multiple header (rows matching condition) and convert into a column

Hello I have a file that has multiple Headers in it that I need to have turned into column values. The file looks like this:
Day1
1,Smith,London
2,Bruce,Seattle
5,Will,Dallas
Day2
1,Mike,Frisco
4,James,LA
I would like the file to end up looking like this:
Day1,1,Smith,London
Day1,2,Bruce,Seattle
Day1,5,Will,Dallas
Day2,1,Mike,Frisco
Day2,4,James,LA
The file doesn't have sequential numbers before the names and it doesn't have the same quantity of records after the "Day" Header.
Does anyone have any ideas on how to accomplish this using the command-line?
In awk
awk -F, 'NF==1{a=$0;next}{print a","$0}' file
Checks if the number of fields is 1, if it is it sets a variable to that and skips the next block.
For each line that doesn't have 1 field, it prints the saved variable and the line
And in sed
sed -n '/,/!{h};/,/{x;G;s/\n/,/;p;s/,.*//;x}' file
Broken down for MrBones wild ride.
sed -n '
/,/!{h}; // If the line does not contain a comma overwrite buffer with line
/,/{ // If the line contains a comma, do everything inside the brackets
x; // Exchange the line for the held in buffer
G; // Append buffer to line
s/\n/,/; // Replace the newline with a comma
p; // Print the line
s/,.*//; // Remove everything after the first comma
x // exchange line for hold buffer to put title back in buffer for the next line.
}' file // The file you are using
In essence it saves the lines without a ,, i.e the headers. Then if its not a header, it switches the current line with the saved header and appends the now switched line to the end of the header. As it is appended with a newline, then the next statement replaces that with a comma. Then the line is printed. NExt to recover the header, everything after it is removed and it is swapped back into the buffer, ready for the next line.
sed '/^Day/ {h;d;}
G;s/\(.*\)\n\(.*\)/\2,\1/
' YourFile
posix compliant
print nothing if not at least 1 data after a Day
white line are treated as data
awk '{if ( $0 ~ /^Day/ ) Head = $0; else print Head "," $0}' YourFile
use Day as paragraph separator and content as header to use on following line
Perl solution:
#! /usr/bin/perl
use warnings;
use strict;
my $header;
while (<>) { # Read line by line.
if (/,/) { # If the line contains a comma,
print "$header,$_"; # prepend the header.
} else {
chomp; # Remove the newline.
$header = $_; # Remember the header.
}
}
Another sed version
sed -n '/Day[0-9]\+/{h;b end};{G;s/\(.*\)\n\(.*\)/\2,\1/;p;:end}'
Perl
$ perl -F, -wlane ' if(#F eq 1){$s=$F[0]; next}print "$s,$_"' file
Day1,1,Smith,London
Day1,2,Bruce,Seattle
Day1,5,Will,Dallas
Day2,1,Mike,Frisco
Day2,4,James,LA
This Perl one-line program will do as you ask. It requires Perl v5.14 or better
perl -ne'tr/,// ? print $c,$_ : ($c = s/\s*\z/,/r)' myfile.txt
for earlier versions of perl, use
perl -ne'tr/,// ? print $c,$_ : ($c = $_) =~ s/\s*\z/,/' myfile.txt
output
Day1,1,Smith,London
Day1,2,Bruce,Seattle
Day1,5,Will,Dallas
Day2,1,Mike,Frisco
Day2,4,James,LA
Another perl example- this time using $/ to separate each record.
use strict;
use warnings;
local $/ = "Day";
while (<>) {
next unless my ($num) = m/^(\d+)/;
for ( split /\n/ ) {
print "Day${num},$_\n" if m/,/;
}
}

Using sed on text files with a csv

I've been trying to do bulk find and replace on two text files using a csv. I've seen the questions that SO suggests, and none seem to answer my question.
I've created two variables for the two text files I want to modify. The csv has two columns and hundreds of rows. The first column contains strings (none have whitespaces) already in the text file that need to be replaced with the corresponding strings in same row in the second column.
As a test, I tried the script
#!/bin/bash
test1='long_file_name.txt'
find='string1'
replace='string2'
sed -e "s/$find/$replace/g" $test1 > $test1.tmp && mv $test1.tmp $test1
This was successful, except that I need to do it once for every row in the csv, using the values given by the csv in each row. My hunch is that my while loop was used wrongly, but I can't find the error. When I execute the script below, I get the command line prompt, which makes me think that something has happened. When I check the text files, nothing's changed.
The two text files, this script, and the csv are all in the same folder (it's also been my working directory when I do this).
#!/bin/bash
textfile1='long_file_name1.txt'
textfile2='long_file_name2.txt'
while IFS=, read f1 f2
do
sed -e "s/$f1/$f2/g" $textfile1 > $textfile1.tmp && \
mv $textfile1.tmp $textfile1
sed -e "s/$f1/$f2/g" $textfile2 > $textfile2.tmp && \
mv $textfile2.tmp $textfile2
done <'findreplace.csv'
It seems to me that this code should do what I want it to do (but doesn't); perhaps I'm misunderstanding something fundamental (I'm new to bash scripting)?
The csv looks like this, but with hundreds of rows. All a_i's should be replaced with their counterpart b_i in the next column over.
a_1 b_1
a_2 b_2
a_3 b_3
Something to note: All the strings actually contain underscores, just in case this affects something. I've tried wrapping the variable name in braces a la ${var}, but it still doesn't work.
I appreciate the solutions, but I'm also curious to know why the above doesn't work. (Also, I would vote everyone up, but I lack the reputation to do so. However, know that I appreciate and am learning a lot from your answers!)
If you are going to process lot of data and your patterns can contain a special character I would consider using Perl. Especially if you are going to have a lot of pairs in findreplace.csv. You can use following script as filter or in-place modification with lot of files. As side effect, it will load replacements and create Aho-Corrasic automaton only once per invocation which will make this solution pretty efficient (O(M+N) instead of O(M*N) in your solution).
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
my $in_place = ( #ARGV and $ARGV[0] =~ /^-i(.*)/ )
? do {
shift;
my $backup_extension = $1;
my $backup_name = $backup_extension =~ /\*/
? sub { ( my $fn = $backup_extension ) =~ s/\*/$_[0]/; $fn }
: sub { shift . $backup_extension };
my $oldargv = '-';
sub {
if ( $ARGV ne $oldargv ) {
rename( $ARGV, $backup_name->($ARGV) );
open( ARGVOUT, '>', $ARGV );
select(ARGVOUT);
$oldargv = $ARGV;
}
};
}
: sub { };
die "$0: File with replacements required." unless #ARGV;
my ( $re, %replace );
do {
my $filename = shift;
open my $fh, '<', $filename;
%replace = map { chomp; split ',', $_, 2 } <$fh>;
close $fh;
$re = join '|', map quotemeta, keys %replace;
$re = qr/($re)/;
};
while (<>) {
$in_place->();
s/$re/$replace{$1}/g;
}
continue {print}
Usage:
./replace.pl replace.csv <file.in >file.out
as well as
./replace.pl replace.csv file.in >file.out
or in-place
./replace.pl -i replace.csv file1.csv file2.csv file3.csv
or with backup
./replace.pl -i.orig replace.csv file1.csv file2.csv file3.csv
or with backup whit placeholder
./replace.pl -ithere.is.\*.original replace.csv file1.csv file2.csv file3.csv
You should convert your CSV file to a sed.script with the following command:
cat replace.csv | awk -F, '{print "s/" $1 "/" $2 "/g";}' > sed.script
And then you will be able to do a one pass replacement:
sed -i -f sed.script longfilename.txt
This will be a faster implementation of what you wanna do.
BTW, sorry, but I do not understand what is wrong with your script which should work except if your CSV file has more than 2 columns.

Resources