How to delete text in text file in Windows using Perl? - windows

I want to port my perl application to windows.
Currently it calls out to "grep" to delete text in a text file, like so:
system("grep -v '$mcadd' $ARGV[0] >> $ARGV[0].bak");
system("mv $ARGV[0].bak $ARGV[0]");
This works perfectly well in ubuntu, but I'm not sure (a) how to modify my perl script to achieve the same effect on Windows, and (b) whether there is a way to achieve the effect in a way that will work in both environments.

Other way to delete text in perl?
You can use perl's inplace editing facility.
~/pperl_programs$ cat data.txt
hello world
goodbye mars
goodbye perl6
back to perl5
Run this:
use strict;
use warnings;
use 5.020;
my $fname = 'data.txt';
#Always use three arg form of open().
#Don't use bareword filehandles.
#open my $INFILE, '<', $fname
# or die "Couldn't open $fname: $!";
{
local $^I = ".bak"; #Turn on inplace editing for this block only
local #ARGV = $fname; #Set #ARGV for this block only
while (my $line = <>) { #"diamond operator" reads from #ARGV
if ($line !~ /hello/) {
print $line; #This does not go to STDOUT--it goes to a new file that perl creates for you.
}
}
} #Return $^I and #ARGV to their previous values
#close $INFILE;
Here is the result:
$ cat data.txt
goodbye mars
goodbye perl6
back to perl5
With inplace editing turned on, perl takes care of creating a new file, sending print() output to the new file, then when you are done, renaming the new file to the original file name, and saving a copy of the original file with a .bak extension.

system("perl -n -e 'if(\$_ !~ /$mcadd/) { print \$_; }' \$ARGV[0] >> \$ARGV[0].bak");
system("rename \$ARGV[0].bak \$ARGV[0]");
This should work in windows.

Related

Splitting very long (4GB) string with new lines

I have a file that is supposed to be JSON objects, one per line. Unfortunately, a miscommunication happened with the creation of the file, and the JSON objects only have a space between them, not a new-line.
I need to fix this by replacing every instance of } { with }\n{.
Should be easy for sed or Perl, right?
sed -e "s/}\s{/}\n{/g" file.in > file.out
perl -pe "s/}\s{/}\n{/g" file.in > file.out
But file.in is actually 4.4 GB which seems to be causing a problem for both of these solutions.
The sed command finishes with a halfway-correct file, but file.out is only 335 MB and is only about the first 1/10th of the input file, cutting off in the middle of a line. It's almost like sed just quit in the middle of the stream. Maybe it's trying to load the entire 4.4 GB file into memory but running out of stack space at around 300MB and silently kills itself.
The Perl command errors with the following message:
[1] 2904 segmentation fault perl -pe "s/}\s{/}\n{/g" file.in > file.out
What else should I try?
Unlike the earlier solutions, this one handles {"x":"} {"}.
use strict;
use warnings;
use feature qw( say );
use JSON::XS qw( );
use constant READ_SIZE => 64*1024*1024;
my $j_in = JSON::XS->new->utf8;
my $j_out = JSON::XS->new;
binmode STDIN;
binmode STDOUT, ':encoding(UTF-8)';
while (1) {
my $rv = sysread(\*STDIN, my $block, READ_SIZE);
die($!) if !defined($rv);
last if !$rv;
$j_in->incr_parse($block);
while (my $o = $j_in->incr_parse()) {
say $j_out->encode($o);
}
}
die("Bad data") if $j_in->incr_text !~ /^\s*\z/;
The default input record separator in Perl is \n, but you can change it to any character you want. For this problem, you could use { (octal 173).
perl -0173 -pe 's/}\s{/}\n{/g' file.in > file.out
perl -ple 'BEGIN{$/=qq/} {/;$\=qq/}\n{/}undef$\ if eof' <input >output
Assuming your input doesn't contain } { pairs in other contexts that you do not want replaced, ll you need is:
awk -v RS='} {' '{ORS=(RT ? "}\n{" : "\n")} 1'
e.g.
$ printf '{foo} {bar}' | awk -v RS='} {' '{ORS=(RT ? "}\n{" : "\n")} 1'
{foo}
{bar}
The above uses GNU awk for multi-char RS and RT and will work on any size input file as it does not read the whole file into memory at one time, just each } {-separated "line" one at a time.
You may read input in blocks/chunks and process them one by one.
use strict;
use warnings;
binmode(STDIN);
binmode(STDOUT);
my $CHUNK=0x2000; # 8kiB
my $buffer = '';
while( sysread(STDIN, $buffer, $CHUNK, length($buffer))) {
$buffer =~ s/\}\s\{/}\n{/sg;
if( length($buffer) > $CHUNK) { # More than one chunk buffered
syswrite( STDOUT, $buffer, $CHUNK); # write FIRST of buffered chunks
substr($buffer,0,$CHUNK,''); # remove FIRST of buffered chunks from buffer
}
}
syswrite( STDOUT, $buffer) if length($buffer);

Bash script csv manipulation optimization

I have a 2 million line csv file where what I want to do is replace the second column of each line in the csv file with a unique value to that string, these are all filled with usernames. The long process I've got below does work, but does take a while.
It doesn't have to be hashed, but this seemed like a sure way of when the next file comes along there are no discrepancies.
I'm by no means a coder, and was wondering if there was anyway that I could optimize the process. Although I understand the best way to do this would be in some sort of scripting language.
#!/bin/bash
#Enter Filename to Read
echo "Enter File Name"
read filename
#Extracts Usersnames from file
awk -F "\"*,\"*" '{print $2}' $filename > usernames.txt
#Hashes Usernames using SHA256
cat usernames.txt | while read line; do echo -n $line|openssl sha256 |sed 's/^.* //'; done > hashedusernames.txt
#Deletes usernames out of first file
cat hash.csv | cut -d, -f2 --complement > output.txt
#Pastes hashed usernames to end of first file
paste -d , output.txt hashedusernames.txt > output2.txt
#Moves everything back into place
awk -F "\"*,\"*" '{print $1","$4","$2","$3}' output2.txt > final.csv
Example File, there are 7 columns in all but only 3 are shown
Time Username Size
2017-01-01T14:53.45,Poke.callum,12345
2016-01-01T13:42.56,Test.User,54312
2015-01-01T12:34.34,Another.User,54123
You could do this in Perl easily in a few lines. The following program uses the Crypt::Digest::SHA256, which you need to install from CPAN or from your OS repository if they have it.
The program assumes input from the DATA section, which we typically do around here to include example data in an mcve.
use strict;
use warnings;
use Crypt::Digest::SHA256 'sha256_b64u';
while (my $line = <DATA>) {
# no need to chomp because we don't touch the last line
my #fields = split /,/, $line;
$fields[1] = sha256_b64u($fields[1]);
print join ',', #fields;
}
__DATA__
2017-01-01T14:53.45,Poke.callum,12345
2016-01-01T13:42.56,Test.User,54312
2015-01-01T12:34.34,Another.User,54123
It prints the following output.
2017-01-01T14:53.45,g8EPHWc3L1ln_lfRhq8elyOUgsiJm6BtTtb_GVt945s,12345
2016-01-01T13:42.56,jwXsws2dJq9h_R08zgSIPhufQHr8Au8_RmniTQbEKY4,54312
2015-01-01T12:34.34,mkrKXbM1ZiPiXSSnWYNo13CUyzMF5cdP2SxHGyO7rgQ,54123
To make it read a file that is supplied as a command line argument and write to a new file with the .new extension, you can use it like this:
use strict;
use warnings;
use Crypt::Digest::SHA256 'sha256_b64u';
open my $fh_in, '<', $ARGV[0] or die $!;
open my $fh_out, '>', "$ARGV[0].new" or die $!;
while (my $line = <$fh_in>) {
# no need to chomp because we don't touch the last line
my #fields = split /,/, $line;
$fields[1] = sha256_b64u($fields[1]);
print $fh_out join ',', #fields;
}
Run it as follows:
$ perl foo.pl example.csv
Your new file will be named example.csv.new.
Yet another Python solution, focus on speed but also on maintainability.
#!/usr/bin/python3
import argparse
import hashlib
import re
parser = argparse.ArgumentParser(description='CSV swaper')
parser.add_argument(
'-f',
'--file',
dest='file_path',
type=str,
required=True,
help='The CSV file path.')
def hash_user(users, user):
try:
return users[user]
except KeyError:
id_ = int(hashlib.md5(user.encode('utf-8')).hexdigest(), 16)
users[user] = id_
return id_
def main():
args = parser.parse_args()
username_extractor = re.compile(r',([\s\S]*?),')
users = {}
counter = 0
templ = ',{},'
with open(args.file_path) as file:
with open('output.csv', 'w') as output:
line = file.readline()
while line:
try:
counter += 1
if counter == 1:
continue
username = username_extractor.search(line).groups()[0]
hashuser = hash_user(users, username)
output.write(username_extractor.sub(
templ.format(hashuser), line)
)
except StopIteration:
break
except:
print('Malformed line at {}'.format(counter))
finally:
line = file.readline()
if __name__ == '__main__':
main()
There are still some points that could be optimized, but the central ones are based on do try instead of check, and save users hashes in the case there are repeated users will not have to redigest the username.
Also, will You run this on a multi-core host?.. this can be easily be improved using threads.
This Python program might do what you want. You can pass the filenames to convert on the command line:
$ python this_program.py file1.csv file2.csv
import fileinput
import csv
import sys
import hashlib
class stdout:
def write(self, *args):
sys.stdout.write(*args)
input = fileinput.input(inplace=True, backup=".bak", mode='rb')
reader = csv.reader(input)
writer = csv.writer(stdout())
for row in reader:
row[1] = hashlib.sha256(row[1]).hexdigest()
writer.writerow(row)
Since you used awk in your original attempt, here's a simpler approach in awk
awk -F"," 'BEGIN{i=0;}
{if (unique_names[$2] == "") {
unique_names[$2]="Unique"i;
i++;
}
$2=unique_names[$2];
print $0}'

IO::Uncompress::Gunzip stops after first "original" gzipped file inside "concatenated" gzipped file

In bash, you can concatenate gzipped files and the result is a valid gzipped file. As far as I recall, I have always been able to treat these "concatenated" gzipped files as normal gzipped files (my example code from link above):
echo 'Hello world!' > hello.txt
echo 'Howdy world!' > howdy.txt
gzip hello.txt
gzip howdy.txt
cat hello.txt.gz howdy.txt.gz > greetings.txt.gz
gunzip greetings.txt.gz
cat greetings.txt
Which outputs
Hello world!
Howdy world!
However, when trying to read this same file using Perl's core IO::Uncompress::Gunzip module, it doesn't get past the first original file. Here is the result:
./my_zcat greetings.txt.gz
Hello world!
Here is the code for my_zcat:
#!/bin/env perl
use strict;
use warnings;
use v5.10;
use IO::Uncompress::Gunzip qw($GunzipError);
my $file_name = shift;
my $fh = IO::Uncompress::Gunzip->new($file_name) or die $GunzipError;
while (defined(my $line = readline $fh))
{
print $line;
}
If I totally decompress the files before creating a new gzipped file, I don't have this problem:
zcat hello.txt.gz howdy.txt.gz | gzip > greetings_via_zcat.txt.gz
./my_zcat greetings_via_zcat.txt.gz
Hello world!
Howdy world!
So, what is the difference between greetings.txt.gz and greetings_via_zcat.txt.gz and why might IO::Uncompress::Gunzip work correctly with greetings.txt.gz?
Based on this answer to another question, I'm guessing that IO::Uncompress::Gunzip messes up because of the metadata between the files. But, since greetings.txt.gz is a valid Gzip file, I would expect IO::Uncompress::Gunzip to work.
My workaround for now will be piping from zcat (which of course doesn't help Windows users much):
#!/bin/env perl
use strict;
use warnings;
use v5.10;
my $file_name = shift;
open(my $fh, '-|', "zcat $file_name");
while (defined(my $line = readline $fh))
{
print $line;
}
This is covered explicitly in the IO::Compress FAQ section Dealing with concatenated gzip files. Basically you just have to include the MultiStream option when you construct the IO::Uncompress::Gunzip object.
Here is a definition of the MultiStream option:
MultiStream => 0|1
If the input file/buffer contains multiple
compressed data streams, this option will uncompress the whole lot as
a single data stream.
Defaults to 0.
So your code needs this change
my $fh = IO::Uncompress::Gunzip->new($file_name, MultiStream => 1) or die $GunzipError;

use grep and awk to transfer data from .srt to .csv/xls

I got an interesting project to do! I'm thinking about converting an srt file into a csv/xls file.
a srt file would look like this:
1
00:00:00,104 --> 00:00:02,669
Hi, I'm shell-scripting.
2
00:00:02,982 --> 00:00:04,965
I'm not sure if it would work,
but I'll try it!
3
00:00:05,085 --> 00:00:07,321
There must be a way to do it!
while I want to output it into a csv file like this:
"1","00:00:00,104","00:00:02,669","Hi, I'm shell-scripting."
"2","00:00:02,982","00:00:04,965","I'm not sure if it would work"
,,,"but I'll try it!"
"3","00:00:05,085","00:00:07,321","There must be a way to do it!"
So as you can see, each subtitle takes up two rows. My thinking would be using grep to put the srt data into the xls, and then use awk to format the xls file.
What do you guys think? How am I suppose to write it? I tried
$grep filename.srt > filename.xls
It seems that all the data including the time codes and the subtitle words ended up all in column A of the xls file...but I want the words to be in column B...How would awk be able to help with the formatting?
Thank you in advance! :)
$ cat tst.awk
BEGIN { RS=""; FS="\n"; OFS=","; q="\""; s=q OFS q }
{
split($2,a,/ .* /)
print q $1 s a[1] s a[2] s $3 q
for (i=4;i<=NF;i++) {
print "", "", "", q $i q
}
}
$ awk -f tst.awk file
"1","00:00:00,104","00:00:02,669","Hi, I'm shell-scripting."
"2","00:00:02,982","00:00:04,965","I'm not sure if it would work,"
,,,"but I'll try it!"
"3","00:00:05,085","00:00:07,321","There must be a way to do it!"
I think something like this should do it quite nicely:
awk -v RS= -F'\n' '
{
sub(" --> ","\x7c",$2) # change "-->" to "|"
printf "%s|%s|%s\n",$1,$2,$3 # print scene, time start, time stop, description
for(i=4;i<=NF;i++)printf "|||%s\n",$i # print remaining lines of description
}' file.srt
The -v RS= sets the Record Separator to blank lines. The -F'\n' sets the Field Separator to new lines.
The sub() replaces the "-->" with a pipe symbol (|).
The first three fields are then printed separated by pipes, and then there is a little loop to print the remaining lines of description, inset by three pipe symbols to make them line up.
Output
1|00:00:00,104|00:00:02,669|Hi, I'm shell-scripting.
2|00:00:02,982|00:00:04,965|I'm not sure if it would work,
|||but I'll try it!
3|00:00:05,085|00:00:07,321|There must be a way to do it!
As I am feeling like having some more fun with Perl and Excel, I took the above output and parsed it in Perl and wrote a real Excel XLSX file. Of course, there is no real need to use awk and Perl so ideally one would re-cast the awk and integrate it into the Perl since the latter can write Excel files while the former cannot. Anyway here is the Perl.
#!/usr/bin/perl
use strict;
use warnings;
use Excel::Writer::XLSX;
my $DEBUG=0;
my $workbook = Excel::Writer::XLSX->new('result.xlsx');
my $worksheet = $workbook->add_worksheet();
my $row=0;
while(my $line=<>){
$row++; # move down a line in Excel worksheet
chomp $line; # strip CR
my #f=split /\|/, $line; # split fields of line into array #f[], on pipe symbols (|)
for(my $j=0;$j<scalar #f;$j++){ # loop through all fields
my $cell= chr(65+$j) . $row; # calcuate Excell cell, starting at A1 (65="A")
$worksheet->write($cell,$f[$j]); # write to spreadsheet
printf "%s:%s ",$cell,$f[$j] if $DEBUG;
}
printf "\n" if $DEBUG;
}
$workbook->close;
Output
My other answer was half awk and half Perl, but, given that awk can't write Excel spreadsheets whereas Perl can, it seems daft to require you to master both awk and Perl when Perl is perfectly capable of doing it all on its own... so here goes in Perl:
#!/usr/bin/perl
use strict;
use warnings;
use Excel::Writer::XLSX;
my $workbook = Excel::Writer::XLSX->new('result.xlsx');
my $worksheet = $workbook->add_worksheet();
my $ExcelRow=0;
local $/ = ""; # set paragraph mode, so we read till next blank line as one record
while(my $para=<>){
$ExcelRow++; # move down a line in Excel worksheet
chomp $para; # strip CR
my #lines=split /\n/, $para; # split paragraph into lines on linefeed character
my $scene = $lines[0]; # pick up scene number from first line of para
my ($start,$end)=split / --> /,$lines[1]; # pick up start and end time from second line
my $cell=sprintf("A%d",$ExcelRow); # work out cell
$worksheet->write($cell,$scene); # write scene to spreadsheet column A
$cell=sprintf("B%d",$ExcelRow); # work out cell
$worksheet->write($cell,$start); # write start time to spreadsheet column B
$cell=sprintf("C%d",$ExcelRow); # work out cell
$worksheet->write($cell,$end); # write end time to spreadsheet column C
$cell=sprintf("D%d",$ExcelRow); # work out cell
$worksheet->write($cell,$lines[2]); # write description to spreadsheet column D
for(my $i=3;$i<scalar #lines;$i++){ # output additional lines of description
$ExcelRow++;
$cell=sprintf("D%d",$ExcelRow); # work out cell
$worksheet->write($cell,$lines[$i]);
}
}
$workbook->close;
Save the above on a file called srt2xls and then make it executable with the command:
chmod +x srt2xls
Then you can run it with
./srt2xls < SomeFileile.srt
and it will give you this spreadsheet called result.xlsx
Since you want to convert the srt into csv. below is awk command
awk '{gsub(" --> ","\x22,\x22");if(NF!=0){if(j<3)k=k"\x22"$0"\x22,";else{k="\x22"$0"\x22 ";l=1}j=j+1}else j=0;if(j==3){print k;k=""}if(l==1){print ",,,"k ;l=0;k=""}}' inputfile > output.csv
detail veiw of awk
awk '{
gsub(" --> ","\x22,\x22");
if(NF!=0)
{
if(j<3)
k=k"\x22"$0"\x22,";
else
{
k="\x22"$0"\x22 ";
l=1
}
j=j+1
}
else
j=0;
if(j==3)
{
print k;
k=""
}
if(l==1)
{
print ",,,"k;
l=0;
k=""
}
}' inputfile > output.csv
take the output.csv on windows platform and then open with microsoft excel and save it as .xls extension.

append a text on the top of the file

I want to add a text on the top of my data.txt file, this code add the text at the end of the file. how I can modify this code to write the text on the top of my data.txt file. thanks in advance for any assistance.
open (MYFILE, '>>data.txt');
print MYFILE "Title\n";
close (MYFILE)
perl -pi -e 'print "Title\n" if $. == 1' data.text
Your syntax is slightly off deprecated (thanks, Seth):
open(MYFILE, '>>', "data.txt") or die $!;
You will have to make a full pass through the file and write out the desired data before the existing file contents:
open my $in, '<', $file or die "Can't read old file: $!";
open my $out, '>', "$file.new" or die "Can't write new file: $!";
print $out "# Add this line to the top\n"; # <--- HERE'S THE MAGIC
while( <$in> ) {
print $out $_;
}
close $out;
close $in;
unlink($file);
rename("$file.new", $file);
(gratuitously stolen from the Perl FAQ, then modified)
This will process the file line-by-line so that on large files you don't chew up a ton of memory. But, it's not exactly fast.
Hope that helps.
There is a much simpler one-liner to prepend a block of text to every file. Let's say you have a set of files named body1, body2, body3, etc, to which you want to prepend a block of text contained in a file called header:
cat header | perl -0 -i -pe 'BEGIN {$h = <STDIN>}; print $h' body*
Appending to the top is normally called prepending.
open(M,"<","data.txt");
#m = <M>;
close(M);
open(M,">","data.txt");
print M "foo\n";
print M #m;
close(M);
Alternately open data.txt- for writing and then move data.txt- to data.txt after the close, which has the benefit of being atomic so interruptions cannot leave the data.txt file truncated.
See the Perl FAQ Entry on this topic
perl -ni -e 'print "Title\n" $.==1' filename , this print the answer once

Resources