Adding a new position at the end of the file Shell or Perl - shell

My question is how to add a new position at the end of the file in Shell or Perl?
I have two files:
File A with 536382 lines and the key is third column:
abc1111,1070X00Y0,**9999**,B
abc2222,1070X00Y0,**9999**,B
abc3333,1070x00Y0,**9999**,B
File B with 946 lines and the key is the first column:
**9999**,Position,West
**9998**,Position,West
**9997**,Position,South
**1111**,Position,South
**9999**,Time,Morning
**9997**,Time,Afternoon
I want a combination of these two files:
abc1111,1070X00Y0,9999,B,West,Morning
abc2222,1070X00Y0,9999,B,West,Morning
abc3333,1070x00Y0,9999,B,West,Morning
I was trying shell script but I was receiving a message of out of memory.
So I open for suggestions.
Thank you, so far.

I was able to get the results you want by making a few changes to your code.
#!/usr/bin/perl
use strict;
use warnings;
open IN2, '<', \<<EOF;
**9999**,Position,West
**9998**,Position,West
**9997**,Position,South
**1111**,Position,South
**9999**,Time,Morning
**9997**,Time,Afternoon
EOF
my %hash;
while ( <IN2> ) {
chomp;
my #col2 = split ",";
$hash{$col2[0]}{$col2[1]} = $col2[2];
}
open IN1, '<', \<<EOF;
abc1111,1070X00Y0,**9999**,B
abc2222,1070X00Y0,**9999**,B
abc3333,1070x00Y0,**9999**,B
EOF
while ( <IN1> ) {
chomp;
my $key = (split /,/)[2];
if ( exists( $hash{$key} ) ) {
print join(",", $_, #{ $hash{$key} }{ qw/Position Time/ }), "\n";
}
}
This produced output of:
abc1111,1070X00Y0,**9999**,B,West,Morning
abc2222,1070X00Y0,**9999**,B,West,Morning
abc3333,1070x00Y0,**9999**,B,West,Morning
The changes to the code were
$hash{$col2[0]}{$col2[1]} = $col2[2]; Create a Hash of Hash to hold the Position and Time keys. They are used in a hash slice here
#{ $hash{$key} }{ qw/Position Time/ })

Convert small file into perl's hash
Process big file line by line

Related

How do I add new lines after deleting a large amount of text in Perl windows?

I'm trying to remove a large amount of text from a file before inserting a few new lines. I can delete everything after the word 'CParticleSystemDefinition' with a single line of code like this
perl -0777 -pi -we "s/CParticleSystemDefinition\x22\K.*/\n}/s" "D:\Steam\steamapps\common\dota 2 beta\content\dota_addons\custom\particles\generic_gameplay\winter_effects_creep.vpcf"
But when I try to change the code slightly so that it adds a few new lines like this, it doesn't work
perl -0777 -pi -we "s/CParticleSystemDefinition\x22\K.*/\n m_Children = \n [\n {\n m_ChildRef = resource:\x22particles/generic_gameplay/winter_effects_breath.vpcf\x22\n },\n ]\n}/s" "D:\Steam\steamapps\common\dota 2 beta\content\dota_addons\custom\particles\generic_gameplay\winter_effects_creep.vpcf"
So, basically, what I want to do is make this file
{
_class = "CParticleSystemDefinition"
m_bShouldHitboxesFallbackToRenderBounds = false
m_nMaxParticles = 24
m_flConstantRadius = 15.000000
m_flConstantLifespan = 0.500000
m_ConstantColor =
[
212,
170,
145,
255,
]
m_bShouldSort = false
m_Renderers =
[
{
_class = "C_OP_RenderSprites"
m_nSequenceCombineMode = "SEQUENCE_COMBINE_MODE_USE_SEQUENCE_0"
m_bMod2X = true
m_nOrientationType = 3
m_hTexture = resource:"materials/particle/footprints/footprints_generic.vtex"
m_flAnimationRate = 1.000000
},
]
m_Emitters =
[
{
_class = "C_OP_ContinuousEmitter"
m_flEmitRate = 10.000000
m_flStartTime = 0.500000
m_nScaleControlPoint = 5
},
]
}
look like this
{
_class = "CParticleSystemDefinition"
m_Children =
[
{
m_ChildRef = resource:"particles/generic_gameplay/winter_effects_breath.vpcf"
},
]
}
Do it in two steps -- clear the rest of the file after that phrase, then add the desired text
perl -0777 -i.bak -wpe"s{Definition\x22\K.*}{}s; $_ .= qq(\n\tm_Children...)" file
where I've used ellipses to indicate the rest, for clarity. I added .bak to keep a backup file, until this is tested well enough.
Adding a string in the replacement part is fine as well of course -- I don't readily see what fails (and how?) in your code. Breaking it up into two steps simply makes it easier to review and organize it better but one can also run that code in the replacement part, using /e modifier
perl -0777 -i.bak -wpe"
s{Definition\x22\K.*}{
# any valid Perl code, what it evaluates to is used as replacement
qq(\n\tm_Children...)
}es;
" file
If you don't want tabs, which may or may not get expanded depending on various settings and on what's done with this, can prepare and use a string of spaces instead. Then we might as well build the replacement more systematically
perl -0777 -i.bak -wpe"
s{Definition\x22\K.*}{}s;
$s4 = q( ) x 4; # four spaces
$_ .= qq(\n${s4}m_Children =\n$s4) . join qq(\n$s4),
q([),
q({),
qq($s4).q(m_ChildRef = ...) # etc
qq(\n)
" file
Now one can either make this into a better system (adding a suitable programming construct for each new level of indentation for example, like map over such lines so to add indentation to all in one statement), if there is a lot -- or condense it if there's really just a few lines.
Again, this can run inside the regex's replacement side, with the additional /e modifier.
This can be done line-by-line in one pass as well, using the read-write (+<) mode for open
perl -MPath::Tiny -wE"
$f = shift // die qq(Need a filename);
open $fh, qq(+<), $f or die qq(Cant open $f: $!);
while (<$fh>) { last if /Definition\x22$/ }; # past the spot to keep
truncate $fh, tell($fh); # remove the rest
say qq(File now:\n), path($f)->slurp; # (just to see it now)
say $fh $_ for # add new content
qq(new line 1),
qq(new line 2)
" filename
(Carefully with read-write modes. Please read the docs with care first.)

Perl - to create and sort hash to produce a sorted output

So I have this script that scrape data to a website, its getting and downloading a CSV and its process the CSV row by row and converts it into TSV, once that finished the TSV file will be converted into a HTML file. I'm done the rest of that but the output that I'm getting is some what messed up, the script goes to different table pages on the source site and downloads a dynamically generated CSV file; that CSV file is then turned into a TSV file that we then turn into HTML. The CSV file seems to be sorted by the first column for each row that is returned but not based on any of the other columns in the same row. Therefore what is happening is that entries with the same first column values can be jumbled up from one download to the next download of the same file.
A visual representation of only sorting by the first column this follows with numbers representing column data:
1st Download:
1-1
1-2
1-3
2-1
2-2
2-3
3-1
3-2
3-3
2nd Download:
1-1
1-3
1-2
2-2
2-1
2-3
3-3
3-2
3-1
So what I have in mind is the process will be like this, download the CSV file from the source and then perform a sort on the lines in that CSV file to normalize them for comparison to one another before writing the TSV or HTML files. This should allow for accurate comparison for updated data files. but I didn't know how to do this my logic is like this
So I will put the function between the 1. and 2. before it process the CSV file into TSV File I want the content of the CSV is already sorted.
So my script is looking like this
my $download_dir_link ="C:/Users/jabella/Downloads";
unlink("$download_dir_link/Product Classification List.csv");
#CHECK IF CSV FILE DOWNLOAD IS FINISHED
my $complete_download_flag = 0;
while($complete_download_flag == 0)
{
my #download_directory = read_dir($download_dir_link);
foreach my $downloaded_file (#download_directory)
{
if($downloaded_file =~ /\QProduct Classification List.csv\E/sgi)
{
$complete_download_flag = 1;
}
}
sleep(5);
}
#SORTED CONTENTS OF CSV BEFORE CONVERSION
print "sORTING csv content...\n";
#CONVERT CSV TO TSV
print "Converting csv to tsv...\n";
my $csv = Text::CSV->new ({ binary => 1 });
my $tsv = Text::CSV->new ({ binary => 1, sep_char => "\t", eol => "\n"});
open my $infh, "<:encoding(utf8)", "$download_dir_link/Product Classification List.csv";
open my $outfh, ">:encoding(utf8)", "Product Classification List.tsv";
while (my $row = $csv->getline ($infh))
{
$tsv->print ($outfh, $row);
}
close($infh);
close($outfh);
my $tsv_content = "";
open(my $fh, '<', "Product Classification List.tsv");
while (<$fh>)
{
$tsv_content = $tsv_content.$_;
}
close($fh);
print "Conversion complete! cleaning tsv content...\n";
#CLEAN TSV CONTENT
$tsv_content =~ s/(.*?)\t"(.*?)"\t"(.*?)"\t"(.*?)"\t(.*?)\t"(.*?)"\t(.*)/<tr><th>$1<\/th><th>$2<\/th><th>$3<\/th><th>$4<\/th><th>$5<\/th><th>$6<\/th><th>$7<\/th><\/tr>/gi;
$tsv_content =~ s/"?(.*?)"?\t"?(.*?)"?\t"?(.*?)"?\t"?(.*?)"?\t"?(.*?)"?\t"?(.*?)"?\t"?(.*?)"?\n/<tr><td>$1<\/td><td>$2<\/td><td>$3<\/td><td>$4<\/td><td>$5<\/td><td>$6<\/td><td>$7<\/td><\/tr>\n/gi;
$tsv_content =~ s/\"{2}/\"/sgi;
$tsv_content =~ s/(<\/tr>)\n?"/$1/sgi;
$tsv_content =~ s/\s{2,}/ /sgi;
$tsv_content =~ s/.*?(<tr>)/$1/si;
$tsv_content = "<table>\n$tsv_content</table>";
$classification =~ s/_//sgi;
if(exists $existing_index_hash{$doc_uid."_pind.html"})
{
if($existing_index_hash{$doc_uid."_pind.html"} ne $tsv_content)
{
$changed_flag = "1";
$updated_files = $updated_files."-$classification\n";
print "Updated: $classification\n";
Hope someone here can help me on this thank you
Here is a simple script that loads a CSV file specified as an argument and outputs it sorted by the first two columns.
#!/usr/bin/perl
use warnings;
use strict;
use Text::CSV_XS;
my $csv = 'Text::CSV_XS'->new({binary => 1, auto_diag => 1});
open my $in, '<', shift or die $!;
my #rows;
while (my $row = $csv->getline($in)) {
push #rows, $row;
}
# Here the sorting happens. Compare the first column,
# if the values are the same, compare the second column.
#rows = sort { $a->[0] cmp $b->[0] || $a->[1] cmp $b->[1] } #rows;
$csv->say(*STDOUT, $_) for #rows;
You can use the following to sort by all columns (but it compares the values as strings, it doesn't work for numbers):
sub by_all {
my ($n, $A, $B) = #_;
$A->[$n] cmp $B->[$n]
|| $n < $#$A && by_all($n + 1, $A, $B)
}
sort { by_all(0, $a, $b) } #rows;
To make it work for numbers, too, you can let Perl guess what is a number:
use Scalar::Util qw{ looks_like_number };
sub by_all {
my ($n, $A, $B) = #_;
(looks_like_number($A->[$n])
? $A->[$n] <=> $B->[$n]
: $A->[$n] cmp $B->[$n]
) || $n < $#$A && by_all($n + 1, $A, $B)
}

comparing 2 data sets possibly with concurrency/asynchronous/parallel approach

I am currently trying to improve upon an existing mechanism (to compare data from 2 sources, implemented in perl5) and would like to use perl6 instead.
My target data volume range is about 20-30 GB in uncompressed flat files.
In terms of lines, a file can contain anywhere from 18 million to 28 million lines.
It has around 40-50 columns per line.
I do this type of data reconciliation on a daily basis and it can take about ~10 minutes to read from a file and populate the hash. ~20 minutes spent to read both files and to populate hash.
comparison process takes about ~30-50 minutes including iterating over hash, collecting desired result(s), and writing to output file (csv,psv).
All in all it can take anywhere between 30 minutes to 60 minutes on a 32 core dual xeon cpu server with 256gb of RAM, including intermittent server load, to perform the process.
Now I am trying to bring down the total processing time even further.
Here is my current single threaded approach using perl5.
fetch data from 2 sources (let's say s1 and s2) one by one and populate my hash based on key-value pairs. Source of data could be either a flat csv or psv file OR a database query Array of Array result, via DBI client. Data is always unsorted to start with.
To be specific, I read the file line by line,split fields, and choose desired indexes for key,value pair and insert into hash.
After collecting data and populating hash with desired key/value pairs,I start to compare and collect results (mainy comparing on what is missing or different in s2 w.r.t s1 and vice-versa).
dump output in an excel file (very costly if no. of lines is large like ~1 million or greater) or in a simple CSV (cheap operation. preferred method).
I was wondering whether if I could somehow do the first step in parallel i.e. collect data from both sources at once and populate my global hash, and then proceed to compare and dump output?
What options can perl6 provide to deal with this situation? I have read about concurrency, asynchronous and parallel operations using perl6 but I am not so certain which one can help me here.
I would really appreciate any general guidance on the matter. I hope I explained my problem well but sadly I don't have much to show for what have I tried till now? and reason is that I am just beginning to tackle this one. I am just unable to see past the single threaded approach and need some help.
Thanks.
EDIT
As my existing problem statement has been deemed by the community as 'too broad' - allow me to attempt to highlight my pain points below:
I would like to do file comparison by utilizing all 32 cores if possible. I am just not able to come up with a strategy or initial idea.
What type of new techniques are available or applicable with perl6 in order to tackle this problem or type of problem.
If I spawn 2 processes to read file(s) and collect data - is it possible to get the result back as an array or hash?
Is it possible to compare the data (stored in hash) in parallel?
My current p5 comparison logic is shown below for your reference. Hope this helps and not let this question shutdown.
package COMP;
use strict;
use Data::Dumper;
sub comp
{
my ($data,$src,$tgt) = #_;
my $result = {};
my $ms = ($result->{ms} = {});
my $mt = ($result->{mt} = {});
my $diff = ($result->{diff} = {});
foreach my $key (keys %{$data->{$src}})
{
my $src_val = $data->{$src}{$key};
my $tgt_val = $data->{$tgt}{$key};
next if ($src_val eq $tgt_val);
if (!exists $data->{$tgt}{$key}) {
push (#{$mt->{$key}}, "$src_val|NULL");
}
if (exists $data->{$tgt}{$key} && $src_val ne $tgt_val) {
push (#{$diff->{$key}}, "$src_val|$tgt_val")
}
}
foreach my $key (keys %{$data->{$tgt}})
{
my $src_val = $data->{$src}{$key};
my $tgt_val = $data->{$tgt}{$key};
next if ($src_val eq $tgt_val);
if (!exists $data->{$src}{$key}) {
push (#{$ms->{$key}},"NULL|$tgt_val");
}
}
return $result;
}
1;
If someone would like to try it out, here is the sample output and the test script used.
script output
[User#Host:]$ perl testCOMP.pl
$VAR1 = {
'mt' => {
'Source' => [
'source|NULL'
]
},
'ms' => {
'Target' => [
'NULL|target'
]
},
'diff' => {
'Sunday_isit' => [
'Yes|No'
]
}
};
Test Script
[User#Host:]$ cat testCOMP.pl
#!/usr/bin/env perl
use lib $ENV{PWD};
use COMP;
use strict;
use warnings;
use Data::Dumper;
my $data2 = {
f1 => {
Amitabh => 'Bacchan',
YellowSun => 'Yes',
Sunday_isit => 'Yes',
Source => 'source',
},
f2 => {
Amitabh => 'Bacchan',
YellowSun => 'Yes',
Sunday_isit => 'No',
Target => 'target',
},
};
my $result = COMP::comp ($data2,'f1','f2');
print Dumper $result;
[User#Host:]$
If you have an existing and working toolchain you don't have to rewrite it all to use Perl6. It's parallelism mechanisms work fine with external processess too. Consider
allnum.pl6
use v6;
my #processes =
[ "num1.txt", "num2.txt", "num3.txt", "num4.txt", "num5.txt" ]
.map( -> $filename {
[ $filename, run "perl", "num.pl", $filename, :out ];
})
.hyper;
say "Lazyness Here!";
my $time = time;
for #processes
{
say "<{$_[0]} : {$_[1].out.slurp}>";
}
say time - $time, "s";
num.pl
use warnings;
use strict;
my $file = shift #ARGV;
my $start = time;
my $result = 0;
open my $in, "<", $file or die $!;
while (my $thing = <$in>)
{
chomp $thing;
$thing =~ s/ //g;
$result = ($result + $thing) / 2;
}
print $result, " : ", time - $start, "s";
On my system
C:\Users\holli\tmp>perl6 allnum.pl6
Lazyness Here!
<num1.txt : 7684.16347578616 : 3s>
<num2.txt : 3307.36261498186 : 7s>
<num3.txt : 5834.32817942962 : 10s>
<num4.txt : 6575.55944995197 : 0s>
<num5.txt : 6157.63100049619 : 0s>
10s
Files were set up like so
C:\Users\holli\tmp>perl -e "for($i=0;$i<10000000;$i++) { print chr(32) ** 100, int(rand(1000)), chr(32) ** 100, qq(\n); }">num1.txt
C:\Users\holli\tmp>perl -e "for($i=0;$i<20000000;$i++) { print chr(32) ** 100, int(rand(1000)), chr(32) ** 100, qq(\n); }">num2.txt
C:\Users\holli\tmp>perl -e "for($i=0;$i<30000000;$i++) { print chr(32) ** 100, int(rand(1000)), chr(32) ** 100, qq(\n); }">num3.txt
C:\Users\holli\tmp>perl -e "for($i=0;$i<400000;$i++) { print chr(32) ** 100, int(rand(1000)), chr(32) ** 100, qq(\n); }">num4.txt
C:\Users\holli\tmp>perl -e "for($i=0;$i<5000;$i++) { print chr(32) ** 100, int(rand(1000)), chr(32) ** 100, qq(\n); }">num5.txt

Parsing CSV file with \n in double quoted fields

I'm parsing a CSV file that has a break line in double quoted fields. I'm reading the file line by line with a groovy script but I get an ArrayIndexOutBoundException when I tried to get access the missing tokens.
I was trying to pre-process the file to remove those characters and I was thinking to do that with some bash script or with groovy itself.
Could you, please suggest any approach that I can use to resolve the problem?
This is how the CSV looks like:
header1,header2,header3,header4
timestamp, "abcdefghi", "abcdefghi","sdsd"
timestamp, "zxcvb
fffffgfg","asdasdasadsd","sdsdsd"
This is the groovy script I'm using
def csv = new File(args[0]).text
def bufferString = ""
def parsedFile = new File("Parsed_" + args[0]);
csv.eachLine { line, lineNumber ->
def splittedLine = line.split(',');
retString += new Date(splittedLine[0]) + ",${splittedLine[1]},${splittedLine[2]},${splittedLine[3]}\n";
if(lineNumber % 1000 == 0){
parsedFile.append(retString);
retString = "";
}
}
parsedFile.append(retString);
UPDATE:
Finally I did this and it works, (I needed format the first column from timestamp to a human readable date):
gawk -F',' '{print strftime("%Y-%m-%d %H:%M:%S", substr( $1, 0, length($1)-3 ) )","($2)","($3)","($4)}' TobeParsed.csv > Parsed.csv
Thank you #karakfa
If you use a proper CSV parser rather than trying to do it with split (which as you can see doesn't work with any form of quoting), then it works fine:
#Grab('com.xlson.groovycsv:groovycsv:1.1')
import static com.xlson.groovycsv.CsvParser.parseCsv
def csv = '''header1,header2,header3,header4
timestamp, "abcdefghi", "abcdefghi","sdsd"
timestamp, "zxcvb
fffffgfg","asdasdasadsd","sdsdsd"'''
def data = parseCsv(csv)
data.eachWithIndex { line, index ->
println """Line $index:
| 1:$line.header1
| 2:$line.header2
| 3:$line.header3
| 4:$line.header4""".stripMargin()
}
Which prints:
Line 0:
1:timestamp
2:abcdefghi
3:abcdefghi
4:sdsd
Line 1:
1:timestamp
2:zxcvb
fffffgfg
3:asdasdasadsd
4:sdsdsd
awk to the rescue!
this will merge the newline split fields together, you process can take it from there
$ awk -F'"' '!(NF%2){getline remainder;$0=$0 OFS remainder}1' splitted.csv
header1,header2,header3
xxxxxx, "abcdefghi", "abcdefghi"
yyyyyy, "zxcvb fffffgfg","asdasdasadsd"
assumes that odd number of quotes mean split field and replace new line with OFS. If you want to simple delete new line (the split parts will combine) remove OFS.

Parsing string into ARGV equivalent (Windows and Perl)

Edit - Answer posted below
I have a script that usually uses #ARGV arguments but in some cases it is invoked by another script (which I cannot modify) that instead only passes a config filename which among other things has the command line options that should have been passed directly.
Example:
Args=--test --pdf "C:\testing\my pdf files\test.pdf"
If possible I'd like a way to parse this string into an array that would be identical to #ARGV.
I have a workaround where I setup an external perl script that just echos #ARGV, and I invoke this script like below (standard boilerplate removed).
echo-args.pl
print join ("\n", #ARGV);
test-echo-args.pl
$my_args = '--test --pdf "C:\testing\my pdf files\test.pdf"';
#args = map { chomp ; $_ } `perl echo-args.pl $my_args`;
This seems inelegant but it works. Is there a better way without invoking a new process? I did try splitting and processing but there are some oddities on the command line e.g. -a"b c" becomes '-ab c' and -a"b"" becomes -ab" and I'd rather not worry about edge cases but I know that'll bite me one day if I don't.
Answer - thanks ikegami!
I've posted a working program below that uses Win32::API and CommandLineToArgvW from shell32.dll based on ikegami's advice. It is intentionally verbose in the hopes that it'll be more easy to follow for anyone like myself who is extremely rusty with C and pointer arithmetic.
Any tips are welcome, apart from the obvious simplifications :)
use strict;
use warnings;
use Encode qw( encode decode );
use Win32::API qw( );
use Data::Dumper;
# create a test argument string, with some variations, and pack it
# apparently an empty string returns $^X which is documented so check before calling
my $arg_string = '--test 33 -3-t" "es 33\t2 ';
my $packed_arg_string = encode('UTF-16le', $arg_string."\0");
# create a packed integer buffer for output
my $packed_argc_buf_ptr = pack('L', 0);
# create then call the function and get the result
my $func = Win32::API->new('shell32.dll', 'CommandLineToArgvW', 'PP', 'N')
or die $^E;
my $ret = $func->Call($packed_arg_string, $packed_argc_buf_ptr);
# unpack to get the number of parsed arguments
my $argc = unpack('L', $packed_argc_buf_ptr);
print "We parsed $argc arguments\n";
# parse the return value to get the actual strings
my #argv = decode_LPWSTR_array($ret, $argc);
print Dumper \#argv;
# try not to leak memory
my $local_free = Win32::API->new('kernel32.dll', 'LocalFree', 'N', '')
or die $^E;
$local_free->Call($ret);
exit;
sub decode_LPWSTR_array {
my ($ptr, $num) = #_;
return undef if !$ptr;
# $ptr is the memory location of the array of strings (i.e. more pointers)
# $num is how many we need to get
my #strings = ();
for (1 .. $num) {
# convert $ptr to a long, using that location read 4 bytes - this is the pointer to the next string
my $string_location = unpack('P4', pack('L', $ptr));
# make it human readable
my $readable_string_location = unpack('L', $string_location);
# decode the string and save it for later
push(#strings, decode_LPCWSTR($readable_string_location));
# our pointers are 32-bit
$ptr += 4;
}
return #strings;
}
# Copied from http://stackoverflow.com/questions/5529928/perl-win32api-and-pointers
sub decode_LPCWSTR {
my ($ptr) = #_;
return undef if !$ptr;
my $sW = '';
for (;;) {
my $chW = unpack('P2', pack('L', $ptr));
last if $chW eq "\0\0";
$sW .= $chW;
$ptr += 2;
}
return decode('UTF-16le', $sW);
}
In unix systems, it's the shell that parses that shell command into strings. But in Windows, it's up to each application. I think this is normally done using the CommandLineToArgv system call (which you could call with the help of Win32::API), but the spec is documented here if you want to reimplement it yourself.

Resources