Parsing large XML files? - performance

I have 2 xml files 1 with 115mb size and another with 34mb size.
Wiile reading file A there is 1 field called desc that relations it with file B where I retrieve the field id from file B where desc.file A is iqual to name.file B.
file A is already too big then I have to search inside file B and it takes a very long time to complete.
How could I speed up this proccess or what would be a better approch to do it ?
current code I am using:
#!/usr/bin/perl
use strict;
use warnings;
use XML::Simple qw(:strict XMLin);
my $npcs = XMLin('Client/client_npcs.xml', KeyAttr => { }, ForceArray => [ 'npc_client' ]);
my $strings = XMLin('Client/client_strings.xml', KeyAttr => { }, ForceArray => [ 'string' ]);
my ($nameid,$rank);
open (my $fh, '>>', 'Output/npc_templates.xml');
print $fh "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<npc_templates xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:noNamespaceSchemaLocation=\"npcs.xsd\">\n";
foreach my $npc ( #{ $npcs->{npc_client} } ) {
if (defined $npc->{desc}) {
foreach my $string (#{$strings->{string}}) {
if (defined $string->{name} && $string->{name} =~ /$npc->{desc}/i) {
$nameid = $string->{id};
last;
}
}
} else {
$nameid = "";
}
if (defined $npc->{hpgauge_level} && $npc->{hpgauge_level} > 25 && $npc->{hpgauge_level} < 28) {
$rank = 'LEGENDARY';
} elsif (defined $npc->{hpgauge_level} && $npc->{hpgauge_level} > 21 && $npc->{hpgauge_level} < 23) {
$rank = 'HERO';
} elsif (defined $npc->{hpgauge_level} && $npc->{hpgauge_level} > 10 && $npc->{hpgauge_level} < 15) {
$rank = 'ELITE';
} elsif (defined $npc->{hpgauge_level} && $npc->{hpgauge_level} > 0 && $npc->{hpgauge_level} < 11) {
$rank = 'NORMAL';
} else {
$rank = $gauge;
}
print $fh qq|\t<npc_template npc_id="$npc->{id}" name="$npc->{name}" name_id="$nameid" height="$npc->{scale}" rank="$rank" tribe="$npc->{tribe}" race="$npc->{race_type}" hp_gauge="$npc->{hpgauge_level}"/>\n|;
}
print $fh "</<npc_templates>";
close($fh);
example of file A.xml:
<?xml version="1.0" encoding="utf-16"?>
<npc_clients>
<npc_client>
<id>200000</id>
<name>SkillZone</name>
<desc>STR_NPC_NO_NAME</desc>
<dir>Monster/Worm</dir>
<mesh>Worm</mesh>
<material>mat_mob_reptile</material>
<show_dmg_decal>0</show_dmg_decal>
<ui_type>general</ui_type>
<cursor_type>none</cursor_type>
<hide_path>0</hide_path>
<erect>1</erect>
<bound_radius>
<front>1.200000</front>
<side>3.456000</side>
<upper>3.000000</upper>
</bound_radius>
<scale>10</scale>
<weapon_scale>100</weapon_scale>
<altitude>0.000000</altitude>
<stare_angle>75.000000</stare_angle>
<stare_distance>20.000000</stare_distance>
<move_speed_normal_walk>0.000000</move_speed_normal_walk>
<art_org_move_speed_normal_walk>0.000000</art_org_move_speed_normal_walk>
<move_speed_normal_run>0.000000</move_speed_normal_run>
<move_speed_combat_run>0.000000</move_speed_combat_run>
<art_org_speed_combat_run>0.000000</art_org_speed_combat_run>
<in_time>0.100000</in_time>
<out_time>0.500000</out_time>
<neck_angle>90.000000</neck_angle>
<spine_angle>10.000000</spine_angle>
<ammo_bone>Bip01 Head</ammo_bone>
<ammo_fx>skill_stoneshard.stoneshard.ammo</ammo_fx>
<ammo_speed>50</ammo_speed>
<pushed_range>0.000000</pushed_range>
<hpgauge_level>3</hpgauge_level>
<magical_skill_boost>0</magical_skill_boost>
<attack_delay>2000</attack_delay>
<ai_name>SummonSkillArea</ai_name>
<tribe>General</tribe>
<pet_ai_name>Pet</pet_ai_name>
<sensory_range>15.000000</sensory_range>
</npc_client>
</npc_clients>
example of file B.xml:
<?xml version="1.0" encoding="utf-16"?>
<strings>
<string>
<id>350000</id>
<name>STR_NPC_NO_NAME</name>
<body> </body>
</string>
</strings>

Here is example of XML::Twig usage. The main advantage is that it is not holding whole file in memory, so processing is much faster. The code below is trying to emulate operation of script from question.
use XML::Twig;
my %strings = ();
XML::Twig->new(
twig_handlers => {
'strings/string' => sub {
$strings{ lc $_->first_child('name')->text }
= $_->first_child('id')->text
},
}
)->parsefile('B.xml');
print "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<npc_templates xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:noNamespaceSchemaLocation=\"npcs.xsd\">\n";
XML::Twig->new(
twig_handlers => {
'npc_client' => sub {
my $nameid = eval { $strings{ lc $_->first_child('desc')->text } };
# calculate rank as needed
my $hpgauge_level = eval { $_->first_child('hpgauge_level')->text };
$rank = $hpgauge_level >= 28 ? 'ERROR'
: $hpgauge_level > 25 ? 'LEGENDARY'
: $hpgauge_level > 21 ? 'HERO'
: $hpgauge_level > 10 ? 'ELITE'
: $hpgauge_level > 0 ? 'NORMAL'
: $hpgauge_level;
my $npc_id = eval { $_->first_child('id')->text };
my $name = eval { $_->first_child('name')->text };
my $tribe = eval { $_->first_child('tribe')->text };
my $scale = eval { $_->first_child('scale')->text };
my $race_type = eval { $_->first_child('race_type')->text };
print
qq|\t<npc_template npc_id="$npc_id" name="$name" name_id="$nameid" height="$scale" rank="$rank" tribe="$tribe" race="$race_type" hp_gauge="$hpgauge_level"/>\n|;
$_->purge;
}
}
)->parsefile('A.xml');
print "</<npc_templates>";

Grab all the interesting 'desc' fields from file A and put them in a hash. You only have to parse it once, but if it still takes too long have a look at XML::Twig.
Parse file B. once and extract the stuff you need. Use the hash.
Looks like you only need parts of the xml files. XML::Twig can parse only the elements you are interested in and throw away the rest using the "twig_roots" parameter. XML::Simple is easier to get started with though..

Although I can't help you with the specifics of your Perl code, there are some general guidelines when dealing with large volumes of XML data. There are, broadly speaking, 2 kinds of XML APIs - DOM based and Stream based. Dom based API's (like XML DOM) will parse an entire XML document in to memory before the user-level API becomes "available", whereas with a stream based API (like SAX) the implementation doesn't need to parse the whole XML document. One benefit of Stream based parsers are that they typically use much less memory, as they don't need to hold the entire XML document in memory at once - this is obviously a good thing when dealing with large XML documents. Looking at the XML::Simple docs here, it's seems there may be SAX support available - have you tried this?

I'm not a perl guy, so take this with a grain of salt, but I see 2 problems:
The fact that you are iterating over all of the values in file B until you find the correct value for each element in file A is inefficient. Instead, you should be using some sort of map/dictionary for the values in file B.
It looks like you are parsing the both files in memory before you even start processing. File A would be better processed as a stream instead of loading the entire document into memory.

Related

TCL script to find a script and add new text before searched string with number

Below is the partial content of my input file:
xyz
abc
MainContent(abc_dt) {
sc: it;
}
MainContent(xyz_cnt) {
sc : it;
}
MainContent(asd_zxc) {
sc : it;
}
Here, I want to search "MainContent" line and want to add new line before it ... this new line should have "Sb (text which is inside bracket in MainContent_1)"... this should also add opening bracket and closing bracket before next Sb occurance:
Expected output from the script:
xyz
abc
Sb(abc_dt_sb1) {
MainContent(abc_dt) {
sc: it;
}
}
Sb(xyz_cnt_sb2) {
MainContent(xyz_cnt) {
sc : it;
}
}
Sb(asd_zxc_sb3) {
MainContent(asd_zxc) {
sc : it;
}
}
Can someone please help to me create a TCL script for this?
This code will take your text to be processed on standard input and produce the results on standard output. Redirect it as required.
set counter 0
set expectBrace false
while {[gets stdin line] >= 0} {
if {!$expectBrace && [regexp {^\s*MainContent\s*\((\w+)\)} $line -> bits]} {
puts [format "Sb(%s_sb%d) \{" $bits [incr counter]]
set expectBrace true
}
puts $line
if {$expectBrace && [regexp {^\s*\}\s*$} $line]} {
puts "\}"
set expectBrace false
}
}
Using regular expressions to do the matching of the triggers for state changes in a little state machine (two states, governed by expectBrace) is pretty conventional parsing. I've used format to do the substitutions into the Sb(…) line; these are simple enough that you could use direct substitutions instead if you prefer.
I've not done anything about adding indentation.

comparing 2 data sets possibly with concurrency/asynchronous/parallel approach

I am currently trying to improve upon an existing mechanism (to compare data from 2 sources, implemented in perl5) and would like to use perl6 instead.
My target data volume range is about 20-30 GB in uncompressed flat files.
In terms of lines, a file can contain anywhere from 18 million to 28 million lines.
It has around 40-50 columns per line.
I do this type of data reconciliation on a daily basis and it can take about ~10 minutes to read from a file and populate the hash. ~20 minutes spent to read both files and to populate hash.
comparison process takes about ~30-50 minutes including iterating over hash, collecting desired result(s), and writing to output file (csv,psv).
All in all it can take anywhere between 30 minutes to 60 minutes on a 32 core dual xeon cpu server with 256gb of RAM, including intermittent server load, to perform the process.
Now I am trying to bring down the total processing time even further.
Here is my current single threaded approach using perl5.
fetch data from 2 sources (let's say s1 and s2) one by one and populate my hash based on key-value pairs. Source of data could be either a flat csv or psv file OR a database query Array of Array result, via DBI client. Data is always unsorted to start with.
To be specific, I read the file line by line,split fields, and choose desired indexes for key,value pair and insert into hash.
After collecting data and populating hash with desired key/value pairs,I start to compare and collect results (mainy comparing on what is missing or different in s2 w.r.t s1 and vice-versa).
dump output in an excel file (very costly if no. of lines is large like ~1 million or greater) or in a simple CSV (cheap operation. preferred method).
I was wondering whether if I could somehow do the first step in parallel i.e. collect data from both sources at once and populate my global hash, and then proceed to compare and dump output?
What options can perl6 provide to deal with this situation? I have read about concurrency, asynchronous and parallel operations using perl6 but I am not so certain which one can help me here.
I would really appreciate any general guidance on the matter. I hope I explained my problem well but sadly I don't have much to show for what have I tried till now? and reason is that I am just beginning to tackle this one. I am just unable to see past the single threaded approach and need some help.
Thanks.
EDIT
As my existing problem statement has been deemed by the community as 'too broad' - allow me to attempt to highlight my pain points below:
I would like to do file comparison by utilizing all 32 cores if possible. I am just not able to come up with a strategy or initial idea.
What type of new techniques are available or applicable with perl6 in order to tackle this problem or type of problem.
If I spawn 2 processes to read file(s) and collect data - is it possible to get the result back as an array or hash?
Is it possible to compare the data (stored in hash) in parallel?
My current p5 comparison logic is shown below for your reference. Hope this helps and not let this question shutdown.
package COMP;
use strict;
use Data::Dumper;
sub comp
{
my ($data,$src,$tgt) = #_;
my $result = {};
my $ms = ($result->{ms} = {});
my $mt = ($result->{mt} = {});
my $diff = ($result->{diff} = {});
foreach my $key (keys %{$data->{$src}})
{
my $src_val = $data->{$src}{$key};
my $tgt_val = $data->{$tgt}{$key};
next if ($src_val eq $tgt_val);
if (!exists $data->{$tgt}{$key}) {
push (#{$mt->{$key}}, "$src_val|NULL");
}
if (exists $data->{$tgt}{$key} && $src_val ne $tgt_val) {
push (#{$diff->{$key}}, "$src_val|$tgt_val")
}
}
foreach my $key (keys %{$data->{$tgt}})
{
my $src_val = $data->{$src}{$key};
my $tgt_val = $data->{$tgt}{$key};
next if ($src_val eq $tgt_val);
if (!exists $data->{$src}{$key}) {
push (#{$ms->{$key}},"NULL|$tgt_val");
}
}
return $result;
}
1;
If someone would like to try it out, here is the sample output and the test script used.
script output
[User#Host:]$ perl testCOMP.pl
$VAR1 = {
'mt' => {
'Source' => [
'source|NULL'
]
},
'ms' => {
'Target' => [
'NULL|target'
]
},
'diff' => {
'Sunday_isit' => [
'Yes|No'
]
}
};
Test Script
[User#Host:]$ cat testCOMP.pl
#!/usr/bin/env perl
use lib $ENV{PWD};
use COMP;
use strict;
use warnings;
use Data::Dumper;
my $data2 = {
f1 => {
Amitabh => 'Bacchan',
YellowSun => 'Yes',
Sunday_isit => 'Yes',
Source => 'source',
},
f2 => {
Amitabh => 'Bacchan',
YellowSun => 'Yes',
Sunday_isit => 'No',
Target => 'target',
},
};
my $result = COMP::comp ($data2,'f1','f2');
print Dumper $result;
[User#Host:]$
If you have an existing and working toolchain you don't have to rewrite it all to use Perl6. It's parallelism mechanisms work fine with external processess too. Consider
allnum.pl6
use v6;
my #processes =
[ "num1.txt", "num2.txt", "num3.txt", "num4.txt", "num5.txt" ]
.map( -> $filename {
[ $filename, run "perl", "num.pl", $filename, :out ];
})
.hyper;
say "Lazyness Here!";
my $time = time;
for #processes
{
say "<{$_[0]} : {$_[1].out.slurp}>";
}
say time - $time, "s";
num.pl
use warnings;
use strict;
my $file = shift #ARGV;
my $start = time;
my $result = 0;
open my $in, "<", $file or die $!;
while (my $thing = <$in>)
{
chomp $thing;
$thing =~ s/ //g;
$result = ($result + $thing) / 2;
}
print $result, " : ", time - $start, "s";
On my system
C:\Users\holli\tmp>perl6 allnum.pl6
Lazyness Here!
<num1.txt : 7684.16347578616 : 3s>
<num2.txt : 3307.36261498186 : 7s>
<num3.txt : 5834.32817942962 : 10s>
<num4.txt : 6575.55944995197 : 0s>
<num5.txt : 6157.63100049619 : 0s>
10s
Files were set up like so
C:\Users\holli\tmp>perl -e "for($i=0;$i<10000000;$i++) { print chr(32) ** 100, int(rand(1000)), chr(32) ** 100, qq(\n); }">num1.txt
C:\Users\holli\tmp>perl -e "for($i=0;$i<20000000;$i++) { print chr(32) ** 100, int(rand(1000)), chr(32) ** 100, qq(\n); }">num2.txt
C:\Users\holli\tmp>perl -e "for($i=0;$i<30000000;$i++) { print chr(32) ** 100, int(rand(1000)), chr(32) ** 100, qq(\n); }">num3.txt
C:\Users\holli\tmp>perl -e "for($i=0;$i<400000;$i++) { print chr(32) ** 100, int(rand(1000)), chr(32) ** 100, qq(\n); }">num4.txt
C:\Users\holli\tmp>perl -e "for($i=0;$i<5000;$i++) { print chr(32) ** 100, int(rand(1000)), chr(32) ** 100, qq(\n); }">num5.txt

Need simple parallelism example in Perl 6

I am trying to learn Perl 6 and parallelism/concurrency at the same time.
For a simple learning exercise, I have a folder of 550 '.htm' files and I want the total sum of lines of code among all of them. So far, I have this:
use v6;
my $start_time = now;
my $exception;
my $total_lines = 0;
my #files = "c:/testdir".IO.dir(test => / '.' htm $/);
for #files -> $file {
$total_lines += $file.lines.elems;
CATCH {
default { $exception = $_; } #some of the files error out for malformed utf-8
}
}
say $total_lines;
say now - $start_time;
That gives a sum of 577,449 in approximately 3 seconds.
How would I rewrite that to take advantage of Perl 6 parallelism ideas? I realize the time saved won't be much but it will work as proof of concept.
Implementing Christoph's suggestion. The count is slightly higher than my original post because I'm now able to read in the malformed UTF-8 files using encode latin1.
use v6;
my $start_time = now;
my #files = "c:/iforms/live".IO.dir(test => / '.' htm $/);
my $total_lines = [+] #files.race.map(*.lines(:enc<latin1>).elems);
say $total_lines;
say now - $start_time;
use v6;
my $start_time = now;
my $exception;
my #lines;
my $total_lines = 0;
my #files = "c:/testdir".IO.dir(test => / '.' htm $/);
await do for #files -> $file {
start {
#lines.push( $file.lines.elems );
CATCH {
default { $exception = $_; } #some of the files error out for malformed utf-8
}
}
}
$total_lines = [+] #lines;
say $total_lines;
say now - $start_time;

How can I make DOT correctly process UTF-8 to PostScript and have multiple graph/pages?

This dot source
graph A
{
a;
}
graph B
{
"Enûma Eliš";
}
when compiled with dot -Tps generates this error
Warning: UTF-8 input uses non-Latin1 characters which cannot be handled by this PostScript driver
I can fix the UTF-8 problem by passing -Tps:cairo but then only graph A is in the output -- it is truncated to a single page. The same happens with -Tpdf. There are no other postscript driver available on my installation.
I could split the graphs into separate files and concatenate them afterwards, but I'd rather not. Is there a way to have correct UTF-8 handling and multiple page output?
Generating PDF or SVG could bypass the encoding problem too.
dot -Tpdf chs.dot > chs.pdf
// or
dot -Tsvg chs.dot > chs.svg
Apparently the dot PS driver can't handle other encodings than the old ISO8859-1. I think it can't change fonts either.
One thing you can do is to run a filter to change dot's PostScript output. The following Perl program does that, it's an adaptation of some code I had. It changes the encoding from UTF-8 to a modified ISO encoding with extra characters replacing unused ones.
Of course, the output still depends on the font having the characters. Since dot (I think) only uses the default PostScript fonts, anything beyond the "standard latin" is out of the question...
It works with Ghostscript or with any interpreter which defines AdobeGlyphList.
The filter should be used this way:
dot -Tps graph.dot | perl reenc.pl > output.ps
Here it is:
#!/usr/bin/perl
use strict;
use warnings;
use open qw(:std :utf8);
my $ps = do { local $/; <STDIN> };
my %high;
my %in_use;
foreach my $char (split //, $ps) {
my $code = (unpack("C", $char))[0];
if ($code > 127) {
$high{$char} = $code;
if ($code < 256) {
$in_use{$code} = 1;
}
}
}
my %repl;
my $i = 128;
foreach my $char (keys %high) {
if ($in_use{$high{$char}}) {
$ps =~ s/$char/sprintf("\\%03o", $high{$char})/ge;
next;
}
while ($in_use{$i}) { $i++; }
$repl{$i} = $high{$char};
$ps =~ s/$char/sprintf("\\%03o", $i)/ge;
$i++;
}
my $psprocs = <<"EOPS";
/EncReplacements <<
#{[ join(" ", %repl) ]}
>> def
/RevList AdobeGlyphList length dict dup begin
AdobeGlyphList { exch def } forall
end def
% code -- (uniXXXX)
/uniX { 16 6 string cvrs dup length 7 exch sub exch
(uni0000) 7 string copy dup 4 2 roll putinterval } def
% font code -- glyphname
/unitoname { dup RevList exch known
{ RevList exch get }
{ uniX cvn } ifelse
exch /CharStrings get 1 index known not
{ pop /.notdef } if
} def
/chg-enc { dup length array copy EncReplacements
{ currentdict exch unitoname 2 index 3 1 roll put } forall
} def
EOPS
$ps =~ s{/Encoding EncodingVector def}{/Encoding EncodingVector chg-enc def};
$ps =~ s/(%%BeginProlog)/$1\n$psprocs/;
print $ps;

Algorithm to do numeric profile of the string

I have few file similar to below, and I am trying to do numeric profiling as mentioned in the image
>File Sample
attttttttttttttacgatgccgggggatgcggggaaatttccctctctctctcttcttctcgcgcgcg
aaaaaaaaaaaaaaagcgcggcggcgcggasasasasasasaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
I have to map each substring of size 2 and then map it to 33 value for different ptoperties and then add as per the window size of 5.
my %temp = (
aCount => {
aa =>2
}
cCount => {
aa => 0
}
);
My current implementation include as per below ,
while (<FILE>) {
my $line = $_;
chomp $line;
while ($line=~/(.{2})/og) {
$subStr = $1;
if (exists $temp{aCount}{$subStr}) {
push #{$temp{aCount_array}},$temp{aCount}{$subStr};
if (scalar(#{$temp{aCount_array}}) == $WINDOW_SIZE) {
my $sum = eval (join('+',#{$temp{aCount_array}}));
shift #{$temp{aCount_array}};
#Similar approach has been taken to other 33 rules
}
}
if (exists $temp{cCount}{$subStr}) {
#similar approach
}
$line =~s/.{1}//og;
}
}
is there any other approach to increase the speed of the overall process
Regular expressions are awesome, but they can be overkill when all you need are fixed width substrings. Alternatives are substr
$len = length($line);
for ($i=0; $i<$len; $i+=2) {
$subStr = substr($line,$i,2);
...
}
or unpack
foreach $subStr (unpack "(A2)*", $line) {
...
}
I don't know how much faster either of these will be than regular expressions, but I know how I would find out.

Resources