Logstash count events using ruby filter - ruby

I have following pipeline (example, not full pipeline) where I need count the number of events for EVERY file I am writing as output. For example, this pipleline creates new file every 30 seconds with the events processed in that 30 seconds. This code is sort of working except two issues:
Writing each trap_id as an entry in out put file (I just want ONE entry with final count) -- Though I am doing just event count, my end goal is, count different log_level events, for example, how many "Error", "Information" windows events etc.,
When I reset the trap_id counter, the first entry in output file starting with ZERO instead of ONE.
Can someone please advise how I can address these issues?
input {
beats {
port => 5045
}
}
filter
{
ruby {
init => '
#trap_id = 0
#lasttimestmp = 0
'
code => '
evnttme = event.get("[#metadata][ms]")
if #lasttimestmp == evnttme
#trap_id += 1
event.set("lsttimestmp", #lasttimestmp)
event.set("trap_id", #trap_id)
else
#trap_id = 0
#lasttimestmp = evnttme
event.set("lsttimestmp", evnttme)
event.set("[#metadata][ms]", evnttme)
event.set("trap_id", #trap_id)
end
'
}
}
output {
file {
path => "output.log"
}
file {
flush_interval => 30
codec => line { format => "%{[#metadata][ts]}, %{[trap_id]}"}
path => "C:/lgstshop/local/csv/output%{[#metadata][ms]}.csv"
}
}

Related

Logstash aggregate filter didnt add new field into index in-fact didn't created any index after aggregation filter add

Dears,
i have created two grok pattern for in single log file , i want add existing field to another document if condition match , could you advice me how to add one existing field to another document
my log input
my input
INFO [2020-05-21 18:00:17,240][appListenerContainer-3][ID:b5ba824c-9f79-4dd4-9d53-1012250ce72a] - ValidationRuleSegmentStops - CarrierCode = AI ServiceNumber = 0531 DeparturePortCode = DEL ArrivalPortCode = CCJ DepartureDateTime = Thu May 21 XXXXX AST 2020 ArrivalDateTime = Thu May 21 XXX
WARN [2020-05-21 18:00:17,242][appListenerContainer-3][ID:b5ba824c-9f79-4dd4-9d53-1012250ce72a] - ValidationRuleSegmentStops - Multiple segment stops with departure datetime not set - only one permitted. Message sequence number 374991954 discarded
INFO [2020-05-21 18:00:17,242][appListenerContainer-3][ID:b5ba824c-9f79-4dd4-9d53-1012250ce72a] - SensitiveDataFilterHelper - Sensitive Data Filter key is not enabled.
ERROR [2020-05-21 18:00:17,243][appListenerContainer-3][ID:b5ba824c-9f79-4dd4-9d53-1012250ce72a] - AbstractMessageHandler - APP_LOGICAL_VALIDATION_FAILURE: comment1 = Multiple segment stops with departure datetime not set - only one permitted. Message sequence number 374991954 discarded
my filter
filter {
if [type] == "server" {
grok {
match => [ "message", "%{LOGLEVEL:log-level}\s+\[%{TIMESTAMP_ISO8601:createddatetime}\]\[%{DATA:lstcontainer}\](?<seg_corid>[^\s]+)\s+\-\s+%{WORD:abstract}\s+\-\s+CarrierCode\s+=\s+(?<carrierCode>[A-Z0-9]{2})\s+ServiceNumber\s+=\s+(?<Number>[0-9]{4})\s+DeparturePortCode\s+=\s+(?<DeparturePort>[A-Z]{3})\s+ArrivalPortCode\s+=\s+(?<ArrivalPort>[A-Z]{3})\s+DepartureDateTime\s+=\s+%{DATESTAMP_OTHER:departure_datetime}\s+ArrivalDateTime\s+=\s+%{DATESTAMP_OTHER:arrival_datetime},%{LOGLEVEL:log-level}\s+\[%{TIMESTAMP_ISO8601:createddatetime}\]\[%{DATA:lstcontainer}\](?<failed_corid>[^\s]+)\s+\-\s+%{WORD:abstract}\s+\-\s+(?<app-logical-error>[A-Z]{3}\_[A-Z]{7}\_[A-Z]{10}\_[A-Z]{7})\:\s+comment1 = Multiple segment stops with\s+%{WORD:direction}\s+datetime not set - only one permitted. Message sequence number\s+%{NUMBER:appmessageid:int}"]
}
}
if [failed_corid] == "%{seg_corid}" {
aggregate {
task_id => "%{appmessageid}"
code => "map['carrierCode'] = [carrierCode]"
map_action => "create"
end_of_task => true
timeout => 120
}
}
mutate { remove_field => ["message"]}
if "_grokparsefailure" in [tags]{drop {} }
}
output {
if [type] == "server" {
elasticsearch {
hosts => ["X.X.X.X:9200"]
index => "app-data-%{+YYYY-MM-DD}"
}
}
}
my requred fields found in defferent grok patter for example
carrier code found in
%{LOGLEVEL:log-level}\s+\[%{TIMESTAMP_ISO8601:createddatetime}\]\[%{DATA:lstcontainer}\](?<seg_corid>[^\s]+)\s+\-\s+%{WORD:abstract}\s+\-\s+CarrierCode\s+=\s+(?<carrierCode>[A-Z0-9]{2})\s+ServiceNumber\s+=\s+(?<Number>[0-9]{4})\s+DeparturePortCode\s+=\s+(?<DeparturePort>[A-Z]{3})\s+ArrivalPortCode\s+=\s+(?<ArrivalPort>[A-Z]{3})\s+DepartureDateTime\s+=\s+%{DATESTAMP_OTHER:departure_datetime}\s+ArrivalDateTime\s+=\s+%{DATESTAMP_OTHER:arrival_datetime},
and failed corid found in
%{LOGLEVEL:log-level}\s+\[%{TIMESTAMP_ISO8601:createddatetime}\]\[%{DATA:lstcontainer}\](?<failed_corid>[^\s]+)\s+\-\s+%{WORD:abstract}\s+\-\s+(?<app-logical-error>[A-Z]{3}\_[A-Z]{7}\_[A-Z]{10}\_[A-Z]{7})\:\s+comment1 = Multiple segment stops with\s+%{WORD:direction}\s+datetime not set - only one permitted. Message sequence number\s+%{NUMBER:appmessageid:int}
we want to merge carriercode field into failed_corid documents
kindly help us how to do it in logstash
expected output fields in the documents
{
"failed_corid": [id-433erfdtert3er]
"carrier_code": AI
}

comparing 2 data sets possibly with concurrency/asynchronous/parallel approach

I am currently trying to improve upon an existing mechanism (to compare data from 2 sources, implemented in perl5) and would like to use perl6 instead.
My target data volume range is about 20-30 GB in uncompressed flat files.
In terms of lines, a file can contain anywhere from 18 million to 28 million lines.
It has around 40-50 columns per line.
I do this type of data reconciliation on a daily basis and it can take about ~10 minutes to read from a file and populate the hash. ~20 minutes spent to read both files and to populate hash.
comparison process takes about ~30-50 minutes including iterating over hash, collecting desired result(s), and writing to output file (csv,psv).
All in all it can take anywhere between 30 minutes to 60 minutes on a 32 core dual xeon cpu server with 256gb of RAM, including intermittent server load, to perform the process.
Now I am trying to bring down the total processing time even further.
Here is my current single threaded approach using perl5.
fetch data from 2 sources (let's say s1 and s2) one by one and populate my hash based on key-value pairs. Source of data could be either a flat csv or psv file OR a database query Array of Array result, via DBI client. Data is always unsorted to start with.
To be specific, I read the file line by line,split fields, and choose desired indexes for key,value pair and insert into hash.
After collecting data and populating hash with desired key/value pairs,I start to compare and collect results (mainy comparing on what is missing or different in s2 w.r.t s1 and vice-versa).
dump output in an excel file (very costly if no. of lines is large like ~1 million or greater) or in a simple CSV (cheap operation. preferred method).
I was wondering whether if I could somehow do the first step in parallel i.e. collect data from both sources at once and populate my global hash, and then proceed to compare and dump output?
What options can perl6 provide to deal with this situation? I have read about concurrency, asynchronous and parallel operations using perl6 but I am not so certain which one can help me here.
I would really appreciate any general guidance on the matter. I hope I explained my problem well but sadly I don't have much to show for what have I tried till now? and reason is that I am just beginning to tackle this one. I am just unable to see past the single threaded approach and need some help.
Thanks.
EDIT
As my existing problem statement has been deemed by the community as 'too broad' - allow me to attempt to highlight my pain points below:
I would like to do file comparison by utilizing all 32 cores if possible. I am just not able to come up with a strategy or initial idea.
What type of new techniques are available or applicable with perl6 in order to tackle this problem or type of problem.
If I spawn 2 processes to read file(s) and collect data - is it possible to get the result back as an array or hash?
Is it possible to compare the data (stored in hash) in parallel?
My current p5 comparison logic is shown below for your reference. Hope this helps and not let this question shutdown.
package COMP;
use strict;
use Data::Dumper;
sub comp
{
my ($data,$src,$tgt) = #_;
my $result = {};
my $ms = ($result->{ms} = {});
my $mt = ($result->{mt} = {});
my $diff = ($result->{diff} = {});
foreach my $key (keys %{$data->{$src}})
{
my $src_val = $data->{$src}{$key};
my $tgt_val = $data->{$tgt}{$key};
next if ($src_val eq $tgt_val);
if (!exists $data->{$tgt}{$key}) {
push (#{$mt->{$key}}, "$src_val|NULL");
}
if (exists $data->{$tgt}{$key} && $src_val ne $tgt_val) {
push (#{$diff->{$key}}, "$src_val|$tgt_val")
}
}
foreach my $key (keys %{$data->{$tgt}})
{
my $src_val = $data->{$src}{$key};
my $tgt_val = $data->{$tgt}{$key};
next if ($src_val eq $tgt_val);
if (!exists $data->{$src}{$key}) {
push (#{$ms->{$key}},"NULL|$tgt_val");
}
}
return $result;
}
1;
If someone would like to try it out, here is the sample output and the test script used.
script output
[User#Host:]$ perl testCOMP.pl
$VAR1 = {
'mt' => {
'Source' => [
'source|NULL'
]
},
'ms' => {
'Target' => [
'NULL|target'
]
},
'diff' => {
'Sunday_isit' => [
'Yes|No'
]
}
};
Test Script
[User#Host:]$ cat testCOMP.pl
#!/usr/bin/env perl
use lib $ENV{PWD};
use COMP;
use strict;
use warnings;
use Data::Dumper;
my $data2 = {
f1 => {
Amitabh => 'Bacchan',
YellowSun => 'Yes',
Sunday_isit => 'Yes',
Source => 'source',
},
f2 => {
Amitabh => 'Bacchan',
YellowSun => 'Yes',
Sunday_isit => 'No',
Target => 'target',
},
};
my $result = COMP::comp ($data2,'f1','f2');
print Dumper $result;
[User#Host:]$
If you have an existing and working toolchain you don't have to rewrite it all to use Perl6. It's parallelism mechanisms work fine with external processess too. Consider
allnum.pl6
use v6;
my #processes =
[ "num1.txt", "num2.txt", "num3.txt", "num4.txt", "num5.txt" ]
.map( -> $filename {
[ $filename, run "perl", "num.pl", $filename, :out ];
})
.hyper;
say "Lazyness Here!";
my $time = time;
for #processes
{
say "<{$_[0]} : {$_[1].out.slurp}>";
}
say time - $time, "s";
num.pl
use warnings;
use strict;
my $file = shift #ARGV;
my $start = time;
my $result = 0;
open my $in, "<", $file or die $!;
while (my $thing = <$in>)
{
chomp $thing;
$thing =~ s/ //g;
$result = ($result + $thing) / 2;
}
print $result, " : ", time - $start, "s";
On my system
C:\Users\holli\tmp>perl6 allnum.pl6
Lazyness Here!
<num1.txt : 7684.16347578616 : 3s>
<num2.txt : 3307.36261498186 : 7s>
<num3.txt : 5834.32817942962 : 10s>
<num4.txt : 6575.55944995197 : 0s>
<num5.txt : 6157.63100049619 : 0s>
10s
Files were set up like so
C:\Users\holli\tmp>perl -e "for($i=0;$i<10000000;$i++) { print chr(32) ** 100, int(rand(1000)), chr(32) ** 100, qq(\n); }">num1.txt
C:\Users\holli\tmp>perl -e "for($i=0;$i<20000000;$i++) { print chr(32) ** 100, int(rand(1000)), chr(32) ** 100, qq(\n); }">num2.txt
C:\Users\holli\tmp>perl -e "for($i=0;$i<30000000;$i++) { print chr(32) ** 100, int(rand(1000)), chr(32) ** 100, qq(\n); }">num3.txt
C:\Users\holli\tmp>perl -e "for($i=0;$i<400000;$i++) { print chr(32) ** 100, int(rand(1000)), chr(32) ** 100, qq(\n); }">num4.txt
C:\Users\holli\tmp>perl -e "for($i=0;$i<5000;$i++) { print chr(32) ** 100, int(rand(1000)), chr(32) ** 100, qq(\n); }">num5.txt

How to save and display Dashing historical values?

Currently to setup a graph widget, the job should pass all values to be displayed:
data = [
{ "x" => 1980, "y" => 1323 },
{ "x" => 1981, "y" => 53234 },
{ "x" => 1982, "y" => 2344 }
]
I would like to read just current (the latest) value from my server, but previous values should be also displayed.
It looks like I could create a job, which will read the current value from the server, but remaining values to be read from the Redis (or sqlite database, but I would prefer Redis). The current value after that should be saved to the database.
I never worked with Ruby and Dashing before, so the first question I have - is it possible? If I will use Redis, then the question is how to store the data since this is key-value database. I can keep it as widget-id-1, widget-id-2, widget-id-3 ... widget-id-N etc., but in this case I will have to store N value (like widget-id=N). Or, is there any better way?
I came to the following solution:
require 'redis' # https://github.com/redis/redis-rb
redis_uri = URI.parse(ENV["REDISTOGO_URL"])
redis = Redis.new(:host => redis_uri.host, :port => redis_uri.port, :password => redis_uri.password)
if redis.exists('values_x') && redis.exists('values_y')
values_x = redis.lrange('values_x', 0, 9) # get latest 10 records
values_y = redis.lrange('values_y', 0, 9) # get latest 10 records
else
values_x = []
values_y = []
end
SCHEDULER.every '10s', :first_in => 0 do |job|
rand_data = (Date.today-rand(10000)).strftime("%d-%b") # replace this line with the code to get your data
rand_value = rand(50) # replace this line with the code to get your data
values_x << rand_data
values_y << rand_value
redis.multi do # execute as a single transaction
redis.lpush('values_x', rand_data)
redis.lpush('values_y', rand_value)
# feel free to add more datasets values here, if required
end
data = [
{
label: 'dataset-label',
fillColor: 'rgba(220,220,220,0.5)',
strokeColor: 'rgba(220,220,220,0.8)',
highlightFill: 'rgba(220,220,220,0.75)',
highlightStroke: 'rgba(220,220,220,1)',
data: values_y.last(10) # display last 10 values only
}
]
options = { scaleFontColor: '#fff' }
send_event('barchart', { labels: values_x.last(10), datasets: data, options: options })
end
Not sure if everything is implemented correctly here, but it works.

Slow performance in spark streaming

I am using spark streaming 1.1.0 locally (not in a cluster).
I created a simple app that parses the data (about 10.000 entries), stores it in a stream and then makes some transformations on it. Here is the code:
def main(args : Array[String]){
val master = "local[8]"
val conf = new SparkConf().setAppName("Tester").setMaster(master)
val sc = new StreamingContext(conf, Milliseconds(110000))
val stream = sc.receiverStream(new MyReceiver("localhost", 9999))
val parsedStream = parse(stream)
parsedStream.foreachRDD(rdd =>
println(rdd.first()+"\nRULE STARTS "+System.currentTimeMillis()))
val result1 = parsedStream
.filter(entry => entry.symbol.contains("walking")
&& entry.symbol.contains("true") && entry.symbol.contains("id0"))
.map(_.time)
val result2 = parsedStream
.filter(entry =>
entry.symbol == "disappear" && entry.symbol.contains("id0"))
.map(_.time)
val result3 = result1
.transformWith(result2, (rdd1, rdd2: RDD[Int]) => rdd1.subtract(rdd2))
result3.foreachRDD(rdd =>
println(rdd.first()+"\nRULE ENDS "+System.currentTimeMillis()))
sc.start()
sc.awaitTermination()
}
def parse(stream: DStream[String]) = {
stream.flatMap { line =>
val entries = line.split("assert").filter(entry => !entry.isEmpty)
entries.map { tuple =>
val pattern = """\s*[(](.+)[,]\s*([0-9]+)+\s*[)]\s*[)]\s*[,|\.]\s*""".r
tuple match {
case pattern(symbol, time) =>
new Data(symbol, time.toInt)
}
}
}
}
case class Data (symbol: String, time: Int)
I have a batch duration of 110.000 milliseconds in order to receive all the data in one batch. I believed that, even locally, the spark is very fast. In this case, it takes about 3.5sec to execute the rule (between "RULE STARTS" and "RULE ENDS"). Am I doing something wrong or this is the expected time? Any advise
So i was using case matching in allot of my jobs and it killed performance, more than when i introduced a json parser. Also try tweaking the batch time on the StreamingContext. It made quite a bit of difference for me. Also how many local workers do you have?

Parsing large XML files?

I have 2 xml files 1 with 115mb size and another with 34mb size.
Wiile reading file A there is 1 field called desc that relations it with file B where I retrieve the field id from file B where desc.file A is iqual to name.file B.
file A is already too big then I have to search inside file B and it takes a very long time to complete.
How could I speed up this proccess or what would be a better approch to do it ?
current code I am using:
#!/usr/bin/perl
use strict;
use warnings;
use XML::Simple qw(:strict XMLin);
my $npcs = XMLin('Client/client_npcs.xml', KeyAttr => { }, ForceArray => [ 'npc_client' ]);
my $strings = XMLin('Client/client_strings.xml', KeyAttr => { }, ForceArray => [ 'string' ]);
my ($nameid,$rank);
open (my $fh, '>>', 'Output/npc_templates.xml');
print $fh "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<npc_templates xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:noNamespaceSchemaLocation=\"npcs.xsd\">\n";
foreach my $npc ( #{ $npcs->{npc_client} } ) {
if (defined $npc->{desc}) {
foreach my $string (#{$strings->{string}}) {
if (defined $string->{name} && $string->{name} =~ /$npc->{desc}/i) {
$nameid = $string->{id};
last;
}
}
} else {
$nameid = "";
}
if (defined $npc->{hpgauge_level} && $npc->{hpgauge_level} > 25 && $npc->{hpgauge_level} < 28) {
$rank = 'LEGENDARY';
} elsif (defined $npc->{hpgauge_level} && $npc->{hpgauge_level} > 21 && $npc->{hpgauge_level} < 23) {
$rank = 'HERO';
} elsif (defined $npc->{hpgauge_level} && $npc->{hpgauge_level} > 10 && $npc->{hpgauge_level} < 15) {
$rank = 'ELITE';
} elsif (defined $npc->{hpgauge_level} && $npc->{hpgauge_level} > 0 && $npc->{hpgauge_level} < 11) {
$rank = 'NORMAL';
} else {
$rank = $gauge;
}
print $fh qq|\t<npc_template npc_id="$npc->{id}" name="$npc->{name}" name_id="$nameid" height="$npc->{scale}" rank="$rank" tribe="$npc->{tribe}" race="$npc->{race_type}" hp_gauge="$npc->{hpgauge_level}"/>\n|;
}
print $fh "</<npc_templates>";
close($fh);
example of file A.xml:
<?xml version="1.0" encoding="utf-16"?>
<npc_clients>
<npc_client>
<id>200000</id>
<name>SkillZone</name>
<desc>STR_NPC_NO_NAME</desc>
<dir>Monster/Worm</dir>
<mesh>Worm</mesh>
<material>mat_mob_reptile</material>
<show_dmg_decal>0</show_dmg_decal>
<ui_type>general</ui_type>
<cursor_type>none</cursor_type>
<hide_path>0</hide_path>
<erect>1</erect>
<bound_radius>
<front>1.200000</front>
<side>3.456000</side>
<upper>3.000000</upper>
</bound_radius>
<scale>10</scale>
<weapon_scale>100</weapon_scale>
<altitude>0.000000</altitude>
<stare_angle>75.000000</stare_angle>
<stare_distance>20.000000</stare_distance>
<move_speed_normal_walk>0.000000</move_speed_normal_walk>
<art_org_move_speed_normal_walk>0.000000</art_org_move_speed_normal_walk>
<move_speed_normal_run>0.000000</move_speed_normal_run>
<move_speed_combat_run>0.000000</move_speed_combat_run>
<art_org_speed_combat_run>0.000000</art_org_speed_combat_run>
<in_time>0.100000</in_time>
<out_time>0.500000</out_time>
<neck_angle>90.000000</neck_angle>
<spine_angle>10.000000</spine_angle>
<ammo_bone>Bip01 Head</ammo_bone>
<ammo_fx>skill_stoneshard.stoneshard.ammo</ammo_fx>
<ammo_speed>50</ammo_speed>
<pushed_range>0.000000</pushed_range>
<hpgauge_level>3</hpgauge_level>
<magical_skill_boost>0</magical_skill_boost>
<attack_delay>2000</attack_delay>
<ai_name>SummonSkillArea</ai_name>
<tribe>General</tribe>
<pet_ai_name>Pet</pet_ai_name>
<sensory_range>15.000000</sensory_range>
</npc_client>
</npc_clients>
example of file B.xml:
<?xml version="1.0" encoding="utf-16"?>
<strings>
<string>
<id>350000</id>
<name>STR_NPC_NO_NAME</name>
<body> </body>
</string>
</strings>
Here is example of XML::Twig usage. The main advantage is that it is not holding whole file in memory, so processing is much faster. The code below is trying to emulate operation of script from question.
use XML::Twig;
my %strings = ();
XML::Twig->new(
twig_handlers => {
'strings/string' => sub {
$strings{ lc $_->first_child('name')->text }
= $_->first_child('id')->text
},
}
)->parsefile('B.xml');
print "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<npc_templates xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:noNamespaceSchemaLocation=\"npcs.xsd\">\n";
XML::Twig->new(
twig_handlers => {
'npc_client' => sub {
my $nameid = eval { $strings{ lc $_->first_child('desc')->text } };
# calculate rank as needed
my $hpgauge_level = eval { $_->first_child('hpgauge_level')->text };
$rank = $hpgauge_level >= 28 ? 'ERROR'
: $hpgauge_level > 25 ? 'LEGENDARY'
: $hpgauge_level > 21 ? 'HERO'
: $hpgauge_level > 10 ? 'ELITE'
: $hpgauge_level > 0 ? 'NORMAL'
: $hpgauge_level;
my $npc_id = eval { $_->first_child('id')->text };
my $name = eval { $_->first_child('name')->text };
my $tribe = eval { $_->first_child('tribe')->text };
my $scale = eval { $_->first_child('scale')->text };
my $race_type = eval { $_->first_child('race_type')->text };
print
qq|\t<npc_template npc_id="$npc_id" name="$name" name_id="$nameid" height="$scale" rank="$rank" tribe="$tribe" race="$race_type" hp_gauge="$hpgauge_level"/>\n|;
$_->purge;
}
}
)->parsefile('A.xml');
print "</<npc_templates>";
Grab all the interesting 'desc' fields from file A and put them in a hash. You only have to parse it once, but if it still takes too long have a look at XML::Twig.
Parse file B. once and extract the stuff you need. Use the hash.
Looks like you only need parts of the xml files. XML::Twig can parse only the elements you are interested in and throw away the rest using the "twig_roots" parameter. XML::Simple is easier to get started with though..
Although I can't help you with the specifics of your Perl code, there are some general guidelines when dealing with large volumes of XML data. There are, broadly speaking, 2 kinds of XML APIs - DOM based and Stream based. Dom based API's (like XML DOM) will parse an entire XML document in to memory before the user-level API becomes "available", whereas with a stream based API (like SAX) the implementation doesn't need to parse the whole XML document. One benefit of Stream based parsers are that they typically use much less memory, as they don't need to hold the entire XML document in memory at once - this is obviously a good thing when dealing with large XML documents. Looking at the XML::Simple docs here, it's seems there may be SAX support available - have you tried this?
I'm not a perl guy, so take this with a grain of salt, but I see 2 problems:
The fact that you are iterating over all of the values in file B until you find the correct value for each element in file A is inefficient. Instead, you should be using some sort of map/dictionary for the values in file B.
It looks like you are parsing the both files in memory before you even start processing. File A would be better processed as a stream instead of loading the entire document into memory.

Resources