How do I merge based on a missing elements with JQ?

How do I merge based on a missing elements with JQ? - bash

I have two JSON files that I am working with: dest_sr.json and src_templates.json
What I would like to is create a new JSON object that combines keys from JSON object #1 (specifically the Template_SR key) and JSON object #2 (the Template_Name and Template_ID keys), but only when a VM-Template cannot be found on Pool_ID that matches on both object #1 and object #2
For the two given JSON objects:
dest_sr.json:
{
"Type": "pool",
"Pool_ID": "adb58e84",
"Template_SR": "fc820294"
}
{
"Type": "pool",
"Pool_ID": "d2dea684",
"Template_SR": "313f2a07"
}
src_templates.json:
{
"Type": "VM-template",
"Template_Name": "CentOS 7 CloudInit V1",
"Template_ID": "9b1833a3",
"Pool_ID": "adb58e84"
}
I would like to create a new JSON object that looks like this:
{
"Template_ID",:"9b1833a3",
"Template_SR":"313f2a07",
"Template_Name":"CentOS 7 CloudInit V1",
}
Using Stackoverflow I was able to hack together a hashJoin from examples that looks like this, where it performs a hashJoin on the .Pool_ID attribute to create a new object:
# hashJoin(a1; a2; field) expects a1 and a2 to be arrays of JSON objects
# and that for each of the objects, the field value is a string.
# A relational join is performed on "field".
def hashJoin(a1; a2; field):
# hash phase:
(reduce a1[] as $o ({}; . + { ($o | field): $o } )) as $h1
| (reduce a2[] as $o ({}; . + { ($o | field): $o } )) as $h2
# join phase:
| reduce ($h1|keys[]) as $key
([]; if $h2|has($key) then . + [ $h1[$key] + $h2[$key] ] else . end) ;
hashJoin( $file1; $file2; .Pool_ID)[]
When I call JQ to perform the hashJoin, I get a new object that looks like this:
/tmp $ jq -nc --slurpfile file1 /tmp/dest_sr.json --slurpfile file2 /tmp/src_templates.json -f hashJoinSimple.jq
{
"Type":"VM-template",
"Pool_ID":"adb58e84",
"Template_SR":"fc820294",
"Template_Name":"CentOS 7 CloudInit V1",
"Template_ID":"9b1833a3"
}
Is there any way I can write my merge so that the object would look like this?
{
"Type":"VM-template",
"Pool_ID":"d2dea684",
"Template_SR":"fc820294",
"Template_Name":"CentOS 7 CloudInit V1",
"Template_ID":"9b1833a3"
}

I think given your particular inputs and your desired result, I don't think the hashing is necessary. It seems like all you need to do is select the inputs from dest_sr.json where it doesn't match the id in src_templates.json, then combine it with the desired values from the template.
$ jq --argfile template src_templates.json '
select(.Pool_ID != $template.Pool_ID) + ($template|{Template_Name,Template_ID})
' dest_sr.json

Related

How to Replace Substring in all Elements of Object without Destroying it

I need to replace substrings of all elements in an object.
E.g. replace all 'X' in val1 and val2:
{
"input": [
{
"val1": "008 X 148",
"val2": "SOME X DATA"
},
{
"val1": "X 005 5PM",
"val2": "SOME X DATA"
},
{
"val1": "MODTOX",
"val2": "X SOME X DATA"
}
]
}
My first intention was to use $map and then $each, like this:
$map(input, function($i)
{ $each($i, function($s)
{ $replace($s, "X", "Y" )
})
})
, but as expected, this destroys the object.
Any suggestion? Finally 'input' should still be of same structure.

You need to use the transform operator to modify a copy of the input data:
$ ~> | input | $each(function($v, $n){{$n: $replace($v, "X", "Y") }} ) ~> $merge() |
See https://try.jsonata.org/yeKAKg_U_

jq: recursion -> nested arrays

How can I parse this json structure with jq? It should loop over leafs (projects and groups) recursively.
My use case is: create project and groups in VCS with CLI. Group can have multiple projects, group can be empty, projects must have parent group created in advance.
Similar analogy would be:
group = folder
project = file
path = absolute path in format /root-groups/nested-groups-level-1/nested-groups-level-2/nested-groups-level-N
Thanks
{
"structure":[
{
"name":"rootgroup1",
"type":"group",
"nested":[
{
"name":"nestedproject1",
"type":"project"
},
{
"name":"nestedgroup1",
"type":"group",
"nested":[
{
"name":"nestednestedproject2",
"type":"project"
}
]
}
]
},
{
"name":"rootproject1",
"type":"project"
},
{
"name":"rootgroup2",
"type":"group",
"nested": []
}
]
}
Expected output:
"rootgroup1","group",""
"nestedproject1","project","rootgroup1"
"nestedgroup1","group","rootgroup1"
"nestednestedproject2","group","rootgroup1/nestedgroup1"
"rootproject1","project",""
"rootgroup2","group",""
Try:
jq -r '.structure[] | .. | "\(.name?) \(.type?)"'
Still not sure, how create a parent path.

The following implements a solution to the problem as I understand it:
# $prefix is an array interpreted as the prefix
def details($prefix):
def out:
select(has("name") and has("type")) | [.name, .type, "/" + ($prefix|join("/"))];
out,
if (.nested | (. and length>0))
then .name as $n | .nested[] | details($prefix + [$n])
else empty
end;
.structure[]
| details([])
| #csv
Given your sample input, the output would be:
"rootgroup1","group","/"
"nestedproject1","project","/rootgroup1"
"nestedgroup1","group","/rootgroup1"
"nestednestedproject2","project","/rootgroup1/nestedgroup1"
"rootproject1","project","/"
"rootgroup2","group","/"
This differs in some respects from the sample output, but hopefully you can take it from here.

comparing 2 data sets possibly with concurrency/asynchronous/parallel approach

I am currently trying to improve upon an existing mechanism (to compare data from 2 sources, implemented in perl5) and would like to use perl6 instead.
My target data volume range is about 20-30 GB in uncompressed flat files.
In terms of lines, a file can contain anywhere from 18 million to 28 million lines.
It has around 40-50 columns per line.
I do this type of data reconciliation on a daily basis and it can take about ~10 minutes to read from a file and populate the hash. ~20 minutes spent to read both files and to populate hash.
comparison process takes about ~30-50 minutes including iterating over hash, collecting desired result(s), and writing to output file (csv,psv).
All in all it can take anywhere between 30 minutes to 60 minutes on a 32 core dual xeon cpu server with 256gb of RAM, including intermittent server load, to perform the process.
Now I am trying to bring down the total processing time even further.
Here is my current single threaded approach using perl5.
fetch data from 2 sources (let's say s1 and s2) one by one and populate my hash based on key-value pairs. Source of data could be either a flat csv or psv file OR a database query Array of Array result, via DBI client. Data is always unsorted to start with.
To be specific, I read the file line by line,split fields, and choose desired indexes for key,value pair and insert into hash.
After collecting data and populating hash with desired key/value pairs,I start to compare and collect results (mainy comparing on what is missing or different in s2 w.r.t s1 and vice-versa).
dump output in an excel file (very costly if no. of lines is large like ~1 million or greater) or in a simple CSV (cheap operation. preferred method).
I was wondering whether if I could somehow do the first step in parallel i.e. collect data from both sources at once and populate my global hash, and then proceed to compare and dump output?
What options can perl6 provide to deal with this situation? I have read about concurrency, asynchronous and parallel operations using perl6 but I am not so certain which one can help me here.
I would really appreciate any general guidance on the matter. I hope I explained my problem well but sadly I don't have much to show for what have I tried till now? and reason is that I am just beginning to tackle this one. I am just unable to see past the single threaded approach and need some help.
Thanks.
EDIT
As my existing problem statement has been deemed by the community as 'too broad' - allow me to attempt to highlight my pain points below:
I would like to do file comparison by utilizing all 32 cores if possible. I am just not able to come up with a strategy or initial idea.
What type of new techniques are available or applicable with perl6 in order to tackle this problem or type of problem.
If I spawn 2 processes to read file(s) and collect data - is it possible to get the result back as an array or hash?
Is it possible to compare the data (stored in hash) in parallel?
My current p5 comparison logic is shown below for your reference. Hope this helps and not let this question shutdown.
package COMP;
use strict;
use Data::Dumper;
sub comp
{
my ($data,$src,$tgt) = #_;
my $result = {};
my $ms = ($result->{ms} = {});
my $mt = ($result->{mt} = {});
my $diff = ($result->{diff} = {});
foreach my $key (keys %{$data->{$src}})
{
my $src_val = $data->{$src}{$key};
my $tgt_val = $data->{$tgt}{$key};
next if ($src_val eq $tgt_val);
if (!exists $data->{$tgt}{$key}) {
push (#{$mt->{$key}}, "$src_val|NULL");
}
if (exists $data->{$tgt}{$key} && $src_val ne $tgt_val) {
push (#{$diff->{$key}}, "$src_val|$tgt_val")
}
}
foreach my $key (keys %{$data->{$tgt}})
{
my $src_val = $data->{$src}{$key};
my $tgt_val = $data->{$tgt}{$key};
next if ($src_val eq $tgt_val);
if (!exists $data->{$src}{$key}) {
push (#{$ms->{$key}},"NULL|$tgt_val");
}
}
return $result;
}
1;
If someone would like to try it out, here is the sample output and the test script used.
script output
[User#Host:]$ perl testCOMP.pl
$VAR1 = {
'mt' => {
'Source' => [
'source|NULL'
]
},
'ms' => {
'Target' => [
'NULL|target'
]
},
'diff' => {
'Sunday_isit' => [
'Yes|No'
]
}
};
Test Script
[User#Host:]$ cat testCOMP.pl
#!/usr/bin/env perl
use lib $ENV{PWD};
use COMP;
use strict;
use warnings;
use Data::Dumper;
my $data2 = {
f1 => {
Amitabh => 'Bacchan',
YellowSun => 'Yes',
Sunday_isit => 'Yes',
Source => 'source',
},
f2 => {
Amitabh => 'Bacchan',
YellowSun => 'Yes',
Sunday_isit => 'No',
Target => 'target',
},
};
my $result = COMP::comp ($data2,'f1','f2');
print Dumper $result;
[User#Host:]$

If you have an existing and working toolchain you don't have to rewrite it all to use Perl6. It's parallelism mechanisms work fine with external processess too. Consider
allnum.pl6
use v6;
my #processes =
[ "num1.txt", "num2.txt", "num3.txt", "num4.txt", "num5.txt" ]
.map( -> $filename {
[ $filename, run "perl", "num.pl", $filename, :out ];
})
.hyper;
say "Lazyness Here!";
my $time = time;
for #processes
{
say "<{$_[0]} : {$_[1].out.slurp}>";
}
say time - $time, "s";
num.pl
use warnings;
use strict;
my $file = shift #ARGV;
my $start = time;
my $result = 0;
open my $in, "<", $file or die $!;
while (my $thing = <$in>)
{
chomp $thing;
$thing =~ s/ //g;
$result = ($result + $thing) / 2;
}
print $result, " : ", time - $start, "s";
On my system
C:\Users\holli\tmp>perl6 allnum.pl6
Lazyness Here!
<num1.txt : 7684.16347578616 : 3s>
<num2.txt : 3307.36261498186 : 7s>
<num3.txt : 5834.32817942962 : 10s>
<num4.txt : 6575.55944995197 : 0s>
<num5.txt : 6157.63100049619 : 0s>
10s
Files were set up like so
C:\Users\holli\tmp>perl -e "for($i=0;$i<10000000;$i++) { print chr(32) ** 100, int(rand(1000)), chr(32) ** 100, qq(\n); }">num1.txt
C:\Users\holli\tmp>perl -e "for($i=0;$i<20000000;$i++) { print chr(32) ** 100, int(rand(1000)), chr(32) ** 100, qq(\n); }">num2.txt
C:\Users\holli\tmp>perl -e "for($i=0;$i<30000000;$i++) { print chr(32) ** 100, int(rand(1000)), chr(32) ** 100, qq(\n); }">num3.txt
C:\Users\holli\tmp>perl -e "for($i=0;$i<400000;$i++) { print chr(32) ** 100, int(rand(1000)), chr(32) ** 100, qq(\n); }">num4.txt
C:\Users\holli\tmp>perl -e "for($i=0;$i<5000;$i++) { print chr(32) ** 100, int(rand(1000)), chr(32) ** 100, qq(\n); }">num5.txt

Algorithm to do numeric profile of the string

I have few file similar to below, and I am trying to do numeric profiling as mentioned in the image
>File Sample
attttttttttttttacgatgccgggggatgcggggaaatttccctctctctctcttcttctcgcgcgcg
aaaaaaaaaaaaaaagcgcggcggcgcggasasasasasasaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
I have to map each substring of size 2 and then map it to 33 value for different ptoperties and then add as per the window size of 5.
my %temp = (
aCount => {
aa =>2
}
cCount => {
aa => 0
}
);
My current implementation include as per below ,
while (<FILE>) {
my $line = $_;
chomp $line;
while ($line=~/(.{2})/og) {
$subStr = $1;
if (exists $temp{aCount}{$subStr}) {
push #{$temp{aCount_array}},$temp{aCount}{$subStr};
if (scalar(#{$temp{aCount_array}}) == $WINDOW_SIZE) {
my $sum = eval (join('+',#{$temp{aCount_array}}));
shift #{$temp{aCount_array}};
#Similar approach has been taken to other 33 rules
}
}
if (exists $temp{cCount}{$subStr}) {
#similar approach
}
$line =~s/.{1}//og;
}
}
is there any other approach to increase the speed of the overall process

Regular expressions are awesome, but they can be overkill when all you need are fixed width substrings. Alternatives are substr
$len = length($line);
for ($i=0; $i<$len; $i+=2) {
$subStr = substr($line,$i,2);
...
}
or unpack
foreach $subStr (unpack "(A2)*", $line) {
...
}
I don't know how much faster either of these will be than regular expressions, but I know how I would find out.

Parsing large XML files?

I have 2 xml files 1 with 115mb size and another with 34mb size.
Wiile reading file A there is 1 field called desc that relations it with file B where I retrieve the field id from file B where desc.file A is iqual to name.file B.
file A is already too big then I have to search inside file B and it takes a very long time to complete.
How could I speed up this proccess or what would be a better approch to do it ?
current code I am using:
#!/usr/bin/perl
use strict;
use warnings;
use XML::Simple qw(:strict XMLin);
my $npcs = XMLin('Client/client_npcs.xml', KeyAttr => { }, ForceArray => [ 'npc_client' ]);
my $strings = XMLin('Client/client_strings.xml', KeyAttr => { }, ForceArray => [ 'string' ]);
my ($nameid,$rank);
open (my $fh, '>>', 'Output/npc_templates.xml');
print $fh "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<npc_templates xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:noNamespaceSchemaLocation=\"npcs.xsd\">\n";
foreach my $npc ( #{ $npcs->{npc_client} } ) {
if (defined $npc->{desc}) {
foreach my $string (#{$strings->{string}}) {
if (defined $string->{name} && $string->{name} =~ /$npc->{desc}/i) {
$nameid = $string->{id};
last;
}
}
} else {
$nameid = "";
}
if (defined $npc->{hpgauge_level} && $npc->{hpgauge_level} > 25 && $npc->{hpgauge_level} < 28) {
$rank = 'LEGENDARY';
} elsif (defined $npc->{hpgauge_level} && $npc->{hpgauge_level} > 21 && $npc->{hpgauge_level} < 23) {
$rank = 'HERO';
} elsif (defined $npc->{hpgauge_level} && $npc->{hpgauge_level} > 10 && $npc->{hpgauge_level} < 15) {
$rank = 'ELITE';
} elsif (defined $npc->{hpgauge_level} && $npc->{hpgauge_level} > 0 && $npc->{hpgauge_level} < 11) {
$rank = 'NORMAL';
} else {
$rank = $gauge;
}
print $fh qq|\t<npc_template npc_id="$npc->{id}" name="$npc->{name}" name_id="$nameid" height="$npc->{scale}" rank="$rank" tribe="$npc->{tribe}" race="$npc->{race_type}" hp_gauge="$npc->{hpgauge_level}"/>\n|;
}
print $fh "</<npc_templates>";
close($fh);
example of file A.xml:
<?xml version="1.0" encoding="utf-16"?>
<npc_clients>
<npc_client>
<id>200000</id>
<name>SkillZone</name>
<desc>STR_NPC_NO_NAME</desc>
<dir>Monster/Worm</dir>
<mesh>Worm</mesh>
<material>mat_mob_reptile</material>
<show_dmg_decal>0</show_dmg_decal>
<ui_type>general</ui_type>
<cursor_type>none</cursor_type>
<hide_path>0</hide_path>
<erect>1</erect>
<bound_radius>
<front>1.200000</front>
<side>3.456000</side>
<upper>3.000000</upper>
</bound_radius>
<scale>10</scale>
<weapon_scale>100</weapon_scale>
<altitude>0.000000</altitude>
<stare_angle>75.000000</stare_angle>
<stare_distance>20.000000</stare_distance>
<move_speed_normal_walk>0.000000</move_speed_normal_walk>
<art_org_move_speed_normal_walk>0.000000</art_org_move_speed_normal_walk>
<move_speed_normal_run>0.000000</move_speed_normal_run>
<move_speed_combat_run>0.000000</move_speed_combat_run>
<art_org_speed_combat_run>0.000000</art_org_speed_combat_run>
<in_time>0.100000</in_time>
<out_time>0.500000</out_time>
<neck_angle>90.000000</neck_angle>
<spine_angle>10.000000</spine_angle>
<ammo_bone>Bip01 Head</ammo_bone>
<ammo_fx>skill_stoneshard.stoneshard.ammo</ammo_fx>
<ammo_speed>50</ammo_speed>
<pushed_range>0.000000</pushed_range>
<hpgauge_level>3</hpgauge_level>
<magical_skill_boost>0</magical_skill_boost>
<attack_delay>2000</attack_delay>
<ai_name>SummonSkillArea</ai_name>
<tribe>General</tribe>
<pet_ai_name>Pet</pet_ai_name>
<sensory_range>15.000000</sensory_range>
</npc_client>
</npc_clients>
example of file B.xml:
<?xml version="1.0" encoding="utf-16"?>
<strings>
<string>
<id>350000</id>
<name>STR_NPC_NO_NAME</name>
<body> </body>
</string>
</strings>

Here is example of XML::Twig usage. The main advantage is that it is not holding whole file in memory, so processing is much faster. The code below is trying to emulate operation of script from question.
use XML::Twig;
my %strings = ();
XML::Twig->new(
twig_handlers => {
'strings/string' => sub {
$strings{ lc $_->first_child('name')->text }
= $_->first_child('id')->text
},
}
)->parsefile('B.xml');
print "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<npc_templates xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:noNamespaceSchemaLocation=\"npcs.xsd\">\n";
XML::Twig->new(
twig_handlers => {
'npc_client' => sub {
my $nameid = eval { $strings{ lc $_->first_child('desc')->text } };
# calculate rank as needed
my $hpgauge_level = eval { $_->first_child('hpgauge_level')->text };
$rank = $hpgauge_level >= 28 ? 'ERROR'
: $hpgauge_level > 25 ? 'LEGENDARY'
: $hpgauge_level > 21 ? 'HERO'
: $hpgauge_level > 10 ? 'ELITE'
: $hpgauge_level > 0 ? 'NORMAL'
: $hpgauge_level;
my $npc_id = eval { $_->first_child('id')->text };
my $name = eval { $_->first_child('name')->text };
my $tribe = eval { $_->first_child('tribe')->text };
my $scale = eval { $_->first_child('scale')->text };
my $race_type = eval { $_->first_child('race_type')->text };
print
qq|\t<npc_template npc_id="$npc_id" name="$name" name_id="$nameid" height="$scale" rank="$rank" tribe="$tribe" race="$race_type" hp_gauge="$hpgauge_level"/>\n|;
$_->purge;
}
}
)->parsefile('A.xml');
print "</<npc_templates>";

Grab all the interesting 'desc' fields from file A and put them in a hash. You only have to parse it once, but if it still takes too long have a look at XML::Twig.
Parse file B. once and extract the stuff you need. Use the hash.
Looks like you only need parts of the xml files. XML::Twig can parse only the elements you are interested in and throw away the rest using the "twig_roots" parameter. XML::Simple is easier to get started with though..

Although I can't help you with the specifics of your Perl code, there are some general guidelines when dealing with large volumes of XML data. There are, broadly speaking, 2 kinds of XML APIs - DOM based and Stream based. Dom based API's (like XML DOM) will parse an entire XML document in to memory before the user-level API becomes "available", whereas with a stream based API (like SAX) the implementation doesn't need to parse the whole XML document. One benefit of Stream based parsers are that they typically use much less memory, as they don't need to hold the entire XML document in memory at once - this is obviously a good thing when dealing with large XML documents. Looking at the XML::Simple docs here, it's seems there may be SAX support available - have you tried this?

I'm not a perl guy, so take this with a grain of salt, but I see 2 problems:
The fact that you are iterating over all of the values in file B until you find the correct value for each element in file A is inefficient. Instead, you should be using some sort of map/dictionary for the values in file B.
It looks like you are parsing the both files in memory before you even start processing. File A would be better processed as a stream instead of loading the entire document into memory.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How do I merge based on a missing elements with JQ? - bash

Related

How to Replace Substring in all Elements of Object without Destroying it

jq: recursion -> nested arrays

comparing 2 data sets possibly with concurrency/asynchronous/parallel approach

Algorithm to do numeric profile of the string

Parsing large XML files?

Categories

Resources