Counting occurrences of attributes in a sequence in XQuery - xpath

I have a sequence called $answer with the attributes I extracted from elements from an XML file. Inside $answer I have the following 3 attributes: 1, 3, 3 and another sequence of attributes called $p with: 1, 3
I tried to do this to get the number of occurrences by doing
for $x in $p
return count (index-of($x, $answer))
since I saw it as a solution in another posting but it gave me errors. What's the correct way to do this?

Do you want to sort all your attributes by its values? The group by statement might give you the expected results:
for $a in (attribute a {'A'}, attribute b {'B'}, attribute a {'A'})
group by $v := $a
return concat(count($a), ': ', $v)
Note, however, that your XQuery implementation needs to support XQuery 3.0.

You need to swap the arguments you passed to index-of():
for $x in $p
return count(index-of($answer, $x))
But a simpler way is to test for equality in a predicate:
for $x in $p
return count($answer[. eq $x])
which produces the same result for the given data.

Related

How do I sort a Perl hash by keys numerically? [duplicate]

This question already has answers here:
How can I sort a hash's keys naturally?
(4 answers)
Closed 1 year ago.
My first question ... I have found many answers to other questions through search but I am failing to do so this time :-)
I want to generate a report that is sorted by a number that is embedded in a string from my input data. The report is being generate from elements of a perl hash where the same number is used as the hash key.
The output that I am currently getting is sorted like strings.
foreach my $num (sort keys %dir_map) {
$path = $paths{$num};
$name = $names{$num};
printf OUT ("%d %s %s\n",$num,$path,$name);
}
My input data looks like:
dist_14 randomString nameStringIwant RandomInteger AnotherRandomString
Which I am processing like:
while(<IN>) {
chomp;
my #header = split /\s+/;
my $header_length = $#header ;
if ( /dist_/ ) {
my $NumberStr = $header[0] ;
$justNumberStr =~ s/dist_//;
my $justNumber = sprintf("%d",$justNumberStr);
$names{$justNumber} = $header[2];
}
}
As you've discovered, Perl's sort will, by default, sort using a string comparison. To override that default behaviour, you need to provide a sort block.
foreach my $num (sort { $a <=> $b } keys %dir_map)
The sort block is given two of the elements from your list in the variables $a and $b. Your code should compare these two values and return a negative integer if $a comes before $b, a positive integer if $b comes before $a and zero if they sort in the same place. The "spaceship operator" (<=>) does exactly that for two numbers.
The FAQ How do I sort an array by (anything)?
might also be useful. You don't have an array, but your list of keys can be treated in the same way.

Powershell Strange behaviour with Import-CSV

I have following powershell code:
clear;
$importedIDs = (Import-Csv "testing.csv" -Delimiter ';').id;
Write-Host $importedIDs.Length;
for ($i=0; $i -lt $importedIDs.Length; $i++) {
Write-Host $($importedIDs[$i]);
}
The goal is to read only the id column in the csv file which looks like this:
"created";"id"
"2018-04-04 21:03:01";"123456"
"2018-04-04 21:03:01";"123457"
When there are two or more rows the output is as expected:
2
123456
123457
However when there is only 1 row in the csv file (row with id 123456) the output is:
6
1
2
3
4
5
6
Desired output would be:
1
123456
Can anyone explain why this is happening and how can I fix this?
Any help is appreciated
If there are multiple rows in the csv you get an array of strings. One array-element per row. Therefore the index applies to the rows. If there is only one row you don't get a array containing one string, as you would probably expect. You get a single string instead. When using an index on a string powershell treats the string as an array of characters and therefore returns only one char.
You can slightly rewrite your script to get around the problem described by J. Bergmann.
Instead of using a for loop to loop through each element in the array, where an "element" may refer to a string in a string array or a character in a string, you can use a foreach loop and loop through elements in an array like this. foreach won't treat a string as an character array
clear;
$importedIDs = (Import-Csv "testing.csv" -Delimiter ';').id;
Write-Host $importedIDs.Length;
foreach ($importedID in $importedIDs) {
Write-Host $($importedID);
}

Pad Independently Missing Columns per Row in CSV with Bash (based off expected values)

I have a CSV file in which the ideal format for a row is this:
taxID#, scientific name, kingdom, k, phylum, p, class, c, order, o, family, f, genus, g
...where kingdom, phylum, etc. are identifiers, literals ("kingdom", ... "phylum"), and the values that follow the identifiers (k, p, etc.) are the actual values for those kingdoms, phyla, etc.
Example:
240395,Rugosa emeljanovi,kingdom,Metazoa,phylum,Chordata,class,Amphibia,order,Anura,family,Ranidae,genus,Rugosa
However, not all rows possess all levels of taxonomy, i.e. any one row might be missing the columns for an identifier/value pair, say, "class, c," and any 2-column PAIR can be missing independently of the other pairs missing or not. Also, if fields are missing, they will always be missing with their identifier field, so I'd never get "kingdom, phylum" together without the value for "k" between them. Thus much of my file is missing random fields:
...
135487,Nocardia cyriacigeorgica,class,Actinobacteria,order,Corynebacteriales,genus,Nocardia
10090,Mus musculus,kingdom,Metazoa,phylum,Chordata,class,Mammalia,order,Rodentia,family,Muridae,genus,Mus
152507,uncultured actinobacterium,phylum,Actinobacteria,class,Actinobacteria
171953,uncultured Acidobacteria bacterium,phylum,Acidobacteria
77133,uncultured bacterium
...
Question: How can I write a bash shell script that can "pad" every row in a file so that every field pair that may be missing from my ideal format is inserted, and its value column that follows is just blank. Desired output:
...
135487,Nocardia cyriacigeorgica,kingdom,,phylum,,class,Actinobacteria,order,Corynebacteriales,family,,genus,Nocardia
10090,Mus musculus,kingdom,Metazoa,phylum,Chordata,class,Mammalia,order,Rodentia,family,Muridae,genus,Mus
152507,uncultured actinobacterium,kingdom,,phylum,Actinobacteria,class,Actinobacteria,order,,family,,genus,
171953,uncultured Acidobacteria bacterium,phylum,Acidobacteria,clas,,order,,family,,genus,
77133,uncultured bacterium,kingdom,,phylum,,class,,order,,family,,genus,
...
Notes:
Notice if a genus was missing, the padded output should end with a comma to denote the value of genus doesn't exist.
taxID# and scientific name (the first two fields) will ALWAYS be present.
I don't care for time/resource efficiency if your solution is brute-forcey.
What I've tried:
I wrote a simple if/then script that checks sequentially if an expected field is gone. pseudocode:
if "$f3" is not "kingdom", pad
but the problem is that if kingdom was truly missing, it will get padded in output but the remaining field variables will be goofed up and I can't just follow that by saying
if "$f5" is not "phylum", pad
because if kingdom were missing, phylum would probably now be in field 3 ($f3), not $f5, that is, if it too weren't missing. (I did this by concatenating onto a string variable the expected output based on the absence of each field, and simply concatenating the original value if the field wasn't missing, and then echoing the finished, supposedly padded row to output).
I'd like to be able to execute my script like this
bash pad.sh prePadding.csv postPadding.csv
but I would accept answers using Mac Excel 2011 if needed.
Thank you!!
Although it should be possible in bash, I would use Perl for this. I tried to make the code as simple to understand as I could.
#!/usr/bin/perl
while (<>){
chomp;
my #fields=split ',';
my $kingdom='';
my $phylum='';
my $class='';
my $order='';
my $family='';
my $genus='';
for (my $i=2;$i<$#fields;$i+=2){
if ($fields[$i] eq 'kingdom'){$kingdom=$fields[$i+1];}
if ($fields[$i] eq 'phylum'){$phylum=$fields[$i+1];}
if ($fields[$i] eq 'class'){$class=$fields[$i+1];}
if ($fields[$i] eq 'order'){$order=$fields[$i+1];}
if ($fields[$i] eq 'family'){$family=$fields[$i+1];}
if ($fields[$i] eq 'genus'){$genus=$fields[$i+1];}
}
print "$fields[0],$fields[1],kingdom,$kingdom,phylum,$phylum,class,$class,order,$order,family,$family,genus,$genus\n";
}
Which gives me:
perl pad.pl input
135487,Nocardia cyriacigeorgica,kingdom,,phylum,,class,Actinobacteria,order,Corynebacteriales,family,,genus,Nocardia
10090,Mus musculus,kingdom,Metazoa,phylum,Chordata,class,Mammalia,order,Rodentia,family,Muridae,genus,Mus
152507,uncultured actinobacterium,kingdom,,phylum,Actinobacteria,class,Actinobacteria,order,,family,,genus,
171953,uncultured Acidobacteria bacterium,kingdom,,phylum,Acidobacteria,class,,order,,family,,genus,
(or for better reading:)
perl pad.pl input | tableize -t | sed 's/^/ /'
+------+----------------------------------+-------+-------+------+--------------+-----+--------------+-----+-----------------+------+-------+-----+--------+
|135487|Nocardia cyriacigeorgica |kingdom| |phylum| |class|Actinobacteria|order|Corynebacteriales|family| |genus|Nocardia|
+------+----------------------------------+-------+-------+------+--------------+-----+--------------+-----+-----------------+------+-------+-----+--------+
|10090 |Mus musculus |kingdom|Metazoa|phylum|Chordata |class|Mammalia |order|Rodentia |family|Muridae|genus|Mus |
+------+----------------------------------+-------+-------+------+--------------+-----+--------------+-----+-----------------+------+-------+-----+--------+
|152507|uncultured actinobacterium |kingdom| |phylum|Actinobacteria|class|Actinobacteria|order| |family| |genus| |
+------+----------------------------------+-------+-------+------+--------------+-----+--------------+-----+-----------------+------+-------+-----+--------+
|171953|uncultured Acidobacteria bacterium|kingdom| |phylum|Acidobacteria |class| |order| |family| |genus| |
+------+----------------------------------+-------+-------+------+--------------+-----+--------------+-----+-----------------+------+-------+-----+--------+
This would be the answer in bash using associative arrays:
#!/bin/bash
declare -A THIS
while IFS=, read -a LINE; do
# we always get the #ID and name
if (( ${#LINE[#]} < 2 || ${#LINE[#]} % 2 )); then
echo Invalid CSV line: "${LINE[#]}" >&2
continue
fi
echo -n "${LINE[0]},${LINE[1]},"
THIS=()
for (( INDEX=2; INDEX < ${#LINE[#]}; INDEX+=2 )); do
THIS[${LINE[INDEX]}]=${LINE[INDEX+1]}
done
for KEY in kingdom phylum class order family; do
echo -n $KEY,${THIS[$KEY]},
done
echo genus,${THIS[genus]}
done <$1 >$2
It also validates CSV lines so that they contain at least 2 columns (ID and name) and that they have an even number of columns.
The script can be extended to do more error checking (i.e. if both arguments are passed, if the input exists, etc), but it should work as expected with just the way you posted it.

Change a referenced variable in BASH

I am intending to change a global variable inside a function in BASH, however I don't get a clue about how to do it. This is my code:
CANDIDATES[5]="1 2 3 4 5 6"
random_mutate()
{
a=$1 #assign name of input variable to "a"
insides=${!a} #See input variable value
RNDM_PARAM=`echo $[ 1 + $[ RANDOM % 5 ]]` #change random position in input variable
NEW_PAR=99 #value to substitute
ARR=($insides) #Convert string to array
ARR[$RNDM_PARAM]=$NEW_PAR #Change the random position
NEW_GUY=$( IFS=$' '; echo "${ARR[*]}" ) #Convert array once more to string
echo "$NEW_GUY"
### NOW, How to assign NEW_GUY TO CANDIDATES[5]?
}
random_mutate CANDIDATES[5]
I would like to be able to assign NEW_GUY to the variable referenced by $1 or to another variable that would be pointed by $2 (not incuded in the code). I don't want to do the direct assignation in the code as I intend to use the function for multiple possible inputs (in fact, the assignation NEW_PAR=99 is quite more complicated in my original code as it implies the selection of a number depending the position in a range of random values using an R function, but for the sake of simplicity I included it this way).
Hopefully this is clear enough. Please let me know if you need further information.
Thank you,
Libertad
You can use eval:
eval "$a=\$NEW_GUY"
Be careful and only use it if the value of $a is safe (imagine what happens if $a is set to rm -rf / ; a).

XPath 2.0:reference earlier context in another part of the XPath expression

in an XPath I would like to focus on certain elements and analyse them:
...
<field>aaa</field>
...
<field>bbb</field>
...
<field>aaa (1)</field>
...
<field>aaa (2)</field>
...
<field>ccc</field>
...
<field>ddd (7)</field>
I want to find the elements who's text content (apart from a possible enumeration, are unique. In the aboce example that would be bbb, ccc and ddd.
The following XPath gives me the unique values:
distinct-values(//field[matches(normalize-space(.), ' \([0-9]\)$')]/substring-before(., '(')))
Now I would like to extent that and perform another XPath on all the distinct values, that would be to count how many field start with either of them and retreive the ones who's count is bigger than 1.
These could be a field content that is equal to that particular value, or it starts witrh that value and is followed by " (". The problem is that in the second part of that XPath I would have refer to the context of that part itself and to the former context at the same time.
In the following XPath I will - instead of using "." as the context- use c_outer and c_inner:
distinct-values(//field[matches(normalize-space(.), ' \([0-9]\)$')]/substring-before(., '(')))[count(//field[(c_inner = c_outer) or starts-with(c_inner, concat(c_outer, ' ('))]) > 1]
I can't use "." for both for obvious reasons. But how could I reference a particular, or the current distinct value from the outer expression within the inner expression?
Would that even be possible?
XQuery can do it e.g.
for $s
in distinct-values(
//field[matches(normalize-space(.), ' \([0-9]\)$')]/substring-before(., '(')))
where count(//field[(. = $s) or starts-with(., concat($s, ' ('))]) > 1
return $s

Resources