Powershell Strange behaviour with Import-CSV - windows

I have following powershell code:
clear;
$importedIDs = (Import-Csv "testing.csv" -Delimiter ';').id;
Write-Host $importedIDs.Length;
for ($i=0; $i -lt $importedIDs.Length; $i++) {
Write-Host $($importedIDs[$i]);
}
The goal is to read only the id column in the csv file which looks like this:
"created";"id"
"2018-04-04 21:03:01";"123456"
"2018-04-04 21:03:01";"123457"
When there are two or more rows the output is as expected:
2
123456
123457
However when there is only 1 row in the csv file (row with id 123456) the output is:
6
1
2
3
4
5
6
Desired output would be:
1
123456
Can anyone explain why this is happening and how can I fix this?
Any help is appreciated

If there are multiple rows in the csv you get an array of strings. One array-element per row. Therefore the index applies to the rows. If there is only one row you don't get a array containing one string, as you would probably expect. You get a single string instead. When using an index on a string powershell treats the string as an array of characters and therefore returns only one char.

You can slightly rewrite your script to get around the problem described by J. Bergmann.
Instead of using a for loop to loop through each element in the array, where an "element" may refer to a string in a string array or a character in a string, you can use a foreach loop and loop through elements in an array like this. foreach won't treat a string as an character array
clear;
$importedIDs = (Import-Csv "testing.csv" -Delimiter ';').id;
Write-Host $importedIDs.Length;
foreach ($importedID in $importedIDs) {
Write-Host $($importedID);
}

Related

Length of string after delimiter in the data inside a file

I have a ton of data files that have delimiters inside and I would like to know the max length of the column after each delimiter. Since I can't to use a third-party program, I would like to get this done through PowerShell as that is built-in for Windows. And, at the same time I can't manually do. So, wondering if this could be achieved with PowerShell at all or any simple trick to do so?
Here is my sample data in a file FOO.TXT
Col1|Col2|Col3
12345|This is a String|This is another String
45688|String|This is another String of unknown length
30098|Second Column String|Third Column String
Expected output:
Col1 Max Length - 5
Col2 Max Length - 20
Col3 Max Length - 40
You can do it like this(but file with data must have csv extension):
$j=Import-Csv -Delimiter "|" -Path D:\testdir\new.csv #import our csv as array string
$colums=$j|gm -MemberType NoteProperty|Select-Object -ExpandProperty Name #get columns name
Foreach($colum in $colums){
$l=($j."$colum"|Measure-Object -Maximum -Property Length).Maximum #for each column get max length
Write-Host $colum" Max Length- "$l
}
Here is the answer I was looking for irrespective of the file extension. Thanks to both #Vad and #Theo.
$j=Import-Csv -Delimiter "|" -Path D:\testdir\FOO.TXT #import our csv as array string
$columns = $j[0].PSObject.Properties.Name #get columns name
Foreach($column in $columns){
$l=($j."$column"|Measure-Object -Maximum -Property Length).Maximum #for each column get max length
Write-Host $column" Max Length- "$l
}

Extract 2 fields from string with search

I have a file with several lines of data. The fields are not always in the same position/column. I want to search for 2 strings and then show only the field and the data that follows. For example:
{"id":"1111","name":"2222","versionCurrent":"3333","hwVersion":"4444"}
{"id":"5555","name":"6666","hwVersion":"7777"}
I would like to return the following:
"id":"1111","hwVersion":"4444"
"id":"5555","hwVersion":"7777"
I am struggling because the data isn't always in the same position, so I can't chose a column number. I feel I need to search for "id" and "hwVersion" Any help is GREATLY appreciated.
Totally agree with #KamilCuk. More specifically
jq -c '{id: .id, hwVersion: .hwVersion}' <<< '{"id":"1111","name":"2222","versionCurrent":"3333","hwVersion":"4444"}'
Outputs:
{"id":"1111","hwVersion":"4444"}
Not quite the specified output, but valid JSON
More to the point, your input should probably be processed record by record, and my guess is that a two column output with "id" and "hwVersion" would be even easier to parse:
cat << EOF | jq -j '"\(.id)\t\(.hwVersion)\n"'
{"id":"1111","name":"2222","versionCurrent":"3333","hwVersion":"4444"}
{"id":"5555","name":"6666","hwVersion":"7777"}
EOF
Outputs:
1111 4444
5555 7777
Since the data looks like a mapping objects and even corresponding to a JSON format, something like this should do, if you don't mind using Python (which comes with JSON) support:
import json
def get_id_hw(s):
d = json.loads(s)
return '"id":"{}","hwVersion":"{}"'.format(d["id"], d["hwVersion"])
We take a line of input string into s and parse it as JSON into a dictionary d. Then we return a formatted string with double-quoted id and hwVersion strings followed by column and double-quoted value of corresponding key from the previously obtained dict.
We can try this with these test input strings and prints:
# These will be our test inputs.
s1 = '{"id":"1111","name":"2222","versionCurrent":"3333","hwVersion":"4444"}'
s2 = '{"id":"5555","name":"6666","hwVersion":"7777"}'
# we pass and print them here
print(get_id_hw(s1))
print(get_id_hw(s2))
But we can just as well iterate over lines of any input.
If you really wanted to use awk, you could, but it's not the most robust and suitable tool:
awk '{ i = gensub(/.*"id":"([0-9]+)".*/, "\\1", "g")
h = gensub(/.*"id":"([0-9]+)".*/, "\\1", "g")
printf("\"id\":\"%s\",\"hwVersion\":\"%s\"\n"), i, h}' /your/file
Since you mention position is not known and assuming it can be in any order, we use one regex to extract id and the other to get hwVersion, then we print it out in given format. If the values could be something other then decimal digits as in your example, the [0-9]+ but would need to reflect that.
And for the fun if it (this preserves the order) if entries from the file, in sed:
sed -e 's#.*\("\(id\|hwVersion\)":"[0-9]\+"\).*\("\(id\|hwVersion\)":"[0-9]\+"\).*#\1,\3#' file
It looks for two groups of "id" or "hwVersion" followed by :"<DECIMAL_DIGITS>".

Pad Independently Missing Columns per Row in CSV with Bash (based off expected values)

I have a CSV file in which the ideal format for a row is this:
taxID#, scientific name, kingdom, k, phylum, p, class, c, order, o, family, f, genus, g
...where kingdom, phylum, etc. are identifiers, literals ("kingdom", ... "phylum"), and the values that follow the identifiers (k, p, etc.) are the actual values for those kingdoms, phyla, etc.
Example:
240395,Rugosa emeljanovi,kingdom,Metazoa,phylum,Chordata,class,Amphibia,order,Anura,family,Ranidae,genus,Rugosa
However, not all rows possess all levels of taxonomy, i.e. any one row might be missing the columns for an identifier/value pair, say, "class, c," and any 2-column PAIR can be missing independently of the other pairs missing or not. Also, if fields are missing, they will always be missing with their identifier field, so I'd never get "kingdom, phylum" together without the value for "k" between them. Thus much of my file is missing random fields:
...
135487,Nocardia cyriacigeorgica,class,Actinobacteria,order,Corynebacteriales,genus,Nocardia
10090,Mus musculus,kingdom,Metazoa,phylum,Chordata,class,Mammalia,order,Rodentia,family,Muridae,genus,Mus
152507,uncultured actinobacterium,phylum,Actinobacteria,class,Actinobacteria
171953,uncultured Acidobacteria bacterium,phylum,Acidobacteria
77133,uncultured bacterium
...
Question: How can I write a bash shell script that can "pad" every row in a file so that every field pair that may be missing from my ideal format is inserted, and its value column that follows is just blank. Desired output:
...
135487,Nocardia cyriacigeorgica,kingdom,,phylum,,class,Actinobacteria,order,Corynebacteriales,family,,genus,Nocardia
10090,Mus musculus,kingdom,Metazoa,phylum,Chordata,class,Mammalia,order,Rodentia,family,Muridae,genus,Mus
152507,uncultured actinobacterium,kingdom,,phylum,Actinobacteria,class,Actinobacteria,order,,family,,genus,
171953,uncultured Acidobacteria bacterium,phylum,Acidobacteria,clas,,order,,family,,genus,
77133,uncultured bacterium,kingdom,,phylum,,class,,order,,family,,genus,
...
Notes:
Notice if a genus was missing, the padded output should end with a comma to denote the value of genus doesn't exist.
taxID# and scientific name (the first two fields) will ALWAYS be present.
I don't care for time/resource efficiency if your solution is brute-forcey.
What I've tried:
I wrote a simple if/then script that checks sequentially if an expected field is gone. pseudocode:
if "$f3" is not "kingdom", pad
but the problem is that if kingdom was truly missing, it will get padded in output but the remaining field variables will be goofed up and I can't just follow that by saying
if "$f5" is not "phylum", pad
because if kingdom were missing, phylum would probably now be in field 3 ($f3), not $f5, that is, if it too weren't missing. (I did this by concatenating onto a string variable the expected output based on the absence of each field, and simply concatenating the original value if the field wasn't missing, and then echoing the finished, supposedly padded row to output).
I'd like to be able to execute my script like this
bash pad.sh prePadding.csv postPadding.csv
but I would accept answers using Mac Excel 2011 if needed.
Thank you!!
Although it should be possible in bash, I would use Perl for this. I tried to make the code as simple to understand as I could.
#!/usr/bin/perl
while (<>){
chomp;
my #fields=split ',';
my $kingdom='';
my $phylum='';
my $class='';
my $order='';
my $family='';
my $genus='';
for (my $i=2;$i<$#fields;$i+=2){
if ($fields[$i] eq 'kingdom'){$kingdom=$fields[$i+1];}
if ($fields[$i] eq 'phylum'){$phylum=$fields[$i+1];}
if ($fields[$i] eq 'class'){$class=$fields[$i+1];}
if ($fields[$i] eq 'order'){$order=$fields[$i+1];}
if ($fields[$i] eq 'family'){$family=$fields[$i+1];}
if ($fields[$i] eq 'genus'){$genus=$fields[$i+1];}
}
print "$fields[0],$fields[1],kingdom,$kingdom,phylum,$phylum,class,$class,order,$order,family,$family,genus,$genus\n";
}
Which gives me:
perl pad.pl input
135487,Nocardia cyriacigeorgica,kingdom,,phylum,,class,Actinobacteria,order,Corynebacteriales,family,,genus,Nocardia
10090,Mus musculus,kingdom,Metazoa,phylum,Chordata,class,Mammalia,order,Rodentia,family,Muridae,genus,Mus
152507,uncultured actinobacterium,kingdom,,phylum,Actinobacteria,class,Actinobacteria,order,,family,,genus,
171953,uncultured Acidobacteria bacterium,kingdom,,phylum,Acidobacteria,class,,order,,family,,genus,
(or for better reading:)
perl pad.pl input | tableize -t | sed 's/^/ /'
+------+----------------------------------+-------+-------+------+--------------+-----+--------------+-----+-----------------+------+-------+-----+--------+
|135487|Nocardia cyriacigeorgica |kingdom| |phylum| |class|Actinobacteria|order|Corynebacteriales|family| |genus|Nocardia|
+------+----------------------------------+-------+-------+------+--------------+-----+--------------+-----+-----------------+------+-------+-----+--------+
|10090 |Mus musculus |kingdom|Metazoa|phylum|Chordata |class|Mammalia |order|Rodentia |family|Muridae|genus|Mus |
+------+----------------------------------+-------+-------+------+--------------+-----+--------------+-----+-----------------+------+-------+-----+--------+
|152507|uncultured actinobacterium |kingdom| |phylum|Actinobacteria|class|Actinobacteria|order| |family| |genus| |
+------+----------------------------------+-------+-------+------+--------------+-----+--------------+-----+-----------------+------+-------+-----+--------+
|171953|uncultured Acidobacteria bacterium|kingdom| |phylum|Acidobacteria |class| |order| |family| |genus| |
+------+----------------------------------+-------+-------+------+--------------+-----+--------------+-----+-----------------+------+-------+-----+--------+
This would be the answer in bash using associative arrays:
#!/bin/bash
declare -A THIS
while IFS=, read -a LINE; do
# we always get the #ID and name
if (( ${#LINE[#]} < 2 || ${#LINE[#]} % 2 )); then
echo Invalid CSV line: "${LINE[#]}" >&2
continue
fi
echo -n "${LINE[0]},${LINE[1]},"
THIS=()
for (( INDEX=2; INDEX < ${#LINE[#]}; INDEX+=2 )); do
THIS[${LINE[INDEX]}]=${LINE[INDEX+1]}
done
for KEY in kingdom phylum class order family; do
echo -n $KEY,${THIS[$KEY]},
done
echo genus,${THIS[genus]}
done <$1 >$2
It also validates CSV lines so that they contain at least 2 columns (ID and name) and that they have an even number of columns.
The script can be extended to do more error checking (i.e. if both arguments are passed, if the input exists, etc), but it should work as expected with just the way you posted it.

Counting occurrences of attributes in a sequence in XQuery

I have a sequence called $answer with the attributes I extracted from elements from an XML file. Inside $answer I have the following 3 attributes: 1, 3, 3 and another sequence of attributes called $p with: 1, 3
I tried to do this to get the number of occurrences by doing
for $x in $p
return count (index-of($x, $answer))
since I saw it as a solution in another posting but it gave me errors. What's the correct way to do this?
Do you want to sort all your attributes by its values? The group by statement might give you the expected results:
for $a in (attribute a {'A'}, attribute b {'B'}, attribute a {'A'})
group by $v := $a
return concat(count($a), ': ', $v)
Note, however, that your XQuery implementation needs to support XQuery 3.0.
You need to swap the arguments you passed to index-of():
for $x in $p
return count(index-of($answer, $x))
But a simpler way is to test for equality in a predicate:
for $x in $p
return count($answer[. eq $x])
which produces the same result for the given data.

What does the "#" symbol do in PowerShell?

I've seen the # symbol used in PowerShell to initialise arrays.
What exactly does the # symbol denote and where can I read more about it?
In PowerShell V2, # is also the Splat operator.
PS> # First use it to create a hashtable of parameters:
PS> $params = #{path = "c:\temp"; Recurse= $true}
PS> # Then use it to SPLAT the parameters - which is to say to expand a hash table
PS> # into a set of command line parameters.
PS> dir #params
PS> # That was the equivalent of:
PS> dir -Path c:\temp -Recurse:$true
PowerShell will actually treat any comma-separated list as an array:
"server1","server2"
So the # is optional in those cases. However, for associative arrays, the # is required:
#{"Key"="Value";"Key2"="Value2"}
Officially, # is the "array operator." You can read more about it in the documentation that installed along with PowerShell, or in a book like "Windows PowerShell: TFM," which I co-authored.
While the above responses provide most of the answer it is useful--even this late to the question--to provide the full answer, to wit:
Array sub-expression (see about_arrays)
Forces the value to be an array, even if a singleton or a null, e.g. $a = #(ps | where name -like 'foo')
Hash initializer (see about_hash_tables)
Initializes a hash table with key-value pairs, e.g.
$HashArguments = #{ Path = "test.txt"; Destination = "test2.txt"; WhatIf = $true }
Splatting (see about_splatting)
Let's you invoke a cmdlet with parameters from an array or a hash-table rather than the more customary individually enumerated parameters, e.g. using the hash table just above, Copy-Item #HashArguments
Here strings (see about_quoting_rules)
Let's you create strings with easily embedded quotes, typically used for multi-line strings, e.g.:
$data = #"
line one
line two
something "quoted" here
"#
Because this type of question (what does 'x' notation mean in PowerShell?) is so common here on StackOverflow as well as in many reader comments, I put together a lexicon of PowerShell punctuation, just published on Simple-Talk.com. Read all about # as well as % and # and $_ and ? and more at The Complete Guide to PowerShell Punctuation. Attached to the article is this wallchart that gives you everything on a single sheet:
You can also wrap the output of a cmdlet (or pipeline) in #() to ensure that what you get back is an array rather than a single item.
For instance, dir usually returns a list, but depending on the options, it might return a single object. If you are planning on iterating through the results with a foreach-object, you need to make sure you get a list back. Here's a contrived example:
$results = #( dir c:\autoexec.bat)
One more thing... an empty array (like to initialize a variable) is denoted #().
The Splatting Operator
To create an array, we create a variable and assign the array. Arrays are noted by the "#" symbol. Let's take the discussion above and use an array to connect to multiple remote computers:
$strComputers = #("Server1", "Server2", "Server3")<enter>
They are used for arrays and hashes.
PowerShell Tutorial 7: Accumulate, Recall, and Modify Data
Array Literals In PowerShell
I hope this helps to understand it a bit better.
You can store "values" within a key and return that value to do something.
In this case I have just provided #{a="";b="";c="";} and if not in the options i.e "keys" (a, b or c) then don't return a value
$array = #{
a = "test1";
b = "test2";
c = "test3"
}
foreach($elem in $array.GetEnumerator()){
if ($elem.key -eq "a"){
$key = $elem.key
$value = $elem.value
}
elseif ($elem.key -eq "b"){
$key = $elem.key
$value = $elem.value
}
elseif ($elem.key -eq "c"){
$key = $elem.key
$value = $elem.value
}
else{
Write-Host "No other value"
}
Write-Host "Key: " $key "Value: " $value
}

Resources