I have a file with several lines of data. The fields are not always in the same position/column. I want to search for 2 strings and then show only the field and the data that follows. For example:
{"id":"1111","name":"2222","versionCurrent":"3333","hwVersion":"4444"}
{"id":"5555","name":"6666","hwVersion":"7777"}
I would like to return the following:
"id":"1111","hwVersion":"4444"
"id":"5555","hwVersion":"7777"
I am struggling because the data isn't always in the same position, so I can't chose a column number. I feel I need to search for "id" and "hwVersion" Any help is GREATLY appreciated.
Totally agree with #KamilCuk. More specifically
jq -c '{id: .id, hwVersion: .hwVersion}' <<< '{"id":"1111","name":"2222","versionCurrent":"3333","hwVersion":"4444"}'
Outputs:
{"id":"1111","hwVersion":"4444"}
Not quite the specified output, but valid JSON
More to the point, your input should probably be processed record by record, and my guess is that a two column output with "id" and "hwVersion" would be even easier to parse:
cat << EOF | jq -j '"\(.id)\t\(.hwVersion)\n"'
{"id":"1111","name":"2222","versionCurrent":"3333","hwVersion":"4444"}
{"id":"5555","name":"6666","hwVersion":"7777"}
EOF
Outputs:
1111 4444
5555 7777
Since the data looks like a mapping objects and even corresponding to a JSON format, something like this should do, if you don't mind using Python (which comes with JSON) support:
import json
def get_id_hw(s):
d = json.loads(s)
return '"id":"{}","hwVersion":"{}"'.format(d["id"], d["hwVersion"])
We take a line of input string into s and parse it as JSON into a dictionary d. Then we return a formatted string with double-quoted id and hwVersion strings followed by column and double-quoted value of corresponding key from the previously obtained dict.
We can try this with these test input strings and prints:
# These will be our test inputs.
s1 = '{"id":"1111","name":"2222","versionCurrent":"3333","hwVersion":"4444"}'
s2 = '{"id":"5555","name":"6666","hwVersion":"7777"}'
# we pass and print them here
print(get_id_hw(s1))
print(get_id_hw(s2))
But we can just as well iterate over lines of any input.
If you really wanted to use awk, you could, but it's not the most robust and suitable tool:
awk '{ i = gensub(/.*"id":"([0-9]+)".*/, "\\1", "g")
h = gensub(/.*"id":"([0-9]+)".*/, "\\1", "g")
printf("\"id\":\"%s\",\"hwVersion\":\"%s\"\n"), i, h}' /your/file
Since you mention position is not known and assuming it can be in any order, we use one regex to extract id and the other to get hwVersion, then we print it out in given format. If the values could be something other then decimal digits as in your example, the [0-9]+ but would need to reflect that.
And for the fun if it (this preserves the order) if entries from the file, in sed:
sed -e 's#.*\("\(id\|hwVersion\)":"[0-9]\+"\).*\("\(id\|hwVersion\)":"[0-9]\+"\).*#\1,\3#' file
It looks for two groups of "id" or "hwVersion" followed by :"<DECIMAL_DIGITS>".
I have a CSV file in which the ideal format for a row is this:
taxID#, scientific name, kingdom, k, phylum, p, class, c, order, o, family, f, genus, g
...where kingdom, phylum, etc. are identifiers, literals ("kingdom", ... "phylum"), and the values that follow the identifiers (k, p, etc.) are the actual values for those kingdoms, phyla, etc.
Example:
240395,Rugosa emeljanovi,kingdom,Metazoa,phylum,Chordata,class,Amphibia,order,Anura,family,Ranidae,genus,Rugosa
However, not all rows possess all levels of taxonomy, i.e. any one row might be missing the columns for an identifier/value pair, say, "class, c," and any 2-column PAIR can be missing independently of the other pairs missing or not. Also, if fields are missing, they will always be missing with their identifier field, so I'd never get "kingdom, phylum" together without the value for "k" between them. Thus much of my file is missing random fields:
...
135487,Nocardia cyriacigeorgica,class,Actinobacteria,order,Corynebacteriales,genus,Nocardia
10090,Mus musculus,kingdom,Metazoa,phylum,Chordata,class,Mammalia,order,Rodentia,family,Muridae,genus,Mus
152507,uncultured actinobacterium,phylum,Actinobacteria,class,Actinobacteria
171953,uncultured Acidobacteria bacterium,phylum,Acidobacteria
77133,uncultured bacterium
...
Question: How can I write a bash shell script that can "pad" every row in a file so that every field pair that may be missing from my ideal format is inserted, and its value column that follows is just blank. Desired output:
...
135487,Nocardia cyriacigeorgica,kingdom,,phylum,,class,Actinobacteria,order,Corynebacteriales,family,,genus,Nocardia
10090,Mus musculus,kingdom,Metazoa,phylum,Chordata,class,Mammalia,order,Rodentia,family,Muridae,genus,Mus
152507,uncultured actinobacterium,kingdom,,phylum,Actinobacteria,class,Actinobacteria,order,,family,,genus,
171953,uncultured Acidobacteria bacterium,phylum,Acidobacteria,clas,,order,,family,,genus,
77133,uncultured bacterium,kingdom,,phylum,,class,,order,,family,,genus,
...
Notes:
Notice if a genus was missing, the padded output should end with a comma to denote the value of genus doesn't exist.
taxID# and scientific name (the first two fields) will ALWAYS be present.
I don't care for time/resource efficiency if your solution is brute-forcey.
What I've tried:
I wrote a simple if/then script that checks sequentially if an expected field is gone. pseudocode:
if "$f3" is not "kingdom", pad
but the problem is that if kingdom was truly missing, it will get padded in output but the remaining field variables will be goofed up and I can't just follow that by saying
if "$f5" is not "phylum", pad
because if kingdom were missing, phylum would probably now be in field 3 ($f3), not $f5, that is, if it too weren't missing. (I did this by concatenating onto a string variable the expected output based on the absence of each field, and simply concatenating the original value if the field wasn't missing, and then echoing the finished, supposedly padded row to output).
I'd like to be able to execute my script like this
bash pad.sh prePadding.csv postPadding.csv
but I would accept answers using Mac Excel 2011 if needed.
Thank you!!
Although it should be possible in bash, I would use Perl for this. I tried to make the code as simple to understand as I could.
#!/usr/bin/perl
while (<>){
chomp;
my #fields=split ',';
my $kingdom='';
my $phylum='';
my $class='';
my $order='';
my $family='';
my $genus='';
for (my $i=2;$i<$#fields;$i+=2){
if ($fields[$i] eq 'kingdom'){$kingdom=$fields[$i+1];}
if ($fields[$i] eq 'phylum'){$phylum=$fields[$i+1];}
if ($fields[$i] eq 'class'){$class=$fields[$i+1];}
if ($fields[$i] eq 'order'){$order=$fields[$i+1];}
if ($fields[$i] eq 'family'){$family=$fields[$i+1];}
if ($fields[$i] eq 'genus'){$genus=$fields[$i+1];}
}
print "$fields[0],$fields[1],kingdom,$kingdom,phylum,$phylum,class,$class,order,$order,family,$family,genus,$genus\n";
}
Which gives me:
perl pad.pl input
135487,Nocardia cyriacigeorgica,kingdom,,phylum,,class,Actinobacteria,order,Corynebacteriales,family,,genus,Nocardia
10090,Mus musculus,kingdom,Metazoa,phylum,Chordata,class,Mammalia,order,Rodentia,family,Muridae,genus,Mus
152507,uncultured actinobacterium,kingdom,,phylum,Actinobacteria,class,Actinobacteria,order,,family,,genus,
171953,uncultured Acidobacteria bacterium,kingdom,,phylum,Acidobacteria,class,,order,,family,,genus,
(or for better reading:)
perl pad.pl input | tableize -t | sed 's/^/ /'
+------+----------------------------------+-------+-------+------+--------------+-----+--------------+-----+-----------------+------+-------+-----+--------+
|135487|Nocardia cyriacigeorgica |kingdom| |phylum| |class|Actinobacteria|order|Corynebacteriales|family| |genus|Nocardia|
+------+----------------------------------+-------+-------+------+--------------+-----+--------------+-----+-----------------+------+-------+-----+--------+
|10090 |Mus musculus |kingdom|Metazoa|phylum|Chordata |class|Mammalia |order|Rodentia |family|Muridae|genus|Mus |
+------+----------------------------------+-------+-------+------+--------------+-----+--------------+-----+-----------------+------+-------+-----+--------+
|152507|uncultured actinobacterium |kingdom| |phylum|Actinobacteria|class|Actinobacteria|order| |family| |genus| |
+------+----------------------------------+-------+-------+------+--------------+-----+--------------+-----+-----------------+------+-------+-----+--------+
|171953|uncultured Acidobacteria bacterium|kingdom| |phylum|Acidobacteria |class| |order| |family| |genus| |
+------+----------------------------------+-------+-------+------+--------------+-----+--------------+-----+-----------------+------+-------+-----+--------+
This would be the answer in bash using associative arrays:
#!/bin/bash
declare -A THIS
while IFS=, read -a LINE; do
# we always get the #ID and name
if (( ${#LINE[#]} < 2 || ${#LINE[#]} % 2 )); then
echo Invalid CSV line: "${LINE[#]}" >&2
continue
fi
echo -n "${LINE[0]},${LINE[1]},"
THIS=()
for (( INDEX=2; INDEX < ${#LINE[#]}; INDEX+=2 )); do
THIS[${LINE[INDEX]}]=${LINE[INDEX+1]}
done
for KEY in kingdom phylum class order family; do
echo -n $KEY,${THIS[$KEY]},
done
echo genus,${THIS[genus]}
done <$1 >$2
It also validates CSV lines so that they contain at least 2 columns (ID and name) and that they have an even number of columns.
The script can be extended to do more error checking (i.e. if both arguments are passed, if the input exists, etc), but it should work as expected with just the way you posted it.
I've seen the # symbol used in PowerShell to initialise arrays.
What exactly does the # symbol denote and where can I read more about it?
In PowerShell V2, # is also the Splat operator.
PS> # First use it to create a hashtable of parameters:
PS> $params = #{path = "c:\temp"; Recurse= $true}
PS> # Then use it to SPLAT the parameters - which is to say to expand a hash table
PS> # into a set of command line parameters.
PS> dir #params
PS> # That was the equivalent of:
PS> dir -Path c:\temp -Recurse:$true
PowerShell will actually treat any comma-separated list as an array:
"server1","server2"
So the # is optional in those cases. However, for associative arrays, the # is required:
#{"Key"="Value";"Key2"="Value2"}
Officially, # is the "array operator." You can read more about it in the documentation that installed along with PowerShell, or in a book like "Windows PowerShell: TFM," which I co-authored.
While the above responses provide most of the answer it is useful--even this late to the question--to provide the full answer, to wit:
Array sub-expression (see about_arrays)
Forces the value to be an array, even if a singleton or a null, e.g. $a = #(ps | where name -like 'foo')
Hash initializer (see about_hash_tables)
Initializes a hash table with key-value pairs, e.g.
$HashArguments = #{ Path = "test.txt"; Destination = "test2.txt"; WhatIf = $true }
Splatting (see about_splatting)
Let's you invoke a cmdlet with parameters from an array or a hash-table rather than the more customary individually enumerated parameters, e.g. using the hash table just above, Copy-Item #HashArguments
Here strings (see about_quoting_rules)
Let's you create strings with easily embedded quotes, typically used for multi-line strings, e.g.:
$data = #"
line one
line two
something "quoted" here
"#
Because this type of question (what does 'x' notation mean in PowerShell?) is so common here on StackOverflow as well as in many reader comments, I put together a lexicon of PowerShell punctuation, just published on Simple-Talk.com. Read all about # as well as % and # and $_ and ? and more at The Complete Guide to PowerShell Punctuation. Attached to the article is this wallchart that gives you everything on a single sheet:
You can also wrap the output of a cmdlet (or pipeline) in #() to ensure that what you get back is an array rather than a single item.
For instance, dir usually returns a list, but depending on the options, it might return a single object. If you are planning on iterating through the results with a foreach-object, you need to make sure you get a list back. Here's a contrived example:
$results = #( dir c:\autoexec.bat)
One more thing... an empty array (like to initialize a variable) is denoted #().
The Splatting Operator
To create an array, we create a variable and assign the array. Arrays are noted by the "#" symbol. Let's take the discussion above and use an array to connect to multiple remote computers:
$strComputers = #("Server1", "Server2", "Server3")<enter>
They are used for arrays and hashes.
PowerShell Tutorial 7: Accumulate, Recall, and Modify Data
Array Literals In PowerShell
I hope this helps to understand it a bit better.
You can store "values" within a key and return that value to do something.
In this case I have just provided #{a="";b="";c="";} and if not in the options i.e "keys" (a, b or c) then don't return a value
$array = #{
a = "test1";
b = "test2";
c = "test3"
}
foreach($elem in $array.GetEnumerator()){
if ($elem.key -eq "a"){
$key = $elem.key
$value = $elem.value
}
elseif ($elem.key -eq "b"){
$key = $elem.key
$value = $elem.value
}
elseif ($elem.key -eq "c"){
$key = $elem.key
$value = $elem.value
}
else{
Write-Host "No other value"
}
Write-Host "Key: " $key "Value: " $value
}