I have a CSV file in which the ideal format for a row is this:
taxID#, scientific name, kingdom, k, phylum, p, class, c, order, o, family, f, genus, g
...where kingdom, phylum, etc. are identifiers, literals ("kingdom", ... "phylum"), and the values that follow the identifiers (k, p, etc.) are the actual values for those kingdoms, phyla, etc.
Example:
240395,Rugosa emeljanovi,kingdom,Metazoa,phylum,Chordata,class,Amphibia,order,Anura,family,Ranidae,genus,Rugosa
However, not all rows possess all levels of taxonomy, i.e. any one row might be missing the columns for an identifier/value pair, say, "class, c," and any 2-column PAIR can be missing independently of the other pairs missing or not. Also, if fields are missing, they will always be missing with their identifier field, so I'd never get "kingdom, phylum" together without the value for "k" between them. Thus much of my file is missing random fields:
...
135487,Nocardia cyriacigeorgica,class,Actinobacteria,order,Corynebacteriales,genus,Nocardia
10090,Mus musculus,kingdom,Metazoa,phylum,Chordata,class,Mammalia,order,Rodentia,family,Muridae,genus,Mus
152507,uncultured actinobacterium,phylum,Actinobacteria,class,Actinobacteria
171953,uncultured Acidobacteria bacterium,phylum,Acidobacteria
77133,uncultured bacterium
...
Question: How can I write a bash shell script that can "pad" every row in a file so that every field pair that may be missing from my ideal format is inserted, and its value column that follows is just blank. Desired output:
...
135487,Nocardia cyriacigeorgica,kingdom,,phylum,,class,Actinobacteria,order,Corynebacteriales,family,,genus,Nocardia
10090,Mus musculus,kingdom,Metazoa,phylum,Chordata,class,Mammalia,order,Rodentia,family,Muridae,genus,Mus
152507,uncultured actinobacterium,kingdom,,phylum,Actinobacteria,class,Actinobacteria,order,,family,,genus,
171953,uncultured Acidobacteria bacterium,phylum,Acidobacteria,clas,,order,,family,,genus,
77133,uncultured bacterium,kingdom,,phylum,,class,,order,,family,,genus,
...
Notes:
Notice if a genus was missing, the padded output should end with a comma to denote the value of genus doesn't exist.
taxID# and scientific name (the first two fields) will ALWAYS be present.
I don't care for time/resource efficiency if your solution is brute-forcey.
What I've tried:
I wrote a simple if/then script that checks sequentially if an expected field is gone. pseudocode:
if "$f3" is not "kingdom", pad
but the problem is that if kingdom was truly missing, it will get padded in output but the remaining field variables will be goofed up and I can't just follow that by saying
if "$f5" is not "phylum", pad
because if kingdom were missing, phylum would probably now be in field 3 ($f3), not $f5, that is, if it too weren't missing. (I did this by concatenating onto a string variable the expected output based on the absence of each field, and simply concatenating the original value if the field wasn't missing, and then echoing the finished, supposedly padded row to output).
I'd like to be able to execute my script like this
bash pad.sh prePadding.csv postPadding.csv
but I would accept answers using Mac Excel 2011 if needed.
Thank you!!
Although it should be possible in bash, I would use Perl for this. I tried to make the code as simple to understand as I could.
#!/usr/bin/perl
while (<>){
chomp;
my #fields=split ',';
my $kingdom='';
my $phylum='';
my $class='';
my $order='';
my $family='';
my $genus='';
for (my $i=2;$i<$#fields;$i+=2){
if ($fields[$i] eq 'kingdom'){$kingdom=$fields[$i+1];}
if ($fields[$i] eq 'phylum'){$phylum=$fields[$i+1];}
if ($fields[$i] eq 'class'){$class=$fields[$i+1];}
if ($fields[$i] eq 'order'){$order=$fields[$i+1];}
if ($fields[$i] eq 'family'){$family=$fields[$i+1];}
if ($fields[$i] eq 'genus'){$genus=$fields[$i+1];}
}
print "$fields[0],$fields[1],kingdom,$kingdom,phylum,$phylum,class,$class,order,$order,family,$family,genus,$genus\n";
}
Which gives me:
perl pad.pl input
135487,Nocardia cyriacigeorgica,kingdom,,phylum,,class,Actinobacteria,order,Corynebacteriales,family,,genus,Nocardia
10090,Mus musculus,kingdom,Metazoa,phylum,Chordata,class,Mammalia,order,Rodentia,family,Muridae,genus,Mus
152507,uncultured actinobacterium,kingdom,,phylum,Actinobacteria,class,Actinobacteria,order,,family,,genus,
171953,uncultured Acidobacteria bacterium,kingdom,,phylum,Acidobacteria,class,,order,,family,,genus,
(or for better reading:)
perl pad.pl input | tableize -t | sed 's/^/ /'
+------+----------------------------------+-------+-------+------+--------------+-----+--------------+-----+-----------------+------+-------+-----+--------+
|135487|Nocardia cyriacigeorgica |kingdom| |phylum| |class|Actinobacteria|order|Corynebacteriales|family| |genus|Nocardia|
+------+----------------------------------+-------+-------+------+--------------+-----+--------------+-----+-----------------+------+-------+-----+--------+
|10090 |Mus musculus |kingdom|Metazoa|phylum|Chordata |class|Mammalia |order|Rodentia |family|Muridae|genus|Mus |
+------+----------------------------------+-------+-------+------+--------------+-----+--------------+-----+-----------------+------+-------+-----+--------+
|152507|uncultured actinobacterium |kingdom| |phylum|Actinobacteria|class|Actinobacteria|order| |family| |genus| |
+------+----------------------------------+-------+-------+------+--------------+-----+--------------+-----+-----------------+------+-------+-----+--------+
|171953|uncultured Acidobacteria bacterium|kingdom| |phylum|Acidobacteria |class| |order| |family| |genus| |
+------+----------------------------------+-------+-------+------+--------------+-----+--------------+-----+-----------------+------+-------+-----+--------+
This would be the answer in bash using associative arrays:
#!/bin/bash
declare -A THIS
while IFS=, read -a LINE; do
# we always get the #ID and name
if (( ${#LINE[#]} < 2 || ${#LINE[#]} % 2 )); then
echo Invalid CSV line: "${LINE[#]}" >&2
continue
fi
echo -n "${LINE[0]},${LINE[1]},"
THIS=()
for (( INDEX=2; INDEX < ${#LINE[#]}; INDEX+=2 )); do
THIS[${LINE[INDEX]}]=${LINE[INDEX+1]}
done
for KEY in kingdom phylum class order family; do
echo -n $KEY,${THIS[$KEY]},
done
echo genus,${THIS[genus]}
done <$1 >$2
It also validates CSV lines so that they contain at least 2 columns (ID and name) and that they have an even number of columns.
The script can be extended to do more error checking (i.e. if both arguments are passed, if the input exists, etc), but it should work as expected with just the way you posted it.
An action outputs a fixed-length string via Ruby's pack function
clean = [edc_unico, sequenza_sede, cliente_id.to_s, nome, indirizzo, cap, comune, provincia, persona, note, telefono, email]
string = clean.pack('A15A5A6A40A35A5A30A2A40A40A18A25')
However, the data is in UTF-8 as to allow latin/high-ascii characters. The result of the pack action is logical. high-ascii characters take the space of 2 regular ascii characters. The resulting string is shortened by 1 space character, defeating the original purpose.
What would be a concise ruby command to interpret high-ascii characters and thus add an extra space at the end of each variable for each high-ascii character, so that the length can be brought to its proper target? (note: I am assuming there is no directive that addresses this specifically, and the whole lot of pack directives is mind-muddling)
update an example where the second line shifts positions based on accented characters
CNFrigo 539 Via Privata Da Via Iseo 6C 20098San Giuliano Milanese MI02 98282410 02 98287686 12886480156 12886480156 Bo3 Euro Giuseppe Frigo Transport 349 2803433 M.Gianoli#Delanchy.Fr S.Galliard#Delanchy.Fr
CNIn's M 497 Via Istituto S.Maria della Pietà, 30173Venezia Ve041 8690111 340 6311408 0041 5136113 00115180283 02896940273 B60Fm Euro Per Documentazioni Tecniche Inviare Materiale A : Silvia_Scarpa#Insmercato.It Amministrazione : Michela_Bianco#Insmercato.It Silvia Scarpa Per Liberatorie 041/5136171 Sig.Ra Bianco Per Pagamento Fatture 041/5136111 (Solo Il Giovedi Pomeriggio Dalle 14 All Beniservizi.Insmercato#Pec.Gruppopam.It
It looks like you are trying to use pack to format strings to fixed width columns for display. That’s not what it’s for, it is generally used for packing data into fixed byte structures for things like network protocols.
You probably want to use a format string instead, which is better suited for manipulating data for display.
Have a look at String#% (i.e. the % method on string). Like pack it uses another little language which is defined in Kernel#sprintf.
Taking a simplified example, with the two arrays:
plain = ["Iseo", "Next field"]
accent = ["Pietà", "Next field"]
then using pack like this:
puts plain.pack("A10A10")
puts accent.pack("A10A10")
will produce a result that looks like this, where “Next field” isn’t aligned since pack is dealing with the width in bytes, not the displayed width:
Iseo Next field
Pietà Next field
Using a format string, like this:
puts "%-10s%-10s" % plain
puts "%-10s%-10s" % accent
produces the desired result, since it is dealing with the displayable width:
Iseo Next field
Pietà Next field
using visualworks, in small talk, I'm receiving a string like '31323334' from a network connection.
I need a string that reads '1234' so I need a way of extracting two characters at a time, converting them to what they represent in ascii, and then building a string of them...
Is there a way to do so?
EDIT(7/24): for some reason many of you are assuming I will only be working with numbers and could just truncate 3s or read every other char. This is not the case, examples of strings read could include any keys on the US standard keyboard (a-z, A-Z,0-9,punctuation/annotation such as {}*&^%$...)
Following along the lines of what Max started to suggest:
x := '31323334'.
in := ReadStream on: x.
out := WriteStream on: String new.
[ in atEnd ] whileFalse: [ out nextPut: (in next digitValue * 16 + (in next digitValue)) asCharacter ].
newX := out contents.
newX will have the result '1234'. Or, if you start with:
x := '454647'
You will get a result of 'EFG'.
Note that digitValue might only recognize upper case hex digits, so an asUppercase may be needed on the string before processing.
There is usually a #fold: or #reduce: method that will let you do that. In Pharo there's also a message #allPairsDo: and #groupsOf:atATimeCollect:. Using one of these methods you could do:
| collectionOfBytes |
collectionOfBytes := '9798'
groupsOf: 2
atATimeCollect: [ :group |
(group first digitValue * 10) + (group second digitValue) ].
collectionOfBytes asByteArray asString "--> 'ab'"
The #digitValue message in Pharo simply returns the value of the digit for numerical characters.
If you're receiving the data on a stream you could replace #groupsOf:atATime: with a loop (result may be any collection that you then convert to a string like above):
...
[ stream atEnd ] whileFalse: [
result add: (stream next digitValue * 10) + (stream next digitValue) ]
...
in Smalltalk/X, there is a method called "fromHexBytes:" which the ByteArray class understands. I am not sure, but think that something similar exists in other ST dialects.
If present, you can solve this with:
(ByteArray fromHexString:'68656C6C6F31323334') asString
and the reverse would be:
'hello1234' asByteArray hexPrintString
Another possible solution is to read the string as a hex number,
fetch the digitBytes (which should give you a byte array) and then convert that to a string.
I.e.
(Integer readFrom:'68656C6C6F31323334' radix:16)
digitBytes asString
One problem with that is that I am not sure about which byte-order you will get the digitBytes (LSB or MSB), and if that is defined to be the same across architectures or converted at image loading time to use the native order. So it may be required to reverse the string at the end (to be portable, it may even be required to reverse it conditionally, depending on the endianess of the system.
I cannot test this on VisualWorks, but I assume it should work fine there, too.
I have MATLAB set to record three webcams at the same time. I want to capture and save each feed to a file and automatically increment it the file name, it will be replaced by experiment_0001.avi, followed by experiment_0002.avi, etc.
My code looks like this at the moment
set(vid1,'LoggingMode','disk');
set(vid2,'LoggingMode','disk');
avi1 = VideoWriter('X:\ABC\Data Collection\Presentations\Correct\ExperimentA_002.AVI');
avi2 = VideoWriter('X:\ABC\Data Collection\Presentations\Correct\ExperimentB_002.AVI');
set(vid1,'DiskLogger',avi1);
set(vid2,'DiskLogger',avi2);
and I am incrementing the 002 each time.
Any thoughts on how to implement this efficiently?
Thanks.
dont forget matlab has some roots to C programming language. That means things like sprintf will work
so since you are printing out an integer value zero padded to 3 spaces you would need something like this sprintf('%03d',n) then % means there is a value to print that isn't text. 0 means zero pad on the left, 3 means pad to 3 digits, d means the number itself is an integer
just use sprintf in place of a string. the s means String print formatted. so it will output a string. here is an idea of what you might do
set(vid1,'LoggingMode','disk');
set(vid2,'LoggingMode','disk');
for (n=1:2:max_num_captures)
avi1 = VideoWriter(sprintf('X:\ABC\Data Collection\Presentations\Correct\ExperimentA_%03d.AVI',n));
avi2 = VideoWriter(sprintf('X:\ABC\Data Collection\Presentations\Correct\ExperimentB_002.AVI',n));
set(vid1,'DiskLogger',avi1);
set(vid2,'DiskLogger',avi2);
end