Substituting string labels by integer IDs and back - vowpalwabbit

My data files contain lines with the first entity being a string label followed by features. For example:
MEMO |f write down this note
CALL |f call jim's cell
The problem is that Vowpal Wabbit accepts only integer labels. How can I quickly change from string labels to unique integer IDs and back? That is quickly modify the data file to:
1 |f write down this note
2 |f call jim's cell
... and back when needed.
For my sample dataset I did it manually for each class using ``sed'', but this breaks seriously my workflow.

cat input.data | perl -nale '$i=$m{$F[0]}; $i or $i=$m{$F[0]}=++$n; $F[0]=$i; print "#F"; END{warn "$_ $m{$_}\n" for sort {$m{$a}<=>$m{$b}} keys %m}' > output.data 2> mapping.txt

Related

Extract 2 fields from string with search

I have a file with several lines of data. The fields are not always in the same position/column. I want to search for 2 strings and then show only the field and the data that follows. For example:
{"id":"1111","name":"2222","versionCurrent":"3333","hwVersion":"4444"}
{"id":"5555","name":"6666","hwVersion":"7777"}
I would like to return the following:
"id":"1111","hwVersion":"4444"
"id":"5555","hwVersion":"7777"
I am struggling because the data isn't always in the same position, so I can't chose a column number. I feel I need to search for "id" and "hwVersion" Any help is GREATLY appreciated.
Totally agree with #KamilCuk. More specifically
jq -c '{id: .id, hwVersion: .hwVersion}' <<< '{"id":"1111","name":"2222","versionCurrent":"3333","hwVersion":"4444"}'
Outputs:
{"id":"1111","hwVersion":"4444"}
Not quite the specified output, but valid JSON
More to the point, your input should probably be processed record by record, and my guess is that a two column output with "id" and "hwVersion" would be even easier to parse:
cat << EOF | jq -j '"\(.id)\t\(.hwVersion)\n"'
{"id":"1111","name":"2222","versionCurrent":"3333","hwVersion":"4444"}
{"id":"5555","name":"6666","hwVersion":"7777"}
EOF
Outputs:
1111 4444
5555 7777
Since the data looks like a mapping objects and even corresponding to a JSON format, something like this should do, if you don't mind using Python (which comes with JSON) support:
import json
def get_id_hw(s):
d = json.loads(s)
return '"id":"{}","hwVersion":"{}"'.format(d["id"], d["hwVersion"])
We take a line of input string into s and parse it as JSON into a dictionary d. Then we return a formatted string with double-quoted id and hwVersion strings followed by column and double-quoted value of corresponding key from the previously obtained dict.
We can try this with these test input strings and prints:
# These will be our test inputs.
s1 = '{"id":"1111","name":"2222","versionCurrent":"3333","hwVersion":"4444"}'
s2 = '{"id":"5555","name":"6666","hwVersion":"7777"}'
# we pass and print them here
print(get_id_hw(s1))
print(get_id_hw(s2))
But we can just as well iterate over lines of any input.
If you really wanted to use awk, you could, but it's not the most robust and suitable tool:
awk '{ i = gensub(/.*"id":"([0-9]+)".*/, "\\1", "g")
h = gensub(/.*"id":"([0-9]+)".*/, "\\1", "g")
printf("\"id\":\"%s\",\"hwVersion\":\"%s\"\n"), i, h}' /your/file
Since you mention position is not known and assuming it can be in any order, we use one regex to extract id and the other to get hwVersion, then we print it out in given format. If the values could be something other then decimal digits as in your example, the [0-9]+ but would need to reflect that.
And for the fun if it (this preserves the order) if entries from the file, in sed:
sed -e 's#.*\("\(id\|hwVersion\)":"[0-9]\+"\).*\("\(id\|hwVersion\)":"[0-9]\+"\).*#\1,\3#' file
It looks for two groups of "id" or "hwVersion" followed by :"<DECIMAL_DIGITS>".

Pad Independently Missing Columns per Row in CSV with Bash (based off expected values)

I have a CSV file in which the ideal format for a row is this:
taxID#, scientific name, kingdom, k, phylum, p, class, c, order, o, family, f, genus, g
...where kingdom, phylum, etc. are identifiers, literals ("kingdom", ... "phylum"), and the values that follow the identifiers (k, p, etc.) are the actual values for those kingdoms, phyla, etc.
Example:
240395,Rugosa emeljanovi,kingdom,Metazoa,phylum,Chordata,class,Amphibia,order,Anura,family,Ranidae,genus,Rugosa
However, not all rows possess all levels of taxonomy, i.e. any one row might be missing the columns for an identifier/value pair, say, "class, c," and any 2-column PAIR can be missing independently of the other pairs missing or not. Also, if fields are missing, they will always be missing with their identifier field, so I'd never get "kingdom, phylum" together without the value for "k" between them. Thus much of my file is missing random fields:
...
135487,Nocardia cyriacigeorgica,class,Actinobacteria,order,Corynebacteriales,genus,Nocardia
10090,Mus musculus,kingdom,Metazoa,phylum,Chordata,class,Mammalia,order,Rodentia,family,Muridae,genus,Mus
152507,uncultured actinobacterium,phylum,Actinobacteria,class,Actinobacteria
171953,uncultured Acidobacteria bacterium,phylum,Acidobacteria
77133,uncultured bacterium
...
Question: How can I write a bash shell script that can "pad" every row in a file so that every field pair that may be missing from my ideal format is inserted, and its value column that follows is just blank. Desired output:
...
135487,Nocardia cyriacigeorgica,kingdom,,phylum,,class,Actinobacteria,order,Corynebacteriales,family,,genus,Nocardia
10090,Mus musculus,kingdom,Metazoa,phylum,Chordata,class,Mammalia,order,Rodentia,family,Muridae,genus,Mus
152507,uncultured actinobacterium,kingdom,,phylum,Actinobacteria,class,Actinobacteria,order,,family,,genus,
171953,uncultured Acidobacteria bacterium,phylum,Acidobacteria,clas,,order,,family,,genus,
77133,uncultured bacterium,kingdom,,phylum,,class,,order,,family,,genus,
...
Notes:
Notice if a genus was missing, the padded output should end with a comma to denote the value of genus doesn't exist.
taxID# and scientific name (the first two fields) will ALWAYS be present.
I don't care for time/resource efficiency if your solution is brute-forcey.
What I've tried:
I wrote a simple if/then script that checks sequentially if an expected field is gone. pseudocode:
if "$f3" is not "kingdom", pad
but the problem is that if kingdom was truly missing, it will get padded in output but the remaining field variables will be goofed up and I can't just follow that by saying
if "$f5" is not "phylum", pad
because if kingdom were missing, phylum would probably now be in field 3 ($f3), not $f5, that is, if it too weren't missing. (I did this by concatenating onto a string variable the expected output based on the absence of each field, and simply concatenating the original value if the field wasn't missing, and then echoing the finished, supposedly padded row to output).
I'd like to be able to execute my script like this
bash pad.sh prePadding.csv postPadding.csv
but I would accept answers using Mac Excel 2011 if needed.
Thank you!!
Although it should be possible in bash, I would use Perl for this. I tried to make the code as simple to understand as I could.
#!/usr/bin/perl
while (<>){
chomp;
my #fields=split ',';
my $kingdom='';
my $phylum='';
my $class='';
my $order='';
my $family='';
my $genus='';
for (my $i=2;$i<$#fields;$i+=2){
if ($fields[$i] eq 'kingdom'){$kingdom=$fields[$i+1];}
if ($fields[$i] eq 'phylum'){$phylum=$fields[$i+1];}
if ($fields[$i] eq 'class'){$class=$fields[$i+1];}
if ($fields[$i] eq 'order'){$order=$fields[$i+1];}
if ($fields[$i] eq 'family'){$family=$fields[$i+1];}
if ($fields[$i] eq 'genus'){$genus=$fields[$i+1];}
}
print "$fields[0],$fields[1],kingdom,$kingdom,phylum,$phylum,class,$class,order,$order,family,$family,genus,$genus\n";
}
Which gives me:
perl pad.pl input
135487,Nocardia cyriacigeorgica,kingdom,,phylum,,class,Actinobacteria,order,Corynebacteriales,family,,genus,Nocardia
10090,Mus musculus,kingdom,Metazoa,phylum,Chordata,class,Mammalia,order,Rodentia,family,Muridae,genus,Mus
152507,uncultured actinobacterium,kingdom,,phylum,Actinobacteria,class,Actinobacteria,order,,family,,genus,
171953,uncultured Acidobacteria bacterium,kingdom,,phylum,Acidobacteria,class,,order,,family,,genus,
(or for better reading:)
perl pad.pl input | tableize -t | sed 's/^/ /'
+------+----------------------------------+-------+-------+------+--------------+-----+--------------+-----+-----------------+------+-------+-----+--------+
|135487|Nocardia cyriacigeorgica |kingdom| |phylum| |class|Actinobacteria|order|Corynebacteriales|family| |genus|Nocardia|
+------+----------------------------------+-------+-------+------+--------------+-----+--------------+-----+-----------------+------+-------+-----+--------+
|10090 |Mus musculus |kingdom|Metazoa|phylum|Chordata |class|Mammalia |order|Rodentia |family|Muridae|genus|Mus |
+------+----------------------------------+-------+-------+------+--------------+-----+--------------+-----+-----------------+------+-------+-----+--------+
|152507|uncultured actinobacterium |kingdom| |phylum|Actinobacteria|class|Actinobacteria|order| |family| |genus| |
+------+----------------------------------+-------+-------+------+--------------+-----+--------------+-----+-----------------+------+-------+-----+--------+
|171953|uncultured Acidobacteria bacterium|kingdom| |phylum|Acidobacteria |class| |order| |family| |genus| |
+------+----------------------------------+-------+-------+------+--------------+-----+--------------+-----+-----------------+------+-------+-----+--------+
This would be the answer in bash using associative arrays:
#!/bin/bash
declare -A THIS
while IFS=, read -a LINE; do
# we always get the #ID and name
if (( ${#LINE[#]} < 2 || ${#LINE[#]} % 2 )); then
echo Invalid CSV line: "${LINE[#]}" >&2
continue
fi
echo -n "${LINE[0]},${LINE[1]},"
THIS=()
for (( INDEX=2; INDEX < ${#LINE[#]}; INDEX+=2 )); do
THIS[${LINE[INDEX]}]=${LINE[INDEX+1]}
done
for KEY in kingdom phylum class order family; do
echo -n $KEY,${THIS[$KEY]},
done
echo genus,${THIS[genus]}
done <$1 >$2
It also validates CSV lines so that they contain at least 2 columns (ID and name) and that they have an even number of columns.
The script can be extended to do more error checking (i.e. if both arguments are passed, if the input exists, etc), but it should work as expected with just the way you posted it.

Find lines that have partial matches

So I have a text file that contains a large number of lines. Each line is one long string with no spacing, however, the line contains several pieces of information. The program knows how to differentiate the important information in each line. The program identifies that the first 4 numbers/letters of the line coincide to a specific instrument. Here is a small example portion of the text file.
example text file
1002IPU3...
POIPIPU2...
1435IPU1...
1812IPU3...
BFTOIPD3...
1435IPD2...
As you can see, there are two lines that contain 1435 within this text file, which coincides with a specific instrument. However these lines are not identical. The program I'm using can not do its calculation if there are duplicates of the same station (ie, there are two 1435* stations). I need to find a way to search through my text files and identify if there are any duplicates of the partial strings that represent the stations within the file so that I can delete one or both of the duplicates. If I could have BASH script output the number of the lines containing the duplicates and what the duplicates lines say, that would be appreciated. I think there might be an easy way to do this, but I haven't been able to find any examples of this. Your help is appreciated.
If all you want to do is detect if there are duplicates (not necessarily count or eliminate them), this would be a good starting point:
awk '{ if (++seen[substr($0, 1, 4)] > 1) printf "Duplicates found : %s\n",$0 }' inputfile.txt
For that matter, it's a good starting point for counting or eliminating, too, it'll just take a bit more work...
If you want the count of duplicates:
awk '{a[substr($0,1,4)]++} END {for (i in a) {if(a[i]>1) print i": "a[i]}}' test.in
1435: 2
or:
{
a[substr($0,1,4)]++ # put prefixes to array and count them
}
END { # in the end
for (i in a) { # go thru all indexes
if(a[i]>1) print i": "a[i] # and print out the duplicate prefixes and their counts
}
}
Slightly roundabout but this should work-
cut -c 1-4 file.txt | sort -u > list
for i in `cat list`;
do
echo -n "$i "
grep -c ^"$i" file.txt #This tells you how many occurrences of each 'station'
done
Then you can do whatever you want with the ones that occur more than once.
Use following Python script(syntax of python 2.7 version used)
#!/usr/bin/python
file_name = "device.txt"
f1 = open(file_name,'r')
device = {}
line_count = 0
for line in f1:
line_count += 1
if device.has_key(line[:4]):
device[line[:4]] = device[line[:4]] + "," + str(line_count)
else:
device[line[:4]] = str(line_count)
f1.close()
print device
here the script reads each line and initial 4 character of each line are considered as device name and creates a key value pair device with key representing device name and value as line numbers where we find the string(device name)
following would be output
{'POIP': '2', '1435': '3,6', '1002': '1', '1812': '4', 'BFTO': '5'}
this might help you out!!

Automatically increment filename VideoWriter MATLAB

I have MATLAB set to record three webcams at the same time. I want to capture and save each feed to a file and automatically increment it the file name, it will be replaced by experiment_0001.avi, followed by experiment_0002.avi, etc.
My code looks like this at the moment
set(vid1,'LoggingMode','disk');
set(vid2,'LoggingMode','disk');
avi1 = VideoWriter('X:\ABC\Data Collection\Presentations\Correct\ExperimentA_002.AVI');
avi2 = VideoWriter('X:\ABC\Data Collection\Presentations\Correct\ExperimentB_002.AVI');
set(vid1,'DiskLogger',avi1);
set(vid2,'DiskLogger',avi2);
and I am incrementing the 002 each time.
Any thoughts on how to implement this efficiently?
Thanks.
dont forget matlab has some roots to C programming language. That means things like sprintf will work
so since you are printing out an integer value zero padded to 3 spaces you would need something like this sprintf('%03d',n) then % means there is a value to print that isn't text. 0 means zero pad on the left, 3 means pad to 3 digits, d means the number itself is an integer
just use sprintf in place of a string. the s means String print formatted. so it will output a string. here is an idea of what you might do
set(vid1,'LoggingMode','disk');
set(vid2,'LoggingMode','disk');
for (n=1:2:max_num_captures)
avi1 = VideoWriter(sprintf('X:\ABC\Data Collection\Presentations\Correct\ExperimentA_%03d.AVI',n));
avi2 = VideoWriter(sprintf('X:\ABC\Data Collection\Presentations\Correct\ExperimentB_002.AVI',n));
set(vid1,'DiskLogger',avi1);
set(vid2,'DiskLogger',avi2);
end

sed: flexible template w/ line number constraint

Problem
I need to insert text of arbitrary length ( # of lines ) into a template while maintaining an exact number of total lines.
Sample source data file:
You have a hold available for pickup as of 2012-01-13:
Title: Really Long Test Title Regarding Random Gibberish. Volume 1, A-B, United States
and affiliated territories, United Nations, countries of the world
Author: Barrel Roll Morton
Title: How to Compromise Free Speech Using Everyday Tools. Volume XXVI
Author: Lamar Smith
#end-of-record
You have a hold available for pickup as of 2012-01-13:
Title: Selling Out Democracy For Fun and Profit. Volume 1, A-B, United States
Author: Lamar Smith
Copy: 12
#end-of-record
Sample Template ( simplified for brevity ):
<%CUST-NAME%>
<%CUST-ADDR%>
<%CUST-CTY-ZIP%>
<%TITLES GO HERE%>
<%STORE-NAME%>
<%STORE-ADDR%>
<%STORE-CTY-ZIP%>
At this point I use bash's 'mapfile' to load the source file
record by record using the /^#end-of-file/ regex ...so far so good.
Then I pull predictable aspects of each record according to the line
on which they occur, then process the info using a series of sed
search replace statements.
The Hang-Up
So the problem is the unknown number of 'title' records that could occur.
How can I accommodate an unknown number of titles and always have output
of precisely 65 lines?
Given that title records always occur starting on line 8, I can pull the
titles easily with:
sed -n '8,$p' test-match.txt
However, how can I insert this within an allotted space, ex, between <%CUST-CTY-ZIP%> and <%STORE-NAME%> without pushing the store info out of place in the template?
My idea so far:
-first send the customer info through:
Ex.
sed 's/<%CUST-NAME%>/Benedict Arnold/' template.txt
-Append title records
???
-Then the store/location info
sed 's/<%STORE-NAME%>/Smith's House of Greasy Palms/' template.txt
I have code and functions for this stuff if interested but this post is 'windy' as it is.
Just need help with inserting the title records while maintaining position of following text and maintaining total line number of 65.*
UPDATE
I've decided to change tactics. I'm going to create place holders in the template for all available lines between customer and store info --- then:
Test if line is null in source
if yes -- replace placeholder with null leaving the line ending. Line number maintained.
if not null -- again, replace with text, maintaining line number and line endings in template.
Eventually, I plan to invest some time looking closer at Triplee's suggestion regarding Perl. The Perl way really does look simpler and easier to maintain if I'm going to be stuck with this project long term.
This might work for you:
cat <<! >titles.txt
> 1
> 2
> 3
> 4
> 5
> 6
> 7
> Title 1
> Title 2
> Title 3
> Title 4
> Title 5
> Title 6
> !
cat <<! >template.txt
> <%CUST-NAME%>
> <%CUST-ADDR%>
> <%CUST-CTY-ZIP%>
>
> <%TITLES GO HERE%>
>
> <%STORE-NAME%>
> <%STORE-ADDR%>
> <%STORE-CTY-ZIP%>
> !
sed '1,7d;:a;$!{N;ba};:b;G;s/\n[^\n]*//5g;tc;bb;:c;s/\n/\\n/g;s|.*|/<%TITLES GO HERE%>/c\\&|' titles.txt |
sed -f - template.txt
<%CUST-NAME%>
<%CUST-ADDR%>
<%CUST-CTY-ZIP%>
Title 1
Title 2
Title 3
Title 4
Title 5
<%STORE-NAME%>
<%STORE-ADDR%>
<%STORE-CTY-ZIP%>
This pads/squeezes the titles to 5 lines (s/\n[^\n]*//5g) if you want fewer or more change the 5 to the number desired.
This will give you five lines of output regardless of the number of lines in titles.txt:
sed -n '$s/$/\n\n\n\n\n/;8,$p' test-match.txt | head -n 5
Another version:
sed -n '8,$N; ${s/$/\n\n\n\n\n/;s/\(\([^\n]*\n\)\{4\}\).*/\1/p}' test-match.txt
Use one less than the number of lines you want (4 in this example will cause 5 lines of output).
Here's a quick proof of concept using Perl formats. If you are unfamiliar with Perl, I guess you will need some additional help with how to get the values from two different files, but it's quite doable, of course. Here, the data is simply embedded into the script itself.
I set the $titles format to 5 lines instead of the proper value (58 or something?) in order to make this easier to try out in a terminal window, and to demonstrate that the output is indeed truncated when it is longer than the allocated space.
#!/usr/bin/perl
use strict;
use warnings;
use vars (qw($cust_name $cust_addr $cust_cty_zip $titles
$store_name $store_addr $store_cty_zip));
my $fmtline = '#' . '<' x 78;
my $titlefmtline = '^' . '<' x 78;
my $empty = '';
my $fmt = join ("\n$fmtline\n", 'format STDOUT = ',
'$cust_name', '$cust_addr', '$cust_cty_zip', '$empty') .
("\n$titlefmtline\n" . '$titles') x 5 . #58
join ("\n$fmtline\n", '', '$empty',
'$store_name', '$store_addr', '$store_cty_zip');
#print $fmt;
eval "$fmt\n.\n";
titles = <<____HERE;
Title: Really Long Test Title Regarding Random Gibberish. Volume 1, A-B, United States
and affiliated territories, United Nations, countries of the world
Author: Barrel Roll Morton
Title: How to Compromise Free Speech Using Everyday Tools. Volume XXVI
Author: Lamar Smith
____HERE
# Preserve line breaks -- ^<< will fill lines, but preserves line breaks on \r
$titles =~ s/\n/\r\n/g;
while (<DATA>) {
chomp;
($cust_name, $cust_addr, $cust_cty_zip, $store_name, $store_addr, $store_cty_zip)
= split (",");
write STDOUT;
}
__END__
Charlie Bravo,23 Alpa St,Delta ND 12345,Spamazon,98 Spamway,Atlanta GA 98765
The use of $empty to get an empty line is pretty ugly, but I wanted to keep the format as regular as possible. I'm sure it could be avoided, but at the cost of additional code complexity IMHO.
If you are unfamiliar with Perl, the use strict is a complication, but a practical necessity; it requires you to declare your variables either with use vars or my. It is a best practice which helps immensely if you try to make changes to the script.
Here documents with <<HERE work like in shell scripts; it allows you to create a multi-line string easily.
The x operator is for repetition; 'string' x 3 is 'stringstringstring' and ("list") x 3 is ("list" "list" "list"). The dot operator is string concatenation; that is, "foo" . "bar" is "foobar".
Finally, the DATA filehandle allows you to put arbitrary data in the script file itself after the __END__ token which signals the end of the program code. For reading from standard input, use <> instead of <DATA>.

Resources