I have a CSV file in which the ideal format for a row is this:
taxID#, scientific name, kingdom, k, phylum, p, class, c, order, o, family, f, genus, g
...where kingdom, phylum, etc. are identifiers, literals ("kingdom", ... "phylum"), and the values that follow the identifiers (k, p, etc.) are the actual values for those kingdoms, phyla, etc.
Example:
240395,Rugosa emeljanovi,kingdom,Metazoa,phylum,Chordata,class,Amphibia,order,Anura,family,Ranidae,genus,Rugosa
However, not all rows possess all levels of taxonomy, i.e. any one row might be missing the columns for an identifier/value pair, say, "class, c," and any 2-column PAIR can be missing independently of the other pairs missing or not. Also, if fields are missing, they will always be missing with their identifier field, so I'd never get "kingdom, phylum" together without the value for "k" between them. Thus much of my file is missing random fields:
...
135487,Nocardia cyriacigeorgica,class,Actinobacteria,order,Corynebacteriales,genus,Nocardia
10090,Mus musculus,kingdom,Metazoa,phylum,Chordata,class,Mammalia,order,Rodentia,family,Muridae,genus,Mus
152507,uncultured actinobacterium,phylum,Actinobacteria,class,Actinobacteria
171953,uncultured Acidobacteria bacterium,phylum,Acidobacteria
77133,uncultured bacterium
...
Question: How can I write a bash shell script that can "pad" every row in a file so that every field pair that may be missing from my ideal format is inserted, and its value column that follows is just blank. Desired output:
...
135487,Nocardia cyriacigeorgica,kingdom,,phylum,,class,Actinobacteria,order,Corynebacteriales,family,,genus,Nocardia
10090,Mus musculus,kingdom,Metazoa,phylum,Chordata,class,Mammalia,order,Rodentia,family,Muridae,genus,Mus
152507,uncultured actinobacterium,kingdom,,phylum,Actinobacteria,class,Actinobacteria,order,,family,,genus,
171953,uncultured Acidobacteria bacterium,phylum,Acidobacteria,clas,,order,,family,,genus,
77133,uncultured bacterium,kingdom,,phylum,,class,,order,,family,,genus,
...
Notes:
Notice if a genus was missing, the padded output should end with a comma to denote the value of genus doesn't exist.
taxID# and scientific name (the first two fields) will ALWAYS be present.
I don't care for time/resource efficiency if your solution is brute-forcey.
What I've tried:
I wrote a simple if/then script that checks sequentially if an expected field is gone. pseudocode:
if "$f3" is not "kingdom", pad
but the problem is that if kingdom was truly missing, it will get padded in output but the remaining field variables will be goofed up and I can't just follow that by saying
if "$f5" is not "phylum", pad
because if kingdom were missing, phylum would probably now be in field 3 ($f3), not $f5, that is, if it too weren't missing. (I did this by concatenating onto a string variable the expected output based on the absence of each field, and simply concatenating the original value if the field wasn't missing, and then echoing the finished, supposedly padded row to output).
I'd like to be able to execute my script like this
bash pad.sh prePadding.csv postPadding.csv
but I would accept answers using Mac Excel 2011 if needed.
Thank you!!
Although it should be possible in bash, I would use Perl for this. I tried to make the code as simple to understand as I could.
#!/usr/bin/perl
while (<>){
chomp;
my #fields=split ',';
my $kingdom='';
my $phylum='';
my $class='';
my $order='';
my $family='';
my $genus='';
for (my $i=2;$i<$#fields;$i+=2){
if ($fields[$i] eq 'kingdom'){$kingdom=$fields[$i+1];}
if ($fields[$i] eq 'phylum'){$phylum=$fields[$i+1];}
if ($fields[$i] eq 'class'){$class=$fields[$i+1];}
if ($fields[$i] eq 'order'){$order=$fields[$i+1];}
if ($fields[$i] eq 'family'){$family=$fields[$i+1];}
if ($fields[$i] eq 'genus'){$genus=$fields[$i+1];}
}
print "$fields[0],$fields[1],kingdom,$kingdom,phylum,$phylum,class,$class,order,$order,family,$family,genus,$genus\n";
}
Which gives me:
perl pad.pl input
135487,Nocardia cyriacigeorgica,kingdom,,phylum,,class,Actinobacteria,order,Corynebacteriales,family,,genus,Nocardia
10090,Mus musculus,kingdom,Metazoa,phylum,Chordata,class,Mammalia,order,Rodentia,family,Muridae,genus,Mus
152507,uncultured actinobacterium,kingdom,,phylum,Actinobacteria,class,Actinobacteria,order,,family,,genus,
171953,uncultured Acidobacteria bacterium,kingdom,,phylum,Acidobacteria,class,,order,,family,,genus,
(or for better reading:)
perl pad.pl input | tableize -t | sed 's/^/ /'
+------+----------------------------------+-------+-------+------+--------------+-----+--------------+-----+-----------------+------+-------+-----+--------+
|135487|Nocardia cyriacigeorgica |kingdom| |phylum| |class|Actinobacteria|order|Corynebacteriales|family| |genus|Nocardia|
+------+----------------------------------+-------+-------+------+--------------+-----+--------------+-----+-----------------+------+-------+-----+--------+
|10090 |Mus musculus |kingdom|Metazoa|phylum|Chordata |class|Mammalia |order|Rodentia |family|Muridae|genus|Mus |
+------+----------------------------------+-------+-------+------+--------------+-----+--------------+-----+-----------------+------+-------+-----+--------+
|152507|uncultured actinobacterium |kingdom| |phylum|Actinobacteria|class|Actinobacteria|order| |family| |genus| |
+------+----------------------------------+-------+-------+------+--------------+-----+--------------+-----+-----------------+------+-------+-----+--------+
|171953|uncultured Acidobacteria bacterium|kingdom| |phylum|Acidobacteria |class| |order| |family| |genus| |
+------+----------------------------------+-------+-------+------+--------------+-----+--------------+-----+-----------------+------+-------+-----+--------+
This would be the answer in bash using associative arrays:
#!/bin/bash
declare -A THIS
while IFS=, read -a LINE; do
# we always get the #ID and name
if (( ${#LINE[#]} < 2 || ${#LINE[#]} % 2 )); then
echo Invalid CSV line: "${LINE[#]}" >&2
continue
fi
echo -n "${LINE[0]},${LINE[1]},"
THIS=()
for (( INDEX=2; INDEX < ${#LINE[#]}; INDEX+=2 )); do
THIS[${LINE[INDEX]}]=${LINE[INDEX+1]}
done
for KEY in kingdom phylum class order family; do
echo -n $KEY,${THIS[$KEY]},
done
echo genus,${THIS[genus]}
done <$1 >$2
It also validates CSV lines so that they contain at least 2 columns (ID and name) and that they have an even number of columns.
The script can be extended to do more error checking (i.e. if both arguments are passed, if the input exists, etc), but it should work as expected with just the way you posted it.
So I have a text file that contains a large number of lines. Each line is one long string with no spacing, however, the line contains several pieces of information. The program knows how to differentiate the important information in each line. The program identifies that the first 4 numbers/letters of the line coincide to a specific instrument. Here is a small example portion of the text file.
example text file
1002IPU3...
POIPIPU2...
1435IPU1...
1812IPU3...
BFTOIPD3...
1435IPD2...
As you can see, there are two lines that contain 1435 within this text file, which coincides with a specific instrument. However these lines are not identical. The program I'm using can not do its calculation if there are duplicates of the same station (ie, there are two 1435* stations). I need to find a way to search through my text files and identify if there are any duplicates of the partial strings that represent the stations within the file so that I can delete one or both of the duplicates. If I could have BASH script output the number of the lines containing the duplicates and what the duplicates lines say, that would be appreciated. I think there might be an easy way to do this, but I haven't been able to find any examples of this. Your help is appreciated.
If all you want to do is detect if there are duplicates (not necessarily count or eliminate them), this would be a good starting point:
awk '{ if (++seen[substr($0, 1, 4)] > 1) printf "Duplicates found : %s\n",$0 }' inputfile.txt
For that matter, it's a good starting point for counting or eliminating, too, it'll just take a bit more work...
If you want the count of duplicates:
awk '{a[substr($0,1,4)]++} END {for (i in a) {if(a[i]>1) print i": "a[i]}}' test.in
1435: 2
or:
{
a[substr($0,1,4)]++ # put prefixes to array and count them
}
END { # in the end
for (i in a) { # go thru all indexes
if(a[i]>1) print i": "a[i] # and print out the duplicate prefixes and their counts
}
}
Slightly roundabout but this should work-
cut -c 1-4 file.txt | sort -u > list
for i in `cat list`;
do
echo -n "$i "
grep -c ^"$i" file.txt #This tells you how many occurrences of each 'station'
done
Then you can do whatever you want with the ones that occur more than once.
Use following Python script(syntax of python 2.7 version used)
#!/usr/bin/python
file_name = "device.txt"
f1 = open(file_name,'r')
device = {}
line_count = 0
for line in f1:
line_count += 1
if device.has_key(line[:4]):
device[line[:4]] = device[line[:4]] + "," + str(line_count)
else:
device[line[:4]] = str(line_count)
f1.close()
print device
here the script reads each line and initial 4 character of each line are considered as device name and creates a key value pair device with key representing device name and value as line numbers where we find the string(device name)
following would be output
{'POIP': '2', '1435': '3,6', '1002': '1', '1812': '4', 'BFTO': '5'}
this might help you out!!
I'm doing a large text mining project. I have 100,000 text files. I've extracted two- and three-word phrases from sets of 1,000 documents at a time and have created 100 files. Each file has roughly 8 million lines in this format:
total_references num_docs_referencing_phrase phrase
I want to create an aggregate list of total references and number of docs referencing each phrase by processing the 100 intermediate files. To that end I wrote this program.
#!/usr/bin/perl -w
$| = 1 ; # Don't buffer output
use File::Find ;
$dir = "/home/sl/phrase-counts" ;
find(\&processFile, $dir) ;
for $key ( keys %TOTALREFS ) {
print "$TOTALREFS{$key} $NUMDOCS{$key} ${key}\n" ;
}
sub processFile {
my $file = $_ ;
my $fullName = $File::Find::name ;
if ( $fullName =~ /\.txt$/ ) {
$date = `date` ;
chomp $date ;
print "($date) file: $fullName\n" ;
open INFILE, "$fullName" or die "Cannot read ${fullName}";
while ( <INFILE> ) {
my $line = $_ ;
chomp $line ;
( $totalRefs, $numDocs, $phrase ) = split (/\s+/, $line, 3) ;
$TOTALREFS{$phrase} += $totalRefs ;
$NUMDOCS{$phrase} += $numDocs ;
}
close ( INFILE ) ;
}
}
The code produces strange errors after 8 or so files are processed and then it hangs, i.e. it stops listing files it should be processing.
Use of uninitialized value $date in scalar chomp at ./getCounts line 21.
Use of uninitialized value $date in concatenation (.) or string at ./getCounts line 22.
I don't believe the problem is really my date command, especially since it runs fine for a number of early files processed and because the problem does not occur at the same point in the run every time I run it. I assume the problem is that my program is consuming too much system resource and corrupting the state of the running environment. Running top and watching memory use go up to 97% of the machine concerns me although I notice that the errors and hang occur before top shows little memory left. And, there is some swap on the machine.
My question is, how can I rewrite this program to actually complete its execution? With 8 million lines of data for each of 100 files there could be 800 million lines of output although I would guess that the total is more likely in the range of 50-100 million lines. I have done some cleanup of the data and could consider more aggressive sanitizing of phrases to cut down on the numbers but I'd like to understand how I can design this code better.
I've seen articles that tell programmers to put their data into a database. My concern is the time it might take to update a database 100 million times.
Suggestions?
It looks like you're running on a *nix system, so make sort do all the work for you. It knows how to use memory efficiently.
sort -k 3 all_your_input_files*.txt > sorted.txt
Why do this? Because now all lines corresponding to the same phrase appear in a single block within the file, so you can compute totals easily: just write a short Perl script that adds the current line's numbers to the current totals, and writes them out whenever the phrase changes from the previous line (and at the end):
my ($oldPhrase, $totTotalRefs, $totNumDocs) = (undef, 0, 0);
while ( <INFILE> ) {
my $line = $_ ;
chomp $line ;
( $totalRefs, $numDocs, $phrase ) = split (/\s+/, $line, 3) ;
if (defined($oldPhrase) && $phrase ne $oldPhrase) {
print "$totTotalRefs $totNumDocs $oldPhrase\n" ;
$totTotalRefs = $totNumDocs = 0;
}
$totTotalRefs += $totalRefs ;
$totNumDocs += $numDocs ;
$oldPhrase = $phrase;
}
close ( INFILE ) ;
print "$totTotalRefs $totNumDocs $oldPhrase\n" ;
The above code is untested, but should work with appropriate boilerplate added I think.
[EDIT: Fix bug in which $oldPhrase never gets set, as suggested by Sol.]
You are storing all of the different phrases as keys for both %TOTALREFS and %NUMDOCS, so things are at least twice as bad as they need to be.
I suggest you try the following
Add use strict and use warnings (instead of -w) and declare all of your variables properly
Don't use capitals in your variable names. Capital letters are reserved for global identifiers
Don't start 100 subprocesses just to get the time of day. Just use localtime like this
printf "(%s) file: %s\n", scalar localtime, $full_name;
Use find just to generate an array of the files to be processed, so it would look like this
my #files;
find(sub {
push #files, $File::Find::name if -f and /\.txt$/i;
}, $dir) ;
Then you can process each file with a simple for loop
for my $file (#files) {
...
}
Take two passes through the files, the first time generating a hash that relates each phrase to an integer starting at zero, and the second that uses those integers to index arrays #total_refs and #num_docs and increment their elements
You may still run out of memory, but those measures will certainly give you a better chance.
Update
Just to be clear, this is how I imagine it would work. I've done this as a single pass, but it may be better to write it as two passes as I described so that you can check your intermediate data.
Note that this isn't tested apart from making sure that it compiles.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use autodie;
STDOUT->autoflush;
use File::Find;
my $dir = '/home/sl/phrase-counts';
my #files;
find(sub {
push #files, $File::Find::name if -f and /\.txt$/i;
}, $dir);
my (%phrases, #total_refs, #num_docs);
my $num_phrases = 0;
for my $file (#files) {
printf "(%s) file: %s\n", scalar localtime, $file;
open my $in_fh, '<', $file;
while (<$in_fh>) {
chomp;
my ($total_refs, $num_docs, $phrase) = split ' ', $_, 3;
my $phrase_num = $phrases{$phrase} //= $num_phrases++;
$total_refs[$phrase_num] += $total_refs;
$num_docs[$phrase_num] += $num_docs;
}
}
for my $phrase (keys %phrases) {
my $phrase_num = $phrases{$phrase};
printf "%s %s %s\n",
$total_refs[$phrase_num],
$num_docs[$phrase_num],
$phrase_num;
}
Trying to use more resources than available causes exceptions for being unable to allocate memory or results in system calls returning error messages. It doesn't corruption memory.
In this case, the result of backticks is undef, which means the command could not be executed. That could very well be because you have insufficient memory left. Where did you get the idea that being unable to execute a program is the result of corrupted memory?! Furthermore, you have an error you don't understand, yet you didn't check what error was returned? Backticks sets $? (and $! when $? is negative) as per system. Assuming it's a bug in Perl is a very bad assumption to make, especially when the system tells you what error occurred.
Use less memory, either through the use of a more appropriate and/or efficient data structure, or by keeping a portion of the data out of memory (e.g. on disk or in a database).