Renaming multiple files at once - terminal

I have a large data set of files that are all named like this:
proportional.rank_9.cycle_10157.ratio_9
proportional.rank_9.cycle_10158.ratio_0
proportional.rank_9.cycle_10159.ratio_130
proportional.rank_9.cycle_10160.ratio_7
But of course in hindsight this is going to make reading in the files difficult since I'll need the file names and each may have a different ratio. Is there a way to rename all the files at once so that the ratio_* is gone?
Ideally I'd like them to be in this form:
proportional.rank_9.cycle_10157
proportional.rank_9.cycle_10158
proportional.rank_9.cycle_10159
proportional.rank_9.cycle_10160

//set X to all files of the form proportional.rank
X=$(ls proportional.rank*)
//print out to check X is set
echo $X
//loop over all files of X
for i in $X ; do
before_stem=${i%.ratio_*} //what I want to keep
after_stem=${i#*.ratio_} //what I don't want
new_end =$(printf ".out" $after_stem) //give new ending
mv $i ${before_stem}${new_end} //concatenate the new ending and old beginning
done
%: take everything before the pattern that follows the %
pound sign: take everything after the pattern that follows the #
printf: just like C printf
proportional.rank_9.cycle_10157.ratio_9
proportional.rank_9.cycle_10158.ratio_0
proportional.rank_9.cycle_10159.ratio_130
proportional.rank_9.cycle_10160.ratio_7
becomes
proportional.rank_9.cycle_10157.out
proportional.rank_9.cycle_10158.out
proportional.rank_9.cycle_10159.out
proportional.rank_9.cycle_10160.out

Related

How to process lines from external process in elvish?

Original problem: I want to extract information from (string) lines, produced by some process. It can be done by processing the lines one by one. Extracted data should be saved in several collections (lists? maps? whatever...)
The lines to process are specific, and they can be ignored, until some specific line is found. All the following lines should be saved, and then processed.
My problem with elvish is that I don't know, how to transform the produced output lines into a list of strings.
I have tried:
var lines = (my_app) | from-lines`
but elvish says, that on the right side there are many values, and on the left side is only one (it cannot make a list from those values).
My other approach was:
var data = []
my_app | each {|line|
...
set data = conj $data $line
}
but it also didn't work (I don't remember the error message, though). Also, (conj $data $line) doesn't work: elvish treats it as some external command, which cannot be found.
Quite simple Output capture worked for me:
~> cat t.txt
a
b
~> var linelist = [( cat t.txt )]
~> put $linelist[1]
▶ b

Pad Independently Missing Columns per Row in CSV with Bash (based off expected values)

I have a CSV file in which the ideal format for a row is this:
taxID#, scientific name, kingdom, k, phylum, p, class, c, order, o, family, f, genus, g
...where kingdom, phylum, etc. are identifiers, literals ("kingdom", ... "phylum"), and the values that follow the identifiers (k, p, etc.) are the actual values for those kingdoms, phyla, etc.
Example:
240395,Rugosa emeljanovi,kingdom,Metazoa,phylum,Chordata,class,Amphibia,order,Anura,family,Ranidae,genus,Rugosa
However, not all rows possess all levels of taxonomy, i.e. any one row might be missing the columns for an identifier/value pair, say, "class, c," and any 2-column PAIR can be missing independently of the other pairs missing or not. Also, if fields are missing, they will always be missing with their identifier field, so I'd never get "kingdom, phylum" together without the value for "k" between them. Thus much of my file is missing random fields:
...
135487,Nocardia cyriacigeorgica,class,Actinobacteria,order,Corynebacteriales,genus,Nocardia
10090,Mus musculus,kingdom,Metazoa,phylum,Chordata,class,Mammalia,order,Rodentia,family,Muridae,genus,Mus
152507,uncultured actinobacterium,phylum,Actinobacteria,class,Actinobacteria
171953,uncultured Acidobacteria bacterium,phylum,Acidobacteria
77133,uncultured bacterium
...
Question: How can I write a bash shell script that can "pad" every row in a file so that every field pair that may be missing from my ideal format is inserted, and its value column that follows is just blank. Desired output:
...
135487,Nocardia cyriacigeorgica,kingdom,,phylum,,class,Actinobacteria,order,Corynebacteriales,family,,genus,Nocardia
10090,Mus musculus,kingdom,Metazoa,phylum,Chordata,class,Mammalia,order,Rodentia,family,Muridae,genus,Mus
152507,uncultured actinobacterium,kingdom,,phylum,Actinobacteria,class,Actinobacteria,order,,family,,genus,
171953,uncultured Acidobacteria bacterium,phylum,Acidobacteria,clas,,order,,family,,genus,
77133,uncultured bacterium,kingdom,,phylum,,class,,order,,family,,genus,
...
Notes:
Notice if a genus was missing, the padded output should end with a comma to denote the value of genus doesn't exist.
taxID# and scientific name (the first two fields) will ALWAYS be present.
I don't care for time/resource efficiency if your solution is brute-forcey.
What I've tried:
I wrote a simple if/then script that checks sequentially if an expected field is gone. pseudocode:
if "$f3" is not "kingdom", pad
but the problem is that if kingdom was truly missing, it will get padded in output but the remaining field variables will be goofed up and I can't just follow that by saying
if "$f5" is not "phylum", pad
because if kingdom were missing, phylum would probably now be in field 3 ($f3), not $f5, that is, if it too weren't missing. (I did this by concatenating onto a string variable the expected output based on the absence of each field, and simply concatenating the original value if the field wasn't missing, and then echoing the finished, supposedly padded row to output).
I'd like to be able to execute my script like this
bash pad.sh prePadding.csv postPadding.csv
but I would accept answers using Mac Excel 2011 if needed.
Thank you!!
Although it should be possible in bash, I would use Perl for this. I tried to make the code as simple to understand as I could.
#!/usr/bin/perl
while (<>){
chomp;
my #fields=split ',';
my $kingdom='';
my $phylum='';
my $class='';
my $order='';
my $family='';
my $genus='';
for (my $i=2;$i<$#fields;$i+=2){
if ($fields[$i] eq 'kingdom'){$kingdom=$fields[$i+1];}
if ($fields[$i] eq 'phylum'){$phylum=$fields[$i+1];}
if ($fields[$i] eq 'class'){$class=$fields[$i+1];}
if ($fields[$i] eq 'order'){$order=$fields[$i+1];}
if ($fields[$i] eq 'family'){$family=$fields[$i+1];}
if ($fields[$i] eq 'genus'){$genus=$fields[$i+1];}
}
print "$fields[0],$fields[1],kingdom,$kingdom,phylum,$phylum,class,$class,order,$order,family,$family,genus,$genus\n";
}
Which gives me:
perl pad.pl input
135487,Nocardia cyriacigeorgica,kingdom,,phylum,,class,Actinobacteria,order,Corynebacteriales,family,,genus,Nocardia
10090,Mus musculus,kingdom,Metazoa,phylum,Chordata,class,Mammalia,order,Rodentia,family,Muridae,genus,Mus
152507,uncultured actinobacterium,kingdom,,phylum,Actinobacteria,class,Actinobacteria,order,,family,,genus,
171953,uncultured Acidobacteria bacterium,kingdom,,phylum,Acidobacteria,class,,order,,family,,genus,
(or for better reading:)
perl pad.pl input | tableize -t | sed 's/^/ /'
+------+----------------------------------+-------+-------+------+--------------+-----+--------------+-----+-----------------+------+-------+-----+--------+
|135487|Nocardia cyriacigeorgica |kingdom| |phylum| |class|Actinobacteria|order|Corynebacteriales|family| |genus|Nocardia|
+------+----------------------------------+-------+-------+------+--------------+-----+--------------+-----+-----------------+------+-------+-----+--------+
|10090 |Mus musculus |kingdom|Metazoa|phylum|Chordata |class|Mammalia |order|Rodentia |family|Muridae|genus|Mus |
+------+----------------------------------+-------+-------+------+--------------+-----+--------------+-----+-----------------+------+-------+-----+--------+
|152507|uncultured actinobacterium |kingdom| |phylum|Actinobacteria|class|Actinobacteria|order| |family| |genus| |
+------+----------------------------------+-------+-------+------+--------------+-----+--------------+-----+-----------------+------+-------+-----+--------+
|171953|uncultured Acidobacteria bacterium|kingdom| |phylum|Acidobacteria |class| |order| |family| |genus| |
+------+----------------------------------+-------+-------+------+--------------+-----+--------------+-----+-----------------+------+-------+-----+--------+
This would be the answer in bash using associative arrays:
#!/bin/bash
declare -A THIS
while IFS=, read -a LINE; do
# we always get the #ID and name
if (( ${#LINE[#]} < 2 || ${#LINE[#]} % 2 )); then
echo Invalid CSV line: "${LINE[#]}" >&2
continue
fi
echo -n "${LINE[0]},${LINE[1]},"
THIS=()
for (( INDEX=2; INDEX < ${#LINE[#]}; INDEX+=2 )); do
THIS[${LINE[INDEX]}]=${LINE[INDEX+1]}
done
for KEY in kingdom phylum class order family; do
echo -n $KEY,${THIS[$KEY]},
done
echo genus,${THIS[genus]}
done <$1 >$2
It also validates CSV lines so that they contain at least 2 columns (ID and name) and that they have an even number of columns.
The script can be extended to do more error checking (i.e. if both arguments are passed, if the input exists, etc), but it should work as expected with just the way you posted it.

Find lines that have partial matches

So I have a text file that contains a large number of lines. Each line is one long string with no spacing, however, the line contains several pieces of information. The program knows how to differentiate the important information in each line. The program identifies that the first 4 numbers/letters of the line coincide to a specific instrument. Here is a small example portion of the text file.
example text file
1002IPU3...
POIPIPU2...
1435IPU1...
1812IPU3...
BFTOIPD3...
1435IPD2...
As you can see, there are two lines that contain 1435 within this text file, which coincides with a specific instrument. However these lines are not identical. The program I'm using can not do its calculation if there are duplicates of the same station (ie, there are two 1435* stations). I need to find a way to search through my text files and identify if there are any duplicates of the partial strings that represent the stations within the file so that I can delete one or both of the duplicates. If I could have BASH script output the number of the lines containing the duplicates and what the duplicates lines say, that would be appreciated. I think there might be an easy way to do this, but I haven't been able to find any examples of this. Your help is appreciated.
If all you want to do is detect if there are duplicates (not necessarily count or eliminate them), this would be a good starting point:
awk '{ if (++seen[substr($0, 1, 4)] > 1) printf "Duplicates found : %s\n",$0 }' inputfile.txt
For that matter, it's a good starting point for counting or eliminating, too, it'll just take a bit more work...
If you want the count of duplicates:
awk '{a[substr($0,1,4)]++} END {for (i in a) {if(a[i]>1) print i": "a[i]}}' test.in
1435: 2
or:
{
a[substr($0,1,4)]++ # put prefixes to array and count them
}
END { # in the end
for (i in a) { # go thru all indexes
if(a[i]>1) print i": "a[i] # and print out the duplicate prefixes and their counts
}
}
Slightly roundabout but this should work-
cut -c 1-4 file.txt | sort -u > list
for i in `cat list`;
do
echo -n "$i "
grep -c ^"$i" file.txt #This tells you how many occurrences of each 'station'
done
Then you can do whatever you want with the ones that occur more than once.
Use following Python script(syntax of python 2.7 version used)
#!/usr/bin/python
file_name = "device.txt"
f1 = open(file_name,'r')
device = {}
line_count = 0
for line in f1:
line_count += 1
if device.has_key(line[:4]):
device[line[:4]] = device[line[:4]] + "," + str(line_count)
else:
device[line[:4]] = str(line_count)
f1.close()
print device
here the script reads each line and initial 4 character of each line are considered as device name and creates a key value pair device with key representing device name and value as line numbers where we find the string(device name)
following would be output
{'POIP': '2', '1435': '3,6', '1002': '1', '1812': '4', 'BFTO': '5'}
this might help you out!!

perl text mining code can't handle massive amounts of data

I'm doing a large text mining project. I have 100,000 text files. I've extracted two- and three-word phrases from sets of 1,000 documents at a time and have created 100 files. Each file has roughly 8 million lines in this format:
total_references num_docs_referencing_phrase phrase
I want to create an aggregate list of total references and number of docs referencing each phrase by processing the 100 intermediate files. To that end I wrote this program.
#!/usr/bin/perl -w
$| = 1 ; # Don't buffer output
use File::Find ;
$dir = "/home/sl/phrase-counts" ;
find(\&processFile, $dir) ;
for $key ( keys %TOTALREFS ) {
print "$TOTALREFS{$key} $NUMDOCS{$key} ${key}\n" ;
}
sub processFile {
my $file = $_ ;
my $fullName = $File::Find::name ;
if ( $fullName =~ /\.txt$/ ) {
$date = `date` ;
chomp $date ;
print "($date) file: $fullName\n" ;
open INFILE, "$fullName" or die "Cannot read ${fullName}";
while ( <INFILE> ) {
my $line = $_ ;
chomp $line ;
( $totalRefs, $numDocs, $phrase ) = split (/\s+/, $line, 3) ;
$TOTALREFS{$phrase} += $totalRefs ;
$NUMDOCS{$phrase} += $numDocs ;
}
close ( INFILE ) ;
}
}
The code produces strange errors after 8 or so files are processed and then it hangs, i.e. it stops listing files it should be processing.
Use of uninitialized value $date in scalar chomp at ./getCounts line 21.
Use of uninitialized value $date in concatenation (.) or string at ./getCounts line 22.
I don't believe the problem is really my date command, especially since it runs fine for a number of early files processed and because the problem does not occur at the same point in the run every time I run it. I assume the problem is that my program is consuming too much system resource and corrupting the state of the running environment. Running top and watching memory use go up to 97% of the machine concerns me although I notice that the errors and hang occur before top shows little memory left. And, there is some swap on the machine.
My question is, how can I rewrite this program to actually complete its execution? With 8 million lines of data for each of 100 files there could be 800 million lines of output although I would guess that the total is more likely in the range of 50-100 million lines. I have done some cleanup of the data and could consider more aggressive sanitizing of phrases to cut down on the numbers but I'd like to understand how I can design this code better.
I've seen articles that tell programmers to put their data into a database. My concern is the time it might take to update a database 100 million times.
Suggestions?
It looks like you're running on a *nix system, so make sort do all the work for you. It knows how to use memory efficiently.
sort -k 3 all_your_input_files*.txt > sorted.txt
Why do this? Because now all lines corresponding to the same phrase appear in a single block within the file, so you can compute totals easily: just write a short Perl script that adds the current line's numbers to the current totals, and writes them out whenever the phrase changes from the previous line (and at the end):
my ($oldPhrase, $totTotalRefs, $totNumDocs) = (undef, 0, 0);
while ( <INFILE> ) {
my $line = $_ ;
chomp $line ;
( $totalRefs, $numDocs, $phrase ) = split (/\s+/, $line, 3) ;
if (defined($oldPhrase) && $phrase ne $oldPhrase) {
print "$totTotalRefs $totNumDocs $oldPhrase\n" ;
$totTotalRefs = $totNumDocs = 0;
}
$totTotalRefs += $totalRefs ;
$totNumDocs += $numDocs ;
$oldPhrase = $phrase;
}
close ( INFILE ) ;
print "$totTotalRefs $totNumDocs $oldPhrase\n" ;
The above code is untested, but should work with appropriate boilerplate added I think.
[EDIT: Fix bug in which $oldPhrase never gets set, as suggested by Sol.]
You are storing all of the different phrases as keys for both %TOTALREFS and %NUMDOCS, so things are at least twice as bad as they need to be.
I suggest you try the following
Add use strict and use warnings (instead of -w) and declare all of your variables properly
Don't use capitals in your variable names. Capital letters are reserved for global identifiers
Don't start 100 subprocesses just to get the time of day. Just use localtime like this
printf "(%s) file: %s\n", scalar localtime, $full_name;
Use find just to generate an array of the files to be processed, so it would look like this
my #files;
find(sub {
push #files, $File::Find::name if -f and /\.txt$/i;
}, $dir) ;
Then you can process each file with a simple for loop
for my $file (#files) {
...
}
Take two passes through the files, the first time generating a hash that relates each phrase to an integer starting at zero, and the second that uses those integers to index arrays #total_refs and #num_docs and increment their elements
You may still run out of memory, but those measures will certainly give you a better chance.
Update
Just to be clear, this is how I imagine it would work. I've done this as a single pass, but it may be better to write it as two passes as I described so that you can check your intermediate data.
Note that this isn't tested apart from making sure that it compiles.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use autodie;
STDOUT->autoflush;
use File::Find;
my $dir = '/home/sl/phrase-counts';
my #files;
find(sub {
push #files, $File::Find::name if -f and /\.txt$/i;
}, $dir);
my (%phrases, #total_refs, #num_docs);
my $num_phrases = 0;
for my $file (#files) {
printf "(%s) file: %s\n", scalar localtime, $file;
open my $in_fh, '<', $file;
while (<$in_fh>) {
chomp;
my ($total_refs, $num_docs, $phrase) = split ' ', $_, 3;
my $phrase_num = $phrases{$phrase} //= $num_phrases++;
$total_refs[$phrase_num] += $total_refs;
$num_docs[$phrase_num] += $num_docs;
}
}
for my $phrase (keys %phrases) {
my $phrase_num = $phrases{$phrase};
printf "%s %s %s\n",
$total_refs[$phrase_num],
$num_docs[$phrase_num],
$phrase_num;
}
Trying to use more resources than available causes exceptions for being unable to allocate memory or results in system calls returning error messages. It doesn't corruption memory.
In this case, the result of backticks is undef, which means the command could not be executed. That could very well be because you have insufficient memory left. Where did you get the idea that being unable to execute a program is the result of corrupted memory?! Furthermore, you have an error you don't understand, yet you didn't check what error was returned? Backticks sets $? (and $! when $? is negative) as per system. Assuming it's a bug in Perl is a very bad assumption to make, especially when the system tells you what error occurred.
Use less memory, either through the use of a more appropriate and/or efficient data structure, or by keeping a portion of the data out of memory (e.g. on disk or in a database).

Change a referenced variable in BASH

I am intending to change a global variable inside a function in BASH, however I don't get a clue about how to do it. This is my code:
CANDIDATES[5]="1 2 3 4 5 6"
random_mutate()
{
a=$1 #assign name of input variable to "a"
insides=${!a} #See input variable value
RNDM_PARAM=`echo $[ 1 + $[ RANDOM % 5 ]]` #change random position in input variable
NEW_PAR=99 #value to substitute
ARR=($insides) #Convert string to array
ARR[$RNDM_PARAM]=$NEW_PAR #Change the random position
NEW_GUY=$( IFS=$' '; echo "${ARR[*]}" ) #Convert array once more to string
echo "$NEW_GUY"
### NOW, How to assign NEW_GUY TO CANDIDATES[5]?
}
random_mutate CANDIDATES[5]
I would like to be able to assign NEW_GUY to the variable referenced by $1 or to another variable that would be pointed by $2 (not incuded in the code). I don't want to do the direct assignation in the code as I intend to use the function for multiple possible inputs (in fact, the assignation NEW_PAR=99 is quite more complicated in my original code as it implies the selection of a number depending the position in a range of random values using an R function, but for the sake of simplicity I included it this way).
Hopefully this is clear enough. Please let me know if you need further information.
Thank you,
Libertad
You can use eval:
eval "$a=\$NEW_GUY"
Be careful and only use it if the value of $a is safe (imagine what happens if $a is set to rm -rf / ; a).

Resources