Create files of matching pattern from a single data file - shell

I have a file of data type (file.dat) with ASCII data in it and consists two columns. Also this file is sorted according to first column. I want to write a script in either shell or awk in such way that new file should be created for similar record from that sorted file. Suppose I have file consists of (Four records) such as given below...
100.00 321342
100.00 434243
100.00 543231
100.50 743893
Hence according to my problem here two files should be created. One file consists of top three records and other file consists of last record according to data in first column.
File 1 contains
100.00 321342
100.00 434243
100.00 543231
File 2 contains
100.50 743893

your file
100.00 321342
100.00 434243
100.00 543231
100.50 743893
what you need
perl -a -nE 'qx( echo "$F[0] $F[1]" >> "Timestep_$F[0]" )' file
output is simply creates two file and name of one is Timestep_100.00 and name of other is Timestep_100.50 so it is separated by name of the first unique column. that's it.
$ cat Timestep_100.00
100.00 321342
100.00 434243
100.00 543231
and other file
$ cat Timestep_100.50
100.50 743893

This script should do the work:
#!/bin/sh
exec 0<file.txt
makeit=yes
while read stp num; do
if [ -f "Timestep_$stp" ]; then
echo "File Timestep_$stp exists, exiting."
makeit=no
break
fi
done
if [ $makeit = yes ]; then
exec 0<file.txt
while read stp num; do
echo "$stp $num" >> Timestep_$stp
done
echo "Processing done."
fi
The first loop checks that no file exists, otherwise the result would be wrong.

Related

shell script to create multiple files, incrementing from last file upon next execution

I'm trying to create a shell script that will create multiple files (or a batch of files) of a specified amount. When the amount is reached, script stops. When the script is re-executed, the files pick up from the last file created. So if the script creates files 1-10 on first run, then on the next script execution should create 11-20, and so on.
enter code here
#!/bin/bash
NAME=XXXX
valid=true
NUMBER=1
while [ $NUMBER -le 5 ];
do
touch $NAME$NUMBER
((NUMBER++))
echo $NUMBER + "batch created"
if [ $NUMBER == 5 ];
then
break
fi
touch $NAME$NUMBER
((NUMBER+5))
echo "batch complete"
done
Based on my comment above and your description, you can write a script that will create 10 numbered files (by default) each time it is run, starting with the next available number. As mentioned, rather than just use a raw-unpadded number, it's better for general sorting and listing to use zero-padded numbers, e.g. 001, 002, ...
If you just use 1, 2, ... then you end up with odd sorting when you reach each power of 10. Consider the first 12 files numbered 1...12 without padding. a general listing sort would produce:
file1
file11
file12
file2
file3
file4
...
Where 11 and 12 are sorted before 2. Adding leading zeros with printf -v avoids the problem.
Taking that into account, and allowing the user to change the prefix (first part of the file name) by giving it as an argument, and also change the number of new files to create by passing the count as the 2nd argument, you could do something like:
#!/bin/bash
prefix="${1:-file_}" ## beginning of filename
number=1 ## start number to look for
ext="txt" ## file extension to add
newcount="${2:-10}" ## count of new files to create
printf -v num "%03d" "$number" ## create 3-digit start number
fname="$prefix$num.$ext" ## form first filename
while [ -e "$fname" ]; do ## while filename exists
number=$((number + 1)) ## increment number
printf -v num "%03d" "$number" ## form 3-digit number
fname="$prefix$num.$ext" ## form filename
done
while ((newcount--)); do ## loop newcount times
touch "$fname" ## create filename
((! newcount)) && break; ## newcount 0, break (optional)
number=$((number + 1)) ## increment number
printf -v num "%03d" "$number" ## form 3-digit number
fname="$prefix$num.$ext" ## form filename
done
Running the script without arguments will create the first 10 files, file_001.txt - file_010.txt. Run a second time, it would create 10 more files file_011.txt to file_020.txt.
To create a new group of 5 files with the prefix of list_, you would do:
bash scriptname list_ 5
Which would result in the 5 files list_001.txt to list_005.txt. Running again with the same options would create list_006.txt to list_010.txt.
Since the scheme above with 3 digits is limited to 1000 files max (if you include 000), there isn't a big need to get the number from the last file written (bash can count to 1000 quite fast). However, if you used 7-digits, for 10 million files, then you would want to parse the last number with ls -1 | tail -n 1 (or version sort and choose the last file). Something like the following would do:
number=$(ls -1 "$prefix"* | tail -n 1 | grep -o '[1-9][0-9]*')
(note: that is ls -(one) not ls -(ell))
Let me know if that is what you are looking for.

Loop Script from Input File

I have a reference file with device names in them. For example WABEL8499IPM101. I'm using this script to set the base name (without the last 3 digits) to look at the reference file and see what is already used. If 101 is used it will create a file for me with 102, 103 if I request 2 total. I'm looking to use an input file to run it multiple times. I'm also trying to figure out how to start at 101 if there isn't a name found when searching the reference file
I would like to loop this using an input file instead of manually entering bash test.sh WABEL8499IPM 2 each time. I would like to be able to build an input file of all the names that need compared and then output. It would also be nice that if there isn't a match that it starts creating names at WABEL8499IPM101 instead of just WABEL8499IPM1.
Input file example:
ColumnA (BASE NAME) ColumnB (QUANTITY)
WABEL8499IPM 2
Script:
SRCFILE="~/Desktop/deviceinfo.csv"
LOGDIR="~/Desktop/"
LOGFILE="$LOGDIR/DeviceNames.csv"
# base name, such as "WABEL8499IPM"
device_name=$1
# quantity, such as "2"
quantityNum=$2
# the largest in sequence, such as "WABEL8499IPM108"
max_sequence_name=$(cat $SRCFILE | grep -o -e "$device_name[0-9]*" | sort --reverse | head -n 1)
# extract the last 3digit number (such as "108") from max_sequence_name
max_sequence_num=$(echo $max_sequence_name | rev | cut -c 1-3 | rev)
# create new sequence_name
# such as ["WABEL8499IPM109", "WABEL8499IPM110"]
array_new_sequence_name=()
for i in $(seq 1 $quantityNum);
do
cnum=$((max_sequence_num + i))
array_new_sequence_name+=($(echo $device_name$cnum))
done
#CODE FOR CREATING OUTPUT FILE HERE
#for fn in ${array_new_sequence_name[#]}; do touch $fn; done;
# write log
for sqn in ${array_new_sequence_name[#]};
do
echo $sqn >> $LOGFILE
done
Usage:
bash test.sh WABEL8499IPM 2
Result in the log file:
WABEL8499IPM109
WABEL8499IPM110
Just wrap a loop around your code instead of assuming the args come in on the command line.
SRCFILE="~/Desktop/deviceinfo.csv"
LOGDIR="~/Desktop/"
LOGFILE="$LOGDIR/DeviceNames.csv"
while read device_name quantityNum
do max_sequence_name=$( grep -o -e "$device_name[0-9]*" $SRCFILE |
sort --reverse | head -n 1)
max_sequence_num=${max_sequence_name: -3}
array_new_sequence_name=()
for i in $(seq 1 $quantityNum)
do cnum=$((max_sequence_num + i))
array_new_sequence_name+=("$device_name$cnum")
done
for sqn in ${array_new_sequence_name[#]};
do echo $sqn >> $LOGFILE
done
done < input.file
I'd maybe pass the input file as the parameter now.

How to read a file located in in several folders and subfolders in Bash Shell

There are several files named TESTFILE which located in directories ~/main1/sub1, ~/main1/sub2, ~/main1/sub3, ..., ~/main2/sub1,~/main2/sub2, ... ~/mainX/subY where mainX is the main folder and subY are the subfolders inside the main folder. The TESTFILE file for each main folder-subfolder has the same pattern, but the data in each is unique.
Now here's what I want to do:
I want to read a specific number in the TESTFILE for each ~/mainX/subY.
I want to create a text file where every line has the following format [mainX][space][subY][space][value read from TESTFILE]
Some information about TESTFILE and the data I want to get:
It is an OSZICAR file from VASP, a DFT program
The number of lines in OSZICAR varies in different folder-subfolder combination
The information I want to get is always located in the last two lines of the file
The last two lines always look like this:
DAV: 2 -0.942521930239E+01 0.27889E-09 -0.79991E-13 864 0.312E-06
10 F= -.94252193E+01 E0= -.94252193E+01 d E =-.717252E-07
Or in general, the last two lines pattern is:
DAV: a b c d e f
g F= h E0= i d E = j
where the italicized parts are the parts that do not change and boldfaced variable are the ones that I want to get
Some information about main folder mainX and sub-folder subY:
The folders mainX and subY are all real numbers.
How I want the output to be:
Suppose mainX={0.12, 0.20, 0.34, 0.7} and subY={1.10, 2.30, 4.50, 1.00, 2.78}, and the last two lines of ~/0.12/1.10/OSZICAR is the example above, my output file should contain:
0.12 1.10 2 10 -.94252193E+01 -.94252193E+01 -.717252E-07
...
0.7 2.30 2 10 -.94252193E+01 -.94252193E+01 -.717252E-07
...
mainX mainY a g h i j
How do I do this in the simplest way possible? I'm reading grep, awk, sed and I'm very overwhelmed.
You could do this using some for loops in bash:
for m in ~/main*/; do
main=$(basename "$m")
for s in "$m"sub*/; do
sub=$(basename "$s")
num=$(tail -n2 TESTFILE | awk -F'[ =]+' 'NR==1{s=$2;next}{print s,$1,$3,$5,$8}')
echo "$main $sub $num"
done
done > output_file
I have modified the command to extract the data from your file. It uses tail to read the last two lines of the file. The lines are passed to awk, where they are split into fields using any number of spaces and = signs together as the field separator. The second field from the first of the two lines is saved to the variable s. next skips to the next line, then the columns that you are interested in are printed.
Your question is not very clear - specifically on how to extract the value from TESTFILE, but this is something like what you want:
#!/bin/bash
for X in {1..100}; do
for Y in {1..100}; do
directory="main${X}/sub${Y}"
echo Checking $directory
if [ -f "${directory}/TESTFILE" ]; then
something=$(grep something "${directory}/TESTFILE")
echo main${X} sub${Y} $something
fi
done
done

add a column to multiple text files with bash

I am trying to add a formatted column between columns of multiple text files. By using
awk 'NR==FNR{c[NR]=$3;l=NR;next}{$2=($3+c[1])*c[l]" "$2}7' file file
I can convert a file that has a form
1 2 3 4
10 20 30 40
100 200 300 400
to
1 1800 3 4
10 9900 30 40
100 90900 300 400
How can I do above operation to multiple .dat files?
tmp="/usr/tmp/tmp$$"
for file in *
do
awk '...' "$file" "$file" > "$tmp" && mv "$tmp" "$file"
done
wrt your script, though:
awk 'NR==FNR{c[NR]=$3;l=NR;next}{$2=($3+c[1])*c[l]" "$2}7' file file
Never use the letter l (el) as a variable name as it looks far too much like the number 1 (one). I'd actually write it as:
awk 'NR==FNR{c[++n]=$3;next}{$2=($3+c[1])*c[n]" "$2}7' file file
or if memory is a concern for a large file:
awk 'NR==FNR{c[NR==1]=$3;next}{$2=($3+c[1])*c[0]" "$2}7' file file

Paste column to existing file in a loop

I am using the paste command in a bash loop to add new columns to a CSV-file. I would like to reuse the CSV-file. Currently I am using a temporary file to accomplish this:
while [ $i -le $max ]
do
# create text from grib2
wgrib2 -d 1.$(($i+1)) -no_header myGribFile.grb2 -text tmptxt.txt
#paste to temporary file
paste -d, existingfile.csv tmptxt.txt > tmpcsv.csv
#overwrite old csv with new csv
mv tmpcsv.csv existingfile.csv
((i++))
done
After adding some columns the copy is getting slow, because the file is becoming bigger and bigger (every tmptxt.txt has about 2 MB, adding to approx 100 MB).
A tmptxt.txt is a plain txt-file with one column and one value per row:
1
2
3
.
.
The existingfile.csv would then be
1,1,x
2,2,y
3,3,z
.,.,.
.,.,.
Is there any way to use the paste command to add a column to an existing file? Or is there any other way?
Thanks
Would it be feasible to split the operation in 2 ? One step for generating all the intermediate files; and another for generating all the final output file. The idea is to avoid rereading and rewriting over and over the final file.
The changes to the script would be something like this:
while [ $i -le $max ]
do
n=$(printf "%05d" $i) # to preserve lexical order if $max > 9
# create text from grib2
wgrib2 -d 1.$(($i+1)) -no_header myGribFile.grb2 -text tmptxt$n.txt
((i++))
done
#make final file
paste -d, existingfile.csv tmptxt[0-9]*.txt > tmpcsv.csv
#overwrite old csv with new csv
mv tmpcsv.csv existingfile.csv
Assuming the number of lines output by the program is constant and is equal to number of lines in existingfile.csv (which should be the case since you are using paste)
Disclaimer: I'm not exactly sure if this would speed things up (depending on whether io redirection >> writes to the file exactly once or not). Anyway give it a try and let me know.
So the basic idea is
append the output in one go after the loop is done (note the change: wgrib now prints to - which is stdout)
use awk to move every linenum rows (linenum being the number of lines in existingfile.csv) to the end to the first linenum rows
Save to tempcsv.csv (because I can't find a way to save in the same file)
rename to / overwrite existingfile.csv
.
while [ $i -le $max ]; do
# create text from grib2
wgrib2 -d 1.$(($i+1)) -no_header myGribFile.grb2 -text -
((i++))
done >> existingfile.csv
awk -v linenum=4 '
{ array[FNR%linenum]=array[FNR%linenum]","$0 }
END { for(i=1;i<linenum;i++) print array[i%linenum] }
' existingfile.csv > tempcsv.csv
mv tempcsv.csv existingfile.csv
If this is how I imagine it would work (internally), you should have 2 writes to existingfile.csv instead of $max number of writes. So hopefully this would speed things up.

Resources