I have two files, smaller and bigger and bigger contains all lines of smaller. Those lines are almost same, just last column differs.
file_smaller
A NM 0
B GT 4
file_bigger
A NM 5 <-same as in file_smaller according to my rules
C TY 2
D OP 6
B GT 3 <-same as in file_smaller according to my rules
I would like to write lines, where the two files differ, that means:
wished_output
C TY 2
D OP 6
Could you please help me to do so? Thanks a lot.
you can do the following:
cat file_bigger file_smaller |sed 's=\(.*\).$=\1='|sort| uniq -u > temp_pat
grep -f temp_pat file_bigger ; rm temp_pat
which will (in the same order)
merge the files
remove the last column
sort the result
print only unique lines in temp_pat
find the original lines in file_bigger
all in all, the expected result.
awk 'FILENAME==file_bigger {arr[$1 $2]=$0}
FILENAME==file_smaller { tmp=$1 $2; if( tmp in arr) {next} else {print $0}}
' file_bigger file_smaller
See if that meets you needs
grep -vf <(cut -d " " -f 1-2 file_smaller| sed 's/^/^/') file_bigger
The process substitution results in this:
^A NM
^B GT
Then, grep -v removes those patterns from "file_bigger"
Bash 4 using associative arrays:
#!/usr/bin/env bash
f() {
if (( $# != 2 )); then
echo "usage: ${FUNCNAME} <smaller> <bigger>" >&2
return 1
fi
local -A smaller
local -a x
while read -ra x; do
smaller["${x[#]::2}"]=0
done <"$1"
while read -ra x; do
((${smaller["${x[#]::2}"]:-1})) && echo "${x[*]}"
done <"$2"
}
f /dev/fd/3 /dev/fd/0 <<"SMALLER" 3<&0 <<"BIGGER"
A NM 0
B GT 4
SMALLER
A NM 5
C TY 2
D OP 6
B GT 3
BIGGER
Related
I have a .txt file with numeric indices of certain 'outlier' data points, each on their own line, called by $outlier_file:
1
7
30
43
48
49
56
57
65
Using the following code, I can successfully remove certain files (volumes of neuroimaging data in this case) by using while + read.
while read outlier; do
# Remove current outlier vol from eddy unwarped DWI data
rm $DWI_path/$1/vol000*"$outlier".nii.gz;
done < $outlier_file
However, I also need to remove the numbers located at these 'outlier' indices from another text file stored in $bvec_file, which has 69 columns & 3 rows. Within each row, the numbers are space delimited. So e.g., for this example, I need to remove all 3 rows of column 1, 7, 30, etc. and then save this version with the outliers removed into a new *.txt file.
0 0.9988864166 -0.0415925034 -0.06652866169 -0.6187155495 0.2291534462 0.8892356214 0.7797364286 0.1957395685 0.9236669465 -0.5400265342 -0.3845263463 -0.4903989539 0.4863306385 -0.6496130843 0.5571164636 0.8110081715 0.9032142094 -0.3234596075 -0.1551409525 -0.806059879 0.4811597826 -0.7820757748 -0.9528881463 0.1916556621 -0.007136403284 -0.2459431735 -0.7915263574 -0.1938049261 -0.1578786349 0.8688043633 -0.5546072294 -0.4019951732 0.2806154851 0.3478762022 0.9548067252 -0.9696777541 -0.4816255837 -0.7962240023 0.6818610905 0.7097978218 0.6739686799 0.1317547111 -0.7648252249 -0.1456021218 -0.5948047487 0.0934205064 0.5268769564 -0.8618324858 -0.3721029232 -0.1827616535 0.691353613 0.4159071597 0.4605505287 0.1312199424 0.426674893 -0.4068291509 0.7167859082 0.2330824665 0.01909161256 -0.06375254731 -0.5981122948 -0.2672253674 0.6875472994 0.2302943724 0 0 0 0
0 0.04258194557 0.9988207007 0.6287131425 0.7469024143 0.5528476637 0.3024964957 0.1446931241 0.9305823612 0.1675139932 0.8208211337 0.8238722992 0.5983722761 0.4238174961 0.639429196 0.1072148887 0.5551578885 0.003337599176 0.511740508 0.9516619405 0.3851404227 0.8526321065 0.1390947346 0.2030449535 0.7759459569 0.165587903 0.9523372297 0.5801228933 0.3277276562 0.7413928896 0.442482978 0.2320585706 0.1079269171 0.1868672655 0.1606136006 0.2968573235 0.1682337977 0.8745679247 0.5989061899 0.4172933119 0.01746934331 0.5641480832 0.7455469091 0.3471016571 0.8035001467 0.5870623128 0.361107261 0.8192579877 0.4160218909 0.5651330299 0.4070513153 0.7221181184 0.714223583 0.6971767133 0.4937978446 0.4232911691 0.8011701162 0.2870385494 0.9016941521 0.09688949547 0.9086826131 0.2631932421 0.152678096 0.6295753848 0.9712458578 0 0 0 0
0 -0.02031513434 -0.02504539005 -0.7747862425 0.2435730944 0.8011542666 0.343155766 -0.6091592581 -0.3093581909 -0.3446424728 -0.1860752773 -0.4163819443 -0.6336083058 0.7641081337 -0.4112580017 -0.8234841915 0.1845683194 0.4291770641 -0.7959243273 -0.2650864686 0.449371034 -0.203724703 0.6074620459 0.2253373638 -0.6009791836 -0.9861692137 0.1804598471 0.1922068008 -0.9246806119 0.6522353256 -0.2222336438 0.7990992685 -0.9092588527 -0.9414539684 0.9236803664 0.0148272357 -0.1772637652 0.05628269894 -0.08566629406 -0.6007759525 0.7041888058 0.4769729119 0.6532997034 -0.5427364139 -0.5772239915 0.5491494803 0.9278330427 0.2263117816 -0.290121617 0.7363179158 0.8949343019 -0.02399176716 0.5629439653 -0.5493977074 -0.8596191107 -0.7992328333 0.4388809483 0.6354737076 0.3641705918 0.9951120218 0.412591228 -0.75696169 0.9514620339 -0.3618197699 0.06038199928 0 0 0 0
As far as I've gotten in one approach is using awk to index the right columns.. (just printing them right now) but I can only get this to work if I call $1 (i.e., the numeric index of the first outlier column)...
awk -F ' ' '{print $1}' $bvec_file
If I try to refer to the value in $outlier, it doesn't work. Instead, this prints the entire contents of $bvec_file
while read outlier; do
# Remove current outlier vol from eddy unwarped DWI data
rm $DWI_path/$1/vol000*"$outlier".nii.gz;
# Remove outlier #'s from bvec file
awk -F ' ' '{print $1}' $bvec_file
done < $outlier_file
I am completely stuck on how to get this done. Any advice would be greatly appreciated.
To delete the outliers from bvec_file after the loop and only delete the ones where the associated file was successfully removed:
#!/usr/bin/env bash
tmp=$(mktemp) || exit 1
while IFS= read -r outlier; do
# Remove current outlier vol from eddy unwarped DWI data
rm "$DWI_path/$1"/vol000*"$outlier".nii.gz &&
echo "$outlier"
done < "$outlier_file" |
awk '
NR==FNR { os[$0]; next }
{
for (o in os) {
$o=""
}
$0=$0; $1=$1
}
1' - "$bvec_file" > "$tmp" &&
mv "$tmp" "$bvec_file"
Or to delete the outliers one at a time as the files are removed:
#!/usr/bin/env bash
tmp=$(mktemp) || exit 1
while IFS= read -r outlier; do
# Remove current outlier vol from eddy unwarped DWI data
rm "$DWI_path/$1"/vol000*"$outlier".nii.gz &&
# Remove outlier #'s from bvec file
awk -v o="$outlier" '{$o=""; $0=$0; $1=$1} 1' "$bvec_file" > "$tmp" &&
mv "$tmp" "$bvec_file"
done < <(sort -rnu "$outlier_file")
Always quote your shell variables, see https://mywiki.wooledge.org/Quotes, and the && at the end of each line is to ensure the next command only runs if the previous commands succeeded.
The magical incantation in the awk script does the following - lets say your input is a b c and the outlier field is field number 2, b:
$ echo 'a b c'
a b c
$
$ echo 'a b c' | awk -v o=2 '{$o=""; print NF ":", $0}'
3: a c
$
$ echo 'a b c' | awk -v o=2 '{$o=""; $0=$0; print NF ":", $0}'
2: a c
$
$ echo 'a b c' | awk -v o=2 '{$o=""; $0=$0; $1=$1; print NF ":", $0}'
2: a c
The o="" sets the field value to null, the $0=$0 forces awk to resplit $0 into fields so it effectively deletes field 2 (as opposed to the previous step which set it to null but it still existed as such), and the $1=$1 recombines $0 from it's fields replacing every FS (any contiguous chain of white space chars including the 2 blanks now between a and c) with OFS (a single blank char).
I am stuck on that. So I have this while-read loop within my code that is taking so long and I would like to run it in many processors. But, I'd like to split the input file and run 14 loops (because I have 14 threads), one for each splited file, in parallel. Thing is that I don't know how to tell the while loop which file to get and work with.
For example, in a regular while-read loop I would code:
while read line
do
<some code>
done < input file or variable...
But in this case I would like to split the above input file in 14 files and run 14 while loops in parallel, one for each splited file.
I tried :
split -n 14 input_file
find . -name "xa*" | \
parallel -j 14 | \
while read line
do
<lot of stuff>
done
also tried
split -n 14 input_file
function loop {
while read line
do
<lot of stuff>
done
}
export -f loop
parallel -j 14 ::: loop
But neither I was able to tell which file would be the input to the loop so parallel would understand "take each of those xa* files and place into individual loops in parallel"
An example of the input file (a list of strings)
AEYS01000010.10484.12283
CVJT01000011.50.2173
KF625180.1.1799
KT949922.1.1791
LOBZ01000025.54942.57580
EDIT
This is the code.
The output is a table (741100 lines) with some statistics regarding DNA sequences alignments already made.
The loop takes an input_file (no broken lines, varies from 500 to ~45000 lines, 800Kb) with DNA sequence acessions, reads it line-by-line and look for each correspondent full taxonomy for those acessions in a databank (~45000 lines). Then, it does a few sums/divisions. Output is a .tsv and looks like this (an example for sequence "KF625180.1.1799"):
Rate of taxonomies for this sequence in %: KF625180.1.1799 D_6__Bacillus_atrophaeus
Taxonomy %aligned number_ocurrences_in_the_alignment num_ocurrences_in_databank %alingment/databank
D_6__Bacillus_atrophaeus 50% 1 20 5%
D_6__Bacillus_amyloliquefaciens 50% 1 154 0.649351%
$ head input file
AEYS01000010.10484.12283
CVJT01000011.50.217
KF625180.1.1799
KT949922.1.1791
LOBZ01000025.54942.57580
Two additional files are also used inside the loop. They are not the loop input.
1) a file called alnout_file that only serves for finding how many hits (or alignments) a given sequence had against the databank. It was also previously made outside this loop. It can vary in the number of lines from hundreads to thousands. Only columns 1 and 2 matters here. Column1 is the name of the sequence and col2 is the name of all sequences it matched in the databnk. It looks like that:
$ head alnout_file
KF625180.1.1799 KF625180.1.1799 100.0 431 0 0 1 431 1 431 -1 0
KF625180.1.1799 KP143082.1.1457 99.3 431 1 2 1 431 1 429 -1 0
KP143082.1.1457 KF625180.1.1799 99.3 431 1 2 1 429 1 431 -1 0
2) a databank .tsv file containing ~45000 taxonomies correspondent to the DNA sequences. Each taxonomy is in one line:
$ head taxonomy.file.tsv
KP143082.1.1457 D_0__Bacteria;D_1__Firmicutes;D_2__Bacilli;D_3__Bacillales;D_4__Bacillaceae;D_5__Bacillus;D_6__Bacillus_amyloliquefaciens
KF625180.1.1799 D_0__Bacteria;D_1__Firmicutes;D_2__Bacilli;D_3__Bacillales;D_4__Bacillaceae;D_5__Bacillus;D_6__Bacillus_atrophaeus
So, given sequence KF625180.1.1799. I previously aligned it against a databank containing ~45000 other DNA sequences and got an output whis has all the accessions to sequences that it matched. What the loop does is that it finds the taxonomies for all those sequences and calculates the "statistics" I mentionded previously. Code does it for all the DNA-sequences-accesions I have.
TAXONOMY=path/taxonomy.file.tsv
while read line
do
#find hits
hits=$(grep $line alnout_file | cut -f 2)
completename=$(grep $line $TAXONOMY | sed 's/D_0.*D_4/D_4/g')
printf "\nRate of taxonomies for this sequence in %%:\t$completename\n"
printf "Taxonomy\t%aligned\tnumber_ocurrences_in_the_alignment\tnum_ocurrences_in_databank\t%alingment/databank\n"
#find hits and calculate the frequence (%) of the taxonomy in the alignment output
# ex.: Bacillus_subtilis 33
freqHits=$(grep "${hits[#]}" $TAXONOMY | \
cut -f 2 | \
awk '{a[$0]++} END {for (i in a) {print i, "\t", a[i]/NR*100, "\t", a[i]}}' | \
sed -e 's/D_0.*D_5/D_5/g' -e 's#\s\t\s#\t#g' | \
sort -k2 -hr)
# print frequence of each taxonomy in the databank
freqBank=$(while read line; do grep -c "$line" $TAXONOMY; done < <(echo "$freqHits" | cut -f 1))
#print cols with taxonomy and calculations
paste <(printf %s "$freqHits") <(printf %s "$freqBank") | awk '{print $1,"\t",$2"%","\t",$3,"\t",$4,"\t",$3/$4*100"%"}'
done < input_file
It is a lot of greps and parsing so it takes about ~12h running in one processor for doing it to all the 45000 DNA sequence accessions. The, I would like to split input_file and do it in all the processors I have (14) because it would the time spend in that.
Thank you all for being so patient with me =)
You are looking for --pipe. In this case you can even use the optimized --pipepart (version >20160621):
export TAXONOMY=path/taxonomy.file.tsv
doit() {
while read line
do
#find hits
hits=$(grep $line alnout_file | cut -f 2)
completename=$(grep $line $TAXONOMY | sed 's/D_0.*D_4/D_4/g')
printf "\nRate of taxonomies for this sequence in %%:\t$completename\n"
printf "Taxonomy\t%aligned\tnumber_ocurrences_in_the_alignment\tnum_ocurrences_in_databank\t%alingment/databank\n"
#find hits and calculate the frequence (%) of the taxonomy in the alignment output
# ex.: Bacillus_subtilis 33
freqHits=$(grep "${hits[#]}" $TAXONOMY | \
cut -f 2 | \
awk '{a[$0]++} END {for (i in a) {print i, "\t", a[i]/NR*100, "\t", a[i]}}' | \
sed -e 's/D_0.*D_5/D_5/g' -e 's#\s\t\s#\t#g' | \
sort -k2 -hr)
# print frequence of each taxonomy in the databank
freqBank=$(while read line; do grep -c "$line" $TAXONOMY; done < <(echo "$freqHits" | cut -f 1))
#print cols with taxonomy and calculations
paste <(printf %s "$freqHits") <(printf %s "$freqBank") | awk '{print $1,"\t",$2"%","\t",$3,"\t",$4,"\t",$3/$4*100"%"}'
done
}
export -f doit
parallel -a input_file --pipepart doit
This will chop input_file into 10*ncpu blocks (where ncpu is the number of CPU threads), pass each block to doit, run ncpu jobs in parallel.
That said I think your real problem is spawning too many programs: If you rewrite doit in Perl or Python I will expect you will see a major speedup.
As an alternative I threw together a quick test.
#! /bin/env bash
mkfifo PIPELINE # create a single queue
cat "$1" > PIPELINE & # supply it with records
{ declare -i cnt=0 max=14
while (( ++cnt <= max )) # spawn loop creates worker jobs
do printf -v fn "%02d" $cnt
while read -r line # each work loop reads common stdin...
do echo "$fn:[$line]"
sleep 1
done >$fn.log 2>&1 & # these run in background in parallel
done # this one exits
} < PIPELINE # *all* read from the same queue
wait
cat [0-9][0-9].log
Doesn't need split, but does need a mkfifo.
Obviously, change the code inside the internal loop.
This answers what you asked, namely how to process in parallel the 14 files you get from running split. However, I don't think it is the best way of doing whatever it is that you are trying to do - but we would need some answers from you for that.
So, let's make a million line file and split it into 14 parts:
seq 1000000 > 1M
split -n 14 1M part-
That gives me 14 files called part-aa through part-an. Now your question is how to process those 14 parts in parallel - (read the last line first):
#!/bin/bash
# This function will be called for each of the 14 files
DoOne(){
# Pick up parameters
job=$1
file=$2
# Count lines in specified file
lines=$(wc -l < "$file")
echo "Job No: $job, file: $file, lines: $lines"
}
# Make the function above known to processes spawned by GNU Parallel
export -f DoOne
# Run 14 parallel instances of "DoOne" passing job number and filename to each
parallel -k -j 14 DoOne {#} {} ::: part-??
Sample Output
Job No: 1, file: part-aa, lines: 83861
Job No: 2, file: part-ab, lines: 72600
Job No: 3, file: part-ac, lines: 70295
Job No: 4, file: part-ad, lines: 70295
Job No: 5, file: part-ae, lines: 70294
Job No: 6, file: part-af, lines: 70295
Job No: 7, file: part-ag, lines: 70295
Job No: 8, file: part-ah, lines: 70294
Job No: 9, file: part-ai, lines: 70295
Job No: 10, file: part-aj, lines: 70295
Job No: 11, file: part-ak, lines: 70295
Job No: 12, file: part-al, lines: 70294
Job No: 13, file: part-am, lines: 70295
Job No: 14, file: part-an, lines: 70297
You would omit the -k argument to GNU Parallel normally - I only added it so the output comes in order.
I think that using a bunch of grep and awk commands is the wrong approach here - you would be miles better off using Perl, or awk. As you have not provided any sample files I generated some using this code:
#!/bin/bash
for a in {A..Z} {0..9} ; do
for b in {A..Z} {0..9} ; do
for c in {A..Z} {0..9} ; do
echo "${a}${b}${c}"
done
done
done > a
# Now make file "b" which has the same stuff but shuffled into a different order
gshuf < a > b
Note that there are 26 letters in the alphabet, so if I add the digits 0..9 to the letters of the alphabet, I get 36 alphanumeric digits and if I nest 3 loops of that I get 36^3 or 46,656 lines which matches your file sizes roughly. File a now looks like this:
AAA
AAB
AAC
AAD
AAE
AAF
File b looks like this:
UKM
L50
AOC
79U
K6S
6PO
12I
XEV
WJN
Now I want to loop through a finding the corresponding line in b. First, I use your approach:
time while read thing ; do grep $thing b > /dev/null ; done < a
That takes 9 mins 35 seconds.
If I now exit grep on the first match, on average I will find it in the middle, which means the time will be halved since I won't continue to needlessly read b after I find what I want.
time while read thing ; do grep -m1 $thing b > /dev/null ; done < a
That improves the time down to 4 mins 30 seconds.
If I now use awk to read the contents of b into an associative array (a.k.a. hash) and then read the elements of a and find them in b like this:
time awk 'FNR==NR{a[$1]=$1; next} {print a[$1]}' b a > /dev/null
That now runs in 0.07 seconds. Hopefully you get the idea of what I am driving at. I expect Perl would do this in the same time and also provide more expressive facilities for the maths in the middle of your loop too.
I hope this small script helps you out:
function process {
while read line; do
echo "$line"
done < $1
}
function loop {
file=$1
chunks=$2
dir=`mktemp -d`
cd $dir
split -n l/$chunks $file
for i in *; do
process "$i" &
done
rm -rf $dir
}
loop /tmp/foo 14
It runs the process loop on the specified file with the specified number of chunks (without splitting lines) in parallel (using & to put each invocation in the background). I hope it gets you started.
This can do the job for You, I am not familiar with parallel instead using native bash spawning processes &:
function loop () {
while IFS= read -r -d $'\n'
do
# YOUR BIG STUFF
done < "${1}"
}
arr_files=(./xa*)
for i in "${arr_files[#]}"
do loop "${i}" &
done
wait
I have a big txt file with 2 columns and more than 2 million rows. Every value represents an id and there may be duplicates. There are about 100k unique ids.
1342342345345 34523453452343
0209239498238 29349203492342
2349234023443 99203900992344
2349234023443 182834349348
2923000444 9902342349234
I want to identify each id and re-number all of them starting from 1. It should re-number duplicates also using the same new id. If possible, it should be done using bash.
The output could be something like:
123 485934
34 44834
167 34564
167 2345
2 34564
Doing this in pure bash will be really slow. I'd recommend:
tr -s '[:blank:]' '\n' <file |
sort -un |
awk '
NR == FNR {id[$1] = FNR; next}
{for (i=1; i<=NF; i++) {$i = id[$i]}; print}
' - file
4 8
3 7
5 9
5 2
1 6
With bash and sort:
#!/bin/bash
shopt -s lastpipe
declare -A hash # declare associative array
index=1
# read file and fill associative array
while read -r a b; do
echo "$a"
echo "$b"
done <file | sort -nu | while read -r x; do
hash[$x]="$((index++))"
done
# read file and print values from associative array
while read -r a b; do
echo "${hash[$a]} ${hash[$b]}"
done < file
Output:
4 8
3 7
5 9
5 2
1 6
See: man bash and man sort
Pure Bash, with a single read of the file:
declare -A hash
index=1
while read -r a b; do
[[ ${hash[$a]} ]] || hash[$a]=$((index++)) # assign index only if not set already
[[ ${hash[$b]} ]] || hash[$b]=$((index++)) # assign index only if not set already
printf '%s %s\n' "${hash[$a]}" "${hash[$b]}"
done < file > file.indexed
Notes:
the index is assigned in the order read (not based on sorting)
we make a single pass through the file (not two as in other solutions)
Bash's read is slower than awk; however, if the same logic is implemented in Perl or Python, it will be much faster
this solution is more CPU bound because of the hash lookups
Output:
1 2
3 4
5 6
5 7
8 9
Just keep a monotonic counter and a table of seen numbers; when you see a new id, give it the value of the counter and increment:
awk '!a[$1]{a[$1]=++N} {$1=a[$1]} !a[$2]{a[$2]=++N} {$2=a[$2]} 1' input
awk 'NR==FNR { ids[$1] = ++c; next }
{ print ids[$1], ids[$2] }
' <( { cut -d' ' -f1 renum.in; cut -d' ' -f2 renum.in; } | sort -nu ) renum.in
join the two columns into one then sort the that into numerical order (-n), and make unique (-u), before using awk to use this sequence to generate an array of mappings between old to new ids.
Then for each line in input, swap ids and print.
I have a text with repeated data patterns, and grep keeps getting all matches without stop.
for ((count = 1; count !=17; count++)); do # 17 times
xuz1[count]=`grep -e "1 O1" $out_file | cut -c10-29`
xuz2[count]=`grep -e "2 O2" $out_file | cut -c10-29`
xuz3[count]=`grep -e "3 O3" $out_file | cut -c10-29`
echo ${xuz1[count]}
echo ${xuz2[count]}
echo ${xuz3[count]}
done
data looks like:
some text.....
Text....
.....
1 O1 111111 111111 111111
2 O2 222211 222211 222211
3 O3 643653 652346 757686
some text.....
1 O1 111122 111122 111122
2 O2 222222 222222 222222
3 O3 343653 652346 757683
some text.....
1 O1 111333 111333 111333
2 O2 222333 222333 222333
3 O3 343653 652346 757684
.
.
.
And result I'm getting:
xuz1[1] = 111111 111111 111111
xuz2[1] = 222211 222211 222211
xuz3[1] = 643653 652346 757686
xuz1[2] = 111111 111111 111111
xuz2[2] = 222211 222211 222211
xuz3[2] = 643653 652346 757686
...
looking for result like this:
xuz1[1]=111111 111111 111111
xuz2[1]=222211 222211 222211
xuz3[1]=343653 652346 757683
xuz1[2]=111122 111122 111122
xuz2[2]=222222 222222 222222
xuz3[2]=343653 652346 757684
also tried "grep -m 1 -e"
Which way should I go?
for now I ended up with one line
grep -A4 -e "1 O1" $out_file | cut -c10-29
Some text.... Is a huge text part.
A little bash script with a single grep is enough
grep -E '^[0-9]+ +O[0-9]+ +.*'|
while read idx oidx cols; do
if ((idx == 1)); then
let ++i
name=xuz$i
let j=1
fi
echo "$name[$j]=$cols"
let ++j
done
You haven't really described what you want, but I guess something like this.
awk '! /^[1-9][0-9]* O[0-9] / { n++; m=0; if (NR>1) print ""; next }
{ print "xuz" ++m "[" n "]=" substr($0, 10) }' "$out_file"
If the regex doesn't match, we assume we are looking at one of the "some text" pieces, and that this starts a new record. Increment n and reset m. Otherwise, print the output for this item within this record.
If some text could be more than one line, you will need a minor change, but I hope this should be enough at least to send you in the right direction.
You can do this in pure Bash, too, though this is going to be highly inefficient - you would expect a Bash while read loop to be at least a hundred times slower than Awk, and the code is markedly less idiomatic and elegant.
while read -r m x result; do
case $m::$x in
[1-9]::O[1-9])
printf 'xuz%d[%d]=%s\n' $m $n "$result;;
*)
# If n is unset, don't print an empty line
printf '%s' "${n+$'\n'}"
let ((n++));;
esac
done <"$out_file"
I would aggressively challenge any requirement to do this in pure Bash. If it's for homework, the requirement is unrealistic, and a core skill for shell script authors is to understand the limits of the shell and the strengths of the common support tools like Awk. The Awk language is virtually guaranteed to be available wherever you have a shell, in particular a heavy shell like Bash. (In a limited e.g. embedded environment, a limited shell like Dash would make more sense. Then e.g. the let keyword won't be available, though it should not be hard to make this script properly portable.)
The case statement accepts glob patterns, not regular expressions, so the pattern here is slightly less general (we accept one positive digit in the first field).
Thank you all for participating in discussion.
*** this is my home project to help my wife do extract data from research calculations /// speed up is around 400 times **
file used for extracting data from, contains around 2000 lines,
needed data blocks look like this
and they're repeated 10-20 times in the file.
uiyououy COORDINATES
NR ATOM CCCCC X Y Z
1 O1 8.00 0.000000000 0.882236820 -0.789494235
2 O2 8.00 0.000000000 -1.218250722 -1.644061652
3 O3 8.00 0.000000000 1.218328524 0.400260050
4 O4 8.00 0.000000000 -0.882314622 2.033295837
Text text text text
tons of text
to extract 4 lines I used expression below
grep -A4 --no-group-separator -e "1 O1" $from_file | cut -c23-64
>xyz_temp.txt
# grep 4 lines at once to txt
sed -i '/^[ \t]*$/d' xyz_temp.txt
#del empty lines from xyz txt
next is to convert string in to numbers (should use '| bc -l' for arithmetic)
while IFS= read line
do
IFS=' ' read -r -a arr_line <<< "$line"
# break line of xyz into 3 numbers
s1=$(echo "${arr_line[0]}" \* 0.529177249 | bc -l)
# some math convertion
s2=$(echo "${arr_line[1]}" \* 0.529177249 | bc -l)
s3=$(echo "${arr_line[2]}" \* 0.529177249 | bc -l)
#-------to array non sorted ------------
arr[$n]=${n}";"${from_file}";"${gd_}";"${frt[count_4s]}";"${n4}";"${s1}";"${s2}";"${s3}
echo ${arr[n]}
#--------------------------------------------
done <"$from_file_txt"
sort array
IFS=$'\n' sorted=($(sort -t \; -k4 -k5 -g <<<"${arr[*]}"))
# -t separator ';' -k column -g generic * to get new line output
#-k4 -k5 sort by column 4 then5
#printf "%s\n" "${sorted[*]}"
unset IFS
There is Last part which will combine data to result view
echo "$n"
n2=1
n42=1
count_4s2=1
i=0
echo "============================== sorted =============================="
################### loop for empty 4s lines
printf "%s" ";" ";" ";" ";" ";" "${count_4s2}" ";"
printf "%s\n"
printf "%s\n" "${sorted[i]}"
while [ $i -lt $((n-2)) ]
do
i=$((i+1))
if [ "$n42" = "4" ] # 1234
then n42=0
count_4s2=$((count_4s2+1))
printf "%s" ";" ";" ";" ";" ";" "${count_4s2}" ";"
printf "%s\n"
fi
#--------------------------------------------
n2=$((n2+1))
n42=$((n42+1))
printf "%s\n" "${sorted[i]}"
done ############# while
#00000000000000000000000000000000000000
printf "%s\n"
echo ==END===END===END==
Output looks like this
============================== sorted ==============================
;;;;;1;
17;A-13_A1+.out;1.3;0.4;1;0;.221176355474853043;-.523049776514580244
18;A-13_A1+.out;1.3;0.4;2;0;-.550350051428402955;-.734584881824005358
19;A-13_A1+.out;1.3;0.4;3;0;.665269869069959489;.133910683627893251
20;A-13_A1+.out;1.3;0.4;4;0;-.336096173116409577;1.123723974181515102
;;;;;2;
13;A-13_A1+.out;1.3;0.45;1;0;.279265277182782148;-.504490787956469897
14;A-13_A1+.out;1.3;0.45;2;0;-.583907412327951988;-.759310392973448167
15;A-13_A1+.out;1.3;0.45;3;0;.662538493711206290;.146829200993661293
16;A-13_A1+.out;1.3;0.45;4;0;-.357896358566036450;1.116971979936256771
;;;;;3;
9;A-13_A1+.out;1.3;0.5;1;0;.339333719743262501;-.482029749553797105
10;A-13_A1+.out;1.3;0.5;2;0;-.612395507070451545;-.788968880150283253
11;A-13_A1+.out;1.3;0.5;3;0;.658674809217196345;.163289820251690233
12;A-13_A1+.out;1.3;0.5;4;0;-.385613021360830052;1.107708808923212876
==END===END===END==
*note : some code might not shown here
next step is to paste it to excel with ; separator.
I am having two files numbers.txt(1 \n 2 \n 3 \n 4 \n 5 \n) and alpha.txt (a \n n \n c \n d \n e \n)
Now I want to iterate both the files at the same time something like.
for num in `cat numbers.txt` && alpha in `cat alpha.txt`
do
echo $num "blah" $alpha
done
Or other idea I was having is
for num in `cat numbers.txt`
do
for alpha in `cat alpha.txt`
do
echo $num 'and' $alpha
break
done
done
but this kind of code always take the first value of $alpha.
I hope my problem is clear enough.
Thanks in advance.
Here it is what I actually intended to do. (Its just an example)
I am having one more file say template.txt having content.
variable1= NUMBER
variable2= ALPHA
I wanted to take the output from two files i.e numbers.txt and alpha.txt(one line from both at a time) and want to replace the NUMBER and ALPHA with the respective content from those two files.
so here it what I did as i got to know how to iterate both files together.
paste number.txt alpha.txt | while read num alpha
do
cp template.txt temp.txt
sed -i "{s/NUMBER/$num/g}" temp.txt
sed -i "{s/ALPHA/$alpha/g}" temp.txt
cat temp.txt >> final.txt
done
Now what i am having in final.txt is:
variable1= 1
variable2= a
variable1= 2
variable2= b
variable1= 3
variable2= c
variable1= 4
variable2= d
variable1= 5
variable2= e
variable1= 6
variable2= f
variable1= 7
variable2= g
variable1= 8
variable2= h
variable1= 9
variable2= i
variable1= 10
variable2= j
Its very simple and stupid approach. I wanted to know is there any other way to do this??
Any suggestion will be appreciated.
No, your question isn't clear enough. Specifically, the way you wish to iterate through your files is unclear, but assuming you want to have an output such as:
1 blah a
2 blah b
3 blah c
4 blah d
5 blah e
you can use the paste utility, like this:
paste number.txt alpha.txt | while read alpha num ; do
echo "$num and $alpha"
done
or even:
paste -d# alpha num | sed 's/#/ blah /'
Your first loop is impossible in bash. Your second one, without the break, would combine each line from numbers.txt with each line from alpha.txt, like this:
1 AND a
1 AND n
1 AND c
...
2 AND a
...
3 AND a
...
4 AND a
...
Your break makes it skip all lines from the alpha.txt, except the 1st one (bmk has already explained it in his answer)
It should be possible to organize the correct loop using the while loop construction, but it would be rather ugly.
There're lots of easier alternatives which maybe a better choice, depending on specifics of your task. For example, you could try this:
paste numbers.txt alpha.txt
or, if you really want your "AND"s, then, something like this:
paste numbers.txt alpha.txt | sed 's/\t/ AND /'
And if your numbers are really sequential (and you can live without 'AND'), you can simply do:
cat -n alpha.txt
Here is an alternate solution according to the first model you suggested:
while read -u 5 a && read -u 6 b
do
echo $a $b
done 5<numbers.txt 6<alpha.txt
The notation 5<numbers.txt tells the shell to open numbers.txt using file descriptor 5. read -u 5 a means read from a value for a from file descriptor 5, which has been associated with numbers.txt.
The advantage of this approach over paste is that it gives you fine-grain control over how you merge the two files. For example you could read one line from the first file and twice from the second file.
In your second example the inner loop is executed only once because of the break. It will simply jump out of the loop, i.e. you will always only get the first element of alpha.txt. Therefore I think you should remove it:
for num in `cat numbers.txt`
do
for alpha in `cat alpha.txt`
do
echo $num 'and' $alpha
done
done
If multiple loop isn't specifically your requirement but getting corresponding lines is then you may try the following code:
for line in `cat numbers.txt`
do
echo $line "and" $(cat alpha.txt| head -n$line | tail -n1 )
done
The head gets you the number of lines equal to the value of line and tail gets you the last element.
#tollboy, I think the answer you are looking for is this:
count=1
for item in $(paste number.txt alpha.txt); do
if [[ "${item}" =~ [a-zA-Z] ]]; then
echo "variable${count}= ${item}" >> final.txt
elif [[ "${item}" =~ [0-9] ]]; then
echo "variable${count}= ${item}" >> final.txt
fi
count=$((count+1))
done
When you type paste number.txt alpha.txt in your console, you see:
1 a
2 b
3 c
4 d
5 e
6 f
7 g
8 h
9 i
10 j
From bash's point of view $(paste number.txt alpha.txt) it looks like this:
1 a 2 b 3 c 4 d 5 e 6 f 7 g 8 h 9 i 10 j
So for each item in that list, figure out if it is alpha or numeric, and print it to the output file.
Lastly, increment the count.