So I am trying to collect as series of numbers after a string. However, the string's position in the block of text I am sorting through can change a lot.
Here is the full code I am working with right now:
echo name HF MP2 | cat > allE
for i in *.out
do grep "Slide" $i | cut -d "\\" -f2 | cat | tr -d '\n' > $i.name &&
grep "EUMP2" $i | cut -d "=" -f3 | cut -c 1-25 | tr '\n' ' ' >> $i.mp2 &&
grep 'AG\\HF' $i | cut -d "=" -f3 | cut -c 1-13 | tr '\n' ' ' >> $i.hf &&
paste $i.name >> $i.energies &&
paste $i.hf >> $i.energies &&
sed -i 's/[ABCEFGHIJKLMNOPQRSTUVWXYZ]//g' $i.mp2 &&
paste $i.mp2 >> $i.energies &&
transpose $i.energies >> $i.allE #temp.txt &&
#cat temp.txt > $i.energies
#echo $i is finished
done
echo see allE for energies
rm *.energies #temp.txt
rm *.name
rm *.mp2
The string I am searching for is AG\HF.
The problem is the data it is searching through can look like (Note: there are actual new line characters in this data, which I think is causing a bit of the problem)
1\1\GINC-N076\SP\RMP2-FC\CC-pVDZ\C12H12\R2536\09-Apr-2020\0\\# mp2/cc-
pVDZ\\Squish3_Slide0\\0,1\H,0,0.,2.4822,0.\C,0,0.,1.3948,0.\C,0,0.,-1.
3948,0.\C,0,1.2079,0.6974,0.\C,0,-1.2079,0.6974,0.\C,0,-1.2079,-0.6974
,0.\C,0,1.2079,-0.6974,0.\H,0,2.1497,1.2411,0.\H,0,-2.1497,1.2411,0.\H
,0,-2.1497,-1.2411,0.\H,0,2.1497,-1.2411,0.\H,0,0.,-2.4822,0.\C,0,0.,1
.3948,3.\C,0,0.,-1.3948,3.\C,0,1.2079,0.6974,3.\C,0,-1.2079,0.6974,3.\
C,0,-1.2079,-0.6974,3.\C,0,1.2079,-0.6974,3.\H,0,0.,2.4822,3.\H,0,2.14
97,1.2411,3.\H,0,-2.1497,1.2411,3.\H,0,-2.1497,-1.2411,3.\H,0,2.1497,-
1.2411,3.\H,0,0.,-2.4822,3.\\Version=ES64L-G09RevD.01\State=1-AG\HF=-4
61.3998608\MP2=-463.0005321\RMSD=3.490e-09\PG=D02H [SG"(C4H4),X(C8H8)]
\\#
OR
1\1\GINC-N076\SP\RMP2-FC\CC-pVDZ\C12H12\R2536\09-Apr-2020\0\\# mp2/cc-
pVDZ\\Squish3.1_Slide0\\0,1\H,0,0.,2.4822,0.\C,0,0.,1.3948,0.\C,0,0.,-
1.3948,0.\C,0,1.2079,0.6974,0.\C,0,-1.2079,0.6974,0.\C,0,-1.2079,-0.69
74,0.\C,0,1.2079,-0.6974,0.\H,0,2.1497,1.2411,0.\H,0,-2.1497,1.2411,0.
\H,0,-2.1497,-1.2411,0.\H,0,2.1497,-1.2411,0.\H,0,0.,-2.4822,0.\C,0,0.
,1.3948,3.1\C,0,0.,-1.3948,3.1\C,0,1.2079,0.6974,3.1\C,0,-1.2079,0.697
4,3.1\C,0,-1.2079,-0.6974,3.1\C,0,1.2079,-0.6974,3.1\H,0,0.,2.4822,3.1
\H,0,2.1497,1.2411,3.1\H,0,-2.1497,1.2411,3.1\H,0,-2.1497,-1.2411,3.1\
H,0,2.1497,-1.2411,3.1\H,0,0.,-2.4822,3.1\\Version=ES64L-G09RevD.01\St
ate=1-AG\HF=-461.4104442\MP2=-463.0062587\RMSD=3.651e-09\PG=D02H [SG"(
C4H4),X(C8H8)]\\#
OR
1\1\GINC-N076\SP\RMP2-FC\CC-pVDZ\C12H12\R2536\09-Apr-2020\0\\# mp2/cc-
pVDZ\\Squish3.3_Slide1.7\\0,1\H,0,0.,2.4822,0.\C,0,0.,1.3948,0.\C,0,0.
,-1.3948,0.\C,0,1.2079,0.6974,0.\C,0,-1.2079,0.6974,0.\C,0,-1.2079,-0.
6974,0.\C,0,1.2079,-0.6974,0.\H,0,2.1497,1.2411,0.\H,0,-2.1497,1.2411,
0.\H,0,-2.1497,-1.2411,0.\H,0,2.1497,-1.2411,0.\H,0,0.,-2.4822,0.\C,0,
0.,-0.3052,3.3\C,0,0.,-3.0948,3.3\C,0,1.2079,-1.0026,3.3\C,0,-1.2079,-
1.0026,3.3\C,0,-1.2079,-2.3974,3.3\C,0,1.2079,-2.3974,3.3\H,0,0.,0.782
2,3.3\H,0,2.1497,-0.4589,3.3\H,0,-2.1497,-0.4589,3.3\H,0,-2.1497,-2.94
11,3.3\H,0,2.1497,-2.9411,3.3\H,0,0.,-4.1822,3.3\\Version=ES64L-G09Rev
D.01\State=1-AG\HF=-461.436061\MP2=-463.0177441\RMSD=7.859e-09\PG=C02H
[SGH(C4H4),X(C8H8)]\\#
OR
1\1\GINC-N076\SP\RMP2-FC\CC-pVDZ\C12H12\R2536\09-Apr-2020\0\\# mp2/cc-
pVDZ\\Squish3.6_Slide0.9\\0,1\H,0,0.,2.4822,0.\C,0,0.,1.3948,0.\C,0,0.
,-1.3948,0.\C,0,1.2079,0.6974,0.\C,0,-1.2079,0.6974,0.\C,0,-1.2079,-0.
6974,0.\C,0,1.2079,-0.6974,0.\H,0,2.1497,1.2411,0.\H,0,-2.1497,1.2411,
0.\H,0,-2.1497,-1.2411,0.\H,0,2.1497,-1.2411,0.\H,0,0.,-2.4822,0.\C,0,
0.,0.4948,3.6\C,0,0.,-2.2948,3.6\C,0,1.2079,-0.2026,3.6\C,0,-1.2079,-0
.2026,3.6\C,0,-1.2079,-1.5974,3.6\C,0,1.2079,-1.5974,3.6\H,0,0.,1.5822
,3.6\H,0,2.1497,0.3411,3.6\H,0,-2.1497,0.3411,3.6\H,0,-2.1497,-2.1411,
3.6\H,0,2.1497,-2.1411,3.6\H,0,0.,-3.3822,3.6\\Version=ES64L-G09RevD.0
1\State=1-AG\HF=-461.4376969\MP2=-463.0163868\RMSD=7.263e-09\PG=C02H [
SGH(C4H4),X(C8H8)]\\#
And several other possible combinations.
Currently I am using grep with a = as a delimiter and that works about 60% of the time.
The grep reads like so:
grep 'AG\\HF' $i | cut -d "=" -f3 | cut -c 1-13 | tr '\n' ' ' >> $i.hf
This grep is going to occur many times in the same file.
Any suggestions for making the collection of the desired 13 characters consistently would be greatly appreciated.
The end result should look like:
-461.4440942 -461.4441024 -461.4441114 -461.4441212 -461.4441321 -461.4441575 -461.4441725 -461.4441893 -461.444208 -461.4442289 -461.4442522 -461.444278 -461.4443063 -461.4443371 -461.4444054 -461.4444421 -461.4444798 -461.4445175 -461.4445544 -461.4445891
What I actually get is a combination of the desired output and this:
-461.4417716\ -461.4413023\ 1-AG\HF -461.439848\M -461.4387568\ -461.4373225\ -461.4354367\ -461.4329522\ -461.4296709\ -461.4253285\ -461.419576\M -461.4119582\ 1-AG\HF -461.4432257\ -461.4431843\ -461.4431419\ -461.443098\M -461.4430519\ 1-AG\HF -461.4429461\ -461.4428799\ -461.4427974\ -461.4426902\ -461.4425469\ -461.4423525\ -461.4420882\ -461.4417302\ -461.4412489\ 1-AG\HF -461.439758\M -461.4386392\ -461.4371684\ -461.4352341\ -461.4326853\ -461.4293183\ -461.4248614\ -461.4189557\ -461.411132\M 1-AG\HF -461.4432226\ -461.443181\M -461.4431381\ -461.4430938\ -461.4430472\ 1-AG\HF -461.4429401\ -461.4428728\ -461.4427889\ -461.44268\MP -461.4425343\ -461.4423369\ -461.4420684\ -461.4417048\ -461.4412162\ 1-AG\HF -461.4397026\ -461.4385667\ -461.4370734\ -461.4351091\ -461.4325204\ -461.4291001\ -461.424572\M -461.4185707\ -461.4106184\ 1-AG\HF -461.4432215\ -461.4431798\ -461.4431369\ -461.4430924\ -461.4430457\ 1-AG\HF -461.442938\M -461.4428704\ -461.4427861\ -461.4426766\ -461.4425301\ -461.4423316\ -461.4420617\ -461.4416963\ -461.4412051\ 1-AG\HF -461.4396839\ -461.4385423\ -461.4370413\ -461.4350669\ -461.4324646\ -461.4290263\ -461.4244739\ -461.4184402\ -461.4104442\ 1-AG\HF
Using the line
awk -v ORS=' ' -F= '$3 ~ /AG\\HF$/{print substr($4, 1, 12)}' $i >> $i.hf &&
All that was output was
-4 -4 -4
You may use this single awk:
awk -F= '$3 ~ /AG\\HF$/{print substr($4, 1, 12)}' file
-461.3998608
-461.4104442
-461.4440942
-461.4441483
To get output in single line:
awk -v ORS=' ' -F= '$3 ~ /AG\\HF$/{print substr($4, 1, 12)}' file
Hey guys so i have this sample data from uniq-c:
100 c.m milk
99 c.s milk
45 cat food
30 beef
desired output:
beef,30
c.m milk,100
c.s milk,99
cat food,45
the thing i have tried are using:
awk -F " " '{print $2" " $3 " " $4 " " $5 "," $1}' stock.txt |sort>stock2.csv
i got :
beef ,30
cat food
,45
c.m milk
,100
c.s milk
,99
think its because some item doesn't have 2,3,4,5 and i still use " ", and the sort in unix doesn't prioritise dot first unlike sql. however i'm not too sure how to fix it
To obtain your desired output you could sort first your current input and then try to swap the columns.
Using awk, please give a try to this:
$ sort -k2 stock.txt | awk '{t=$1; sub($1 FS,""); print $0"," t}'
It will output:
beef,30
c.m milk,100
c.s milk,99
cat food,45
i think you can solve it in bash using some easy commands, if the format of the file is as you posted it:
prova.txt is your file.
then do:
cat prova.txt | cut -d" " -f2,3 > first_col
cat prova.txt | cut -d" " -f1 > second_col
paste -d "," first_col second_col | sort -u > output.csv
rm first_col second_col
in output.txt you have your desired output in CSV format!
EDIT:
after reading and applying PesaThe comment, the code is way easier:
paste -d, <(cut -d' ' -f2- prova.txt) <(cut -d' ' -f1 prova.txt) | sort -u > output.csv
Combining additional information from this thread with awk, the following script is a possible solution:
awk ' { printf "%s", $2; if ($3) printf " %s", $3; printf ",%d\n", $1; } ' stock.txt | LC_ALL=C sort > stock2.csv
It works well in my case. Nevertheless, I would prefer nbari's solution because it is shorter.
$ awk '{$0=$0","$1; sub(/^[^[:space:]]+[[:space:]]+/,"")} 1' file | LC_ALL=C sort
beef,30
c.m milk,100
c.s milk,99
cat food,45
You can use sed + sort:
sed -E 's/^([^[:blank:]]+)[[:blank:]]+(.+)/\2,\1/' file | C_ALL=C sort
beef,30
c.m milk,100
c.s milk,99
cat food,45
i have the following file::
FirstName, FamilyName, Address, PhoneNo
the file is sorted according to the family name, how can i count the number of family names starts with a particular character ??
output should look like this ::
A: 2
B: 1
...
??
With awk:
awk '{print substr($2, 1, 1)}' file|
uniq -c|
awk '{print $2 ": " $1}'
OK, no awk. Here's with sed:
sed s'/[^,]*, \(.\).*/\1/' file|
uniq -c|
sed 's/.*\([0-9]\)\+ \([a-zA-Z]\)\+/\2: \1/'
OK, no sed. Here's with python:
import csv
r = csv.reader(open(file_name, 'r'))
d = {}
for i in r:
d[i[1][1]] = d.get(i[1][1], 0) + 1
for (k, v) in d.items():
print "%s: %s" % (k, v)
while read -r f l r; do echo "$l"; done < inputfile | cut -c 1 | sort | uniq -c
Just the Shell
#! /bin/bash
##### Count occurance of familyname initial
#FirstName, FamilyName, Address, PhoneNo
exec <<EOF
Isusara, Ali, Someplace, 022-222
Rat, Fink, Some Hole, 111-5555
Louis, Frayser, whaterver, 123-1144
Janet, Hayes, whoever St, 111-5555
Mary, Holt, Henrico VA, 222-9999
Phillis, Hughs, Some Town, 711-5525
Howard, Kingsley, ahahaha, 222-2222
EOF
while read first family rest
do
init=${family:0:1}
[ -n "$oinit" -a $init != "$oinit" ] && {
echo $oinit : $count
count=0
}
oinit=$init
let count++
done
echo $oinit : $count
Running
frayser#gentoo ~/doc/Answers/src/SH/names $ sh names.sh
A : 1
F : 2
H : 3
K : 1
frayser#gentoo ~/doc/Answers/src/SH/names $
To read from a file, remove the here document, and run:
chmod +x names.sh
./names.sh <file
The "hard way" — no use of awk or sed, exactly as asked for. If you're not sure what any of these commands mean, you should definitely look at the man page for each one.
INTERMED=`mktemp` # Creates a temporary file
COUNTS_L=`mktemp` # A second...
COUNTS_R=`mktemp` # A third...
cut -d , -f 2 | # Extracts the FamilyName field only
tr -d '\t ' | # Deletes spaces/tabs
cut -c 1 | # Keeps only the first character
# on each line
tr '[:lower:]' '[:upper:]' | # Capitalizes all letters
sort | # Sorts the list
uniq -c > $INTERMED # Counts how many of each letter
# there are
cut -c1-7 $INTERMED | # Cuts out the LHS of the temp file
tr -d ' ' > $COUNTS_R # Must delete the padding spaces though
cut -c9- $INTERMED > $COUNTS_L # Cut out the RHS of the temp file
# Combines the two halves into the final output in reverse order
paste -d ' ' /dev/null $COUNTS_R | paste -d ':' $COUNTS_L -
rm $INTERMED $COUNTS_L $COUNTS_R # Cleans up the temp files
awk one-liner:
awk '
{count[substr($2,1,1)]++}
END {for (init in count) print init ": " count[init]}
' filename
Prints the how many words start with each letter:
for i in {a..z}; do echo -n "$i:"; find path/to/folder -type f -exec sed "s/ /\n/g" {} \; | grep ^$i | wc -c | awk '{print $0}'; done