So I am trying to remove new lines using sed, because it the only way I can think of to do it. I'm completely self taught so there may be a more efficient way that I just don't know.
The string I am searching for is \HF=-[0-9](newline character). The problem is the data it is searching through can look like (Note: there are actual new line characters in this data, which I think is causing a bit of the problem)
1\1\GINC-N076\SP\RMP2-FC\CC-pVDZ\C12H12\R2536\09-Apr-2020\0\\# mp2/cc-
pVDZ\\Squish3_Slide0\\0,1\H,0,0.,2.4822,0.\C,0,0.,1.3948,0.\C,0,0.,-1.
3948,0.\C,0,1.2079,0.6974,0.\C,0,-1.2079,0.6974,0.\C,0,-1.2079,-0.6974
,0.\C,0,1.2079,-0.6974,0.\H,0,2.1497,1.2411,0.\H,0,-2.1497,1.2411,0.\H
,0,-2.1497,-1.2411,0.\H,0,2.1497,-1.2411,0.\H,0,0.,-2.4822,0.\C,0,0.,1
.3948,3.\C,0,0.,-1.3948,3.\C,0,1.2079,0.6974,3.\C,0,-1.2079,0.6974,3.\
C,0,-1.2079,-0.6974,3.\C,0,1.2079,-0.6974,3.\H,0,0.,2.4822,3.\H,0,2.14
97,1.2411,3.\H,0,-2.1497,1.2411,3.\H,0,-2.1497,-1.2411,3.\H,0,2.1497,-
1.2411,3.\H,0,0.,-2.4822,3.\\Version=ES64L-G09RevD.01\State=1-AG\HF=-4
61.3998608\MP2=-463.0005321\RMSD=3.490e-09\PG=D02H [SG"(C4H4),X(C8H8)]
\\#
OR
1\1\GINC-N076\SP\RMP2-FC\CC-pVDZ\C12H12\R2536\09-Apr-2020\0\\# mp2/cc-
pVDZ\\Squish3.1_Slide0\\0,1\H,0,0.,2.4822,0.\C,0,0.,1.3948,0.\C,0,0.,-
1.3948,0.\C,0,1.2079,0.6974,0.\C,0,-1.2079,0.6974,0.\C,0,-1.2079,-0.69
74,0.\C,0,1.2079,-0.6974,0.\H,0,2.1497,1.2411,0.\H,0,-2.1497,1.2411,0.
\H,0,-2.1497,-1.2411,0.\H,0,2.1497,-1.2411,0.\H,0,0.,-2.4822,0.\C,0,0.
,1.3948,3.1\C,0,0.,-1.3948,3.1\C,0,1.2079,0.6974,3.1\C,0,-1.2079,0.697
4,3.1\C,0,-1.2079,-0.6974,3.1\C,0,1.2079,-0.6974,3.1\H,0,0.,2.4822,3.1
\H,0,2.1497,1.2411,3.1\H,0,-2.1497,1.2411,3.1\H,0,-2.1497,-1.2411,3.1\
H,0,2.1497,-1.2411,3.1\H,0,0.,-2.4822,3.1\\Version=ES64L-G09RevD.01\St
ate=1-AG\HF=-461.4104442\MP2=-463.0062587\RMSD=3.651e-09\PG=D02H [SG"(
C4H4),X(C8H8)]\\#
OR
1\1\GINC-N076\SP\RMP2-FC\CC-pVDZ\C12H12\R2536\09-Apr-2020\0\\# mp2/cc-
pVDZ\\Squish3.3_Slide1.7\\0,1\H,0,0.,2.4822,0.\C,0,0.,1.3948,0.\C,0,0.
,-1.3948,0.\C,0,1.2079,0.6974,0.\C,0,-1.2079,0.6974,0.\C,0,-1.2079,-0.
6974,0.\C,0,1.2079,-0.6974,0.\H,0,2.1497,1.2411,0.\H,0,-2.1497,1.2411,
0.\H,0,-2.1497,-1.2411,0.\H,0,2.1497,-1.2411,0.\H,0,0.,-2.4822,0.\C,0,
0.,-0.3052,3.3\C,0,0.,-3.0948,3.3\C,0,1.2079,-1.0026,3.3\C,0,-1.2079,-
1.0026,3.3\C,0,-1.2079,-2.3974,3.3\C,0,1.2079,-2.3974,3.3\H,0,0.,0.782
2,3.3\H,0,2.1497,-0.4589,3.3\H,0,-2.1497,-0.4589,3.3\H,0,-2.1497,-2.94
11,3.3\H,0,2.1497,-2.9411,3.3\H,0,0.,-4.1822,3.3\\Version=ES64L-G09Rev
D.01\State=1-AG\HF=-461.436061\MP2=-463.0177441\RMSD=7.859e-09\PG=C02H
[SGH(C4H4),X(C8H8)]\\#
OR
1\1\GINC-N076\SP\RMP2-FC\CC-pVDZ\C12H12\R2536\09-Apr-2020\0\\# mp2/cc-
pVDZ\\Squish3.6_Slide0.9\\0,1\H,0,0.,2.4822,0.\C,0,0.,1.3948,0.\C,0,0.
,-1.3948,0.\C,0,1.2079,0.6974,0.\C,0,-1.2079,0.6974,0.\C,0,-1.2079,-0.
6974,0.\C,0,1.2079,-0.6974,0.\H,0,2.1497,1.2411,0.\H,0,-2.1497,1.2411,
0.\H,0,-2.1497,-1.2411,0.\H,0,2.1497,-1.2411,0.\H,0,0.,-2.4822,0.\C,0,
0.,0.4948,3.6\C,0,0.,-2.2948,3.6\C,0,1.2079,-0.2026,3.6\C,0,-1.2079,-0
.2026,3.6\C,0,-1.2079,-1.5974,3.6\C,0,1.2079,-1.5974,3.6\H,0,0.,1.5822
,3.6\H,0,2.1497,0.3411,3.6\H,0,-2.1497,0.3411,3.6\H,0,-2.1497,-2.1411,
3.6\H,0,2.1497,-2.1411,3.6\H,0,0.,-3.3822,3.6\\Version=ES64L-G09RevD.0
1\State=1-AG\HF=-461.4376969\MP2=-463.0163868\RMSD=7.263e-09\PG=C02H [
SGH(C4H4),X(C8H8)]\\#
Basically the number I am looking for can be broken up into two lines at any point based on character count. I need to get rid of the newline breaking up the number so that I can extract the entire value into a separate file. (I have no problems with the extraction to a new file, hence why it isn't included in the code)
Currently I am using this code
sed -i ':a;N;$!ba;s/HF=-*[0-9]*\n/HF=-*[0-9]*/g' $i &&
Which ALMOST works, expect it doesn't replace the wildcard values with the same values. It replaces it with the actual text [0-9] instead and doesn't always remove the new line character.
Important to the is that THERE ARE ACTUAL NEW LINE CHARACTERS in the output file and there is no way to change that without messing up the other 30 lines I am extracting from this output file.
What I want is to just get rid of the newline characters that occur when that string is found, regardless of how many digits there are in between the - sign and the newline character.
So the expected output would be something like
1\1\GINC-N076\SP\RMP2-FC\CC-pVDZ\C12H12\R2536\09-Apr-2020\0\\# mp2/cc-
pVDZ\\Squish3_Slide0\\0,1\H,0,0.,2.4822,0.\C,0,0.,1.3948,0.\C,0,0.,-1.
3948,0.\C,0,1.2079,0.6974,0.\C,0,-1.2079,0.6974,0.\C,0,-1.2079,-0.6974
,0.\C,0,1.2079,-0.6974,0.\H,0,2.1497,1.2411,0.\H,0,-2.1497,1.2411,0.\H
,0,-2.1497,-1.2411,0.\H,0,2.1497,-1.2411,0.\H,0,0.,-2.4822,0.\C,0,0.,1
.3948,3.\C,0,0.,-1.3948,3.\C,0,1.2079,0.6974,3.\C,0,-1.2079,0.6974,3.\
C,0,-1.2079,-0.6974,3.\C,0,1.2079,-0.6974,3.\H,0,0.,2.4822,3.\H,0,2.14
97,1.2411,3.\H,0,-2.1497,1.2411,3.\H,0,-2.1497,-1.2411,3.\H,0,2.1497,-
1.2411,3.\H,0,0.,-2.4822,3.\\Version=ES64L-G09RevD.01\State=1-AG\HF=-461.3998608\MP2=-463.0005321\RMSD=3.490e-09\PG=D02H [SG"(C4H4),X(C8H8)]
\\#
These files are rather large and have over 1500 executions of this line of code, so the more efficient the better.
Everything else in the script this is in is using a combination of grep, awk, sed, and basic UNIX commands.
EDIT
After trying
sed -i -E ':a;N;$!ba;s/(\\HF=-?[.0-9]*)\n/\1/' $i &&
I still had no luck getting rid of those pesky new line characters.
If it has any effect on the answers at all here is the rest of the code to go with the one line that is causing problems
echo name HF MP2 mpdiff | cat > allE
for i in *.out
do echo name HF MP2 mpdiff | cat > $i.allE
grep "Slide" $i | cut -d "\\" -f2 | cat | tr -d '\n' > $i.name &&
grep "EUMP2" $i | cut -d "=" -f3 | cut -c 1-25 | tr '\n' ' ' | tr -s ' ' >> $i.mp &&
grep "EUMP2" $i | cut -d "=" -f2 | cut -c 1-25 | tr '\n' ' ' | tr -s ' ' >> $i.mpdiff &&
sed -i -E ':a;N;$!ba;s/(\\HF=-?[.0-9]*)\n/\1/' $i &&
grep '\\HF' $i | awk -F 'HF' '{print substr($2,2,14)}' | tr '\n' ' ' >> $i.hf &&
paste $i.name >> $i.energies &&
sed -i 's/ /0 /g' $i.hf &&
sed -i 's/\\/0/g' $i.hf &&
sed -i 's/[A-Z]/0/g' $i.hf &&
paste $i.hf >> $i.energies &&
sed -i 's/[ABCEFGHIJKLMNOPQRSTUVWXYZ]//g' $i.mp &&
paste $i.mp >> $i.energies &&
sed -i 's/[ABCEFGHIJKLMNOPQRSTUVWXYZ]//g' $i.mpdiff &&
paste $i.mpdiff >> $i.energies &&
transpose $i.energies >> $i.allE #temp.txt &&
#cat temp.txt > $i.energies
#echo $i is finished
done
echo see allE for energies
#rm *.energies #temp.txt
rm *.name
rm *.mp
rm *.hf
rm *.mpdiff
Here is how you can fix your current attempt.
sed -E ':a;N;$!ba;s/(\\HF=-?[.0-9]*)\n/\1/'
Add the i flag if you want to make the changes on the file itself, add && to send the job to the background, etc. The -E flag is needed, because backreferences (see below) are part of extended regular expressions.
I made the following changes: I changed -* to -? as there should be at most one dash (if I understand correctly and that is in fact a minus sign, not a dash). I added the period to the bracket expression, so that the decimal point would be matched too. (Note that in a bracket expression, the dot is a regular character). I wrapped the whole thing except the newline in parentheses - making it into a subexpression, which you can refer to with a backreference - which is what I did in the replacement part.
A few notes though - this will join the lines even if the entire number is at the end of one line, but not followed by the closing \. If in fact the entire number being on one line, but the closing \ is on the next line, you can change the sed command slightly, to leave those alone. On the other hand, this does not handle situations where, for example, one line ends in \H and the next line begins with F=304.222\ You only mentioned "split number" in your problem statement; shouldn't you, though, also handle such cases, where the newline splits the \HF=...\ token, just not in the "number" portion of the token?
It looks like your input lines start with a space. I have ignored them in this solution.
sed -rz 's/(AG\\HF=-[0-9]*)\n/\1/g' "$i"
I have a file in Unix, with data sample like the following:
{"ID":"123", "Region":"Asia", "Location":"India"}
{"ID":"234", "Region":"APAC", "Location":"Australia"}
{"ID":"345", "Region":"Americas", "Location":"Mexio"}
{"ID":"456", "Region":"Americas", "Location":"Canada"}
{"ID":"567", "Region":"APAC", "Location":"Japan"}
The desired output is
ID|Region|Location
123|Asia|India
234|APAC|Australia
345|Americas|Mexico
456|Americas|Canada
567|APAC|Japan
I tried with a few sed commands. I could remove the following: '{', '}', ' " ', ':'
There are 2 issues with the output file
All rows from input appear in single line in the output.
Adding the pipe ('|') as delimiter.
Any pointers are highly appreciated.
I recommend the tool jq (http://stedolan.github.io/jq/); jq is a lightweight and flexible command-line JSON processor.
jq -r '"\(.ID)|\(.Region)|\(.Location)"' < infile
123|Asia|India
234|APAC|Australia
345|Americas|Mexio
456|Americas|Canada
567|APAC|Japan
Explanation
-r is --raw-output
Through awk,
awk -F'"' -v OFS="|" 'BEGIN{print "ID|Region|Location"}{print $4,$8,$12}' file
Example:
$ cat file
{"ID":"123", "Region":"Asia", "Location":"India"}
{"ID":"234", "Region":"APAC", "Location":"Australia"}
{"ID":"345", "Region":"Americas", "Location":"Mexio"}
{"ID":"456", "Region":"Americas", "Location":"Canada"}
{"ID":"567", "Region":"APAC", "Location":"Japan"}
$ awk -F'"' -v OFS="|" 'BEGIN{print "ID|Region|Location"}{print $4,$8,$12}' file
ID|Region|Location
123|Asia|India
234|APAC|Australia
345|Americas|Mexio
456|Americas|Canada
567|APAC|Japan
EXplanation:
-F'"' Sets " as Field Separator value.
OFS="|" Sets | as Output Field Separator value.
Atfirst, awk would execute the function inside the BEGIN block. It helps to print the header section.
This sed one-liner does what you want. It's capturing the field values using parenthesized expressions, and then putting them into the output using \1, \2, and \3.
s/^{"ID":"\([^"]*\)", "Region":"\([^"]*\)", "Location":"\([^"]*\)"}$/\1|\2|\3/
Invoke it like:
$ sed -f one-liner.sed input.txt
Or you can invoke it within a Bash script, producing the header:
echo 'ID|Region|Location'
sed -e 's/^{"ID":"\([^"]*\)", "Region":"\([^"]*\)", "Location":"\([^"]*\)"}$/\1|\2|\3/' $input
It is a JSON file so it is best to use a JSON parser. Here is a perl implementation of it.
#!/usr/bin/perl
use strict;
use warnings;
use JSON;
open my $fh, '<', 'path/to/your/file';
#keys of your structure
my #key = qw(ID Region Location);
print join ("|", #key), "\n";
#iterate over your file, decode it and print in order of your key structure
while (my $json = <$fh>) {
my $text = decode_json($json);
print join ("|", map { $$text{$_} } #key ),"\n";
}
Output:
ID|Region|Location
123|Asia|India
234|APAC|Australia
345|Americas|Mexio
456|Americas|Canada
567|APAC|Japan
Using sed as follows
Command line
echo "my_string" |
sed -e 's#[,:"{}]##g' -e 's#ID##g' -e "s#Region##g" -e 's#Location##g' \
-e '1 s#^.*$#ID Region Location\n&#' -e 's# #|#g'
or
sed -e 's#[,:"{}]##g' -e 's#ID##g' -e "s#Region##g" -e 's#Location##g' \
-e '1 s#^.*$#ID Region Location\n&#' -e 's# #|#g' my_file
I tried this in a terminal as follows:
echo '{"ID":"123", "Region":"Asia", "Location":"India"}
{"ID":"234", "Region":"APAC", "Location":"Australia"}
{"ID":"345", "Region":"Americas", "Location":"Mexio"}
{"ID":"456", "Region":"Americas", "Location":"Canada"}
{"ID":"567", "Region":"APAC", "Location":"Japan"}' |
sed -e 's#[,:"{}]##g' -e 's#ID##g' -e "s#Region##g" -e 's#Location##g' \
-e '1 s#^.*$#ID Region Location\n&#' -e 's# #|#g'
Output
ID|Region|Location
123|Asia|India
234|APAC|Australia
345|Americas|Mexio
456|Americas|Canada
567|APAC|Japan
Many thanks for your response and the pointers/ solutions did help a lot.
For some mysterious reasons, I couldn't get any sed commands work. So, I devised my own solution. Although it's not elegant, it's still worked.
Here is the script I prepared which resolved the issue.
#!/bin/bash
# ource file path.
infile=/home/exfile.txt
# remove if these temp file exist already.
rm ./efile.txt ./xfile.txt ./yfile.txt ./zfile.txt
# removing the curly braces from input file.
cat exfile.txt | cut -d "{" -f2 | cut -d "}" -f1 >> ./efile.txt
# setting input file name to different value.
infile=./efile.txt
# remove double quotes from the file.
while IFS= read -r line
do
echo $line | sed 's/\"//g' >> ./xfile.txt
done < "$infile"
# creating another temp file.
infile2=./xfile.txt
# remove colon from file.
while IFS= read -r line
do
echo $line | sed 's/\:/,/g' >> ./yfile.txt
done < "$infile2"
# set input file path to new temp file.
infile3=yfile.txt
# initialize variables to hold header column values.
t1=0
t3=0
t5=0
# read each of the line to extract header row. Exit loop after reading 1st row.
once=1
while IFS=',' read -r f1 f2 f3 f4 f5 f6
do
"$f1 $f2 $f3 $f4 $f5 $f6"
t1=$f1
t3=$f3
t5=$f5
if [ "$once" -eq 1 ]; then
break
fi
done < "$infile3"
# Read each of the line from input file. Write only the value to another output file.
while IFS=',' read -r f1 f2 f3 f4 f5 f6
do
echo "$f2|$f4|$f6" >> ./zfile.txt
done < "$infile3"
# insert the header column row into the file generated in the step above.
frstline="$t1|$t3|$t5"
sed -i '1i ID|Region|Location' ./zfile.txt