Separating joined columns with awk

Separating joined columns with awk - bash

I have a data file which looks like the following:
0.00000-130250.92921 28880.20200-159131.13121 301.58706
0.05000-130250.73120 28156.69202-158407.42322 294.03167
0.10000-130250.79137 28237.16138-158487.95275 294.87198
0.15000-130250.81209 28168.63042-158419.44250 294.15634
0.20000-130250.82418 28149.57611-158400.40029 293.95736
0.25000-130250.88438 28069.57135-158320.45573 293.12189
0.30000-130251.06059 28071.30576-158322.36635 293.14000
0.35000-130250.96639 28084.46351-158335.42990 293.27741
as you can see some of the columns which start with "-" sign are
joined to the previous one, for instance: 0.35000-130250.96639
this should be 0.35000 and -130250.96639. I can separate the
columns with VIM but I wanted to know if it is possible to do that
with AWK.
Thanks.

You can use sed: replace each - with a space and -:
sed -e 's/-/ -/g' input > output
The /g means globally, i.e. it replaces all occurrences on each line, not just the first one.

Using just awk
awk '{ gsub("-"," -") ; print }'

Related

Unix sed command - global replacement is not working

I have scenario where we want to replace multiple double quotes to single quotes between the data, but as the input data is separated with "comma" delimiter and all column data is enclosed with double quotes "" got an issue and the same explained below:
The sample data looks like this:
"int","","123","abd"""sf123","top"
So, the output would be:
"int","","123","abd"sf123","top"
tried below approach to get the resolution, but only first occurrence is working, not sure what is the issue??
sed -ie 's/,"",/,"NULL",/g;s/""/"/g;s/,"NULL",/,"",/g' inputfile.txt
replacing all ---> from ,"", to ,"NULL",
replacing all multiple occurrences of ---> from """ or "" or """" to " (single occurrence)
replacing 1 step changes back to original ---> from ,"NULL", to ,"",
But, only first occurrence is getting changed and remaining looks same as below:
If input is :
"int","","","123","abd"""sf123","top"
the output is coming as:
"int","","NULL","123","abd"sf123","top"
But, the output should be:
"int","","","123","abd"sf123","top"

You may try this perl with a lookahead:
perl -pe 's/("")+(?=")//g' file
"int","","123","abd"sf123","top"
"int","","","123","abd"sf123","top"
"123"abcs"
Where input is:
cat file
"int","","123","abd"""sf123","top"
"int","","","123","abd"""sf123","top"
"123"""""abcs"
Breakup:
("")+: Match 1+ pairs of double quotes
(?="): If those pairs are followed by a single "

Using sed
$ sed -E 's/(,"",)?"+(",)?/\1"\2/g' input_file
"int","","123","abd"sf123","top"
"int","","NULL","123","abd"sf123","top"
"int","","","123","abd"sf123","top"

In awk with your shown samples please try following awk code. Written and tested in GNU awk, should work in any version of awk.
awk '
BEGIN{ FS=OFS="," }
{
for(i=1;i<=NF;i++){
if($i!~/^""$/){
gsub(/"+/,"\"",$i)
}
}
}
1
' Input_file
Explanation: Simple explanation would be, setting field separator and output field separator as , for all the lines of Input_file. Then traversing through each field of line, if a field is NOT NULL then Globally replacing all 1 or more occurrences of " with single occurrence of ". Then printing the line.

With sed you could repeat 1 or more times sets of "" using a group followed by matching a single "
Then in the replacement use a single "
sed -E 's/("")+"/"/g' file
For this content
$ cat file
"int","","123","abd"""sf123","top"
"int","","","123","abd"""sf123","top"
"123"""""abcs"
The output is
"int","","123","abd"sf123","top"
"int","","","123","abd"sf123","top"
"123"abcs"

sed s'#"""#"#' file
That works. I will demonstrate another method though, which you may also find useful in other situations.
#!/bin/sh -x
cat > ed1 <<EOF
3s/"""/"/
wq
EOF
cp file stack
cat stack | tr ',' '\n' > f2
ed -s f2 < ed1
cat f2 | tr '\n' ',' > stack
rm -v ./f2
rm -v ./ed1
The point of this is that if you have a big csv record all on one line, and you want to edit a specific field, then if you know the field number, you can convert all the commas to carriage returns, and use the field number as a line number to either substitute, append after it, or insert before it with Ed; and then re-convert back to csv.

How to grep a pattern followed by a number, only if the number is above a certain value

I actually need to grep the entire line. I have a file with a bunch of lines that look like this
1 123213 A T . stuff=1.232;otherstuf=34;morestuff=121;AF=0.44;laststuff=AV
4 223152 D L . stuff=1.122;otherstuf=4;morestuff=41;AF=0.02;laststuff=RV
and I want to keep all the lines where AF>0.1. So for the lines above I only want to keep the first line.

Using gnu-awk you can do this:
awk 'gensub(/.*;AF=([^;]+).*/, "\\1", "1", $NF)+0 > 0.1' file
1 123213 A T . stuff=1.232;otherstuf=34;morestuff=121;AF=0.44;laststuff=AV
This gensub function parses out AF=<number> from last field of the input and captures number in captured group #1 which is used for comparison with 0.1.
PS: +0 will convert parsed field to a number.

You could use awk with multiple delimeters to extract the value and compare it:
$ awk -F';|=' '$8 > 0.1' file

Assuming that AF is always of the form 0.NN you can simply match values where the tens place is 1-9, e.g.:
grep ';AF=0.[1-9][0-9];' your_file.csv
You could add a + after the second character group to support additional digits (i.e. 0.NNNNN) but if the values could be outside the range [0, 1) you shouldn't try to match the field with regular expressions.

$ awk -F= '$5>0.1' file
1 123213 A T . stuff=1.232;otherstuf=34;morestuff=121;AF=0.44;laststuff=AV
If that doesn't do what you want when run against your real data then edit your question to provide more truly representative sample input/output.

I would use awk. Since awk supports alphanumerical comparisons you can simply use this:
awk -F';' '$(NF-1) > "AF=0.1"' file.txt
-F';' splits the line into fields by ;. $(NF-1) address the second last field in the line. (NF is the number of fields)

sed multiple replacements with line range

I have a file with below records
user1,fuser1,luser1,user1#test.com,data,user1
user2,fuser2,luser2,user2#test.com,data,user2
user3,fuser3,luser3,user3#test.com,data,user3
I wanted to perform some text replacements from
user1,fuser1,luser1,user1#test.com,data,user1
to
New_user1,New_fuser1,New_luser1,New_user1#test.com,data,New_user1
so I wrote below sed script.
sed -i -e 's/user/New_user/g; s/fuser/New_fuser/g; s/luser/New_luser/g' file
This works perfect. Now I have a requirement that I want to replace in specific line range.
start=2
end=3
sed -i -e ''${start},${end}'s/user/New_user/g; s/fuser/New_fuser/g; s/luser/New_luser/g' file
but this command is replacing pattern in all lines. example output is,
user1,New_fuser1,New_luser1,user1#test.com,data,New_user1
user2,New_fuser2,New_luser2,user2#test.com,data,New_user2
user3,New_fuser3,New_luser3,user3#test.com,data,New_user3
Looks like range is getting applied only to first expression and remaining expressions are getting applied on whole file. How to apply this range to all expressions?

You can use awk variables to use for this functionality, controlling the row and column numbers used for replacing
awk -vFS="," -vOFS="," -v columnStart=2 -v columnEnd=3 -v rowStart=1 -v rowEnd=2 \
'NR>=rowStart&&NR<=rowEnd{for(i=columnStart; i<=columnEnd; i++) \
$i="New_"$i; print }' file
where the awk variables columnStart, columnEnd, rowStart and rowStart determine which columns and rows to replace with , as the de-limiter adopted.
For your input file:-
$ cat input-file
user1,fuser1,luser1,user1#test.com,data,user1
user2,fuser2,luser2,user2#test.com,data,user2
user3,fuser3,luser3,user3#test.com,data,user3
Assuming I want to do replacement in lines 2 and 3 from columns 3-4, I can set-up my awk as
awk -vFS="," -vOFS="," -v columnStart=3 -v columnEnd=4 -v rowStart=2 -v rowEnd=3 \
'NR>=rowStart&&NR<=rowEnd{for(i=columnStart; i<=columnEnd; i++) \
$i="New_"$i; print }' file
user2,fuser2,New_luser2,New_user2#test.com,data,user2
user3,fuser3,New_luser3,New_user3#test.com,data,user3
To apply on the say the last column, set the columnStart and columnEnd to the same value e.g. say on column 6 and on last line only.
awk -vFS="," -vOFS="," -v columnStart=6 -v columnEnd=6 -v rowStart=3 -v rowEnd=3 \
'NR>=rowStart&&NR<=rowEnd{for(i=columnStart; i<=columnEnd; i++) \
$i="New_"$i; print }' file
user3,fuser3,luser3,user3#test.com,data,New_user3

When using GNU Sed (present on Ubuntu, probably Debian, and probably others).
There is a feature which makes this easy:
https://www.gnu.org/software/sed/manual/sed.html#Common-Commands
A group of commands may be enclosed between { and } characters. This
is particularly useful when you want a group of commands to be
triggered by a single address (or address-range) match.
Example: perform substitution then print the second input line:
$ seq 3 | sed -n '2{s/2/X/ ; p}'
X
Given the original question, this should do the trick:
sed -i -e '2,3 {s/user/New_user/g; s/fuser/New_fuser/g; s/luser/New_luser/g}' file

The following works for me:
START=2
NUM=1
sed -i -e "$START,+${NUM} s/user/New_user/g; $START,+${NUM} s/fuser/New_fuser/g; $START,+${NUM} s/luser/New_luser/g" file
As you can see, there are several changes:
The line range has to be present at each expression
The range should be represented (in this case) as the start line number and number of lines (the number of affected lines is NUM+1)
You put extra apostrophe symbols.

Using a single s command:
start=1
end=2
sed -e "$start,$end s/\([fl]*\)user/New_\1user/g" file
[fl]*user will match user with optional f or l first letter
output:
New_user1,New_fuser1,New_luser1,New_user1#test.com,data,New_user1
New_user2,New_fuser2,New_luser2,New_user2#test.com,data,New_user2
user3,fuser3,luser3,user3#test.com,data,user3

convert multiply lines between pattern to a comma separated string

I need help in processing data from STDIN (data is taken from another file with 'tail -f' plus grepped to filter out garbage). There are several lines between patterns:
<DN> 589</DN>
<DD>03.12.2014</DD>
<ST> </ST>
<STC>0</STC>
<STT>0</STT>
<PU>5</PU>
<OT>01</OT>
<DSN></DSN>
<NRA>40807,40820,426,30231,40818,30230</NRA>
<GR>300 000-00
&#10</GR>
then next block with DN/GR starts
I need to convert lines between and to a single line, comma-separated:
<DN> 589</DN>,<DD>03.12.2014</DD>,<ST> </ST>,<STC>0</STC>,<STT>0</STT>,<PU>5</PU>,<OT>01</OT>,<DSN></DSN>,<NRA>40807,40820,426,30231,40818,30230</NRA>,<GR>300 000-00
&#10</GR>
I need a one-liner with awk or sed or perl to do it and put result to STDOUT.
I've tried to do it, but failed due to lack of experience. Also tried to google and didn't find a working solution.

whatever..| awk '{sub(/^\s*/,"");printf "%s%s",$0,(/\/GR>\s*$/?"\n":",")}'
this line does:
remove the leading spaces from each line
join all line with sep , till the block end /GR>
if you have x data blocks, it gives you x long lines.

sed -nr '/<DN>/,/<GR>/{ H; /<GR>/{ g; s%\n%,%g; s%^,%%; p; s%.*%%; h }; }' <<'EOSEQ'
<DN> 589</DN>
<DD>03.12.2014</DD>
<STC>0</STC>
<GR>300 000-00
&#10</GR>
<DN>900</DN>
<DD>20.11.2014</DD>
<OT>01</OT>
<NRA>40807,40820,426,30231,40818,30230</NRA>
<GR>300 000-00
&#10</GR>
EOSEQ
SED one-liner, as you wish :)

Using awk you could do the following:
awk '{printf ("%s,", $NF)}' test.txt ##Will have comma at the end which may/may not be ok for you.

You can use the following one in sed.
sed -r ':loop ;N;s/(.*)\n(.*)/\1,\2/ ; t loop ' file name.

Bash command to extract characters in a string

I want to write a small script to generate the location of a file in an NGINX cache directory.
The format of the path is:
/path/to/nginx/cache/d8/40/32/13febd65d65112badd0aa90a15d84032
Note the last 6 characters: d8 40 32, are represented in the path.
As an input I give the md5 hash (13febd65d65112badd0aa90a15d84032) and I want to generate the output: d8/40/32/13febd65d65112badd0aa90a15d84032
I'm sure sed or awk will be handy, but I don't know yet how...

This awk can make it:
awk 'BEGIN{FS=""; OFS="/"}{print $(NF-5)$(NF-4), $(NF-3)$(NF-2), $(NF-1)$NF, $0}'
Explanation
BEGIN{FS=""; OFS="/"}. FS="" sets the input field separator to be "", so that every char will be a different field. OFS="/" sets the output field separator as /, for print matters.
print ... $(NF-1)$NF, $0 prints the penultimate field and the last one all together; then, the whole string. The comma is "filled" with the OFS, which is /.
Test
$ awk 'BEGIN{FS=""; OFS="/"}{print $(NF-5)$(NF-4), $(NF-3)$(NF-2), $(NF-1)$NF, $0}' <<< "13febd65d65112badd0aa90a15d84032"
d8/40/32/13febd65d65112badd0aa90a15d84032
Or with a file:
$ cat a
13febd65d65112badd0aa90a15d84032
13febd65d65112badd0aa90a15f1f2f3
$ awk 'BEGIN{FS=""; OFS="/"}{print $(NF-5)$(NF-4), $(NF-3)$(NF-2), $(NF-1)$NF, $0}' a
d8/40/32/13febd65d65112badd0aa90a15d84032
f1/f2/f3/13febd65d65112badd0aa90a15f1f2f3

With sed:
echo '13febd65d65112badd0aa90a15d84032' | \
sed -n 's/\(.*\([0-9a-f]\{2\}\)\([0-9a-f]\{2\}\)\([0-9a-f]\{2\}\)\)$/\2\/\3\/\4\/\1/p;'
Having GNU sed you can even simplify the pattern using the -r option. Now you won't need to escape {} and () any more. Using ~ as the regex delimiter allows to use the path separator / without need to escape it:
sed -nr 's~(.*([0-9a-f]{2})([0-9a-f]{2})([0-9a-f]{2}))$~\2/\3/\4/\1~p;'
Output:
d8/40/32/13febd65d65112badd0aa90a15d84032
Explained simple the pattern does the following: It matches:
(all (n-5 - n-4) (n-3 - n-2) (n-1 - n-0))
and replaces it by
/$1/$2/$3/$0

You can use a regular expression to separate each of the last 3 bytes from the rest of the hash.
hash=13febd65d65112badd0aa90a15d84032
[[ $hash =~ (..)(..)(..)$ ]]
new_path="/path/to/nginx/cache/${BASH_REMATCH[1]}/${BASH_REMATCH[2]}/${BASH_REMATCH[3]}/$hash"

Base="/path/to/nginx/cache/"
echo '13febd65d65112badd0aa90a15d84032' | \
sed "s|\(.*\(..\)\(..\)\(..\)\)|${Base}\2/\3/\4/\1|"
# or
# sed sed 's|.*\(..\)\(..\)\(..\)$|${Base}\1/\2/\3/&|'
Assuming info is a correct MD5 (and only) string

First of all - thanks to all of the responders - this was extremely quick!
I also did my own scripting meantime, and came up with this solution:
Run this script with a parameter of the URL you're looking for (www.example.com/article/76232?q=hello for example)
#!/bin/bash
path=$1
md5=$(echo -n "$path" | md5sum | cut -f1 -d' ')
p3=$(echo "${md5:0-2:2}")
p2=$(echo "${md5:0-4:2}")
p1=$(echo "${md5:0-6:2}")
echo "/path/to/nginx/cache/$p1/$p2/$p3/$md5"
This assumes the NGINX cache has a key structure of 2:2:2.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Separating joined columns with awk - bash

You can use sed: replace each - with a space and -: sed -e 's/-/ -/g' input > output The /g means globally, i.e. it replaces all occurrences on each line, not just the first one.

Using just awk awk '{ gsub("-"," -") ; print }'

Related

Unix sed command - global replacement is not working

How to grep a pattern followed by a number, only if the number is above a certain value

sed multiple replacements with line range

convert multiply lines between pattern to a comma separated string

Bash command to extract characters in a string

Categories

Resources