sed/awk unix csv file modification - shell

I have a directory that is receiving .csv files.
column1,column2,column3,columb4
value1,0021,value3,value4,
value1,00211,value3,value4,
I want remove the header, pad the second column to 6 digits and add ":" so it is in HH:MM:SS format. e.g.
value1,00:00:21,value3,value4,
value1,00:02:11,value3,value4,
I can pad the characters to 6 digits using awk but I am not sure to insert the semicolumn every 2 characters for the second $2. Else can this be fully done in sed? which would be better for performance?
Thank you

You may do it all with GNU awk:
awk 'BEGIN{FS=OFS=","} {$2=sprintf("%06d", $2); $2=substr($2,1,2) gensub(/.{2}/,":&","g",substr($2,3))}1' file
See an online demo
Details
BEGIN{FS=OFS=","} - sets input/output field separator to a comma
$2=sprintf("%06d", $2) - pads Field 2 with zeros
$2=substr($2,1,2)""gensub(/.{2}/,":&","g",substr($2,3)) - sets Field 2 value equal to a first two chars of the field (substr($2,1,2)) plus the field substring starting from the third char with : inserted before each two char chunk.
1 - default print action.

With awk formatting + substitution magic:
awk 'BEGIN{ FS = OFS = "," }
NR > 1{ $2=sprintf("%06d", $2); gsub(/[0-9]{2}/, "&:", $2);
$2=substr($2, 0, 8); print }' file
The output:
value1,00:00:21,value3,value4,
value1,00:02:11,value3,value4,

with sed
$ sed -nE '2,$s/,([0-9]+)/,00000\1/;s/,0+(..)(..)(..),/,\1:\2:\3,/p' file
value1,00:00:21,value3,value4,
value1,00:02:11,value3,value4,
I think it can be simplified little bit.

Related

awk to get first column if the a specific number in the line is greater than a digit

I have a data file (file.txt) contains the below lines:
123 pro=tegs, ETA=12:00, team=xyz,user1=tom,dom=dby.com
345 pro=rbs, team=abc,user1=chan,dom=sbc.int,ETA=23:00
456 team=efg, pro=bvy,ETA=22:00,dom=sss.co.uk,user2=lis
I'm expecting to get the first column ($1) only if the ETA= number is greater than 15, like here I will have 2nd and 3rd line first column only is expected.
345
456
I tried like cat file.txt | awk -F [,TPF=]' '{print $1}' but its print whole line which has ETA at the end.
Using awk
$ awk -F"[=, ]" '{for (i=1;i<NF;i++) if ($i=="ETA") if ($(i+1) > 15) print $1}' input_file
345
456
With your shown samples please try following GNU awk code. Using match function of GNU awk where I am using regex (^[0-9]+).*ETA=([0-9]+):[0-9]+ which creates 2 capturing groups and saves its values into array arr. Then checking condition if 2nd element of arr is greater than 15 then print 1st value of arr array as per requirement.
awk '
match($0,/(^[0-9]+).*\<ETA=([0-9]+):[0-9]+/,arr) && arr[2]+0>15{
print arr[1]
}
' Input_file
I would harness GNU AWK for this task following way, let file.txt content be
123 pro=tegs, ETA=12:00, team=xyz,user1=tom,dom=dby.com
345 pro=rbs, team=abc,user1=chan,dom=sbc.int,ETA=23:00
456 team=efg, pro=bvy,ETA=02:00,dom=sss.co.uk,user2=lis
then
awk 'substr($0,index($0,"ETA=")+4,2)+0>15{print $1}' file.txt
gives output
345
Explanation: I use String functions, index to find where is ETA= then substr to get 2 characters after ETA=, 4 is used as ETA= is 4 characters long and index gives start position, I use +0 to convert to integer then compare it with 15. Disclaimer: this solution assumes every row has ETA= followed by exactly 2 digits.
(tested in GNU Awk 5.0.1)
Whenever input contains tag=value pairs as yours does, it's best to first create an array of those mappings (v[]) below and then you can just access the values by their tags (names):
$ cat tst.awk
BEGIN {
FS = "[, =]+"
OFS = ","
}
{
delete v
for ( i=2; i<NF; i+=2 ) {
v[$i] = $(i+1)
}
}
v["ETA"]+0 > 15 {
print $1
}
$ awk -f tst.awk file
345
456
With that approach you can trivially enhance the script in future to access whatever values you like by their names, test them in whatever combinations you like, output them in whatever order you like, etc. For example:
$ cat tst.awk
BEGIN {
FS = "[, =]+"
OFS = ","
}
{
delete v
for ( i=2; i<NF; i+=2 ) {
v[$i] = $(i+1)
}
}
(v["pro"] ~ /b/) && (v["ETA"]+0 > 15) {
print $1, v["team"], v["dom"]
}
$ awk -f tst.awk file
345,abc,sbc.int
456,efg,sss.co.uk
Think about how you'd enhance any other solution to do the above or anything remotely similar.
It's unclear why you think your attempt would do anything of the sort. Your attempt uses a completely different field separator and does not compare anything against the number 15.
You'll also want to get rid of the useless use of cat.
When you specify a column separator with -F that changes what the first column $1 actually means; it is then everything before the first occurrence of the separator. Probably separately split the line to obtain the first column, space-separated.
awk -F 'ETA=' '$2 > 15 { split($0, n, /[ \t]+/); print n[1] }' file.txt
The value in $2 will be the data after the first separator (and up until the next one) but using it in a numeric comparison simply ignores any non-numeric text after the number at the beginning of the field. So for example, on the first line, we are actually literally checking if 12:00, team=xyz,user1=tom,dom=dby.com is larger than 15 but it effectively checks if 12 is larger than 15 (which is obviously false).
When the condition is true, we split the original line $0 into the array n on sequences of whitespace, and then print the first element of this array.
Using awk you could match ETA= followed by 1 or more digits. Then get the match without the ETA= part and check if the number is greater than 15 and print the first field.
awk '/^[0-9]/ && match($0, /ETA=[0-9]+/) {
if(substr($0, RSTART+4, RLENGTH-4)+0 > 15) print $1
}' file
Output
345
456
If the first field should start with a number:
awk '/^[0-9]/ && match($0, /ETA=[0-9]+/) {
if(substr($0, RSTART+4, RLENGTH-4) > 15)+0 print $1
}' file

Searching for a string between two characters

I need to find two numbers from lines which look like this
>Chr14:453901-458800
I have a large quantity of those lines mixed with lines that doesn't contain ":" so we can search for colon to find the line with numbers. Every line have different numbers.
I need to find both numbers after ":" which are separated by "-" then substract the first number from the second one and print result on the screen for each line
I'd like this to be done using awk
I managed to do something like this:
awk -e '$1 ~ /\:/ {print $0}' file.txt
but it's nowhere near the end result
For this example i showed above my result would be:
4899
Because it is the result of 458800 - 453901 = 4899
I can't figure it out on my own and would appreciate some help
With GNU awk. Separate the row into multiple columns using the : and - separators. In each row containing :, subtract the contents of column 2 from the contents of column 3 and print result.
awk -F '[:-]' '/:/{print $3-$2}' file
Output:
4899
Using awk
$ awk -F: '/:/ {split($2,a,"-"); print a[2] - a[1]}' input_file
4899

How to replace a specific character in a file, only on the lines by counting this specific character in the line?

I would like to double the 4th comma in the lines counting 7 and only 7 commas in all the csv's of a folder.
In this command line, I double the 4th comma:
sed  's/,/,,/4' Person_7.csv > new.csv
In this command line, I can find and count all the commas in a line:
sed 's/[^,]//g' dat | awk '{ print length }'
In this command line, I can count and create a new file with lines containing 7 commas:
awk -F , 'NF == 7' <Person_test.csv >Person_7.csv
But I don't know how to do the specific work...
You need something to select only the lines that contain exactly 7 commas and then operate on just these lines. You can do that with sed:
sed '/^\([^,]*,\)\{7\}[^,]*$/s/,/&&/4'
where ^\([^,]*,\)\{7\}[^,]*$ defines a line that contains exactly 7 commas.
It's a bit easier with awk, though:
awk -F, -v OFS=, 'NF == 8 { $4 = $4 OFS } 1'
This sets input and output field separators to ,, and then for lines with 8 fields (7 commas) appends a , to the end of the 4th field, doubling the comma. The final 1 makes sure every line gets printed.

Decrease string length in pattern

I have a string like so:
text1;text2;text3;text4;text5;text6
and i need to decrease length of string after 3rd occurrence ; so in that case
text
, to for example 4 characters.
so far i did:
echo "$(cat pp.txt | awk -F ";" '{print $4}' | sed 's/^\(...\).*/\1/;q' )"
But the output is "text". What i need is:
Text1;text2;text3;text;text5;text6
Please help
You could do it all with awk like
awk -F\; 'BEGIN {OFS=FS} {$4=substr($4, 0, 4); print}' pp.txt
You can modify the fields in awk, which we do with $4=substr($4, 0, 4). That will take the substring of $4 from the 0th to the 4th characters, then store that back into the forth field. Then we just print the line with the updated value.
Also, set the output field separator to be the same as the one we specify on the command line so we don't change that when printing it.

How to set FS to eof?

I want to read whole file not per lines. How to change field separator to eof-symbol?
I do:
awk "^[0-9]+∆DD$1[PS].*$" $(ls -tr)
$1 - param (some integer), .* - message that I want to print. There is a problem: message can contains \n. In that way this code prints only first line of file. How can I scan whole file not per lines?
Can I do this using awk, sed, grep? Script must have length <= 60 characters (include spaces).
Assuming you mean record separator, not field separator, with GNU awk you'd do:
gawk -v RS='^$' '{ print "<" $0 ">" }' file
Replace the print with whatever you really want to do and update your question with some sample input and expected output if you want help with that part too.
The portable way to do this, by the way, is to build up the record line by line and then process it in the END section:
awk '{rec = rec (NR>1?RS:"") $0} END{ print "<" rec ">" }' file
using nf = split(rec,flds) to create fields if necessary.

Resources