sed/awk - Put all text on the same line as a preceding number - bash

How can I get all text that proceeds 'number:number' onto the same line as the preceding 'number:number'?
10:15
text line one
text line two
text no pattern
11:12
random text
text is random
totally random
could be four lines
could be five
Should then become
10:15 text line one text line two text no pattern
11:12 random text text is random totally random could be four lines could be five

This works for your example-
tr '\n' ' ' < file.txt | sed 's/[0-9]*:[0-9]*/\n&/g'
Explanation-
tr will initially put everything on the same line.
Then that sed one liner will insert new lines before each num:num pattern.

Given that input file all you need is to tell awk to read a blank-line-separated paragraph at a time using RS=<null> and recompile each record using the default OFS value of a blank char
$ awk -v RS= '{$1=$1}1' file
10:15 text line one text line two text no pattern
11:12 random text text is random totally random could be four lines could be five

Both sed and awk solutions join lines till a new record is detected or input is done in which case the joined lines are printed and cleared - use either solution
the sed oneliner
sed -nr '/^[0-9]{2}:[0-9]{2}$/!{H;$!b}; x; s/\n/ /gp'
the awk script
awk '
!/^[0-9]{2}:[0-9]{2}$/ {
lines=lines" "$0
next
}
{if(lines) print lines; lines=$0}
END {print lines}
'

Here is an GNU AWK script:
script.awk
BEGIN { RS = "\n[0-9]+:[0-9]+|\n$" }
{ gsub(/\n/,"",$0)
printf( "%s%s ", $0,RT) }
Use it like this awk -f script.awk file.txt
It uses the GNU AWK specific extensions RT and regex RS:
the record separator is set to "colon separated number pairs".
to get the final newline at the end of the file the "|\n$" is added to match the last newline in the file.
In order to start separation at the second pair: the "\n" is added in front. Thus the first colon separated number pair "10:15" is included in the first $0 and not in RT.

The trick here is that you want to split the file on paragraphs instead of lines. In awk, if you set RS="" it enables paragraph mode. Each iteration of the awk loop will have a paragraph in $0. You can then substitute the newlines and turn them into spaces.
awk <data.txt 'BEGIN { RS = "" ; FS = "\n" } { gsub(/\n/, " ", $0) ; print }'
Output:
10:15 text line one text line two text no pattern
11:12 random text text is random totally random could be four lines could be five
The benefit of this is that awk handles all the special cases for you: files that end in a blank line, end without a blank line, end without a newline, etc.

Related

How to replace a specific character in a file, only on the lines by counting this specific character in the line?

I would like to double the 4th comma in the lines counting 7 and only 7 commas in all the csv's of a folder.
In this command line, I double the 4th comma:
sed  's/,/,,/4' Person_7.csv > new.csv
In this command line, I can find and count all the commas in a line:
sed 's/[^,]//g' dat | awk '{ print length }'
In this command line, I can count and create a new file with lines containing 7 commas:
awk -F , 'NF == 7' <Person_test.csv >Person_7.csv
But I don't know how to do the specific work...
You need something to select only the lines that contain exactly 7 commas and then operate on just these lines. You can do that with sed:
sed '/^\([^,]*,\)\{7\}[^,]*$/s/,/&&/4'
where ^\([^,]*,\)\{7\}[^,]*$ defines a line that contains exactly 7 commas.
It's a bit easier with awk, though:
awk -F, -v OFS=, 'NF == 8 { $4 = $4 OFS } 1'
This sets input and output field separators to ,, and then for lines with 8 fields (7 commas) appends a , to the end of the 4th field, doubling the comma. The final 1 makes sure every line gets printed.

Append and replace using awk/sed

I have this file:
2016,05,P,0002 ,CJGLOPSD8
00,BBF,BBDFTP999,051000100,GBP, , -2705248.00
00,BBF,BBDFTP999,059999998,GBP, , -3479679.38
00,BBF,BBDFTP999,061505141,GBP, , -0.40
00,BBF,BBDFTP999,061505142,GBP, , 6207621.00
00,BBF,BBDFTP999,061505405,GBP, , -0.16
00,BBF,BBDFTP999,061552000,GBP, , -0.24
00,BBF,BBDFTP999,061559010,GBP, , -0.44
00,BBF,BBDFTP999,062108021,GBP, , -0.34
00,BBF,BBDFTP999,063502007,GBP, , -0.28
I want to programmatically (in unix, or informatica if possible) grab the first two fields in the top row, concatenate them, append them to the end of each line and remove that first row.
Like so:
00,BBF,BBDFTP999,051000100,GBP,,-2705248.00,201605
00,BBF,BBDFTP999,059999998,GBP,,-3479679.38,201605
00,BBF,BBDFTP999,061505141,GBP,,-0.40,201605
00,BBF,BBDFTP999,061505142,GBP,,6207621.00,201605
00,BBF,BBDFTP999,061505405,GBP,,-0.16,201605
00,BBF,BBDFTP999,061552000,GBP,,-0.24,201605
00,BBF,BBDFTP999,061559010,GBP,,-0.44,201605
00,BBF,BBDFTP999,062108021,GBP,,-0.34,201605
00,BBF,BBDFTP999,063502007,GBP,,-0.28,201605
This is my current attempt:
awk -vvar1=`cat OF\ OPSDOWN8.CSV | head -1 | cut -d',' -f1` -vvar2=`cat OF\ OPSDOWN8.CSV | head -1 | cut -d',' -f2` 'BEGIN {FS=OFS=","} {print $0, var 1var2}' OF\ OPSDOWN8.CSV> OF_OPSDOWN8.csv
Any pointers? I've tried looking around the forum but can only find answers to part of my question.
Thanks for your help.
Use this awk:
awk 'BEGIN{FS=OFS=","} NR==1{val=$1$2;next} {gsub(/ */,"");print $0,val}' file
Explanation:
BEGIN{FS=OFS=","} - This block will set FS (Field Separator) and OFS (Output Field Separator) as ,.
NR==1 - Working with line number 1. Here, $1 and $2 denotes field number.
print $0,val - Printing $0 (whole line) and stored value from val.
I would use the following awk command:
awk 'NR==1{d=$1$2;next}{$(NF+1)=d;gsub(/[[:space:]]/,"")}1' FS=, OFS=, file
Explanation:
NR==1{d=$1$2;next} applies on line 1 and set's a variable d(ate) to the value of the first and the second field. The variable is being used when processing the remaining lines. next tells awk to go ahead with the next line right away without processing further instructions on this line.
{$(NF+1)=d;gsub(/[[:space:]]/,"")}1 appends a new field to the line (NF is the number of fields, assigning d to $(NF+1) effectively adds a field. gsub() is used to removing spaces. 1 at the end always evaluates to true and makes awk print the modified line.
FS=, is a command line argument. It set's the input field delimiter to ,.
OFS=, is a command line argument. It set's the output field delimiter to ,.
Output:
00,BBF,BBDFTP999,051000100,GBP,,-2705248.00,201605
00,BBF,BBDFTP999,059999998,GBP,,-3479679.38,201605
00,BBF,BBDFTP999,061505141,GBP,,-0.40,201605
00,BBF,BBDFTP999,061505142,GBP,,6207621.00,201605
00,BBF,BBDFTP999,061505405,GBP,,-0.16,201605
00,BBF,BBDFTP999,061552000,GBP,,-0.24,201605
00,BBF,BBDFTP999,061559010,GBP,,-0.44,201605
00,BBF,BBDFTP999,062108021,GBP,,-0.34,201605
00,BBF,BBDFTP999,063502007,GBP,,-0.28,201605
With sed :
sed '1{s/\([^,]*\),\([^,]*\),.*/\1\2/;h;d};/.*/G;s/\n/,/;s/ //g' file
in ERE mode :
sed -r '1{s/([^,]*),([^,]*),.*/\1\2/;h;d};/.*/G;s/\n/,/;s/ //g' file
Output :
00,BBF,BBDFTP999,051000100,GBP,,-2705248.00,201605
00,BBF,BBDFTP999,059999998,GBP,,-3479679.38,201605
00,BBF,BBDFTP999,061505141,GBP,,-0.40,201605
00,BBF,BBDFTP999,061505142,GBP,,6207621.00,201605
00,BBF,BBDFTP999,061505405,GBP,,-0.16,201605
00,BBF,BBDFTP999,061552000,GBP,,-0.24,201605
00,BBF,BBDFTP999,061559010,GBP,,-0.44,201605
00,BBF,BBDFTP999,062108021,GBP,,-0.34,201605
00,BBF,BBDFTP999,063502007,GBP,,-0.28,201605
This might work for you (GNU sed):
sed '1s/,//;1s/,.*//;1h;1d;s/ //g;G;s/\n/,/' file
For the first line only: remove the first comma, remove from the next comma to the end of the line, store the amended line in the hold space (HS) and then delete the current line (the d abruptly ends processing). For subsequent lines: remove all spaces, append the HS and replace the newline (from the G command) with a comma.
Or if you prefer:
sed '1{s/,//;s/,.*//;h;d};s/ //g;G;s/\n/,/' file
If you want to use Informatica for this, use two Source Qualifiers. Read the file twice - just one line in one SQ (filter out the rest) and in the second SQ read the whole file except the first line (skip header). Join the two on dummy port and you're done.

Need to convert single text column to a single row and then split the row based on the pattern

I am very new to the bash programming and need to convert a single text column to a single row and then separate the characters in the row based on the pattern.
I have text document with the column, which has one letter with six digits
in each line:
a111111
b222222
c333333
d444444
e555555
I need to transform the column above into the following row:
'a111111','b222222','c333333','d444444','e555555'
Could someone please advise how this can be achieved?
You can use awk with printf:
awk -v ORS=, 'NR>1{printf "%s", ORS} {printf "\x27%s\x27", $0}' file
\x27 prints a single quote.
For the 2nd record onwards it will prints ORS (which is set to comma) at start and then the quoted line will be printed.
Output:
'a111111','b222222','c333333','d444444','e555555'
Another approach:
sed -r 's/^|$/\x27/g' file | paste -sd,
sed adds the single quotes at the beginning and end of each line, and paste joins the line together with commas
Or, print a comma for each line, and when you're done back up 1 character and overwrite the last comma with a space:
awk '{printf "'\''%s'\'',", $0} END {printf "\b \n"}' file

I have a text file and I need to delete the first blank line and then all the text after the 2nd blank line

I'm using bash and I have a file that is in 3 parts of text. The first part, then a blank line, then the 2nd part then another blank line, then the file 3 part of text. I need to output this to a new file that contains only the first 2 parts without the blank line in between. I've been playing with sed and awk, but can't quite figure it out.
Most simply with awk:
awk -v RS= 'NR <= 2' filename
With an empty record separator RS, awk splits the file into records at empty lines. With the selection NR <= 2, only the first two are printed (delimited by the default output record separator, which is a newline).
If the file is very large, it might be prudent to amend this to
awk -v RS= '1; NR == 2 { exit }' filename
This stops processing the file after the second record and prints all until then.
Addendum: Obligatory crazy sed solution (not recommended for use, written for fun):
sed -n '/^$/ { x; /./q; H; d; }; p' filename

Split a big txt file to do grep - unix

I work (unix, shell scripts) with txt files that are millions field separate by pipe and not separated by \n or \r.
something like this:
field1a|field2a|field3a|field4a|field5a|field6a|[...]|field1d|field2d|field3d|field4d|field5d|field6d|[...]|field1m|field2m|field3m|field4m|field5m|field6m|[...]|field1z|field2z|field3z|field4z|field5z|field6z|
All text is in the same line.
The number of fields is fixed for every file.
(in this example I have field1=name; field2=surname; field3=mobile phone; field4=email; field5=office phone; field6=skype)
When I need to find a field (ex field2), command like grep doesn't work (in the same line).
I think that a good solution can be do a script that split every 6 field with a "\n" and after do a grep. I'm right? Thank you very much!
With awk :
$ cat a
field1a|field2a|field3a|field4a|field5a|field6a|field1d|field2d|field3d|field4d|field5d|field6d|field1m|field2m|field3m|field4m|field5m|field6m|field1z|field2z|field3z|field4z|field5z|field6z|
$ awk -F"|" '{for (i=1;i<NF;i=i+6) {for (j=0; j<6; j++) printf $(i+j)"|"; printf "\n"}}' a
field1a|field2a|field3a|field4a|field5a|field6a|
field1d|field2d|field3d|field4d|field5d|field6d|
field1m|field2m|field3m|field4m|field5m|field6m|
field1z|field2z|field3z|field4z|field5z|field6z|
Here you can easily set the length of line.
Hope this helps !
you can use sed to split the line in multiple lines:
sed 's/\(\([^|]*|\)\{6\}\)/\1\n/g' input.txt > output.txt
explanation:
we have to use heavy backslash-escaping of (){} which makes the code slightly unreadable.
but in short:
the term (([^|]*|){6}) (backslashes removed for readability) between s/ and /\1, will match:
[^|]* any character but '|', repeated multiple times
| followed by a '|'
the above is obviously one column and it is grouped together with enclosing parantheses ( and )
the entire group is repeated 6 times {6}
and this is again grouped together with enclosing parantheses ( and ), to form one full set
the rest of the term is easy to read:
replace the above (the entire dataset of 6 fields) with \1\n, the part between / and /g
\1 refers to the "first" group in the sed-expression (the "first" group that is started, so it's the entire dataset of 6 fields)
\n is the newline character
so replace the entire dataset of 6 fields by itself followed by a newline
and do so repeatedly (the trailing g)
you can use sed to convert every 6th | to a newline.
In my version of tcsh I can do:
sed 's/\(\([^|]\+|\)\{6\}\)/\1\n/g' filename
consider this:
> cat bla
a1|b2|c3|d4|
> sed 's/\(\([^|]\+|\)\{6\}\)/\1\n/g' bla
a1|b2|
c3|d4|
This is how the regex works:
[^|] is any non-| character.
[^|]\+ is a sequence of at least one non-| characters.
[^|]\+| is a sequence of at least one non-| characters followed by a |.
\([^|]\+|\) is a sequence of at least one non-| characters followed by a |, grouped together
\([^|]\+|\)\{6\} is 6 consecutive such groups.
\(\([^|]\+|\)\{6\}\) is 6 consecutive such groups, grouped together.
The replacement just takes this sequence of 6 groups and adds a newline to the end.
Here is how I would do it with awk
awk -v RS="|" '{printf $0 (NR%7?RS:"\n")}' file
field1a|field2a|field3a|field4a|field5a|field6a|[...]
field1d|field2d|field3d|field4d|field5d|field6d|[...]
field1m|field2m|field3m|field4m|field5m|field6m|[...]
field1z|field2z|field3z|field4z|field5z|field6z|
Just adjust the NR%7 to number of field you to what suites you.
What about printing the lines on blocks of six?
$ awk 'BEGIN{FS=OFS="|"} {for (i=1; i<=NF; i+=6) {print $(i), $(i+1), $(i+2), $(i+3), $(i+4), $(i+5)}}' file
field1a|field2a|field3a|field4a|field5a|field6a
field1d|field2d|field3d|field4d|field5d|field6d
field1m|field2m|field3m|field4m|field5m|field6m
field1z|field2z|field3z|field4z|field5z|field6z
Explanation
BEGIN{FS=OFS="|"} set input and output field separator as |.
{for (i=1; i<=NF; i+=6) {print $(i), $(i+1), $(i+2), $(i+3), $(i+4), $(i+5)}} loop through items on blocks of 6. Every single time, print six of them. As print end up writing a new line, then you are done.
If you want to treat the files as being in multiple lines, then make \n the field separator. For example, to get the 2nd column, just do:
tr \| \\n < input-file | sed -n 2p
To see which columns match a regex, do:
tr \| \\n < input-file | grep -n regex

Resources