Extract String before bracket and create new line - bash

I have data in below format
ABC-ERW 12344 ZYX 12345
FFANKN 2345 QW [123457, 89053]
FAFDJ-ER 1234 MNO [6532, 789, 234578]
I want to create the data in below format using sed or awk.
ABC-ERW 12344 ZYX 12345
FFANKN 2345 QW 123457
FFANKN 2345 QW 89053
FAFDJ-ER 1234 MNO 6532
FAFDJ-ER 1234 MNO 789
FAFDJ-ER 1234 MNO 234578
I can extract the data before bracket but I don't know how to concatenate the same with data from bracket repeatedly.
My Effort :--
# !/bin/bash
while IFS= read -r line
do
echo "$line"
cnt=`echo $line | grep -o "\[" | wc -l`
if [ $cnt -gt 0 ]
then
startstr=`echo $line | awk -F[ '{print $1}'`
echo $startstr
intrstr=`echo $line | cut -d "[" -f2 | cut -d "]" -f1`
echo $intrstr
else
echo "$line" >> newfile.txt
fi
done < 1.txt
I am able to get the first part and also keep the rows not having "[" in new file but I dont know how to get the values in "[" and pass it at end as number of variables in "[" keep changing randomly.
Regards

With your shown samples, please try following awkcode.
awk '
match($0,/\[[^]]*\]$/){
num=split(substr($0,RSTART+1,RLENGTH-2),arr,", ")
for(i=1;i<=num;i++){
print substr($0,1,RSTART-1) arr[i]
}
next
}
1
' Input_file
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
match($0,/\[[^]]*\]$/){ ##Using match function to match from [ till ] at the end of line.
num=split(substr($0,RSTART+1,RLENGTH-2),arr,", ") ##Splitting matched values by regex above and passing into array named arr with delimiters comma and space.
for(i=1;i<=num;i++){ ##Running for loop till value of num.
print substr($0,1,RSTART-1) arr[i] ##printing sub string before matched along with element of arr with index of i.
}
next ##next will skip all further statements from here.
}
1 ##1 will print current line.
' Input_file ##Mentioning Input_file name here.

Suggesting simple awk script:
awk 'NR==1{print}{for (i=2;i<NF;i++)print $1, $i}' FS="( \\\[)|(, )|(\\\]$)" input.1.txt
Explanation:
FS="( \\\[)|(, )|(\\\]$)" Set awk field seperator to be either [ , ]EOL
This will make the interesting fields $2 ---> $FN to be appended to $1
NR==1{print} print first line only as it is.
{for (i=2;i<NF;i++)print $1, $i} for 2nd line on, print: field $1 appended by current field.

This might work for you (GNU sed):
sed -E '/(.*)\[([^,]*), /{s//\1\2\n\1[/;P;D};s/[][]//g' file
Match the string up to the opening square bracket and also the string after before the comma and space.
Replace the entire match by the leading and trailing matching strings, followed be a newline and the leading matching string.
Print/delete the first line and repeat.
The last line of any repeat above will fail because there is not trailing comma space, in which case the opening and closing square brackets should also be removed.
Alternative:
sed -E ':a;s/([^\n]*)\[([^,]*), /\1\2\n\1[/;ta;s/[][]//g' file

Related

Processing text with multiple delims in awk

I have a text which looks like -
Application.||dates:[2022-11-12]|models:[MODEL1]|count:1|ids:2320
Application.||dates:[2022-11-12]|models:[MODEL1]|count:5|ids:2320
I want the number from the count:1 columns so 1 and i wish to store these numbers in an array.
nums=($(echo -n "$grepResult" | awk -F ':' '{ print $4 }' | awk -F '|' '{ print $1 }'))
this seems very repetitive and not very efficient, any ideas how to simplify this ?
You can use awk once, set the field separator to |. Then loop all the fields and split on :
If the field starts with count then print the second part of the splitted value.
This way the count: part can occur anywhere in the string and can possibly print this multiple times.
nums=($(echo -n "$grepResult" | awk -F'|' '
{
for(i=1; i<=NF; i++) {
split($i, a, ":")
if (a[1] == "count") {
print a[2]
}
}
}
'))
for i in "${nums[#]}"
do
echo "$i"
done
Output
1
5
If you want to combine the both split values, you can use [|:] as a character class and print field number 8 for a precise match as mentioned in the comments.
Note that it does not check if it starts with count:
nums=($(echo -n "$grepResult" | awk -F '[|:]' '{print $8}'))
With gnu awk you can use a capture group to get a bit more precise match where on the left and right can be either the start/end of string or a pipe char. The 2nd group matches 1 or more digits:
nums=($(echo -n "$grepResult" | awk 'match($0, /(^|\|)count:([0-9]+)(\||$)/, a) {print a[2]}' ))
Try sed
nums=($(sed 's/.*count://;s/|.*//' <<< "$grepResult"))
Explanation:
There are two sed commands separated with ; symbol.
First command 's/.*count://' remove all characters till 'count:' including it.
Second command 's/|.*//' remove all characters starting from '|' including it.
Command order is important here.

Remove first two lines, last two lines and space from file and add quotes on each line and replace newline with commas in shell script

I have to input.txt file which needs to be formatted by shell script with following condition
remove first two lines and
last two lines
remove all spaces in each
lines(each line have two spaces at
beginning and one space at end)
Each line should be within single
quotes(' ')
At last replace newline($) with
commas.
(original)
input.txt
sql
--------
Abce
Bca
Efr
-------
Row (3)
Desired output file
output.txt
'Abce','Bca','Efr'
I have tried using following commands
Sed -i 1,2d input.txt > input.txt
Sed "$(( $(wc -l <input.txt) -2+1)), $ d" Input.txt > input.txt
Sed ':a;N;$!ba;s/\n/, /g' input.txt > output.txt
But i get blank output.txt
Would you please try the following:
mapfile -t ary < <(tail -n +3 input.txt | head -n -2 | sed -E "s/^[[:blank:]]*/'/; s/[[:blank:]]*$/'/")
(IFS=,; echo "${ary[*]}")
tail -n +3 outputs lines after the 3rd line, inclusive.
head -n -2 outputs lines excluding the last 2 lines.
sed -E "s/^[[:blank:]]*/'/" removes leading whitespaces and prepends
a single quote.
Similarly the sed command "s/[[:blank:]]*$/'/" removes trailing
whitespaces and appends a single quote.
The syntax <(command ..) is a process substitution and the
output of the commands within the parentheses is fed to the mapfile
via the redirect.
mapfile -t ary reads lines from the standard input into the array
variable named ary.
echo "${ary[*]}" expands to a single string with the contents of
the array ary separated by the value of IFS, which is just assigned
to a comma.
The assignment of IFS and the array expansion are enclosed with
parentheses to be executed in the subshell. This prevents the IFS
to be modified in the current process.
With your shown samples, please try following awk program. Written and tested in GNU awk, should work with any version.
awk -v s1="'" -v lines="$(wc -l < Input_file)" '
BEGIN{ OFS="," }
FNR==(lines-1) {
print val
exit
}
FNR>2{
sub(/^[[:space:]]+/,"")
val=(val?val OFS:"") (s1 $0 s1)
}
' Input_file
Explanation: Adding detailed explanation for above code, this is only for explanation purposes.
awk -v s1="'" -v lines="$(wc -l < Input_file)" ' ##Starting awk program, setting s1 variable to ' and creating lines which has total number of lines in it, using wc -l command on Input_file file.
BEGIN{ OFS="," } ##Setting OFS to comma in BEGIN section of this program.
FNR==(lines-1) { ##Checking condition if its 2nd last line of Input_file.
print val ##Then printing val here.
exit ##exiting from program from here.
}
FNR>2{ ##Checking condition if FNR is greater than 2 then do following.
sub(/^[[:space:]]+/,"") ##Substituting initial spaces with NULL here.
val=(val?val OFS:"") (s1 $0 s1) ##Creating val which has ' current line ' in it and keep adding it in val.
}
' Input_file ##Mentioning Input_file name here.
If you know the input is small enough to fit in memory:
$ awk '
NR>4 { gsub(/^ *| *$/,"\047",p2); out=out sep p2; sep="," }
{ p2=p1; p1=$0 }
END { print out }
' input.txt
'Abce','Bca','Efr'
Otherwise:
$ awk '
NR>4 { gsub(/^ *| *$/,"\047",p2); printf "%s%s", sep, p2; sep="," }
{ p2=p1; p1=$0 }
END { print "" }
' input.txt
'Abce','Bca','Efr'
Either script will work using any awk in any shell on every Unix box.
This might work for you (GNU sed):
sed -E '1,2d;$!H;$!d;x;s/^\s*(.*)\s*$/'\''\1'\''/mg;s/\n[^\n]*$//;y/\n/,/' file
Delete the first two lines.
Append each line to the hold space, except for the last (this means the second from last line will still be present - see later).
Delete all lines except for the last.
Swap to the hold space.
Remove all spaces either side of the words on each line and surround those words by single quotes.
Remove the last line and its newline.
Replace all newlines by commas.
The first sed -i overwrites input.txt with an empty file. You can't write output back to the file you are reading, and sed -i does not produce any output anyway.
The minimal fix is to take out the -i and string together the commands into a pipeline; but of course, sed allows you to combine the commands into a single script.
len=$(wc -l <input.txt)
sed -e '1,2d' -e "$((len - 3))"',$d' \
-e ':a' \
-e 's/^ \(.*\) $/'"'\\1'/" \
-e N -e '$!ba' -e 's/\n/, /g' input.txt >output.txt
(Untested; if your sed does not allow multiple -e options, needs refactoring to use a single string with semicolons or newlines between the commands.)
This is hard to write and debug and brittle because of the ways you have to combine the quoting features of the shell with the requirements of sed and this particular script, but also more inherently because sed is a terse and obscure language.
A much more legible and maintainable solution is to switch to Awk, which allows you to express the logic in more human terms, and avoid having to pull in support from the shell for simple tasks like arithmetic and string formatting.
awk 'FNR > 2 { sub(/^ /, ""); sub(/ $/, "");
a[++i] = sprintf("\047%s\047,", $0); }
END { for(j=1; j < i-1; ++j) printf "%s", a[j] }' input.txt >output.txt
This literally replaces all newlines with commas; perhaps you would in fact like to print a newline instead of the comma on the last line?
awk 'FNR > 2 { sub(/^ /, ""); sub(/ $/, "");
a[++i] = sprintf("%s\047%s\047", sep, $0); sep="," }
END { for(j=1; j < i-1; ++j) printf "%s", a[j]; printf "\n" }' input.txt >output.txt
If the input file is really large, you might want to refactor this to not keep all the lines in memory. The array a collects the formatted output and we print all its elements except the last two in the END block.
sed -E '
/^-+$/,/^-+$/!d
//d
s/^[[:space:]]*|[[:space:]]*$/'\''/g
' input.txt |
paste -sd ,
This uses a trick that doesn't work on all sed implementations, to print the lines between two patterns (the dashes in this case), excluding those patterns.
On the plus side if the ---- pattern is at a different line number, it still works. Down side is it breaks, if that pattern (a line containing only dashes) occurs an odd number of times (ie. not in pairs, that wrap the lines you want).
Then sub line start and end (including white space) with single quotes.
Finally pipe to paste to sub the new lines with commas, excluding a trailing comma.
Using sed
$ sed "1,2d; /-/,$ d; s/\s\+//;s/.*/'&'/" input_file | sed -z 's/\n/,/g;s/,$/\n/'
'Abce','Bca','Efr'
I'll post a sed solution which is rather light.
sed '$d' input.txt | sed "\$d; 1,2d; s/^\s*\|\s*$/'/g" | paste -sd ',' > output.txt
$d Remove last line with first sed
\$d Remove the last line. $ escaped with backslash as we are within double-quotes.
1,2d Remove the first two lines.
s/^\s*\|\s*$/'/g Replace all leading and trailing whitespace with single quotes.
Use paste to concatenate to a single, comma delimited strings.
If we know that the relevant lines always start with two spaces, then it can even be simplified further.
sed -n "s/\s*$/'/; s/^ /'/p" input.txt | paste -sd ',' > output.txt
-n suppress printing lines unless told to
s/\s*$/'/ replace trailing whitespace with single quotes
s/^ /'/p replace two leading spaces and print lines that match
paste to concat
Then an awk solution:
awk -v i=1 -v q=\' 'FNR>2 {
gsub(/^[[:space:]]*|[[:space:]]*$/, q)
a[i++]=$0
} END {
for(i=1; i<=length(a)-3; i++)
printf "%s,", a[i]
print a[i++]
}' input.txt > output.txt
-v i=1 create an awk variable starting at one
-v q=\' create an awk variable for the single quote character
FNR>2 { ... tells it to only process line 3+
gsub(/^[[:space:]]*|[[:space:]]*$/, q) substitute leading and trailing whitespace with single quotes
a[i++]=$0 add line to array
END { ... Process the rest after reaching end of file
for(i=1; i<=length(a)-3; i++) take the length of the array but subtract three -- representing the last three lines
printf "%s,", a[i] print all but last three entries comma delimited
print a[i++] print next entry and complete the script (skipping the last two entries)
Not a one liner but works
sed "s/^ */\'/;s/\$/\',/;1,2d;N;\$!P;\$!D;\$d" | sed ' H;1h;$!d;x;s/\n//g;s/,$//'
Explanation:
s/^ */\'/;s/\$/\',/ ---> Adds single quotes and comma
N;$!P;$!D;$d ---> Deletes last two lines
H;1h;$!d;x;s/\n//g;s/,$//' ---> Loads entire file and merge all lines and remove last comma

How to substring a particular length and append that substring text to end of line in unix

Question Edited
Sincere Apologies for editing the question!!
I want to substring using start position and end position for 2 strings "20200224" and "LN". And append that result substring text to the end of line.
For Example,
EDITED INPUT TEXT
2020-02-25
07:24|/prd/data_fabric/prd_dfab_open/acct/process_date=20200224/data_src=ACB/source_country_code=LN/ACB_ACCT_HK_LN_01-part-0.orc
2020-02-25
07:24|/prd/data_fabric/prd_dfab_open/acct/process_date=20200224/data_src=ACB/source_country_code=LN/ACB_ACCT_HK_LN_01-part-1.orc
i want to substring "20200224" / "20200225" which is of start_position=20 and end_position=27 and append the same in the end of each line as below,
2020-02-25
07:24|/prd/data_fabric/prd_dfab_open/acct/process_date=20200224/data_src=ACB/source_country_code=LN/ACB_ACCT_HK_LN_01-part-0.orc|20200224|LN
2020-02-25
07:24|/prd/data_fabric/prd_dfab_open/acct/process_date=20200224/data_src=ACB/source_country_code
=LN/ACB_ACCT_HK_LN_01-part-1.orc|20200224|LN
Like this more lines are there in the file.
I would like to search based on 2 set of strings "process_date=" and "source_country_code=" and want to take the values between " and / which is 20200224 and LN. Append the same in end of line with pipe | delimiter
Use of awk:
line='/acct/process_date=20200224/data_src=ACB/source_country_code=LN/ACB_ACCT_HK_LN_01-part-1.orc'
Above can be achieved by reading file line by line.
var=`echo $line| tr '=' ' '| awk '{ print $2 }' | tr '/' ' ' | awk '{ print $1}'`
Reading anything after = then reading anything before /
echo $line$var
To match the 20th to 27th characters from the line, and paste it at the end of the line:
paste -d'|' input_file <(cut -c20-27 input_file)
/acct/process_date=20200224/data_src=ACB/source_country_code=LN/ACB_ACCT_HK_LN_01-part-1.orc|20200224
/acct/process_date=20200225/data_src=ACB/source_country_code=MO/ACB_ACCT_HK_MO_01-part-0.orc|20200225
To match the first number in the line, and paste it at the end of the line:
sed -i 's/\([0-9]\+\)\(.*\)/\1\2|\1/' input_file
/acct/process_date=20200224/data_src=ACB/source_country_code=LN/ACB_ACCT_HK_LN_01-part-1.orc|20200224
/acct/process_date=20200225/data_src=ACB/source_country_code=MO/ACB_ACCT_HK_MO_01-part-0.orc|20200225

print first 3 characters and / rest of the string with stars

I'have this input like this
John:boofoo
I want to print rest of the string with stars and keep only 3 characters of the string.
The output will be like this
John:boo***
this my command
awk -F ":" '{print $1,$2 ":***"}'
I want to use only print command if possible. Thanks
With GNU sed:
echo 'John:boofoo' | sed -E 's/(:...).*/\1***/'
Output:
John:boo***
With GNU awk for gensub():
$ awk 'BEGIN{FS=OFS=":"} {print $1, substr($2,1,3) gensub(/./,"*","g",substr($2,4))}' file
John:boo***
With any awk:
awk 'BEGIN{FS=OFS=":"} {tl=substr($2,4); gsub(/./,"*",tl); print $1, substr($2,1,3) tl}' file
John:boo***
Could you please try following. This will print stars(keeping only first 3 letters same as it is) how many characters are present in 2nd field after first 3 characters.
awk '
BEGIN{
FS=OFS=":"
}
{
stars=""
val=substr($2,1,3)
for(i=4;i<=length($2);i++){
stars=stars"*"
}
$2=val stars
}
1
' Input_file
Output will be as follows.
John:boo***
Explanation: Adding explanation for above code too here.
awk '
BEGIN{ ##Starting BEGIN section from here.
FS=OFS=":" ##Setting FS and OFS value as : here.
} ##Closing block of BEGIN section here.
{ ##Here starts main block of awk program.
stars="" ##Nullifying variable stars here.
val=substr($2,1,3) ##Creating variable val whose value is 1st 3 letters of 2nd field.
for(i=4;i<=length($2);i++){ ##Starting a for loop from 4(becasue we need to have from 4th character to till last in 2nd field) till length of 2nd field.
stars=stars"*" ##Keep concatenating stars variable to its own value with *.
}
$2=val stars ##Assigning value of variable val and stars to 2nd field here.
}
1 ##Mentioning 1 here to print edited/non-edited lines for Input_file here.
' Input_file ##Mentioning Input_file name here.
Or even with good old sed
$ echo "John:boofoo" | sed 's/...$/***/'
Output:
John:boo***
(note: this just replaces the last 3 characters of any string with "***", so if you need to key off the ':', see the GNU sed answer from Cyrus.)
Another awk variant:
awk -F ":" '{print $1 FS substr($2, 1, 3) "***"}' <<< 'John:boofoo'
John:boo***
Since we have the tags awk, bash and sed: for completeness sake here is a bash only solution:
INPUT="John:boofoo"
printf "%s:%s\n" ${INPUT%%:*} $(TMP1=${INPUT#*:};TMP2=${TMP1:3}; echo "${TMP1:0:3}${TMP2//?/*}")
It uses two arguments to printf after the format string. The first one is INPUT stripped of by everything uncluding and after the :. Lets break down the second argument $(TMP1=${INPUT#*:};TMP2=${TMP1:3}; echo "${TMP1:0:3}${TMP2//?/*}"):
$(...) the string is interpreted as a bash command its output is substituted as last argument to printf
TMP1=${INPUT#*:}; remove everything up to and including the :, store the string in TMP1.
TMP2=${TMP1:3}; geht all characters of TMP1 from offset 3 to the end and store them in TMP2.
echo "${TMP1:0:3}${TMP2//?/*}" output the temporary strings: the first three chars from TMP1 unmodified and all chars from TMP2 as *
the output of the last echo is the last argument to printf
Here is the bash -x output:
+ INPUT=John:boofoo
++ TMP1=boofoo
++ TMP2=foo
++ echo 'boo***'
+ printf '%s:%s\n' John 'boo***'
John:boo***
Another sed : replace all chars after the third by *
sed -E ':A;s/([^:]*:...)(.*)[^*]([*]*)/\1\2\3*/;tA'
Some more awk
awk 'BEGIN{FS=OFS=":"}{s=sprintf("%0*d",length(substr($2,4)),0); gsub(/0/,"*",s);print $1,substr($2,1,3) s}' infile
You can use the %* form of printf, which accepts a variable width. And, if you use '0' as your value to print, combined with the right-aligned text that's zero padded on the left..
Better Readable:
awk 'BEGIN{
FS=OFS=":"
}
{
s=sprintf("%0*d",length(substr($2,4)),0);
gsub(/0/,"*",s);
print $1,substr($2,1,3) s
}
' infile
Test Results:
$ awk --version
GNU Awk 3.1.7
Copyright (C) 1989, 1991-2009 Free Software Foundation.
$ cat f
John:boofoo
$ awk 'BEGIN{FS=OFS=":"}{s=sprintf("%0*d",length(substr($2,4)),0); gsub(/0/,"*",s);print $1,substr($2,1,3) s}' f
John:boo***
Another pure Bash, using the builtin regular expression predicate.
input="John:boofoo"
if [[ $input =~ ^([^:]*:...)(.*)$ ]]; then
printf '%s%s\n' "${BASH_REMATCH[1]}" "${BASH_REMATCH[2]//?/*}"
else
echo >&2 "String doesn't match pattern"
fi
We split the string in two parts: the first part being everything up to (and including) the three chars found after the first colon (stored in ${BASH_REMATCH[1]}), the second part being the remaining part of string (stored in ${BASH_REMATCH[2]}). If the string doesn't match this pattern, we just insult the user.
We then print the first part unchanged, and the second part with every character replaced with *.

print 1st string of a line if last 5 strings match input

I have a requirement to print the first string of a line if last 5 strings match specific input.
Example: Specified input is 2
India;1;2;3;4;5;6
Japan;1;2;2;2;2;2
China;2;2;2;2
England;2;2;2;2;2
Expected Output:
Japan
England
As you can see, China is excluded as it doesn't meet the requirement (last 5 digits have to be matched with the input).
grep ';2;2;2;2;2$' file | cut -d';' -f1
$ in a regex stands for "end of line", so grep will print all the lines that end in the given string
-d';' tells cut to delimit columns by semicolons
-f1 outputs the first column
You could use awk:
awk -F';' -v v="2" -v count=5 '
{
c=0;
for(i=2;i<=NF;i++){
if($i == v) c++
if(c>=count){print $1;next}
}
}' file
where
v is the value to match
count is the maximum number of value to print the wanted string
the for loop is parsing all fields delimited with a ; in order to find a match
This script doesn't need the 5 values 2 to be consecutive.
With sed:
sed -n 's/^\([^;]*\).*;2;2;2;2;2$/\1/p' file
It captures and output non ; first characters in lines ending with ;2;2;2;2;2
It can be shortened with GNU sed to:
sed -nE 's/^([^;]*).*(;2){5}$/\1/p' file
awk -F\; '/;2;2;2;2;2$/{print $1}' file
Japan
England

Resources