Filter text file with very long lines and special symbols using bash - sorting

I have the text file with very long lines and special symbols inside. Here is an example:
{"keyword1":["A123","D356"],"keyword2":"ENXXXXXXXXXXXXXX","keyword3":[{"name1":["R3123","L2356"],"keyword4":"text here","keyword5":"4LJ"},{"app":,"keyword6":"XX-XX-XX-XXX-XXX-Axy - Important text here","keyword7":"FBG","{[ ** text here.........}
Text in keyword2 is always starting with EN followed by 14 numbers
Text in keyword6 is always starting in alphanumeric format XX-XX-XX-XXX-XXX-Axx, where X is 0 to 9, A is symbol A, and xx is 0 to 9, but my or may not be present. "Important text here" can contain any symbol including &, /, \ *.
Keywords may not be unique, but they can appear in the text only after keyword7.
What i want to achieve is to take data from the keywords 2 and 6 and make a new file with three columns: separated with semicolon
ENXXXXXXXXXXXXXX;XX-XX-XX-XXX-XXX-Axy;Important text here
Tried awk and sed, but with questionable success due to so many special symbols around.

echo '{"keyword1":["A123","D356"],"keyword2":"ENXXXXXXXXXXXXXX","keyword3":[{"name1":["R3123","L2356"],"keyword4":"text here","keyword5":"4LJ"},{"app":,"keyword6":"XX-XX-XX-XXX-XXX-Axy - Important text here","keyword7":"FBG","{[ ** text here.........}' |
{m,g,n}awk NF=NF OFS=' )\n ( ' \
FS='^.+"keyword2":"|","keyword(3".+"keyword6":"|7".+$)| - '
)
( ENXXXXXXXXXXXXXX )
( XX-XX-XX-XXX-XXX-Axy )
( Important text here )
(
It should be trivial from here.
gawk 'gsub("^\n+|\n+$",_, $!(NF = NF))^_' OFS='\n' \
FS='^.+"keyword2":"|","keyword(3".+"keyword6":"|7".+$)| - '
ENXXXXXXXXXXXXXX
XX-XX-XX-XXX-XXX-Axy
Important text here

Related

How to replace text in file between known start and stop positions with a command line utility like sed or awk?

I have been tinkering with this for a while but can't quite figure it out. A sample line within the file looks like this:
"...~236 characters of data...Y YYY. Y...many more characters of data"
How would I use sed or awk to replace spaces with a B character only between positions 236 and 246? In that example string it starts at character 29 and ends at character 39 within the string. I would want to preserve all the text preceding and following the target chunk of data within the line.
For clarification based on the comments, it should be applied to all lines in the file and expected output would be:
"...~236 characters of data...YBBYYY.BBY...many more characters of data"
With GNU awk:
$ awk -v FIELDWIDTHS='29 10 *' -v OFS= '{gsub(/ /, "B", $2)} 1' ip.txt
...~236 characters of data...YBBYYY.BBY...many more characters of data
FIELDWIDTHS='29 10 *' means 29 characters for first field, next 10 characters for second field and the rest for third field. OFS is set to empty, otherwise you'll get space added between the fields.
With perl:
$ perl -pe 's/^.{29}\K.{10}/$&=~tr| |B|r/e' ip.txt
...~236 characters of data...YBBYYY.BBY...many more characters of data
^.{29}\K match and ignore first 29 characters
.{10} match 10 characters
e flag to allow Perl code instead of string in replacement section
$&=~tr| |B|r convert space to B for the matched portion
Use this Perl one-liner with substr and tr. Note that this uses the fact that you can assign to substr, which changes the original string:
perl -lpe 'BEGIN { $from = 29; $to = 39; } (substr $_, ( $from - 1 ), ( $to - $from + 1 ) ) =~ tr/ /B/;' in_file > out_file
To change the file in-place, use:
perl -i.bak -lpe 'BEGIN { $from = 29; $to = 39; } (substr $_, ( $from - 1 ), ( $to - $from + 1 ) ) =~ tr/ /B/;' in_file
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-p : Loop over the input one line at a time, assigning it to $_ by default. Add print $_ after each loop iteration.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-i.bak : Edit input files in-place (overwrite the input file). Before overwriting, save a backup copy of the original file by appending to its name the extension .bak.
I would use GNU AWK following way, for simplicity sake say we have file.txt content
S o m e s t r i n g
and want to change spaces from 5 (inclusive) to 10 (inclusive) position then
awk 'BEGIN{FPAT=".";OFS=""}{for(i=5;i<=10;i+=1)$i=($i==" "?"B":$i);print}' file.txt
output is
S o mBeBsBt r i n g
Explanation: I set field pattern (FPAT) to any single character and output field seperator (OFS) to empty string, thus every field is populated by single characters and I do not get superfluous space when print-ing. I use for loop to access desired fields and for every one I check if it is space, if it is I assign B here otherwise I assign original value, finally I print whole changed line.
Using GNU awk:
awk -v strt=29 -v end=39 '{ ram=substr($0,strt,(end-strt));gsub(" ","B",ram);print substr($0,1,(strt-1)) ram substr($0,(end)) }' file
Explanation:
awk -v strt=29 -v end=39 '{ # Pass the start and end character positions as strt and end respectively
ram=substr($0,strt,(end-strt)); # Extract the 29th to the 39th characters of the line and read into variable ram
gsub(" ","B",ram); # Replace spaces with B in ram
print substr($0,1,(strt-1)) ram substr($0,(end)) # Rebuild the line incorporating raw and printing the result
}'file
This is certainly a suitable task for perl, and saddens me that my perl has become so rusty that this is the best I can come up with at the moment:
perl -e 'local $/=\1;while(<>) { s/ /B/ if $. >= 236 && $. <= 246; print }' input;
Another awk but using FS="":
$ awk 'BEGIN{FS=OFS=""}{for(i=29;i<=39;i++)sub(/ /,"B",$i)}1' file
Output:
"...~236 characters of data...YBBYYY.BBY...many more characters of data"
Explained:
$ awk ' # yes awk yes
BEGIN {
FS=OFS="" # set empty field delimiters
}
{
for(i=29;i<=39;i++) # between desired indexes
sub(/ /,"B",$i) # replace space with B
# if($i==" ") # couldve taken this route, too
# $i="B"
}1' file # implicit output
With sed :
sed '
H
s/\(.\{236\}\)\(.\{11\}\).*/\2/
s/ /B/g
H
g
s/\n//g
s/\(.\{236\}\)\(.\{11\}\)\(.*\)\(.\{11\}\)/\1\4\3/
x
s/.*//
x' infile
When you have an input string without \r, you can use:
sed -r 's/(.{236})(.{10})(.*)/\1\r\2\r\3/;:a;s/(\r.*) (.*\r)/\1B\2/;ta;s/\r//g' input
Explanation:
First put \r around the area that you want to change.
Next introduce a label to jump back to.
Next replace a space between 2 markers.
Repeat until all spaces are replaced.
Remove the markers.
In your case, where the length doesn't change, you can do without the markers.
Replace a space after 236..245 characters and try again when it succeeds.
sed -r ':a; s/^(.{236})([^ ]{0,9}) /\1\2B/;ta' input
This might work for you (GNU sed):
sed -E 's/./&\n/245;s//\n&/236/;h;y/ /B/;H;g;s/\n.*\n(.*)\n.*\n(.*)\n.*/\2\1/' file
Divide the problem into 2 lines, one with spaces and one with B's where there were spaces.
Then using pattern matching make a composite line from the two lines.
N.B. The newline can be used as a delimiter as it is guaranteed not to be in seds pattern space.

Allow only specifi character else null should transfer in unix

Allow characters in 2nd columns are 0 to 9 and A to Z and Symbol like "+" and "-", if allow character found in 2nd column then complete record should be Transfer else null should be Transfer in 2nd column
Input
- 1|89+
- 2|-AB
- 3|XY*
- 4|PR%
Output
- 1|89+
- 2|-AB
- 3|<null>
- 4|<null>
grep -E '^[a-zA-Z0-9\+\-\|]+$' file > file1
but above code is discard complete record if matching not found, I Need all records but if matching found then it should Transfer else null Transfer.
Use sed to replace everything after a pipe, that begins with zero or more characters in the class of digits, letters, plus or minus followed by one character not in that class up to the end of the string with a pipe only.
sed 's/\|[0-9a-zA-Z+-]*[^0-9a-zA-Z+-].*$/|/' file
Using awk and character classes where supported:
$ awk 'BEGIN{FS=OFS="|"}$2~/[^[:alnum:]+-]/{$2=""}1' file
1|89+
2|-AB
3|
4|
Where not supported (such as mawk) use:
$ awk 'BEGIN{FS=OFS="|"}$2~/[^A-Za-z0-9+-]/{$2=""}1' file

search a string from file which contains only specific symbol and number in unix through Grep command

I want to search a line from file which contains "+", number and character. If any other character available in string then complete string should be discarded.
A1264
13255
1255+*
*6_54
54789+
Output should be
A1264
13255
54789+
Record number 3 and 4 should not come as it contains some other character also.
You can try something like:
grep -E '^[a-zA-Z0-9\+]+$'
This will accept only a to z chars (small and caps), digits and + sign
If you have other symbols you can edit the command line:
# grep -E '^[a-fA-F0-9©]+$' a1
A1264
13255
54789©

awk sed backreference csv file

A question to extend previous one here. (I prefer asking new question rather editing first one. I may be wrong)
EDIT : ok, I was wrong, I should edit my first question. My bad (SO question is an art, difficult to master)
I have csv file, with semi-column as field delimiter. Here is an extract of csv file :
...;field;(:);10000(n,d);(:);field;....
...;field;123.12(b);123(a);123.00(:);....
Here is the desired output :
...;field;(:);(n,d) 10000;(:);field;....
...;field;(b) 123.12;(a) 123;(:) 123.00;....
I search a solution to swap 2 patterns in each field.
pattern 1 : any digit, with optional decimal mark (.) and optional decimal digit
e.g : 1 / 1111.00 / 444444444.3 / 32 / 32.6666666 / 1.0 / ....
pattern 2 : any string that begin with left parenthesis, follow by one or more character, ending with right parenthesis
e.g : (n,a,p) / (:) / (llll) / (d) / (123) / (1;2;3) ...
Solutions provided in first question are right for simple file that contain only one column. If I try the solution within csv file, I face multiple failures.
So I try awk similar solution, which is (I think) more "column-oriented".
I have try
awk -F";" '{print gensub(/([[:digit:].]*)(\(.*\))/, "\\2 \\1", "g")}' file
I though by fixing field delimiter (;), "my regex swap" will succes in every field. It was a mistake.
Here is an exemple of failure
;(:);7320000(n,d);(:)
desired output --> ;(:);(n,d) 7320000;(:)
My questions (finally) : why awk fail when it success with one-column file. what is the best tool to face this challenge ?
sed with very long regex ?
awk with very long regex ?
for loop ?
other tools ?
PS : I know I am not clear. I have 2 problems (English language, technical limitations). Sorry.
Your "question" is far too long, cluttered, and containing too many separate questions to wade through but here's how to get the output you want from the input you provided with any sed:
$ sed 's/\([0-9][0-9.]*\)\(([^)]*)\)/\2 \1/g' file
...;field;(:);(n,d) 10000;(:);field;....
...;field;(b) 123.12;(a) 123;(:) 123.00;....
Well, when parsing simple delimetered files without any quoted values, usually awk comes to the rescue:
awk -vFS=';' -vOFS=';' '{
for (i = 1; i < NF; i++) {
split($i, t, "(")
if (length(t[1]) != 0 && length(t[2]) != 0) {
$i="("t[2]" "t[1]
}
}
print
}' <<EOF
...;field;(:);10000(n,d);(:);field;....
...;field;123.12(b);123(a);123.00(:);....
EOF
However this will fail if fields are quoted, ie. the separator ; comes inside the values...
First we set input and output seapartor as ;
We iterate through all the fields in the line for (i = 1; i < NF; i++)
We split the line over ( character
If the first field splitted over ( is nonzero length and the second field has also nonzero length
We swap the firelds for this fields and add a space (we also remember about the removed ( on the beginning).
And then the line get's printed.
A solution using sed and xargs, but you need to know the number of fields in advance:
{
sed 's/;/\n/g' |
sed 's/\([^(]\{1,\}\)\((.*)\)/\2 \1/' |
xargs -d '\n' -n7 -- printf "%s;%s;%s;%s;%s;%s;%s\n"
} <<EOF
...;field;(:);10000(n,d);(:);field;....
...;field;123.12(b);123(a);123.00(:);....
EOF
For each ; i do a newline
For each line i substitute the string with at least on character before ( and a string inside ).
I then merge 7 lines using ; as separator with xargs and printf.
This might work for you (GNU sed):
sed -r 's/([0-9]+(\.[0-9]+)?)(\([^)]*\))/\3 \1/g' file
Look for group of numbers (possibly with a decimal point) followed by a pair of parens and rearrange them in the desired fashion, globally through out each line.

speed up my awk command? Answer must be awk :)

I have some awk code that is running really slow. The format of my file is tab delimited 5 column ASCII. I am operating on column 5 to get a count of appropriate characters to alter the value in column 4.
Example input line:
10 5134832 N 28 Aaaaa*AAAAaAAAaAAAAaAAAA^]a^]a^Fa^]a
If I find any "^" in $5 I want to not count it, or the following character.
Then I want to find out how many characters are ">" or "<" or "*" and remove them from the count. I'm guessing using a gsub, and 3 splits is less than ideal, especially since column 5 can occasionally be a very very long string.
awk '{l=$4; if($5~/>/ || $5~/</ || $5~/*/ ) {gsub(/\^./,"");l-=split($5,a,"<")-1;l-=split($5,a,">")-1;l-=split($5,a,"*")-1}
If the code runs successfully on the line above, l will be 27.
I am omitting the surrounding parts of the command to try and focus on the part I have a question about.
So, what is the best step to make this run faster?
Well as I see, your gsub pattern will not work, as the / was not closed. Anyway, if I get it correctly and you want the character count of $5 without some characters, I'd go with:
count=length(gensub("[><A-Z^]","","g",$5))
You should list your skippable characters between [ and ], and do not start with ^!
Do you need to use awk, or will this work instead?
cut -f 5 < $file | grep -v '^[A-Z]' | tr -d '<>*\n' | wc -c
Translation:
Extract the 5th field from the tab-delimited $file.
Remove all fields starting with a capital letter.
Remove the characters <, >, *, and newlines.
Count the remaining characters.
Here's a guess:
awk '
BEGIN {FS = OFS = "\t"}
{
str = $5
gsub(/\^.|[><*]/, "", str)
l = length(str)
}
'
This might work for you:
echo "10 5134832 N 28 Aaaaa*AAAAaAAAaAAAAaAAAA^]a^]a^Fa^]a" |
awk '/[><*^]/{t=$5;gsub(/[><*]|[\^]./,"",t);$4=length(t)}1'
10 5134832 N 27 Aaaaa*AAAAaAAAaAAAAaAAAA^]a^]a^Fa^]a
if you want to show the amended fifth field:
awk '/[><*^]/{gsub(/[><*]|[\^]./,"",$5);$4=length($5)}1'

Resources