Sorting lines within paragraphs - sorting

I need to change the order of the lines of the paragraphs in a text file where each paragraph has this structure:
<body>blah blah</body>
<date>some date</date>
<user>some name</user>
I need the line with <user>some name</user> to be the first one in each paragraph. I.e.:
<user>some name</user>
<body>blah blah</body>
<date>some date</date>
How do I accomplish this, in awk, sed, etc.?

awk to the rescue!
assuming paragraphs are separated with one or more blank lines you can do this
$ awk 'BEGIN{RS=""; OFS=FS="\n"} {for(i=1;i<=NF;i++)
if($i~/user/) {$1=$i OFS $1;
$i=""}}1' text
<user>some name</user>
<body>blah blah</body>
<date>some date</date>
<user>some name</user>
<body>blah blah</body>
<date>some date</date>
<user>some name</user>
<body>blah blah</body>
<date>some date</date>
you can fine tune the pattern "user" for a more accurate match but works for the sample input.

Perl can do custom sorting of the lines in a paragraph
perl -00 -F'\n' -lane '
print
join "\n",
sort {
if ($a =~ /<user>/) { -1 }
elsif ($b =~ /<user>/) { +1 }
else { $a cmp $b }
}
#F
' file
Notes:
-00 read the file in paragraphs (separated by one or more blank lines)
-F'\n' uses a newline as the field separator, and -a splits the lines of the paragraph into the perl array #F
the custom sorting block sorts lines with <user> first, and all other lines lexically.
one-liner-ized:
perl -00 -F'\n' -lape'$_=join"\n",sort{$a=~/<user>/?-1:$b=~/<user>/?1:$a cmp $b}#F' file

With sed :
sed '/<body>/{:a;N;/<user>/!ba};s/\(.*\)\n\(<user>.*\)/\2\n\1/' file

The following assumes that the <user>...</user> fragment appears on a line by itself, and that apart from these <user> lines, the other lines should NOT be re-ordered. Otherwise it is quite robust, efficient, and adaptable.
awk '
function p( i) { for(i=0;i<n;i++) print s[i]; n=0; }
/<user>/ {print; p(); next;}
NF==0 {p(); print; next;}
{s[n++]=$0}
END { p() }'

This MIGHT be all you need:
$ awk -F'[<>]' -v OFS='\n' '{a[$2]=$0} !(NR%3){print a["user"], a["body"], a["date"]}' file
<user>some name</user>
<body>blah blah</body>
<date>some date</date>
It just depends what's in the parts of the input file that you haven't shown us.

Related

How to print all lines from a CSV file that contains the exact match of a string from the nth column?

I want to print all lines from a CSV file that match the character string "ac". So if column 2 equals "ac", then print this line.
Before
"id","make","model","yeear","product"
"11","ac","tot","1999","9iuh8b"
"12","acute","huu","1991","soyo"
"13","ac","auu","1992","shjoyo"
"14","bb","ayu","2222","jkqweh"
"15","bb","ac","8238","smiley"
After
"11","ac","tot","1999","9iuh8b"
"13","ac","auu","1992","shjoyo"
I attempted cat file| grep "ac", but this will give me all lines that have ac:
"11","ac","tot","1999","9iuh8b"
"12","acute","huu","1991","soyo"
"13","ac","auu","1992","shjoyo"
"15","bb","ac","8238","smiley"
Consider surrounding double quotes:
$ awk -F, '$2=="\"ac\""' input.csv
"11","ac","tot","1999","9iuh8b"
"13","ac","auu","1992","shjoyo"
Or the same with regex pattern matching:
$ awk -F, '$2~/^"ac"$/' input.csv
"11","ac","tot","1999","9iuh8b"
"13","ac","auu","1992","shjoyo"

Remove redundancy lines for "almost similar" strings

I have the below file:
ab=5
ac=6
ad=5
ba=5
bc=7
bd=4
ca=5
cb=7
cd=3
...
"ab" and "ba", "ac" and "ca", "bc" and "cb" are redundant.
How do I eliminate these redundant lines in bash ?
Expected output:
ab=5
ac=6
ad=5
bc=7
bd=4
cd=3
$ awk '{x=substr($0,1,1); y=substr($0,2,1)} !seen[x>y?x y:y x]++' file
ab=5
ac=6
ad=5
bc=7
bd=4
cd=3
Short awk solution:
awk '{ c1=substr($0,1,1); c2=substr($0,2,1) }!a[c1 c2]++ && !((c2 c1) in a)' file
c1=substr($0,1,1) - assign the extracted 1st character to variable c1
c2=substr($0,2,1) - assign the extracted 2nd character to variable c2
!a[c1 c2]++ && !((c2 c1) in a) - crucial condition based on mutual exclusion between "similar" 2-character sequences
The output:
ab=5
ac=6
ad=5
bc=7
bd=4
cd=3
Here's one with perl, generic solution irrespective of number of characters before =
$ cat ip.txt
ab=5
ac=6
abd=51
ba=5
bad=23
bc=7
bd=4
ca=5
cb=7
cd=3
$ perl -F= -lane 'print if !$seen{join "",sort split//,$F[0]}++' ip.txt
ab=5
ac=6
abd=51
bc=7
bd=4
cd=3
like awk, by default uninitialized variables evaluate to false
-F= use = as field separator, results saved in #F array
$F[0] will give first field, i.e the characters before =
split//,$F[0] will give array with individual characters
sort by default does string sorting
join "" will then form single string from the sorted characters with null string as separator
See https://perldoc.perl.org/perlrun.html#Command-Switches for documentation on -lane and -F options. Use -i for inplace editing
Could you please try following and let me know if this helps you, I have written and tested it with GNU awk.
awk -F'=' '{
split($1,array,"")}
!((array[1],array[2]) in a){
a[array[1],array[2]];
a[array[2],array[1]];
print;
next
}
!((array[2],array[1]) in a){
a[array[1],array[2]];
a[array[2],array[1]];
print;
}
' Input_file
Output will be as follows.
ab=5
ac=6
ad=5
bc=7
bd=4
cd=3

AWK: search substring in first file against second

I have the following files:
data.txt
Estring|0006|this_is_some_random_text|more_text
Fstring|0010|random_combination_of_characters
Fstring|0028|again_here
allids.txt (here the columns are separated by semicolon; the real input is tab-delimited)
Estring|0006;MAR0593
Fstring|0002;MAR0592
Fstring|0028;MAR1195
please note: data.txt: the important part is here the first two "columns" = name|number)
Now I want to use awk to search the first part (name|number) of data.txt in allids.txt and output the second column (starting with MAR)
so my expected output would be (again tab-delimited):
Estring|0006|this_is_some_random_text|more_text;MAR0593
Fstring|0010|random_combination_of_characters
Fstring|0028|again_here;MAR1195
I do not know now how to search that first conserved part within awk, the rest should then be:
awk 'BEGIN{FS=OFS="\t"} FNR == NR { a[$1] = $1; next } $1 in a { print a[$0], [$1] }' data.txt allids.txt
I would use a set of field delimiters, like this:
awk -F'[|\t;]' 'NR==FNR{a[$1"|"$2]=$0; next}
$1"|"$2 in a {print a[$1"|"$2]"\t"$NF}' data.txt allids.txt
In your real-data example you can remove the ;. It is in here just to be able to reproduce the example in the question.
Here is another awk that uses a different field separator for both files:
awk -F ';' 'NR==FNR{a[$1]=FS $2; next} {k=$1 FS $2}
k in a{$0=$0 a[k]} 1' allids.txt FS='|' data.txt
Estring|0006|this_is_some_random_text|more_text;MAR0593
Fstring|0010|random_combination_of_characters
Fstring|0028|again_here;MAR1195
This command uses ; as FS for allids.txt and uses | as FS for data.txt.

How to replace the empty place with next line content in shell script

1,n1,abcd,1234
2,n2,abrt,5666
,h2,yyyy,123x
3,h2,yyyy,123y
3,h2,yyyy,1234
,k1,yyyy,5234
4,22,yyyy,5234
the above given is my input file abc.txt , all I want the missing first column value should fill with next row first value.
example:
3,h2,yyyy,123x
3,h2,yyyy,123y
I want output like below,
1,n1,abcd,1234
2,n2,abrt,5666
3,h2,yyyy,123x// the missing first column value 3 should fill with second row first value
3,h2,yyyy,123y
3,h2,yyyy,1234
4,k1,yyyy,5234
4,22,yyyy,5234
How to implement this with help of AWK or some other alternate in shell script,please help.
Using awk you can do:
awk -F, '$1 ~ /^ *$/ {
p=p RS $0
next
}
p!="" {
gsub(RS " +", RS $1, p)
sub("^" RS, "", p)
print p
p=""
} 1' file
1,n1,abcd,1234
2,n2,abrt,5666
3,h2,yyyy,123x
3,h2,yyyy,123y
3,h2,yyyy,1234
4,k1,yyyy,5234
4,22,yyyy,5234
I would reverse the file, and then replace the value from the previous line:
tac filename | awk -F, '$1 ~ /^[[:blank:]]*$/ {$1 = prev} {print; prev=$1}' | tac
This will also fill in missing values on multiple lines.
With GNU sed:
$ sed '/^ ,/{N;s/ \(.*\n\)\([^,]*\)\(.*\)/\2\1\2\3/}' infile
1,n1,abcd,1234
2,n2,abrt,5666
3,h2,yyyy,123x
3,h2,yyyy,123y
3,h2,yyyy,1234
4,k1,yyyy,5234
4,22,yyyy,5234
The sed command does the following:
/^ ,/ { # If the line starts with 'space comma'
N # Append the next line
# Extract the value before the comma, prepend to first line
s/ \(.*\n\)\([^,]*\)\(.*\)/\2\1\2\3/
}
BSD sed would require an extra semicolon before the closing brace.
This only works with non-contiguous lines with missing values.

awk delete all lines not containing substring using if condition

I want to delete lines where the first column does not contain the substring 'cat'.
So if string in col 1 is 'caterpillar', i want to keep it.
awk -F"," '{if($1 != cat) ... }' file.csv
How can i go about doing it?
I want to delete lines where the first column does not contain the substring 'cat'
That can be taken care by this awk:
awk -F, '!index($1, "cat")' file.csv
If that doesn't work then I would suggest you to provide your sample input and expected output in question.
This awk does the job too
awk -F, '$1 ~ /cat/{print}' file.csv
Explanation
-F : "Delimiter"
$1 ~ /cat/ : match pattern cat in field 1
{print} : print
A shorter command is:
awk -F, '$1 ~ "cat"' file.csv
-F is the field delimiter: (,)
$1 ~ "cat" is a (not anchored) regular expression match, match at any position.
As no action has been given, the default: {print} is assumed by awk.

Resources