I have this text file that I need to sort per section.
#cat raw_file.txt
== other info ==
===instructions===
===english words===
this
is
only
test
=== missing words ===
==== include words ====
some
more
words
==== customer name ====
ram
sham
amar
akbar
anthony
==== cities ====
mumbai
delhi
pune
=== prefix ===
the
a
an
If I sort it "as is" then it starts with 2 equal signs followed by 3 equal signs and then all the words. How do I sort the words per section separately?
# sort raw_file.txt
== other info ==
=== missing words ===
=== prefix ===
==== cities ====
==== customer name ====
==== include words ====
===english words===
===instructions===
a
akbar
amar
an
anthony
delhi
is
more
mumbai
only
pune
ram
sham
some
test
the
this
words
This is mediawiki format if that matters. I am sorting each and every section and that is taking a lot of time.
#cat expected_output.txt
== other info ==
===instructions===
===english words===
is
only
test
this
=== missing words ===
==== include words ====
more
some
words
==== customer name ====
akbar
amar
anthony
ram
sham
==== cities ====
delhi
mumbai
pune
=== prefix ===
a
an
the
this will keep the exact number of spaces as well, in normal sorting order they will appear at the top, so it's handed to be added at the bottom of each section
$ awk 'BEGIN {s="sort"}
!NF {c++}
/^=/ {close(s);
for(i=1;i<=c;i++) print "";
c=0;
print;
next}
NF {print | s}' file
will generate...
== other info ==
===instructions===
===english words===
is
only
test
this
=== missing words ===
==== include words ====
more
some
words
==== customer name ====
akbar
amar
anthony
ram
sham
==== cities ====
delhi
mumbai
pune
=== prefix ===
a
an
the
If you're not worried about keeping the blank lines you could use:
awk '/=/ {c++} {print c+1, $0}' file.txt | sort -n | cut -d' ' -f2- | sed '/^$/d'
>== other info ==
>===instructions===
>===english words===
>is
>only
>test
>this
>=== missing words ===
>==== include words ====
>more
>some
>words
>==== customer name ====
>akbar
>amar
>anthony
>ram
>sham
>==== cities ====
>delhi
>mumbai
>pune
>=== prefix ===
>a
>an
>the
This approach works by appending an index number to every line and incrementing the index by one every time the line contains an '=', then sorting based on the index number first, then the actual word second, then removing the index and removing blank lines (which end up at the top of each 'section' after the sort).
Edit
I just saw #Bing Wang's comment - this is basically what he suggested you do
Related
This question already has answers here:
Is there way to delete duplicate header in a file in Unix?
(2 answers)
How to delete the first column ( which is in fact row names) from a data file in linux?
(5 answers)
Closed 4 years ago.
My apologies if this question already exists out there. I have a concatenated text file that looks like this:
#Chr start end ID GTEX-Q2AG GTEX-NPJ8
1 1 764484 783034 1:764484:783034:clu_2500_NA 0.66666024153854 -0.194766358934969
2 1 764484 787307 1:764484:787307:clu_2500_NA -0.602342191830433 0.24773430748199
3 1 880180 880422 1:880180:880422:clu_2501_NA -0.211378452591182 2.02508282380949
4 1 880180 880437 1:880180:880437:clu_2501_NA 0.231916912049866 -2.20305649485074
5 1 889462 891303 1:889462:891303:clu_2502_NA -2.3215482460681 0.849095194607155
6 1 889903 891303 1:889903:891303:clu_2502_NA 2.13353943689806 -0.920181808417383
7 1 899547 899729 1:899547:899729:clu_2503_NA 0.990822909478346 0.758143648905368
8 1 899560 899729 1:899560:899729:clu_2503_NA -0.938514081703866 -0.543217522714283
9 1 986217 986412 1:986217:986412:clu_2504_NA -0.851041440248378 0.682551011244202
The first line, #Chr start end ID GTEX-Q2AG GTEX-NPJ8, is the header, and because I concatenated several similar files, it occurs multiple times throughout the file. I would like to delete every instance of the header occuring in the text without deleting the first header
BONUS I actually need help with this too and would like to avoid posting another stack overflow question. The first column of my data was generated by R and represents row numbers. I want them all gone without deleting #Chr. There are too many columns and it's a problem.
This problem is different than ones recommended to me because of the above additional issue and also because you don't necessarily have to use regex to solve this problem.
The following AWK script removes all lines that are exactly the same as the first one.
awk '{ if($0 != header) { print; } if(header == "") { header=$0; } }' inputfile > outputfile
It will print the first line because the initial value of header is an empty string. Then it will store the first line in header because it is empty.
After this it will print only lines that are not equal to the first one already stored in header. The second if will always be false once the header has been saved.
Note: If the file starts with empty lines these empty lines will be removed.
To remove the first number column you can use
sed 's/^[0-9][0-9]*[ \t]*//' inputfile > outputfile
You can combine both commands to a pipe
awk '{ if($0 != header) { print; } if(header == "") { header=$0; } }' inputfile | sed 's/^[0-9][0-9]*[ \t]*//' > outputfile
maybe this helpful:
delete all headers
delete first column
add first header
cat foo.txt
#Chr start end ID GTEX-Q2AG GTEX-NPJ8
1 1 764484 783034 1:764484:783034:clu
#Chr start end ID GTEX-Q2AG GTEX-NPJ8
2 1 764484 783034 1:764484:783034:clu
#Chr start end ID GTEX-Q2AG GTEX-NPJ8
3 1 764484 783034 1:764484:783034:clu
sed '/#Chr start end ID GTEX-Q2AG GTEX-NPJ8/d' foo.txt | awk '{$1 = ""; print $0 }' | sed '1i #Chr start end ID GTEX-Q2AG GTEX-NPJ8'
#Chr start end ID GTEX-Q2AG GTEX-NPJ8
1 764484 783034 1:764484:783034:clu
1 764484 783034 1:764484:783034:clu
1 764484 783034 1:764484:783034:clu
Using sed
sed '2,${/HEADER/d}' input.txt > output.txt
Command explained:
Starting at line 2: 2,
Search for any line matching 'HEADER' /HEADER
Delete it /d
I would do
awk 'NR == 1 {header = $0; print} $0 != header' file
I have 2 text files. File1 has about 1,000 lines and File2 has 20,000 lines. An extract of File1 is as follows:
/BBC Micro/Thrust
/Amiga/Alien Breed Special Edition '92
/Arcade-Vertical/amidar
/MAME (Advance)/mario
/Arcade-Vertical/mspacman
/Sharp X68000/Bubble Bobble (1989)(Dempa)
/BBC Micro/Chuckie Egg
An extract of File2 is as follows:
005;005;Arcade-Vertical;;;;;;;;;;;;;;
Alien Breed Special Edition '92;Alien Breed Special Edition '92;Amiga;;1992;Team 17;Action / Shooter;;;;;;;;;;
Alien 8 (Japan);Alien 8 (Japan);msx;;1987;Nippon Dexter Co., Ltd.;Action;1;;;;;;;;;
amidar;amidar;Arcade-Vertical;;;;;;;;;;;;;;
Bubble Bobble (Japan);Bubble Bobble (Japan);msx2;;;;;;;;;;;;;;
Buffy the Vampire Slayer - Wrath of the Darkhul King (USA, Europe);Buffy the Vampire Slayer - Wrath of the Darkhul King (USA, Europe);Nintendo Game Boy Advance;;2003;THQ;Action;;;;;;;;;;
mario;mario;FBA;;;;;;;;;;;;;;
mspacman;mspacman;Arcade-Vertical;;;;;;;;;;;;;;
Thrust;Thrust;BBC Micro;;;;;;;;;;;;;;
Thunder Blade (1988)(U.S. Gold)[128K];Thunder Blade (1988)(U.S. Gold)[128K];ZX Spectrum;;;;;;;;;;;;;;
Thunder Mario v0.1 (SMB1 Hack);Thunder Mario v0.1 (SMB1 Hack);Nintendo NES Hacks 2;;;;;;;;;;;;;;
Thrust;Thrust;Vectrex;;;;;;;;;;;;;;
In File3 (the output file), using grep, sed, awk or a bash script, I would like to achieve the following output:
Thrust;Thrust;BBC Micro;;;;;;;;;;;;;;
Alien Breed Special Edition '92;Alien Breed Special Edition '92;Amiga;;1992;Team 17;Action / Shooter;;;;;;;;;;
amidar;amidar;Arcade-Vertical;;;;;;;;;;;;;;
mspacman;mspacman;Arcade-Vertical;;;;;;;;;;;;;;
This is similar to a previous question I asked but not the same. I specifically want to avoid the possibility of Thrust;Thrust;Vectrex;;;;;;;;;;;;;; being recorded in File 3.
Using sudo awk -F\; 'NR==FNR{a[$1]=$0;next}$1 in a{print a[$1]}', I found that Thrust;Thrust;Vectrex;;;;;;;;;;;;;; was recorded in File 3 instead of Thrust;Thrust;BBC Micro;;;;;;;;;;;;;; (the latter being the output I'm seeking).
Equally, mario;mario;FBA;;;;;;;;;;;;;; won't appear in File3 because it does not match /MAME (Advance)/mario as "MAME (Advance)" doesn't match. That is good. The same for Bubble Bobble (Japan);Bubble Bobble (Japan);msx2;;;;;;;;;;;;;; which doesn't match either "Sharp X68000" or "Bubble Bobble (1989)(Dempa)".
Using AWK and associative array You can use this:
awk '
BEGIN {
if ( ARGC != 3 ) exit(1);
FS="/";
while ( getline < ARGV[2] ) mfggames[$2"/"$3]=1;
FS=";";
ARGC=2;
}
mfggames[$3"/"$1]
' file2 file1
Output:
Alien Breed Special Edition '92;Alien Breed Special Edition '92;Amiga;;1992;Team 17;Action / Shooter;;;;;;;;;;
amidar;amidar;Arcade-Vertical;;;;;;;;;;;;;;
mspacman;mspacman;Arcade-Vertical;;;;;;;;;;;;;;
Thrust;Thrust;BBC Micro;;;;;;;;;;;;;;
Sorted per file1 solution (as per comment request):
awk '
BEGIN {
if ( ARGC != 3 ) exit(1);
FS="/";
while ( getline < ARGV[2] ) mfggames[$2"/"$3]=++order;
FS=";";
ARGC=2;
}
mfggames[$3"/"$1] { print(mfggames[$3"/"$1] FS $0); }
' file2 file1 | sort -n | cut -d ';' -f 2-
Output:
Thrust;Thrust;BBC Micro;;;;;;;;;;;;;;
Alien Breed Special Edition '92;Alien Breed Special Edition '92;Amiga;;1992;Team 17;Action / Shooter;;;;;;;;;;
amidar;amidar;Arcade-Vertical;;;;;;;;;;;;;;
mspacman;mspacman;Arcade-Vertical;;;;;;;;;;;;;;
I'm very new to shell scripting and wasn't sure how to go about doing this.
Suppose I have two files:
file1.csv | file2.csv
--------------------
Apples Apples
Dogs Dogs
Cats Cats
Grapes Oranges
Batman Thor
Borgs Daleks
Kites Kites
Blah Blah
xyz xyz
How do I only keep the differences in each file, and 2 lines above the start of the differences, and 2 lines after? For example, the output would be:
file1.csv | file2.csv
-----------------------
Dogs Dogs
Cats Cats
Grapes Oranges
Batman Thor
Borgs Daleks
Kites Kites
Blah Blah
Thank you very much!
This is a job for diff.
diff -u2 file1.csv file2.csv | sed '1,3d;/##/,+2d' > diff
The diff command will produce a patch style difference containing meta information of the files in the form:
--- file1.csv 2017-05-12 15:21:47.564801174 -0700
+++ file2.csv 2017-05-12 15:21:52.462801174 -0700
## -2,7 +2,7 ##
Any block of difference will have header like ## -2,7 +2,7 ##. We want to throw these away using sed.
1,3d - means delete the top 3 lines
/##/,+2d - delete any lines containing ## and the next 2 lines after it. This is not needed for your case but is good to be included here in case your input suddenly has multiple blocks of differences.
The result of the above commands will produce this list.
Dogs
Cats
-Grapes
-Batman
-Borgs
+Oranges
+Thor
+Daleks
Kites
Blah
The contents has a 1 character prefix, ' ' is common to both, '-' is only on file1.csv while '+' is only on file2.csv. Now all we need is to distribute these to the 2 files.
sed '/^+.*/d;s/^.//' diff > file1.csv
sed '/^-.*/d;s/^.//' diff > file2.csv
The sed commands here will filter the file and write the proper contents to each of the input files.
/^+.*/d - lines starting with '+' will be deleted.
s/^.// - will remove the 1 character prefix which was added by diff.
/^-.*/d - lines starting with '-' will be deleted.
Finally, remove the transient file diff.
Is there a way to use sed to replace starting of the string in the entire file without using a loop?
For example, my source data is the following:
str_address: 123 main street
str_desc: Apt3
str_desc: 2nd floor
str_city: new york
str_desc: mailing address
Now, the file will have thousands of addresses, but I want anytime "str_desc" appears after "str_address" and before "str_city" to be replaced with "str_address", however any "str_desc" that appears after str_city to remain unchanged.
Desired output:
str_address: 123 main street
str_address: Apt3
str_address: 2nd floor
str_city: new york
str_desc: mailing address
I can extract this info with,
cat file | awk '/str_city/{f=0} f; /str_address/{f=1}'
which gives
str_desc: Apt3
str_desc: 2nd floor
But I am having trouble changing the first "str_desc" to "str_address".
You almost have the complete solution in your awk extraction code:
awk '/str_city/{f=0} f; /str_address/{f=1}'
The idea is to:
turn the flag on when you see str_address.
turn the flag off when you see str_city.
replace str_desc with str_address if the flag is on.
That's basically (in readable form, and the order is important):
awk '
$1 == "str_address:" { flag = 1 }
$1 == "str_desc:" && flag == 1 { $1 = "str_address:" }
$1 == "str_city:" { flag = 0 }
{ print }
' < inputFile >outputFile
Here's a transcript showing it in action:
pax$ echo '
str_address: 123 main street
str_desc: Apt3
str_desc: 2nd floor
str_city: new york
str_desc: mailing address
' | awk '
$1 == "str_address:" { flag = 1 }
$1 == "str_desc:" && flag == 1 { $1 = "str_address:" }
$1 == "str_city:" { flag = 0 }
{ print }'
str_address: 123 main street
str_address: Apt3
str_address: 2nd floor
str_city: new york
str_desc: mailing address
And, of course, a minified version:
awk '$1=="str_address:"{f=1}$1=="str_desc:"&&f==1{$1="str_address:"}$1=="str_city:"{f=0}{print}' < inputFile >outputFile
You can use an address range in sed:
$ sed '/str_address/,/str_city/s/str_desc/str_address/' infile
str_address: 123 main street
str_address: Apt3
str_address: 2nd floor
str_city: new york
str_desc: mailing address
This leaves all the str_desc outside of the /str_address/,/str_city/ range untouched, and substitutes the others with str_address (that's the s/str_desc/str_address/ part).
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I need to sort data which reside in txt file. The sample data is as follows:
======
Jhon
Doe
score -
------
======
Ann
Smith
score +
------
======
Will
Marrow
score -
------
And I need to extract only sections where score + is defined. So the result should be
======
Ann
Smith
score +
------
I would try this one:
$ grep -B3 -A1 "score +" myfile
It means... grep three lines Before and one line After "score +".
Sed can do it as follows:
sed -n '/^======/{:a;N;/\n------/!ba;/score +/p}' infile
======
Ann
Smith
score +
------
where -n prevents printing, and
/^======/ { # If the pattern space starts with "======"
:a # Label to branch to
N # Append next line to pattern space
/\n------/!ba # If we don't match "------", branch to :a
/score +/p # If we match "score +", print the pattern space
}
Things could be more properly anchored with /\n------$/, but there are spaces at the end of the lines, and I'm not sure if those are real or copy-paste artefacts – but this work for the example data.
give this oneliner a try:
awk -v RS="==*" -F'\n' '{p=0;for(i=1;i<=NF;i++)if($i~/score \+/)p=1}p' file
with the given data, it outputs:
Ann
Smith
score +
------
The idea is, take all lines divided by ====... as one multiple-line record, and check if the record contains the searching pattern, print it out.
With GNU awk for multi-char RS:
$ awk -v RS='=+\n' '/score \+/' file
Ann
Smith
score +
------
Given:
$ echo "$txt"
======
Jhon
Doe
score -
------
======
Ann
Smith
score +
------
======
Will
Marrow
score -
------
You can create a toggle type match in awk to print only the section that you wist:
$ echo "$txt" | awk '/^=+/{f=1;s=$0;next} /^score \+/{f=2} f {s=s"\n"$0} /^-+$/ {if(f==2) {print s} f=0}'
======
Ann
Smith
score +
------
Use Grep Context Flags
Assuming you have a truly fixed-format file, you can just use fgrep (or GNU or BSD grep with the speedy --fixed-strings flag) along with the the --before-context and --after-context flags. For example:
$ fgrep -A1 -B3 'score +' /tmp/foo
======
Ann
Smith
score +
------
The flags will find your match, and include the three lines before and one line after each match. This gives you the output you're after, but with a lot less complexity than a sed or awk script. YMMV.