How to sort lines of text by other lines of text as keys? - bash

Basically the equivalent of "sort-by" method in CLI.
A sample input file (file1.txt):
one
three
five
eleven
thirteen
sixteen
Another input file (file2.txt, which lists length of corresponding line in file1.txt):
3
5
4
6
8
7
Desired output (sort lines in file1.txt by lines in file2.txt, in this case numerically; or in other words, sort lines in file1.txt by the line's length):
one
five
three
eleven
sixteen
thirteen
I've created a simple Perl script to do this. Sample usage:
% sort-by-lines file1.txt file2.txt
% sort-by-lines /etc/passwd <(perl -nE'say length' /etc/passwd)
But was wondering if a combination of more basic Unix commands (sort, cut, etc) can also do the same in a comparably simple fashion.

Are you just trying to sort a file by the length of each line? With standard tools in any shell on any UNIX box that'd be:
awk -v OFS='\t' '{print length(), NR, $0}' file | sort -k1,2n | cut -f3-
For example:
$ cat file
other stuff
text
foo
stuff
bar
$ awk -v OFS='\t' '{print length(), NR, $0}' file | sort -k1,2n | cut -f3-
foo
bar
text
stuff
other stuff
If that's not it then please edit your question to clarify what it is you're trying to do and what your actual question is.
Update - given the input you added to your question:
$ paste file2.txt file1.txt | sort -k1,2n | cut -f2-
one
five
three
eleven
sixteen
thirteen
Note that that won't necessarily preserve the order of lines of the same length - you'd need to add the GNU -s ("stable") option to sort to do that:
paste file2.txt file1.txt | sort -s -k1,2n | cut -f2-
or do this which is bash only:
paste file2.txt <(cat -n file1.txt) | sort -k1,2n | cut -f3-
or this which is portable to all shells/Unixes:
awk -v OFS='\t' 'NR==FNR{a[NR]=$0;next} {print a[FNR], FNR, $0}' file2.txt file1.txt | sort -k1,2n | cut -f3-
or do something with an explicit temp file or a here document.

Related

Move lines in file using awk/sed

Hi my files look like:
>ID.1
GGAACACGACATCCTGCAGGGTTAAAAAAGAAAAAATCAGTAAAAGTACTGGA
>ID.2
GGAATACCACATCCCGCAGGGTTAAAAAAGAAAAAATCAGTAACAGTACTGGA
and I want to move the lines so that line 1 swaps with 3, and line 2 swaps with 4.
>ID.2
GGAATACCACATCCCGCAGGGTTAAAAAAGAAAAAATCAGTAACAGTACTGGA
>ID.1
GGAACACGACATCCTGCAGGGTTAAAAAAGAAAAAATCAGTAAAAGTACTGGA
I have thought about using cut so cut send the lines into other files, and then bring them all back in the desired order using paste, but is there a solution using awk/sed.
EDIT: The file always has 4 lines (2 fasta entrys), no more.
For such a simple case, as #Ed_Morton mentioned, you can just swap the even-sized slices with head and tail commands:
$ tail -2 test.txt; head -2 test.txt
>ID.2
GGAATACCACATCCCGCAGGGTTAAAAAAGAAAAAATCAGTAACAGTACTGGA
>ID.1
GGAACACGACATCCTGCAGGGTTAAAAAAGAAAAAATCAGTAAAAGTACTGGA
Generic solution with GNU tac to reverse contents:
$ tac -bs'>' ip.txt
>ID.2
GGAATACCACATCCCGCAGGGTTAAAAAAGAAAAAATCAGTAACAGTACTGGA
>ID.1
GGAACACGACATCCTGCAGGGTTAAAAAAGAAAAAATCAGTAAAAGTACTGGA
By default tac reverses line wise but you can customize the separator.
Here, I'm assuming > can be safely used as a unique separator (provided to the -s option). The -b option is used to put the separator before the content in the output.
Using ed (inplace editing):
# move 3rd to 4th lines to the top
printf '3,4m0\nwq\n' | ed -s ip.txt
# move the last two lines to the top
printf -- '-1,$m0\nwq\n' | ed -s ip.txt
Using sed:
sed '1h;2H;1,2d;4G'
Store the first line in the hold space;
Add the second line to the hold space;
Don't print the first two lines;
Before printing the fourth line, append the hold space to it (i.e. append the 1st and 2nd line).
GNU AWK manual has example of swapping two lines using getline as you know that
The file always has 4 lines (2 fasta entrys), no more.
then you might care only about case when number of lines is evenly divisble by 4 and use getline following way, let file.txt content be
>ID.1
GGAACACGACATCCTGCAGGGTTAAAAAAGAAAAAATCAGTAAAAGTACTGGA
>ID.2
GGAATACCACATCCCGCAGGGTTAAAAAAGAAAAAATCAGTAACAGTACTGGA
then
awk '{line1=$0;getline line2;getline line3;getline line4;printf "%s\n%s\n%s\n%s\n",line3,line4,line1,line2}' file.txt
gives output
>ID.2
GGAATACCACATCCCGCAGGGTTAAAAAAGAAAAAATCAGTAACAGTACTGGA
>ID.1
GGAACACGACATCCTGCAGGGTTAAAAAAGAAAAAATCAGTAAAAGTACTGGA
Explanation: store current line in variable $0, then next line as line2, yet next line as line3, yet next line as line4, use printf with 4 placeholders (%s) followed by newlines (\n), which are filled accordingly to your requirement.
(tested in GNU Awk 5.0.1)
GNU sed:
sed -zE 's/(.*\r?\n)(.*\r?\n?)/\2\1/' file
A Perl:
perl -0777 -pe 's/(.*\R.*\R)(.*\R.*\R?)/\2\1/' file
A ruby:
ruby -ne 'BEGIN{lines=[]}
lines<<$_
END{puts lines[2...4]+lines[0...2] }' file
Paste and awk:
paste -s file | awk -F'\t' '{print $3, $4, $1, $2}' OFS='\n'
A POSIX pipe:
paste -sd'\t\n' file | nl | sort -nr | cut -f 2- | tr '\t' '\n'
This seems to work:
awk -F'\n' '{print $3, $4, $1, $2}' OFS='\n' RS= ORS='\n\n' file.txt

Read two files simultaneously and create one from them

I am new to Bash scripting, but do understand most of the basics. My scenario is as follows:
I have a server from which I get a load of data via cURL. This is parsed properly (XML format) and from these results I then extract the data I want. The cURL statement writes its output to a file called temp-rec-schedule.txt. The below code is what I use to get the values I want to use in further calculation.
MP=`cat temp-rec-schedule.txt | grep "<ns3:mediapackage" | awk -F' ' '{print $3}' | cut -d '=' -f 2 | awk -F\" '{print $(NF-1)}'`
REC_TIME=`cat temp-rec-schedule.txt | grep "<ns3:mediapackage" | awk -F' ' '{print $2}' | cut -d '=' -f 2 | awk -F\" '{print $(NF-1)}'`
So this all still work perfectly. The output of the above code is respectively (if written to two separate files):
MP output:
b1706f0d-2cf1-4fd6-ab60-ae4d08608f1f
fd578fcc-342c-4f6c-986a-794ccb1abd0c
ce9f40e9-8e2c-4654-ba1c-7f79d35a69fd
c31a2354-6f4b-4bfe-b51e-2bac80889897
df342d88-c660-490e-9da6-9c91a7966536
49083f88-4264-4629-80fb-fae480d0bb25
946121c7-4948-4254-9cb5-2457e1b99685
f7bd0cad-e8f5-4e3d-a219-650d07a4bb34
REC_TIME output:
2014-09-15T07:30:00Z
2014-09-19T08:58:00Z
2014-09-22T07:30:00Z
2014-10-13T07:30:00Z
2014-10-17T08:58:00Z
2014-10-20T07:30:00Z
2014-10-22T13:28:00Z
2014-10-27T07:30:00Z
What I want to do now is create a file where line1 from file1 is appended with line1 from file2. i.e. :
b1706f0d-2cf1-4fd6-ab60-ae4d08608f1f 2014-09-15T07:30:00Z
fd578fcc-342c-4f6c-986a-794ccb1abd0c 2014-09-19T08:58:00Z
and so on.
I am not really familiar with Perl, but do know a little bit about Bash, so if it is possible, I would like to do this in Bash.
Further, from here, I want to compare two files that contain the same MP variable, but have two different TIME values assigned: subtract the one value from the other, and calculate the amount of hours that have passed between. This is all to calculate the amount of hours that have passed between publishing a video on our system, and the start time of the recording. Basically:
File1's output: b1706f0d-2cf1-4fd6-ab60-ae4d08608f1f 2014-09-15T07:30:00Z
File2's output: b1706f0d-2cf1-4fd6-ab60-ae4d08608f1f 2014-09-15T09:30:00Z
The output of my script should yield a value of 2 hours.
How can I do this with Bash?
You're probably better off just using awk for the whole thing. Something like:
awk '/<ns3:medipacakge/{gsub("\"","");
split($3,mp,"=");
split($2,rt,"="); print mp[2],rt[2]}' temp-rec-schedule.txt
The answer to the first question is to write the output to two different files and then use paste.
grep "<ns3:mediapackage" temp-rec-schedule.txt | awk -F' ' '{print $3}' | cut -d '=' -f 2 | awk -F\" '{print $(NF-1)}' > MP_out.txt
grep "<ns3:mediapackage" temp-rec-schedule.txt | awk -F' ' '{print $2}' | cut -d '=' -f 2 | awk -F\" '{print $(NF-1)}' > REC_out.txt
paste MP_out.txt REC_out.txt
That being said (and as #WilliamPursell says in his comment on the OP) there is never a reason to string this series of commands together since awk can do all the things you are doing there with significantly less overhead and more flexibility.

How to temporarily sort a file from longest to shortest line, before restoring it back to its original order?

The answers in Sorting lines from longest to shortest provide various ways to sort a file's lines from longest to shortest.
I need to temporarily sort a file from longest to shortest, to give some time for a BASH script to perform some operations to edit various content, but then to restore the file to its original order after the BASH script has finished.
How can I first sort a file from longest to shortest, but then be able to restore the order later?
Done by enhancing your linked answer by these steps:
Prepend a length and line number to the front of each line, sort by length, cut length (just like in linked answer)
perl -ne 'print length($_)." $. $_"' file.txt | sort -r -n | cut -d ' ' -f 2- > newfile.txt
Do your bash translation (ignoring first number on each line)
If for some reason you can't do your amorphous translation with the number prefixes, then split the numbers into a separate file and merge them back in afterwords.
Sort by line number and then cut off line number to restore file to previous state.
sort -n newfile.txt | cut -d ' ' -f 2- > file.txt
Something like this to store the original line order in a separate file might be what you need:
awk -v OFS='\t' '{print length($0), NR, $0}' infile |
sort -k1rn -k2n |
tee order.txt |
cut -f3- > sorted.txt
do stuff with sorted.txt then
cut -f2 order.txt |
paste - sorted.txt |
sort -n |
cut -f2- > outfile
You don't say what you want done with lines that are the same length as each other but the above will preserve the order from the original file in that case. If that's not what you want, play with the sort -rn commands, modifying the -ks as necessary.

grep "output of cat command - every line" in a different file

Sorry title of this question is little confusing but I couldnt think of anything else.
I am trying to do something like this
cat fileA.txt | grep `awk '{print $1}'` fileB.txt
fileA contains 100 lines while fileB contains 100 million lines.
What I want is get id from fileA, grep that id in a different file-fileB and print that line.
e.g fileA.txt
1234
1233
e.g.fileB.txt
1234|asdf|2012-12-12
5555|asdd|2012-11-12
1233|fvdf|2012-12-11
Expected output is
1234|asdf|2012-12-12
1233|fvdf|2012-12-11
Getting rid of cat and awk altogether:
grep -f fileA.txt fileB.txt
awk alone can do that job well:
awk -F'|' 'NR==FNR{a[$0];next;}$1 in a' fileA fileB
see the test:
kent$ head a b
==> a <==
1234
1233
==> b <==
1234|asdf|2012-12-12
5555|asdd|2012-11-12
1233|fvdf|2012-12-11
kent$ awk -F'|' 'NR==FNR{a[$0];next;}$1 in a' a b
1234|asdf|2012-12-12
1233|fvdf|2012-12-11
EDIT
add explanation:
-F'|' #| as field separator (fileA)
'NR==FNR{a[$0];next;} #save lines in fileA in array a
$1 in a #if $1(the 1st field) in fileB in array a, print the current line from FileB
for further details I cannot explain here, sorry. for example how awk handle two files, what is NR and what is FNR.. I suggest that try this awk line in case the accepted answer didn't work for you. If you want to dig a little bit deeper, read some awk tutorials.
If the id's are on distinct lines you could use the -f option in grep as such:
cut -d "|" -f1 < fileB.txt | grep -F -f fileA.txt
The cut command will ensure that only the first field is searched for in the pattern searching using grep.
From the man page:
-f FILE, --file=FILE
Obtain patterns from FILE, one per line.
The empty file contains zero patterns, and therefore matches nothing.
(-f is specified by POSIX.)

Searching for Strings

I would like to have a shell script that searches two files and returns a list of strings:
File A contains just a list of unique alphanumeric strings, one per line, like this:
accc_34343
GH_HF_223232
cwww_34343
jej_222
File B contains a list of SOME of those strings (some times more than once), and a second column of infomation, like this:
accc_34343 dog
accc_34343 cat
jej_222 cat
jej_222 horse
I would like to create a third file that contains a list of the strings from File A that are NOT in File B.
I've tried using some loops with grep -v, but that doesn't work. So, in the above example, the new file would have this as it's contents:
GH_HF_223232
cwww_34343
Any help is greatly appreciated!
Here's what you can do:
grep -v -f <(awk '{print $1}' file_b) file_a > file_c
Explanation:
grep -v : Use -v option to grep to invert the matching
-f : Use -f option to grep to specify that the patterns are from file
<(awk '{print $1}' file_b): The <(awk '{print $1}' file_b) is to simply extract the first column values from file_b without using a temp file; the <( ... ) syntax is process substitution.
file_a : Tell grep that the file to be searched is file_a
> file_c : Output to be written to file_c
comm is used to find intersections and differences between files:
comm -23 <(sort fileA) <(cut -d' ' -f1 fileB | sort -u)
result:
GH_HF_223232
cwww_34343
I assume your shell is bash/zsh/ksh
awk 'FNR==NR{a[$0];next}!($1 in a)' fileA fileB
check here

Resources