How can I sort alphanumeric strings in Unix?

How can I sort alphanumeric strings in Unix? - sorting

I have a list of table names, which are out of order. How can I get them in the correct logical order?
$ cat list.txt
TAB1
TAB13
TAB11
TAB19
TAB2
TAB3
TAB16
TAB17
TAB18
TAB9
TAB10
TAB8
TAB12
TAB20
$ cat list.txt | sort -n
TAB1
TAB10
TAB11
TAB12
TAB13
TAB16
TAB17
TAB18
TAB19
TAB2
TAB20
TAB3
TAB8
TAB9
Expected order:
TAB1
TAB2
TAB3
TAB8
TAB9
TAB10
TAB11
TAB12
TAB13
TAB16
TAB17
TAB18
TAB19
TAB20
Any vim short-cuts will also do, I do not necessarily need a separate utility for this.

You can always perform sort with argument -V to sort alphanumeric string..
$ sort -V inputfile > outputfile
$ cat inputfile
TAB1
TAB13
TAB11
TAB19
TAB2
TAB3
TAB16
TAB17
TAB18
TAB9
TAB10
TAB8
TAB12
TAB20
$ cat outputfile
TAB1
TAB2
TAB3
TAB8
TAB9
TAB10
TAB11
TAB12
TAB13
TAB16
TAB17
TAB18
TAB19
TAB20

You need to tell it where your sorting key starts:
sort -n -k1.4 list.txt
Otherwise it starts from the beginning, fails to convert a string to a number and falls back to alphabetical comparison.

Since this is tagged as a Vim question, I figured it might be worth mentioning the Vim option (even though I would personally use sort since the data's already in a file). It's simply
:sort n
Since Vim's numeric sort ignores up to the first decimal number, one doesn't need to ignore the "TAB" (:sort can take a pattern to ignore, :sort n /TAB/ would work as well, for example). As usual, :h :sort for more information.

You can do this in Perl or any language where sort lets you specify a comparison operator:
sub numcomp() {
$a =~ /([0-9]*)$/; $aa = $1;
$b =~ /([0-9]*)$/; $bb = $1;
$aa <=> $bb;
}
sort numcomp #mylist...
(Don't bother telling me it's baby Perl. I... um, I wrote it that way on purpose so it would be easy to understand.)
(Don't bother telling me it's wrong. I... um, I wrote it that way on purpose as an exercise for the reader.)

Related

Should I use a for loop to process text files line by line?

So I have two text files
FILE1: 1-40 names
FILE2: 1-40 names
Now what I want the program to do (Terminal) is to go through each name, by incrementing by ONE in each file so that the first name from FILE1 runs the first line from FILE2, and 20th name from FILE1 runs the 20th line from FILE2.
BUT I DON'T WANT IT TO run first name of FILE1, and then run through all of the names listed in FILE2, and repeat that over and over again.
Should I do a for loop?
I was thinking of doing something like:
for f in (cat FILE1); do
flirt -in $f -ref (cat FILE2);
done
I'm doing this using BASH.

Yes, you can do it quite easily, but it will require reading from two-different file descriptors at once. You can simply redirect one of the files into the next available file descriptor and use it to feed your read loop, e.g.
while read f1var && read -u 3 f2var; do
echo "f1var: $f1var -- f2var: $f2var"
done <file1.txt 3<file2.txt
Which will read line-by-line from each file reading a line from file1.txt on the standard file descriptor into f1var and from file2.txt on fd3 into f2var.
A short example might help:
Example Input Files
$ cat f1.txt
a
b
c
$ cat f2.txt
d
e
f
Example Use
$ while read f1var && read -u 3 f2var; do \
echo "f1var: $f1var -- f2var: $f2var"; \
done <f1.txt 3<f2.txt
f1var: a -- f2var: d
f1var: b -- f2var: e
f1var: c -- f2var: f
Using paste as an alternative
The paste utility also provides a simple alternative for combining files line-by-line, e.g.:
$ paste f1.txt f2.txt
a d
b e
c f

In Bash, you might make usage of arrays:
echo "Alice
> Bob
> Claire" > file-1
echo "Anton
Bärbel
Charlie" > file-2
n1=($(cat file-1))
n2=($(cat file-2))
for n in {0..2}; do echo ${n1[$n]} ${n2[$n]} ; done
Alice Anton
Bob Bärbel
Claire Charlie

Getting familiar with join and nl (number lines) can't be wrong, so here is a different approach:
nl -w 1 file-1 > file1
nl -w 1 file-2 > file2
join -1 1 -2 1 file1 file2 | sed -r 's/^[0-9]+ //'
nl with put a big amount of blanks in front of the small line numbers, if we don't tell it to -w 1.
We join the files by matching line number and remove the line number afterwards with sed.
Paste is of course much more elegant. Didn't know about this.

Using nzload to load special characters

I have extended ascii chars in Oracle table data which I am able extract to a file using sqlplus with the \ escape character prefixed. I want to use nzload to load the exact same data into a netezza table.
nzload adds a couple of extra bytes when it encounters this char seq (c2bf)
in the extracted file data:
echo "PROFESSIONAL¿" | od -x
0000000 5052 4f46 4553 5349 4f4e 414c **c2bf** 0a00
after nzload:
echo "PROFESSIONALÂ¿" | od -x
0000000 5052 4f46 4553 5349 4f4e 414c **c382 c2bf**
on the nzload command line, I have these options:
-escapechar \ -ctrlchars
Can anyone provide any help with this?

I'm not very savvy with Unicode conversion issues, but I've done this to myself before, and I'll demonstrate what I think is happening.
I believe what you are seeing here is not an issue with loading special characters with nzload, rather it's an issue with how your display/terminal software is display the data and/or Netezza how is storing the character data. I suspect a double conversion to/from UTF-8 (the Unicode encoding that Netezza supports). Let's see if we can suss out which it is.
Here I am using PuTTY with the default (for me) Remote Character Set as Latin-1.
$ od -xa input.txt
0000000 5250 464f 5345 4953 4e4f 4c41 bfc2 000a
P R O F E S S I O N A L B ? nl
0000017
$ cat input.txt
PROFESSIONALÂ¿
Here we can see from od that the file has only the data we expect, however when we cat the file we see the extra character. If it's not in the file, then the character is likely coming from the display translation.
If I change the PuTTY settings to have UTF-8 be the remote character set, we see it this way:
$ od -xa input.txt
0000000 5250 464f 5345 4953 4e4f 4c41 bfc2 000a
P R O F E S S I O N A L B ? nl
0000017
$ cat input.txt
PROFESSIONAL¿
So, the same source data, but two different on-screen representations, which are, not coincidentally, the same as your two different outputs. The same data can be displayed at least two ways.
Now let's see how it loads into Netezza, once into a VARCHAR column, and again into an NVARCHAR column.
create table test_enc_vchar (col1 varchar(50));
create table test_enc_nvchar (col1 nvarchar(50));
$ nzload -db testdb -df input.txt -t test_enc_vchar -escapechar '\' -ctrlchars
Load session of table 'TEST_ENC_VCHAR' completed successfully
$ nzload -db testdb -df input.txt -t test_enc_nvchar -escapechar '\' -ctrlchars
Load session of table 'TEST_ENC_NVCHAR' completed successfully
The data loaded with no errors. Note while I specify the escapechar option for nzload, none of the characters in this specific sample of input data require escaping, nor are they escaped.
I will now use the rawtohex function from the SQL Extension Toolkit as an in-database tool like we've used od from the command line.
select rawtohex(col1) from test_enc_vchar;
RAWTOHEX
------------------------------
50524F46455353494F4E414CC2BF
(1 row)
select rawtohex(col1) from test_enc_nvchar;
RAWTOHEX
------------------------------
50524F46455353494F4E414CC2BF
(1 row)
At this point both columns seem to have exactly the same data as the input file. So far, so good.
What if we select the column? For the record, I am doing this in a PuTTY session with remote character set of UTF-8.
select col1 from test_enc_vchar;
COL1
----------------
PROFESSIONALÂ¿
(1 row)
select col1 from test_enc_nvchar;
COL1
---------------
PROFESSIONAL¿
(1 row)
Same binary data, but different display. If I then copy the output of each of those selects into echo piped to od,
$ echo PROFESSIONALÂ¿ | od -xa
0000000 5250 464f 5345 4953 4e4f 4c41 82c3 bfc2
P R O F E S S I O N A L C stx B ?
0000020 000a
nl
0000021
$ echo PROFESSIONAL¿ | od -xa
0000000 5250 464f 5345 4953 4e4f 4c41 bfc2 000a
P R O F E S S I O N A L B ? nl
0000017
Based on this output, I'd wager that you are loading your sample data, which I'd also wager is UTF-8, into a VARCHAR column rather than an NVARCHAR column. This is not, in of itself, a problem, but can have display/conversion issues down the line.
Generally speaking, you'd want to load UTF-8 data into NVARCHAR columns.

Unix sort: inconsistent between 2 files

[1.txt]
Sample10_1.fq.gz
Sample11_1.fq.gz
Sample12_1.fq.gz
Sample1_1.fq.gz
Sample13_1.fq.gz
[2.txt]
Sample10_2.fq.gz
Sample11_2.fq.gz
Sample12_2.fq.gz
Sample1_2.fq.gz
Sample13_2.fq.gz
As you can see, the only difference is the digit after the "_".
Anyway, here are the results of sort:
[sort 1.txt]
Sample10_2.fq.gz
Sample11_2.fq.gz
Sample12_2.fq.gz
Sample1_2.fq.gz
Sample13_2.fq.gz
[sort 2.txt]
Sample10_1.fq.gz
Sample11_1.fq.gz
Sample1_1.fq.gz
Sample12_1.fq.gz
Sample13_1.fq.gz
Discrepancy: "Sample1_" is sorted between "Sample12" and "Sample13" in 1.txt, but it's between "Sample11" and "Sample12" in 2.txt.
Am I doing something wrong to make this inconsistency happen?

Use sort -V
cat 1.txt | sort -V
Sample1_1.fq.gz
Sample10_1.fq.gz
Sample11_1.fq.gz
Sample12_1.fq.gz
Sample13_1.fq.gz

Combine 2 files based on the date using awk or sed

I have 2 files , the format is as follow,
File1 's contents,
02-01-12 28.46
02-02-12 27.15
02-03-12 47.54
02-04-12 27.36
02-05-12 47.57
02-06-12 27.01
02-07-12 27.41
02-08-12 27.27
02-09-12 27.39
File2 's contents,
02-01-12 11.46
02-02-12 12.15
02-03-12 14.54
02-04-12 15.36
02-05-12 17.57
02-06-12 17.01
02-07-12 17.41
02-08-12 21.27
02-09-12 17.39
I want to combine them into 1 file based on the date as below,
02-01-12 28.46 11.46
02-02-12 27.15 12.15
02-03-12 47.54 14.54
....................
....................
....................
Please help !! Thanks in advance..

what you want is join:
from the man page:
join - join lines of two files on a common field
try:
$ join file1 file2

Full real working example using paste :
paste FILE1 <(cut -d " " -f2 FILE2)
See :
man 1 paste

Using just sed:
/bin/sed -n '
p
R f2
' f1 |
/bin/sed 'N;s/\n[^ ]*//;'

Rearrange columns using cut

I am having a file in the following format
Column1 Column2
str1 1
str2 2
str3 3
I want the columns to be rearranged. I tried below command
cut -f2,1 file.txt
The command doesn't reorder the columns. Any idea why its not working?

For the cut(1) man page:
Use one, and only one of -b, -c or -f. Each LIST is made up of
one
range, or many ranges separated by commas. Selected input is written
in the same order that it is read, and is written exactly once.
It reaches field 1 first, so that is printed, followed by field 2.
Use awk instead:
awk '{ print $2 " " $1}' file.txt

You may also combine cut and paste:
paste <(cut -f2 file.txt) <(cut -f1 file.txt)
via comments: It's possible to avoid process substitution and remove one instance of cut by doing:
paste file.txt file.txt | cut -f2,3

Using join:
join -t $'\t' -o 1.2,1.1 file.txt file.txt
Notes:
-t $'\t' In GNU join the more intuitive -t '\t' without the $ fails, (coreutils v8.28 and earlier?); it's probably a bug that a workaround like $ should be necessary. See: unix join separator char.
Even though there's just one file being worked on, join syntax requires two filenames. Repeating the file name allows join to perform the desired action.
For systems with low resources join offers a smaller footprint than some of the tools used in other answers:
wc -c $(realpath `which cut join sed awk perl`) | head -n -1
43224 /usr/bin/cut
47320 /usr/bin/join
109840 /bin/sed
658072 /usr/bin/gawk
2093624 /usr/bin/perl

using just the shell,
while read -r col1 col2
do
echo $col2 $col1
done <"file"

You can use Perl for that:
perl -ane 'print "$F[1] $F[0]\n"' < file.txt
-e option means execute the command after it
-n means read line by line (open the file, in this case STDOUT, and loop over lines)
-a means split such lines to a vector called #F ("F" - like Field). Perl indexes vectors starting from 0 unlike cut which indexes fields starting form 1.
You can add -F pattern (with no space between -F and pattern) to use pattern as a field separator when reading the file instead of the default whitespace
The advantage of running perl is that (if you know Perl) you can do much more computation on F than rearranging columns.

Just been working on something very similar, I am not an expert but I thought I would share the commands I have used. I had a multi column csv which I only required 4 columns out of and then I needed to reorder them.
My file was pipe '|' delimited but that can be swapped out.
LC_ALL=C cut -d$'|' -f1,2,3,8,10 ./file/location.txt | sed -E "s/(.*)\|(.*)\|(.*)\|(.*)\|(.*)/\3\|\5\|\1\|\2\|\4/" > ./newcsv.csv
Admittedly it is really rough and ready but it can be tweaked to suit!

Just as an addition to answers that suggest to duplicate the columns and then to do cut. For duplication, paste etc. will work only for files, but not for streams. In that case, use sed instead.
cat file.txt | sed s/'.*'/'&\t&'/ | cut -f2,3
This works on both files and streams, and this is interesting if instead of just reading from a file with cat, you do something interesting before re-arranging the columns.
By comparison, the following does not work:
cat file.txt | paste - - | cut -f2,3
Here, the double stdin placeholder paste does not duplicate stdin, but reads the next line.

Using sed
Use sed with basic regular expression's nested subexpressions to capture and reorder the column content. This approach is best suited when there are a limited number of cuts to reorder columns, as in this case.
The basic idea is to surround interesting portions of the search pattern with \( and \), which can be played back in the replacement pattern with \# where # represents the sequential position of the subexpression in the search pattern.
For example:
$ echo "foo bar" | sed "s/\(foo\) \(bar\)/\2 \1/"
yields:
bar foo
Text outside a subexpression is scanned but not retained for playback in the replacement string.
Although the question did not discuss fixed width columns, we will discuss here as this is a worthy measure of any solution posed. For simplicity let's assume the file is space delimited although the solution can be extended for other delimiters.
Collapsing Spaces
To illustrate the simplest usage, let's assume that multiple spaces can be collapsed into single spaces, and the the second column values are terminated with EOL (and not space padded).
File:
bash-3.2$ cat f
Column1 Column2
str1 1
str2 2
str3 3
bash-3.2$ od -a f
0000000 C o l u m n 1 sp sp sp sp C o l u m
0000020 n 2 nl s t r 1 sp sp sp sp sp sp sp 1 nl
0000040 s t r 2 sp sp sp sp sp sp sp 2 nl s t r
0000060 3 sp sp sp sp sp sp sp 3 nl
0000072
Transform:
bash-3.2$ sed "s/\([^ ]*\)[ ]*\([^ ]*\)[ ]*/\2 \1/" f
Column2 Column1
1 str1
2 str2
3 str3
bash-3.2$ sed "s/\([^ ]*\)[ ]*\([^ ]*\)[ ]*/\2 \1/" f | od -a
0000000 C o l u m n 2 sp C o l u m n 1 nl
0000020 1 sp s t r 1 nl 2 sp s t r 2 nl 3 sp
0000040 s t r 3 nl
0000045
Preserving Column Widths
Let's now extend the method to a file with constant width columns, while allowing columns to be of differing widths.
File:
bash-3.2$ cat f2
Column1 Column2
str1 1
str2 2
str3 3
bash-3.2$ od -a f2
0000000 C o l u m n 1 sp sp sp sp C o l u m
0000020 n 2 nl s t r 1 sp sp sp sp sp sp sp 1 sp
0000040 sp sp sp sp sp nl s t r 2 sp sp sp sp sp sp
0000060 sp 2 sp sp sp sp sp sp nl s t r 3 sp sp sp
0000100 sp sp sp sp 3 sp sp sp sp sp sp nl
0000114
Transform:
bash-3.2$ sed "s/\([^ ]*\)\([ ]*\) \([^ ]*\)\([ ]*\)/\3\4 \1\2/" f2
Column2 Column1
1 str1
2 str2
3 str3
bash-3.2$ sed "s/\([^ ]*\)\([ ]*\) \([^ ]*\)\([ ]*\)/\3\4 \1\2/" f2 | od -a
0000000 C o l u m n 2 sp C o l u m n 1 sp
0000020 sp sp nl 1 sp sp sp sp sp sp sp s t r 1 sp
0000040 sp sp sp sp sp nl 2 sp sp sp sp sp sp sp s t
0000060 r 2 sp sp sp sp sp sp nl 3 sp sp sp sp sp sp
0000100 sp s t r 3 sp sp sp sp sp sp nl
0000114
Lastly although the question's example does not have strings of unequal length, this sed expression supports this case.
File:
bash-3.2$ cat f3
Column1 Column2
str1 1
string2 2
str3 3
Transform:
bash-3.2$ sed "s/\([^ ]*\)\([ ]*\) \([^ ]*\)\([ ]*\)/\3\4 \1\2/" f3
Column2 Column1
1 str1
2 string2
3 str3
bash-3.2$ sed "s/\([^ ]*\)\([ ]*\) \([^ ]*\)\([ ]*\)/\3\4 \1\2/" f3 | od -a
0000000 C o l u m n 2 sp C o l u m n 1 sp
0000020 sp sp nl 1 sp sp sp sp sp sp sp s t r 1 sp
0000040 sp sp sp sp sp nl 2 sp sp sp sp sp sp sp s t
0000060 r i n g 2 sp sp sp nl 3 sp sp sp sp sp sp
0000100 sp s t r 3 sp sp sp sp sp sp nl
0000114
Comparison to other methods of column reordering under shell
Surprisingly for a file manipulation tool, awk is not well-suited for cutting from a field to end of record. In sed this can be accomplished using regular expressions, e.g. \(xxx.*$\) where xxx is the expression to match the column.
Using paste and cut subshells gets tricky when implementing inside shell scripts. Code that works from the commandline fails to parse when brought inside a shell script. At least this was my experience (which drove me to this approach).

Expanding on the answer from #Met, also using Perl:
If the input and output are TAB-delimited:
perl -F'\t' -lane 'print join "\t", #F[1, 0]' in_file
If the input and output are whitespace-delimited:
perl -lane 'print join " ", #F[1, 0]' in_file
Here,
-e tells Perl to look for the code inline, rather than in a separate script file,
-n reads the input 1 line at a time,
-l removes the input record separator (\n on *NIX) after reading the line (similar to chomp), and add output record separator (\n on *NIX) to each print,
-a splits the input line on whitespace into array #F,
-F'\t' in combination with -a splits the input line on TABs, instead of whitespace into array #F.
#F[1, 0] is the array made up of the 2nd and 1st elements of array #F, in this order. Remember that arrays in Perl are zero-indexed, while fields in cut are 1-indexed. So fields in #F[0, 1] are the same fields as the ones in cut -f1,2.
Note that such notation enables more flexible manipulation of input than in some other answers posted above (which are fine for a simple task). For example:
# reverses the order of fields:
perl -F'\t' -lane 'print join "\t", reverse #F' in_file
# prints last and first fields only:
perl -F'\t' -lane 'print join "\t", #F[-1, 0]' in_file

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How can I sort alphanumeric strings in Unix? - sorting

You can always perform sort with argument -V to sort alphanumeric string.. $ sort -V inputfile > outputfile $ cat inputfile TAB1 TAB13 TAB11 TAB19 TAB2 TAB3 TAB16 TAB17 TAB18 TAB9 TAB10 TAB8 TAB12 TAB20 $ cat outputfile TAB1 TAB2 TAB3 TAB8 TAB9 TAB10 TAB11 TAB12 TAB13 TAB16 TAB17 TAB18 TAB19 TAB20

You need to tell it where your sorting key starts: sort -n -k1.4 list.txt Otherwise it starts from the beginning, fails to convert a string to a number and falls back to alphabetical comparison.

Related

Should I use a for loop to process text files line by line?

Using nzload to load special characters

Unix sort: inconsistent between 2 files

Combine 2 files based on the date using awk or sed

Rearrange columns using cut

Categories

Resources