Sort files by number at end in bash/perl - bash

I'm trying to sort a huge list of files into the following format:
file-55_357-0.csv
file-55_357-1.csv
file-55_357-2.csv
file-55_357-3.csv
...
Is there a simple way to do this in bash or perl? In other words, is there a way to write the perl script such that it goes through all of them in numerical order? For instance, when I create my #files, can I make sure the script goes through them all in this sorting -- how could I create a my #sorted array? I ask because I want to append all these files together vertically, and they need to be in the sorted order. Thanks so much!

You can use the sort command, which is neither part of bash nor part of perl.
With input data in input.txt:
file-55_357-123.csv
file-55_357-0.csv
file-55_357-21.csv
file-55_357-3.csv
From my shell, (any shell, not just bash) I can do the following:
$ sort -t- -nk3 input.txt
file-55_357-0.csv
file-55_357-3.csv
file-55_357-21.csv
file-55_357-123.csv
The -t option specifies a delimiter, -n says to compare numeric values (so that 21 comes after 3 rather than before) and -k 3 says to sort on the third field (per the delimiter).

use Sort::Key::Natural qw( natsort );
my #sorted = natsort #file_names;

Related

Get first N chars and sort them

I have a requirement where i need to fetch first four characters from each line of file and sort them.
I tried below way. but its not sorting each line
cut -c1-4 simple_file.txt | sort -n
O/p using above:
appl
bana
uoia
Expected output:
alpp
aabn
aiou
sort isn't the right tool for the job in this case, as it used to sort lines of input, not the characters within each line.
I know you didn't tag the question with perl but here's one way you could do it:
perl -F'' -lane 'print(join "", sort #F[0..3])' file
This uses the -a switch to auto-split each line of input on the delimiter specified by -F (in this case, an empty string, so each character is its own element in the array #F). It then sorts the first 4 characters of the array using the standard string comparison order. The result is joined together on an empty string.
Try defining two helper functions:
explodeword () {
test -z "$1" && return
echo ${1:0:1}
explodeword ${1:1}
}
sortword () {
echo $(explodeword $1 | sort) | tr -d ' '
}
Then
cut -c1-4 simple_file.txt | while read -r word; do sortword $word; done
will do what you want.
The sort command is used to sort files line by line, it's not designed to sort the contents of a line. It's not impossible to make sort do what you want, but it would be a bit messy and probably inefficient.
I'd probably do this in Python, but since you might not have Python, here's a short awk command that does what you want.
awk '{split(substr($0,1,4),a,"");n=asort(a);s="";for(i=1;i<=n;i++)s=s a[i];print s}'
Just put the name of the file (or files) that you want to process at the end of the command line.
Here's some data I used to test the command:
data
this
is a
simple
test file
a
of
apple
banana
cat
uoiea
bye
And here's the output
hist
ais
imps
estt
a
fo
alpp
aabn
act
eiou
bey
Here's an ugly Python one-liner; it would look a bit nicer as a proper script rather than as a Bash command line:
python -c "import sys;print('\n'.join([''.join(sorted(s[:4])) for s in open(sys.argv[1]).read().splitlines()]))"
In contrast to the awk version, this command can only process a single file, and it reads the whole file into RAM to process it, rather than processing it line by line.

Bash equivalent of Matlab's `fscanf`

I have lots of text files which contains columns of numeric values (number of columns are different for each files). I use MATLAB to store each one's content like this:
id1 = fopen('texfile.txt','r');
A = fscanf(id1,'%f',[1 Inf]);
fclose(id1);
I wanted to know that if there is any simple way in bash script to do the same for me?
A simple equivalent of fscanf in Bash is the read builtin:
read -r A
If, on the other hand, we have got multiple columns of values, then awk can be used to extract the n-th column:
awk '{print $n}' < input > output
Not the simplest way imaginable, but you could use Bash arrays (Bash 4 and up).
First, read the file using newline as separator:
IFS_prev="$IFS"; IFS=$'\n';
A=($(cat "textfile.txt"))
IFS="$IFS_prev"
then, to refer to the jth element in the ith row, use this:
row=(${A[i]}) # extract ith row and split on spaces
element=${row[j]} # extract jth element

How to get frequency counts of unique values in a list using UNIX?

I have a file that has a couple thousand domain names in a list. I easily generated a list of just the unique names using the uniq command. Now, I want to go through and find how many times each of the items in the uniques list appears in the original, non-unique list. I thought this should be pretty easy to do with this loop, but I'm running into trouble:
for name in 'cat uniques.list'; do grep -c $name original.list; done > output.file
For some reason, it's spitting out a result that shows some count of something (honestly not sure what) for the uniques file and the original file.
I feel like I'm overlooking something really simple here. Any help is appreciated.
Thanks!
Simply use uniq -c on your file :
-c, --count
prefix lines by the number of occurrences
The command to get the final output :
sort original.list | uniq -c

Sort filenames without leading zeros

i would like to sort stereo imagefiles with the following pattern
img_i_j.ppm,
where i is the image counter and j is the id of the camera [0,1].
Currently, if i sort them using
ls -1 *.ppm | sort -n
the result looks like that:
img_0_0.ppm
img_0_1.ppm
img_10_0.ppm
img_10_1.ppm
img_1_0.ppm
img_11_0.ppm
img_11_1.ppm
img_1_1.ppm
img_12_0.ppm
But i need to have this output:
img_0_0.ppm
img_0_1.ppm
img_1_0.ppm
img_1_1.ppm
img_2_0.ppm
img_2_1.ppm
...
img_10_0.ppm
img_10_1.ppm
...
Is this achievable without adapting the filename?
As seen on the comments, use
sort -V
I initially posted it as a comment because this parameter is not always in the sort binary, so you have to use sort -k -n ... (for example like here).
ls (now?) has the -v option, which does what you want. From man ls:
-v natural sort of (version) numbers within text
This is simpler than piping to sort, and follows advice not to parse ls.
If you actually intend to parse the output, I imagine that you can mess with LC_COLLATE in bash. Alternatively, in zsh, you can just use the glob *(n) instead.

Find same words in two text files

I have two text files and each contains more than 50 000 lines. I need to find same words that are in both text files. I tried COMM command but I got answer that "file 2 is not in sorted order". I tried to sort file by command SORT but it doesn´t work. I´m working in Windows. It doesn´t have to be solved in command line. It can be solved in some program or something else. Thank you for every idea.
If you want to sort the files you will have to use some sort of external sort (like merge sort) so you have enough memory. As for another way you could go through the first file and find all the words and store them in a hashtable, then go through the second file and check for repeated words. If the words are actual words and not gibberish the second method will work and be easier. Since the files are so large you may not want to use a scripting language but it might work.
If the words are not on their own line, then comm can not help you.
If you have a set of unix utilities handy, like Cygwin, (you mentioned comm, so you may have have others as well) you can do:
$ tr -cs "[:alpha:]" "\n" < firstFile | sort > firstFileWords
$ tr -cs "[:alpha:]" "\n" < secondFile | sort > secondFileWords
$ comm -12 firstFileWords secondFileWords > commonWords
The first two lines convert the words in each file in to a single word on each line, it also sorts the file.
If you're only interested in individual words, you can change sort to sort -u to make get the unique set.

Resources