how sort file with tab delimiter - shell

Now I generate one text file,band the values are stored as \t
value1 value2 valu3.
And I want to sort this text file as the value1
sort a.txt -o a.txt1
And found it happen wrong
google 1 1
google 1 2
google 1 3
=google 1 4
google 1 3
found =google was inserted between google.Why it happened,so stranged.
And I tried sort a.txt -t $'\t' -k 1 -o a.txt1 but it has the same issue.

Your locale apparently specifies that = should be ignored when sorting. Try to replace sort with LC_ALL=C sort. This will run sort with the environment variable LC_ALL temporarily set to C, which will override your locale (in any locale-aware program) to the "traditional" / legacy locale-ignorant "C" locale.

sort -n x.txt
google 1 1
google 1 2
google 1 3
google 1 3
=google 1 4

Related

how UNIX sort command handles expressions with different character sizes?

I am trying to sort and join two files which contain IP addresses, the first file only has IPs, the second file contains IPs and an associated number. But sort acts differently in these files. here are the code and outcomes:
cat file | grep '180.76.15.15' | sort
cat file | grep '180.76.15.15' | sort -k 1
cat file | grep '180.76.15.15' | sort -t ' ' -k 1
outcome:
180.76.15.150 987272
180.76.15.152 52219
180.76.15.154 52971
180.76.15.156 65472
180.76.15.158 35475
180.76.15.15 99709
cat file | grep '180.76.15.15' | cut -d ' ' -f 1 | sort
outcome:
180.76.15.15
180.76.15.150
180.76.15.152
180.76.15.154
180.76.15.156
180.76.15.158
As you can see, the first three commands all produce the same outcome, but when lines only contain IP address, the sorting changes which causes me a problem trying to join files.
Explicitly, the IP 180.76.15.15 appears at the bottom row in the first case (even when I sort explicitly on the first argument), but at the top row in the second case and I can't understand why.
Can anyone please explain why is this happening?
P.S. I am ssh connecting through windows 10 powershell to ubuntu 20.04 installed on VMware.
sort will use your locale settings to determine the order of the characters. From man sort also:
*** WARNING *** The locale specified by the environment affects sort order.
Set LC_ALL=C to get the traditional sort order that uses native byte values.
This way you can use the ASCII characters order. For example:
> cat file
#a
b#
152
153
15 4
15 1
Here all is sorted with the alphabetical order excluding special characters, first the numbers, then the letters.
thanasis#basis:~/Documents/development/temp> sort file
15 1
152
153
15 4
#a
b#
Here all characters count, first #, then numbers, but the space counts also, then letters.
thanasis#basis:~/Documents/development/temp> LC_ALL=C sort file
#a
15 1
15 4
152
153
b#

`sort -t` doesn't work properly with string input

I want to sort some space separated numbers in bash. The following doesn't work, however:
sort -dt' ' <<< "3 5 1 4"
Output is:
3 5 1 4
Expected output is:
1
3
4
5
As I understand it the -t option should use it's argument as a delimiter. Why isn't my code working? I know I can tr the spaces to newlines, but I'm working on a code golf thing and want to be able to do it without any other utility.
EDIT: everybody is answering by splitting the spaces to lines. I do not want to do this. I already know how to do this with other utilities. I am specifically asking how to do this with sort, and sort only. If -t doesn't delimit input, what does it do?
Use process substitution with printf to have each input number in a separate line, otherwise sort gets only one line to sort:
sort <(printf "%s\n" 3 5 1 4)
1
3
4
5
While doing so -dt '' is not needed.
After searching around, I have discovered what -t is for. It is for delimiting a file if you want to sort by a certain part of each line, for e.g, if you have
Hello,56
Cat,81
Book,14
Nope,62
And you want to sort by the number, you would to -t',' to delimit by the comma and then use -k to select which part to sort by. It is for field delimiting, not record delimiting like I thought.
Since sort only separates fields on a single line, you have no choice but to pipe data into sort -dt such as this method using IFS:
#!/bin/bash
clear
var="8 3 5 1 4 7 2 9 6"
old_IFS="$IFS"
function main() {
IFS=" "
printf "%s\n" $var | sort -d
}
main
This will give an obvious output of:
1
2
3
4
5
6
7
8
9
If this is not the way you wish to use sort well you have already answered your own question by doing a bit of digging on the issue which, if you would have done so before, would have saved time for the others giving answers as well as yours.

Linux command to remove lines containing a duplicated value in a text file?

If I have a text file with the following form
1 1
1 3
3 4
2 2
5 7
...
Is there a Linux command that can give me the following result?
1 3
3 4
5 7
...
So, I want to delete the lines 1 1 and 2 2.
Yes, you can use something like:
awk '$1!=$2{print}' inputfilename
or the slightly less verbose (thanks to ooga):
awk '$1!=$2' inputfilename
which uses the "missing action means print" feature of awk.
Both these awk commands print lines where the columns don't match, and throw away everything else.

Frequency count of particular field appended to line without deleting duplicates

Trying to work out how to get a frequency appended or prepended to each line in a file WITHOUT deleting duplicate occurrences (which uniq can do for me).
So, if input file is:
mango
mango
banana
apple
watermelon
banana
I need output:
mango 2
mango 2
banana 2
apple 1
watermelon 1
banana 2
All the solutions I have seen delete the duplicates. In other words, what I DON'T want is:
mango 2
banana 2
apple 1
watermelon 1
Basically you cannot do it in one pass without keeping everything in memory. If this is what you want to do, then use python/perl/awk/whatever. The algorithm is quite simple.
Let's do it with standard Unix tools. This is a bit cumbersome and can be improved but should do the work:
$ sort input | uniq -c > input.count
$ nl input | sort -k 2 > input.line
$ join -1 2 -2 2 input.line input.count | sort -k 2 | awk '{print $1 " " $3}
The first step is to count the number occurrences of a given word.
As you said you cannot both repeat and keep line ordering. So we have to fix that. The second step prepends the line number that we will use later to fix the ordering issue.
In the last step, we join the two temporary files on the original word, the second column contains the original line number sort we sort on this key and strip it from the final output.

how to join files in unix without sorting

I am trying to join 2 csv files based on a key in unix.
My files are really huge 5GB each and sorting them is taking too long.
I want to repeat this procedure for 50 such joins.
Can someone tell me how to join without sorting and quickly.
Unfortunately there is no way around the sorting. But please take a look at some utility scripts I have written here: (https://github.com/stefan-schroedl/tabulator). You can use them if you keep the header of the column names as the first line in each file. There is a script 'tbljoin' that will take care of the sorting and column counting for you. For example, say you have
Employee.csv:
employee_id|employee_name|department_id
4|John|10
1|Monica|4
12|Louis|5
20|Peter|2
21|David|3
13|Barbara|6
Dept.csv:
department_id|department_name
1|HR
2|Manufacturing
3|Engineering
4|Marketing
5|Sales
6|Information technology
7|Security
Then the command tbljoin Employee.csv Dept.csv produces
employee_id|employee_name|department_id|department_name
20|Peter|2|Manufacturing
21|David|3|Engineering
1|Monica|4|Marketing
12|Louis|5|Sales
13|Barbara|6|Information technology.
tabulator contains many other useful features, e.g., for simple rearranging of columns.
Here is the example with two files having data delimited by pipe
Data from employee.csv with key employee_id, Name and department_id delimited by pipe.
Employee.csv
4|John | 10
1|Monica|4
12|Louis|5
20|Peter|2
21|David|3
13|Barbara|6
Department file with deparment_id and its name delimited by pipe.
Dept.csv
1|HR
2| Manufacturing
3| Engineering
4 |Marketing
5| Sales
6| Information technology
7| Security
command:
join -t “|” -1 3 -2 1 Employee_sort.csv Dept.csv
-t “| ” indicated files are delimited by pipe
-1 3 for third column of file 1 i.e deparment_id from Employee_sort.csv file
-2 1 for first column of file 2 i.e. deparment_id from Dept.csv file
By using above command, we get following output.
2|20|Peter| Manufacturing
3|21|David| Engineering
4|1|Monica| Marketing
5|12|Louis| Sales
6|13|Barbara| Information technology
If you want to get everything from file 2 and corresponding entries in file 1
You can also use -a and -v options
try following commands
join -t “|” -1 3 -2 1 -v2 Employee_sort.csv Dept.csv
join -t “|” -1 3 -2 1 -a2 Employee_sort.csv Dept.csv
I think that you could avoid using join (and thus sorting your file), but this is not a quick solution :
In both files, replace all pipes and all double-spaces with spaces :
sed -i 's/|/ /g;s/ / /g' Employee.csv Dept.csv
run these code lines as a bash script :
cat Employee.csv | while read a b c
do
cat Dept.csv | while read d e
do
if [ "$c" -eq "$d" ] ; then
echo -e "$a\t$b\t$c\t$e"
fi
done
done
Note that looping takes a long time

Resources