unexpected result from gnu sort - sorting

when I try to sort the following text file 'input':
test1 3
test3 2
test 4
with the command
sort input
the output is exactly the input. Here is the output of
od -bc input
:
0000000 164 145 163 164 061 011 063 012 164 145 163 164 063 011 062 012
t e s t 1 \t 3 \n t e s t 3 \t 2 \n
0000020 164 145 163 164 011 064 012
t e s t \t 4 \n
0000027
It's just a tab separated file with two columns. When I do
sort -k 2
The output changes to
test3 2
test1 3
test 4
which is what I would expect. But if I do
sort -k 1
nothing changes with respect to the input, whereas I would expect 'test' to sort before 'test1'. Finally, if I do
cat input | cut -f 1 | sort
I get
test
test1
test3
as expected. Is there a logical explanation for this? What exactly is sort supposed to do by default, something like:
sort -k 1
?
My version of sort:
sort (GNU coreutils) 7.4

From the man pages:
* WARNING * The locale specified by the environment affects
sort
order. Set LC_ALL=C to get the traditional sort order that uses
native
byte values.
So it seems export LC_ALL=C must help

Related

Matching Column numbers from two different txt file

I have two text files which are a different size. The first one below example1.txt has only one column of numbers:
101
102
103
104
111
120
120
125
131
131
131
131
131
131
And the Second text file example2.txt has two columns:
101 3
102 3
103 3
104 4
104 4
111 5
120 1
120 1
125 2
126 2
127 2
128 2
129 2
130 2
130 2
130 2
131 10
131 10
131 10
131 10
131 10
131 10
132 10
The first column in the example1.txt is a subset of column one in example2.txt. The second column numbers in example2.txt are the associated values with the first column.
What I want to do is to get the associated second column of example1.txt following the example2.txt. I have tried but couldn't figure it out yet. Any suggestions or solutions in bash, awk would be appreciated
Therefore the result would be:
101 3
102 3
103 3
104 4
111 5
120 1
120 1
125 2
131 10
131 10
131 10
131 10
131 10
131 10
UPDATE:
I have been trying to do the column matching like :
awk -F'|' 'NR==FNR{c[$1]++;next};c[$1] > 0' example1.txt example2.txt > output.txt
In both files, the first column goes like an ascending order, but the frequency of the same numbers may not be the same. For example, the frequency of 104 is one in the example1.txt, but it appeared twice in the example2.txt The important thing is that the associated second column value would be the same for example1.txt too. Just see the expected output in the end.
$ awk 'NR==FNR{a[$1]++; next} ($1 in a) && b[$1]++ < a[$1]' f1 f2
101 3
102 3
103 3
104 4
111 5
120 1
120 1
125 2
131 10
131 10
131 10
131 10
131 10
131 10
This solution doesn't make use of the fact that the first column is in ascending order. Perhaps some optimization can be done based on that.
($1 in a) && b[$1]++ < a[$1] is the main difference from your solution. This checks if the field exists as well as that the count doesn't exceed that of the first file.
Also, not sure why you set the field separator as | because there is no such character in the sample given.

Difference of column value received from operation done on 2 different files

I have following 2 files file1.txt and file2.txt with the data as given below-
Data in file1.txt
125
125
295
295
355
355
355
Data in file2.txt
125
125
295
355
I did below operation over the files and got following output-
Operation1-
sort file1.txt | uniq -c
2 125
2 295
3 355
Operation2-
sort file2.txt | uniq -c
2 125
1 295
1 355
Now, I want following output using the result of Operation1 and Operation2 -
I want to compare the result of Operation1 and Operation2 and get the output which will show the difference of values from column 1 of both the files, and it will show the column 2 as it is as given below-
0 125
1 295
2 355
redirect output of operation 1 and operation 2 in some files. Let say
file1
and
file2
, then write like this:-
paste file1 file2 | awk '{print $1-$3,$2}'
you will have output
0 125
1 295
2 355

unexpected result: grep from a changing line

I wrote a bash command to test grep from a changing line:
for i in $(seq 0 9); do echo -e -n "\r"$i; sleep 0.1; done | grep 5
The result shows:
9
Update
The real problem is as follows:
mplayer shows and refreshes a single-line playing progress when playing a media file. A sample result is:
A: 17.2 (17.2) of 213.0 (03:33.0) 0.5%
And I'm trying to grep this playing progress and ingore other lines. I used this command:
mplayer xxx.mp3 | grep ^A:
The result does not contain the line expected.
Update 2
mplayer xxx.mp3 | od -xda
shows:
0002140 4a5b 410d 203a 2020 2e31 2033 3028 2e31
[ J \r A : 1 . 3 ( 0 1 .
133 112 015 101 072 040 040 040 061 056 063 040 050 060 061 056
0002160 2932 6f20 2066 3132 2e33 2030 3028 3a33
2 ) o f 2 1 3 . 0 ( 0 3 :
062 051 040 157 146 040 062 061 063 056 060 040 050 060 063 072
0002200 3333 302e 2029 3020 342e 2025 5b1b 0d4a
3 3 . 0 ) 0 . 4 % 033 [ J \r
063 063 056 060 051 040 040 060 056 064 045 040 033 133 112 015
0002220 3a41 2020 3120 352e 2820 3130 342e 2029
A : 1 . 5 ( 0 1 . 4 )
101 072 040 040 040 061 056 065 040 050 060 061 056 064 051 040
0002240 666f 3220 3331 302e 2820 3330 333a 2e33
o f 2 1 3 . 0 ( 0 3 : 3 3 .
157 146 040 062 061 063 056 060 040 050 060 063 072 063 063 056
And
mplayer xxx.mp3 | tr '\r' '\n'
shows
A: 0.2 (00.1) of 213.0 (03:33.0) 0.3%
A: 0.3 (00.3) of 213.0 (03:33.0) 0.3%
A: 0.5 (00.5) of 213.0 (03:33.0) 0.4%
A: 0.6 (00.6) of 213.0 (03:33.0) 0.4%
A: 0.8 (00.8) of 213.0 (03:33.0) 0.4%
A: 1.0 (01.0) of 213.0 (03:33.0) 0.4%
While,
mplayer xxx.mp3 | tr '\r' '\n' | grep ^A
shows empty result.
Any tip will be appreciated.
It's your definition of "line" that's causing the problem here. The -n means that all the numbers are output on a single line, according the the definition used by grep (a series of characters, terminated by the \n character):
\r1\r2\r3\r4\r5\r6\r7\r8\r9
If you pipe the output through something like a hex dump, you can see what's happening:
$ for i in $(seq 0 9); do echo -e -n "\r"$i; sleep 0.1; done | grep 5 | od -xcb
0000000 300d 310d 320d 330d 340d 350d 360d 370d
\r 0 \r 1 \r 2 \r 3 \r 4 \r 5 \r 6 \r 7
015 060 015 061 015 062 015 063 015 064 015 065 015 066 015 067
0000020 380d 390d 000a
\r 8 \r 9 \n
015 070 015 071 012
0000025
That single line containing all the carriage returns (and not newlines) will, when output, appear to be a single line with just the 9 on it. Removing the -n will result instead in:
$ for i in $(seq 0 9); do echo -e "\r"$i; sleep 0.1; done | grep 5 | od -xcb
0000000 350d 000a
\r 5 \n
015 065 012
0000003
which would look like just the 5 was being output.
If you have a process that outputs "lines" separated by carriage returns rather than newlines, there's nothing to stop you changing them on the fly so as to be able to handle them as real lines:
$ echo -e "junk\rA: good 1\rjunk\rA: good 2\rjunk" | tr '\r' '\n' | grep '^A'
A: good 1
A: good 2
Applying that back to your original question, it would be (with the sleep removed since it's irrelevant):
$ for i in $(seq 0 9); do echo -e -n "\r"$i; done | tr '\r' '\n' | grep 5
5
$ for i in $(seq 0 9); do echo -e -n "\r"$i; done | tr '\r' '\n' | grep 5 | od -xcb
0000000 0a35
5 \n
065 012
0000002

The simplest way to delete a section of text, n times

I have a file bigger than 4gb which is bad news for me because I can't open the file in notepad++ and use the macro feature to record and repeat a process to the end of a file.
What I'd like to do is say, leave the first 20 lines of text, then delete the next 80, then repeat that process to the end of a file.
What would be the easiest way to do this?
I'm looking at these files on a linux server so running a script of some kind would be the easiest way, or maybe someone knows a way to do this in vi? (hence the lame taging)
Thanks in advance
awk can do this fairly easily:
awk '(NR-1)%100 < 20' bigfile.txt
I would go with the awk solution, but here's one way you could do the same thing with sed:
seq 20 | sed 's/$/~100p/' | sed -nf - bigfile.txt
Testing:
seq 20 | sed 's/$/~100p/' | sed -nf - <(seq 200)
Output:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120

computer basics problem

hi everybody can anyone tell me answer of this question ?
i created a simple txt file. it contain only two words and the words are hello word according to i studied computer uses ascii code to store the text on disk or memory .In ascii code each letter or symbol is represented by one byte or in simple words one byte is used to store a symbol.
Now the problem is this when ever i saw the size of file it shows 11 byte I understand 9 byte for words one byte for space makes the total of 10 then why it is showing 11 byte size .i tried different things such as changing the name of file saving it with shortest name possible or longest name possible but it did not change the total storage
so can any body explain why it is happening? i tried this thing over window or Linux(Ubuntu.centos) system result is same.
pax> echo hello word >outfile.txt
pax> ls -al outfile.txt
-rw-r--r-- 1 pax pax 11 2010-11-19 15:34 outfile.txt
pax> od -xcb outfile.txt
0000000 6568 6c6c 206f 6f77 6472 000a
h e l l o w o r d \n
150 145 154 154 157 040 167 157 162 144 012
pax> hd outfile.txt
00000000 68 65 6c 6c 6f 20 77 6f 72 64 0a |hello word.|
0000000b
As per above, you're storing "hello word" and the newline character. That's 11 characters in total. If you don't want the newline, you can use something like the -n option of echo (which doesn't add the newline):
pax> echo -n hello word >outfile.txt
pax> ls -al outfile.txt
-rw-r--r-- 1 pax pax 10 2010-11-19 15:36 outfile.txt
pax> od -xcb outfile.txt
0000000 6568 6c6c 206f 6f77 6472
h e l l o w o r d
150 145 154 154 157 040 167 157 162 144
pax> hd outfile.txt
00000000 68 65 6c 6c 6f 20 77 6f 72 64 |hello word|
0000000a
If you want to see the content of the file you can perform an octal dump of it using the "od" command under linux "od ". Most probably what you will see is a CR (carriage return) and a LN (linefeed).
The name of the file has nothing to do with his size.
Luis
Did you a new line in the text file (\n)? Just because this character cannot be seen does not mean it is not there.

Resources