I've been trying to count the amount of times that a character appears on a file by using the following code:
sed 's/./&\n/g' 1.txt | sort | uniq -ic
However, it doesn't distinguish between upper and lower cases. Here is an example:
The content of the file 1.txt is this: hola Adios
And this is the output:
1
2 a
1 d
1 h
1 i
1 l
2 o
1 s
As you can see, I have letter "a" and "o" 2 times, but the correct output should be this:
1
1 a
1 A
1 d
1 h
1 i
1 l
2 o
1 s
Just one time "a" and one time "A". Does anyone know how can I modify the code in order to have the expected output (to distinguish between upper and lower cases)? Thanks in advance.
just do
sed 's/./&\n/g' 1.txt | sort | uniq -c
removing the option 'i' whose does not differentiate upper and lower case
Execution :
pi#raspberrypi:/tmp $ cat 1.txt
hola Adios
pi#raspberrypi:/tmp $ sed 's/./&\n/g' 1.txt | sort | uniq -c
1
1
1 a
1 A
1 d
1 h
1 i
1 l
2 o
1 s
pi#raspberrypi:/tmp $
Note one of the '1' alone is for the newline, if I remove it in the input file :
pi#raspberrypi:/tmp $ cat 1.txt
hola Adiospi#raspberrypi:/tmp $ sed 's/./&\n/g' 1.txt | sort | uniq -c
1
1 a
1 A
1 d
1 h
1 i
1 l
2 o
1 s
pi#raspberrypi:/tmp $
If you use the empty field separator in awk, you can parse one char at a time. The advantage is that you use only one process and avoid inserting a newline with sed for each character, as was your original attempt.
awk -F '' '{for(i=1;i<=NF;i++)a[$i]++}END{for (i in a){print a[i],i}}' 1.txt
Although an empty field separator is not POSIX specified, it is a common extension. Worked with gawk, mawk and nawk.
awk -F '' ' #Empty field separator
{for(i=1;i<=NF;i++)a[$i]++} #Each char has an entry in this array and is incremented when found
END{for (i in a){print a[i],i}} #Print number of occurrences and value
' 1.txt
Related
I've got files which look like this, (there can be more columns or rows):
dif-1-2-3-4.com 1 1 1
dif-1-2-3-5.com 1 1 2
dif-1-2-4-5.com 1 2 1
dif-1-3-4-5.com 2 1 1
dif-2-3-4-5.com 1 1 1
And I want to compare these numbers:
1 1 1
1 1 2
1 2 1
2 1 1
1 1 1
And print only those rows which do not repeat, so I get this:
dif-1-2-3-4.com 1 1 1
dif-1-2-3-5.com 1 1 2
dif-1-2-4-5.com 1 2 1
dif-1-3-4-5.com 2 1 1
Another simple approach is sort with uniq using a KEYDEF for fields 2-4 with sort and skipping field 1 with uniq, e.g.
$ sort file.txt -k 2,4 | uniq -f1
Example Use/Output
$ sort file.txt -k 2,4 | uniq -f1
dif-1-2-3-4.com 1 1 1
dif-1-2-3-5.com 1 1 2
dif-1-2-4-5.com 1 2 1
dif-1-3-4-5.com 2 1 1
Keep a running record of the triples already seen and only print the first time they appear:
$ awk '!(($2,$3,$4) in seen) {print; seen[$2,$3,$4]}' file
dif-1-2-3-4.com 1 1 1
dif-1-2-3-5.com 1 1 2
dif-1-2-4-5.com 1 2 1
dif-1-3-4-5.com 2 1 1
Try, the following awk code too:
awk '!a[$2,$3,$4]++' Input_file
Explanation:
Create an array named a and its indexes as $2,$3,$4. The condition here is !a, (which means any line's $2,$3,$4 are NOT present in array a), and then doing 2 things:
Increasing that specific index's value to 1 so that next time that condition will NOT be true for same $2,$3,$4 indexes in array a.
Not specifying an action, (so awk works in the mode of condition and then action), so the default action will be to print the current line. This will go on for all the lines in Input_file, and the last line will not be printed as its $2,$3,$4 are already present in array a.
I hope this helps.
This works with POSIX and gnu awk:
$ awk '{s=""
for (i=2;i<=NF; i++)
s=s $i "|"}
s in seen { next }
++seen[s]' file
Which can be shortened to:
$ awk '{s=""; for (i=2;i<=NF; i++) s=s $i "|"} !seen[s]++' file
Also supports a variable number of columns.
If you want a sort uniq solution that also respects file order (i.e. the first of the set of duplicates is printed, not the later ones) you need to do a decorate, sort, undecorate approach.
You can:
use cat -n to decorate the file with line numbers;
sort -k3 -k1n to sort first on all the fields starting at the 3 though the end of the line then numerically on the line number added;
add -u if your version of sort supports that or use uniq -f3 to only keep the first in the group of dups;
finally use sed -e 's/^[[:space:]]*[0-9]*[[:space:]]*// to remove the added line numbers:
cat -n file | sort -k3 -k1n | uniq -f3 | sed -e 's/^[[:space:]]*[0-9]*[[:space:]]*//'
Awk is easier and faster in this case.
I can move every second row into the second column of the previous row by:
awk '{printf "%s%s",$0,(NR%2?FS:RS)}' file > newfile
But I can't do it the other way around. What I have is as below:
1 a
2 b
3 c
I need
1
a
2
b
3
c
I have checked several similar column-row shifting questions, but couldn't figure out my case. Thanks!
You can use this awk command with OFS='\n' to get output field separator as newline after forcing awk to rewrite each record with $1=$1 trick:
awk '{$1=$1} 1' OFS='\n' file
1
a
2
b
3
c
You can also use grep -o:
grep -Eo '\w+' file
1
a
2
b
3
c
Just use xargs with 1 record at a time,
xargs -n1 <file
1
a
2
b
3
c
From the man xargs page
-n max-args, --max-args=max-args
Use at most max-args arguments per command line. Fewer than max-args arguments will be used if the size (see the -s option) is exceeded, unless the
-x option is given, in which case xargs will exit.
you can use tr
cat file | tr ' ' '\n'
or sed
sed -r 's/ /\n/g' file
you get,
1
a
2
b
3
c
I would like to process a multi-line, multi-field input file so that I get a file with all pairs of consecutive lines ONLY IF they have the same value as field #1.
This is, for each line, the output would contain the line itself + the next line, and would omit combinations of lines with different values at field #1.
It's better explained with an example.
Given this input:
1 this
1 that
1 nye
2 more
2 sit
I want to produce something like:
1 this 1 that
1 that 1 nye
2 more 2 sit
So far I've got this:
awk 'NR % 2 == 1 { i=$0 ; next } { print i,$0 } END { if ( NR % 2 == 1 ) { print i } }' input.txt
My output:
1 this 1 that
1 nye 2 more
2 sit
As you can see, my code is blind to field #1 value, and also (and more importantly) it omits "intermediate" results like 1 that 1 nye (once it's done with a line, it jumps to the next pair of lines).
Any ideas? My preferred language is awk/gawk, but if it can be done using unix bash it's ok as well.
Thanks in advance!
You can use this awk:
awk 'NR>1 && ($1 in a){print a[$1], $0} {a[$1]=$0}' file
1 this 1 that
1 that 1 nye
2 more 2 sit
You can do it with simple commands. Assuming your input file is "test.txt" with content:
1 this
1 that
1 nye
2 more
2 sit
following commands gives the requested output:
sort -n test.txt > tmp1
(echo; cat tmp1) | paste tmp1 - | egrep '^([0-9])+ *[^ ]* *\1'
Just for fun
paste -d" " filename <(sed 1d filename) | awk '$1==$3'
Say, I have two files and want to find out how many equal lines they have. For example, file1 is
1
3
2
4
5
0
10
and file2 contains
3
10
5
64
15
In this case the answer should be 3 (common lines are '3', '10' and '5').
This, of course, is done quite simply with python, for example, but I got curious about doing it from bash (with some standard utils or extra things like awk or whatever). This is what I came up with:
cat file1 file2 | sort | uniq -c | awk '{if ($1 > 1) {$1=""; print $0}}' | wc -l
It does seem too complicated for the task, so I'm wondering is there a simpler or more elegant way to achieve the same result.
P.S. Outputting the percentage of common part to the number of lines in each file would also be nice, though is not necessary.
UPD: Files do not have duplicate lines
To find lines in common with your 2 files, using awk :
awk 'a[$0]++' file1 file2
Will output 3 10 15
Now, just pipe this to wc to get the number of common lines :
awk 'a[$0]++' file1 file2 | wc -l
Will output 3.
Explanation:
Here, a works like a dictionary with default value of 0. When you write a[$0]++, you will add 1 to a[$0], but this instruction returns the previous value of a[$0] (see difference between a++ and ++a). So you will have 0 ( = false) the first time you encounter a certain string and 1 ( or more, still = true) the next times.
By default, awk 'condition' file is a syntax for outputting all the lines where condition is true.
Be also aware that the a[] array will expand every time you encounter a new key. At the end of your script, the size of the array will be the number of unique values you have throughout all your input files (in OP's example, it would be 9).
Note: this solution counts duplicates, i.e if you have:
file1 | file2
1 | 3
2 | 3
3 | 3
awk 'a[$0]++' file1 file2 will output 3 3 3 and awk 'a[$0]++' file1 file2 | wc -l will output 3
If this is a behaviour you don't want, you can use the following code to filter out duplicates :
awk '++a[$0] == 2' file1 file2 | wc -l
with your input example, this works too. but if the files are huge, I prefer the awk solutions by others:
grep -cFwf file2 file1
with your input files, the above line outputs
3
Here's one without awk that instead uses comm:
comm -12 <(sort file1.txt) <(sort file2.txt) | wc -l
comm compares two sorted files. The arguments 1,2 suppresses unique lines found in both files.
The output is the lines they have in common, on separate lines. wc -l counts the number of lines.
Output without wc -l:
10
3
5
And when counting (obviously):
3
You can also use comm command. Remember that you will have to first sort the files that you need to compare:
[gc#slave ~]$ sort a > sorted_1
[gc#slave ~]$ sort b > sorted_2
[gc#slave ~]$ comm -1 -2 sorted_1 sorted_2
10
3
5
From man pages for comm command:
comm - compare two sorted files line by line
Options:
-1 suppress column 1 (lines unique to FILE1)
-2 suppress column 2 (lines unique to FILE2)
-3 suppress column 3 (lines that appear in both files)
You can do all with awk:
awk '{ a[$0] += 1} END { c = 0; for ( i in a ) { if ( a[i] > 1 ) c++; } print c}' file1 file2
To get the percentage, something like this works:
awk '{ a[$0] += 1; if (NR == FNR) { b = FILENAME; n = NR} } END { c = 0; for ( i in a ) { if ( a[i] > 1 ) c++; } print b, c/n; print FILENAME, c/FNR;}' file1 file2
and outputs
file1 0.428571
file2 0.6
In your solution, you can get rid of one cat:
sort file1 file2| uniq -c | awk '{if ($1 > 1) {$1=""; print $0}}' | wc -l
How about keeping it nice and simple...
This is all that's needed:
cat file1 file2 | sort -n | uniq -d | wc -l
3
man sort:
-n, --numeric-sort -- compare according to string numerical value
man uniq:
-d, --repeated -- only print duplicate lines
man wc:
-l, --lines -- print the newline counts
Hope this helps.
EDIT - one fewer process (credit martin):
sort file1 file2 | uniq -d | wc -l
One way using awk:
awk 'NR==FNR{a[$0]; next}$0 in a{n++}END{print n}' file1 file2
Output:
3
The first answer by Aserre using awk is good but may have the undesirable effect of counting duplicates - even if the duplicates exist in only ONE of the files, which is not quite what the OP asked for.
I believe this edit will return only the unique lines that exist in BOTH files.
awk 'NR==FNR{a[$0]=1;next}a[$0]==1{a[$0]++;print $0}' file1 file2
If duplicates are desired, but only if they exist in both files, I believe this next version will work, but will only report duplicates in the second file that exist in the first file. (If the duplicates exist in the first file, only the those that also exist in file2 will be reported, so file order matters).
awk 'NR==FNR{a[$0]=1;next}a[$0]' file1 file2
Btw, I tried using grep, but it was painfully slow on files with a few thousand lines each. Awk is very fast!
UPDATE 1 : new version ensures intra-file duplicates are excluded from count, so only cross-file duplicates would show up in the final stats :
mawk '
BEGIN { _*= FS = "^$"
} FNR == NF { split("",___)
} ___[$_]++<NF { __[$_]++
} END { split("",___)
for (_ in __) {
___[__[_]]++ } printf(RS)
for (_ in ___) {
printf(" %\04715.f %s\n",_,___[_]) }
printf(RS) }' \
<( jot - 1 999 3 | mawk '1;1;1;1;1' | shuf ) \
<( jot - 2 1024 7 | mawk '1;1;1;1;1' | shuf ) \
<( jot - 7 1295 17 | mawk '1;1;1;1;1' | shuf )
3 3
2 67
1 413
===========================================
this is probably waaay overkill, but i wrote something similar to this to supplement uniq -c :
measuring the frequency of frequencies
it's like uniq -c | uniq -c without wasting time sorting. The summation and % parts are trivial from here, with 47 over-lapping lines in this example. It avoids spending any time performing per row processing, since the current setup only shows the summarized stats.
If you need to actual duplicated rows, they're also available right there serving as the hash key for the 1st array.
gcat <( jot - 1 999 3 ) <( jot - 2 1024 7 ) |
mawk '
BEGIN { _*= FS = "^$"
} { __[$_]++
} END { printf(RS)
for (_ in __) { ___[__[_]]++ }
for (_ in ___) {
printf(" %\04715.f %s\n",
_,___[_]) } printf(RS) }'
2 47
1 386
add another file, and the results reflect the changes (I added <( jot - 5 1295 5 ) ):
3 9
2 115
1 482
Given input file
z
b
a
f
g
a
b
...
I want to output the number of occurrences of each string, for example:
z 1
b 2
a 2
f 1
g 1
How can this be done in a bash script?
You can sort the input and pass to uniq -c:
$ sort input_file | uniq -c
2 a
2 b
1 f
1 g
1 z
If you want the numbers on the right, use awk to switch them:
$ sort input_file | uniq -c | awk '{print $2, $1}'
a 2
b 2
f 1
g 1
z 1
Alternatively, do the whole thing in awk:
$ awk '
{
++count[$1]
}
END {
for (word in count) {
print word, count[word]
}
}
' input_file
f 1
g 1
z 1
a 2
b 2
cat text | sort | uniq -c
should do the job
Try:
awk '{ freq[$1]++; } END{ for( c in freq ) { print c, freq[c] } }' test.txt
Where test.txt would be your input file.
Here's a bash-only version (requires bash version 4), using an associative array.
#! /bin/bash
declare -A count
while read val ; do
count[$val]=$(( ${count[$val]} + 1 ))
done < your_intput_file # change this as needed
for key in ${!count[#]} ; do
echo $key ${count[$key]}
done
This might work for you:
cat -n file |
sort -k2,2 |
uniq -cf1 |
sort -k2,2n |
sed 's/^ *\([^ ]*\).*\t\(.*\)/\2 \1/'
This output the number of occurrences of each string in the order in which they appear.
You can use sort filename | uniq -c.
Have a look at the Wikipedia page on uniq.