awk on multiple files and piping output of each of it's run to the wc command separately - bash

I have bunch of record wise formatted (.csv)files. First field is an integer or may be empty as well. Its true for all the files. I want to count number of records whose first field is empty in each file and then want to plot count graph over all the files.
File format of filename.csv:
123456,few,other,fields
,few,other,fields
234567,few,other,fields
I want something like
awk -F, '$1==""' `ls` | (for each file separately wc -l) | gnugraph ( y axis as output of wc -l command and x axis as simply 1 to n where n is number of csv files)
The problem I am facing is wc -l gets executed only once for all the files together. I want to run wc -l for each file and count the number of records having empty first field and provide this sequence of count to the gnugraph command.
once I get required count for each file I am almost done as
seq 10 | gnuplot -p -e "plot '<cat'"
works fine

You could use awk to keep track of the count for each file in an array. Then at the end print the contents of the array:
awk '$1==""{a[FILENAME]+=1} END{for(file in a) { print file, a[file] }}' `ls`
This way you don't have to tangle with wc and just shoot the contents right over to gnuplot
Example in use:
$> cat file1
,test
2,test
3,
$> cat file2
,test
2,test
3,
,test
$> awk -F"," '$1==""{a[FILENAME]+=1} END{for(file in a) { print file, a[file] }}' `ls`
file1 1
file2 2

With gawk you can use BEGINFILE and ENDFILE:
$ awk -F, '$1==""{++i} BEGINFILE{i=0} ENDFILE{print FILENAME, i}' file1 file2
file1 3
file2 1

If you want to run wc -l separately for each file, you'll have to set up a loop.
Something along the lines of-
for i in `ls`
do
awk -F, '$1==""' "$i" | wc -l
done | gnugraph

For the first field, there is an easier way with grep
$ grep -c '^,' file{1..3}
file1:1
file2:2
file3:4
I copied your file to file1 and doubled in file2 and file3 respectively

Related

How to merge in one file, two files in bash line by line [duplicate]

What's the easiest/quickest way to interleave the lines of two (or more) text files? Example:
File 1:
line1.1
line1.2
line1.3
File 2:
line2.1
line2.2
line2.3
Interleaved:
line1.1
line2.1
line1.2
line2.2
line1.3
line2.3
Sure it's easy to write a little Perl script that opens them both and does the task. But I was wondering if it's possible to get away with fewer code, maybe a one-liner using Unix tools?
paste -d '\n' file1 file2
Here's a solution using awk:
awk '{print; if(getline < "file2") print}' file1
produces this output:
line 1 from file1
line 1 from file2
line 2 from file1
line 2 from file2
...etc
Using awk can be useful if you want to add some extra formatting to the output, for example if you want to label each line based on which file it comes from:
awk '{print "1: "$0; if(getline < "file2") print "2: "$0}' file1
produces this output:
1: line 1 from file1
2: line 1 from file2
1: line 2 from file1
2: line 2 from file2
...etc
Note: this code assumes that file1 is of greater than or equal length to file2.
If file1 contains more lines than file2 and you want to output blank lines for file2 after it finishes, add an else clause to the getline test:
awk '{print; if(getline < "file2") print; else print ""}' file1
or
awk '{print "1: "$0; if(getline < "file2") print "2: "$0; else print"2: "}' file1
#Sujoy's answer points in a useful direction. You can add line numbers, sort, and strip the line numbers:
(cat -n file1 ; cat -n file2 ) | sort -n | cut -f2-
Note (of interest to me) this needs a little more work to get the ordering right if instead of static files you use the output of commands that may run slower or faster than one another. In that case you need to add/sort/remove another tag in addition to the line numbers:
(cat -n <(command1...) | sed 's/^/1\t/' ; cat -n <(command2...) | sed 's/^/2\t/' ; cat -n <(command3) | sed 's/^/3\t/' ) \
| sort -n | cut -f2- | sort -n | cut -f2-
With GNU sed:
sed 'R file2' file1
Output:
line1.1
line2.1
line1.2
line2.2
line1.3
line2.3
Here's a GUI way to do it: Paste them into two columns in a spreadsheet, copy all cells out, then use regular expressions to replace tabs with newlines.
cat file1 file2 |sort -t. -k 2.1
Here its specified that the separater is "." and that we are sorting on the first character of the second field.

How to use grep -c to count ocurrences of various strings in a file?

i have a bunch files with data from a company and i need to count, let's say, how many people from a certain cities there are. Initially i was doing it manually with
grep -c 'Chicago' file.csv
But now i have to look for a lot cities and it would be time consuming to do this manually every time. So i did some reaserch and found this:
#!/bin/sh
for p in 'Chicago' 'Washington' 'New York'; do
grep -c '$p' 'file.csv'
done
But it doenst work. It keeps giving me 0s as output and im not sure what is wrong. Anyways, basically what i need is for an output with every result (just the values) given by grep in a column so i can copy directly to a spreadsheet. Ex.:
132
407
523
Thanks in advance.
You should use sort + uniq for that:
$ awk '{print $<N>}' file.csv | sort | uniq -c
where N is the column number of cities (I assume it structured, as it's CSV file).
For example, which shell how often used on my system:
$ awk -F: '{print $7}' /etc/passwd | sort | uniq -c
1 /bin/bash
1 /bin/sync
1 /bin/zsh
1 /sbin/halt
41 /sbin/nologin
1 /sbin/shutdown
$
From the title, it sounds like you want to count the number of occurrences of the string rather than the number of lines on which the string appears, but since you accept the grep -c answer I'll assume you actually only care about the latter. Do not use grep and read the file multiple times. Count everything in one pass:
awk '/Chicago/ {c++} /Washington/ {w++} /New York/ {n++}
END { print c; print w; print n }' input-file
Note that this will print a blank line instead of "0" for any string that does not appear, so you migt want to initialize. There are several ways to do that. I like:
awk '/Chicago/ {c++} /Washington/ {w++} /New York/ {n++}
END { print c; print w; print n }' c=0 w=0 n=0 input-file

split numbers in and store them in different files using unix shell script

I have a file called "list.txt" which contains the following rows of numbers.
31056780
31909020
31092320
61093190
61094592
45090280
45902902
I need to now take all the rows starting with "31" and store them in another file call file31.txt take all the rows starting with "61" and store them in file61.txt, take all rows starting with "45" store it in file45.txt
file31.txt will contain.
31056780
31909020
31092320
file61.txt will contain.
61093190
61094592
file45.txt will contain.
45090280
45902902
I tried this command for all 3 but it does not do what i want it to do.
awk -F\" '/31*/ {print $0}' list.txt > file31
awk -F\" '/61*/ {print $0}' list.txt > file61
awk -F\" '/45*/ {print $0}' list.txt > file45
You can use output redirection inside a single awk script. It can construct the filename by concatenating the first two characters of the line.
awk '{ fn = "list" substr($0, 1, 2) ".txt"; print > fn }' list.txt
You could use grep or sed to filter the lines with a matching pattern, for example:
sed '/^31/!d' list.txt > list31.txt
Or in a for loop for every number you want:
for n in "31" "45" "61"; do sed '/^'"$n"'/!d' list.txt > list$n.txt; done
Hope it helps.
You can use:
awk '/^31/{print > "file31"} /^45/{print > "file45"} /^61/{print > "file61"}' file
for i in `cat list.txt | cut -c1-2 | uniq`; do cat list.txt | grep -P ^${i} > file${i}.txt; done
This command works fine and is generic enough to work for all cases.
Now let's understand how it works.
cat list.txt | cut -c1-2 | uniq
31
45
61
Next we loop over these unique identifiers to create the new files using
cat list.txt | grep -P ^${i}
grep -P finds strings with partial match - here ^ - means that we are looking at this partial string only at the beginning of the line.

how to find matching records from 3 different files in unix

I have 3 different files.
Test1.txt , Test2.txt & Test3.txt
Test1.txt contains
JJTP#yahoo.com
BBMU#ssc.com
HK#glb.com
Test2.txt contains
SFTY#gmail.com
JJTP#yahoo.com
Test3.txt contains
JJTP#yahoo.com
HK#glb.com
I would like to see only matching records in these 3 files.
so the matching records in above example will be JJTP#yahoo.com
The output should be
JJTP#yahoo.com
If you don't have duplicate lines in each file then:
$ awk '++a[$1]==3' test[1-3]
JJTP#yahoo.com
Here is an awk that has a mix of jaypal and sudo_o solution.
It will not give false positive since it test for uniqueness of the lines.
awk '!a[$1 FS FILENAME]++ && ++b[$1]==3' test*
JJTP#yahoo.com
If you have a unknown number of files, this could be an option
awk '!a[$1 FS FILENAME]++ && ++b[$1]==ARGC-1' test*
The ARGC store the number of files read by awk + 1
comm lists common lines for two files. Just find the common lines in the first two files, then pipe the output to comm again and find the common lines with the third file.
comm -12 <(sort Test1.txt) <(sort Test2.txt) | comm -12 - <(sort Test3.txt)
Here is how you'd do with awk:
awk '
FILENAME == ARGV[1] { a[$0]++ }
FILENAME == ARGV[2] && ($0 in a) { b[$0]++ }
FILENAME == ARGV[3] && ($0 in b)' file1 file2 file3
Output:
JJTP#yahoo.com
To find the common lines in two files, you can use:
sort Test1.txt Test2.txt | uniq -d
Or, if you wish to preserve the order found in Test1.txt, you may use:
while read x; do grep -w "$x" Test2.txt; done < Test1.txt
For three files, repeat this:
sort Test1.txt Test2.txt | uniq -d | sort - Test3.txt | uniq -d
Or:
cat Test1.txt |\
while read x; do grep -w "$x" Test2.txt; done |\
while read x; do grep -w "$x" Test3.txt; done
The sort method assumes that the files themselves don't have duplicate lines; if so, you may need create temporary files.
If you wish to use sed rather than grep, try sed -n "/^$x$/p".

Shell command to find lines common in two files

I'm sure I once found a shell command which could print the common lines from two or more files. What is its name?
It was much simpler than diff.
The command you are seeking is comm. eg:-
comm -12 1.sorted.txt 2.sorted.txt
Here:
-1 : suppress column 1 (lines unique to 1.sorted.txt)
-2 : suppress column 2 (lines unique to 2.sorted.txt)
To easily apply the comm command to unsorted files, use Bash's process substitution:
$ bash --version
GNU bash, version 3.2.51(1)-release
Copyright (C) 2007 Free Software Foundation, Inc.
$ cat > abc
123
567
132
$ cat > def
132
777
321
So the files abc and def have one line in common, the one with "132".
Using comm on unsorted files:
$ comm abc def
123
132
567
132
777
321
$ comm -12 abc def # No output! The common line is not found
$
The last line produced no output, the common line was not discovered.
Now use comm on sorted files, sorting the files with process substitution:
$ comm <( sort abc ) <( sort def )
123
132
321
567
777
$ comm -12 <( sort abc ) <( sort def )
132
Now we got the 132 line!
To complement the Perl one-liner, here's its awk equivalent:
awk 'NR==FNR{arr[$0];next} $0 in arr' file1 file2
This will read all lines from file1 into the array arr[], and then check for each line in file2 if it already exists within the array (i.e. file1). The lines that are found will be printed in the order in which they appear in file2.
Note that the comparison in arr uses the entire line from file2 as index to the array, so it will only report exact matches on entire lines.
Maybe you mean comm ?
Compare sorted files FILE1 and FILE2 line by line.
With no options, produce three-column output. Column one
contains lines unique to FILE1, column
two contains lines unique to
FILE2, and column three contains lines common to both files.
The secret in finding these information are the info pages. For GNU programs, they are much more detailed than their man-pages. Try info coreutils and it will list you all the small useful utils.
While
fgrep -v -f 1.txt 2.txt > 3.txt
gives you the differences of two files (what is in 2.txt and not in 1.txt), you could easily do a
fgrep -f 1.txt 2.txt > 3.txt
to collect all common lines, which should provide an easy solution to your problem. If you have sorted files, you should take comm nonetheless. Regards!
Note: You can use grep -F instead of fgrep.
If the two files are not sorted yet, you can use:
comm -12 <(sort a.txt) <(sort b.txt)
and it will work, avoiding the error message comm: file 2 is not in sorted order
when doing comm -12 a.txt b.txt.
perl -ne 'print if ($seen{$_} .= #ARGV) =~ /10$/' file1 file2
awk 'NR==FNR{a[$1]++;next} a[$1] ' file1 file2
On limited version of Linux (like a QNAP (NAS) I was working on):
comm did not exist
grep -f file1 file2 can cause some problems as said by #ChristopherSchultz and using grep -F -f file1 file2 was really slow (more than 5 minutes - not finished it - over 2-3 seconds with the method below on files over 20 MB)
So here is what I did:
sort file1 > file1.sorted
sort file2 > file2.sorted
diff file1.sorted file2.sorted | grep "<" | sed 's/^< *//' > files.diff
diff file1.sorted files.diff | grep "<" | sed 's/^< *//' > files.same.sorted
If files.same.sorted shall be in the same order as the original ones, then add this line for same order than file1:
awk 'FNR==NR {a[$0]=$0; next}; $0 in a {print a[$0]}' files.same.sorted file1 > files.same
Or, for the same order than file2:
awk 'FNR==NR {a[$0]=$0; next}; $0 in a {print a[$0]}' files.same.sorted file2 > files.same
For how to do this for multiple files, see the linked answer to Finding matching lines across many files.
Combining these two answers (answer 1 and answer 2), I think you can get the result you are needing without sorting the files:
#!/bin/bash
ans="matching_lines"
for file1 in *
do
for file2 in *
do
if [ "$file1" != "$ans" ] && [ "$file2" != "$ans" ] && [ "$file1" != "$file2" ] ; then
echo "Comparing: $file1 $file2 ..." >> $ans
perl -ne 'print if ($seen{$_} .= #ARGV) =~ /10$/' $file1 $file2 >> $ans
fi
done
done
Simply save it, give it execution rights (chmod +x compareFiles.sh) and run it. It will take all the files present in the current working directory and will make an all-vs-all comparison leaving in the "matching_lines" file the result.
Things to be improved:
Skip directories
Avoid comparing all the files two times (file1 vs file2 and file2 vs file1).
Maybe add the line number next to the matching string
Not exactly what you were asking, but something I think still may be useful to cover a slightly different scenario
If you just want to quickly have certainty of whether there is any repeated line between a bunch of files, you can use this quick solution:
cat a_bunch_of_files* | sort | uniq | wc
If the number of lines you get is less than the one you get from
cat a_bunch_of_files* | wc
then there is some repeated line.
rm file3.txt
cat file1.out | while read line1
do
cat file2.out | while read line2
do
if [[ $line1 == $line2 ]]; then
echo $line1 >>file3.out
fi
done
done
This should do it.

Resources