awk or shell command to count occurence of value in 1st column based on values in 4th column - bash

I have a large file with records like below :
jon,1,2,apple
jon,1,2,oranges
jon,1,2,pineaaple
fred,1,2,apple
tom,1,2,apple
tom,1,2,oranges
mary,1,2,apple
I want to find the no of person (names in col 1) have apple and oranges both. And the command should take as less memory as possible and should be fast. Any help appreciated!
Output :
awk/sed file => 2 (jon and tom)

Using awk is pretty easy:
awk -F, \
'$4 == "apple" { apple[$1]++ }
$4 == "oranges" { orange[$1]++ }
END { for (name in apple) if (orange[name]) print name }' data
It produces the required output on the sample data file:
jon
tom
Yes, you could squish all the code onto a single line, and shorten the names, and otherwise obfuscate the code.
Another way to do this avoids the END block:
awk -F, \
'$4 == "apple" { if (apple[$1]++ == 0 && orange[$1]) print $1 }
$4 == "oranges" { if (orange[$1]++ == 0 && apple[$1]) print $1 }' data
When it encounters an apple entry for the first time for a given name, it checks to see if the name also (already) has an entry for oranges and prints it if it has; likewise and symmetrically, if it encounters an orange entry for the first time for a given name, it checks to see if the name also has an entry for apple and prints it if it has.
As noted by Sundeep in a comment, it could use in:
awk -F, \
'$4 == "apple" { if (apple[$1]++ == 0 && $1 in orange) print $1 }
$4 == "oranges" { if (orange[$1]++ == 0 && $1 in apple) print $1 }' data
The first answer could also use in in the END loop.
Note that all these solutions could be embedded in a script that would accept data from standard input (a pipe or a redirected file) — they have no need to read the input file twice. You'd replace data with "$#" to process file names if they're given, or standard input if no file names are specified. This flexibility is worth preserving when possible.

With awk
$ awk -F, 'NR==FNR{if($NF=="apple") a[$1]; next}
$NF=="oranges" && ($1 in a){print $1}' ip.txt ip.txt
jon
tom
This processes the input twice
In first pass, add key to an array if last field is apple (-F, would set , as input field separator)
In second pass, check if last field is oranges and if first field is a key of array a
To print only number of matches:
$ awk -F, 'NR==FNR{if($NF=="apple") a[$1]; next}
$NF=="oranges" && ($1 in a){c++} END{print c}' ip.txt ip.txt
2
Further reading: idiomatic awk for details on two file processing and awk idioms

I did a work around and used only grep and comm commands.
grep "apple" file | cut -d"," -f1 | sort > file1
grep "orange" file | cut -d"," -f1 | sort > file2
comm -12 file1 file2 > names.having.both.apple&orange
comm -12 shows only the common names between the 2 files.
Solution from Jonathan also worked.

For the input:
jon,1,2,apple
jon,1,2,oranges
jon,1,2,pineaaple
fred,1,2,apple
tom,1,2,apple
tom,1,2,oranges
mary,1,2,apple
the command:
sed -n "/apple\|oranges/p" inputfile | cut -d"," -f1 | uniq -d
will output a list of people with both apples and oranges:
jon
tom
Edit after comment: For an for input file where lines are not ordered by 1st column and where each person can have two or more repeated fruits, like:
jon,1,2,apple
fred,1,2,apple
fred,1,2,apple
jon,1,2,oranges
jon,1,2,pineaaple
jon,1,2,oranges
tom,1,2,apple
mary,1,2,apple
tom,1,2,oranges
This command will work:
sed -n "/\(apple\|oranges\)$/ s/,.*,/,/p" inputfile | sort -u | cut -d, -f1 | uniq -d

Related

Write specific columns of files into another files, Who can give me a more concise solution?

I have a troublesome problem about writing specific columns of the file into another file, more details are I have the file1 like below, I need to write the first columns exclude the first row to file2 with one line and separated with '|' sign. And now I have a solution by sed and awk, this missing last step inserts into the top of file2, even though I still believe there should be some more concise solution on account of powerful of awk、sed, etc. So, Who can offer me another more concise script?
sed '1d;s/ .//' ./file1 | awk '{printf "%s|", $1; }' | awk '{if (NR != 0) {print substr($1, 1, length($1) - 1)}}'
file1:
col_name data_type comment
aaa string null
bbb int null
ccc int null
file2:
xxx ccc(whatever is this)
The result of file2 should be this :
aaa|bbb|ccc
xxx ccc(whatever is this)
Assuming there's no whitespace in the column 1 data, in increasing length:
sed -i "1i$(awk 'NR > 1 {print $1}' file1 | paste -sd '|')" file2
or
ed file2 <<END
1i
$(awk 'NR > 1 {print $1}' file1 | paste -sd '|')
.
wq
END
or
{ awk 'NR > 1 {print $1}' file1 | paste -sd '|'; cat file2; } | sponge file2
or
mapfile -t lines < <(tail -n +2 file1)
col1=( "${lines[#]%%[[:blank:]]*}" )
new=$(IFS='|'; echo "${col1[*]}"; cat file2)
echo "$new" > file2
This might work for you (GNU sed):
sed -z 's/[^\n]*\n//;s/\(\S*\).*/\1/mg;y/\n/|/;s/|$/\n/;r file2' file1
Process file1 "wholemeal" by using the -z command line option.
Remove the first line.
Remove all columns other than the first.
Replace newlines by |'s
Replace the last | by a newline.
Append file2.
Alternative using just command line utils:
tail +2 file1 | cut -d' ' -f1 | paste -s -d'|' | cat - file2
Tail file1 from line 2 onwards.
Using the results from the tail command, isolate the first column using a space as the column delimiter.
Using the results from the cut command, serialize each line into one, delimited by |',s.
Using the results from the paste, append file2 using the cat command.
I'm learning awk at the moment.
awk 'BEGIN{a=""} {if(NR>1) a = a $1 "|"} END{a=substr(a, 1, length(a)-1); print a}' file1
Edit: Here's another version that uses an array:
awk 'NR > 1 {a[++n]=$1} END{for(i=1; i<=n; ++i){if(i>1) printf("|"); printf("%s", a[i])} printf("\n")}' file1
Here is a simple Awk script to merge the files as per your spec.
awk '# From the first file, merge all lines except the first
NR == FNR { if (FNR > 1) { printf "%s%s", sep, $1; sep = "|"; } next }
# We are in the second file; add a newline after data from first file
FNR == 1 { printf "\n" }
# Simply print all lines from file2
1' file1 file2
The NR==FNR condition is true when we are reading the first input file: The overall line number NR is equal to the line number within the current file FNR. The final 1 is a common idiom for printing all input lines which make it this far into the script (the next in the first block prevent lines from the first file to reaching this far).
For conciseness, you can remove the comments.
awk 'NR == FNR { if (FNR > 1) { printf "%s%s", sep, $1; sep = "|"; } next }
FNR == 1 { printf "\n" } 1' file1 file2
Generally speaking, Awk can do everything sed can do, so piping sed into Awk (or vice versa) is nearly always a useless use of sed.

Multiple Big file sort

I have two files that each line order by timestamp but has different structure. I want merge there file info one single file and order by timestamp. look like:
file A(less than 2G)
1,1,1487779199850
2,2,1487779199852
3,3,1487779199854
4,4,1487779199856
5,5,1487779199858
file B(less than 15G)
1,1,10,100,1487779199850
2,2,20,200,1487779199852
3,3,30,300,1487779199854
4,4,40,400,1487779199856
5,5,50,500,1487779199858
how can I accomplish this? is there any way can make it as fast as possible?
$ awk -F, -v OFS='\t' '{print $NF, $0}' fileA fileB | sort -s -n -k1,1 | cut -f2-
1,1,1487779199850
1,1,10,100,1487779199850
2,2,1487779199852
2,2,20,200,1487779199852
3,3,1487779199854
3,3,30,300,1487779199854
4,4,1487779199856
4,4,40,400,1487779199856
5,5,1487779199858
5,5,50,500,1487779199858
I originally posted the above as just a comment under #VM17's answer but (s)he suggested I make it a new answer.
The above would be more robust and efficient since it's using the default separator for sort+cut (tab), will truly only sort on the first key (his would use the whole line despite the -k1 since sorts field separator tab isn't present in the line), uses a stable sort algorithm (sort -s) to preserve input order and uses cut to strip off the added key field which is more efficient than invoking awk again since awk does field splitting etc. on each record which isn't needed to just remove the leading field(s).
Alternatvely you might find something like this more efficient:
$ cat tst.awk
{ currRec = $0; currKey = $NF }
NR>1 {
print prevRec
printf "%s", saved
while ( (getline < "fileB") > 0 ) {
if ($NF < currKey) {
print
}
else {
saved = $0 ORS
break
}
}
}
{ prevRec = currRec; prevKey = currKey }
END {
print prevRec
printf "%s", saved
while ( (getline < "fileB") > 0 ) {
print
}
}
$ awk -f tst.awk fileA
1,1,1487779199850
1,1,10,100,1487779199850
2,2,1487779199852
2,2,20,200,1487779199852
3,3,1487779199854
3,3,30,300,1487779199854
4,4,1487779199856
4,4,40,400,1487779199856
5,5,1487779199858
5,5,50,500,1487779199858
As you can see it reads from fileB between reads of lines fileA comparing timestamps so it's interleaving the 2 files and so doesn't require a subsequent pipe to sort and cut.
Just check the logic as I didn't think about it very much and be aware that this is a rare situation where getline might be appropriate for efficiency but make sure to read http://awk.freeshell.org/AllAboutGetline to understand all it's caveats if you're ever considering using it again.
Try this-
awk -F, '{print $NF, $0}' fileA fileB | sort -nk 1 | awk '{print $2}'
Output-
1,1,10,100,1487779199850
1,1,1487779199850
2,2,1487779199852
2,2,20,200,1487779199852
3,3,1487779199854
3,3,30,300,1487779199854
4,4,1487779199856
4,4,40,400,1487779199856
5,5,1487779199858
5,5,50,500,1487779199858
This concatenates the two files and then puts the timestamp at the starting of the line. It then sorts according to the timestamp and then removes that dummy column.
This will be slow for big files though.

Get the contents of one column given another column

I have a tab separated file with 3 columns. I'd like to get the contents of the first column, but only for the rows where the 3rd column is equal to 8. How do I extract these values? If I just wanted to extract the values in the first column, I would do the following:
cat file1 | tr "\t" "~" | cut -d"~" -f1 >> file_with_column_3
I'm thinking something like:
cat file1 | tr "\t" "~" | if cut -d"~" -f3==8; then cut -d"~" -f1 ; fi>> file_with_column_3
But that doesn't quite seem to work.
Given that your file is tab delimited, it seems like this problem would be well suited for awk.
Something simple like below should work for you, though without any sample data I can't say for sure (try to always include this on questions on SO)
awk -F'\t' '$3==8 {print $1}' inputfile > outputfile
The -F'\t' sets the input delimiter as tab.
$3==8 compares if the 3rd column based on that delimiter is 8.
If so, the {print $1} is executed, which prints the first column.
Otherwise, nothing is done and awk proceeds to the next line.
If your file had a header you wanted to preserve, you could just modify this like the following, which tells awk to print if the current record number is 1.
awk -F'\t' 'NR==1 {print;} $3==8 {print $1}' inputfile > outputfile
awk can handle this better:
awk -F '\t' '$3 == 8 { print $1 }' file1
You can do it with bash only too:
cat x | while read y; do split=(${y}); [ ${split[2]} == '8' ] && echo $split[0]; done
The input is read in variable y, then split into an array. The IFS (input field separator) defaults to <space><tab<>newline>, so it splits on tabs too. The third field of the array is then compared to '8'. If it equals, it prints the first field of the array. Remember that fields in arrays start counting at zero.

Comparing values in two files

I am comparing two files, each having one column and n number of rows.
file 1
vincy
alex
robin
file 2
Allen
Alex
Aaron
ralph
robin
if the data of file 1 is present in file 2 it should return 1 or else 0, in a tab seprated file.
Something like this
vincy 0
alex 1
robin 1
What I am doing is
#!/bin/bash
for i in `cat file1 `
do
cat file2 | awk '{ if ($1=="'$i'") print 1 ; else print 0 }'>>binary
done
the above code is not giving me the output which I am looking for.
Kindly have a look and suggest correction.
Thank you
The simple awk solution:
awk 'NR==FNR{ seen[$0]=1 } NR!=FNR{ print $0 " " seen[$0] + 0}' file2 file1
A simple explanation: for the lines in file2, NR==FNR, so the first action is executed and we simply record that a line has been seen. In file1, the 2nd action is taken and the line is printed, followed by a space, followed by a "0" or a "1", depending on if the line was seen in file2.
AWK loves to do this kind of thing.
awk 'FNR == NR {a[tolower($1)]; next} {f = 0; if (tolower($1) in a) {f = 1}; print $1, f}' file2 file1
Swap the positions of file2 and file1 in the argument list to make file1 the dictionary instead of file2.
When FNR (the record number in the current file) and NR (the record number of all records so far) are equal, then the first file is the one being processed. Simply referencing an array element brings it into existence. This sets up the dictionary. The next instruction reads the next record.
Once FNR and NR aren't equal, subsequent file(s) are being processed and their data is looked up in the dictionary array.
The following code should do it.
Take a close look to the BEGIN and END sections.
#!/bin/bash
rm -f binary
for i in $(cat file1); do
awk 'BEGIN {isthere=0;} { if ($1=="'$i'") isthere=1;} END { print "'$i'",isthere}' < file2 >> binary
done
There are several decent approaches. You can simply use line-by-line set math:
{
grep -xF -f file1 file2 | sed $'s/$/\t1/'
grep -vxF -f file1 file2 | sed $'s/$/\t0/'
} > somefile.txt
Another approach would be to simply combine the files and use uniq -c, then just swap the numeric column with something like awk:
sort file1 file2 | uniq -c | awk '{ print $2"\t"$1 }'
The comm command exists to do this kind of comparison for you.
The following approach does only one pass and scales well to very large input lists:
#!/bin/bash
while read; do
if [[ $REPLY = $'\t'* ]] ; then
printf "%s\t0\n" "${REPLY#?}"
else
printf "%s\t1\n" "${REPLY}"
fi
done < <(comm -2 <(tr '[A-Z]' '[a-z]' <file1 | sort) <(tr '[A-Z]' '[a-z]' <file2 | sort))
See also BashFAQ #36, which is directly on-point.
Another solution, if you have python installed.
If you're familiar with Python and are interested in the solution, you only need a bit of formatting.
#/bin/python
f1 = open('file1').readlines()
f2 = open('file2').readlines()
f1_in_f2 = [int(x in f2) for x in f1]
for n,c in zip(f1, f1_in_f2):
print n,c

Join lines based on pattern

I have the following file:
test
1
My
2
Hi
3
i need a way to use cat ,grep or awk to give the following output:
test1
My2
Hi3
How can i achieve this in a single command? something like
cat file.txt | grep ... | awk ...
Note that its always a string followed by a number in the original text file.
sed 'N;s/\n//' file.txt
This should give the desired output when the content is in file.txt
paste -d "" - - < filename
This takes consecutive lines and pastes them together delimited by the empty string.
awk '{printf("%s", $0);} !(NR%2){printf("\n");}' file.txt
EDIT: I just noticed that your question requires the use of cat and grep. Both of those programs are unnecessary to achieve your stated aims. If you have some reason for including them that you haven't mentioned, try this (uselessly inefficient) version of the line I wrote immediately above:
cat file.txt | grep '^' | awk '{printf("%s", $0);} !(NR%2){printf("\n");}'
It is possible that this command uses features not present in the original awk program. You may need to invoke the new awk program, nawk instead.
If your input file is always 1 number then 1 string, and you only want the strings, all you have to do is take every other line.
If you only want the odd lines, you can do awk 'NR % 2' file.txt
If you want the evens, this becomes awk 'NR % 2==0' data
Here is the answer:
cat file.txt | awk 'BEGIN { lno = 0 } { val=$0; if (lno % 2 == 1) {printf "%s\n", $0} else {printf "%s", $0}; ++lno}'

Resources