Print only words that are in File A not in B

Print only words that are in File A not in B - shell

I have two files and I need to print the words only (not complete lines) that are in the first file not in the second file. I have tried wdiff but it prints complete lines and is not useful.
Sample of file:
وكان مكنيل وقتها رئيس رابطة مؤرخي أمريكا ـ
كما فهمت - من شاهد الحادثة. ثم يصف كيف قدم
مكنيل الرجلين الخصمين, فكانت له صرامته, إذ
حدد عشرين دقيقة فقط لكل منهما أن يقدم رأيه
وحجته, ثم وقت للرد, ثم يجيبان عن أسئلة قليلة
من القاعة, والمناقشة في وقت محدد.

Make two files that contain each word on its own line, and sort them. Then use comm:
$ cat fileA
ﻮﻛﺎﻧ ﻢﻜﻨﻴﻟ ﻮﻘﺘﻫﺍ ﺮﺌﻴﺳ ﺭﺎﺒﻃﺓ ﻡﺅﺮﺨﻳ ﺄﻣﺮﻴﻛﺍ ـ
ﻚﻣﺍ ﻒﻬﻤﺗ - ﻢﻧ ﺵﺎﻫﺩ ﺎﻠﺣﺍﺪﺛﺓ. ﺚﻣ ﻲﺼﻓ ﻚﻴﻓ ﻕﺪﻣ
$ cat fileB
ﻮﻘﺘﻫﺍ ﺮﺌﻴﺳ ﺭﺎﺒﻃﺓ ﺄﻣﺮﻴﻛﺍ ـ
ﻚﻣﺍ ﻒﻬﻤﺗ - ﻢﻧ ﺵﺎﻫﺩ ﻲﺼﻓ ﻚﻴﻓ ﻕﺪﻣ
$ tr ' ' '\n' < fileA | sort > fileA-sorted
$ tr ' ' '\n' < fileB | sort > fileB-sorted
$ comm -23 fileA-sorted fileB-sorted
ﺎﻠﺣﺍﺪﺛﺓ.
ﺚﻣ
ﻢﻜﻨﻴﻟ
ﻡﺅﺮﺨﻳ
ﻮﻛﺎﻧ
$
This can also be written on a single line in bash:
comm -23 <(tr ' ' '\n' < fileA | sort) <(tr ' ' '\n' < fileB | sort)

This is not an answer, but a comment too long to be a comment. I'm sorry - I don't yet know the etiquette in this case, so please let me know if there's a better way to do this.
I thought both the approaches given in other answers were interesting, but was concerned that the grep version would require m * n comparisons, where m and n are the numbers of words in each file respectively.
I'm running bash on OSX and ran the following smoke test to compare:
Grab two random selections of 10K words from my dictionary:
gsort -R /usr/share/dict/words | head -n 10000 > words1
gsort -R /usr/share/dict/words | head -n 10000 > words2
Compare the running time for each solution:
Using comm:
time comm -23 <(tr ' ' '\n' < words1 | sort) <(tr ' ' '\n' < words2 | sort)
Result:
real 0m0.143s
user 0m0.225s
sys 0m0.018s
Using grep:
time grep -wf <(tr ' ' '\n' < words1) <(tr ' ' '\n' < words2)
Result:
real 1m25.988s
user 1m25.925s
sys 0m0.063s
I'm not sure about memory complexity. I'd be interested in any criticism of this analysis, or commentary on how to evaluate which solution is better?

You can avoid sorting (specially if input files are pretty huge) using grep:
grep -wf <(tr ' ' '\n' < file1) <(tr ' ' '\n' < file2)

Related

Bash then sorting it

Hey guys so i have this sample data from uniq-c:
100 c.m milk
99 c.s milk
45 cat food
30 beef
desired output:
beef,30
c.m milk,100
c.s milk,99
cat food,45
the thing i have tried are using:
awk -F " " '{print $2" " $3 " " $4 " " $5 "," $1}' stock.txt |sort>stock2.csv
i got :
beef ,30
cat food
,45
c.m milk
,100
c.s milk
,99
think its because some item doesn't have 2,3,4,5 and i still use " ", and the sort in unix doesn't prioritise dot first unlike sql. however i'm not too sure how to fix it

To obtain your desired output you could sort first your current input and then try to swap the columns.
Using awk, please give a try to this:
$ sort -k2 stock.txt | awk '{t=$1; sub($1 FS,""); print $0"," t}'
It will output:
beef,30
c.m milk,100
c.s milk,99
cat food,45

i think you can solve it in bash using some easy commands, if the format of the file is as you posted it:
prova.txt is your file.
then do:
cat prova.txt | cut -d" " -f2,3 > first_col
cat prova.txt | cut -d" " -f1 > second_col
paste -d "," first_col second_col | sort -u > output.csv
rm first_col second_col
in output.txt you have your desired output in CSV format!
EDIT:
after reading and applying PesaThe comment, the code is way easier:
paste -d, <(cut -d' ' -f2- prova.txt) <(cut -d' ' -f1 prova.txt) | sort -u > output.csv

Combining additional information from this thread with awk, the following script is a possible solution:
awk ' { printf "%s", $2; if ($3) printf " %s", $3; printf ",%d\n", $1; } ' stock.txt | LC_ALL=C sort > stock2.csv
It works well in my case. Nevertheless, I would prefer nbari's solution because it is shorter.

$ awk '{$0=$0","$1; sub(/^[^[:space:]]+[[:space:]]+/,"")} 1' file | LC_ALL=C sort
beef,30
c.m milk,100
c.s milk,99
cat food,45

You can use sed + sort:
sed -E 's/^([^[:blank:]]+)[[:blank:]]+(.+)/\2,\1/' file | C_ALL=C sort
beef,30
c.m milk,100
c.s milk,99
cat food,45

Print common values in columns using bash

I have file with two columns
apple apple
ball cat
cat hat
dog delta
I need to extract values that are common in two columns (occur in both columns) like
apple apple
cat cat
There is no ordering in items in each column.

Could you please try following and let me know if this helps you.
awk '
{
col1[$1]++;
col2[$2]++;
}
END{
for(i in col1){
if(col2[i]){
while(++count<=(col1[i]+col2[i])){
printf("%s%s",i,count==(col1[i]+col2[i])?ORS:OFS)}
count=""}
}
}' Input_file
NOTE: It will print the values if found in both the columns exactly number of times they are occurring in both the columns too.

$ awk '{a[$1];b[$2]} END{for(k in a) if(k in b) print k}' file
apple
cat
to print the values twice change to print k,k
with sort/join
$ join <(cut -d' ' -f1 file | sort) <(cut -d' ' -f2 file | sort)
apple
cat
perhaps,
$ function f() { cut -d' ' -f"$1" file | sort; }; join <(f 1) <(f 2)

Assuming I can use unix commands:
cut -d' ' -f2 fil | egrep `cut -d' ' -f1 < fil | paste -sd'|'` -
Basically what this does is this:
The second cut command collects all the words in the first column. The paste command joins them with a pipe (i.e. dog|cat|apple).
The first cut command takes the second column of words in the list and pipes them into a regexp-enabled egrep command.

Here is the closest I could get. Maybe you could loop through whole file and print when it reaches another occurrence.
Code
cat file.txt | gawk '$1==$2 {print $1,"=",$2}'
or
gawk '$1==$2 {print $1,"=",$2}' file.txt

Subtract length element two columns

I've a file from which I get two columns: cut -d $'\t' -f 4,5 file.txt
Now I would like to get the difference in length of each element between column 1 and 2.
Input from cut command
A T
AA T
AC TC
A CT
What I would expect
0
1
0
-1

Using awk.
awk ' {print length($1) - length($2)} ' cutoutput.txt
Or awk on the original file you can simply do:
awk ' {print length($4) - length($5)} ' file.txt

You probably can do this only with awk without using cut. Since you don't have the original input file, I would use the following with a | to your cut command:
cut -d $'\t' -f 4,5 file.txt | \
awk '{for (i=1;i<NF;i++) s=length($i)-length($NF); printf s"\n"}'

Extracting the pattern from the line

How can I extract words which contains the pattern "arum" from the following line:
Agarum anoestrum alabastrum sun antirumor alarum antiserum ambulacrum antistrumatic Anatherum antistrumous androphorum antrum 4. foodstuff foody nonfood Aplectrum
So words like sun, 4., foody, nonfood should be removed.

You can use grep:
echo "Agarum anoestrum sun" | tr ' ' '\n' | grep "arum"
tr is used to split the input string in one word per line, since grep operates on a per-line basis and would display the whole line.
If you want the output to be in one line again, use:
echo "Agarum anoestrum sun" | tr ' ' '\n' | grep "arum" | tr '\n' ' '

Using grep -Eo:
grep -Eo 'a[[:alnum:]]*rum' file
arum
anoestrum
alabastrum
antirum
alarum
antiserum
ambulacrum
antistrum
atherum
antistrum
androphorum
antrum

Try:
echo Agarum anoestrum alabastrum sun antirumor alarum antiserum ambulacrum antistrumatic Anatherum antistrumous androphorum antrum 4. foodstuff foody nonfood Aplectrum | awk '{for (i=1;i<NF;i++) { if (match($i, "[aA][[:alnum:]]*[rR][uU][mM]") != 0) { printf ("%s ", $i) }} print}'

Why uniq -c output with space instead of \t?

I use uniq -c some text file.
Its output like this:
123(space)first word(tab)other things
2(space)second word(tab)other things
....
So I need extract total number(like 123 and 2 above), but I can't figure out how to, because if I split this line by space, it will like this ['123', 'first', 'word(tab)other', 'things'].
I want to know why doesn't it output with tab?
And how to extract total number in shell? ( I finally extract it with python, WTF)
Update: Sorry, I didn't describe my question correctly. I didn't want to sum the total number, I just want to replace (space) with (tab), but it doesn't effect the space in words, because I still need the data after. Just like this:
123(tab)first word(tab)other things
2(tab)second word(tab)other things

Try this:
uniq -c | sed -r 's/^( *[^ ]+) +/\1\t/'

Try:
uniq -c text.file | sed -e 's/ *//' -e 's/ /\t/'
That will remove the spaces prior to the line count, and then replace only the first space with a tab.
To replace all spaces with tabs, use tr:
uniq -c text.file | tr ' ' '\t'
To replace all continuous runs of tabs with a single tab, use -s:
uniq -c text.file | tr -s ' ' '\t'

You can sum all the numbers using awk:
awk '{s+=$1}END{print s}'

$ cat <file> | uniq -c | awk -F" " '{sum += $1} END {print sum}'

One possible solution to getting tabs after counts is to write a uniq -c-like script that formats exactly how you want. Here's a quick attempt (that seems to pass my minute or so of testing):
awk '
(NR == 1) || ($0 != lastLine) {
if (NR != 1) {
printf("%d\t%s\n", count, lastLine);
}
lastLine = $0;
count = 1;
next;
}
{
count++;
}
END {
printf("%d\t%s\n", count, lastLine);
}
' yourFile.txt

Another solution. This is equivalent to the earlier sed solution, but it does use awk as requested / tagged!
cat yourFile.txt \
| uniq -c \
| awk '{
match($0, /^ *[^ ]* /);
printf("%s\t%s\n", $1, substr($0, RLENGTH + 1));
}'

Based on William Pursell answer , if you like Perl compatible regular expressions (PCRE) maybe a more elegant and modern way would be
perl -pe 's/ *(\d+) /$1\t/'
Options are to execute (-e) and print (-p).

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Print only words that are in File A not in B - shell

You can avoid sorting (specially if input files are pretty huge) using grep: grep -wf <(tr ' ' '\n' < file1) <(tr ' ' '\n' < file2)

Related

Bash then sorting it

Print common values in columns using bash

Subtract length element two columns

Extracting the pattern from the line

Why uniq -c output with space instead of \t?

Categories

Resources