Search phrases and terms (stored in a file) in a text article - bash

I have two text files, one containing keywords and phrases (file1.txt) and a paragraph-based text file (file2.txt). I'm trying to find the keywords/phrases in file1.txt that appeared on file2.txt
Here's a sample data:
File 1 (file1.txt):
123 111 1111
ABC 000
A 999
B 000
C 111
Thank you
File 2 (file2.txt)
Hello!
The following order was completed: ABC 000
Item 1 (A 999)
Item 2 (X 412)
Item 3 (8 357)
We will call: 123 111 1111 if we encounter any issues
Thank you very much!
Desired output:
123 111 1111
ABC 000
A 999
Thank you
I've tried the grep command:
grep -Fxf file1.txt file2.txt > output.txt
And I'm getting a blank output.txt
What suggestions do you have to get the right output?

try
grep -o -f file1.txt <file2.txt
-o < print only matching pattern
-f < search for this string line by line
< < Standard input
Demo :
$cat file1.txt
123 111 1111
ABC 000
A 999
B 000
C 111
Thank you
$cat file2.txt
Hello!
The following order was completed: ABC 000
Item 1 (A 999)
Item 2 (X 412)
Item 3 (8 357)
We will call: 123 111 1111 if we encounter any issues
Thank you very much!
$grep -o -f file1.txt <file2.txt
ABC 000
A 999
123 111 1111
Thank you
$

Related

how to combine more than two file into one new file with specific name using bash

I have many file
list file name:
p004c01.txt
p004c05.txt
p006c01.txt
p006c02.txt
p007c01.txt
p007c03.txt
p007c04.txt
...
$cat p004c01.txt
#header
122.5 -0.256 547
123.6 NaN 325
$cat p004c05.txt
#header
122.1 2.054 247
122.2 -1.112 105
$cat p006c01.txt
#header
99 -0.200 333
121.4 -1.206 243
$cat p006c02.txt
#header
122.5 2.200 987
99 -1.335 556
I want the file be like this
file1
$cat p004.txt
122 -0.256 547
122 2.054 247
122 -1.112 105
file2
$cat p006.txt
122.5 2.200 987
121.4 -1.206 243
99 -1.335 556
99 -0.200 333
And the other file too
File that contain the same value (?) in
p????cxx.txt
is in the same new file
I tried one by one file like this
cat p004* | sed '/#/d'| sort -k 1n | sed '/NaN/d' |awk '{print substr($1,2,3),$2,$3,$4,$5}' > p004.txt
Anyone can help me with the simple script for all the data?
Thank you :)
Perhaps this will work for you:
for f in {001..999}; do tail -n +2 p"$f"c* > p"$f".txt; done 2>/dev/null

loop through numeric text files in bash and add numbers row wise

I have a set of text files in a folder, like so:
a.txt
1
2
3
4
5
b.txt
1000
1001
1002
1003
1004
.. and so on (assume fixed number of rows, but unknown number of text files). What I am looking a results file which is a summation across all rows:
result.txt
1001
1003
1005
1007
1009
How do I go about achieving this in bash? without using Python etc.
Using awk
Try:
$ awk '{a[FNR]+=$0} END{for(i=1;i<=FNR;i++)print a[i]}' *.txt
1001
1003
1005
1007
1009
How it works:
a[FNR]+=$0
For every line read, we add the value of that line, $0, to partial sum, a[FNR], where a is an array and FNR is the line number in the current file.
END{for(i=1;i<=FNR;i++)print a[i]}
After all the files have been read in, this prints out the sum for each line number.
Using paste and bc
$ paste -d+ *.txt | bc
1001
1003
1005
1007
1009

How can I compare two 2D-array files with bash?

I have two 2D-array files to read with bash.
What I want to do is extract the elements inside both files.
These two files contain different rows x columns such as:
file1.txt (nx7)
NO DESC ID TYPE W S GRADE
1 AAA 20 AD 100 100 E2
2 BBB C0 U 200 200 D
3 CCC 9G R 135 135 U1
4 DDD 9H Z 246 246 T1
5 EEE 9J R 789 789 U1
.
.
.
file2.txt (mx3)
DESC W S
AAA 100 100
CCC 135 135
EEE 789 789
.
.
.
Here is what I want to do:
Extract the element in DESC column of file2.txt then find the corresponding element in file1.txt.
Extract the W,S elements in such row of file2.txt then find the corresponding W,S elements in such row of file1.txt.
If [W1==W2 && S1==S2]; then echo "${DESC[colindex]} ok"; else echo "${DESC[colindex]} NG"
How can I read this kind of file as a 2D array with bash or is there any convenient way to do that?
bash does not support 2D arrays. You can simulate them by generating 1D array variables like array1, array2, and so on.
Assuming DESC is a key (i.e. has no duplicate values) and does not contain any spaces:
#!/bin/bash
# read data from file1
idx=0
while read -a data$idx; do
let idx++
done <file1.txt
# process data from file2
while read desc w2 s2; do
for ((i=0; i<idx; i++)); do
v="data$i[1]"
[ "$desc" = "${!v}" ] && {
w1="data$i[4]"
s1="data$i[5]"
if [ "$w2" = "${!w1}" -a "$s2" = "${!s1}" ]; then
echo "$desc ok"
else
echo "$desc NG"
fi
break
}
done
done <file2.txt
For brevity, optimizations such as taking advantage of sort order are left out.
If the files actually contain the header NO DESC ID TYPE ... then use tail -n +2 to discard it before processing.
A more elegant solution is also possible, which avoids reading the entire file in memory. This should only be relevant for really large files though.
If the rows order is not needed be preserved (can be sorted), maybe this is enough:
join -2 2 -o 1.1,1.2,1.3,2.5,2.6 <(tail -n +2 file2.txt|sort) <(tail -n +2 file1.txt|sort) |\
sed 's/^\([^ ]*\) \([^ ]*\) \([^ ]*\) \2 \3/\1 OK/' |\
sed '/ OK$/!s/\([^ ]*\) .*/\1 NG/'
For file1.txt
NO DESC ID TYPE W S GRADE
1 AAA 20 AD 100 100 E2
2 BBB C0 U 200 200 D
3 CCC 9G R 135 135 U1
4 DDD 9H Z 246 246 T1
5 EEE 9J R 789 789 U1
and file2.txt
DESC W S
AAA 000 100
CCC 135 135
EEE 789 000
FCK xxx 135
produces:
AAA NG
CCC OK
EEE NG
Explanation:
skip the header line in both files - tail +2
sort both files
join the needed columns from both files into one table like, in the result will appears only the lines what has common DESC field
like next:
AAA 000 100 100 100
CCC 135 135 135 135
EEE 789 000 789 789
in the lines, which have the same values in 2-4 and 3-5 columns, substitute every but 1st column with OK
in the remainder lines substitute the columns with NG

Counting the number of 10-digit numbers in a file

I need to count the total number of instances in which a 10-digit number appears within a file. All of the numbers have leading zeros, e.g.:
This is some text. 0000000001
Returns:
1
If the same number appears more than once, it is counted again, e.g.:
0000000001 This is some text.
0000000010 This is some more text.
0000000001 This is some other text.
Returns:
3
Sometimes there are no spaces between the numbers, but each continuous string of 10-digits should be counted:
00000000010000000010000000000100000000010000000001
Returns:
5
How can I determine the total number of 10-digit numbers appearing in a file?
Try this:
grep -o '[0-9]\{10\}' inputfilename | wc -l
The last requirement - that you need to count multiple numbers per line - excludes grep, as far as I know it can count only per-line.
Edit: Obviously, I stand corrected by Nate :) grep's -o option is what I was looking for.
You can however do this easily with sed like this:
$ cat mkt.sh
sed -r -e 's/[^0-9]/./g' -e 's/[0-9]{10}/num /g' -e 's/[0-9.]//g' $1
$ for i in *.txt; do echo --- $i; cat $i; echo --- number count; ./mkt.sh $i|wc -w; done
--- 1.txt
This is some text. 0000000001
--- number count
1
--- 2.txt
0000000001 This is some text.
0000000010 This is some more text.
0000000001 This is some other text.
--- number count
3
--- 3.txt
00000000010000000010000000000100000000010000000001
--- number count
5
--- 4.txt
1 2 3 4 5 6 6 7 9 0
11 22 33 44 55 66 77 88 99 00
123456789 0
--- number count
0
--- 5.txt
1.2.3.4.123
1234567890.123-AbceCMA-5553///q/\1231231230
--- number count
2
$
This might work for you:
cat <<! >test.txt
0000000001 This is some text.
0000000010 This is some more text.
0000000001 This is some other text.
00000000010000000010000000000100000000010000000001
1 a 2 b 3 c 4 d 5 e 6 f 7 g 8 h 9 i 0 j
12345 67890 12 34 56 78 90
!
sed 'y/X/ /;s/[0-9]\{10\}/\nX\n/g' test.txt | sed '/X/!d' | sed '$=;d'
8
"I need to count the total number of instances in which a 10-digit number appears within a file. All of the numbers have leading zeros"
So I think this might be more accurate:
$ grep -o '0[0-9]\{9\}' filename | wc -l

Extract numbers from filenames disregarding extensions

I'm making a script to rename some video files. Some are named XXX blah blah.ext and some are XXX - XXX blah blah.ext where "X" are digits. Furthermore, some files are .avi and some are mp4. What I'd like is to extract the numbers from these files, separated by a space if there is more than one, and to disregard the "4" in ".mp4" files.
My current implementation is egrep -o "[[:digit:]]*", and while this does separate numbers into different outputs, it also considers ".mp4".
Using sed I've not only not been able to produce different outputs for every number, but it also includes the "4". Note: I'm very new to sed i.e. I began learning it for the purpose of writing this script.
How can I do this?
for file in *
do
echo $file | sed 's/\..*$//' | egrep -o "[[:digit:]]*"
done
You should find this to be pretty robust:
sed 's/^[^[:digit:]]*\([[:digit:]]\+\)[^[:digit:]]\+\( [[:digit:]]\+\)\?[^[:digit:]]\+[[:digit:]]\?$/\1\2/'
If your sed supports -r, you can eliminate the backslashes which are used for escaping:
sed -r 's/^[^[:digit:]]*([[:digit:]]+)[^[:digit:]]+( [[:digit:]]+)?[^[:digit:]]+[[:digit:]]?$/\1\2/'
Demo:
$ echo '123 blah blah.avi
234 blah blah.mp4
345 - 678 blah blah.avi
901 - 234 blah blah.mp4' |
sed -r 's/^[^[:digit:]]*([[:digit:]]+)[^[:digit:]]+( [[:digit:]]+)?[^[:digit:]]+[[:digit:]]?$/\1\2/'
123
234
345 678
901 234
This depends on there being a space in the filename before the second number (when there is one). If there are files that don't have that, then a simple modification can make it work.
This might work for you:
# echo '123 bla bla.avi
456 - 789 bla bla.avi
012bla bla.avi
345-678blabla.avi
901 bla bla.mp4
234 - 567 bla bla.mp4
890bla bla.mp4
123 - 456 - 789 bla bla.mp4' |
sed 's/[^0-9]*[0-9]$//;s/[^0-9]\+/ /g'
123
456 789
012
345 678
901
234 567
890
123 456 789

Resources