How to read the contents from a file in an array in Ruby - ruby

I want to read a file which has name value pair on a remote server. As per the requirement I need to shell into the remote server, read the file, and then grep for the values. Example:
/domain/srvr/primary = ABC
/host/DEF/second = DEF
/host/XYZ/second = XYZ
/host/GHI/second = GHI
:
:
:
Now I want to read this file and make an array of all secondary servers (ex: DEF, XYZ, GHI) but I am getting nil value.
primary = #ssh.exec!("cd /home/dir; grep 'srvr/primary' #{filename} | awk '{print $3}'")
secondary = #ssh.exec!("cd /home/dir; grep '\<host.*second\>' #{filename} | awk '{print $3}'")`
It prints the primary server name properly but returns nil for secondary servers array. I tried to use split("\n") but errors out by saying undefined method 'split' for nil:NilClass.
Need help in getting all the secondary servers in an array.

You can use something like this
file_contents.
split("\n").
map {|line| line.split(" = ") }.
find_all {|pair| pair[0] =~ /second$/ }.
map(&:last)
and probably get file contents with cat or downloading the file through ssh. If in the same server just use File.read.
If you can only use bash or just prefer to, you can use
grep -p "^\/host.*second" path/to/file | cut -d" " -f 3
The -p option will enable perl regexp syntax on grep which will give you all the capabilities you want to search through the file. Then cut will split each string by the delimiter -d ang get the field of 1 based index -f. In this case string = server, so server is on the third field.

Instead of trying to do the work on the remote host, you can simplify the task and only grab the data and process it locally. You can use Net::SCP (docs), Net::FTP or Net::SFTP to easily retrieve the data.
I'd use something like this to grab the desired data once the text has been received:
data = <<EOT
/domain/srvr/primary = ABC
/host/DEF/second = DEF
/host/XYZ/second = XYZ
/host/GHI/second = GHI
EOT
data.split("\n").grep(/\bsecond\b/).map{ |l| l.split.last }
# => ["DEF", "XYZ", "GHI"]
or:
data.split("\n").grep(/\bsecond\b/).map{ |l| l[/= (\S+)/, 1] }
# => ["DEF", "XYZ", "GHI"]
or:
data.split("\n").grep(/\bsecond\b/).map{ |l| l.rstrip[/\S+$/] }
# => ["DEF", "XYZ", "GHI"]
Just to make it more interesting:
require 'fruity'
data = <<EOT
/domain/srvr/primary = ABC
/host/DEF/second = DEF
/host/XYZ/second = XYZ
/host/GHI/second = GHI
EOT
compare do
p1 { data.split("\n").grep(/\bsecond\b/).map{ |l| l[/= (\S+)/, 1] } }
p2 { data.split("\n").grep(/\bsecond\b/).map{ |l| l.rstrip[/\S+$/] } }
p3 { data.split("\n").grep(/\bsecond\b/).map{ |l| l.rstrip[/\S+$/] } }
end
# >> Running each test 256 times. Test will take about 1 second.
# >> p1 is faster than p2 by 3.9x ± 0.01
# >> p2 is similar to p3

Related

BASH - Shuffle characters in strings from several rows

I have a file (filename.txt) with the following structure:
>line1
ABC
DEF
GHI
>line2
JKL
MNO
PQR
>line3
STU
VWX
YZ
I would like to shuffle the characters in the strings that do not start wit >. The output would (for example) look like the following:
>line1
DGC
FEI
HBA
>line2
JRP
OKN
QML
>line3
SZV
YXT
UW
This is what I tried to shuffle the characters for each >line[number]: ruby -lpe '$_ = $_.chars.shuffle * "" if !/^>/' filename.txt. The command works (see my post BASH - Shuffle characters in strings from file) but it shuffles line by line. I was wondering how I could modify the command to shuffle characters between all strings of each >line[number]). Using ruby is not a requirement.
First, we need to solve the problem: how to shuffle all characters in multiple lines:
echo -e 'ABC\nDEF\nGHI' |grep -o . |shuf |tr -d '\n'
GDAFHEIBC
and, we also need an array to record the length of each line in origin strings.
s=GDAFHEIBC
lens=(3 3 3)
start=0
for len in "${lens[#]}"; do
echo ${s:${start}:${len}}
((start+=len))
done
GDA
FHE
IBC
So, the origin multiple lines:
ABC
DEF
GHI
have been shuffled to:
GDA
FHE
IBC
Now, we can do our jobs:
lens=()
string=""
function shuffle_lines {
local start=0
local shuffled_string=$(grep -o . <<< ${string} |shuf |tr -d '\n')
for len in "${lens[#]}"; do
echo ${shuffled_string:${start}:${len}}
((start+=len))
done
lens=()
string=""
}
while read -r line; do
if [[ "${line}" =~ ^\> ]]; then
shuffle_lines
echo "${line}"
else
string+="${line}"
lens+=(${#line})
fi
done <filename.txt
shuffle_lines
Examples:
$ cat filename.txt
>line1
ABC
DEF
GHI
>line2
JKL
MNO
PQR
>line3
STU
VWX
YZ
>line4
0123
456
78
9
$ ./solution.sh
>line1
HFG
BED
AIC
>line2
JOP
KMQ
RLN
>line3
UVW
TYZ
XS
>line4
1963
245
08
7
#!/bin/bash
# echo > output.txt # uncomment to write in a file output.txt
mix(){
{
echo "$title"
line="$( fold -w1 <<< "$line" | shuf )"
echo "${line//$'\n'}" | fold -w3
} # >> output.txt # uncomment to write in a file output.txt
unset line
}
while read -r; do
if [[ $REPLY =~ ^\> ]]; then
mix
title="$REPLY"
else
line+="$REPLY"
fi
done < filename.txt
mix # final mix after loop's exit, otherwise line3 will be not mixed
exit
edited with comment of gniourf-gniourf
First create a test file.
str =<<FINI
>line1
ABC
DEF
GHI
>line2
JKL
MNO
PQR
>line3
STU
VWX
YZ
FINI
File.write('test', str)
#=> 56
Now read the file and perform the desired operations.
result = File.read('test').split(/(>line\d+)/).map do |s|
if s.match?(/\A(?:|>line\d+)\z/)
s
else
a = s.scan(/\p{Lu}/).shuffle
s.gsub(/\p{Lu}/) { a.shift }
end
end.join
# ">line1\nECF\nHIA\nGBD\n>line2\nJNP\nKLR\nOQM\n>line3\nTXY\nUZV\nSW\n"
puts result
>line1
ECF
HIA
GBD
>line2
JNP
KLR
OQM
>line3
TXY
UZV
SW
To do this from the command convert the code to a string with statements separated by a semicolon.
ruby -e "puts (File.read('test').split(/(>line\d+)/).map do |s|; if s.match?(/\A(?:|>line\d+)\z/); s; else; a = s.scan(/\p{Lu}/).shuffle; s.gsub(/\p{Lu}/) { a.shift }; end; end).join"
The steps are as follows.
a = File.read('test')
#=> ">line1\nABC\nDEF\nGHI\n>line2\nJKL\nMNO\nPQR\n>line3\nSTU\nVWX\nYZ\n"
b = a.split(/(>line\d+)/)
#=> ["", ">line1", "\nABC\nDEF\nGHI\n", ">line2", "\nJKL\nMNO\nPQR\n",
# ">line3", "\nSTU\nVWX\nYZ\n"]
Notice that the regular expression that is split's argument places >line\d+ within a capture group. Without that, ">line1", ">line2" and ">line3" would not be included in b.
c = b.map do |s|
if s.match?(/\A(?:|>line\d+)\z/)
s
else
a = s.scan(/\p{Lu}/).shuffle
s.gsub(/\p{Lu}/) { a.shift }
end
end
#=> ["", ">line1", "\nEAC\nIHB\nDGF\n", ">line2", "\nKQJ\nROL\nMPN\n",
# ">line3", "\nSUY\nXTV\nZW\n"]
c.join
#=> ">line1\nEAC\nIHB\nDGF\n>line2\nKQJ\nROL\nMPN\n>line3\nSUY\nXTV\nZW\n"
Now consider more closely the calculation of c. The first element of b is passed to the block and the block variable s is set to its value:
s = ""
We then compute
s.match?(/\A(?:|>line\d+)\z/)
#=> true
so "" is returned from the block. The regular expression can be expressed as follows.
/
\A # match the beginning of the string
(?: # begin a non-capture group
# match an empty space
| # or
>line\d+ # match '>line' followed by one or more digits
) # end non-capture group
\z # match the end of the string
/x # free-spacing regex definition mode.
Within the non-capture group an empty space was matched.
The next element of b is then passed to the block.
s = ">line1"
Again
s.match?(/\A(?:|>line\d+)\z/)
#=> true
so s is return from the block.
Now the third element of b is passed to the block. (Finally, something interesting.)
s = "\nABC\nDEF\nGHI\n"
d = s.scan(/\p{Lu}/)
#=> ["A", "B", "C", "D", "E", "F", "G", "H", "I"]
a = d.shuffle
#=> ["D", "C", "G", "H", "B", "F", "I", "E", "A"]
s.gsub(/\p{Lu}/) { a.shift }
#=> "\nDCG\nHBF\nIEA\n"
The remaining calculations are similar.

grep the input file with keyword, then generate new report

cat infile
abc 123 678
sda 234 345 321
xyz 234 456 678
I need grep the file for keyword sda and report with first and last column.
sda has the value of 321
If you know bash script, I need a function in ruby as in below bash(awk) script:
awk '/sda/{print $1 " has the value of " $NF}' infile
How about something like this?
File.open("infile", "r").each_line do |line|
next unless line =~ /^sda/ # don't process the line unless it starts with "sda"
entries = line.split(" ")
var1 = entries.first
var2 = entries.last
puts "#{var1} has the value of #{var2}"
end
I don't know where you are defining the "sda" matcher. If it's fixed, you can just put it in there.
If not, you might try grabbing it from commandline arguments.
key, *_, value = line.split
next unless key == 'sda' # or "next if key != 'sda'"
puts your_string
Alternatively, you could use a regexp matcher in the beginning to see if the line starts with 'sda' or not.

How to get specific data from block of data based on condition

I have a file like this:
[group]
enable = 0
name = green
test = more
[group]
name = blue
test = home
[group]
value = 48
name = orange
test = out
There may be one ore more space/tabs between label and = and value.
Number of lines may wary in every block.
I like to have the name, only if this is not true enable = 0
So output should be:
blue
orange
Here is what I have managed to create:
awk -v RS="group" '!/enable = 0/ {sub(/.*name[[:blank:]]+=[[:blank:]]+/,x);print $1}'
blue
orange
There are several fault with this:
I am not able to set RS to [group], both this fails RS="[group]" and RS="\[group\]". This will then fail if name or other labels contains group.
I do prefer not to use RS with multiple characters, since this is gnu awk only.
Anyone have other suggestion? sed or awk and not use a long chain of commands.
If you know that groups are always separated by empty lines, set RS to the empty string:
$ awk -v RS="" '!/enable = 0/ {sub(/.*name[[:blank:]]+=[[:blank:]]+/,x);print $1}'
blue
orange
#devnull explained in his answer that GNU awk also accepts regular expressions in RS, so you could only split at [group] if it is on its own line:
gawk -v RS='(^|\n)[[]group]($|\n)' '!/enable = 0/ {sub(/.*name[[:blank:]]+=[[:blank:]]+/,x);print $1}'
This makes sure we're not splitting at evil names like
[group]
enable = 0
name = [group]
name = evil
test = more
Your problem seems to be:
I am not able to set RS to [group], both this fails RS="[group]" and
RS="\[group\]".
Saying:
RS="[[]group[]]"
should yield the desired result.
In these situations where there's clearly name = value statements within a record, I like to first populate an array with those mappings, e.g.:
map["<name>"] = <value>
and then just use the names to reference the values I want. In this case:
$ awk -v RS= -F'\n' '
{
delete map
for (i=1;i<=NF;i++) {
split($i,tmp,/ *= */)
map[tmp[1]] = tmp[2]
}
}
map["enable"] !~ /^0$/ {
print map["name"]
}
' file
blue
orange
If your version of awk doesn't support deleting a whole array then change delete map to split("",map).
Compared to using REs and/or sub()s., etc., it makes the solution much more robust and extensible in case you want to compare and/or print the values of other fields in future.
Since you have line-separated records, you should consider putting awk in paragraph mode. If you must test for the [group] identifier, simply add code to handle that. Here's some example code that should fulfill your requirements. Run like:
awk -f script.awk file.txt
Contents of script.awk:
BEGIN {
RS=""
}
{
for (i=2; i<=NF; i+=3) {
if ($i == "enable" && $(i+2) == 0) {
f = 1
}
if ($i == "name") {
r = $(i+2)
}
}
}
!(f) && r {
print r
}
{
f = 0
r = ""
}
Results:
blue
orange
This might work for you (GNU sed):
sed -n '/\[group\]/{:a;$!{N;/\n$/!ba};/enable\s*=\s*0/!s/.*name\s*=\s*\(\S\+\).*/\1/p;d}' file
Read the [group] block into the pattern space then substitute out the colour if the enable variable is not set to 0.
sed -n '...' set sed to run in silent mode, no ouput unless specified i.e. a p or P command
/\[group\]/{...} when we have a line which contains [group] do what is found inside the curly braces.
:a;$!{N;/\n$/!ba} to do a loop we need a place to loop to, :a is the place to loop to. $ is the end of file address and $! means not the end of file, so $!{...} means do what is found inside the curly braces when it is not the end of file. N means append a newline and the next line to the current line and /\n$/ba when we have a line that ends with an empty line branch (b) to a. So this collects all lines from a line that contains `[group] to an empty line (or end of file).
/enable\s*=\s*0/!s/.*name\s*=\s*\(\S\+\).*/\1/p if the lines collected contain enable = 0 then do not substitute out the colour. Or to put it another way, if the lines collected so far do not contain enable = 0 do substitute out the colour.
If you don't want to use the record separator, you could use a dummy variable like this:
#!/usr/bin/awk -f
function endgroup() {
if (e == 1) {
print n
}
}
$1 == "name" {
n = $3
}
$1 == "enable" && $3 == 0 {
e = 0;
}
$0 == "[group]" {
endgroup();
e = 1;
}
END {
endgroup();
}
You could actually use Bash for this.
while read line; do
if [[ $line == "enable = 0" ]]; then
n=1
else
n=0
fi
if [ $n -eq 0 ] && [[ $line =~ name[[:space:]]+=[[:space:]]([a-z]+) ]]; then
echo ${BASH_REMATCH[1]}
fi
done < file
This will only work however if enable = 0 is always only one line above the line with name.

What is the fastest way to the delete lines in a file which have no match in a second file?

I have two files, wordlist.txt and text.txt.
The first file, wordlist.txt, contains a huge list of words in Chinese, Japanese, and Korean, e.g.:
你
你们
我
The second file, text.txt, contains long passages, e.g.:
你们要去哪里?
卡拉OK好不好?
I want to create a new word list (wordsfount.txt), but it should only contain those lines from wordlist.txt which are found at least once within text.txt. The output file from the above should show this:
你
你们
"我" is not found in this list because it is never found in text.txt.
I want to find a very fast way to create this list which only contains lines from the first file that are found in the second.
I know a simple way in BASH to check each line in worlist.txt and see if it is in text.txt using grep:
a=1
while read line
do
c=`grep -c $line text.txt`
if [ "$c" -ge 1 ]
then
echo $line >> wordsfound.txt
echo "Found" $a
fi
echo "Not found" $a
a=`expr $a + 1`
done < wordlist.txt
Unfortunately, as wordlist.txt is a very long list, this process takes many hours. There must be a faster solution. Here is one consideration:
As the files contain CJK letters, they can be thought of as a giant alphabet with about 8,000 letters. So nearly every word share characters. E.g.:
我
我们
Due to this fact, if "我" is never found within text.txt, then it is quite logical that "我们" never appears either. A faster script might perhaps check "我" first, and upon finding that it is not present, would avoid checking every subsequent word contained withing wordlist.txt that also contained within wordlist.txt. If there are about 8,000 unique characters found in wordlist.txt, then the script should not need to check so many lines.
What is the fastest way to create the list containing only those words that are in the first file that are also found somewhere within in the second?
I grabbed the text of War and Peace from the Gutenberg project and wrote the following script. If prints all words in /usr/share/dict/words which are also in war_and_peace.txt. You can change that with:
perl findwords.pl --wordlist=/path/to/wordlist --text=/path/to/text > wordsfound.txt
On my computer, it takes just over a second to run.
use strict;
use warnings;
use utf8::all;
use Getopt::Long;
my $wordlist = '/usr/share/dict/words';
my $text = 'war_and_peace.txt';
GetOptions(
"worlist=s" => \$wordlist,
"text=s" => \$text,
);
open my $text_fh, '<', $text
or die "Cannot open '$text' for reading: $!";
my %is_in_text;
while ( my $line = <$text_fh> ) {
chomp($line);
# you will want to customize this line
my #words = grep { $_ } split /[[:punct:][:space:]]/ => $line;
next unless #words;
# This beasty uses the 'x' builtin in list context to assign
# the value of 1 to all keys (the words)
#is_in_text{#words} = (1) x #words;
}
open my $wordlist_fh, '<', $wordlist
or die "Cannot open '$wordlist' for reading: $!";
while ( my $word = <$wordlist_fh> ) {
chomp($word);
if ( $is_in_text{$word} ) {
print "$word\n";
}
}
And here's my timing:
• [ovid] $ wc -w war_and_peace.txt
565450 war_and_peace.txt
• [ovid] $ time perl findwords.pl > wordsfound.txt
real 0m1.081s
user 0m1.076s
sys 0m0.000s
• [ovid] $ wc -w wordsfound.txt
15277 wordsfound.txt
Just use comm
http://unstableme.blogspot.com/2009/08/linux-comm-command-brief-tutorial.html
comm -1 wordlist.txt text.txt
This might work for you:
tr '[:punct:]' ' ' < text.txt | tr -s ' ' '\n' |sort -u | grep -f - wordlist.txt
Basically, create a new word list from text.txt and grep it against wordlist.txt file.
N.B. You may want to use the software you used to build the original wordlist.txt. In which case all you need is:
yoursoftware < text.txt > newwordlist.txt
grep -f newwordlist.txt wordlist.txt
Use grep with fixed-strings (-F) semantics, this will be fastest. Similarly, if you want to write it in Perl, use the index function instead of regex.
sort -u wordlist.txt > wordlist-unique.txt
grep -F -f wordlist-unique.txt text.txt
I'm surprised that there are already four answers, but no one posted this yet. People just don't know their toolbox anymore.
I would probably use Perl;
use strict;
my #aWordList = ();
open(WORDLIST, "< wordlist.txt") || die("Can't open wordlist.txt);
while(my $sWord = <WORDLIST>)
{
chomp($sWord);
push(#aWordList, $sWord);
}
close(WORDLIST);
open(TEXT, "< text.txt") || die("Can't open text.txt);
while(my $sText = <TEXT>)
{
foreach my $sWord (#aWordList)
{
if($sText =~ /$sWord/)
{
print("$sWord\n");
}
}
}
close(TEXT);
This won't be too slow, but if you could let us know the size of the files you're dealing with I could have a go at writing something much more clever with hash tables
Quite sure not the fastest solution, but at least a working one (I hope).
This solution needs ruby 1.9, the text file are expected to be UTF-8.
#encoding: utf-8
#Get test data
$wordlist = File.readlines('wordlist.txt', :encoding => 'utf-8').map{|x| x.strip}
$txt = File.read('text.txt', :encoding => 'utf-8')
new_wordlist = []
$wordlist.each{|word|
new_wordlist << word if $txt.include?(word)
}
#Save the result
File.open('wordlist_new.txt', 'w:utf-8'){|f|
f << new_wordlist.join("\n")
}
Can you provide a bigger example to make some benchmark on different methods? (Perhaps some test files to download?)
Below a benchmark with four methods.
#encoding: utf-8
require 'benchmark'
N = 10_000 #Number of Test loops
#Get test data
$wordlist = File.readlines('wordlist.txt', :encoding => 'utf-8').map{|x| x.strip}
$txt = File.read('text.txt', :encoding => 'utf-8')
def solution_count
new_wordlist = []
$wordlist.each{|word|
new_wordlist << word if $txt.count(word) > 0
}
new_wordlist.sort
end
#Faster then count, it can stop after the first hit
def solution_include
new_wordlist = []
$wordlist.each{|word|
new_wordlist << word if $txt.include?(word)
}
new_wordlist.sort
end
def solution_combine()
#get biggest word size
max = 0
$wordlist.each{|word| max = word.size if word.size > max }
#Build list of all letter combination from text
words_in_txt = []
0.upto($txt.size){|i|
1.upto(max){|l|
words_in_txt << $txt[i,l]
}
}
(words_in_txt & $wordlist).sort
end
#Idea behind:
#- remove string if found.
#- the next comparison is faster, the search text is shorter.
#
#This will not work with overlapping words.
#Example:
# abcdef contains def.
# if we check bcd first, the 'd' of def will be deleted, def is not detected.
def solution_gsub
new_wordlist = []
txt = $txt.dup #avoid to manipulate data source for other methods
#We must start with the big words.
#If we start with small one, we destroy long words
$wordlist.sort_by{|x| x.size }.reverse.each{|word|
new_wordlist << word if txt.gsub!(word,'')
}
#Now we must add words which where already part of longer words
new_wordlist.dup.each{|neww|
$wordlist.each{|word|
new_wordlist << word if word != neww and neww.include?(word)
}
}
new_wordlist.sort
end
#Save the result
File.open('wordlist_new.txt', 'w:utf-8'){|f|
#~ f << solution_include.join("\n")
f << solution_combine.join("\n")
}
#Check the different results
if solution_count != solution_include
puts "Difference solution_count <> solution_include"
end
if solution_gsub != solution_include
puts "Difference solution_gsub <> solution_include"
end
if solution_combine != solution_include
puts "Difference solution_combine <> solution_include"
end
#Benchmark the solution
Benchmark.bmbm(10) {|b|
b.report('count') { N.times { solution_count } }
b.report('include') { N.times { solution_include } }
b.report('gsub') { N.times { solution_gsub } } #wrong results
b.report('combine') { N.times { solution_gsub } } #wrong results
} #Benchmark
I think, the solution_gsub variant is not correct. See the comment in the method definition. If CJK may allow this solution, the please give me a feedback.
That variant is the slowest in my test, but perhaps it will tune up with bigger examples.
And perhaps it can be tuned a bit.
The variant combine is also very slow, but it would be interestiung what happens with a bigger example.
First TXR Lisp solution ( http://www.nongnu.org/txr ):
(defvar tg-hash (hash)) ;; tg == "trigraph"
(unless (= (len *args*) 2)
(put-line `arguments required: <wordfile> <textfile>`)
(exit nil))
(defvar wordfile [*args* 0])
(defvar textfile [*args* 1])
(mapcar (lambda (line)
(dotimes (i (len line))
(push line [tg-hash [line i..(succ i)]])
(push line [tg-hash [line i..(ssucc i)]])
(push line [tg-hash [line i..(sssucc i)]])))
(file-get-lines textfile))
(mapcar (lambda (word)
(if (< (len word) 4)
(if [tg-hash word]
(put-line word))
(if (find word [tg-hash [word 0..3]]
(op search-str #2 #1))
(put-line word))))
(file-get-lines wordfile))
The strategy here is to reduce the corpus of words to a hash table which is indexed on individual characters, digraphs and trigraphs occuring in the lines, associating these fragments with the lines. Then when we process the word list, this reduces the search effort.
Firstly if the word is short, three characters or less (probably common in Chinese words), we can try to get an instant match in the hash table. If no match, word is not in the corpus.
If the word is longer than three characters, we can try to get a match for the first three characters. That gives us a list of lines which contain a match for the trigraph. We can search those lines exhaustively to see which ones of them match the word. I suspect that this will greatly reduce the number of lines that have to be searched.
I would need your data, or something representative thereof, to be able to see what the behavior is like.
Sample run:
$ txr words.tl words.txt text.txt
water
fire
earth
the
$ cat words.txt
water
fire
earth
the
it
$ cat text.txt
Long ago people
believed that the four
elements were
just
water
fire
earth
(TXR reads UTF-8 and does all string manipulation in Unicode, so testing with ASCII characters is valid.)
The use of lazy lists means that we do not store the entire list of 300,000 words, for instance. Although we are using the Lisp mapcar function, the list is being generated on the fly and because we don't keep the reference to the head of the list, it is eligible for garbage collection.
Unfortunately we do have to keep the text corpus in memory because the hash table associates lines.
If that's a problem, the solution could be reversed. Scan all the words, and then process the text corpus lazily, tagging those words which occur. Then eliminate the rest. I will post such a solution also.
new file newlist.txt
for each word in wordlist.txt:
check if word is in text.txt (I would use grep, if you're willing to use bash)
if yes:
append it to newlist.txt (probably echo word >> newlist.txt)
if no:
next word
Simplest way with bash script:
Preprocessing first with "tr" and "sort" to format it to one word a line and remove duplicated lines.
Do this:
cat wordlist.txt | while read i; do grep -E "^$i$" text.txt; done;
That's the list of words you want...
Try this:
cat wordlist.txt | while read line
do
if [[ grep -wc $line text.txt -gt 0 ]]
then
echo $line
fi
done
Whatever you do, if you use grep you must use -w to match a whole word. Otherwise if you have foo in wordlist.txt and foobar in text.txt, you'll get wrong match.
If the files are VERY big, and this loop takes too much time to run, you can convert text.txt to a list of work (easy with AWK), and use comm to find the words that are in both lists.
This solution is in perl, maintains your original symantics and uses the optimization you suggested.
#!/usr/bin/perl
#list=split("\n",`sort < ./wordlist.txt | uniq`);
$size=scalar(#list);
for ($i=0;$i<$size;++$i) { $list[$i]=quotemeta($list[$i]);}
for ($i=0;$i<$size;++$i) {
my $j = $i+1;
while ($list[$j]=~/^$list[$i]/) {
++$j;
}
$skip[$i]=($j-$i-1);
}
open IN,"<./text.txt" || die;
#text = (<IN>);
close IN;
foreach $c(#text) {
for ($i=0;$i<$size;++$i) {
if ($c=~/$list[$i]/) {
$found{$list[$i]}=1;
last;
}
else {
$i+=$skip[$i];
}
}
}
open OUT,">wordsfound.txt" ||die;
while ( my ($key, $value) = each(%found) ) {
print OUT "$key\n";
}
close OUT;
exit;
Use paralel processing to speed up the processing.
1) sort & uniq on wordlist.txt, then split it to several files (X)
Do some testing, X is equal with your computer cores.
split -d -l wordlist.txt
2) use xargs -p X -n 1 script.sh x00 > output-x00.txt
to process the files in paralel
find ./splitted_files_dir -type f -name "x*" -print| xargs -p 20 -n 1 -I SPLITTED_FILE script.sh SPLITTED_FILE
3) cat output* > output.txt concatenate output files
This will speed up the processing enough, and you are able to use tools that you could understand. This will ease up the maintinging "cost".
The script almost identical that you used in the first place.
script.sh
FILE=$1
OUTPUTFILE="output-${FILE}.txt"
WORDLIST="wordliist.txt"
a=1
while read line
do
c=`grep -c $line ${FILE} `
if [ "$c" -ge 1 ]
then
echo $line >> ${OUTPUTFILE}
echo "Found" $a
fi
echo "Not found" $a
a=`expr $a + 1`
done < ${WORDLIST}

shell script to search attribute and store value along with filename

Looking out for a shell script which searches for an attribute (a string) in all the files in current directory and stores the attribute values along with the file name.
e.g File1.txt
abc xyz = "pqr"
File2.txt
abc xyz = "klm"
Here File1 and File2 contains desired string "abc xyz" and have values "pqr" and "klm".
I want result something like this:
File1.txt:pqr
File2.txt:klm
Well, this depends on how do you define a 'shell script'. Here are 3 one-line solutions:
Using grep/sed:
egrep -o "abc xyz = ".*"' * | sed -e 's/abc xyz = "(.*)"/\1/'
Using awk:
awk '/abc xyz = "(.)"/ { print FILENAME ":" gensub("abc xyz = \"(.)\"", "\1", 1) }' *
Using perl one-liner:
perl -ne 'if(s/abc xyz = "(.*)"/$ARGV:$1/) { print }' *
I personally would go with the last one.
Please don't use bash scripting for this.
There is much room for small improvements in the code,
but in 20 lines the damn thing does the job.
Note: the code assumes that "abc xyz" is at the beginning of the line.
#!/usr/bin/python
import os
import re
MYDIR = '/dir/you/want/to/search'
def search_file(fn):
myregex = re.compile(r'abc xyz = \"([a-z]+)\"')
f = open(fn, 'r')
for line in f:
m = myregex.match(line)
if m:
yield m.group(1)
for filename in os.listdir(MYDIR):
if os.path.isfile(os.path.join(MYDIR, filename)):
matches = search_file(os.path.join(MYDIR, filename))
for match in matches:
print filename + ':' + match,
Thanks to David Beazley, A.M. Kuchling, and Mark Pilgrim for sharing their vast knowledge.
I couldn't have done something like this without you guys leading the way.

Resources