How do i change an array of numbers into corresponding letters of the alphabet - bash

I have an array called variable that contains the numbers 1-26, i am trying to use a for loop in bash to go through each number of the array and associating it with a letter from the alphabet as tr only lets me translate the first few letters of the alphabet. An example of my code is
Note: i am using bash
#!/bin/bash
for p1 in "${variable[#]}"; do
if (( $p1 == 1 )); then
newvar+='a'
elif (( $p1 == 2 )); then
newvar+='b'
...... and so on down to z
i am trying to create the string newvar which contains these translated letters. However when i try to run this it only shows me a which is the very first number translated. Why doesn't this work?

for p1 in "${variable[#]}"; do
chars+=( $((p1 + 96)) )
done
printf '%b' $(printf '\\%03o' ${chars[#]})

Maybe:
# alphabet=(a b c d e f g h i j k l m n o p q r s t u v w x y z)
alphabet=({a..z})
letters=(8 5 12 12 15 23 15 18 12 4)
phrase=''
for i in "${letters[#]}"; do
phrase+="${alphabet[i-1]}"
done
echo $phrase
helloworld

Related

ls command, default (alphabetical) sorting order

here a piece of code :
$> ls
` = _ ; ? ( ] # \ % 1 4 7 a B d E g H J l M o P r S u V x Y
^ > - : ' ) { $ & + 2 5 8 A c D f G i k L n O q R t U w X z
< | , ! " [ } * # 0 3 6 9 b C e F h I K m N p Q s T v W y Z
I'm printing all ASCII character, each element is a folder, and I'm trying to understand the default sorting order of the ls command.
I understand that's there is a case insensitive comparison to sort alphabetic character, with digit coming first.
I've some trouble to understand how special character are sorted, and I'm not able to find something clear. I was thinking it could be related to the ASCII table, but when we see how things are ordered it really make no sens with it... Where is this order coming from ?
Thanks

Processing data swapped over files BASH

First, I would like to apologize for my extremely basic knowledge about coding. Then I hope that I will be able to express myself correctly about my issue. Do no hesitate to ask for further clarifications or anything else...
I'm encountering troubles postprocessing data...
My goal is to recombine data which were swapped.
EDIT : here is a .rar folder containing my test example which works and the one that I try to make working... (do not be afraid by the time it requires to process the data)
https://drive.google.com/file/d/1AEPUc8haT5_Z3LR3jnZZlpyfxhdDwwo6/view?usp=sharing
EDIT 2 : Here is what I expect on paper (Its my TestReorder3OK folder in my rar archive)
enter image description here
EDIT 3 : MINIMAL COMPLETE EXAMPLE
Script :
#!/bin/bash
# Definir le nombre de replica
NP=3
NP1=$[NP-1]
rm torder*
for repl in `seq 0 $NP1`
do
echo $repl
# colle la colonne 2 du fichier .lammps dans un fichier rep_0, puis dans la seconde boucle, la colonne 3 dans rep_1, etc.
awk -v rep=$repl '{r2=rep+2;print $r2}' < log.lammps > rep_$repl
i=0
j=0
# cree une boucle dans la boucle
for a in `cat rep_$repl`
do
i=$[i+1]
j=$[j+3]
head -$i screen.$repl.temp | tail -1 >> torder.$a
head -$j ccccd2_H_${repl}_col.bak2 | tail -3 >> ccccd2_H_${a}_temp_col.bak2
done
done
log.lammps file
1 0 1 2
2 1 0 2
3 1 2 0
Starting at column 2, this file contains the number associated to the inputs below. Here is an expanded explanation :
column 2 has three values : 0, 1 and 1 ; the 0 is associated to the first three lines of the file ccccd2_H_0_col.bak2, the next three ones are associated the 1 and the last three ones again to the value 1.
column 3 has also three values : 1, 0 and 2 ; the 1 is associated to the first three lines of the file ccccd2_H_1_col.bak2, the next three ones are associated the 0 and the last three ones again to the value 2.
Same story for column 4.
Now what I want, is that every set of three lines associated to the 0 value go into a single file. Every set of three lines associated to the 1 value go into another single file, and the sets of three lines associated to the 2 value to a last file.
Inputs :
ccccd2_H_0_col.bak2
blank line
N a b c
C d e f
N g h i
C j k l
N m n o
C p q r
ccccd2_H_1_col.bak2
blank line
N s t u
C v w x
N y z a
C b c d
N e f g
C h i j
ccccd2_H_2_col.bak2
blank line
N k l m
C n o p
N q r s
C t u v
N w x y
C z a b
Outputs : These are the desired outputs and the one that I get for simple test files
ccccd2_H_0_temp_col
blank line
N a b c
C d e f
N y z a
C b c d
N w x y
C z a b
ccccd2_H_1_temp_col
blank line
N g h i
C j k l
N m n o
C p q r
N s t u
C v w x
ccccd2_H_2_temp_col
blank line
N e f g
C h i j
N k l m
C n o p
N q r s
C t u v
This works fine on small test files (as shown here), but not on my real system. For my real system, I have the log.lammps file that contains 14 rows and 10,001 lines, and my input files that contain 121,121 lines (so 10,001 * block of 121 lines). It creates files 10 times larger with more data than it should.
Can you enlighten me about my issue ? I think this is linked to the difference of line number from my files containing a single row and the files containing cartesian coordinates, but I really don't understand the link nor the way to solve it...
Thank you in advance...
I think I understand what you're trying do do now and this GNU awk script (for ARGIND, ENDFILE and inbuilt open file management) will do it:
$ cat ../tst.awk
ARGIND == 1 {
for (inFileNr=2; inFileNr<=NF; inFileNr++) {
outFileNrs[inFileNr,NR] = $inFileNr
}
next
}
ENDFILE { RS = "" }
{ print ORS $0 > ("ccccd2_H_" outFileNrs[ARGIND,FNR] "_temp_col") }
Look:
INPUT:
$ ls
ccccd2_H_0_col.bak2 ccccd2_H_1_col.bak2 ccccd2_H_2_col.bak2 log.lammps
$ cat log.lammps
1 0 1 2
2 1 0 2
3 1 2 0
$ paste ccccd2_H_0_col.bak2 ccccd2_H_1_col.bak2 ccccd2_H_2_col.bak2 | sed 's/\t/\t\t/g'
N a b c N s t u N k l m
C d e f C v w x C n o p
N g h i N y z a N q r s
C j k l C b c d C t u v
N m n o N e f g N w x y
C p q r C h i j C z a b
SCRIPT EXECUTION:
$ awk -f ../tst.awk log.lammps ccccd2_H_0_col.bak2 ccccd2_H_1_col.bak2 ccccd2_H_2_col.bak2
OUTPUT:
$ ls
ccccd2_H_0_col.bak2 ccccd2_H_1_col.bak2 ccccd2_H_2_col.bak2 log.lammps
ccccd2_H_0_temp_col ccccd2_H_1_temp_col ccccd2_H_2_temp_col
$ paste ccccd2_H_0_temp_col ccccd2_H_1_temp_col ccccd2_H_2_temp_col | sed 's/\t/\t\t/g'
N a b c N g h i N e f g
C d e f C j k l C h i j
N y z a N m n o N k l m
C b c d C p q r C n o p
N w x y N s t u N q r s
C z a b C v w x C t u v

Shell script for getting limited Values from Delimited Values

I need a shell script for the below.
I have 14 variables as:
scenario 1
input:
a###b###c###d###e###f###g###e###f###g###h###i###j###k###l###m###n
scenario 2
input:
a###b###c###d###e###f###g###e###f###g###h###i###j###k###l###m###n###
I want output as
op1 = a
op2 = b
op3 = c
op4 = d
op5 = e
op6 = g
op7 = f
op8 = h
op9 = i
op10 = j
op11 = k
op12 = l
op13 = m
op14 = n
Here op1 to op14 are variable in which i have to store the values.
First replace ### with a single, unique character delimiter e.g. #, then use read to read it into an array. The elements of the array contain the characters. This is shown below.
$ input="a###b###c###d###e###f###g###e###f###g###h###i###j###k"
$ IFS='#' read -a arr <<< "${input//###/#}"
$ echo ${arr[0]}
a
$ echo ${arr[1]}
b
$ echo ${arr[13]}
k
# print out the whole array
$ for (( i=0; i<${#arr[#]}; i++ ))
> do
> echo "Element at index $i is ${arr[i]}"
> done
Element at index 0 is a
Element at index 1 is b
Element at index 2 is c
Element at index 3 is d
Element at index 4 is e
Element at index 5 is f
Element at index 6 is g
Element at index 7 is e
Element at index 8 is f
Element at index 9 is g
Element at index 10 is h
Element at index 11 is i
Element at index 12 is j
Element at index 13 is k

How to convert CSV file in bash?

I have a file, each line of which is a list of comma separated values. For example,
1, a, b, c, d, e
2, x, y, z
Now I would like to convert it in bash as follows:
1 a
1 b
1 c
1 d
1 e
2 x
2 y
2 z
How to do it with a shell (bash) script?
awk -F, '{for(i=2;i<=NF;i++)print $1,$i}' temp
tested below:
> cat temp
1, a, b, c, d, e
2, x, y, z
> awk -F, '{for(i=2;i<=NF;i++)print $1,$i}' temp
1 a
1 b
1 c
1 d
1 e
2 x
2 y
2 z
You can split the line into tokens and put them in an array. The first element of the array will be having the number, in your case it is 1 or 2 and so on. Something like this may be :
while read line
do
arrIN=(${line//,/ })
## make a loop and echo them
## arrIN[0] will have the initial number
done < $file
# $file is the input file you are reading
I understand you need a shell script, but does it need to be bash ? e.g. I would normally use a higher-level scripting language and a CSV library. e.g. Perl and Text::CSV

Code Golf: Quickly Build List of Keywords from Text, Including # of Instances

Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
I've already worked out this solution for myself with PHP, but I'm curious how it could be done differently - better even. The two languages I'm primarily interested in are PHP and Javascript, but I'd be interested in seeing how quickly this could be done in any other major language today as well (mostly C#, Java, etc).
Return only words with an occurrence greater than X
Return only words with a length greater than Y
Ignore common terms like "and, is, the, etc"
Feel free to strip punctuation prior to processing (ie. "John's" becomes "John")
Return results in a collection/array
Extra Credit
Keep Quoted Statements together, (ie. "They were 'too good to be true' apparently")Where 'too good to be true' would be the actual statement
Extra-Extra Credit
Can your script determine words that should be kept together based upon their frequency of being found together? This being done without knowing the words beforehand. Example:
*"The fruit fly is a great thing when it comes to medical research. Much study has been done on the fruit fly in the past, and has lead to many breakthroughs. In the future, the fruit fly will continue to be studied, but our methods may change."*
Clearly the word here is "fruit fly," which is easy for us to find. Can your search'n'scrape script determine this too?
Source text: http://sampsonresume.com/labs/c.txt
Answer Format
It would be great to see the results of your code, output, in addition to how long the operation lasted.
GNU scripting
sed -e 's/ /\n/g' | grep -v '^ *$' | sort | uniq -c | sort -nr
Results:
7 be
6 to
[...]
1 2.
1 -
With occurence greater than X:
sed -e 's/ /\n/g' | grep -v '^ *$' | sort | uniq -c | awk '$1>X'
Return only words with a length greater than Y (put Y+1 dots in second grep):
sed -e 's/ /\n/g' | grep -v '^ *$' | grep .... | sort | uniq -c
Ignore common terms like "and, is, the, etc" (assuming that the common terms are in file 'ignored')
sed -e 's/ /\n/g' | grep -v '^ *$' | grep -vf ignored | sort | uniq -c
Feel free to strip punctuation prior to processing (ie. "John's" becomes "John"):
sed -e 's/[,.:"\']//g;s/ /\n/g' | grep -v '^ *$' | sort | uniq -c
Return results in a collection/array: it is already like an array for shell: first column is count, second is word.
Perl in only 43 characters.
perl -MYAML -anE'$_{$_}++for#F;say Dump\%_'
Here is an example of it's use:
echo a a a b b c d e aa | perl -MYAML -anE'$_{$_}++for#F;say Dump \%_'
---
a: 3
aa: 1
b: 2
c: 1
d: 1
e: 1
If you need to list only the lowercase versions, it requires two more characters.
perl -MYAML -anE'$_{lc$_}++for#F;say Dump\%_'
For it to work on the specified text requires 58 characters.
curl http://sampsonresume.com/labs/c.txt |
perl -MYAML -F'\W+' -anE'$_{lc$_}++for#F;END{say Dump\%_}'
real 0m0.679s
user 0m0.304s
sys 0m0.084s
Here is the last example expanded a bit.
#! perl
use 5.010;
use YAML;
while( my $line = <> ){
for my $elem ( split '\W+', $line ){
$_{ lc $elem }++
}
END{
say Dump \%_;
}
}
F#: 304 chars
let f =
let bad = Set.of_seq ["and";"is";"the";"of";"are";"by";"it"]
fun length occurrence msg ->
System.Text.RegularExpressions.Regex.Split(msg, #"[^\w-']+")
|> Seq.countBy (fun a -> a)
|> Seq.choose (fun (a, b) -> if a.Length > length && b > occurrence && (not <| bad.Contains a) then Some a else None)
Ruby
When "minified", this implementation becomes 165 characters long. It uses array#inject to give a starting value (a Hash object with a default of 0) and then loop through the elements, which are then rolled into the hash; the result is then selected from the minimum frequency.
Note that I didn't count the size of the words to skip, that being an external constant. When the constant is counted too, the solution is 244 characters long.
Apostrophes and dashes aren't stripped, but included; their use modifies the word and therefore cannot be stripped simply without removal of all information beyond the symbol.
Implementation
CommonWords = %w(the a an but and is not or as of to in for by be may has can its it's)
def get_keywords(text, minFreq=0, minLen=2)
text.scan(/(?:\b)[a-z'-]{#{minLen},}(?=\b)/i).
inject(Hash.new(0)) do |result,w|
w.downcase!
result[w] += 1 unless CommonWords.include?(w)
result
end.select { |k,n| n >= minFreq }
end
Test Rig
require 'net/http'
keywords = get_keywords(Net::HTTP.get('www.sampsonresume.com','/labs/c.txt'), 3)
keywords.sort.each { |name,count| puts "#{name} x #{count} times" }
Test Results
code x 4 times
declarations x 4 times
each x 3 times
execution x 3 times
expression x 4 times
function x 5 times
keywords x 3 times
language x 3 times
languages x 3 times
new x 3 times
operators x 4 times
programming x 3 times
statement x 7 times
statements x 4 times
such x 3 times
types x 3 times
variables x 3 times
which x 4 times
C# 3.0 (with LINQ)
Here's my solution. It makes use of some pretty nice features of LINQ/extension methods to keep the code short.
public static Dictionary<string, int> GetKeywords(string text, int minCount, int minLength)
{
var commonWords = new string[] { "and", "is", "the", "as", "of", "to", "or", "in",
"for", "by", "an", "be", "may", "has", "can", "its"};
var words = Regex.Replace(text.ToLower(), #"[,.?\/;:\(\)]", string.Empty).Split(' ');
var occurrences = words.Distinct().Except(commonWords).Select(w =>
new { Word = w, Count = words.Count(s => s == w) });
return occurrences.Where(wo => wo.Count >= minCount && wo.Word.Length >= minLength)
.ToDictionary(wo => wo.Word, wo => wo.Count);
}
This is however far from the most efficient method, being O(n^2) with the number of words, rather than O(n), which is optimal in this case I believe. I'll see if I can creater a slightly longer method that is more efficient.
Here are the results of the function run on the sample text (min occurences: 3, min length: 2).
3 x such
4 x code
4 x which
4 x declarations
5 x function
4 x statements
3 x new
3 x types
3 x keywords
7 x statement
3 x language
3 x expression
3 x execution
3 x programming
4 x operators
3 x variables
And my test program:
static void Main(string[] args)
{
string sampleText;
using (var client = new WebClient())
sampleText = client.DownloadString("http://sampsonresume.com/labs/c.txt");
var keywords = GetKeywords(sampleText, 3, 2);
foreach (var entry in keywords)
Console.WriteLine("{0} x {1}", entry.Value.ToString().PadLeft(3), entry.Key);
Console.ReadKey(true);
}
#! perl
use strict;
use warnings;
while (<>) {
for my $word (split) {
$words{$word}++;
}
}
for my $word (keys %words) {
print "$word occurred $words{$word} times.";
}
That's the simple form. If you want sorting, filtering, etc.:
while (<>) {
for my $word (split) {
$words{$word}++;
}
}
for my $word (keys %words) {
if ((length($word) >= $MINLEN) && ($words{$word) >= $MIN_OCCURRENCE) {
print "$word occurred $words{$word} times.";
}
}
You can also sort the output pretty easily:
...
for my $word (keys %words) {
if ((length($word) >= $MINLEN) && ($words{$word) >= $MIN_OCCURRENCE) {
push #output, "$word occurred $words{$word} times.";
}
}
$re = qr/occurred (\d+) /;
print sort {
$a = $a =~ $re;
$b = $b =~ $re;
$a <=> $b
} #output;
A true Perl hacker will easily get these on one or two lines each, but I went for readability.
Edit: this is how I would rewrite this last example
...
for my $word (
sort { $words{$a} <=> $words{$b} } keys %words
){
next unless length($word) >= $MINLEN;
last unless $words{$word) >= $MIN_OCCURRENCE;
print "$word occurred $words{$word} times.";
}
Or if I needed it to run faster I might even write it like this:
for my $word_data (
sort {
$a->[1] <=> $b->[1] # numerical sort on count
} grep {
# remove values that are out of bounds
length($_->[0]) >= $MINLEN && # word length
$_->[1] >= $MIN_OCCURRENCE # count
} map {
# [ word, count ]
[ $_, $words{$_} ]
} keys %words
){
my( $word, $count ) = #$word_data;
print "$word occurred $count times.";
}
It uses map for efficiency,
grep to remove extra elements,
and sort to do the sorting, of course. ( it does so it in that order )
This is a slight variant of the Schwartzian transform.
Another Python solution, at 247 chars. The actual code is a single line of highly dense Python line of 134 chars that computes the whole thing in a single expression.
x=3;y=2;W="and is the as of to or in for by an be may has can its".split()
from itertools import groupby as gb
d=dict((w,l)for w,l in((w,len(list(g)))for w,g in
gb(sorted(open("c.txt").read().lower().split())))
if l>x and len(w)>y and w not in W)
A much longer version with plenty of comments for you reading pleasure:
# High and low count boundaries.
x = 3
y = 2
# Common words string split into a list by spaces.
Words = "and is the as of to or in for by an be may has can its".split()
# A special function that groups similar strings in a list into a
# (string, grouper) pairs. Grouper is a generator of occurences (see below).
from itertools import groupby
# Reads the entire file, converts it to lower case and splits on whitespace
# to create a list of words
sortedWords = sorted(open("c.txt").read().lower().split())
# Using the groupby function, groups similar words together.
# Since grouper is a generator of occurences we need to use len(list(grouper))
# to get the word count by first converting the generator to a list and then
# getting the length of the list.
wordCounts = ((word, len(list(grouper))) for word, grouper in groupby(sortedWords))
# Filters the words by number of occurences and common words using yet another
# list comprehension.
filteredWordCounts = ((word, count) for word, count in wordCounts if word not in Words and count > x and len(word) > y)
# Creates a dictionary from the list of tuples.
result = dict(filteredWordCounts)
print result
The main trick here is using the itertools.groupby function to count the occurrences on a sorted list. Don't know if it really saves characters, but it does allow all the processing to happen in a single expression.
Results:
{'function': 4, 'operators': 4, 'declarations': 4, 'which': 4, 'statement': 5}
C# code:
IEnumerable<KeyValuePair<String, Int32>> ProcessText(String text, int X, int Y)
{
// common words, that will be ignored
var exclude = new string[] { "and", "is", "the", "as", "of", "to", "or", "in", "for", "by", "an", "be", "may", "has", "can", "its" }.ToDictionary(word => word);
// regular expression to find quoted text
var regex = new Regex("\"[^\"]\"", RegexOptions.Compiled);
return
// remove quoted text (it will be processed later)
regex.Replace(text, "")
// remove case dependency
.ToLower()
// split text by all these chars
.Split(".,'\\/[]{}()`~##$%^&*-=+?!;:<>| \n\r".ToCharArray())
// add quoted text
.Concat(regex.Matches(text).Cast<Match>().Select(match => match.Value))
// group words by the word and count them
.GroupBy(word => word, (word, words) => new KeyValuePair<String, Int32>(word, words.Count()))
// apply filter(min word count and word length) and remove common words
.Where(pair => pair.Value >= X && pair.Key.Length >= Y && !exclude.ContainsKey(pair.Key));
}
Output for ProcessText(text, 3, 2) call:
3 x languages
3 x such
4 x code
4 x which
3 x based
3 x each
4 x declarations
5 x function
4 x statements
3 x new
3 x types
3 x keywords
3 x variables
7 x statement
4 x expression
3 x execution
3 x programming
3 x operators
In C#:
Use LINQ, specifically groupby, then filter by group count, and return a flattened (selectmany) list.
Use LINQ, filter by length.
Use LINQ, filter with 'badwords'.Contains.
REBOL
Verbose, perhaps, so definitely not a winner, but gets the job done.
min-length: 0
min-count: 0
common-words: [ "a" "an" "as" "and" "are" "by" "for" "from" "in" "is" "it" "its" "the" "of" "or" "to" "until" ]
add-word: func [
word [string!]
/local
count
letter
non-letter
temp
rules
match
][
; Strip out punctuation
temp: copy {}
letter: charset [ #"a" - #"z" #"A" - #"Z" #" " ]
non-letter: complement letter
rules: [
some [
copy match letter (append temp match)
|
non-letter
]
]
parse/all word rules
word: temp
; If we end up with nothing, bail
if 0 == length? word [
exit
]
; Check length
if min-length > length? word [
exit
]
; Ignore common words
ignore:
if find common-words word [
exit
]
; OK, its good. Add it.
either found? count: select words word [
words/(word): count + 1
][
repend words [word 1]
]
]
rules: [
some [
{"}
copy word to {"} (add-word word)
{"}
|
copy word to { } (add-word word)
{ }
]
end
]
words: copy []
parse/all read %c.txt rules
result: copy []
foreach word words [
if string? word [
count: words/:word
if count >= min-count [
append result word
]
]
]
sort result
foreach word result [ print word ]
The output is:
act
actions
all
allows
also
any
appear
arbitrary
arguments
assign
assigned
based
be
because
been
before
below
between
braces
branches
break
builtin
but
C
C like any other language has its blemishes Some of the operators have the wrong precedence some parts of the syntax could be better
call
called
calls
can
care
case
char
code
columnbased
comma
Comments
common
compiler
conditional
consisting
contain
contains
continue
control
controlflow
criticized
Cs
curly brackets
declarations
define
definitions
degree
delimiters
designated
directly
dowhile
each
effect
effects
either
enclosed
enclosing
end
entry
enum
evaluated
evaluation
evaluations
even
example
executed
execution
exert
expression
expressionExpressions
expressions
familiarity
file
followed
following
format
FORTRAN
freeform
function
functions
goto
has
high
However
identified
ifelse
imperative
include
including
initialization
innermost
int
integer
interleaved
Introduction
iterative
Kernighan
keywords
label
language
languages
languagesAlthough
leave
limit
lineEach
loop
looping
many
may
mimicked
modify
more
most
name
needed
new
next
nonstructured
normal
object
obtain
occur
often
omitted
on
operands
operator
operators
optimization
order
other
perhaps
permits
points
programmers
programming
provides
rather
reinitialization
reliable
requires
reserve
reserved
restrictions
results
return
Ritchie
say
scope
Sections
see
selects
semicolon
separate
sequence
sequence point
sequential
several
side
single
skip
sometimes
source
specify
statement
statements
storage
struct
Structured
structuresAs
such
supported
switch
syntax
testing
textlinebased
than
There
This
turn
type
types
union
Unlike
unspecified
use
used
uses
using
usually
value
values
variable
variables
variety
which
while
whitespace
widespread
will
within
writing
Python (258 chars as is, including 66 chars for first line and 30 chars for punctuation removal) :
W="and is the as of to or in for by an be may has can its".split()
x=3;y=2;d={}
for l in open('c.txt') :
for w in l.lower().translate(None,',.;\'"!()[]{}').split() :
if w not in W: d[w]=d.get(w,0)+1
for w,n in d.items() :
if n>y and len(w)>x : print n,w
output :
4 code
3 keywords
3 languages
3 execution
3 each
3 language
4 expression
4 statements
3 variables
7 statement
5 function
4 operators
4 declarations
3 programming
4 which
3 such
3 types
Here is my variant, in PHP:
$str = implode(file('c.txt'));
$tok = strtok($str, " .,;()\r\n\t");
$splitters = '\s.,\(\);?:'; // string splitters
$array = preg_split( "/[" . $splitters . "]*\\\"([^\\\"]+)\\\"[" . $splitters . "]*|[" . $splitters . "]+/", $str, 0, PREG_SPLIT_DELIM_CAPTURE );
foreach($array as $key) {
$res[$key] = $res[$key]+1;
}
$splitters = '\s.,\(\)\{\};?:'; // string splitters
$array = preg_split( "/[" . $splitters . "]*\\\"([^\\\"]+)\\\"[" . $splitters . "]*|[" . $splitters . "]+/", $str, 0, PREG_SPLIT_DELIM_CAPTURE );
foreach($array as $key) {
$res[$key] = $res[$key]+1;
}
unset($res['the']);
unset($res['and']);
unset($res['to']);
unset($res['of']);
unset($res['by']);
unset($res['a']);
unset($res['as']);
unset($res['is']);
unset($res['in']);
unset($res['']);
arsort($res);
//var_dump($res); // concordance
foreach ($res AS $word => $rarity)
echo $word . ' <b>x</b> ' . $rarity . '<br/>';
foreach ($array as $word) { // words longer than n (=5)
// if(strlen($word) > 5)echo $word.'<br/>';
}
And output:
statement x 7
be x 7
C x 5
may x 5
for x 5
or x 5
The x 5
as x 5
expression x 4
statements x 4
code x 4
function x 4
which x 4
an x 4
declarations x 3
new x 3
execution x 3
types x 3
such x 3
variables x 3
can x 3
languages x 3
operators x 3
end x 2
programming x 2
evaluated x 2
functions x 2
definitions x 2
keywords x 2
followed x 2
contain x 2
several x 2
side x 2
most x 2
has x 2
its x 2
called x 2
specify x 2
reinitialization x 2
use x 2
either x 2
each x 2
all x 2
built-in x 2
source x 2
are x 2
storage x 2
than x 2
effects x 1
including x 1
arguments x 1
order x 1
even x 1
unspecified x 1
evaluations x 1
operands x 1
interleaved x 1
However x 1
value x 1
branches x 1
goto x 1
directly x 1
designated x 1
label x 1
non-structured x 1
also x 1
enclosing x 1
innermost x 1
loop x 1
skip x 1
There x 1
within x 1
switch x 1
Expressions x 1
integer x 1
variety x 1
see x 1
below x 1
will x 1
on x 1
selects x 1
case x 1
executed x 1
based x 1
calls x 1
from x 1
because x 1
many x 1
widespread x 1
familiarity x 1
C's x 1
mimicked x 1
Although x 1
reliable x 1
obtain x 1
results x 1
needed x 1
other x 1
syntax x 1
often x 1
Introduction x 1
say x 1
Programming x 1
Language x 1
C, like any other language, has its blemishes. Some of the operators have the wrong precedence; some parts of the syntax could be better. x 1
Ritchie x 1
Kernighan x 1
been x 1
criticized x 1
For x 1
example x 1
care x 1
more x 1
leave x 1
return x 1
call x 1
&& x 1
|| x 1
entry x 1
include x 1
next x 1
before x 1
sequence point x 1
sequence x 1
points x 1
comma x 1
operator x 1
but x 1
compiler x 1
requires x 1
programmers x 1
exert x 1
optimization x 1
object x 1
This x 1
permits x 1
high x 1
degree x 1
occur x 1
Structured x 1
using x 1
struct x 1
union x 1
enum x 1
define x 1
Declarations x 1
file x 1
contains x 1
Function x 1
turn x 1
assign x 1
perhaps x 1
Keywords x 1
char x 1
int x 1
Sections x 1
name x 1
variable x 1
reserve x 1
usually x 1
writing x 1
type x 1
Each x 1
line x 1
format x 1
rather x 1
column-based x 1
text-line-based x 1
whitespace x 1
arbitrary x 1
FORTRAN x 1
77 x 1
free-form x 1
allows x 1
restrictions x 1
Comments x 1
C99 x 1
following x 1
// x 1
until x 1
*/ x 1
/* x 1
appear x 1
between x 1
delimiters x 1
enclosed x 1
braces x 1
supported x 1
if x 1
-else x 1
conditional x 1
Unlike x 1
reserved x 1
sequential x 1
provides x 1
control-flow x 1
identified x 1
do-while x 1
while x 1
any x 1
omitted x 1
break x 1
continue x 1
expressions x 1
testing x 1
iterative x 1
looping x 1
separate x 1
initialization x 1
normal x 1
modify x 1
control x 1
structures x 1
As x 1
imperative x 1
single x 1
act x 1
sometimes x 1
curly brackets x 1
limit x 1
scope x 1
language x 1
uses x 1
evaluation x 1
assigned x 1
values x 1
To x 1
effect x 1
semicolon x 1
actions x 1
common x 1
consisting x 1
used x 1
var_dump statement simply displays concordance. This variant preserves double-quoted expressions.
For supplied file this code finishes in 0.047 seconds. Though larger file will consume lots of memory (because of file function).
This is not going to win any golfing awards but it does keep quoted phrases together and takes into account stop words (and leverages CPAN modules Lingua::StopWords and Text::ParseWords).
In addition, I use to_S from Lingua::EN::Inflect::Number to count only the singular forms of words.
You might also want to look at Lingua::CollinsParser.
#!/usr/bin/perl
use strict; use warnings;
use Lingua::EN::Inflect::Number qw( to_S );
use Lingua::StopWords qw( getStopWords );
use Text::ParseWords;
my $stop = getStopWords('en');
my %words;
while ( my $line = <> ) {
chomp $line;
next unless $line =~ /\S/;
next unless my #words = parse_line(' ', 1, $line);
++ $words{to_S $_} for
grep { length and not $stop->{$_} }
map { s!^[[:punct:]]+!!; s![[:punct:]]+\z!!; lc }
#words;
}
print "=== only words appearing 4 or more times ===\n";
print "$_ : $words{$_}\n" for sort {
$words{$b} <=> $words{$a}
} grep { $words{$_} > 3 } keys %words;
print "=== only words that are 12 characters or longer ===\n";
print "$_ : $words{$_}\n" for sort {
$words{$b} <=> $words{$a}
} grep { 11 < length } keys %words;
Output:
=== only words appearing 4 or more times ===
statement : 11
function : 7
expression : 6
may : 5
code : 4
variable : 4
operator : 4
declaration : 4
c : 4
type : 4
=== only words that are 12 characters or longer ===
reinitialization : 2
control-flow : 1
sequence point : 1
optimization : 1
curly brackets : 1
text-line-based : 1
non-structured : 1
column-based : 1
initialization : 1

Resources