how to delete all characters starting from the nth position for every word using bash? - bash

I have a file containing 1,700,000 words. I want to do naive stemming of the words, if a word's length is more than 6 characters, I delete all characters after 6th position. For example:
Input:
Everybody is around
Everyone keeps talking
Output:
Everyb is around
Everyo keeps talkin
I wrote the following script:
INPUT=train.txt
while read line; do
for word in $line; do
new="$(echo $word | awk '{print substr($0,1,6);exit}')"
echo -n $new >> train_stem_6.txt
echo -n ' ' >> train_stem_6.txt
done
echo ' ' >> train_stem_6.txt
done < "$INPUT"
This answers the question perfectly, but it is extremely slow, and since I have 1,700,000 words, it takes forever.
Is there a faster way to do this using bash script.
Thanks a lot,

You can use this gnu awk using custom RS:
awk -v RS='[[:space:]]' '{ORS=RT; print substr($0, 1, 6)}' file
Everyb is around
Everyo keeps talkin
Timings of 3 commands on 11 MB input file:
sed:
time sed -r 's/([a-zA-Z]{6})[a-zA-Z]+/\1/g' file >/dev/null
real 0m2.913s
user 0m2.878s
sys 0m0.020s
awk command by #andlrc:
time awk '{for(i=1;i<=NF;i++){$i=substr($i, 1, 6)}}1' file >/dev/null
real 0m1.191s
user 0m1.174s
sys 0m0.011s
My suggested awk command:
time awk -v RS='[[:space:]]' '{ORS=RT; print substr($0, 1, 6)}' file >/dev/null
real 0m1.926s
user 0m1.905s
sys 0m0.013s
So both awk commands are taking pretty much same time to finish the job and sed tends to be slower on bigger files.
3 commands on 167mb file
$ time awk -v RS='[[:space:]]+' 'RT{ORS=RT} {$1=substr($1, 1, 6)} 1' test > /dev/null
real 0m29.070s
user 0m28.898s
sys 0m0.060s
$ time awk '{for(i=1;i<=NF;i++){$i=substr($i, 1, 6)}}1' test >/dev/null
real 0m13.897s
user 0m13.805s
sys 0m0.036s
$ time sed -r 's/([a-zA-Z]{6})[a-zA-Z]+/\1/g' test > /dev/null
real 0m40.525s
user 0m40.323s
sys 0m0.064s

Do you consider using sed?
sed -r 's/([a-zA-Z]{6})[a-zA-Z]+/\1/g'

You can use awk for this:
awk '{for(i=1;i<=NF;i++){$i=substr($i, 1, 6)}}1' train.txt
Breakdown:
{
for(i=1;i<=NF;i++) { # Iterate over each word
$i = substr($i, 1, 6); # Shrink it to a maximum of 6 characters
}
}
1 # Print the row
This will however treat Awesome, as a word and therefore remove e,

Pure bash, (i.e. not POSIX), as a one-liner:
while read x ; do set -- $x ; for f in $* ; do echo -n ${f:0:6}" " ; done ; echo ; done < train.txt
...and the same code reformatted for clarity:
while read x ; do
set -- $x
for f in $* ; do
echo -n ${f:0:6}" "
done
echo
done < train.txt
Note: repeated whitespace becomes a single space.
Test run, first make a function using above code, with standard input:
len6() { while read x ; do set -- $x ; for f in $* ; do echo -n ${f:0:6}" " ; done ; echo ; done ; }
Invoke:
COLUMNS=90 man bash | tail | head -n 5 | len6
Output:
gracef when proces suspen is attemp When a proces is stoppe the
shell immedi execut the next comman in the sequen It suffic to
place the sequen of comman betwee parent to force it into a subshe
which may be stoppe as a unit.

Related

Counting all the 5 from a specific range in Bash

I want to count how many times the digit "5" appears from the range 1 to 4321. For example, the number 5 appears 1 or the number 555, 5 would appear 3 times etc.
Here is my code so far, however, the results are 0, and they are supposed to be 1262.
#!/bin/bash
typeset -i count5=0
for n in {1..4321}; do
echo ${n}
done | \
while read -n1 digit ; do
if [ `echo "${digit}" | grep 5` ] ; then
count5=count5+1
fi
done | echo "${count5}"
P.s. I am looking to fix my code so it can print the right output. I do not want a completely different solution or a shortcut.
What about something like this
seq 4321 | tr -Cd 5 | wc -c
1262
Creates the sequence, delete everything but 5's and count the chars
The main problem here is http://mywiki.wooledge.org/BashFAQ/024. With minimal changes, your code could be refactored to
#!/bin/bash
typeset -i count5=0
for n in {1..4321}; do
echo $n # braces around ${n} provide no benefit
done | # no backslash required here; fix weird indentation
while read -n1 digit ; do
# prefer modern command substitution syntax over backticks
if [ $(echo "${digit}" | grep 5) ] ; then
count5=count5+1
fi
echo "${count5}" # variable will not persist outside subprocess
done | head -n 1 # so instead just print the last one after the loop
With some common antipatterns removed, this reduces to
#!/bin/bash
printf '%s\n' {1..4321} |
grep 5 |
wc -l
A more efficient and elegant way to do the same is simply
printf '%s\n' {1..4321} | grep -c 5
One primary issue:
each time results are sent to a pipe said pipe starts a new subshell; in bash any variables set in the subshell are 'lost' when the subshell exits; net result is even if you're correctly incrementing count5 within a subshell you'll still end up with 0 (the starting value) when you exit from the subshell
Making minimal changes to OP's current code:
while read -n1 digit ; do
if [ `echo "${digit}" | grep 5` ]; then
count5=count5+1
fi
done < <(for n in {1..4321}; do echo ${n}; done)
echo "${count5}"
NOTE: there are a couple performance related issues with this method of coding but since OP has explicitly asked to a) 'fix' the current code and b) not provide any shortcuts ... we'll leave the performance fixes for another day ...
A simpler way to get the number for a certain n would be
nx=${n//[^5]/} # Remove all non-5 characters
count5=${#nx} # Calculate the length of what is left
A simpler method in pure bash could be:
printf -v seq '%s' {1..4321} # print the sequence into the variable seq
fives=${seq//[!5]} # delete all characters but 5s
count5=${#fives} # length of the string is the count of 5s
echo $count5 # print it
Or, using standard utilities tr and wc
printf '%s' {1..4321} | tr -dc 5 | wc -c
Or using awk:
awk 'BEGIN { for(i=1;i<=4321;i++) {$0=i; x=x+gsub("5",""); } print x} '

Unix bash script grep loop counter (for)

I am looping our the a grep result. The result contains 10 lines (every line has different content). So the loop stuff in the loop gets executed 10 times.
I need to get the index, 0-9, in the run so i can do actions based on the index.
ABC=(cat test.log | grep "stuff")
counter=0
for x in $ABC
do
echo $x
((counter++))
echo "COUNTER $counter"
done
Currently the counter won't really change.
Output:
51209
120049
148480
1211441
373948
0
0
0
728304
0
COUNTER: 1
If your requirement is to only print counter(which is as per shown samples only), in that case you could use awk(if you are ok with it), this could be done in a single awk like, without creating variable and then using grep like you are doing currently, awk could perform both search and counter printing in a single shot.
awk -v counter=0 '/stuff/{print "counter=" counter++}' Input_file
Replace stuff string above with the actual string you are looking for and place your actual file name for Input_file in above.
This should print like:
counter=1
counter=2
........and so on
Your shell script contains what should be an obvious syntax error.
ABC=(cat test.log | grep "stuff")
This fails with
-bash: syntax error near unexpected token `|'
There is no need to save the output in a variable if you only want to process one at a time (and obviously no need for the useless cat).
grep "stuff" test.log | nl
gets you numbered lines, though the index will be 1-based, not zero-based.
If you absolutely need zero-based, refactoring to Awk should solve it easily:
awk '/stuff/ { print n++, $0 }' test.log
If you want to loop over this and do something more with this information,
awk '/stuff/ { print n++, $0 }' test.log |
while read -r index output; do
echo index is "$index"
echo output is "$output"
done
Because the while loop executes in a subshell the value of index will not be visible outside of the loop. (I guess that's what your real code did with the counter as well. I don't think that part of the code you posted will repro either.)
Do not store the result of grep in a scalar variable $ABC.
If the line of the log file contains whitespaces, the variable $x
is split on them due to the word splitting of bash.
(BTW the statement ABC=(cat test.log | grep "stuff") causes a syntax error.)
Please try something like:
readarray -t abc < <(grep "stuff" test.log)
for x in "${abc[#]}"
do
echo "$x"
echo "COUNTER $((++counter))"
done
or
readarray -t abc < <(grep "stuff" test.log)
for i in "${!abc[#]}"
do
echo "${abc[i]}"
echo "COUNTER $((i + 1))"
done
you can use below increment statement-
counter=$(( $counter + 1));

Counting number of delimiters of special character bash shell script Performance improvement

Hi I have a script that is going to count the number of records in a file and find the expected delimiters per a record by dividing the total record count by rs_count. It works fine but it is a little slow on large records. I was wondering if there is a way to improve performance. The RS is a special character octal \246. I am using bash shell script.
Some additional info:
A line is a record.
The file will always have the same number of delimiters.
The purpose of the script is to check if the file has the expected number of fields. After calculating it, the script just echos it out.
for file in $SOURCE; do
echo "executing File -"$file
if (( $total_record_count != 0 ));then
filename=$(basename "$file")
total_record_count=$(wc -l < $file)
rs_count=$(sed -n 'l' $file | grep -o $RS | wc -l)
Delimiter_per_record=$((rs_count/total_record_count))
fi
done
Counting the delimiters (not total records) in a file
On a file with 50,000 lines, I note around a 10 fold increase by incorporating the sed, grep, and wc pipeline to a single awk process:
awk -v RS='Delimiter' 'END{print NR -1}' input_file
Dealing with wc when there's no trailing line breaks
If you count the instances of ^ (start of line), you will get a true count of lines. Using grep:
grep -co "^" input_file
(Thankfully, even though ^ is a regex, the performance of this is on par with wc)
Incorporating these two modifications into a trivial test based on your supplied code:
#!/usr/bin/env bash
SOURCE="$1"
RS=$'\246'
for file in $SOURCE; do
echo "executing File -"$file
if [[ $total_record_count != 0 ]];then
filename=$(basename "$file")
total_record_count=$(grep -oc "^" $file)
rs_count="$(awk -v RS=$'\246' 'END{print NR -1}' $file)"
Delimiter_per_record=$((rs_count/total_record_count))
fi
done
echo -e "\$rs_count:\t${rs_count}\n\$Delimiter_per_record:\t${Delimiter_per_record}\n\$total_record_count:\t${total_record_count}" | column -t
Running this on a file with 50,000 lines on my macbook:
time ./recordtest.sh /tmp/randshort
executing File -/tmp/randshort
$rs_count: 186885
$Delimiter_per_record: 3
$total_record_count: 50000
real 0m0.064s
user 0m0.038s
sys 0m0.012s
Unit test one-liner
(creates /tmp/recordtest, chmod +x's it, creates /tmp/testfile with 10 lines of random characters including octal \246, and then runs the script file on the testfile)
echo $'#!/usr/bin/env bash\n\nSOURCE="$1"\nRS=$\'\\246\'\n\nfor file in $SOURCE; do\n echo "executing File -"$file\n if [[ $total_record_count != 0 ]];then\n filename=$(basename "$file")\n total_record_count=$(grep -oc "^" $file)\n rs_count="$(awk -v RS=$\'\\246\' \'END{print NR -1}\' $file)"\n Delimiter_per_record=$((rs_count/total_record_count))\n fi\ndone\n\necho -e "\\$rs_count:\\t${rs_count}\\n\\$Delimiter_per_record:\\t${Delimiter_per_record}\\n\\$total_record_count:\\t${total_record_count}" | column -t' > /tmp/recordtest ; echo $'\246459ca4f23bafff1c8fc017864aa3930c4a7f2918b\246753f00e5a9278375b\nb\246a3\246fc074b0e415f960e7099651abf369\246a6f\246f70263973e176572\2467355\n1590f285e076797aa83b2ee537c7f99\24666990bb60419b8aa\246bb5b6b\2467053\n89b938a5\246560a54f2826250a2c026c320302529331229255\246ef79fbb52c2\n9042\246bb\246b942408a22f912268ffc78f08c\2462798b0c05a75439\246245be2ea5\n0ef03170413f90e\246e0\246b1b2515c4\2466bf0a1bb\246ee28b78ccce70432e6b\24653\n51229e7ab228b4518404360b31a\2463673261e3242985bf24e59bc657\246999a\n9964\246b08\24640e63fae788ea\246a1777\2460e94f89af8b571e\246e1b53e6332\246c3\246e\n90\246ae12895f\24689885e\246e736f942080f267a275132a348ec1e837b99efe94\n2895e91\246\246f506f\246c1b986a63444b4258\246bc1b39182\24630\24696be' > /tmp/testfile ; chmod +x /tmp/recordtest ; /tmp/./recordtest /tmp/testfile
Which produces this result:
$rs_count: 39
$Delimiter_per_record: 3
$total_record_count: 10
Though there's a number of solutions for counting instances of characters in files, quite a few come undone when trying to process special characters like octal \246
awk seems to handle it reliably and quickly.

How to pad out values line by line while mainting overall record length in a Unix Shell script ksh

IFS=$'\n'
while read -r line
do
--header/trailer record
if echo ${line} | grep -e '000000000000000' -e '999999999999999' >/dev/null 2>&1
then
echo ${line} >> outfile.01.DAT.sampleNEW
elif echo ${line} | grep '+0' >/dev/null 2>&1
then
echo ${line} | sed -e 's/+/+00000000/; s/ X/X/' >> outfile.01.DAT.sampleNEW
else
echo ${line} | sed -e 's/-/-00000000/; s/ X/X/' >> outfile.01.DAT.sampleNEW
fi
done < Inputfile.01.DAT
I have a large file that I need to pad out the amount fields (signed) but retain the overall record length so have to remove some filler spaces at the end (each line ends with X). The file has a header/trailer that does not need to change. I have come up with a way but it is very slow when using a large input file. I am sure the use of grep here is not good.
Sample records. end with X - Overall length 107 bytes
000000000000000PPPPPPPPP Information INV TRANSACTION 0120160505201605052154HI203.SEQ 01 X
000000000000001PPPPP14PA 000YYYYYY488 -0001235.2520150319 X
000000000000002PPPMS PA 000RRRRR4539 +0008285.0020160301 X
000000000000003PPPP506 000TTTTTT605 -0000225.0020150608 X
9999999999999990000000000000439.940000000079802782.180000005 X
I suspect you want something like this, but it is very hard to tell given the way you have presented your question:
awk '
/000000000000000/ || /999999999999999/ {print;next}
/\+0/ {sub(/\+0/,"+00000000"); sub(/ X/,'X'); print; next}
/\-0/ {sub(/\-0/,"-00000000"); sub(/ X/,'X'); print; next}
' Inputfile.01.DAT
That says... "if the line contains a string of 15 zeroes or 15 nines, print it and move to the next line. If the line contains +0, replace it with +00000000 and remove 8 spaces before the final X, then print. Likewise for -0."
You could also maybe use Perl, and do something like this:
perl -nle '/0{15}|9{15}/ && print; s/([+-])0/$1\0000000000/ && s/ X/X/ && print' Inputfile.01.DAT

How to add multiple line of output one by one to a variable in Bash?

This might be a very basic question but I was not able to find solution. I have a script:
If I run w | awk '{print $1}' in command line in my server I get:
f931
smk591
sc271
bx972
gaw844
mbihk988
laid640
smk59
ycc951
Now I need to use this list in my bash script one by one and manipulate some operation on them. I need to check their group and print those are in specific group. The command to check their group is id username. How can I save them or iterate through them one by one in a loop.
what I have so far is
tmp=$(w | awk '{print $1})
But it only return first record! Appreciate any help.
Populate an array with the output of the command:
$ tmp=( $(printf "a\nb\nc\n") )
$ echo "${tmp[0]}"
a
$ echo "${tmp[1]}"
b
$ echo "${tmp[2]}"
c
Replace the printf with your command (i.e. tmp=( $(w | awk '{print $1}') )) and man bash for how to work with bash arrays.
For a lengthier, more robust and complete example:
$ cat ./tstarrays.sh
# saving multi-line awk output in a bash array, one element per line
# See http://www.thegeekstuff.com/2010/06/bash-array-tutorial/ for
# more operations you can perform on an array and its elements.
oSET="$-"; set -f # save original set flags and turn off globbing
oIFS="$IFS"; IFS=$'\n' # save original IFS and make IFS a newline
array=( $(
awk 'BEGIN{
print "the quick brown"
print " fox jumped\tover\tthe"
print "lazy dogs back "
}'
) )
IFS="$oIFS" # restore original IFS value
set +f -$oSET # restore original set flags
for (( i=0; i < ${#array[#]}; i++ ));
do
printf "array[%d] of length=%d: \"%s\"\n" "$i" "${#array[$i]}" "${array[$i]}"
done
printf -- "----------\n"
printf -- "array[#]=\n\"%s\"\n" "${array[#]}"
printf -- "----------\n"
printf -- "array[*]=\n\"%s\"\n" "${array[*]}"
.
$ ./tstarrays.sh
array[0] of length=22: "the quick brown"
array[1] of length=23: " fox jumped over the"
array[2] of length=21: "lazy dogs back "
----------
array[#]=
"the quick brown"
array[#]=
" fox jumped over the"
array[#]=
"lazy dogs back "
----------
array[*]=
"the quick brown fox jumped over the lazy dogs back "
A couple of non-obvious key points to make sure your array gets populated with exactly what your command outputs:
If your command output can contain globbing characters than you should disable globbing before the command (oSET="$-"; set -f) and re-enable it afterwards (set +f -$oSET).
If your command output can contain spaces then set IFS to a newline before the command (oIFS="$IFS"; IFS=$'\n') and set it back to it's old value after the command (IFS="$oIFS").
tmp=$(w | awk '{print $1}')
while read i
do
echo "$i"
done <<< "$tmp"
You can use a for loop, i.e.
for user in $(w | awk '{print $1}'); do echo $user; done
which in a script would look nicer as:
for user in $(w | awk '{print $1}')
do
echo $user
done
You can use the xargs command to do this:
w | awk '{print $1}' | xargs -I '{}' id '{}'
With the -I switch, xargs will take each line of its standard input separately, then construct and execute a command line by replacing the specified string '{}' in the command line template with the input line
I guess you should use who instead of w. Try this out,
who | awk '{print $1}' | xargs -n 1 id

Resources