Bash Script to process irregular text, count occurrences, cut at a threshold - bash

I have a large sample of text that is pretty irregular and I'd like to tokenize it to single words, and count occurrences of each words, and have an output where occurrence > threshold_value
if [ $# -ne 3 ]; then
echo 'Usage <file> <output_file> <threshold>'
exit 1
fi
clean_and_rank () {
tr -dc [:graph:][:cntrl:][:space:] < $1 \
| tr -d [:punct:] \
| tr -s ' ' \
| tr ' ' '\n' \
| tr '[A-Z]' '[a-z]' \
| grep -v '^$' \
| sort \
| uniq -c \
| sort -nr
}
cut_below_threshold () {
$THRESHOLD=$1
awk '$1 > '$THRESHOLD' { print $1, $2 }'
}
clean_and_rank $1 \
| cut_below_threshold $3
| sort -nr > $2
but for some reason I'm running into trouble with the cut_below_threshold() function --
Also once I get this done, I want to be able to compare it to another sample (my data is 2 samples of several rows of labeled text snippets and I want to score words independently for prevalence in sample A/sample B)
Is there a better way to go about this? Ultimately, I'm looking for insights along the lines of "$WORD is in sample 1 1000 times, out of 100000 total words, it is in sample 2 100 times out of 10000 words"

I assume you are able to get statistics for two text files in the following format:
$ cat a.txt
5 word1
3 word2
1 word3
$ cat b.txt
4 word1
3 word2
1 word4
Then this script does the job of comparison:
#!/bin/sh
# the 1st argument passed to the script, the 1st file to compare (statistics for sample A)
STATA=$1
# the 2nd argument -- the 2nd file (statistics for sample B)
STATB=$2
# concatenate both files and pipe it to the next command
cat ${STATA} ${STATB} |
# call awk; -v is awk option to set a variable
# n1=$() variable n1 gets its value from the output of the command in ()
# wc -l <file counts number of lines in the file
# ' starts awk script
awk -v n1=$(wc -l <${STATA}) '
# (){} means when condition in () is true, execute statement in {}
# NR is number of records processed thus far (usually this is number of lines)
# (NR <= n1) essentially means 'reading statistics file for sample A'
# {1; 2} two statements
# wa += $1 add value of the first field to the wa variable
# each line is splitted by a field separator (space or tab by default) into several fields:
# $1 is the 1st field, $2 is the 2nd, $NF is the last one, $0 is a whole line
# $1 in this case is number of occurrences of a word
# awk variables have zero default value; no need to specify them explicitly
# cnta[] is an associative array -- index is a string (the word in this case)
# $2 in this case is the word
(NR <= n1){wa += $1; cnta[$2] = $1}
# the same for statistics for sample B
(NR > n1){wb += $1; cntb[$2] = $1}
# END{} to execute statements after there's no input left
END {
print "nof words in sample A = " wa;
print "nof words in sample B = " wb;
# standard printf to output a table header
printf "%-15s %5s %8s %5s %8s\n", "word", "cntA", "freqA", "cntB", "freqB";
# iterate over each element (the word) in the count array A
for (w in cnta){
# check that the word is present in the count array B
if (cntb[w] > 0) {
# output statistics in a table form
printf "%-15s %5d %8.6f %5d %8.6f\n", w, cnta[w], cnta[w] / wa, cntb[w], cntb[w]/wb
}
}
}
'
Test run:
$ ./compare.sh a.txt b.txt
nof words in sample A = 9
nof words in sample B = 8
word cntA freqA cntB freqB
word1 5 0.555556 4 0.500000
word2 3 0.333333 3 0.375000

Let bash do most of your work with associative arrays. This is not a rigorous example, that is left as an exercise for you:
declare -A ct
exec 3< file
while IFS= read -u3 line ; do
set -- $line
for tkn ; do
cct=${ct[$tkn]}
ct[$tkn]=$(( ${cct:-0} + 1 ))
done
done
for tkn in ${!ct[*]}
do echo $tkn ${ct[$tkn]} ; done

Related

bash code to count A to z char with count string input string and give output in new string

Hi folks just started write code in bash but stuck in my own question in want to count a to alphabets repetation in string like
read -p 'enter the string' str # SUPPOSE STRING USER ENTER IS
input = AAASSSSFFRREEE
i wanna get output like= A3S4F2R2E3
what can i do i am trying with for condition and this program is making monkey out of me
grep -o, --only-matching
Print only the matched (non-empty) parts of a matching line,
with each such part on a separate output line.
Using . as pattern for every character
grep -o . <<<"AASS"
A
A
S
S
uniq -c, --count
prefix lines by the number of occurrences
grep -o . <<<"AASS"|uniq -c
2 A
2 S
awk '{printf "%s%d", $2,$1}'
Reorder the output. Character '$2' first then number '$1'.
Print all in a single line
grep -o . <<<"AASS"|uniq -c|awk '{printf "%s%d", $2,$1}END{print ""}'
A2S2
Unsorted alphabet
$ string="AAASSSSFFRREEE"
$ grep -o . <<<"$string"|uniq -c|awk '{printf "%s%d", $2,$1}END{print ""}'
A3S4F2R2E3
$ string="AAASSSSFFRREEEAAAAFFFFFEE"
A3S4F2R2E3A4F5E2
Sorted alphabet
$ string="AAASSSSFFRREEE"
$ grep -o . <<<"$string"|sort|uniq -c|awk '{printf "%s%d", $2,$1}END{print ""}'
A3E3F2R2S4
$ string="AAASSSSFFRREEEAAAAFFFFFEE"
A7E5F7R2S4
One solution could be to loop over the string and strip away the first characters while counting them.
Example:
#!/bin/bash
IFS='' read -rp 'enter the string: ' str
res=''
# loop while the string is not empty
while [[ -n $str ]]
do
# get the first character
ch=${str:0:1}
# create a regex matching the start of the string,
# on one or more of the first character
regex="^($ch+)"
# perform the regex matching
[[ "$str" =~ $regex ]]
# the matching string
match="${BASH_REMATCH[1]}"
# concatenate the character + the count to the result
res="$res$ch${#match}"
# remove the matching characters from the string
str=${str#"$match"}
done
# print the result
printf "%s\n" "$res"
Examples:
Input
Output
AAASSSSFFRREEE
A3S4F2R2E3
AAASSSSFFRREEEAAAFFFFFEE
A3S4F2R2E3A3F5E2
Using any awk:
$ cat tst.awk
{
$0 = toupper($0)
while ( char = substr($0,1,1) ) {
printf "%s%d", char, gsub(char,"")
}
print ""
}
$ echo 'AAASSSSFFRREEE' | awk -f tst.awk
A3S4F2R2E3
The above assumes that the input doesn't contain any regexp metachars (you said it's to work on letters so that's fine) and you want the following output if the letters aren't all contiguous (you didn't answer #ArnaudValmary's question about that):
$ echo 'AAASSSSFFRREEEAAAFFFFFEE' | awk -f tst.awk
A6S4F7R2E5
Here's an awk-based solution that takes any unsorted input, and generates sorted output :
echo 'ZZZCGGGGAAASSSWSFXXXRRRYYYUUUUYYFRREEEAAAAJBBBQQQAAQQQBFFFBFFPPEE' |
{m,n,g}awk '{
26 for(_= ( ___=(_^=_+=_^=_<_)^--_) + \
_^((____="")*(___+=_^_)); _<___; _++) {
26 __ = sprintf("%c",_) # if you`re not using nawk, then
# combine this with sub() below
26 sub("$", (__) gsub(__,""), ____)
}
1 gsub("[A-Z]0","", ____)
1 print ____
}'
A9B5C1E5F7G4J1P2Q6R5S4U4W1X3Y5Z3

How can I iterate through text two lines at a time?

I have a text file that I would like to go through and list every count each time a succession of two words appear. For example my desired output would look like this:
Sample input:
I am a man
desired output:
1 I am
1 am a
1 a man
How I thought about doing this is so:
cat $1 | sed "s/ /\n/g" | read word1 &&
while read word2;
do
echo "$word1 $word2";
word1=word2;
done
This gets an infinite loop though. Any help appreciated!
Call read twice in the while condition.
while read line1; read line2; do
echo "$line1 $line2"
done <<EOF
1
a
2
b
EOF
will output
1 a
2 b
The loop exits when the second read fails, even if the first succeeds. If you will want to execute the body (even with an empty line2), move read line2 into the body of the loop.
Assumptions:
counts are accumulated across the entire file (as opposed to restarting the counts for each new line)
word pairs can span lines, eg, one\nword is the same as one word
we're only interested in 2-word pairings, ie, no need to code for a dynamic number of words (eg, 3-words, 4-words)
Sample input data:
$ cat words.dat
I am a man
I am not a man I
am a man
One awk idea:
$ awk -v RS='' ' # treat file as one loooong single record
{ for (i=1;i<NF;i++) # loop through list of fields 1 - (NF-1)
count[$(i)" "$(i+1)]++ # use field i and i+1 as array index
}
END { for (i in count) # loop through array indices
print count[i],i
}
' words.dat
This generates:
2 am a
3 a man
1 am not
3 I am
1 not a
2 man I
NOTE: no sorting requirement was stated otherwise we could pipe the result to sort, or if using GNU awk we may be able to add an appropriate PROCINFO["sorted_in"] statement
OP's original input:
$ awk -v RS='' '{for (i=1;i<NF;i++) count[$(i)" "$(i+1)]++} END {for (i in count) print count[i],i}' <<< "I am a man"
1 am a
1 a man
1 I am
Removing the assumption about dynamic word counts ...
$ awk -v wcnt=2 -v RS='' ' # <word_count> = 2; treat file as one loooong single record
NF>=wcnt { for (i=1;i<=(NF-wcnt+1);i++) { # loop through list of fields 1 - (NF-<word_count>)
pfx=key=""
for (j=0;j<wcnt;j++) { # build count[] index from <word_count> fields
key=key pfx $(j+i)
pfx=" "
}
count[key]++
}
}
END { for (i in count) # loop through array indices
print count[i],i
}
' words.dat
With -v wcnt=2:
2 am a
3 a man
1 am not
3 I am
1 not a
2 man I
With -v wcnt=3:
1 not a man
2 I am a
1 I am not
2 man I am
2 am a man
2 a man I
1 am not a
With -v wcnt=5:
1 I am a man I
1 I am not a man
1 am not a man I
1 am a man I am
1 man I am a man
1 man I am not a
1 a man I am not
1 not a man I am
1 a man I am a
With -v wcnt=3 and awk '...' <<< "I am a man":
1 I am a
1 am a man
With -v wcnt=5 and awk '...' <<< "I am a man":
# no output since less than wcnt=5 words to work with
With bash:
set -f # for slurping in the words of the file, we want word splitting
# but not glob expansion
words=( $(< "$1") )
for ((i = 1; i < ${#words[#]}; i++)); do
printf "%s %s\n" "${words[i-1]}" "${words[i]}"
done
Given #chepner's input file, this outputs
1 a
a 2
2 b
A rewrite of your code: you need a grouping construct so that all the reads are reading from the same pipeline of data.
tr -s '[:space:]' '\n' < "$1" | {
IFS= read -r word1
while IFS= read -r word2; do
echo "$word1 $word2"
word1=$word2
done
}
For counting, the simplest method is to pipe the output into | sort | uniq -c
With the words.dat file from #markp-fuso, output from both these solutions is
3 I am
3 a man
2 am a
1 am not
2 man I
1 not a
The counting can be done in bash using an associative array:
declare -A pairs
for ((i = 1; i < ${#words[#]}; i++)); do
key="${words[i-1]} ${words[i]}"
pairs[$key]=$(( pairs[$key] + 1 ))
done
for key in "${!pairs[#]}"; do
printf "%7d %s\n" "${pairs[$key]}" "$key"
done
1 not a
3 a man
1 am not
2 am a
3 I am
2 man I

Foreach command with table [duplicate]

I have a huge tab-separated file formatted like this
X column1 column2 column3
row1 0 1 2
row2 3 4 5
row3 6 7 8
row4 9 10 11
I would like to transpose it in an efficient way using only bash commands (I could write a ten or so lines Perl script to do that, but it should be slower to execute than the native bash functions). So the output should look like
X row1 row2 row3 row4
column1 0 3 6 9
column2 1 4 7 10
column3 2 5 8 11
I thought of a solution like this
cols=`head -n 1 input | wc -w`
for (( i=1; i <= $cols; i++))
do cut -f $i input | tr $'\n' $'\t' | sed -e "s/\t$/\n/g" >> output
done
But it's slow and doesn't seem the most efficient solution. I've seen a solution for vi in this post, but it's still over-slow. Any thoughts/suggestions/brilliant ideas? :-)
awk '
{
for (i=1; i<=NF; i++) {
a[NR,i] = $i
}
}
NF>p { p = NF }
END {
for(j=1; j<=p; j++) {
str=a[1,j]
for(i=2; i<=NR; i++){
str=str" "a[i,j];
}
print str
}
}' file
output
$ more file
0 1 2
3 4 5
6 7 8
9 10 11
$ ./shell.sh
0 3 6 9
1 4 7 10
2 5 8 11
Performance against Perl solution by Jonathan on a 10000 lines file
$ head -5 file
1 0 1 2
2 3 4 5
3 6 7 8
4 9 10 11
1 0 1 2
$ wc -l < file
10000
$ time perl test.pl file >/dev/null
real 0m0.480s
user 0m0.442s
sys 0m0.026s
$ time awk -f test.awk file >/dev/null
real 0m0.382s
user 0m0.367s
sys 0m0.011s
$ time perl test.pl file >/dev/null
real 0m0.481s
user 0m0.431s
sys 0m0.022s
$ time awk -f test.awk file >/dev/null
real 0m0.390s
user 0m0.370s
sys 0m0.010s
EDIT by Ed Morton (#ghostdog74 feel free to delete if you disapprove).
Maybe this version with some more explicit variable names will help answer some of the questions below and generally clarify what the script is doing. It also uses tabs as the separator which the OP had originally asked for so it'd handle empty fields and it coincidentally pretties-up the output a bit for this particular case.
$ cat tst.awk
BEGIN { FS=OFS="\t" }
{
for (rowNr=1;rowNr<=NF;rowNr++) {
cell[rowNr,NR] = $rowNr
}
maxRows = (NF > maxRows ? NF : maxRows)
maxCols = NR
}
END {
for (rowNr=1;rowNr<=maxRows;rowNr++) {
for (colNr=1;colNr<=maxCols;colNr++) {
printf "%s%s", cell[rowNr,colNr], (colNr < maxCols ? OFS : ORS)
}
}
}
$ awk -f tst.awk file
X row1 row2 row3 row4
column1 0 3 6 9
column2 1 4 7 10
column3 2 5 8 11
The above solutions will work in any awk (except old, broken awk of course - there YMMV).
The above solutions do read the whole file into memory though - if the input files are too large for that then you can do this:
$ cat tst.awk
BEGIN { FS=OFS="\t" }
{ printf "%s%s", (FNR>1 ? OFS : ""), $ARGIND }
ENDFILE {
print ""
if (ARGIND < NF) {
ARGV[ARGC] = FILENAME
ARGC++
}
}
$ awk -f tst.awk file
X row1 row2 row3 row4
column1 0 3 6 9
column2 1 4 7 10
column3 2 5 8 11
which uses almost no memory but reads the input file once per number of fields on a line so it will be much slower than the version that reads the whole file into memory. It also assumes the number of fields is the same on each line and it uses GNU awk for ENDFILE and ARGIND but any awk can do the same with tests on FNR==1 and END.
awk
Gawk version which uses arrays of arrays:
tp(){ awk '{for(i=1;i<=NF;i++)a[i][NR]=$i}END{for(i in a)for(j in a[i])printf"%s"(j==NR?RS:FS),a[i][j]}' "${1+FS=$1}";}
Plain awk version which uses multidimensional arrays (this was about twice as slow in my benchmark):
tp(){ awk '{for(i=1;i<=NF;i++)a[i,NR]=$i}END{for(i=1;i<=NF;i++)for(j=1;j<=NR;j++)printf"%s"(j==NR?RS:FS),a[i,j]}' "${1+FS=$1}";}
macOS comes with a version of Brian Kerningham's nawk from 2007 which doesn't support arrays of arrays.
To use space as a separator without collapsing sequences of multiple spaces, use FS='[ ]'.
rs
rs is a BSD utility which also comes with macOS, but it should be available from package managers on other platforms. It is named after the reshape function in APL.
Use sequences of spaces and tabs as column separator:
rs -T
Use tab as column separator:
rs -c -C -T
Use comma as column separator:
rs -c, -C, -T
-c changes the input column separator and -C changes the output column separator. A lone -c or -C sets the separator to tab. -T transposes rows and columns.
Do not use -t instead of -T, because it automatically selects the number of output columns so that the output lines fill the width of the display (which is 80 characters by default but which can be changed with -w).
When an output column separator is specified using -C, an extra column separator character is added to the end of each row, but you can remove it with sed:
$ seq 4|paste -d, - -|rs -c, -C, -T
1,3,
2,4,
$ seq 4|paste -d, - -|rs -c, -C, -T|sed s/.\$//
1,3
2,4
rs -T determines the number of columns based on the number of columns on the first row, so it produces the wrong result when the first line ends with one or more empty columns:
$ rs -c, -C, -T<<<$'1,\n3,4'
1,3,4,
R
The t function transposes a matrix or dataframe:
Rscript -e 'write.table(t(read.table("stdin",sep=",",quote="",comment.char="")),sep=",",quote=F,col.names=F,row.names=F)'
If you replace Rscript -e with R -e, then it echoes the code that is being run to STDOUT, and it also results in the error ignoring SIGPIPE signal if the R command is followed by a command like head -n1 which exits before it has read the whole STDIN.
quote="" can be removed if the input doesn't contain double quotes or single quotes, and comment.char="" can be removed if the input doesn't contain lines that start with a hash character.
For a big input file, fread and fwrite from data.table are faster than read.table and write.table:
$ seq 1e6|awk 'ORS=NR%1e3?FS:RS'>a
$ time Rscript --no-init-file -e 'write.table(t(read.table("a")),quote=F,col.names=F,row.names=F)'>/dev/null
real 0m1.061s
user 0m0.983s
sys 0m0.074s
$ time Rscript --no-init-file -e 'write.table(t(data.table::fread("a")),quote=F,col.names=F,row.names=F)'>/dev/null
real 0m0.599s
user 0m0.535s
sys 0m0.048s
$ time Rscript --no-init-file -e 'data.table::fwrite(t(data.table::fread("a")),sep=" ",col.names=F)'>t/b
x being coerced from class: matrix to data.table
real 0m0.375s
user 0m0.296s
sys 0m0.073s
jq
tp(){ jq -R .|jq --arg x "${1-$'\t'}" -sr 'map(./$x)|transpose|map(join($x))[]';}
jq -R . prints each input line as a JSON string literal, -s (--slurp) creates an array for the input lines after parsing each line as JSON, and -r (--raw-output) outputs the contents of strings instead of JSON string literals. The / operator is overloaded to split strings.
Ruby
ruby -e'STDIN.map{|x|x.chomp.split(",",-1)}.transpose.each{|x|puts x*","}'
The -1 argument to split disables discarding empty fields at the end:
$ ruby -e'p"a,,".split(",")'
["a"]
$ ruby -e'p"a,,".split(",",-1)'
["a", "", ""]
Function form:
$ tp(){ ruby -e's=ARGV[0];STDIN.map{|x|x.chomp.split(s==" "?/ /:s,-1)}.transpose.each{|x|puts x*s}' -- "${1-$'\t'}";}
$ seq 4|paste -d, - -|tp ,
1,3
2,4
The function above uses s==" "?/ /:s because when the argument to the split function is a single space, it enables awk-like special behavior where strings are split based on contiguous runs of spaces and tabs:
$ ruby -e'p" a \tb ".split(" ",-1)'
["a", "b", ""]
$ ruby -e'p" a \tb ".split(/ /,-1)'
["", "a", "", "\tb", ""]
A Python solution:
python -c "import sys; print('\n'.join(' '.join(c) for c in zip(*(l.split() for l in sys.stdin.readlines() if l.strip()))))" < input > output
The above is based on the following:
import sys
for c in zip(*(l.split() for l in sys.stdin.readlines() if l.strip())):
print(' '.join(c))
This code does assume that every line has the same number of columns (no padding is performed).
Have a look at GNU datamash which can be used like datamash transpose.
A future version will also support cross tabulation (pivot tables)
Here is how you would do it with space separated columns:
datamash transpose -t ' ' < file > transposed_file
the transpose project on sourceforge is a coreutil-like C program for exactly that.
gcc transpose.c -o transpose
./transpose -t input > output #works with stdin, too.
Pure BASH, no additional process. A nice exercise:
declare -a array=( ) # we build a 1-D-array
read -a line < "$1" # read the headline
COLS=${#line[#]} # save number of columns
index=0
while read -a line ; do
for (( COUNTER=0; COUNTER<${#line[#]}; COUNTER++ )); do
array[$index]=${line[$COUNTER]}
((index++))
done
done < "$1"
for (( ROW = 0; ROW < COLS; ROW++ )); do
for (( COUNTER = ROW; COUNTER < ${#array[#]}; COUNTER += COLS )); do
printf "%s\t" ${array[$COUNTER]}
done
printf "\n"
done
GNU datamash is perfectly suited for this problem with only one line of code and potentially arbitrarily large filesize!
datamash -W transpose infile > outfile
There is a purpose built utility for this,
GNU datamash utility
apt install datamash
datamash transpose < yourfile
Taken from this site, https://www.gnu.org/software/datamash/ and http://www.thelinuxrain.com/articles/transposing-rows-and-columns-3-methods
Here is a moderately solid Perl script to do the job. There are many structural analogies with #ghostdog74's awk solution.
#!/bin/perl -w
#
# SO 1729824
use strict;
my(%data); # main storage
my($maxcol) = 0;
my($rownum) = 0;
while (<>)
{
my(#row) = split /\s+/;
my($colnum) = 0;
foreach my $val (#row)
{
$data{$rownum}{$colnum++} = $val;
}
$rownum++;
$maxcol = $colnum if $colnum > $maxcol;
}
my $maxrow = $rownum;
for (my $col = 0; $col < $maxcol; $col++)
{
for (my $row = 0; $row < $maxrow; $row++)
{
printf "%s%s", ($row == 0) ? "" : "\t",
defined $data{$row}{$col} ? $data{$row}{$col} : "";
}
print "\n";
}
With the sample data size, the performance difference between perl and awk was negligible (1 millisecond out of 7 total). With a larger data set (100x100 matrix, entries 6-8 characters each), perl slightly outperformed awk - 0.026s vs 0.042s. Neither is likely to be a problem.
Representative timings for Perl 5.10.1 (32-bit) vs awk (version 20040207 when given '-V') vs gawk 3.1.7 (32-bit) on MacOS X 10.5.8 on a file containing 10,000 lines with 5 columns per line:
Osiris JL: time gawk -f tr.awk xxx > /dev/null
real 0m0.367s
user 0m0.279s
sys 0m0.085s
Osiris JL: time perl -f transpose.pl xxx > /dev/null
real 0m0.138s
user 0m0.128s
sys 0m0.008s
Osiris JL: time awk -f tr.awk xxx > /dev/null
real 0m1.891s
user 0m0.924s
sys 0m0.961s
Osiris-2 JL:
Note that gawk is vastly faster than awk on this machine, but still slower than perl. Clearly, your mileage will vary.
Assuming all your rows have the same number of fields, this awk program solves the problem:
{for (f=1;f<=NF;f++) col[f] = col[f]":"$f} END {for (f=1;f<=NF;f++) print col[f]}
In words, as you loop over the rows, for every field f grow a ':'-separated string col[f] containing the elements of that field. After you are done with all the rows, print each one of those strings in a separate line. You can then substitute ':' for the separator you want (say, a space) by piping the output through tr ':' ' '.
Example:
$ echo "1 2 3\n4 5 6"
1 2 3
4 5 6
$ echo "1 2 3\n4 5 6" | awk '{for (f=1;f<=NF;f++) col[f] = col[f]":"$f} END {for (f=1;f<=NF;f++) print col[f]}' | tr ':' ' '
1 4
2 5
3 6
If you have sc installed, you can do:
psc -r < inputfile | sc -W% - > outputfile
I normally use this little awk snippet for this requirement:
awk '{for (i=1; i<=NF; i++) a[i,NR]=$i
max=(max<NF?NF:max)}
END {for (i=1; i<=max; i++)
{for (j=1; j<=NR; j++)
printf "%s%s", a[i,j], (j==NR?RS:FS)
}
}' file
This just loads all the data into a bidimensional array a[line,column] and then prints it back as a[column,line], so that it transposes the given input.
This needs to keep track of the maximum amount of columns the initial file has, so that it is used as the number of rows to print back.
A hackish perl solution can be like this. It's nice because it doesn't load all the file in memory, prints intermediate temp files, and then uses the all-wonderful paste
#!/usr/bin/perl
use warnings;
use strict;
my $counter;
open INPUT, "<$ARGV[0]" or die ("Unable to open input file!");
while (my $line = <INPUT>) {
chomp $line;
my #array = split ("\t",$line);
open OUTPUT, ">temp$." or die ("unable to open output file!");
print OUTPUT join ("\n",#array);
close OUTPUT;
$counter=$.;
}
close INPUT;
# paste files together
my $execute = "paste ";
foreach (1..$counter) {
$execute.="temp$counter ";
}
$execute.="> $ARGV[1]";
system $execute;
The only improvement I can see to your own example is using awk which will reduce the number of processes that are run and the amount of data that is piped between them:
/bin/rm output 2> /dev/null
cols=`head -n 1 input | wc -w`
for (( i=1; i <= $cols; i++))
do
awk '{printf ("%s%s", tab, $'$i'); tab="\t"} END {print ""}' input
done >> output
Some *nix standard util one-liners, no temp files needed. NB: the OP wanted an efficient fix, (i.e. faster), and the top answers are usually faster than this answer. These one-liners are for those who like *nix software tools, for whatever reasons. In rare cases, (e.g. scarce IO & memory), these snippets can actually be faster than some of the top answers.
Call the input file foo.
If we know foo has four columns:
for f in 1 2 3 4 ; do cut -d ' ' -f $f foo | xargs echo ; done
If we don't know how many columns foo has:
n=$(head -n 1 foo | wc -w)
for f in $(seq 1 $n) ; do cut -d ' ' -f $f foo | xargs echo ; done
xargs has a size limit and therefore would make incomplete work with a long file. What size limit is system dependent, e.g.:
{ timeout '.01' xargs --show-limits ; } 2>&1 | grep Max
Maximum length of command we could actually use: 2088944
tr & echo:
for f in 1 2 3 4; do cut -d ' ' -f $f foo | tr '\n\ ' ' ; echo; done
...or if the # of columns are unknown:
n=$(head -n 1 foo | wc -w)
for f in $(seq 1 $n); do
cut -d ' ' -f $f foo | tr '\n' ' ' ; echo
done
Using set, which like xargs, has similar command line size based limitations:
for f in 1 2 3 4 ; do set - $(cut -d ' ' -f $f foo) ; echo $# ; done
I used fgm's solution (thanks fgm!), but needed to eliminate the tab characters at the end of each row, so modified the script thus:
#!/bin/bash
declare -a array=( ) # we build a 1-D-array
read -a line < "$1" # read the headline
COLS=${#line[#]} # save number of columns
index=0
while read -a line; do
for (( COUNTER=0; COUNTER<${#line[#]}; COUNTER++ )); do
array[$index]=${line[$COUNTER]}
((index++))
done
done < "$1"
for (( ROW = 0; ROW < COLS; ROW++ )); do
for (( COUNTER = ROW; COUNTER < ${#array[#]}; COUNTER += COLS )); do
printf "%s" ${array[$COUNTER]}
if [ $COUNTER -lt $(( ${#array[#]} - $COLS )) ]
then
printf "\t"
fi
done
printf "\n"
done
I was just looking for similar bash tranpose but with support for padding. Here is the script I wrote based on fgm's solution, that seem to work. If it can be of help...
#!/bin/bash
declare -a array=( ) # we build a 1-D-array
declare -a ncols=( ) # we build a 1-D-array containing number of elements of each row
SEPARATOR="\t";
PADDING="";
MAXROWS=0;
index=0
indexCol=0
while read -a line; do
ncols[$indexCol]=${#line[#]};
((indexCol++))
if [ ${#line[#]} -gt ${MAXROWS} ]
then
MAXROWS=${#line[#]}
fi
for (( COUNTER=0; COUNTER<${#line[#]}; COUNTER++ )); do
array[$index]=${line[$COUNTER]}
((index++))
done
done < "$1"
for (( ROW = 0; ROW < MAXROWS; ROW++ )); do
COUNTER=$ROW;
for (( indexCol=0; indexCol < ${#ncols[#]}; indexCol++ )); do
if [ $ROW -ge ${ncols[indexCol]} ]
then
printf $PADDING
else
printf "%s" ${array[$COUNTER]}
fi
if [ $((indexCol+1)) -lt ${#ncols[#]} ]
then
printf $SEPARATOR
fi
COUNTER=$(( COUNTER + ncols[indexCol] ))
done
printf "\n"
done
I was looking for a solution to transpose any kind of matrix (nxn or mxn) with any kind of data (numbers or data) and got the following solution:
Row2Trans=number1
Col2Trans=number2
for ((i=1; $i <= Line2Trans; i++));do
for ((j=1; $j <=Col2Trans ; j++));do
awk -v var1="$i" -v var2="$j" 'BEGIN { FS = "," } ; NR==var1 {print $((var2)) }' $ARCHIVO >> Column_$i
done
done
paste -d',' `ls -mv Column_* | sed 's/,//g'` >> $ARCHIVO
If you only want to grab a single (comma delimited) line $N out of a file and turn it into a column:
head -$N file | tail -1 | tr ',' '\n'
Not very elegant, but this "single-line" command solves the problem quickly:
cols=4; for((i=1;i<=$cols;i++)); do \
awk '{print $'$i'}' input | tr '\n' ' '; echo; \
done
Here cols is the number of columns, where you can replace 4 by head -n 1 input | wc -w.
Another awk solution and limited input with the size of memory you have.
awk '{ for (i=1; i<=NF; i++) RtoC[i]= (RtoC[i]? RtoC[i] FS $i: $i) }
END{ for (i in RtoC) print RtoC[i] }' infile
This joins each same filed number positon into together and in END prints the result that would be first row in first column, second row in second column, etc.
Will output:
X row1 row2 row3 row4
column1 0 3 6 9
column2 1 4 7 10
column3 2 5 8 11
#!/bin/bash
aline="$(head -n 1 file.txt)"
set -- $aline
colNum=$#
#set -x
while read line; do
set -- $line
for i in $(seq $colNum); do
eval col$i="\"\$col$i \$$i\""
done
done < file.txt
for i in $(seq $colNum); do
eval echo \${col$i}
done
another version with set eval
Here is a Bash one-liner that is based on simply converting each line to a column and paste-ing them together:
echo '' > tmp1; \
cat m.txt | while read l ; \
do paste tmp1 <(echo $l | tr -s ' ' \\n) > tmp2; \
cp tmp2 tmp1; \
done; \
cat tmp1
m.txt:
0 1 2
4 5 6
7 8 9
10 11 12
creates tmp1 file so it's not empty.
reads each line and transforms it into a column using tr
pastes the new column to the tmp1 file
copies result back into tmp1.
PS: I really wanted to use io-descriptors but couldn't get them to work.
Another bash variant
$ cat file
XXXX col1 col2 col3
row1 0 1 2
row2 3 4 5
row3 6 7 8
row4 9 10 11
Script
#!/bin/bash
I=0
while read line; do
i=0
for item in $line; { printf -v A$I[$i] $item; ((i++)); }
((I++))
done < file
indexes=$(seq 0 $i)
for i in $indexes; {
J=0
while ((J<I)); do
arr="A$J[$i]"
printf "${!arr}\t"
((J++))
done
echo
}
Output
$ ./test
XXXX row1 row2 row3 row4
col1 0 3 6 9
col2 1 4 7 10
col3 2 5 8 11
I'm a little late to the game but how about this:
cat table.tsv | python -c "import pandas as pd, sys; pd.read_csv(sys.stdin, sep='\t').T.to_csv(sys.stdout, sep='\t')"
or zcat if it's gzipped.
This is assuming you have pandas installed in your version of python
Here's a Haskell solution. When compiled with -O2, it runs slightly faster than ghostdog's awk and slightly slower than Stephan's thinly wrapped c python on my machine for repeated "Hello world" input lines. Unfortunately GHC's support for passing command line code is non-existent as far as I can tell, so you will have to write it to a file yourself. It will truncate the rows to the length of the shortest row.
transpose :: [[a]] -> [[a]]
transpose = foldr (zipWith (:)) (repeat [])
main :: IO ()
main = interact $ unlines . map unwords . transpose . map words . lines
An awk solution that store the whole array in memory
awk '$0!~/^$/{ i++;
split($0,arr,FS);
for (j in arr) {
out[i,j]=arr[j];
if (maxr<j){ maxr=j} # max number of output rows.
}
}
END {
maxc=i # max number of output columns.
for (j=1; j<=maxr; j++) {
for (i=1; i<=maxc; i++) {
printf( "%s:", out[i,j])
}
printf( "%s\n","" )
}
}' infile
But we may "walk" the file as many times as output rows are needed:
#!/bin/bash
maxf="$(awk '{if (mf<NF); mf=NF}; END{print mf}' infile)"
rowcount=maxf
for (( i=1; i<=rowcount; i++ )); do
awk -v i="$i" -F " " '{printf("%s\t ", $i)}' infile
echo
done
Which (for a low count of output rows is faster than the previous code).
A oneliner using R...
cat file | Rscript -e "d <- read.table(file('stdin'), sep=' ', row.names=1, header=T); write.table(t(d), file=stdout(), quote=F, col.names=NA) "
I've used below two scripts to do similar operations before. The first is in awk which is a lot faster than the second which is in "pure" bash. You might be able to adapt it to your own application.
awk '
{
for (i = 1; i <= NF; i++) {
s[i] = s[i]?s[i] FS $i:$i
}
}
END {
for (i in s) {
print s[i]
}
}' file.txt
declare -a arr
while IFS= read -r line
do
i=0
for word in $line
do
[[ ${arr[$i]} ]] && arr[$i]="${arr[$i]} $word" || arr[$i]=$word
((i++))
done
done < file.txt
for ((i=0; i < ${#arr[#]}; i++))
do
echo ${arr[i]}
done
Simple 4 line answer, keep it readable.
col="$(head -1 file.txt | wc -w)"
for i in $(seq 1 $col); do
awk '{ print $'$i' }' file.txt | paste -s -d "\t"
done

How to find the difference between the values two fields from two files and print only if there is a difference >10 using shell

Let say, i have two files a.txt and b.txt. the content of a.txt and b.txt is as follows:
a.txt:
abc|def|ghi|jfkdh|dfgj|hbkjdsf|ndf|10|0|cjhk|00|098r|908re|
dfbk|sgvfd|ZD|zdf|2df|3w43f|ZZewd|11|19|fdgvdf|xz00|00|00
b.txt:
abc|def|ghi|jfkdh|dfgj|hbkjdsf|ndf|11|0|cjhk|00|098r|908re|
dfbk|sgvfd|ZD|zdf|2df|3w43f|ZZewd|22|18|fdgvdf|xz00|00|00
So let's say these files have various fields separated by "|" and can have any number of lines. Also, assume that both are sorted files and so that we can match exact line between the two files. Now, i want to find the difference between the fields 8 & 9 of each row of each to be compared respectively and if any of their difference is greater than 10, then print the lines, otherwise remove the lines from file.
i.e., in the given example, i will subtract |10-11| (respective field no. 8 which is 1(absolute value) from a.txt and b.txt) and similarly for field no. 9 (0-0) which is 0,and both the difference is <10 so we delete this line from the files.
for the second line, the differences are (11-22)= 10 so we print this line.(dont need to check 19-18 as if any of the fields values(8,9) is >=10 we print such lines.
So the output is
a.txt:
dfbk|dfdag|sgvfd|ZD|zdf|2df|3w43f|ZZewd|11|19|fdgvdf|xz00|00|00
b.txt:
dfbk|dfdag|sgvfd|ZD|zdf|2df|3w43f|ZZewd|22|18|fdgvdf|xz00|00|00
You can do this with awk:
awk -F\| 'FNR==NR{x[FNR]=$0;eight[FNR]=$8;nine[FNR]=$9;next} {d1=eight[FNR]-$8;d2=nine[FNR]-$9;if(d1>10||d1<-10||d2>10||d2<-10){print x[FNR] >> "newa";print $0 >> "newb"}}' a.txt b.txt
Explanation
The -F sets the field separator to the pipe symbol. The stuff in curly braces after FNR==NR applies only to the processing of a.txt. It says to save the whole line in array x[] indexed by line number (FNR) and also to save the eighth field in array eight[] also indexed by line number. Likewise field 9 is saved in array nine[].
The second set of curly braces applies to processing file b. It calculates the differences d1 and d2. If either exceeds 10, the line is printed to each of the files newa and newb.
You can write bash shell script that does it:
while true; do
read -r lineA <&3 || break
read -r lineB <&4 || break
vara_8=$(echo "$lineA" | cut -f8 -d "|")
varb_8=$(echo "$lineB" | cut -f8 -d "|")
vara_9=$(echo "$lineA" | cut -f9 -d "|")
varb_9=$(echo "$lineB" | cut -f9 -d "|")
if (( vara_8-varb_8 > 10 || vara_8-varb_8 < -10
|| vara_9-varb_9 > 10 || vara_9-varb_9 < -10 )); then
echo "$lineA" >> newA.txt
echo "$lineB" >> newB.txt
fi
done 3<a.txt 4<b.txt
For short files
Use the method provided by Mark Setchell. Seen below in an expanded and slightly modified version:
parse.awk
FNR==NR {
x[FNR] = $0
m[FNR] = $8
n[FNR] = $9
next
}
{
if(abs(m[FNR] - $8) || abs(n[FNR] - $9)) {
print x[FNR] >> "newa"
print $0 >> "newb"
}
}
Run it like this:
awk -f parse.awk a.txt b.txt
For huge files
The method above reads a.txt into memory. If the file is very large, this becomes unfeasible and streamed parsing is called for.
It can be done in a single pass, but that requires careful handling of the multiplexed lines from a.txt and b.txt. A less error prone approach is to identify relevant line numbers, and then extract those into new files. An example of the last approach is shown below.
First you need to identify the matching lines:
# Extract fields 8 and 9 from a.txt and b.txt
paste <(awk -F'|' '{print $8, $9}' OFS='\t' a.txt) \
<(awk -F'|' '{print $8, $9}' OFS='\t' b.txt) |
# Check if it the fields matche the criteria and print line number
awk '$1 - $3 > n || $3 - $1 > n || $2 - $4 > n || $4 - $2 > 10 { print NR }' n=10 > linesfile
Now we are ready to extract the lines from a.txt and b.txt, and as the numbers are sorted, we can use the extract.awk script proposed here (repeated for convenience below):
extract.awk
BEGIN {
getline n < linesfile
if(length(ERRNO)) {
print "Unable to open linesfile '" linesfile "': " ERRNO > "/dev/stderr"
exit
}
}
NR == n {
print
if(!(getline n < linesfile)) {
if(length(ERRNO))
print "Unable to open linesfile '" linesfile "': " ERRNO > "/dev/stderr"
exit
}
}
Extract the lines (can be run in parallel):
awk -v linesfile=linesfile -f extract.awk a.txt > newa
awk -v linesfile=linesfile -f extract.awk b.txt > newb

create a excel file using shell script

I have a bunch of text files in a directory and i need to read them and extract information and keep in an excel or text file
name1_1.txt
count: 10
totalcount: 30
percentage:33
total no of a's: 20
total no of b's: 20
etc...
name2_2.txt
count: 20
totalcount: 40
percentage:50
total no of a's: 10
total no of b's: 30
etc...
etc...
output
name1 name2
count 10 20
totalcount 30 40
percentage 33 50
I want the output to keep in file called(example.txt or .csv) in the same directory.
can i get help in this?
here what i tried in writing a shell script,but can't create tab separated and output to file what i needed
#$ -S /bin/bash
for sample in *.txt; do
header=$(echo ${sample} | awk '{sub(/_/," ")}1'| awk '{print $1}')
echo -en $header"\t"
done
echo -e ' \t '
echo "count"
for sample in *.txt; do
grep "count:" $sample | awk -F: $'\t''{print $2}'
done
echo "totalcount"
for sample in *.txt; do
grep "totalcount:" $sample | awk -F: $'\t''{print $2}'
done
echo "percentage"
for sample in *.txt; do
grep "percentage:" $sample | awk -F: $'\t''{print $2}'
done
You can see if this does what you want:
awk -F":" 'BEGIN { DELIM="\t" } \
last_filename != FILENAME { \
split( FILENAME, farr, "_" ); header = header DELIM farr[1]; \
last_filename = FILENAME; i=0 } \
$1 ~ /count/ || $1 ~ /totalcount/ || $1 ~/percentage/ \
{ a[i++]= NR==FNR ? $1DELIM$2 : a[i]DELIM$2 } \
END { print header; for( j in a ) { print a[j] } }' name*.txt
where I've tried to break it up into multiple lines for "easier" reading. You can just remove the trailing "\" from each line and concat each line to re-make it as a one-liner. If I edit this anwswer one more time, I'll just make it an executable awk file.
The awk is setting a DELIM for the output to tab in the BEGIN block.
The FILENAME is cleaned up and appended to the header
It takes the column names from the first file, as well as the data and puts that into an array at i. For each next file, it just appends the data.
At the END, the header is output, and then the contents of the array are output.
I get the following output then:
name1 name2
count 10 20
totalcount 20 40
percentage 33 50
This will now only take the columns indicated in the data, provided $1 is an exact match for the count, totalcount and percentage.

Resources