I have a CSV file, and I wanna remove the columns that have less than 5 different values. e.g
a b c;
1 1 1;
1 2 2;
1 3 4;
2 4 5;
1 6 7;
then I wanna remove column a since it has only two different values (1,2). How to do this?
A solution using arrays:
infile="infile.txt"
different=5
rows=0
while read -a line ; do
data+=( ${line[#]/;/} ) # remove all semicolons
((rows++))
done < "$infile"
cols=$(( ${#data[#]}/rows )) # calculate number of rows
result=()
for (( CNTR1=0; CNTR1<cols; CNTR1+=1 )); do
cnt=()
save=( ${data[CNTR1]} ) # add column header
for (( CNTR2=cols; CNTR2<${#data[#]}; CNTR2+=cols )); do
cnt[${data[CNTR1+CNTR2]}]=1
save+=( ${data[CNTR1+CNTR2]} ) # add column data
done
if [ ${#cnt[#]} -eq $different ] ; then # choose column?
result+=( ${save[#]} ) # add column to the result
fi
done
cols=$((${#result[#]}/rows)) # recalculate number of columns
for (( CNTR1=0; CNTR1<rows; CNTR1+=1 )); do
for (( CNTR2=0; CNTR2<${#result[#]}; CNTR2+=rows )); do
printf " %s" "${result[CNTR1+CNTR2]}"
done
printf ";\n"
done
The output:
b c;
1 1;
2 2;
3 4;
4 5;
6 7;
I think to resolve this problem you can read this file to get data (numbers) (can put in a array) then search columns you want to remove and write this result back to file at last.
Related
I want to sort 2 arrays at the same time. The arrays are the following: wordArray and numArray. Both are global.
These 2 arrays contain all the words (without duplicates) and the number of the appearances of each word from a text file.
Right now I am using Bubble Sort to sort both of them at the same time:
# Bubble Sort function
function bubble_sort {
local max=${#numArray[#]}
size=${#numArray[#]}
while ((max > 0))
do
local i=0
while ((i < max))
do
if [ "$i" != "$(($size-1))" ]
then
if [ ${numArray[$i]} \< ${numArray[$((i + 1))]} ]
then
local temp=${numArray[$i]}
numArray[$i]=${numArray[$((i + 1))]}
numArray[$((i + 1))]=$temp
local temp2=${wordArray[$i]}
wordArray[$i]=${wordArray[$((i + 1))]}
wordArray[$((i + 1))]=$temp2
fi
fi
((i += 1))
done
((max -= 1))
done
}
#Calling Bubble Sort function
bubble_sort "${numArray[#]}" "${wordArray[#]}"
But for some reason it won't sort them properly when large arrays are in place.
Does anyone knows what's wrong with it or an other approach to sort the words with the corresponding number of appearance with or without arrays?
This:
wordArray = (because, maybe, why, the)
numArray = (5, 12, 20, 13)
Must turn to this:
wordArray = (why, the, maybe, because)
numArray = (20, 13, 12, 5)
Someone recommended to write the two arrays side by side in a text file and sort the file.
How will it work for this input:
1 Arthur
21 Zebra
to turn to this output:
21 Zebra
1 Arthur
Assuming the arrays do not contain tab character or newline character, how about:
#!/bin/bash
wordArray=(why the maybe because)
numArray=(20 13 12 5)
tmp1=$(mktemp tmp.XXXXXX) # file to be sorted
tmp2=$(mktemp tmp.XXXXXX) # sorted result
for (( i = 0; i < ${#wordArray[#]}; i++ )); do
echo "${numArray[i]}"$'\t'"${wordArray[i]}" # write the number and word delimited by a tab character
done > "$tmp1"
sort -nrk1,1 "$tmp1" > "$tmp2" # sort the file by number in descending order
while IFS=$'\t' read -r num word; do # read the lines splitting by the tab character
numArray_sorted+=("$num") # add the number to the array
wordArray_sorted+=("$word") # add the word to the array
done < "$tmp2"
rm -- "$tmp1" # unlink the temp file
rm -- "$tmp2" # same as above
echo "${wordArray_sorted[#]}" # same as above
echo "${numArray_sorted[#]}" # see the result
Output:
why the maybe because
20 13 12 5
If you prefer not to create temp files, here is the process substitution version, which will run faster without writing/reading temp files.
#!/bin/bash
wordArray=(why the maybe because)
numArray=(20 13 12 5)
while IFS=$'\t' read -r num word; do
numArray_sorted+=("$num")
wordArray_sorted+=("$word")
done < <(
sort -nrk1,1 < <(
for (( i = 0; i < ${#wordArray[#]}; i++ )); do
echo "${numArray[i]}"$'\t'"${wordArray[i]}"
done
)
)
echo "${wordArray_sorted[#]}"
echo "${numArray_sorted[#]}"
Or simpler (using the suggestion by KamilCuk):
#!/bin/bash
wordArray=(why the maybe because)
numArray=(20 13 12 5)
while IFS=$'\t' read -r num word; do
numArray_sorted+=("$num")
wordArray_sorted+=("$word")
done < <(
paste <(printf "%s\n" "${numArray[#]}") <(printf "%s\n" "${wordArray[#]}") | sort -nrk1,1
)
echo "${wordArray_sorted[#]}"
echo "${numArray_sorted[#]}"
You need numeric sort for the numbers. You can sort an array like this:
mapfile -t wordArray <(printf '%s\n' "${wordArray[#]}" | sort -n)
But what you actually need is something like:
for num in "${numArray[#]}"; do
echo "$num: ${wordArray[j++]}"
done |
sort -n k1,1
But, earlier in the process, you should have used only one array, where the word and frequency (or vice versa) are key value pairs. Then they always have a direct relationship, and can be printed similarly to the for loop above.
I have a file with 2 columns and many rows. I would like to calculate the mean for each column for odd and even lines independantly, so that in the end I would have a file with 4 values: 2 columns with odd and even mean.
My file looks like this:
2 4
4 4
6 8
3 5
6 9
2 1
In the end I would like to obtain a file with the mean of 2,6,6 and 4,3,2 in the first column and the mean of 4,8,9 and 4,5,1 in the second column, that is:
4.66 7
3 3.33
If anyone could give me some advice I'd really appreaciate it, for the moment I'm only able to calculate the mean for all rows (not even and odd). Thank you very much in advance!
This is an awk hardcoded example but you can get the point :
awk 'NR%2{e1+=$1;e2+=$2;c++;next}
{o1+=$1;o2+=$2;d++}
END{print e1/c"\t"e2/c"\n"o1/d"\t"o2/d}' your_file
4.66667 7
3 3.33333
A more generalized version of Juan Diego Godoy's answer. Relies on GNU awk
gawk '
{
parity = NR % 2 == 1 ? "odd" : "even"
for (i=1; i<=NF; i++) {
sum[parity][i] += $i
count[parity][i] += 1
}
}
function result(parity) {
for (i=1; i<=NF; i++)
printf "%g\t", sum[parity][i] / count[parity][i]
print ""
}
END { result("odd"); result("even") }
'
This answer uses Bash and bc. It assumes that the input file consists of only integers and that there is an even number of lines.
#!/bin/bash
while read -r oddcol1 oddcol2; read -r evencol1 evencol2
do
(( oddcol1sum += oddcol1 ))
(( oddcol2sum += oddcol2 ))
(( evencol1sum += evencol1 ))
(( evencol2sum += evencol2 ))
(( count++ ))
done < inputfile
cat <<EOF | bc -l
scale=2
print "Odd Column 1 Mean: "; $oddcol1sum / $count
print "Odd Column 2 Mean: "; $oddcol2sum / $count
print "Even Column 1 Mean: "; $evencol1sum / $count
print "Even Column 2 Mean: "; $evencol2sum / $count
EOF
It could be modified to use arrays to make it more flexible.
I have a line that goes like:
string 2 2 3 3 1 4
where the 2nd, 4th and 6th columns represent an ID (assuming each ID number is unique) and 3rd, 5th and 7th columns represent some data associated with respective ID.
How can I re-arrange the line so that it will be sorted by the ID?
string 1 4 2 2 3 3
Note: a line may have any number of IDs, unlike the example.
Using shell script, I'm thinking something like
while read n
do
echo $(echo $n | sork -k (... stuck here) )
done < infile
Another bash alternative which does not rely on how many ids there are:
#!/usr/bin/env bash
x='string 2 2 3 3 1 4'
out="${x%% *}"
in=($x)
for (( i = 1; i < ${#in[*]}; i += 2 ))
do
new[${in[i]}]=${in[i+1]}
done
for i in ${!new[#]}
do
out="$out $i ${new[i]}"
done
echo $out
You can put a loop around the lot if you then want to read a file
I'll add an gawk solution to your long list of options.
This is a standalone script:
#!/usr/bin/env gawk -f
{
line=$1
# Collect the tuples into values of an array,
for (i=2;i<NF;i+=2) a[i]=$i FS $(i+1)
# This sorts the array "a" by value, numerically, ascending...
asort(a, a, "#val_num_asc")
# And this for loop gathers the result.
for (i=0; i<length(a); i++) line=line FS a[i]
# Finally, print the line,
print line
# and clear the array for the next round.
delete a
}
This works by copying your tuples into an array, sorting the array, then reassembling the sorted tuples in a for loop that prints the array elements.
Note that it's gawk-only (not traditional awk) because of the use of asort().
$ cat infile
string 2 2 3 3 1 4
other 5 1 20 9 3 7
$ ./sorttuples infile
string 1 4 2 2 3 3
other 3 7 5 1 20 9
As a bash script this can be done with:
Code:
#!/usr/bin/env bash
# send field pairs as separate lines
function emit_line() {
while [ $# -gt 0 ] ; do
echo "$1" "$2"
shift; shift
done
}
# break the line into pieces and send to sort
function sort_line() {
echo $1
shift
emit_line $* | sort
}
# loop through the lines in the file and sort by key-value pairs
while read n; do
echo $(sort_line $n)
done < infile
File infile:
string 2 2 3 3 1 4
string 2 2 0 3 4 4 1 7
string 2 2 0 3 2 1
Output:
string 1 4 2 2 3 3
string 0 3 1 7 2 2 4 4
string 0 3 2 1 2 2
Update:
Cribbing the sort from grail's version, to remove the (much slower) external sort:
function sort_line() {
line="$1"
shift
while [ $# -gt 0 ] ; do
data[$1]=$2
shift; shift
done
for i in ${!data[#]}; do
out="$line $i ${data[i]}"
done
unset data
echo $line
}
while read n; do
sort_line $n
done < infile
You can use python for this. This function breaks up the column into a list of tuples that can then be sorted. itertools.chain is then used to re-assemble the key values pairs.
Code:
import itertools as it
def sort_line(line):
# split the line on white space
x = line.split()
# make a tuple of key value pairs
as_tuples = [tuple(x[i:i+2]) for i in range(1, len(x), 2)]
# sort the tuples, and flatten them with chain
sorted_kv = list(it.chain(*sorted(as_tuples)))
# join the results back into a string
return ' '.join([x[0]] + sorted_kv)
Test Code:
data = [
"string 2 2 3 3 1 4",
"string 2 2 0 3 4 4 1 7",
]
for line in data:
print(sort_line(line))
Results:
string 1 4 2 2 3 3
string 0 3 1 7 2 2 4 4
I want to manage subvariables in Bash. I can assign the subvariables, but I dont know how to use it:
#/bin/bash
n=1
for lvl in 1 2;
do
export key$n="${RANDOM:0:2}"
let n=$n+1
done
for num in 1 2; do
echo $key$num
done
If I use echo $key$num, it print number sequence of variable $num, and not the random numbers
Use arrays.
for n in 1 2; do
key[n]="${RANDOM:0:2}"
done
for num in 1 2; do
echo "${key[num]}"
done
See http://mywiki.wooledge.org/BashGuide/Arrays.
Also, in bash you'll generally do better counting from 0 instead of 1, and you don't need to export variables unless you want to run some other program that is going to look for them in its inherited environment.
You may use arrays (see #MarkReed), or use declare:
for n in 1 2; do
declare -- key$n="${RANDOM:0:2}"
done
for n in 1 2; do
v=$(declare -p key$n) ; v="${v#*=}" ; echo "${v//\"/}"
done
The same using functions:
key_set () # n val
{
declare -g -- key$1=$2
}
key_get () # n
{
local v=$(declare -p key$1) ; v="${v#*=}" ; echo "${v//\"/}"
}
for n in 1 2; do
key_set $n "${RANDOM:0:2}"
done
for n in 1 2; do
key_get $n
done
input from file $2 : 1 -> 2
while read -a line; do
if (( ${line[2]} > linesNumber )); then
echo "Graph does not match known sites4"
exit
fi
done < "$2"
For some reason inside the if condition, the value of ${line[2]) is not 2
but if I print the value outside if:
echo `${line[2]}`
2
What's linesNumber? Even if you put $linesNumber, where is it coming from?
If you are tracking the line number, you need to set it and increment it. Here's my sample program and data. It's inspired by your example, but doesn't do exactly what you want. However, it shows you how to setup a variable that tracks the line number, how to increment it, and how to use it in an if statement:
foo.txt:
this 1
that 2
foo 4
barf 4
flux 5
The Program:
lineNum=0
while read -a line
do
((lineNum++))
if (( ${line[1]} > $lineNum ))
then
echo "Line Number Too High!"
fi
echo "Verb = ${line[0]} Number = ${line[1]}"
done < foo.txt
Output:
Verb = this Number = 1
Verb = that Number = 2
Line Number Too High!
Verb = foo Number = 4
Verb = barf Number = 4
Verb = flux Number = 5