Extract specific text out of bash variable containing TSV string - bash

I have the following TSV and newlines string assigned to a variable in bash:
TAGS Product3 qwerty text Desc3
TAGS Product1 qwerty text Desc1
TAGS Product2 qwerty text Desc2
I would like to extract the last column to a new string, and it has to be product ordered by my product input, for example:
Product1,Product2,Product3 will have to output: Desc1,Desc2,Desc3
What would be the best approach to accomplish this?

echo "$tsv_data" | awk '{print $2 " " $5}' | sort | awk '{print $2}' | paste -sd ',' -
This does the following steps in order:
Print the second and 5th argument (Product and Description) with a space between them.
Sort the input with sort (use gnu-sort if it can contain numbers)
Print only the description (in each line)
Join the lines together with paste
which will produce the following output:
Desc1,Desc2,Desc3

Here's a function that I suppose should do it:
get_descriptions() {
local tsvstring="$1"
local prodnames="$2"
local result=()
# read tsv line by line, splitting into variables
while IFS=$'\t' read -r tags prodname val1 val2 desc || [[ -n ${prodname} && -n ${desc} ]]; do
# check if the line matches the query, and if, append to array
if grep -iq "${prodname}" <<< "${prodnames}"; then
result+=("${desc}")
fi
done <<< "${tsvstring}"
# echo the result-array with field separator set to comma
echo $(IFS=,; echo "${result[*]}")
}
Then you can just use it like:
get_descriptions "${tsv_string_var}" "product1,product2"

echo "$var" | sort -k2 tags | cut -f5 | paste -sd,

sort + awk + paste pipeline:
echo "$tsv" | sort -nk2 | awk '{print $5}' | paste -sd',' -
The output:
Desc1,Desc2,Desc3
sort -nk2 - sorts the input by the second column numerically
awk '{print $5}' - prints out each fifth column
paste -sd',' - merge lines with ,

Related

check if column has more than one value in unix [duplicate]

I have a text file with a large amount of data which is tab delimited. I want to have a look at the data such that I can see the unique values in a column. For example,
Red Ball 1 Sold
Blue Bat 5 OnSale
...............
So, its like the first column has colors, so I want to know how many different unique values are there in that column and I want to be able to do that for each column.
I need to do this in a Linux command line, so probably using some bash script, sed, awk or something.
What if I wanted a count of these unique values as well?
Update: I guess I didn't put the second part clearly enough. What I wanted to do is to have a count of "each" of these unique values not know how many unique values are there. For instance, in the first column I want to know how many Red, Blue, Green etc coloured objects are there.
You can make use of cut, sort and uniq commands as follows:
cat input_file | cut -f 1 | sort | uniq
gets unique values in field 1, replacing 1 by 2 will give you unique values in field 2.
Avoiding UUOC :)
cut -f 1 input_file | sort | uniq
EDIT:
To count the number of unique occurences you can make use of wc command in the chain as:
cut -f 1 input_file | sort | uniq | wc -l
awk -F '\t' '{ a[$1]++ } END { for (n in a) print n, a[n] } ' test.csv
You can use awk, sort & uniq to do this, for example to list all the unique values in the first column
awk < test.txt '{print $1}' | sort | uniq
As posted elsewhere, if you want to count the number of instances of something you can pipe the unique list into wc -l
Assuming the data file is actually Tab separated, not space aligned:
<test.tsv awk '{print $4}' | sort | uniq
Where $4 will be:
$1 - Red
$2 - Ball
$3 - 1
$4 - Sold
# COLUMN is integer column number
# INPUT_FILE is input file name
cut -f ${COLUMN} < ${INPUT_FILE} | sort -u | wc -l
Here is a bash script that fully answers the (revised) original question. That is, given any .tsv file, it provides the synopsis for each of the columns in turn. Apart from bash itself, it only uses standard *ix/Mac tools: sed tr wc cut sort uniq.
#!/bin/bash
# Syntax: $0 filename
# The input is assumed to be a .tsv file
FILE="$1"
cols=$(sed -n 1p $FILE | tr -cd '\t' | wc -c)
cols=$((cols + 2 ))
i=0
for ((i=1; i < $cols; i++))
do
echo Column $i ::
cut -f $i < "$FILE" | sort | uniq -c
echo
done
This script outputs the number of unique values in each column of a given file. It assumes that first line of given file is header line. There is no need for defining number of fields. Simply save the script in a bash file (.sh) and provide the tab delimited file as a parameter to this script.
Code
#!/bin/bash
awk '
(NR==1){
for(fi=1; fi<=NF; fi++)
fname[fi]=$fi;
}
(NR!=1){
for(fi=1; fi<=NF; fi++)
arr[fname[fi]][$fi]++;
}
END{
for(fi=1; fi<=NF; fi++){
out=fname[fi];
for (item in arr[fname[fi]])
out=out"\t"item"_"arr[fname[fi]][item];
print(out);
}
}
' $1
Execution Example:
bash> ./script.sh <path to tab-delimited file>
Output Example
isRef A_15 C_42 G_24 T_18
isCar YEA_10 NO_40 NA_50
isTv FALSE_33 TRUE_66

Getting last X fields from a specific line in a CSV file using bash

I'm trying to get as bash variable list of users which are in my csv file. Problem is that number of users is random and can be from 1-5.
Example CSV file:
"record1_data1","record1_data2","record1_data3","user1","user2"
"record2_data1","record2_data2","record2_data3","user1","user2","user3","user4"
"record3_data1","record3_data2","record3_data3","user1"
I would like to get something like
list_of_users="cat file.csv | grep "record2_data2" | <something> "
echo $list_of_users
user1,user2,user3,user4
I'm trying this:
cat file.csv | grep "record2_data2" | awk -F, -v OFS=',' '{print $4,$5,$6,$7,$8 }' | sed 's/"//g'
My result is:
user2,user3,user4,,
Question:
How to remove all "," from the end of my result? Sometimes it is just one but sometimes can be user1,,,,
Can I do it in better way? Users always starts after 3rd column in my file.
This will do what your code seems to be trying to do (print the users for a given string record2_data2 which only exists in the 2nd field):
$ awk -F',' '{gsub(/"/,"")} $2=="record2_data2"{sub(/([^,]*,){3}/,""); print}' file.csv
user1,user2,user3,user4
but I don't see how that's related to your question subject of Getting last X records from CSV file using bash so idk if it's what you really want or not.
Better to use a bash array, and join it into a CSV string when needed:
#!/usr/bin/env bash
readarray -t listofusers < <(cut -d, -f4- file.csv | tr -d '"' | tr ',' $'\n' | sort -u))
IFS=,
printf "%s\n" "${listofusers[*]}"
cut -d, -f4- file.csv | tr -d '"' | tr ',' $'\n' | sort -u is the important bit - it first only prints out the fourth and following fields of the CSV input file, removes quotes, turns commas into newlines, and then sorts the resulting usernames, removing duplicates. That output is then read into an array with the readarray builtin, and you can manipulate it and the individual elements however you need.
GNU sed solution, let file.csv content be
"record1_data1","record1_data2","record1_data3","user1","user2"
"record2_data1","record2_data2","record2_data3","user1","user2","user3","user4"
"record3_data1","record3_data2","record3_data3","user1"
then
sed -n -e 's/"//g' -e '/record2_data/ s/[^,]*,[^,]*,[^,]*,// p' file.csv
gives output
user1,user2,user3,user4
Explanation: -n turns off automatic printing, expressions meaning is as follow: 1st substitute globally " using empty string i.e. delete them, 2nd for line containing record2_data substitute (s) everything up to and including 3rd , with empty string i.e. delete it and print (p) such changed line.
(tested in GNU sed 4.2.2)
awk -F',' '
/record2_data2/{
for(i=4;i<=NF;i++) o=sprintf("%s%s,",o,$i);
gsub(/"|,$/,"",o);
print o
}' file.csv
user1,user2,user3,user4
This might work for you (GNU sed):
sed -E '/record2_data/!d;s/"([^"]*)"(,)?/\1\2/4g;s///g' file
Delete all records except for that containing record2_data.
Remove double quotes from the fourth field onward.
Remove any double quoted fields.

Get Field Separator from a File

I have several files and they are different between each other and they use ; and | as separators.
Is there a way to let my program read the file get the ; or | separator to save into a variable and use it later on in my script?
Generally speaking the answer is 'yes' you can dynamically determine the delimiter and use it in later coding.
You haven't mentioned how you'll determine which delimiter is in use (eg, what happens if both characters exist in your data file?) so for the sake of discussion we'll assume the delimiter has been determined and stored in the del variable.
You also haven't stated how you'll use this delimiter in your code so, again for sake of discussion, we'll look at examples using cut and awk.
Let's assume our datafile (mydata) contains the following single line of data:
$ cat mydata
abc;def|ghi;jkl|mno;pqr|stu
We'll now switch our delimiter between ; and | and look at some simple cut and awk examples ...
############################
# set our delimiter to ';'
$ del=";"
##### display 1st field
$ cut -d"${del}" -f1 mydata
abc
$ awk -F"${del}" '{print $1}' mydata
abc
#### display 4th field
$ cut -d"${del}" -f4 mydata
pqr|stu
$ awk -F"${del}" '{print $4}' mydata
pqr|stu
############################
# set our delimiter to '|'
#
$ del="|"
##### display 1st field
$ cut -d"${del}" -f1 mydata
abc;def
$ awk -F"${del}" '{print $1}' mydata
abc;def
##### display 4th field
$ cut -d"${del}" -f4 mydata
stu
$ awk -F"${del}" '{print $4}' mydata
stu

shell sort command : How to sort by the last column (the number of columns is uncertain)?

If the data is like the follow:
a,b,3
c,d,e,f,2
g,1
I want sort by the last column. the result should be:
g,1
c,d,e,f,2
a,b,3
if the last field is single digit
$ rev file | sort | rev
you may need to add -t, -n to sort for numerical ordering but single digits it doesn't matter.
or, for the general case with awk
$ awk -F, '{a[$NF]=$0} END{n=asorti(a,d); for(k=1;k<=n;k++) print a[d[k]]}' file
g,1
c,d,e,f,2
a,b,3
This will fail if the last field is not unique. Using decorate/sort/undecorate idiom you can write instead (as you found yourself)
$ awk -F, '{print $NF FS $0}' file | sort -n | cut -d, -f2-
it's safer to use the field delimiter between the key and the record since you want to ensure the FS doesn't appear in the key itself.
I have a stupid but simple way to do it :)
// if original data in the file : ~/Desktop/1.log
$ awk -F, '{print $NF, $0}' ~/Desktop/1.log | sort -n | awk '{print $2}'
g,1
c,d,e,f,2
a,b,3
Here is my solution using bash script -- i named it uncertain.sh.
# Set here the size of the largest item to sort.
# In our case it is c,d,e,f,2 which is size 5.
max_n=5
# This function 'pads' array with P's before last element
# to force it to grow to max_n size.
# For example, (a b 3) will be transformed into (a b P P 3).
pad () {
local arr=("$#")
local l=${#arr[#]}
local diff_l=$((max_n-l))
local padding=""
# construct padding
for i in `seq 1 $diff_l`; do
padding+="P "
done
local l_minus=$((l-1))
arr=(${arr[#]:0:$l_minus} "$padding"${arr[#]:$l_minus})
echo "${arr[#]}"
}
################################################
# Provide A,B,C here to sort by last item
################################################
A="a,b,3"
B="c,d,e,f,2"
C="g,1"
A=$(echo "$A" | tr ',' ' ')
B=$(echo "$B" | tr ',' ' ')
C=$(echo "$C" | tr ',' ' ')
a=(`echo "$A"`)
b=(`echo "$B"`)
c=(`echo "$C"`)
# Get padded arrays.
a=$(pad "${a[#]}")
b=$(pad "${b[#]}")
c=$(pad "${c[#]}")
# Here, we sort by the last field (we can do this since
# padded arrays are all same size 5).
# Then we remove 'P's from strings.
feed=$(printf "%s\n" "$a" "$b" "$c" | sort -k5,5n | tr -d 'P')
# Lastly, we change spaces with commas ','.
while read line; do
echo "$line" | tr -s ' ' | tr ' ' ','
done < <(echo "$feed")
Here's the output
$ ./uncertain.sh
g,1
c,d,e,f,2
a,b,3
Here's how I did it:
We start with
a,b,3
c,d,e,f,2
g,1
We convert this to
a,b,P,P,3
c,d,e,f,2
g,P,P,P,1
Then we can sort by the 5th column since they are all of same size 5.
So this becomes
g,P,P,P,1
c,d,e,f,2
a,b,P,P,3
We can now remove P's.
g,1
c,d,e,f,2
a,b,3
Hope you found this useful.

Replace tip of newick file using reference list in bash

I have a collection of newick-formatted files containing gene IDs:
((gene1:1,gene2:1)100:1,gene3:1)100;
((gene4:1,gene5:1)100:1,gene6:1)100;
I have a list of equivalence between gene ID and species name:
speciesA=(gene1,gene4)
speciesB=(gene2,gene5)
speciesC=(gene3,gene6)
I would like to get the following output:
((speciesA:1,speciesB:1)100:1,speciesC:1)100;
((speciesA:1,speciesB:1)100:1,speciesC:1)100;
Any idea of how I could proceed? Ideally in bash would be awesome :)
Here's an awk one-liner that does what you want:
$ awk -F'[()=,]+' 'NR==FNR{a[$2]=a[$3]=$1;next}{for(i in a)gsub(i,a[i])}1' species gene
((speciesA:1,speciesB:1)100:1,speciesC:1)100;
((speciesA:1,speciesB:1)100:1,speciesC:1)100;
Go through the file containing the mappings between the species and genes, saving them as key-value pairs in the array a. NR==FNR targets the first file passed to awk as the total line number NR is equal to the line number in the current file FNR. next skips any further instructions. Go through the second file and make the substitutions.
input.txt
((gene1:1,gene2:1)100:1,gene3:1)100;
((gene4:1,gene5:1)100:1,gene6:1)100;
equivs.txt
speciesA=(gene1,gene4)
speciesB=(gene2,gene5)
speciesC=(gene3,gene6)
convert.sh
#!/bin/bash
function replace() {
output=$1
for line in $(cat equivs.txt) #this will fail if there is whitespace in your lines!
do
#gets the replacement string
rep=$(echo $line | cut -d'=' -f1)
#create a regex of all the possible matches we want to replace with $rep
targets=$(echo $line | cut -d'(' -f2- | cut -d')' -f1)
regex="($(echo $targets | sed -r 's/,/|/g'))"
#do the replacements
output=$(echo $output | sed -r "s/${regex}/${rep}/g")
done
echo $output
}
#step through the input, file calling the above function on each line.
#assuming all lines are formatted like the example!
for line in $(cat input.txt)
do
replace $line
done
output:
((speciesA:1,speciesB:1)100:1,speciesC:1)100;
((speciesA:1,speciesB:1)100:1,speciesC:1)100;

Resources