Print nth column from a CSV file, where myScript.sh reads the table and the column argument is combined like: "col1,col2,col3" in one argument - bash

If I have a script which takes three arguments like so:
./myScript.sh path file col1,col3
and if file is like:
id,role,salary
05,engineer,45000
How would I split $3 into separate variables (note that this could be any number of variables, if I had a larger CSV file) in order to only print the corresponding columns to $3.
I've tried saving $3 to a variable, using Tr and array to possible equate the index of the array to the column header number. I failed to do this. What is the most simplistic amateurish approach to resolving this? It would be straight forward if the script took the columns as separate arguments, but when combined in one argument, it's complicating this quite a bit for me.
Expected output:
id,salary
05,45000

If the order of the columns must stay the same:
#!/bin/bash
path="$1"
fname="$2"
cols="$3"
header=($(head -1 "$fname" | sed 's/,/ /g'))
for i in "${!header[#]}"; do
cols=$(echo "$cols" | sed "s/${header[$i]}/$((i+1))/g")
done
cut -d',' -f$cols $fname
If you need more flexibility e.g. define the order of columns, just change the last part of the script with this:
for i in "${!header[#]}"; do
cols=$(echo "$cols" | sed -e "s/${header[$i]}/\$$((i+1))/g")
done
awk -F, "{print(${cols//,/\",\"})}" $fname
Output:
$ ./so.sh <path> input.txt id,salary
id,salary
05,45000
With the awk method, you can do stuff like
$ ./so.sh <path> input.txt id,salary,id,salary
id,salary,id,salary
05,45000,05,45000

Related

Making bash output a certain word from a .txt file

I have a question on Bash:
Like the title says, I require bash to output a certain word, depending on where it is in the file. In my explicit example I have a simple .txt file.
I already found out that you can count the number of words within a file with the command:
wc -w < myFile.txt
An output example would be:
78501
There certainly is also a way to make "cat" to only show word number x. Something like:
cat myFile.txt | wordno. 3125
desired-word
Notice, that I will welcome any command, that gets this done, not only cat.
Alternatively or in addition, I would be happy to know how you can make certain characters in a file show, based on their place in it. Something like:
cat myFile.txt | characterno. 2342
desired-character
I already know how you can achieve this with a variable:
a="hello, how are you"
echo ${a:9:1}
w
Only problem is a variable can only be so long. Is it as long as a whole .txt file, it won't work.
I look forward to your answers!
You could use awkfor this job it splits the string at spaces and prints the $wordnumber stringpart and tr is used to remove newlines
cat myFile.txt | tr -d '\n' | awk -v wordnumber=5 '{ print $wordnumber }'
And if you want the for example 5th. character you could do this like so
head -c 5 myFile.txt | tail -c 1
Since you have NOT shown samples of Input_file or expected output so couldn't test it. You could simply do this with awk as follows could be an example.
awk 'FNR==1{print substr($0,2342,1);next}' Input_file
Where we are telling awk to look for 1st line FNR==1 and in substr where we tell awk to take character 2342 and next 1 means from that position take only 1 character you could increase its value or keep it as per your need too.
With gawk:
awk 'BEGIN{RS="[[:space:]]+"} NR==12345' file
or
gawk 'NR==12345' RS="[[:space:]]+" file
I'm setting the record separator to a sequences of spaces which includes newlines and then print the 12345th record.
To improve the average performance you can exit the script once the match is found:
gawk 'BEGIN{RS="[[:space:]]+"}NR==12345{print;exit}' file

Bash: sort rows within a file by timestamp

I am new to bash scripting and I have written a script to match regex and output lines to print to a file.
However, each line contains multiple columns, one of which is the timestamp column, which appears in the form YYYYMMDDHHMMSSTTT (to millisecond) as shown below.
20180301050630663,ABC,,,,,,,,,,
20180301050630664,ABC,,,,,,,,,,
20180301050630665,ABC,,,,,,,,,,
20180301050630666,ABC,,,,,,,,,,
20180301050630667,ABC,,,,,,,,,,
20180301050630668,ABC,,,,,,,,,,
20180301050630663,ABC,,,,,,,,,,
20180301050630665,ABC,,,,,,,,,,
20180301050630661,ABC,,,,,,,,,,
20180301050630662,ABC,,,,,,,,,,
My code is written as follow:
awk -F "," -v OFS=","'{if($2=="ABC"){print}}' < $i>> "$filename"
How can I modify my code such that it can sort the rows by timestamp (YYYYMMDDHHMMSSTTT) in ascending order before printing to file?
You can use a very simple sort command, e.g.
sort yourfile
If you want to insure sort only looks at the datestamp, you can tell sort to only use the first command separated field as your sorting criteria, e.g.
sort -t, -k1 yourfile
Example Use/Output
With your data save in a file named log, you could do:
$ sort -t, -k1 log
20180301050630661,ABC,,,,,,,,,,
20180301050630662,ABC,,,,,,,,,,
20180301050630663,ABC,,,,,,,,,,
20180301050630663,ABC,,,,,,,,,,
20180301050630664,ABC,,,,,,,,,,
20180301050630665,ABC,,,,,,,,,,
20180301050630665,ABC,,,,,,,,,,
20180301050630666,ABC,,,,,,,,,,
20180301050630667,ABC,,,,,,,,,,
20180301050630668,ABC,,,,,,,,,,
Let me know if you have any problems.
Just add a pipeline.
awk -F "," '$2=="ABC"' < "$i" |
sort -n >> "$filename"
In the general case, to sort on column 234. try sort -t, -k234,234n
Notice alse the quoting around "$i", like you already have around "$filename", and the simplifications of the Awk script.
If you are using gawk you can do:
$ awk -F "," -v OFS="," '$2=="ABC"{a[$1]=$0} # Filter lines that have "ABC"
END{ # set the sort method
PROCINFO["sorted_in"] = "#ind_num_asc"
for (e in a) print a[e] # traverse the array of lines
}' file
An alternative is to use sed and sort:
sed -n '/^[0-9]*,ABC,/p' file | sort -t, -k1 -n
Keep in mind that both of these methods are unrelated to the shell used. Bash is just executing the tools (sed, awk, sort, etc) that are otherwise part of the OS.
Bash itself could do the sort in pure Bash but it would be long and slow.

Concatenate files based on numeric sort of name substring in awk w/o header

I am interested in concatenate many files together based on the numeric number and also remove the first line.
e.g. chr1_smallfiles then chr2_smallfiles then chr3_smallfiles.... etc (each without the header)
Note that chr10_smallfiles needs to come after chr9_smallfiles -- that is, this needs to be numeric sort order.
When separate the two command awk and ls -v1, each does the job properly, but when put them together, it doesn't work. Please help thanks!
awk 'FNR>1' | ls -v1 chr*_smallfiles > bigfile
The issue is with the way that you're trying to pass the list of files to awk. At the moment, you're piping the output of awk to ls, which makes no sense.
Bear in mind that, as mentioned in the comments, ls is a tool for interactive use, and in general its output shouldn't be parsed.
If sorting weren't an issue, you could just use:
awk 'FNR > 1' chr*_smallfiles > bigfile
The shell will expand the glob chr*_smallfiles into a list of files, which are passed as arguments to awk. For each filename argument, all but the first line will be printed.
Since you want to sort the files, things aren't quite so simple. If you're sure the full range of files exist, just replace chr*_smallfiles with chr{1..99}_smallfiles in the original command.
Using some Bash-specific and GNU sort features, you can also achieve the sorting like this:
printf '%s\0' chr*_smallfiles | sort -z -n -k1.4 | xargs -0 awk 'FNR > 1' > bigfile
printf '%s\0' prints each filename followed by a null-byte
sort -z sorts records separated by null-bytes
-n -k1.4 does a numeric sort, starting from the 4th character (the numeric part of the filename)
xargs -0 passes the sorted, null-separated output as arguments to awk
Otherwise, if you want to go through the files in numerical order, and you're not sure whether all the files exist, then you can use a shell loop (although it'll be significantly slower than a single awk invocation):
for file in chr{1..99}_smallfiles; do # 99 is the maximum file number
[ -f "$file" ] || continue # skip missing files
awk 'FNR > 1' "$file"
done > bigfile
You can also use tail to concatenate all the files without header
tail -q -n+2 chr*_smallfiles > bigfile
In case you want to concatenate the files in a natural sort order as described in your quesition, you can pipe the result of ls -v1 to xargs using
ls -v1 chr*_smallfiles | xargs -d $'\n' tail -q -n+2 > bigfile
(Thanks to Charles Duffy) xargs -d $'\n' sets the delimiter to a newline \n in case the filename contains white spaces or quote characters
Using a bash 4 associative array to extract only the numeric substring of each filename; sort those individually; and then retrieve and concatenate the full names in the resulting order:
#!/usr/bin/env bash
case $BASH_VERSION in ''|[123].*) echo "Requires bash 4.0 or newer" >&2; exit 1;; esac
# when this is done, you'll have something like:
# files=( [1]=chr_smallfiles1.txt
# [10]=chr_smallfiles10.txt
# [9]=chr_smallfiles9.txt )
declare -A files=( )
for f in chr*_smallfiles.txt; do
files[${f//[![:digit:]]/}]=$f
done
# now, emit those indexes (1, 10, 9) to "sort -n -z" to sort them as numbers
# then read those numbers, look up the filenames associated, and pass to awk.
while read -r -d '' key; do
awk 'FNR > 1' <"${files[$key]}"
done < <(printf '%s\0' "${!files[#]}" | sort -n -z) >bigfile
You can do with a for loop like below, which is working for me:-
for file in chr*_smallfiles
do
tail +2 "$file" >> bigfile
done
How will it work? For loop read all the files from current directory with wild chard character * chr*_smallfiles and assign the file name to variable file and tail +2 $file will output all the lines of that file except the first line and append in file bigfile. So finally all files will be merged (accept the first line of each file) into one i.e. file bigfile.
Just for completeness, how about a sed solution?
for file in chr*_smallfiles
do
sed -n '2,$p' $file >> bigfile
done
Hope it helps!

How to capture first column values of a command?

I am new to shell scripting. I am trying to write a script that is suppose to run a command and use for loop to capture first column of the output and do further processing.
command: tst get files
output of this command is something like
NAME COUNT ADMIN
FileA.txt 30 adminA
FileB.txt 21 local
FileC.txt 9 local
FileD.txt 90 adminA
Here is what I have tried so far : UPDATED also want to run additional commands
#!/bin/bash
for f in $(tst get files)
do
echo "FILE :[${f}]"
tst setprimary ${f} && tst get dataload
done
the output I am seeing is something like
FILE :[NAME]
FILE :[COUNT]
FILE :[ADMIN]
FILE :[FileA.txt]
FILE :[30]
FILE :[adminA]
FILE :[FileB.txt]
FILE :[21]
FILE :[local]
FILE :[FileC.txt]
FILE :[9]
FILE :[local]
FILE :[FileD.txt]
FILE :[90]
FILE :[adminA]
I am looking for an output something like
FILE :[FileA.txt]
FILE :[FileB.txt]
FILE :[FileC.txt]
FILE :[FileD.txt]
What should I modify in the shell script to only capture NAME column values? Am I executing the tst get files command correctly in the for loop or is there a better way to execute a command and loop thru the results?
EDIT (Samuel Kirschner): you can do without the for loop entirely and just use awk to print the lines you're interested in
tst get files | awk 'NR > 1 {print "FILE :[" $1 "]"}'
If you want to keep the for loop for some reason and just extract the file name from the lines while skipping the header, you have a few choices. Awk is probably the easiest because of the NR builtin variable (which counts lines) and automatic field-splitting ($1 refers to the first field in the line, for instance), but you can use sed and cut as well.
You can use awk 'NR > 1 {print $1}' to get the first column (using any whitespace character as a delimiter while skipping the first line) or sed 1d | cut -d$'\t' -f1. Note that $'\t' is bash-specific syntax for a literal tab character, if your file is padded with spaces rather than using tabs to delimit fields, you can't use the sed ... | cut ... example.
i.e.
#!/bin/bash
for f in $(tst get files | awk 'NR > 1 {print $1}')
do
echo "FILE :[${f}]"
done
or
#!/bin/bash
for f in $(tst get files | sed 1d | cut -d$'\t' -f1)
do
echo "FILE :[${f}]"
done
to avoid unnecessary splitting in the for loop. It's best to set IFS to something specific outside the loop body to prevent 'a file with whitespace.txt' from being broken up.
OLD_IFS=IFS
IFS=$'\n\t'
for f in $(tst get files | sed 1d | cut -d$'\t' -f1)
do
echo "FILE :[${f}]"
done
You can just do:
tst get files | awk 'NR > 1 { printf "FILE :[%s]\n", $1 }'
Update: To answer extended problem as per comments below by OP:
while read -r file _; do
tst setprimary "$file" && tst get dataload
done < <(tst get files)
Or perl:
tst ... | perl -lanE 'say "File: [$F[0]]" if $.>1'
the variable $. contains the current line number

Shell cut delimiter before last

I`m trying to cut a string (Name of a file) where I have to get a variable in the name.
But the problem is, I have to put it in a shell variable, until now it is ok.
Here is the example of what i have to do.
NAME_OF_THE_FILE_VARIABLEiWANTtoGET_DATE
NAMEfile_VARIABLEiWANT_DATE
NAME_FILE_VARIABLEiWANT_DATE
The position of the variable I want always can change, but it will be always 1 before last. The delimiter is the "_".
Is there a way to count the size of the array to get size-1 or something like that?
OBS: when i cut strings I always use things like that:
VARIABLEiWANT=`echo "$FILENAME" | cut 1 -d "_"`
awk -F'_' '{print $(NF-1)}' file
or you have a string
awk -F'_' '{print $(NF-1)}' <<< "$FILENAME"
save the output of above oneliner into your variable.
IFS=_ read -a array <<< "$FILENAME"
variable_i_want=${array[${#array[#]}-2]}
It's a bit of a mess visually, but it's more efficient than starting a new process. ${#array[#]} is the number of elements read from FILENAME, so the indices for the array range from 0 to ${#array[#]}-1.
As of bash 4.3, though, you can use a negative index instead of computing it.
variable_i_want=${array[-2]}
If you need POSIX compatibility (no arrays), then
tmp=${FILENAME%_${FILENAME##*_}} # FILENAME with last field removed
variable_i_want=${tmp##*_} # last field of tmp
Just got it... I find someone using a cat function... I got to use it with the echo... and rev. didn't understand the rev thing, but I think it revert the order of the delimiter.
CODIGO=`echo "$ARQ_NAME" | rev | cut -d "_" -f 2 | rev `

Resources