Scripting username creation from text file? - bash

I'm really new at Bash and scripting in general.
I have to create usernames formed of first letter of first name followed by last name. To do it, I use a provided text file that looks like this:
doe,john
smith,mike
...
I declared the following variables:
fname=$(cut -d, -f2 "file.txt" | cut -c1)
lname=$(cut -d, -f1 "file.txt")
But how do I put the elements together to form the names jdoe and msmith ? I tried the methods I know to concatenate strings and vriables, but nothing works..
I think I found a method using awk that is supposed to work, but is there any other way to "concatenate" the elements of 2 lists?
Thank you

There's a million ways to do it, this is simplest:
$ awk -F, '{print substr($2,1,1) $1}' file
jdoe
msmith

Ed Morton's awk-based answer is simplest (and probably fastest), but since you asked for a different solution:
#!/usr/bin/env bash
while IFS=, read -r last first _; do
username=${first:0:1}${last}
echo "username: $username"
done < file.txt
IFS=, read -r last first _ reads the first 2 ,-separated fields from each input line (_ is a dummy variable that receives the rest of the input line, if any; -r prevents interpretation of \ chars. in the input, which is usually what you want).
username=${first:0:1}${last} concatenates the 1st char. of variable $first's value with variable $last's value, simply by placing the two variable references next to each other.
${first:0:1} - extract 1 character from $first at position 0 - is an example of parameter expansion, specifically: substring expansion
< file.txt is an input redirection that sends file.txt's contents via stdin to the while loop.

This looks a bit too much like homework, so I'll just drop some hints.
To read the lastname and firstname into separate variables for each line of the file, see BashFAQ 1. It should not involve cut.
To grab the first character of a variable, see BashFAQ 100.

Related

Utilising variables in tail command

I am trying to export characters from a reference file in which their byte position is known. To do this, I have a long list of numbers stored as a variable which have been used as the input to a tail command.
For example, the reference file looks like:
ggaaatgcattcaaacatgc
And the list looks like:
5
10
7
15
I have tried using this code:
list=$(<pos.txt)
echo "$list"
cat ref.txt | tail -c +"list" | head -c1 > out.txt
However, it keeps returning "invalid number of bytes: '+5\n10\n7\n15...'"
My expected output would be
a
t
g
a
...
Can anybody tell me what I'm doing wrong? Thanks!
It looks like you are trying to access your list variable in your tail command. You can access it like this: $list rather than just using quotes around it.
Your logic is flawed even after fixing the variable access. The list variable includes all lines of your list.txt file. Including the newline character \n which is invisible in many UIs and programs, but it is of course visible when you are manually reading single bytes. You need to feed the lines one by one to make it work properly.
Also unless those numbers are indexes from the end, you need to feed them to head instead of tail.
If I understood what you are attempting to do correctly, this should work:
while read line
do
head -c $line ref.txt | tail -c 1 >> out.txt
done < pos.txt
The reason for your command failure is simple. The variable list contains a multi-line string stored from the pos.txt files including newlines. You cannot pass not more than one integer value for the -c flag.
Your attempts can be fixed quite easily with removing calls to cat and using a temporary variable to hold the file content
while IFS= read -r lineNo; do
tail -c "$lineNo" ref.txt | head -c1
done < pos.txt
But then if your intentions is print the desired output in a new-line every time, head does not output that way. It just forms a string atga for your given input in a single line and not across multiple lines with one character at each line.
As Gordon mentions in one of the comments, for much more efficient FASTA files processing, you could just use one invocation of awk though (skipping multiple forks to head/tail). Your provided input does not involve any headers to skip which would be straightforward as
awk ' FNR==NR{ n = split($0,arr,""); for(i=1;i<=n;i++) hash[i] = arr[i] }
( $0 in hash ){ print hash[$0] } ' ref.txt pos.txt
You could use cut instead of tail:
pos=$(<pos.txt)
cut -c ${pos//$'\n'/,} --output-delimiter=$'\n' ref.txt
Or just awk:
awk -F '' 'NR==FNR{c[$0];next} {for(i in c) print $i}' pos.txt ref.txt
both yield:
a
g
t
a

Bash Script: Grabbing First Item Per Line, Throwing Into Array

I'm fairly new to the world of writing Bash scripts and am needing some guidance. I've begun writing a script for work, and so far so good. However, I'm now at a part that needs to collect database names. The names are actually stored in a file, and I can grep them.
The command I was given is cat /etc/oratab which produces something like this:
# This file is used by ORACLE utilities. It is created by root.sh
# and updated by the Database Configuration Assistant when creating
# a database.
# A colon, ':', is used as the field terminator. A new line terminates
# the entry. Lines beginning with a pound sign, '#', are comments.
#
# The first and second fields are the system identifier and home
# directory of the database respectively. The third filed indicates
# to the dbstart utility that the database should , "Y", or should not,
# "N", be brought up at system boot time.
#
OEM:/software/oracle/agent/agent12c/core/12.1.0.3.0:N
*:/software/oracle/agent/agent11g:N
dev068:/software/oracle/ora-10.02.00.04.11:Y
dev299:/software/oracle/ora-10.02.00.04.11:Y
xtst036:/software/oracle/ora-10.02.00.04.11:Y
xtst161:/software/oracle/ora-10.02.00.04.11:Y
dev360:/software/oracle/ora-11.02.00.04.02:Y
dev361:/software/oracle/ora-11.02.00.04.02:Y
xtst215:/software/oracle/ora-11.02.00.04.02:Y
xtst216:/software/oracle/ora-11.02.00.04.02:Y
dev298:/software/oracle/ora-11.02.00.04.03:Y
xtst160:/software/oracle/ora-11.02.00.04.03:Y
I turn turned around and wrote grep ":/software/oracle/ora" /etc/oratab so it can grab everything I need, which is 10 databases. Not the most elegant way, but it gets what I need:
dev068:/software/oracle/ora-10.02.00.04.11:Y
dev299:/software/oracle/ora-10.02.00.04.11:Y
xtst036:/software/oracle/ora-10.02.00.04.11:Y
xtst161:/software/oracle/ora-10.02.00.04.11:Y
dev360:/software/oracle/ora-11.02.00.04.02:Y
dev361:/software/oracle/ora-11.02.00.04.02:Y
xtst215:/software/oracle/ora-11.02.00.04.02:Y
xtst216:/software/oracle/ora-11.02.00.04.02:Y
dev298:/software/oracle/ora-11.02.00.04.03:Y
xtst160:/software/oracle/ora-11.02.00.04.03:Y
So, if I want to grab the name, such as dev068 or xtst161, how do I? I think for what I need to do with this project moving forward, is storing them in an array. As mentioned in the documentation, a colon is the field terminator. How could I whip this together so I have an array, something like:
dev068
dev299
xtst036
xtst161
dev360
dev361
xtst215
xtst216
dev298
xtst160
I feel like I may be asking for too much assistance here but I'm truly at a loss. I would be happy to clarify if need be.
It is much simpler using awk:
awk -F: -v key='/software/oracle/ora' '$2 ~ key{print $1}' /etc/oratab
dev068
dev299
xtst036
xtst161
dev360
dev361
xtst215
xtst216
dev298
xtst160
To populate a BASH array with above output use:
mapfile -t arr < <(awk -F: -v key='/software/oracle/ora' '$2 ~ key{print $1}' /etc/oratab)
To check output:
declare -p arr
declare -a arr='([0]="dev068" [1]="dev299" [2]="xtst036" [3]="xtst161" [4]="dev360" [5]="dev361" [6]="xtst215" [7]="xtst216" [8]="dev298" [9]="xtst160")'
We can pipe the output of grep to the cut utility to extract the first field, taking colon as the field separator.
Then, assuming there are no whitespace or glob characters in any of the names (which would be subject to word splitting and filename expansion), we can use a command substitution to run the pipeline, and capture the output in an array by assigning it within the parentheses.
names=($(grep ':/software/oracle/ora' /etc/oratab| cut -d: -f1;));
Note that the above command actually makes use of word splitting on the command substitution output to split the names into separate elements of the resulting array. That is why we must be sure that no whitespace occurs within any single database name, otherwise that name would be internally split into separate elements of the array. The only characters within the command substitution output that we want to be taken as word splitting delimiters are the line feeds that delimit each line of output coming off the cut utility.
You could also use awk for this:
awk -F: '!/^#/ && $2 ~ /^\/software\/oracle\/ora-/ {print $1}' /etc/oratab
The first pattern excludes any commented-out lines (starting with a #). The second pattern looks for your expected directory pattern in the second field. If both conditions are met it prints the first field, which the Oracle SID. The -F: flag sets the field delimiter to a colon.
With your file that gets:
dev068
dev299
xtst036
xtst161
dev360
dev361
xtst215
xtst216
dev298
xtst160
Depending on what you're doing you could finesse it further and check the last flag is set to Y; although that is really to indicate automatic start-up, it can sometime be used to indicate that a database isn't active at all.
And you can put the results into an array with:
declare -a DBS=(`awk -F: -v key='/software/oracle/ora' '$2 ~ key{print $1}' /etc/oratab`)
and then refer to ${DBS[1]} (which evaluates to dev299) etc.
If you'd like them into a Bash array:
$ cat > toarr.bash
#!/bin/bash
while read -r line
do
if [[ $line =~ .*Y$ ]] # they seem to end in a "Y"
then
arr[$((i++))]=${line%%:*}
fi
done < file
echo ${arr[*]} # here we print the array arr
$ bash toarr.bash
dev068 dev299 xtst036 xtst161 dev360 dev361 xtst215 xtst216 dev298 xtst160

Bash: Find and replace all variable characters up to a constant character with a constant string

I've seen many search and replace threads based on the assumption that 1. you either know what string or substring you are explicitly looking for or 2. you know the exact position it is at within the string or 3. both combined.
In my situation I have one csv file containing one column and 1M rows. e.g.
1,google.com
2,yahoo.com
3,twitter.com
4,xyz.com
For every column, I want to replace every character (the incrementing integers) up to and including the comma with the http semicolon dble forward slash dubdubdub
So far I have the following
HTTPSTRING="http://www."
cat X.csv << Will this ensure that the while block is executed on this file?
while IFS=, read line
do {$line/(.*?),/HTTPSTRING} << This is where I am having trouble
done
exit 0
and I would likea text file containing one URL per line e.g.
http://www.google.com
...
http://www.${999,999_more_urls}
Thank you so much in advance
Lewis
This does a greedy match, which would be problematic if you ever have any commas other than the one that separates the initial integer from the characters you want to retain. But it works on your sample X.csv file, producing a Y.csv file that meets your output specification.
HTTPSTRING="http://www."
while read line
do
echo ${line/*,/$HTTPSTRING}
done < X.csv > Y.csv
exit 0
For what it's worth, if you put this in a script, you can take the file input/input redirection parts out of the code itself, and instead apply them when calling the script.
If you're not strictly limited to bash itself, you might want to consider using sed. Either of these should do what you want, differing only in whether you prefer to escape the slashes in your string or use a non-standard delimiter:
sed 's/[0-9]*,/http:\/\/www./' X.csv > Y.csv
sed 's~[0-9]*,~http://www.~' X.csv > Y.csv
Your script is close. You can pipe the output of cat directly to the while loop, but it's better to use input redirection ( < X.csv). Using IFS=, before read will split the line into fields separated by a comma, but you are just missing a variable to hold the second field.
HTTPSTRING="http://www."
while IFS=, read number domain
do
echo "$HTTPSTRING$domain"
done < X.csv
You could use commands only, there is no need for an explicit Bash loop :
cut -d',' -f2 < X.csv | sed 's_^_http://www._' > Y.txt
Notice that the usual / used after the s in sed is replaced by _ because it is included in the string to replace. ^ matches the start of the line.

Extract part of file name with multiple sections

I am trying to extract part of a file name to compare with other file names as it is the only part that does not change. here is the pattern and an example
clearinghouse.doctype.payer.transID.processID.date.time
EMDEON.270.60054.1234567890123456789.70949996.20120925.014606403
all sections are the same length at all times with the exception of clearinghouse & doctype that can vary in character length.
The part of the filename that i need for comparison is the transID.
What would be the cleanest shortest way to do this in a shell script.
Thanks
There are lots of ways to do this, the easiest tool for simple tasks is the cut command. Tell cut what character you want to use as a delemiter and which fields you want to print. Here is the command that does what you want.
file=EMDEON.270.60054.1234567890123456789.70949996.20120925.014606403
transitId=$(echo $file | cut -d. -f4)
Awk can do the same thing, and allows you do much more complicated logic as well.
file=EMDEON.270.60054.1234567890123456789.70949996.20120925.014606403
transitId=$(echo $file | awk -F. '{print $4}')
You can split the filename apart using the read command using an appropriate value
for IFS.
filename="EMDEON.270.60054.1234567890123456789.70949996.20120925.014606403"
IFS="." read clHouse doctype payer transID procID dt tm <<< "$filename"
echo $transID
Since you only want the transaction ID, it's overkill to assign every part to a specific variable. Use a single dummy variable for the other fields:
# You only need one variable after transID to swallow the rest of the input without
# splitting it up.
IFS="." read _ _ _ transID _ <<< "$filename"
or just read each part into a single array and access the proper element:
IFS="." read -a parts <<< "$filename"
transID="${parts[3]}"
You can do this with a parameter expansion:
$ foo=EMDEON.270.60054.1234567890123456789.70949996.20120925.014606403
$ bar=${foo%.[0-9]*.[0-9]*.[0-9]*}
$ echo "${bar##*.}"
1234567890123456789
tranid==`echo file_name|perl -F -ane 'print $F[3]'`

How to parse a CSV in a Bash script?

I am trying to parse a CSV containing potentially 100k+ lines. Here is the criteria I have:
The index of the identifier
The identifier value
I would like to retrieve all lines in the CSV that have the given value in the given index (delimited by commas).
Any ideas, taking in special consideration for performance?
As an alternative to cut- or awk-based one-liners, you could use the specialized csvtool aka ocaml-csv:
$ csvtool -t ',' col "$index" - < csvfile | grep "$value"
According to the docs, it handles escaping, quoting, etc.
See this youtube video: BASH scripting lesson 10 working with CSV files
CSV file:
Bob Brown;Manager;16581;Main
Sally Seaforth;Director;4678;HOME
Bash script:
#!/bin/bash
OLDIFS=$IFS
IFS=";"
while read user job uid location
do
echo -e "$user \
======================\n\
Role :\t $job\n\
ID :\t $uid\n\
SITE :\t $location\n"
done < $1
IFS=$OLDIFS
Output:
Bob Brown ======================
Role : Manager
ID : 16581
SITE : Main
Sally Seaforth ======================
Role : Director
ID : 4678
SITE : HOME
First prototype using plain old grep and cut:
grep "${VALUE}" inputfile.csv | cut -d, -f"${INDEX}"
If that's fast enough and gives the proper output, you're done.
CSV isn't quite that simple. Depending on the limits of the data you have, you might have to worry about quoted values (which may contain commas and newlines) and escaping quotes.
So if your data are restricted enough can get away with simple comma-splitting fine, shell script can do that easily. If, on the other hand, you need to parse CSV ‘properly’, bash would not be my first choice. Instead I'd look at a higher-level scripting language, for example Python with a csv.reader.
In a CSV file, each field is separated by a comma. The problem is, a field itself might have an embedded comma:
Name,Phone
"Woo, John",425-555-1212
You really need a library package that offer robust CSV support instead of relying on using comma as a field separator. I know that scripting languages such as Python has such support. However, I am comfortable with the Tcl scripting language so that is what I use. Here is a simple Tcl script which does what you are asking for:
#!/usr/bin/env tclsh
package require csv
package require Tclx
# Parse the command line parameters
lassign $argv fileName columnNumber expectedValue
# Subtract 1 from columnNumber because Tcl's list index starts with a
# zero instead of a one
incr columnNumber -1
for_file line $fileName {
set columns [csv::split $line]
set columnValue [lindex $columns $columnNumber]
if {$columnValue == $expectedValue} {
puts $line
}
}
Save this script to a file called csv.tcl and invoke it as:
$ tclsh csv.tcl filename indexNumber expectedValue
Explanation
The script reads the CSV file line by line and store the line in the variable $line, then it split each line into a list of columns (variable $columns). Next, it picks out the specified column and assigned it to the $columnValue variable. If there is a match, print out the original line.
Using awk:
export INDEX=2
export VALUE=bar
awk -F, '$'$INDEX' ~ /^'$VALUE'$/ {print}' inputfile.csv
Edit: As per Dennis Williamson's excellent comment, this could be much more cleanly (and safely) written by defining awk variables using the -v switch:
awk -F, -v index=$INDEX -v value=$VALUE '$index == value {print}' inputfile.csv
Jeez...with variables, and everything, awk is almost a real programming language...
For situations where the data does not contain any special characters, the solution suggested by Nate Kohl and ghostdog74 is good.
If the data contains commas or newlines inside the fields, awk may not properly count the field numbers and you'll get incorrect results.
You can still use awk, with some help from a program I wrote called csvquote (available at https://github.com/dbro/csvquote):
csvquote inputfile.csv | awk -F, -v index=$INDEX -v value=$VALUE '$index == value {print}' | csvquote -u
This program finds special characters inside quoted fields, and temporarily replaces them with nonprinting characters which won't confuse awk. Then they get restored after awk is done.
index=1
value=2
awk -F"," -v i=$index -v v=$value '$(i)==v' file
I was looking for an elegant solution that support quoting and wouldn't require installing anything fancy on my VMware vMA appliance. Turns out this simple python script does the trick! (I named the script csv2tsv.py, since it converts CSV into tab-separated values - TSV)
#!/usr/bin/env python
import sys, csv
with sys.stdin as f:
reader = csv.reader(f)
for row in reader:
for col in row:
print col+'\t',
print
Tab-separated values can be split easily with the cut command (no delimiter needs to be specified, tab is the default). Here's a sample usage/output:
> esxcli -h $VI_HOST --formatter=csv network vswitch standard list |csv2tsv.py|cut -f12
Uplinks
vmnic4,vmnic0,
vmnic5,vmnic1,
vmnic6,vmnic2,
In my scripts I'm actually going to parse tsv output line by line and use read or cut to get the fields I need.
Parsing CSV with primitive text-processing tools will fail on many types of CSV input.
xsv is a lovely and fast tool for doing this properly. To search for all records that contain the string "foo" in the third column:
cat file.csv | xsv search -s 3 foo
A sed or awk solution would probably be shorter, but here's one for Perl:
perl -F/,/ -ane 'print if $F[<INDEX>] eq "<VALUE>"`
where <INDEX> is 0-based (0 for first column, 1 for 2nd column, etc.)
Awk (gawk) actually provides extensions, one of which being csv processing.
Assuming that extension is installed, you can use awk to show all lines where a specific csv field matches 123.
Assuming test.csv contains the following:
Name,Phone
"Woo, John",425-555-1212
"James T. Kirk",123
The following will print all lines where the Phone (aka the second field) is equal to 123:
gawk -l csv 'csvsplit($0,a) && a[2] == 123 {print $0}'
The output is:
"James T. Kirk",123
How does it work?
-l csv asks gawk to load the csv extension by looking for it in $AWKLIBPATH;
csvsplit($0, a) splits the current line, and stores each field into a new array named a
&& a[2] == 123 checks that the second field is 123
if both conditions are true, it { print $0 }, aka prints the full line as requested.

Resources