Breakdown a string into two arrays using awk/sed/grep - bash

Create two different arrays in shell/bash script from content in a text file which has details about different files. How do I extract directories into one array and filenames to another array, using awk/sed/grep?
I have a text file as show below
2017-02-04 07:18 /temp/folder1/filename_20170204_something.txt
2017-03-04 07:18 /temp/folder2/filename_20170204_20170304.txt
2017-04-04 07:18 /temp/folder3/filename_20170404_.txt
directories_list= {folder1,folder2,folder3}
file_list = {filename_20170204.txt,filename_20170304.txt,filename_20170404.txt}

I would use awk to split the lines into columns, then print the column number for the folder and the column number for the file. You can tell awk what the delimiting character is with the -F option.
This script stores the folders in one array and the files in another.
#!/bin/bash
FOLDERS=() # declares FOLDERS as an array
FILES=() # declares FILES as an array
INPUT=input.txt # change to the path of your data file
while read LINE
do
FOLDER=$(echo $LINE | awk -F / '{print $3}')
FILE=$(echo $LINE | awk -F / '{print $4}')
echo "Reading next line..."
echo FOLDER: $FOLDER
echo FILE: $FILE
echo ""
FOLDERS+=( "$FOLDER" ) # appends $FOLDERS to the FOLDERS array
FILES+=( "$FILE" ) # appends $FILE to FILES array
done < $INPUT
# Now the FOLDERS array and FILES array have what you want
echo FOLDERS array: ${FOLDERS[#]}
echo FILES array: ${FILES[#]}
That's assuming you have the input.txt file in the same directory and it contains your sample data.

Read the file line by line, split with read and IFK, use basename and dirname, and read in the array:
cat <<EOF >file
2017-02-04 07:18 /temp/folder1/filename_20170204_something.txt
2017-03-04 07:18 /temp/folder2/filename_20170204_20170304.txt
2017-04-04 07:18 /temp/folder3/filename_20170404_.txt
EOF
dirs=() files=()
while IFS=' ' read -r _ _ path; do
dirs+=("$(basename "$(dirname "$path")")")
files+=("$(basename "$path")")
done <file
declare -p dirs files

How do you want to handle the duplicate entries in the array & is there any specific order that you want to save the files.
If not you can use the below commands, (removes duplicate entries & sorts based on filenames)
folders=()
files=()
folders=`awk '{print $NF}' <INPUT_FILE> | awk -F'/' '{print $(NF-1)}' | sort -nr | uniq`
files=`awk '{print $NF}' <INPUT_FILE> | awk -F'/' '{print $NF}' | sort -nr | uniq`
Below is the explanation for the awk commands,
awk '{print $NF}' <INPUT_FILE> -> takes the last field in the input file
awk -F'/' '{print $(NF-1)}' -> cuts the last filed with / as delimiter and takes the penultimate column
Hope this helps !

Related

xargs and cut: getting `cut` fields of a csv to bash variable

I am using xargs in conjuction with cut but I am unsure how to get the output of cut to a variable which I can pipe to use for further processing.
So, I have a text file like so:
test.txt:
/some/path/to/dir,filename.jpg
/some/path/to/dir2,filename2.jpg
...
I do this:
cat test.txt | xargs -L1 | cut -d, -f 1,2
/some/path/to/dir,filename.jpg
but what Id like to do is:
cat test.txt | xargs -L1 | cut -d, -f 1,2 | echo $1 $2
where $1 and $2 are /some/path/to/dir and filename.jpg
I am stumped that I cannot seem to able to achieve this..
You may want to say something like:
#!/bin/bash
while IFS=, read -r f1 f2; do
echo ./mypgm -i "$f1" -o "$f2"
done < test.txt
IFS=, read -r f1 f2 reads a line from test.txt one by one,
splits the line on a comma, then assigns the variables f1 and f2
to the fields.
The line echo .. is for the demonstration purpose. Replace the
line with your desired command using $f1 and $f2.
Try this:
cat test.txt | awk -F, '{print $1, $2}'
From man xargs:
xargs [-L number] [utility [argument ...]]
-L number
Call utility for every number non-empty lines read.
From man awk:
Awk scans each input file for lines that match any of a set of patterns specified literally in prog or in one or more files specified as -f progfile.
So you don't have to use xargs -L1 as you don't pass the utility to call.
Also from man awk:
The -F fs option defines the input field separator to be the regular expression fs.
So awk -F, can replace the cut -d, part.
The fields are denoted $1, $2, ..., while $0 refers to the entire line.
So $1 is for the first column, $2 is for the second one.
An action is a sequence of statements. A statement can be one of the following:
print [ expression-list ] [ > expression ]
An empty expression-list stands for $0.
The print statement prints its argument on the standard output (or on a file if > file or >> file is present or on a pipe if | cmd is present), separated by the current output field separator, and terminated by the output record separator.
Put all these together, cat test.txt | awk -F, '{print $1, $2}' would achieve that you want.

How to get output of awk into a tab-delimited file merging two lines to a line every time?

I have multiple files in gz format and used this script which counts lines in each file and prints 1/4 of lines for each file:
for file in *.gz;
do echo $file;
gunzip -c $file | wc -l | awk '{print, $1/4}';
done
STDOUT:
AB.gz
12
CD.gz
4
How I can pipe outputs of awk into a tab-delimited file like this merging two lines each time:
AB.gz 12
CD.gz 4
I tried paste by piping | paste -sd '\t' > output.txt in the script but it didn't work.
You can use a script like this:
for file in *.gz; do
gzcat "$file" | awk -v fn="$file" -v OFS='\t' 'END{print fn, int(NR/4)}'
done
Do not echo a newline after the file:
for file in *.gz;
do
printf "%s " "${file}"
gunzip -c $file | wc -l | awk '{print, $1/4}';
done

One line command with variable, word count and zcat

I have many files on a server which contains many lines:
201701010530.contentState.csv.gz
201701020530.contentState.csv.gz
201701030530.contentState.csv.gz
201701040530.contentState.csv.gz
I would like with one line command this result:
170033|20170101
169865|20170102
170010|20170103
170715|20170104
The goal is to have the number of lines of each file, just by keeping the date which is already in the filename of the file.
I tried this but the result is not in one line but two...
for f in $(ls -1 2017*gz);do zcat $f | wc -l;echo $f | awk '{print substr($0,1,8)}';done
Thanks in advance guys.
Just use zcat file | wc -l to get the number of lines.
For the name, I understand it is enough to extract the first 8 characters:
$ t="201701030530.contentState.csv.gz"
$ echo "${t:0:8}"
20170103
All together:
for file in 2017*gz;
do
lines=$(zcat "$file" | wc -l)
printf "%s|%s\n" "$lines" "${file:0:8}"
done > myresult.csv
Note the usage of for file in 2017*gz; to go through the files matching the 2017*gz pattern: this suffices, no need to parse ls!
Use zgrep -c ^ file to count the lines, here encapsulated in awk:
$ awk 'FNR==1{ "zgrep -c ^ " FILENAME | getline s; print s "|" substr(FILENAME,1,8) }' *.gz
12|20170101
The whole "zgrep -c ^ " FILENAME should probably be in a var (s) and then s | getline s.

Bash script read specifc value from files of an entire folder

I have a problem creating a script that reads specific value from all the files of an entire folder
I have a number of email files in a directory and I need to extract from each file, 2 specific values.
After that I have to put them into a new file that looks like that:
--------------
To: value1
value2
--------------
This is what I want to do, but I don't know how to create the script:
# I am putting the name of the files into a temp file
`ls -l | awk '{print $9 }' >tmpfile`
# use for the name of a file
`date=`date +"%T"
# The first specific value from file (phone number)
var1=`cat tmpfile | grep "To: 0" | awk '{print $2 }' | cut -b -10 `
# The second specific value from file(subject)
var2=cat file | grep Subject | awk '{print $2$3$4$5$6$7$8$9$10 }'
# Put the first value in a new file on the first row
echo "To: 4"$var1"" > sms-$date
# Put the second value in the same file on the second row
echo ""$var2"" >>sms-$date
.......
and do the same for every file in the directory
I tried using while and for functions but I couldn't finalize the script
Thank You
I've made a few changes to your script, hopefully they will be useful to you:
#!/bin/bash
for file in *; do
var1=$(awk '/To: 0/ {print substr($2,0,10)}' "$file")
var2=$(awk '/Subject/ {for (i=2; i<=10; ++i) s=s$i; print s}' "$file")
outfile="sms-"$(date +"%T")
i=0
while [ -f "$outfile" ]; do outfile="sms-$date-"$((i++)); done
echo "To: 4$var1" > "$outfile"
echo "$var2" >> "$outfile"
done
The for loop just goes through every file in the folder that you run the script from.
I have added added an additional suffix $i to the end of the file name. If no file with the same date already exists, then the file will be created without the suffix. Otherwise the value of $i will keep increasing until there is no file with the same name.
I'm using $( ) rather than backticks, this is just a personal preference but it can be clearer in my opinion, especially when there are other quotes about.
There's not usually any need to pipe the output of grep to awk. You can do the search in awk using the / / syntax.
I have removed the cut -b -10 and replaced it with substr($2, 0, 10), which prints the first 10 characters from column 2.
It's not much shorter but I used a loop rather than the $2$3..., I think it looks a bit neater.
There's no need for all the extra " in the two output lines.
I sugest to try the following:
#!/bin/sh
RESULT_FILE=sms-`date +"%T"`
DIR=.
fgrep -l 'To: 0' "$DIR" | while read FILE; do
var1=`fgrep 'To: 0' "$FILE" | awk '{print $2 }' | cut -b -10`
var2=`fgrep 'Subject' "$FILE" | awk '{print $2$3$4$5$6$7$8$9$10 }'`
echo "To: 4$var1" >>"$RESULT_FIL"
echo "$var2" >>"$RESULT_FIL"
done

using awk within loop to replace field

I have written a script finding the hash value from a dictionary and outputting it in the form "word:md5sum" for each word. I then have a file of names which I would like to use to place each name followed by every hash value i.e.
tom:word1hash
tom:word2hash
.
.
bob:word1hash
and so on. Everything works fine but I can not figure out the substitution. Here is my script.
$#!/bin/bash
#/etc/dictionaries-common/words
cat words.txt | while read line; do echo -n "$line:" >> dbHashFile.txt
echo "$line" | md5sum | sed 's/[ ]-//g' >> dbHashFile.txt; done
cat users.txt | while read name
do
cat dbHashFile.txt >> nameHash.txt;
awk '{$1="$name"}' nameHash.txt;
cat nameHash.txt >> dbHash.txt;
done
the line
$awk '{$1="$name"}' nameHash.txt;
is where I attempt to do the substitution.
thank you for your help
Try replacing the entire contents of the last loop (both cats and the awk) with:
awk -v name="$name" -F ':' '{ print name ":" $2 }' dbHashFile.txt >>dbHash.txt

Resources