Adding part of filename as column to csv files, then concatenate - bash

I have many csv files that look like this:
data/0.Raw/20190401_data.csv
(Only the date in the middle of the filename changes)
I want to concatenate all these files together, but add the date as a new column in the data to be able to distinguish between the different files after merging.
I wrote a bash script that adds the full path and filename as a column in each file, and then merges into a master csv. However, I am having trouble getting rid of the path and the extension to only keep the date portion
The bash script
#! /bin/bash
mkdir data/1.merged
for i in "data/0.Raw/"*.csv; do
awk -F, -v OFS=, 'NR==1{sub(/\_data.csv$/, "", FILENAME) } NR>1{ $1=FILENAME }1' "$i" |
column -t > "data/1.merged/"${i/"data/0.Raw/"/""}""
done
awk 'FNR > 1' data/1.merged/*.csv > data/1.merged/all_files
rm data/1.merged/*.csv
mv data/1.merged/all_files data/1.merged/all_files.csv
using "sub" I was able to remove the "_data.csv" part, but as a result the column gets added as "data/0.Raw/20190401" - that is, I am having trouble removing both the part before the date as well as the part after the date.
I tried replacing sub with gensub to regex match everything except the 8 digits in the middle but that does not seem to work either.
Any ideas on how to solve this?
Thanks!

You can process and concatenate all the files with a single awk call:
awk '
FNR == 1 {
date = FILENAME
gsub(/.*\/|_data\.csv$/,"",date)
next
}
{ print date "," $0 }
' data/0.Raw/*_data.csv > all_files.csv

However, I am having trouble getting rid of the path and the extension
to only keep the date portion
Then take look at basename command
basename NAME [SUFFIX]
Print NAME with any leading directory components removed. If
specified, also remove a trailing SUFFIX.
Example
basename 'data/0.Raw/20190401_data.csv' _data.csv
gives output
20190401

Related

The for loop overwrites or entries duplicates

Say, I have 250 files, and from which I need to extract certain information and store them in a text file.
I have tried for loop in the shell as following,
text= 'home/path/tothe/textfiles'
for sam in $(find ${text} -name \*_PG.tsv);do
#echo ${sam}
awk '{if($2=="ID") print FILENAME"\t""yes""\t""SAP""\t""LUFTA"}' ${sam}
done >> ${text}/metadata.txt
With > operator the output text file is overwritten and with >> the output text file is being entered multiple times or duplicate entry.
I would like to know where should I change to get rid of these issues. Thanks for suggestions !!
I think you can do this with a single invocation of awk:
path=home/path/tothe/textfiles
awk -v OFS='\t' '$2 == "ID" {
print FILENAME, "yes", "SAP", "LUFTA"
}' "$path"/*_PG.tsv > "$path"/metadata.txt
careful with your variable assignments, there should be no spaces around the =
use the shell to expand the list of files, without find
pass the full list of files as arguments to awk, instead of looping one by one
set the Output Field Separator OFS instead of writing \t to separate your fields
redirect the output to the metadata file
I assume that your awk script is behaving as you expect - I removed the useless if since awk scripts are written like condition { action }. I guess you only want one line of output per file, so you can probably add an exit inside the block to avoid processing the rest of the file.

Using awk to try to find a variable in a CSV line

I am trying to go through two files. First one line by line, while using awk to search for a line containing a string pulled from the first file.
while IFS=, read col1 col2 col3
do
echo $(awk -F, -v var="$col2" '$2==var || $2=="www."var {print $0}' searchFile.csv)
//do stuff with data from awk
done < origFile.csv
I am trying to find domain names in this file, and the awk currently is never returning matches. I have checked the files manually to make sure some that are not returning matches are in both, and they are.
I have tried using a nested loop, but bash was not wanting to open a second file to read and would not read the second file. I also tried using grep, but the files are too large and grep would run out of memory.
Sample input for searchFile.csv:
4915,google.com,oct
3532,domain.ca,nov
33451,yahoo.ca,nov
I have ensured there are no spaces in the data being input, and have verified that $col2 from origFile.csv matches data in the searchFile.csv
Does your data file have spaces? If so, your records ($2) will have spaces and may not match (because you are using awk -F,). Try matching with ~ instead of == .

Mac Terminal Bash awk change multiple file names to $NF output

I have been working on this script to retrieve files from all the folders in my directory and trying to change their names to my desired output.
Before filename:
Folder\actors\character\hair\haircurly1.dds
After filename:
haircurly1.dds
I am working with over 12,000 textures with different names that I extracted from an archive. My extractor included the path to the folder where it extracted the files in each file name. For example, a file that should have been named haircurly1.dds was named Folder\actors\character\hair\haircurly1.dds during extraction.
cd ~/Desktop/MainFolder/Folder
find . -name '*\\*.dds' | awk -F\\ '{ print; print $NF; }'
This code retrieves every texture file that I am looking at containing backslashes (as I have already changed some of the files with other codes, however I want one that will change all of the files at once rather than me having to write specific codes for every folder for 12,000+ texture files)
I use print; and it sends me the file path:
./Folder\actors\character\hair\haircurly1.dds
I use print $NF; and it sends me the text after the awk separator:
\
haircurly1.dds
I would like every file name that this script runs through to be changed to the $NF output of the awk command. Anyone know how I can make my script change the file names to their $NF output?
Thank you
Your question isn't clear but it SOUNDS like all you want to do is:
for file in *\\*; do
mv -- "$file" "${file##*\\}"
done
If that's not all you want then edit your question to clarify your requirements.
Have your awk command format and print a "mv" command, and pipe the result to bash. The extra single-quoting ensures bash treats backslash as a normal char.
find . -name '*\\*.dds' | awk -F\\ '{print "mv '\''" $0 "'\'' " $NF}' | bash -x
hth

How to use awk to split a file and store each filename in a Bash array

Input
A file called input_file.csv, which has 7 columns, and n rows.
Example header and row:
Date Location Team1 Team2 Time Prize_$ Sport
2016 NY Raptors Gators 12pm $500 Soccer
Output
n files, where the rows in each new file are grouped based on their values in column 7 of the original file. Each file is named after that shared value from column 7. Note: each file will have the same header. (The script currently does this.)
Example: if 2 rows in the original file had golf as their value for column 7, they would be grouped together in a file called golf.csv. If 3 other rows shared soccer as their value for column 7, they would be found in soccer.csv.
An array that has the name of each generated file in it. This array lives outside of the scope of awk. (This is what I need help with.)
Example: Array = [golf.csv, soccer.csv]
Situation
The following script produces the desired output. However, I want to run another script on each of the newly generated files and I don't know how.
Question:
My idea is to store the names of each new file in an array. That way, I can loop through the array and do what I want to each file. The code below passes a variable called array into awk, but I don't know how to add the name of each file to the array.
#!/bin/bash
ARRAY=()
awk -v myarray="$ARRAY" -F"\",\"" 'NR==1 {header=$0}; NF>1 && NR>1 {if(! files[$7]) {print header >> ("" $7 ".csv"); files[$7]=1}; print $0 >> ("" $7 ".csv"); close("" $7 ".csv");}' input_file.csv
for i in "${ARRAY[#]}"
do
:
echo $i
done
Rather than struggling to get awk to fill your shell array variable, why not:
make sure that the *.csv files are created in a clean directory
use globbing to loop over all *.csv files in that directory?
awk -F'","' ... # your original Awk command
for i in *.csv # use globbing to loop over resulting *.csv files
do
:
echo $i
done
Just off the top of my head, untested because you haven't supplied very much sample data, what about this?
#!/usr/bin/awk -f
FNR==1 {
header=$0
next
}
! $7 in files {
files[$7]=sprintf("sport-%s.csv", $7)
print header > file
}
{
files[$7]=sprintf("sport-%s.csv", $7)
}
{
print > files[$7]
}
END {
printf("declare -a sportlist=( ")
for (sport in files) {
printf("\"%s\"", sport)
}
printf(" )\n");
}
The idea here is that we store sport names in the array files[], and build filenames out of that array. (You can format the filename inside sprintf() as you see fit.) We step through the file, adding a header line whenever we get a new sport with no recorded filename. Then for non-headers, print to the file based on the sport name.
For your second issue, exporting the array back to something outside of awk, the END block here will output a declare line which can be interpreted by bash. IF you feel lucky, you can eval this awk script inside command expansion, and the declare command will effectively be interpreted by your shell:
eval $(/path/to/awkscript inputfile.csv)
Or, if you subscribe to the school of thought that consiers eval to be evil, you can redirect the awk script's standard output to a temporary file which you source:
/path/to/awkscript inputfile.csv > /tmp/yadda.$$
. /tmp/yadda.$$
(Don't use this temp file, make a real one with mktemp or the like.)
There's no way for any program to modify the environment of the parent shell. Just have the awk script output the names of the files as standard output, and use command substitution to put them in an array.
filesArray=($(awk ... ))
If the files might have spaces in them, you need a different solution; assuming you're on bash 4, you can just be sure to print each file on a separate line and use readarray:
readarray filesArray < <( awk ... )
if the files might have newlines in them, too, then things get tricky...
if your file is not large, you can run another script to get the unique $7 elements, for example
$ awk 'NR>1&&!a[$7]++{print $7}' sports
will print the values, you can change it to your file name format as well, such as
$ awk 'NR>1&&!a[$7]++{print tolower($7)".csv"}' sports
this then can be piped to your other process, here for example to wc
$ awk ... sports | xargs wc
This will do what I THINK you want:
oIFS="$IFS"; IFS=$'\n'
array=( $(awk '{out=$7".csv"; print > out} !seen[out]++{print out}' input_file.csv) )
IFS="$oIFS"
If your input file really is comma-separated instead of space-separated as you show in the sample input in your question then adjust the awk script to suit (You might want to look at GNU awk and FPAT).
If you don't have GNU awk then you'll need to add a bit more code to close the open output files as you go.
The above will fail if you have file names that contain newlines but will be fine for blank chars or other white space.

remove lines from file that does not have dot extension in bash

I am having such of file that contains lines as below:
/folder/share/folder1
/folder/share/folder1/file.gz
/folder/share/folder2/11072012
/folder/share/folder2/11072012/file1.rar
I am trying to remove these lines:
/folder/share/folder1/
/folder/share/folder2/11072012
To get a final result the following:
/folder/share/folder2/11072012/file1.rar
/folder/share/folder1/file.gz
In other words, I am trying to keep only the path for files and not directories.
This
awk -F/ '$NF~/\./{print}'
splits input records on the character "/" using the command line switch -F
examines the last field of the input record $NF (where NF is the number of fields in the input record) to see if it DOES contain the character "." (the !~ operator)
if it matches, oputput the record.
Example
$ echo -e '/folder/share/folder.2/11072012
/folder/share/folder2/11072012/file1.rar' | mawk -F/ '$NF~/\./{print}'
/folder/share/folder2/11072012/file1.rar
$
NB: my microscript looks at . ONLY in the filename part of the full path.
Edit in my 1st post I reversed the logic, to print dotless files instead of dotted ones.
You could to use the find command to get only the file list
find <directory> -type f
With awk:
awk -F/ '$NF ~ /\./{print}' File
Set / as delimiter, check if last field ($NF) has . in it, if yes, print the line.
Text only result
sed -n 'H
$ {g
:cycle
s/\(\(\n\).*\)\(\(\2.*\)\{0,1\}\)\1/\3\1/g
t cycle
s/^\n//p
}' YourFile
Based on file name and folder name assuming that:
line that are inside other line are folder and uniq are file (could be completed by a OS file existence file on result)
line are sorted (at least between folder and file inside)
posix version so --posixon GNU sed

Resources