Prepending part of a filename to a .csv file using bash/sed - bash

I have a couple of files in a directory that are named like this;
1_38OE983729JKHKJV.csv
an integer followed by an ID (the Integer and ID are both unique).
I need to prepend this ID to every line of the file for each file in the folder to prepare the files for import to a database (and discard the integer part of the filename). The contents of the file look something like this:
BW;20015;11,45;0,49;41;174856;4103399
BA;25340;11,41;0,55;40;222161;4599779
BB;800;7,58;0,33;42;10559;239887
HE;6301;9,11;0,39;40;69191;1614302
.
.
.
Total;112613;9,33;0,43;40;1207387;25897426
The end result should look something like this:
38OE983729JKHKJV;BW;20015;11,45;0,49;41;174856;4103399
38OE983729JKHKJV;BA;25340;11,41;0,55;40;222161;4599779
38OE983729JKHKJV;BB;800;7,58;0,33;42;10559;239887
38OE983729JKHKJV;HE;6301;9,11;0,39;40;69191;1614302
.
.
.
38OE983729JKHKJV;Total;112613;9,33;0,43;40;1207387;25897426
Thanks for the help!
EDIT: Spelling and vocabular for clarity

Loop over the files with for, use parameter expansion to extract the id.
#!/bin/bash
for csv in *.csv ; do
prefix=${csv%_*}
id=${csv#*_}
id=${id%.csv}
sed -i~ "s/^/$id;/" "$csv"
done
If the ID can contain underscores, you might need to be more careful with the expansion.

With awk tool:
for f in *csv; do awk '{ fn=FILENAME; $0=substr(fn,index(fn,"_")+1,length(fn)-6)";"$0 }1' "$f" > tmp && mv tmp "$f"; done
fn=FILENAME - the filename

try following too in single awk and which will take care of the number of files which are getting opened during this operation too, so that we will avoid the error of maximum number of files opened.
awk 'FNR==1{close(val);val=FILENAME;split(FILENAME,a,"_");sub(/\..*/,"",a[2])} {print a[2]","$0}' *.csv

With GNU awk for inplace editing and gensub() all you need is:
awk -i inplace '{print gensub(/.*_(.*)\..*/,"\\1;",1,FILENAME) $0}' *.csv
No shell loops or anything else necessary, just that command.

Related

batch rename matching files using 1st field to replace and 2nd as search criteria

I have a very large selection of files eg.
foo_de.vtt, foo_en.vtt, foo_es.vtt, foo_fr.vtt, foo_pt.vtt, baa_de.vtt, baa_en.vtt, baa_es.vtt, baa_fr.vtt, baa_pt.vtt... etc.
I have created a tab separated file, filenames.txt containing the current string and replacement string eg.
foo 1000
baa 1016
...etc
I want to rename all of the files to get the following:
1000_de.vtt, 1000_en.vtt, 1000_es.vtt, 1000_fr.vtt, 1000_pt.vtt, 1016_de.vtt, 1016_en.vtt, 1016_es.vtt, 1016_fr.vtt, 1016_pt.vtt
I know I can use a utility like rename to do it manually term by term eg:
rename 's/foo/1000/g' *.vtt
could i chain this into an awk command so that it could run through the filenames.txt?
or is there an easier way to do it just in awk? I know I can rename with awk such as:
find . -type f | awk -v mvCmd='mv "%s" "%s"\n' \
'{ old=$0;
gsub(/foo/,"1000");
printf mvCmd,old,$0;
}' | sh
How can I get awk to process filenames.txt and do all of this in one go?
This question is similar but uses sed. I feel that being tab separated this should be quite easy in awk?
First ever post so please be gentle!
Solution
Thanks for all your help. Ultimately I was able to solve by adapting your answers to the following:
while read new old; do
rename "s/$old/$new/g" *.vtt;
done < filenames.txt
I'm assuming that the strings in the TSV file are literals (not regexes nor globs) and that the part to be replaced can be located anywhere in the filenames.
With that said, you can use mv with shell globs and bash parameter expansion:
#!/bin/bash
while IFS=$'\t' read -r old new
do
for f in *"$old"*.vtt
do
mv "$f" "${f/"$old"/$new}"
done
done < file.tsv
Or with GNU rename (more performant):
while IFS=$'\t' read -r old new
do
rename "$old" "$new" *"$old"*.vtt
done < file.tsv
This might work for you (GNU sed and rename):
sed -E 's#(.*)\t(.*)#rename -n '\''s/\1/\2/'\'' \1*#e' ../file
This builds a script which renames the files in the current directory using file to match and replace parts of the filenames.
Once you are happy with the results, remove the -n and the renaming will be enacted.

Mac Terminal Bash awk change multiple file names to $NF output

I have been working on this script to retrieve files from all the folders in my directory and trying to change their names to my desired output.
Before filename:
Folder\actors\character\hair\haircurly1.dds
After filename:
haircurly1.dds
I am working with over 12,000 textures with different names that I extracted from an archive. My extractor included the path to the folder where it extracted the files in each file name. For example, a file that should have been named haircurly1.dds was named Folder\actors\character\hair\haircurly1.dds during extraction.
cd ~/Desktop/MainFolder/Folder
find . -name '*\\*.dds' | awk -F\\ '{ print; print $NF; }'
This code retrieves every texture file that I am looking at containing backslashes (as I have already changed some of the files with other codes, however I want one that will change all of the files at once rather than me having to write specific codes for every folder for 12,000+ texture files)
I use print; and it sends me the file path:
./Folder\actors\character\hair\haircurly1.dds
I use print $NF; and it sends me the text after the awk separator:
\
haircurly1.dds
I would like every file name that this script runs through to be changed to the $NF output of the awk command. Anyone know how I can make my script change the file names to their $NF output?
Thank you
Your question isn't clear but it SOUNDS like all you want to do is:
for file in *\\*; do
mv -- "$file" "${file##*\\}"
done
If that's not all you want then edit your question to clarify your requirements.
Have your awk command format and print a "mv" command, and pipe the result to bash. The extra single-quoting ensures bash treats backslash as a normal char.
find . -name '*\\*.dds' | awk -F\\ '{print "mv '\''" $0 "'\'' " $NF}' | bash -x
hth

Removes duplicate lines from files recursively

I have a directory with bunch of csv files. I want to remove the duplicates lines from all the files.
I have tried awk solution but seems to be bit tedious to do it for each and every file.
awk '!x[$0]++' file.csv
Even if I will do
awk '!x[$0]++' *
I will lost the file names. Is there a way to remove duplicates from all the files using just one command or script.
Just to clarify
If there are 3 files in the directory, then the output should contain 3 files, each sorted independently. After running the command or script the same folder should contain 3 files each with unique entries.
for f in dir/*;
do awk '!a[$0]++' "$f" > "$f.uniq";
done
to overwrite the existing files change to: awk '!a[$0]++' "$f" > "$f.uniq" && mv "$f.uniq" "$f" after testing!
With GNU awk for "inplace" editing and automatic open/close management of output files:
awk -i inplace '!seen[FILENAME,$0]++' *.csv
This will create new files, with suffix .new, that have only unique lines:
gawk '!x[$0]++{print>(FILENAME".new")}' *.csv
How it works
!x[$0]++
This is a condition. It evaluates to true only the current line, $0, has not been seen before.
print >(FILENAME".new")
If the condition evaluates to true, then this print statement is executed. It writes the current line to a file whose name is the name of the current file, FILENAME, followed by the string .new.

remove lines from file that does not have dot extension in bash

I am having such of file that contains lines as below:
/folder/share/folder1
/folder/share/folder1/file.gz
/folder/share/folder2/11072012
/folder/share/folder2/11072012/file1.rar
I am trying to remove these lines:
/folder/share/folder1/
/folder/share/folder2/11072012
To get a final result the following:
/folder/share/folder2/11072012/file1.rar
/folder/share/folder1/file.gz
In other words, I am trying to keep only the path for files and not directories.
This
awk -F/ '$NF~/\./{print}'
splits input records on the character "/" using the command line switch -F
examines the last field of the input record $NF (where NF is the number of fields in the input record) to see if it DOES contain the character "." (the !~ operator)
if it matches, oputput the record.
Example
$ echo -e '/folder/share/folder.2/11072012
/folder/share/folder2/11072012/file1.rar' | mawk -F/ '$NF~/\./{print}'
/folder/share/folder2/11072012/file1.rar
$
NB: my microscript looks at . ONLY in the filename part of the full path.
Edit in my 1st post I reversed the logic, to print dotless files instead of dotted ones.
You could to use the find command to get only the file list
find <directory> -type f
With awk:
awk -F/ '$NF ~ /\./{print}' File
Set / as delimiter, check if last field ($NF) has . in it, if yes, print the line.
Text only result
sed -n 'H
$ {g
:cycle
s/\(\(\n\).*\)\(\(\2.*\)\{0,1\}\)\1/\3\1/g
t cycle
s/^\n//p
}' YourFile
Based on file name and folder name assuming that:
line that are inside other line are folder and uniq are file (could be completed by a OS file existence file on result)
line are sorted (at least between folder and file inside)
posix version so --posixon GNU sed

How to extract a string at end of line after a specific word

I have different location, but they all have a pattern:
some_text/some_text/some_text/log/some_text.text
All locations don't start with the same thing, and they don't have the same number of subdirectories, but I am interested in what comes after log/ only. I would like to extract the .text
edited question:
I have a lot of location:
/s/h/r/t/log/b.p
/t/j/u/f/e/log/k.h
/f/j/a/w/g/h/log/m.l
Just to show you that I don't know what they are, the user enters these location, so I have no idea what the user enters. The only I know is that it always contains log/ followed by the name of the file.
I would like to extract the type of the file, whatever string comes after the dot
THe only i know is that it always contains log/ followed by the name
of the file.
I would like to extract the type of the file, whatever string comes
after the dot
based on this requirement, this line works:
grep -o '[^.]*$' file
for your example, it outputs:
text
You can use bash built-in string operations. The example below will extract everything after the last dot from the input string.
$ var="some_text/some_text/some_text/log/some_text.text"
$ echo "${var##*.}"
text
Alternatively, use sed:
$ sed 's/.*\.//' <<< "$var"
text
Not the cleanest way, but this will work
sed -e "s/.*log\///" | sed -e "s/\..*//"
This is the sed patterns for it anyway, not sure if you have that string in a variable, or if you're reading from a file etc.
You could also grab that text and store in a sed register for later substitution etc. All depends on exactly what you are trying to do.
Using awk
awk -F'.' '{print $NF}' file
Using sed
sed 's/.*\.//' file
Running from the root of this structure:
/s/h/r/t/log/b.p
/t/j/u/f/e/log/k.h
/f/j/a/w/g/h/log/m.l
This seems to work, you can skip the echo command if you really just want the file types with no record of where they came from.
$ for DIR in *; do
> echo -n "$DIR "
> find $DIR -path "*/log/*" -exec basename {} \; | sed 's/.*\.//'
> done
f l
s p
t h

Resources