Best way to group filename based on filename in Bash? - bash

I have a folder with the following files:
DA-001-car.jpg
DA-001-dog.jpg
DA-001-coffee.jpg
DA-002-house.jpg
DA-003-coffee.jpg
DA-003-cat.jpg
...
I want to generate this (CSV) output:
SKU, IMAGE
DA-001, "DA-001-car.jpg, DA-001-dog.jpg, DA-001-coffee.jpg"
DA-002, "DA-001-house.jpg"
DA-003, "DA-001-coffee.jpg, DA-001-cat.jpg"
I tried to program this in Bash:
#!/bin/bash
echo "SKU, FILE" >> tmp.csv
for file in /home/calvin/test/*.jpg
do
SKU_NAME="${file##*/}"
echo ${SKU_NAME:0:6}, \"inner for-loop?, ?, ?\" >> tmp.csv
done
uniq tmp.csv output.csv
As you can see I'm a noob as for programming :)
Please help me out, thanks in advance!

This will do the trick. This requires GNU awk to output in ascending order. If you don't care about the order, you can use any old awk and remove the PROCINFO line
#!/bin/bash
awk -F- '
BEGIN{
print "SKU, IMAGE"
}
{
sep=!a[$2]?"":", "
a[$2]=a[$2] sep $0
}
END{
PROCINFO["sorted_in"] = "#ind_str_asc" # GNU only feature
for(i in a){print "DA-" i ", " "\"" a[i] "\""}
}' <(find /home/calvin/test -type f -name "*.jpg" -printf "%f\n") > ./tmp.csv
Example Output
$ cat ./tmp.csv
SKU, IMAGE
DA-001, "DA-001-coffee.jpg, DA-001-car.jpg, DA-001-dog.jpg"
DA-002, "DA-002-house.jpg"
DA-003, "DA-003-coffee.jpg, DA-003-cat.jpg"

If the filenames don't contain spaces, you can use sed instead of an inner loop:
printf '%s\n' *.jpg \
| cut -f1,2 -d- \
| sort -u \
| while IFS= read -r sku ; do
echo "$sku",\"$(echo "$sku"* | sed 's/ /, /')\"
done
With the inner loop, you can switch to printf from echo. Sed is used to remove the trailing comma.
printf '%s\n' *.jpg \
| cut -f1,2 -d- \
| sort -u \
| while IFS= read -r sku ; do
printf %s "$sku, \""
for f in "$sku"* ; do
printf '%s, ' "$f"
done | sed 's/, $//'
printf '"\n'
done
If you don't want to parse the output of ls and run sort, you can store the prefixes in an associative array:
#!/bin/bash
declare -A prefix
for jpg in *.jpg ; do
p1=${jpg%%-*}
jpg=${jpg#*-}
p2=${jpg%%-*}
prefix[$p1-$p2]=1
done
for sku in "${!prefix[#]}" ; do
printf '%s, "' "$sku"
for f in "$sku"* ; do
printf '%s, ' "$f"
done | sed 's/, $//'
printf '"\n'
done

awk '
BEGIN {
OFS = ", "
print "SKU", "IMAGE"
for (i=1; i<ARGC; i++) {
curr = fname = ARGV[i]
sub(/-[^-]+$/,"",curr)
if ( curr != prev ) {
if ( i > 1 ) {
print prev, "\"" fnames "\""
}
prev = curr
fnames = ""
}
fnames = (fnames == "" ? "" : fnames OFS) fname
}
print prev, "\"" fnames "\""
exit
}
' /home/calvin/test/*.jpg
SKU, IMAGE
DA-001, "DA-001-car.jpg, DA-001-coffee.jpg, DA-001-dog.jpg"
DA-002, "DA-002-house.jpg"
DA-003, "DA-003-cat.jpg, DA-003-coffee.jpg"

As result of all the replies and advises I'm using this code to achieve the desired output:
#!/bin/bash
echo "SKU, IMAGES" >> output.csv
ls *.jpg | cut -f1,2 -d- | sort -u | while read SKU
do
echo $SKU, \"$(echo "$SKU"* | sed 's/ /, /g')\" >> output.csv
done
Thanks all!

Related

Input from an array for awk to find duplicates

I tried to input data for awk from an array:
awk -v var="${A[*]}" 'BEGIN{split(var,list,"\n"); for (i=1;i<=length(list);i++) print list[i]}'
Also using awk to find duplicates between files:
filecnt=$(find "${pmdir}" -type f)
awk -v n=filecnt '{a[$0]++}END{for (i in a)if (a[i]>1){print i, a[i];}}' $filecnt >> ${outputfile}
But I had hard time to find out how to do it if awk takes an array as its input.
something like:
awk -v var="${A[*]}" '{var[$0]++}END{for (i in var)if (var[i]>1){print i, var[i];}}'
A is a column data reading from a file:
for i in $( awk -F ',' '{ print $1; }' "${ifile}" )
do
A[$j]=$i
#echo "${A[$j]}"
j=$((j+1))
done
example of A is
0x10000
0x11000
0x01100
0x00010
0x11000
0x00010
0x00010
The output is wanted:
0x11000 2
0x00010 3
Thanks for your suggestions.
Is this what you want?
$ printf '%s\n' "${A[#]}" | sort | uniq -cd | awk '{print $2, $1}'
0x00010 3
0x11000 2
or if you prefer:
$ printf '%s\n' "${A[#]}" | awk '{cnt[$0]++} END{for (val in cnt) if (cnt[val]>1) print val, cnt[val]}'
0x11000 2
0x00010 3
or:
$ awk -v vals="${A[*]}" 'BEGIN{split(vals,tmp); for (i in tmp) cnt[tmp[i]]++; for (val in cnt) if (cnt[val]>1) print val, cnt[val]}'
0x11000 2
0x00010 3
Note that that last one relies on none of the values in A[] containing spaces or escape chars.
Your for loop isn't how to populate A[] in the first place, though, this is:
A=()
while IFS= read -r i; do
A+=( "$i" )
done < <(cut -d',' -f1 "$ifile")
or:
A=()
while IFS=',' read -r i _; do
A+=( "$i" )
done < "$ifile"
or:
readarray -t A < <(cut -d',' -f1 "$ifile")

Linux: Appending values into files, to the end of particular lines, and at the bottom of the file if there is on "key"

I have one file, file1, that has values like so:
key1|value1|
key2|value2|
key3|value3|
I have another file, file2, that has key based values I would like to add to add to file1:
key2 value4
key3 value5
key4 value6
I would like to add values to file1 to lines where the "key" matches, and if there is no "key" in file1, simply adding the new key & value to the bottom:
key1|value1|
key2|value2|value4|
key3|value3|value5|
key4|value6|
It seems like this is something that could be done with 2 calls to awk, but I am not familiar enough with it. I'm also open to using bash or shell commands.
UPDATE
I found this to work
awk 'NR==FNR {a[$1]=$2; next} {print $1,$2,a[$1];delete a[$1]}END{for(k in a) print k,a[k]}' file2 file1
The only deviation from the desired output is that keys from file1 that are not in file2 are not known AOT, so they are printed at the end to keep things semi-online:
awk -v first=data1.txt -f script.awk data2.txt
BEGIN {
OLD=FS
FS="|"
while (getline < first)
table[$1] = $0
OFS=FS
FS=OLD
}
!($1 in table) {
queue[$1] = $0
}
$1 in table {
id=$1
gsub(FS, OFS)
sub(/[^|]*\|/, "")
print table[id] $0 OFS
delete table[id]
}
END {
for (id in table)
print table[id]
for (id in queue) {
gsub(FS, OFS, queue[id])
print queue[id] OFS
}
}
key2|value2|value4|
key3|value3|value5|
key1|value1|
key4|value6|
this is the LOL answer ... ha ha . I basically loop over both keeping track of them and sort ... silly'ish , probably not even something you would want to use bash for perhaps ..
declare -a checked
checked=()
file="/tmp/file.txt"
> "${file}"
while IFS= read -r line1 ;do
key1=$(echo $line1 | cut -d'|' -f1)
if ! grep -qi ${key1} "/tmp/file2.txt" ; then
echo "$line1" >> "${file}"
continue
fi
while IFS= read -r line2 ;do
key2=$(echo $line2 | cut -d' ' -f1)
if ! grep -qi ${key2} "/tmp/file1.txt" ; then
if ! [[ "${checked[#]}" =~ $key2 ]] ;then
echo "$(echo $line2| awk '{print $1"|"$2}')|" >> "${file}"
checked+=(${key2})
continue
fi
fi
if [[ "$key2" == "$key1" ]] ;then
echo "${line1}$(echo $line2 | cut -d' ' -f2-)|" >> "${file}"
continue
fi
done < "/tmp/file2.txt"
done < "/tmp/file1.txt"
sort -k2 -n ${file}
[[ -f "${file}" ]] && rm -f "${file}"
Output:
key1|value1|
key2|value2|value4|
key3|value3|value5|
key4|value6|

Splitting out a large file

I would like to process a 200 GB file with lines like the following:
...
{"captureTime": "1534303617.738","ua": "..."}
...
The objective is to split this file into multiple files grouped by hours.
Here is my basic script:
#!/bin/sh
echo "Splitting files"
echo "Total lines"
sed -n '$=' $1
echo "First Date"
head -n1 $1 | jq '.captureTime' | xargs -i date -d '#{}' '+%Y%m%d%H'
echo "Last Date"
tail -n1 $1 | jq '.captureTime' | xargs -i date -d '#{}' '+%Y%m%d%H'
while read p; do
date=$(echo "$p" | sed 's/{"captureTime": "//' | sed 's/","ua":.*//' | xargs -i date -d '#{}' '+%Y%m%d%H')
echo $p >> split.$date
done <$1
Some facts:
80 000 000 lines to process
jq doesn't work well since some JSON lines are invalid.
Could you help me to optimize this bash script?
Thank you
This awk solution might come to your rescue:
awk -F'"' '{file=strftime("%Y%m%d%H",$4); print >> file; close(file) }' $1
It essentially replaces your while-loop.
Furthermore, you can replace the complete script with:
# Start AWK file
BEGIN{ FS='"' }
(NR==1){tmin=tmax=$4}
($4 > tmax) { tmax = $4 }
($4 < tmin) { tmin = $4 }
{ file="split."strftime("%Y%m%d%H",$4); print >> file; close(file) }
END {
print "Total lines processed: ", NR
print "First date: "strftime("%Y%m%d%H",tmin)
print "Last date: "strftime("%Y%m%d%H",tmax)
}
Which you then can run as:
awk -f <awk_file.awk> <jq-file>
Note: the usage of strftime indicates that you need to use GNU awk.
you can start optimizing by changing this
sed 's/{"captureTime": "//' | sed 's/","ua":.*//'
with this
sed -nE 's/(\{"captureTime": ")([0-9\.]+)(.*)/\2/p'
-n suppress automatic printing of pattern space
-E use extended regular expressions in the script

extract information regarding : size && time && row_count in one line shell script

Hey every one! I am pretty new for shell script and I am stuck
I need to extract information regarding: file_name && size && time && row_count and I want it do in one command line. I tried like this :
ls -l * && wc -l file.txt && du -ks file.txt | cut -f1| awk '{print $5" " $6 " " $7 " "$8 " " $9 " "$1 " "$2}'
but is not working properly
I also tried do in loop but i dont know how extract from there
for file in `ls -ltr /export/home/oracle/dbascripts/scripts`
do
[[ -f $file ]] && echo $file | awk '{print $3}'
done
Then I want to redirect to file like this >> for sql loader purpose.
Thanks in advance!
This could be a start if you have GNU find and GNU coreutils (most Linux distribution will do):
for i in /my/path/*; do
find "$i" ! -type d -printf '%p %TY-%Tm-%Td %TH:%TM:%TS %s '
wc -l <"$i"
done
/my/path/* should be modified to reflect the files you want to probe.
Also keep in mind that this one-liner has a few major issues if any directories are specified. This should be safer in that regard:
for i in *; do
if [[ -d "$i" ]]; then
continue
fi
find "$i" -printf '%p %TY-%Tm-%Td %TH:%TM:%TS %s '
wc -l <"$i"
done
You will want to see the manual page for GNU find to understand this better.
EDIT:
There is at least other faster way, using join and bash process substitution, but it's a bit ugly and somewhat harder to make safe and work the kinks out of.
ExtractInformation()
{
timesep="-"
sep="|"
dot=":"
sec="00"
lcount=`wc -l < $fname`
modf_time=`ls -l $fname`
f_size=`echo $modf_time | awk '{print $5}'`
time_month=`echo $modf_time | awk '{print $6}'`
time_day=`echo $modf_time | awk '{print $7}'`
time_hrmin=`echo $modf_time | awk '{print $8}'`
time_hr=`echo $time_hrmin | cut -d ':' -f1`
time_min=`echo $time_hrmin | cut -d ':' -f2`
time_year=`date '+%Y'`
time_param="DD-MON-YYYY HH24:MI:SS"
time_date=$time_day$timesep$time_month$timesep$time_year" "$time_hrmin$dot$sec
result=$fname$sep$time_date$sep$f_size$sep$lcount$sep$time_param
sqlresult=`echo $result | awk '{FS = "|" ;q=sprintf("%c", 39); print "INSERT INTO SIP_ICMS_FILE_T(f_name, f_date_time,f_size,f_row_count) VALUES (" q $1 q ", TO_DATE("q $2 q,q $5 q "),"$3","$4");";}'`
echo $sqlresult>>data.sql
echo "Reading data....."
}
UploadData()
{
#ss=`sqlplus -s a/a#adb #data.sql
#set serveroutput on
#set feedback off
#set echo off`
echo "loading with sql Loader....."
}
f_data=data.sql
[[ -f $f_data ]] && rm data.sql
for fname in * ;
do
if [[ -f $fname ]] then
ExtractInformation
fi
UploadData
#Zipdata
done

replacing a string with space

This is the code
for f in tmp_20100923*.xml
do
str1=`more "$f"|grep count=`
i=`echo $str1 | awk -F "." '{print($2)}'`
j=`echo $i | awk -F " " '{print($2)}'` // output is `count="0"`
sed 's/count=//g' $j > $k; echo $k;
done
I tried to get value 0 from above output using sed filter but no success. Could you please advise how can i separate 0 from string count="0" ?
You can have AWK do everything:
for f in tmp_20100923*.xml
do
k=$(awk -F '.' '/count=/ {split($2,a," "); print gensub("count=","","",a[2])}')
done
Edit:
Based on your comment, you don't need to split on the decimal. You can also have AWK do the summation. So you don't need a shell loop.
awk '/count=/ { sub("count=","",$2); gsub("\042","",$2); sum += $2} END{print sum}' tmp_20100923*.xml
Remove all non digits from $j:
echo ${j//[^0-9]/}
you are trying to sed a file whose name is $j
Instead you can
echo $j | sed 's/count=//g'
You can use this sed regexp:
sed 's/count="\(.*\)"/\1/'
However your script has another problem:
j=`echo $i | awk -F " " '{print($2)}'` // output is `count="0"`
sed 's/count=//g' $j > $k; echo $k;
should be
j=`echo $i | awk -F " " '{print($2)}'` // output is `count="0"`
echo $j | sed 's/count=//g'
or better:
echo $i | awk -F " " '{print($2)}' | sed 's/count=//g'
'sed' accepts filenames as input. $j is a shell variable where you put the output of another program (awk).
Also, the ">" redirection puts things in a file. You wrote ">$k" and then "echo $k", as if >$k wrote the output of sed in the $k variable.
If you want to keep the output of sed in a $k variable write instead:
j=`echo $i | awk -F " " '{print($2)}'` // output is `count="0"`
k=`echo $j | sed 's/count=//g'`
This should snag everything between the quotes.
sed -re 's/count="([^"]+)"/\1/g'
-r adds --regexp-extended to be able to cool stuff with regular expressions, and the expression I've given you means:
search for count=",
then store ( any character that's not a " ), then
make sure it's followed by a ", then
replace everything with the stuff in the parenthesis (\1 is the first register)

Resources