bash read from CSV multiple columns whith hash key

bash read from CSV multiple columns whith hash key - bash

I tried to read verticaly a csv file as follow to insert in graphite/carbon DB.
"No.","time","00:00:00","00:00:01","00:00:02","00:00:03","00:00:04","00:00:05","00:00:06","00:00:07","00:00:08","00:00:09","00:00:0A"
"1","2021/09/12 02:16",235,610,345,997,446,130,129,94,555,274,4
"2","2021/09/12 02:17",364,210,371,341,294,87,179,106,425,262,3
"3","2021/09/12 02:18",297,343,860,216,275,81,73,113,566,274,3
"4","2021/09/12 02:19",305,243,448,262,387,64,63,119,633,249,3
"5","2021/09/12 02:20",276,151,164,263,315,86,92,175,591,291,1
"6","2021/09/12 02:21",264,343,287,542,312,83,72,122,630,273,4
"7","2021/09/12 02:22",373,157,266,446,246,90,173,90,442,273,2
"8","2021/09/12 02:23",265,112,241,307,329,64,71,82,515,260,3
"9","2021/09/12 02:24",285,247,240,372,176,92,67,83,609,620,1
"10","2021/09/12 02:25",289,964,277,476,356,84,74,104,560,294,1
"11","2021/09/12 02:26",279,747,227,573,569,82,77,99,589,229,5
"12","2021/09/12 02:27",338,370,315,439,653,85,165,346,367,281,2
"13","2021/09/12 02:28",269,135,372,262,307,73,86,93,512,283,4
"14","2021/09/12 02:29",281,207,688,322,233,75,69,85,663,276,2
...
I wish to generate commands for each column header 00:00:XX taking into account the hour in column $ 2 and of the value during this time
echo "perf.$type.$serial.$object.00:00:00.TOTAL_IOPS" "235" "epoch time (2021/09/12 02:16)" | nc "localhost" "2004"
echo "perf.$type.$serial.$object.00:00:00.TOTAL_IOPS" "364" "epoch time (2021/09/12 02:17)" | nc "localhost" "2004"
...
echo "perf.$type.$serial.$object.00:00:01.TOTAL_IOPS" "610" "epoch time (2021/09/12 02:16)" | nc "localhost" "2004"
echo "perf.$type.$serial.$object.00:00:01.TOTAL_IOPS" "210" "epoch time (2021/09/12 02:17)" | nc "localhost" "2004"
.. etc..
I dont know by which way to start, i tried with awk without success
Trial1: awk -F "," 'BEGIN{FS=","}NR==1{for(i=1;i<=NF;i++) header[i]=$i}{for(i=1;i<=NF;i++) { print header[i] } }' file.csv
Trial2: awk '{time=$2; for(i=3;i<=NF;i++){time=time" "$i}; print time}' file.csv
Many thanks for any help.

In plain bash:
#!/bin/bash
{
IFS=',' read -ra header
header=("${header[#]//\"}")
nf=${#header[#]}
row_nr=0
while IFS=',' read -ra flds; do
datetime[row_nr++]=$(date -d "${flds[1]//\"}" '+%s')
for ((i = 2; i < nf; ++i)); do
col[i]+=" ${flds[i]}"
done
done
} < file
for ((i = 2; i < nf; ++i)); do
v=(${col[i]})
for ((j = 0; j < row_nr; ++j)); do
printf 'echo "perf.$type.$serial.$object.%s.TOTAL_IOPS" "%s" "epoch time (%s)" | nc "localhost" "2004"\n' \
"${header[i]}" "${v[j]}" "${datetime[j]}"
done
done

Would you please try the following:
awk -F, '
NR==1 { # process the header line
for (i = 3; i <= NF; i++) {
gsub(/"/, "", $i) # remove double quotes
tt[i-2] = $i # assign time array
}
next
}
{ # process the body
gsub(/"/, "", $0)
dt[NR - 1] = $2 # assign datetime array
for (i = 3; i <= NF; i++) {
key[NR-1, i-2] = $i # assign key values
}
}
END {
for (i = 1; i <= NF - 2; i++) {
for (j = 1; j <= NR - 1; j++) {
printf "echo \"perf.$type.$serial.$object.%s.TOTAL_IOPS\" \"%d\" \"epoch time (%s)\" | nc \"localhost\" \"2004\"\n", tt[i], key[j, i], dt[j]
}
}
}
' file.csv

Related

Grab value from text & assign it to a variable

I am working on script that reads battery informations from a system file.
Simply I need to grab the total battery capacity (3000) from the line MAX_IBAT(mA): 3000; and put it into a variable .
This is the content of the file I am reading from :
charging_source: NONE;
charging_enabled: 0;
overload: 0;
Percentage(%): 50;
Percentage_raw(%): 50;
gs_cable_impedance: 0
gs_R_cable_impedance: 0
gs_aicl_result: 0
batt_cycle_first_use: 2017/01/01/12:00:06
batt_cycle_level_raw: 26157;
batt_cycle_overheat(s): 0;
htc_extension: 0x0;
usb_overheat_state: 0;
USB_PWR_TEMP(degree): 304;
ISEN_VALUE_ADC: 228;
ISEN_VALUE: 0;
SOC(%): 27;
VBAT(mV): 3707;
IBAT(mA): 383;
IUSB(mA): 0;
MAX_IBAT(mA): 3000;
MAX_IUSB(mA): 0;
AICL_RESULT: 0
VBUS(uV): 0;
BATT_TEMP: 320;
HEALTH: 1;
BATT_PRESENT(bool): 1;
CHARGE_TYPE: 1;
CHARGE_DONE: 0;
USB_PRESENT: 0;
USB_ONLINE: 0;
CHARGER_TEMP: -1;
CHARGER_TEMP_MAX: 803;
CC_uAh: 889648;
USB_CMD_IL_REG: 0x00;
USBIN_CURRENT_LIMIT_CFG: 0x14;
USBIN_AICL_OPTIONS_CFG: 0xc4;
FAST_CHARGE_CURRENT_CFG: 0x78;
FG_BCL_LMH_STS1: 0x00;
What I have tried:
awk '/^ +MAX_IBAT(mA): && $NF!=0{print $NF} Input_file

This uses ": " and ";" as field separators:
max_ibat=$(awk -F ': |;' '$1=="MAX_IBAT(mA)" {print $2}' file)
echo "$max_ibat"
Output:
3000

grep solution:
-o print the exact output not the whole line
-P perl mode
(?<=MAX_IBAT\(mA\):\s) lookbehind assertion to print only numbers that are preceded by the string 'MAX_IBAT(mA): '
command:
max_ibat=$(grep -oP '(?<=MAX_IBAT\(mA\):\s)\d+' Input_file)
echo "$max_ibat"
Output:
3000
sed solution:
-n silent mode -> do not print by default
/^MAX_IBAT(mA):/ process only lines that start by MAX_IBAT(mA):
/s/[^0-9]//gp replace all characters that are not numbers by nothing (delete them) and then print with p.
command:
max_ibat=$(sed -n '/^MAX_IBAT(mA):/s/[^0-9]//gp' Input_file)
echo "$max_ibat"
Output:
3000

How to get n random "paragraphs" (groups of ordered lines) from a file

I have a file (originally compressed) with a known structure - every 4 lines, the first line starts with the character "#" and defines an ordered group of 4 lines. I want to select randomly n groups (half) of lines in the most efficient way (preferably in bash/another Unix tool).
My suggestion in python is:
path = "origin.txt.gz"
unzipped_path = "origin_unzipped.txt"
new_path = "/home/labs/amit/diklag/subset.txt"
subprocess.getoutput("""gunzip -c %s > %s """ % (path, unzipped_path))
with open(unzipped_path) as f:
lines = f.readlines()
subset_size = round((len(lines)/4) * 0.5)
l = random.sample(list(range(0, len(lines), 4)),subset_size)
selected_lines = [line for i in l for line in list(range(i,i+4))]
new_lines = [lines[i] for i in selected_lines]
with open(new_path,'w+') as f2:
f2.writelines(new_lines)
Can you help me find another (and faster) way to do it?
Right now it takes ~10 seconds to run this code

The following script might be helpful. This is however, untested as we do not have an example file:
attempt 1 (awk and shuf) :
#!/usr/bin/env bash
count=30
path="origin.txt.gz"
new_path="subset.txt"
nrec=$(gunzip -c $path | awk '/^#/{c++}{END print c})'
awk '(NR==FNR){a[$1]=1;next}
!/^#/{next}
((++c) in a) { for(i=1;i<=4;i++) { print; getline } }' \
<(shuf -i 1-$nrec -n $count) <(gunzip -c $path) > $new_path
attempt 2 (sed and shuf) :
#!/usr/bin/env bash
count=30
path="origin.txt.gz"
new_path="subset.txt"
gunzip -c $path | sed ':a;N;$!ba;s/\n/__END_LINE__/g;s/__END_LINE__#/\n#/g' \
| shuf -n $count | sed 's/__END_LINE__/\n/g' > $new_path
In this example, the sed line will replace all newlines with the string __END_LINE__, except if it is followed by #. The shuf command will then pick $count random samples out of that list. Afterwards we replace the string __END_LINE__ again by \n.
attempt 3 (awk) :
Create a file called subset.awk containing :
# Uniform(m) :: returns a random integer such that
# 1 <= Uniform(m) <= m
function Uniform(m) { return 1+int(m * rand()) }
# KnuthShuffle(m) :: creates a random permutation of the range [1,m]
function KnuthShuffle(m, i,j,k) {
for (i = 1; i <= m ; i++) { permutation[i] = i }
for (i = 1; i <= m-1; i++) {
j = Uniform(i-1)
k = permutation[i]
permutation[i] = permutation[j]
permutation[j] = k
}
}
BEGIN{RS="\n#"; srand() }
{a[NR]=$0}
END{ KnuthShuffle(NR);
sub("#","",a[1])
for(r = 1; r <= count; r++) {
print "#"a[permutation[r]]
}
}
And then you can run :
$ gunzip -c <file.gz> | awk -c count=30 -f subset.awk > <output.txt>

Aggregating csv file in bash script

I have csv file with multiple lines. Each line has the same number of columns. What I need to do is to group those lines by a few specified columns and aggregate data from other columns. Example of input file:
proces1,pathA,5-May-2011,10-Sep-2017,5
proces2,pathB,6-Jun-2014,7-Jun-2015,2
proces1,pathB,6-Jun-2017,7-Jun-2017,1
proces1,pathA,11-Sep-2017,15-Oct-2017,2
For above example I need to group lines by first two columns. From 3rd column I need to choose the min value, for 4th column max value, and 5th column should have the sum. So, for such input file I need output:
proces1,pathA,5-May-2011,15-Oct-2017,7
proces1,pathB,6-Jun-2017,7-Jun-2017,1
proces2,pathB,6-Jun-2014,7-Jun-2015,2
I need to process it in bash (I can use awk or sed as well).

With bash and sort:
#!/bin/bash
# create associative arrays
declare -A month2num=([Jan]=1 [Feb]=2 [Mar]=3 [Apr]=4 [May]=5 [Jun]=6 [Jul]=7 [Aug]=8 [Sep]=9 [Oct]=10 [Nov]=11 [Dec]=12])
declare -A p ds de # date start and date end
declare -A -i sum # set integer attribute
# function to convert 5-Jun-2011 to 20110605
date2num() { local d m y; IFS="-" read -r d m y <<< "$1"; printf "%d%.2d%.2d\n" $y ${month2num[$m]} $d; }
# read all columns to variables p1 p2 d1 d2 s
while IFS="," read -r p1 p2 d1 d2 s; do
# if associative array is still empty for this entry
# fill with current strings/value
if [[ -z ${p[$p1,$p2]} ]]; then
p[$p1,$p2]="$p1,$p2"
ds[$p1,$p2]="$d1"
de[$p1,$p2]="$d2"
sum[$p1,$p2]="$s"
continue
fi
# compare strings, set new strings and sum value
if [[ ${p[$p1,$p2]} == "$p1,$p2" ]]; then
[[ $(date2num "$d1") < $(date2num ${ds[$p1,$p2]}) ]] && ds[$p1,$p2]="$d1"
[[ $(date2num "$d2") > $(date2num ${de[$p1,$p2]}) ]] && de[$p1,$p2]="$d2"
sum[$p1,$p2]=sum[$p1,$p2]+s
fi
done < file
# print content of all associative arrays with key vom associative array p
for i in "${!p[#]}"; do echo "${p[$i]},${ds[$i]},${de[$i]},${sum[$i]}"; done
Usage: ./script.sh | sort
Output to stdout:
proces1,pathA,5-May-2011,15-Oct-2017,7
proces1,pathB,6-Jun-2017,7-Jun-2017,1
proces2,pathB,6-Jun-2014,7-Jun-2015,2
See: help declare, help read and of course man bash

With awk + sort
awk -F',|-' '
BEGIN{
A["Jan"]="01"
A["Feb"]="02"
A["Mar"]="03"
A["Apr"]="04"
A["May"]="05"
A["Jun"]="06"
A["July"]="07"
A["Aug"]="08"
A["Sep"]="09"
A["Oct"]="10"
A["Nov"]="11"
A["Dec"]="12"
}
{
B[$1","$2]=B[$1","$2]+$9
z=sprintf( "%.2d",$3)
y=sprintf("%s",$5 A[$4] z)
if(!start[$1$2])
{
end[$1$2]=0
start[$1$2]=99999999
}
if (y < start[$1$2])
{
start[$1$2]=y
C[$1","$2]=$3"-"$4"-"$5
}
x=sprintf( "%.2d",$6)
w=sprintf("%s",$8 A[$7] x)
if(w > end[$1$2] )
{
end[$1$2]=w
D[$1","$2]=$6"-"$7"-"$8
}
}
END{
for (i in B)print i "," C[i] "," D[i] "," B[i]
}
' infile | sort

Extended GNU awk solution:
awk -F, 'function parse_date(d_str){
split(d_str, d, "-");
t = mktime(sprintf("%d %d %d 00 00 00", d[3], m[d[2]], d[1]));
return t
}
BEGIN{ m["Jan"]=1; m["Feb"]=2; m["Mar"]=3; m["Apr"]=4; m["May"]=5; m["Jun"]=6;
m["Jul"]=7; m["Aug"]=8; m["Sep"]=9; m["Oct"]=10; m["Nov"]=11; m["Dec"]=12;
}
{
k=$1 SUBSEP $2;
if (k in a){
if (parse_date(a[k]["min"]) > parse_date($3)) { a[k]["min"]=$3 }
if (parse_date(a[k]["max"]) < parse_date($4)) { a[k]["max"]=$4 }
} else {
a[k]["min"]=$3; a[k]["max"]=$4
}
a[k]["sum"]+= $5
}
END{
for (i in a) {
split(i, j, SUBSEP);
print j[1], j[2], a[i]["min"], a[i]["max"], a[i]["sum"]
}
}' OFS=',' file
The output:
proces1,pathA,5-May-2011,15-Oct-2017,7
proces1,pathB,6-Jun-2017,7-Jun-2017,1
proces2,pathB,6-Jun-2014,7-Jun-2015,2

How to merge files line by line in bash

My files look like
file0 file1 file2
a 1 ##
a 1 ##
b 2 ##
b 2 ##
and I want to merge these files lines by lines, so it should look like
merged file
a
a
1
1
##
##
b
b
2
2
##
##
I mean, choose some lines for each file and merge them into one file.
I tried below bash script.
touch ini.dat
n=2
linenum=$(wc -l < file0)
iter=$((linenum/n))
for i in $(seq 0 1 $iter)
do
for j in $(seq 0 1 2)
do
awk 'NR > '$(($i*$n))' && NR <= '$((($i+1)*$n))'' file"$j" > tmp
cat ini.dat tmp > tmpp
cp tmpp ini.dat
rm tmpp
done
done
It works fine, but takes too much time. Is there any efficient way?

Limiting Factors
Your script had two flaws which made it slow:
A lot of files were created and copied. Especially the ... > tmp; cat ini.dat tmp > tmpp; cp tmpp ini.dat could have been written as ... >> ini.dat.
To read the i-th line of a file, the script has to scan that file from the beginning until the i-th line is reached. If done repeatedly for i = 1, 2, 3, ..., n it will take O(n2). Reading the whole file once (O(n)) into an array and accesing the lines by indices (O(1)) only takes O(n).
Pure Bash Solution
The following bash script does the job a bit faster. linesPerBlock corresponds to the parameter n from your script. The script will print as much blocks as possible. That is:
Once the shortest input file was printed, the script terminates. Following lines from longer files will not be printed.
If the shortest input file's number of lines is not divisible by n, the last lines (fewer than n) will be omitted.
#! /bin/bash
files=(file{0..2})
linesPerBlock=2
starts=(0)
maxLines=9223372036854775807 # bash's max. number
for i in "${!files[#]}"; do
lineCount="$(wc -l < "${files[i]}")"
(( lineCount < maxLines )) && (( maxLines = lineCount ))
(( starts[i+1] = starts[i] + maxLines ))
mapfile -t -O "${starts[i]}" -n "$maxLines" lines < "${files[i]}"
done
for (( b = 0; b < maxLines / linesPerBlock; ++b )); do
for f in "${!files[#]}"; do
start="${starts[f]}"
for (( i = 0; i < linesPerBlock; ++i )); do
echo "${lines[start + b*linesPerBlock + i]}"
done
done
done > outputFile

This awk should do the job and will be much quicker that your shell script:
awk 'fn != FILENAME {
fn = FILENAME
n = 1
}
NF {
a[FILENAME,n++] = $0
}
END {
for(i=0; i<(n-1)/2; i++) {
for(j=1; j<ARGC; j++)
printf "%s\n%s\n", a[ARGV[j],i*2+1], a[ARGV[j],i*2+2];
print ""
}
}' file{0..2}
a
a
1
1
##
##
b
b
2
2
##
##
In a single line:
awk 'fn != FILENAME{fn=FILENAME; n=1} NF{a[FILENAME,n++]=$0} END{for(i=0; i<(n-1)/2; i++) { for(j=1; j<ARGC; j++) printf "%s\n%s\n", a[ARGV[j],i*2+1], a[ARGV[j],i*2+2]; print "" } }' file{0..2}

here is another awk, not caching all contents
paste file{0..2} | awk -v n=2 '
function pr() {for(j=1;j<=NF;j++)
for(i=0;i<n;i++) print a[i,j]}
{for(j=1;j<=NF;j++) a[c+0,j]=$j; c++}
!(NR%n) {pr(); delete a; c=0}
END {pr()}'
if the number of lines is not divisible by n, it will fill up with empty lines.

How to use awk or anything else to number of shared x values of 2 different y values in a csv file consists of column a and b?

Let me be specific. We have a csv file consists of 2 columns x and y like this:
x,y
1h,a2
2e,a2
4f,a2
7v,a2
1h,b6
4f,b6
4f,c9
7v,c9
...
And we want to count how many shared x values two y values have, which means we want to get this:
y1,y2,share
a2,b6,2
a2,c9,2
b6,c9,1
And b6,a2,2 should not show up. Does anyone know how to do this by awk? Or anything else?
Thx ahead!

Try this executable awk script:
#!/usr/bin/awk -f
BEGIN {FS=OFS=","}
NR==1 { print "y1" OFS "y2" OFS "share" }
NR>1 {last=a[$1]; a[$1]=(last!=""?last",":"")$2}
END {
for(i in a) {
cnt = split(a[i], arr, FS)
if( cnt>1 ) {
for(k=1;k<cnt;k++) {
for(i=2;i<=cnt;i++) {
if( arr[k] != arr[i] ) {
key=arr[k] OFS arr[i]
if(out[key]=="") {order[++ocnt]=key}
out[key]++
}
}
}
}
}
for(i=1;i<=ocnt;i++) {
print order[i] OFS out[order[i]]
}
}
When put into a file called awko and made executable, running it like awko data yields:
y1,y2,share
a2,b6,2
a2,c9,2
b6,c9,1
I'm assuming the file is sorted by y values in the second column as in the question( after the header ). If it works for you, I'll add some explanations tomorrow.
Additionally for anyone who wants more test data, here's a silly executable awk script for generating some data similar to what's in the question. Makes about 10K lines when run like gen.awk.
#!/usr/bin/awk -f
function randInt(max) {
return( int(rand()*max)+1 )
}
BEGIN {
a[1]="a"; a[2]="b"; a[3]="c"; a[4]="d"; a[5]="e"; a[6]="f"
a[7]="g"; a[8]="h"; a[9]="i"; a[10]="j"; a[11]="k"; a[12]="l"
a[13]="m"; a[14]="n"; a[15]="o"; a[16]="p"; a[17]="q"; a[18]="r"
a[19]="s"; a[20]="t"; a[21]="u"; a[22]="v"; a[23]="w"; a[24]="x"
a[25]="y"; a[26]="z"
print "x,y"
for(i=1;i<=26;i++) {
amultiplier = randInt(1000) # vary this to change the output size
r = randInt(amultiplier)
anum = 1
for(j=1;j<=amultiplier;j++) {
if( j == r ) { anum++; r = randInt(amultiplier) }
print a[randInt(26)] randInt(5) "," a[i] anum
}
}
}

I think if you can get the input into a form like this, it's easy:
1h a2 b6
2e a2
4f a2 b6 c9
7v a2 c9
In fact, you don't even need the x value. You can convert this:
a2 b6
a2
a2 b6 c9
a2 c9
Into this:
a2,b6
a2,b6
a2,c9
a2,c9
That output can be sorted and piped to uniq -c to get approximately the output you want, so we only need to think much about how to get from your input to the first and second states. Once we have those, the final step is easy.
Step one:
sort /tmp/values.csv \
| awk '
BEGIN { FS="," }
{
if (x != $1) {
if (x) print values
x = $1
values = $2
} else {
values = values " " $2
}
}
END { print values }
'
Step two:
| awk '
{
for (i = 1; i < NF; ++i) {
for (j = i+1; j <= NF; ++j) {
print $i "," $j
}
}
}
'
Step three:
| sort | awk '
BEGIN {
combination = $0
print "y1,y2,share"
}
{
if (combination == $0) {
count = count + 1
} else {
if (count) print combination "," count
count = 1
combination = $0
}
}
END { print combination "," count }
'

This awk script does the job:
BEGIN { FS=OFS="," }
NR==1 { print "y1","y2","share" }
NR>1 { ++seen[$1,$2]; ++x[$1]; ++y[$2] }
END {
for (y1 in y) {
for (y2 in y) {
if (y1 != y2 && !(y2 SUBSEP y1 in c)) {
for (i in x) {
if (seen[i,y1] && seen[i,y2]) {
++c[y1,y2]
}
}
}
}
}
for (key in c) {
split(key, a, SUBSEP)
print a[1],a[2],c[key]
}
}
Loop through the input, recording both the original elements and the combinations. Once the file has been processed, look at each pair of y values. The if statement does two things: it prevents equal y values from being compared and it saves looping through the x values twice for every pair. Shared values are stored in c.
Once the shared values have been aggregated, the final output is printed.

This sed script does the trick:
#!/bin/bash
echo y1,y2,share
x=$(wc -l < file)
b=$(echo "$x -2" | bc)
index=0
for i in $(eval echo "{2..$b}")
do
var_x_1=$(sed -n ''"$i"p'' file | sed 's/,.*//')
var_y_1=$(sed -n ''"$i"p'' file | sed 's/.*,//')
a=$(echo "$i + 1" | bc)
for j in $(eval echo "{$a..$x}")
do
var_x_2=$(sed -n ''"$j"p'' file | sed 's/,.*//')
var_y_2=$(sed -n ''"$j"p'' file | sed 's/.*,//')
if [ "$var_x_1" = "$var_x_2" ] ; then
array[$index]=$var_y_1,$var_y_2
index=$(echo "$index + 1" | bc)
fi
done
done
counter=1
for (( k=1; k<$index; k++ ))
do
if [ ${array[k]} = ${array[k-1]} ] ; then
counter=$(echo "$counter + 1" | bc)
else
echo ${array[k-1]},$counter
counter=1
fi
if [ "$k" = $(echo "$index-1"|bc) ] && [ $counter = 1 ]; then
echo ${array[k]},$counter
fi
done

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

bash read from CSV multiple columns whith hash key - bash

Related

Grab value from text & assign it to a variable

How to get n random "paragraphs" (groups of ordered lines) from a file

Aggregating csv file in bash script

How to merge files line by line in bash

How to use awk or anything else to number of shared x values of 2 different y values in a csv file consists of column a and b?

Categories

Resources