Merge two files using Paste, but after the N row - bash

I have two xlsxfiles, they are different but with only 1 thing in common: the date. I must convert to csv and merge them together.
file1
01/01/2013;horse;penguin
02/01/2013;cat;dog
03/01/2013;frog;whale
04/01/2013;mouse;bird
[...]
until nowadays, may 2017
No animals were hurt in writing this sample.
file2
14/02/2013;banana;cherry
15/02/2013;apple;mango
16/02/2013;orange;strawberry
[...]
until nowadays, may 2017
This is the result I wish to achieve:
But the dates are in epoch (here I leave them not epoch, so you can read them).
01/01/2013;horse;penguin
02/01/2013;cat;dog
03/01/2013;frog;whale
04/01/2013;mouse;bird
[...]
13/02/2013;fish;elephant
14/02/2013;bear;owl;banana;cherry
15/02/2013;monkey;bat;apple;mango
[...]
The following is the script I made.
1) the dates needs to be epoch
2) the sheet2 does not contain the date, the date is printed in the final file for both and I use the date from sheet1
#!/bin/bash
# VARS #
XLSX=$1
SHEET1="sheet1"
SHEET2="sheet2"
P_PATH=/tmp/extract
EXTRACTCSV=$P_PATH/extract.csv
TMP_CSV=$P_PATH/temp.csv
CSV_SPLIT=$P_PATH/processed.csv
CSV_FINAL=$P_PATH/${XLSX}.csv
# START #
[ -d $P_PATH ] || mkdir -p $P_PATH
rm -rfv $P_PATH/*
########################
# ssconvert on sheet 1 #
########################
ssconvert --export-type=Gnumeric_stf:stf_assistant -O 'sheet='$SHEET1' separator=; format=automatic eol=unix' ${XLSX} ${EXTRACTCSV}"."${SHEET1}
if [ $? -gt 0 ]; then
echo "Ssconvert on $SHEET1 failed. Exiting."
exit
fi
########################
# ssconvert on sheet 2 #
########################
ssconvert --export-type=Gnumeric_stf:stf_assistant -O 'sheet='$SHEET2' separator=; format=automatic eol=unix' ${XLSX} ${EXTRACTCSV}"."${SHEET2}
if [ $? -gt 0 ]; then
echo "Ssconvert on $SHEET2 failed. Exiting."
exit
fi
######################
# Processing SHEET 1 #
######################
cat ${EXTRACTCSV}"."${SHEET1} | awk -F';' '{print $1";"$2";"$6}' > ${TMP_CSV}"."${SHEET1}
# Modify to EPOCH #
while read line; do
colDate=$(echo $line | awk -F';' '{print $1}')
colB=$(echo $line | awk -F';' '{print $2}' )
colF=$(echo $line | awk -F';' '{print $3}' )
# Skip when date not set
if [ -z ${colDate} ]; then
continue
fi
epoch_date=$(date +%s -ud ${colDate})
echo "${epoch_date};${colB};${colF}" >> ${CSV_SPLIT}.${SHEET1}
done <${TMP_CSV}"."${SHEET1}
######################
# Processing SHEET 2 #
######################
cat ${EXTRACTCSV}"."${SHEET2} | awk -F';' '{print $12";"$14";"$17}' > ${CSV_SPLIT}.${SHEET2}
##########################
# Merge the csv together #
##########################
paste -d ';' ${CSV_SPLIT}.${SHEET1} ${CSV_SPLIT}.${SHEET2} | column -t > ${CSV_FINAL}
My Request:
The final command, the one to merge the 2 files together:
paste -d ';' ${CSV_SPLIT}.${SHEET1} ${CSV_SPLIT}.${SHEET2} | column -t > ${CSV_FINAL}
works good, but the second file is printed on the row of the 01/01/2013.
I don't know how to modify the logic of this script, to begin pasting the 2nd file from the row of 14/02/2013.
Can anyone help me?

Looks like you want sort and merge file(s) by date.
File1:
sort -n -k3 -k2 -k1 -t '/' -o File1.sorted File1
File2:
sort -n -k3 -k2 -k1 -t '/' -o File2.sorted File2
Merge:
sort -n -m -k3 -k2 -k1 -t '/' -o result.sorted File1.sorted File2.sorted
OR as a single line using virtual file descriptors:
sort -n -m -k3 -k2 -k1 -t '/' <(sort -n -k3 -k2 -k1 -t '/' File1) <(sort -n -k3 -k2 -k1 -t '/')
-n will sort the fields numerically instead of lexical.
-m merges two sorted files
-k will sort by year, then day, then month (fields 3,2,1 respectively)
-t sets the delmiter
EXAMPLE:
sort -m -k3 -k2 -k1 -t '/' <(sort -k3 -k2 -k1 -t '/' t2) <(sort -k3 -k2 -k1 -t '/' t1)
12/01/2012;banana;pear
15/02/2013;apple;mango
14/02/2013;banana;cherry
02/01/2013;cat;dog
03/01/2013;frog;whale
01/01/2013;horse;penguin
04/01/2013;mouse;bird
16/02/2013;orange;strawberry
13/03/2015;mango;papaya

This is how I solved:
if [ $epoch_date -le 1360713600 ]; then
echo "${epoch_date};${colB};${colF}" >> ${CSV_SPLIT}.${SHEET1}.part1
else
echo "${epoch_date};${colB};${colF}" >> ${CSV_SPLIT}.${SHEET1}
fi
[...]
##########################
# Merge the csv together #
##########################
cat ${CSV_SPLIT}.${SHEET1}.part1 > ${CSV_FINAL}
paste -d ';' ${CSV_SPLIT}.${SHEET1} ${CSV_SPLIT}.${SHEET2} | column -t >> ${CSV_FINAL}
I split the file1 in 2 parts when I read it, 1 part contain dates and values before the 14 Feb, the other part the rest.
And well.. easy.

Related

Estimate number of lines in a file and insert that value as first line

I have many files for which I have to estimate the number of lines in each file and add that value as first line. To estimate that, I used something like this:
wc -l 000600.txt | awk '{ print $1 }'
However, no success on how to do it for all files and then to add the value corresponding to each file as first line.
An example:
a.txt b.txt c.txt
>>print a
15
>> print b
22
>>print c
56
Then 15, 22 and 56 should be added respectively to: a.txt b.txt and c.txt
I appreciate the help.
You can add a pattern for example (LINENUM) in first line of file and then use the following script.
wc -l a.txt | awk 'BEGIN {FS =" ";} { print $1;}' | xargs -I {} sed -i 's/LINENUM/LINENUM:{}/' a.txt
or just use from this script:
wc -l a.txt | awk 'BEGIN {FS =" ";} { print $1;}' | xargs -I {} sed -i '1s/^/LINENUM:{}\n/' a.txt
This way you can add the line number as the first line for all *.txt files in current directory. Also using that group command here would be faster than inplace editing commands, in case of large files. Do not modify spaces or semicolons into the grouping.
for f in *.txt; do
{ wc -l < "$f"; cat "$f"; } > "${f}.tmp" && mv "${f}.tmp" "$f"
done
For iterate over the all file you can add use from this script.
for f in `ls *` ; do if [ -f $f ]; then wc -l $f | awk 'BEGIN {FS =" ";} { print $1;}' | xargs -I {} sed -i '1s/^/LINENUM:{}\n/' $f ; fi; done
This might work for you (GNU sed):
sed -i '1h;1!H;$!d;=;x' file1 file2 file3 etc ...
Store each file in memory and insert the last lines line number as the file size.
Alternative:
sed -i ':a;$!{N;ba};=' file?

Write to file from within a for loop in Bash

Let's say I have the following csv file:
A,1
A,2
B,3
C,4
C,5
And for each unique value i in the first column of the file I want to write a script that does some processing using this value. I go about doing it this way:
CSVFILE=path/to/csv
VALUES=$(cut -d, -f1 $CSVFILE | sort | uniq)
for i in $VALUES;
do
cat >> file_${i}.sh <<-!
#!/bin/bash
#
# script that takes value I
#
echo "Processing" $i
!
done
However, this creates empty files for all values of i it is looping over, and prints the actual content of files to the console.
Is there a way to redirect the output to the files instead?
Simply
#!/bin/bash
FILE=/path/to/file
values=`cat $FILE | awk -F, '{print $1}' | sort | uniq | tr '\n' ' '`
for i in $values; do
echo "value of i is $i" >> file_$i.sh
done
Screenshot
Try using this:
#!/usr/bin/env bash
csv=/path/to/file
while IFS= read -r i; do
cat >> "file_$i.sh" <<-eof
#!/bin/bash
#
# Script that takes value $i ...
#
eof
done < <(cut -d, -f1 "$csv" | sort -u)

sh to read a file and take particular value in shell

I need to read a json file and take value like 99XXXXXXXXXXXX0 and cccs and write in csv which having column BASE_No and Schedule.
Input file: classedFFDCD_5666_4888_45_2018_02112018012106.021.json
"bfgft":"99XXXXXXXXXXXX0","fp":"XXXXXX","cur_gt":225XXXXXXXX0,"cccs"
"bfgft":"21XXXXXXXXXXXX0","fp":"XXXXXX","cur_gt":225XXXXXXXX0,"nncs"
"bfgft":"56XXXXXXXXXXXX0","fp":"XXXXXX","cur_gt":225XXXXXXXX0,"fgbs"
"bfgft":"44XXXXXXXXXXXX0","fp":"XXXXXX","cur_gt":225XXXXXXXX0,"ddss"
"bfgft":"94XXXXXXXXXXXX0","fp":"XXXXXX","cur_gt":225XXXXXXXX0,"jjjs"
Expected output:
BASE_No,Schedule
99XXXXXXXXXXXX0,cccs
21XXXXXXXXXXXX0,nncs
56XXXXXXXXXXXX0,fgbs
44XXXXXXXXXXXX0,ddss
94XXXXXXXXXXXX0,jjjs
I am using below code for reading file name and date, but unable to read file for BASE_No,Schedule.
SAVEIFS=$IFS
IFS=$(echo -en "\n\b")
for line in `ls -lrt *.json`; do
date=$(echo $line |awk -F ' ' '{print $6" "$7}');
file=$(echo $line |awk -F ' ' '{print $9}');
echo ''$file','$(date "+%Y/%m/%d %H.%M.%S")'' >> $File_Tracker`
Assuming the structure of the json doesnt change for every line, the sample code checks through line by line to retrieve the particular value and concatenates using printf. The output is then stored as new output.txt file.
#!/bin/bash
input="/home/kj4458/winhome/Downloads/sample.json"
printf "Base,Schedule \n" > output.txt
while IFS= read -r var
do
printf "`echo "$var" | cut -d':' -f 2 | cut -d',' -f 1`,`echo "$var" | cut -d':' -f 4 | cut -d',' -f 2` \n" | sed 's/"//g' >> output.txt
done < "$input"
awk -F " \" " ' {print $4","$12 }' file
99XXXXXXXXXXXX0,cccs
21XXXXXXXXXXXX0,nncs
56XXXXXXXXXXXX0,fgbs
44XXXXXXXXXXXX0,ddss
94XXXXXXXXXXXX0,jjjs
I got that result!

Splitting out a large file

I would like to process a 200 GB file with lines like the following:
...
{"captureTime": "1534303617.738","ua": "..."}
...
The objective is to split this file into multiple files grouped by hours.
Here is my basic script:
#!/bin/sh
echo "Splitting files"
echo "Total lines"
sed -n '$=' $1
echo "First Date"
head -n1 $1 | jq '.captureTime' | xargs -i date -d '#{}' '+%Y%m%d%H'
echo "Last Date"
tail -n1 $1 | jq '.captureTime' | xargs -i date -d '#{}' '+%Y%m%d%H'
while read p; do
date=$(echo "$p" | sed 's/{"captureTime": "//' | sed 's/","ua":.*//' | xargs -i date -d '#{}' '+%Y%m%d%H')
echo $p >> split.$date
done <$1
Some facts:
80 000 000 lines to process
jq doesn't work well since some JSON lines are invalid.
Could you help me to optimize this bash script?
Thank you
This awk solution might come to your rescue:
awk -F'"' '{file=strftime("%Y%m%d%H",$4); print >> file; close(file) }' $1
It essentially replaces your while-loop.
Furthermore, you can replace the complete script with:
# Start AWK file
BEGIN{ FS='"' }
(NR==1){tmin=tmax=$4}
($4 > tmax) { tmax = $4 }
($4 < tmin) { tmin = $4 }
{ file="split."strftime("%Y%m%d%H",$4); print >> file; close(file) }
END {
print "Total lines processed: ", NR
print "First date: "strftime("%Y%m%d%H",tmin)
print "Last date: "strftime("%Y%m%d%H",tmax)
}
Which you then can run as:
awk -f <awk_file.awk> <jq-file>
Note: the usage of strftime indicates that you need to use GNU awk.
you can start optimizing by changing this
sed 's/{"captureTime": "//' | sed 's/","ua":.*//'
with this
sed -nE 's/(\{"captureTime": ")([0-9\.]+)(.*)/\2/p'
-n suppress automatic printing of pattern space
-E use extended regular expressions in the script

BASH Grep for Specific Email Address in CSV

I'm trying to compare two CSV files by reading the first line-by-line and grepping the second file for a match. Using Diff is not a viable solution. I seem to be having a problem with having the email address stored as a variable when I grep the second file.
#!/bin/bash
LANG=C
head -2 $1 | tail -1 | while read -r line; do
line=$( echo $line | sed 's/\n//g' )
echo $line
cat $2 | cut -d',' -f1 | grep -iF "$line"
done
Variable $line contains an email address that DOES exist in file $2, but I'm not getting any results.
What am I doing wrong?
File1
Email
email#verizon.net
email#gmail.com
email#yahoo.com
File2
email,,,,
email#verizon.net,,,,
email#gmail.com,,,,
email#yahoo.com,,,,
Given:
# csv_0.csv
email
me#me.com
you#me.com
fee#me.com
and
# csv_1.csv
email,foo,bar,baz,bim
bee#me.com,3,2,3,4
me#me.com,4,1,1,32
you#me.com,7,4,6,6
gee#me.com,1,2,2,6
me#me.com,5,7,2,34
you#me.com,22,3,2,33
I ran
$ pattern=$(head -2 csv_0.csv | tail -1 | sed s/,.*//g)
$ grep $pattern csv_1.csv
me#me.com,4,1,1,32
me#me.com,5,7,2,34
To do this for each line in csv_0.csv
#!/bin/bash
LANG=C
filename="$1"
{
read # don't read csv headers
while read line
do
pattern=$(echo $line | sed s/,.*//g)
grep $pattern $2
done
} <"$filename"
Then
$ ./csv_read.sh csv_2.csv csv_3.csv
me#me.com,4,1,1,32
me#me.com,5,7,2,34
you#me.com,7,4,6,6
you#me.com,22,3,2,33

Resources