I have a tricky question about how to keep the latest log data as my server reposted it two times
This is the result after I grep from my folder :(i have tons of data, just to keep it simpler)
...
20150630-201427.csv:20150630,CFIIASU,233,96.21786,0.44644,
20150630-201427.csv:20150630,CFIIASU_AU,65,90.71109,0.28569
20150630-201427.csv:20150630,CFIIASU_CN,68,102.19569,0.10692
20150630-201427.csv:20150630,CFIIASU_ID,37,98.02484,0.27775
20150630-201427.csv:20150630,CFIIASU_KR,39,98.42257,0.83055
20150630-201427.csv:20150630,CFIIASU_TH,24,99.94482,0.20743
20150701-151654.csv:20150630,CFIIASU,233,96.21450,0.44294
20150701-151654.csv:20150630,CFIIASU_AU,65,90.71109,0.28569
20150701-151654.csv:20150630,CFIIASU_CN,68,102.16538,0.07723
20150701-151654.csv:20150630,CFIIASU_ID,37,98.02484,0.27775
20150701-151654.csv:20150630,CFIIASU_KR,39,98.42257,0.83055
20150701-151654.csv:20150630,CFIIASU_TH,24,99.94482,0.20743
...
The data actually came from many csv files, I only pick two csv files to make the example, and here are some explainations of this:
the example came from two files 20150630-201427.csv and 20150701-151654.csv, and it has 4 columns which correspond to date, datanme, data_column1, data_column2, data_column3.
these line have the same data date 20150630 and the same dataname CFIIASU,CFIIASU_AU...etc, but the numbers in the fourth and fifth column (which are data_column2 and data_column3) are different.
How could i keep the data of 20150701-151654.csv based on the file's name and data date and apply it on my whole data set?
To make it more clearly. I'd like to keep the lines of "the latest csv" and since the latest csv is corresponding to the file's name, which in this example is 2015070. but when it comes to my whole data set i need to handle with so many 20xxxxxx.csv that i can't check it one by one.
for the example, i made this should end up like this:
20150701-151654.csv:20150630,CFIIASU,233,96.21450,0.44294
20150701-151654.csv:20150630,CFIIASU_AU,65,90.71109,0.28569
20150701-151654.csv:20150630,CFIIASU_CN,68,102.16538,0.07723
20150701-151654.csv:20150630,CFIIASU_ID,37,98.02484,0.27775
20150701-151654.csv:20150630,CFIIASU_KR,39,98.42257,0.83055
20150701-151654.csv:20150630,CFIIASU_TH,24,99.94482,0.20743
Thanks in advance.
Your question isn't clear but it sounds like this might be what you're trying to do (print all lines from the last csv mentioned in the input file):
$ tac file | awk -F':' 'NR>1 && $1!=prev{exit} {print; prev=$1}' | tac
20150701-151654.csv:20150630,CFIIASU,233,96.21450,0.44294
20150701-151654.csv:20150630,CFIIASU_AU,65,90.71109,0.28569
20150701-151654.csv:20150630,CFIIASU_CN,68,102.16538,0.07723
20150701-151654.csv:20150630,CFIIASU_ID,37,98.02484,0.27775
20150701-151654.csv:20150630,CFIIASU_KR,39,98.42257,0.83055
20150701-151654.csv:20150630,CFIIASU_TH,24,99.94482,0.20743
or maybe this (print the last line seen for every 20150630,CFIIASU etc. pair in the input file):
$ tac file | awk -F'[:,]' '!seen[$2,$3]++' | tac
20150701-151654.csv:20150630,CFIIASU,233,96.21450,0.44294
20150701-151654.csv:20150630,CFIIASU_AU,65,90.71109,0.28569
20150701-151654.csv:20150630,CFIIASU_CN,68,102.16538,0.07723
20150701-151654.csv:20150630,CFIIASU_ID,37,98.02484,0.27775
20150701-151654.csv:20150630,CFIIASU_KR,39,98.42257,0.83055
20150701-151654.csv:20150630,CFIIASU_TH,24,99.94482,0.20743
I have 600 CSV files of size ~1Mo for a total of roughly 600Mo. I want to put all of them into a sqlite3 db. So my first step would be to merge them into one big csv (of ~600Mo right?) before importing it into a sql db.
However, when I run the following bash command (to merge all files keeping one header):
cat file-chunk0001.csv | head -n1 > file.csv
for f in *.csv; do cat "`pwd`/$f" | tail -n +2 >> file.csv; done
The resulting file.csv has a size of 38Go, at which point the process stops because I have no space left on device.
So my question is: why would the merged file size be more than 50x times bigger than expected? And what can I do to put them in a sqlite3 db with a reasonable size?
I guess my first question is: if you know how to do a for loop, why do you need to merge all the files into a single CSV file? Can't you just load them one after the other?
But your problem is an infinite loop. Your wildcard (*.csv) includes the file you're writing to. You could put your output file in a different directory or make sure your file glob does not include the output file (for f in file-*.csv maybe).
I'm trying to make use of the DESCRIBE function via Hive to output the column descriptions of each of the tables out to individual files. I've discovered the -f option so I can just read from a file and write the output back out:
hive -f nameOfSqlQueryFile.sql > out.txt
However, if I open the output file, it throws all the descriptions back to back and it's unclear where one description starts for a table and where it ends.
So, I've tried making a batch file that uses -e to describe each of the tables individually and output to a file:
#!/bin/bash
nameArr=( $(hive -e 'show tables;') )
count=0
for i in "${nameArr[#]}"
do
echo 'Working on table('$count'): '$i
hive -e 'describe '$i > $i'_.txt';
count=$(($count+1))
done
However, because this needs to reconnect for each query, it's remarkably slow, taking hours to process several hundred queries.
Does anyone have an idea of how else I might run each of these DESCRIBE functions, and ideally output to separate files?
You can probably use one of these, depending on how you process the output:
Just use the OK line as a separator and search for it using a script.
Use DESCRIBE EXTENDED which adds a line at the end with info on the table, including its location, which can be used to extract the table name (using sed, for example)
If you're just using the output file as a manual reference, insert a SQL statement that prints a separator of your choice between each table, e.g.:
DESCRIBE table;
SELECT '-----------------' FROM table;
I have to parse ASCII files, output the relevant data to a comma-delimited file and load it into a database table.
The specs for the file format have been recently updated and one section is causing problems. This is the original layout for that section.
CSVHeaderAttr:PUIS,IdleImmediate,POH,Temp,WorstTemp
CSVValuesAttr:NO,NO,9814,31,56
I parse it with grep thusly
CSVAttributes=$(grep ^CSVValuesAttr: ${filename}|cut -d':' -f2)
[ -z "$CSVAttributes" ] && CSVAttributes="NA"
It works great but now that the section has new fields and they are named differently
CSVHeaderAttr:PUIS,IdleImmediateSupported,IdleImmediateEnabled,POH,Temp,WorstTemp
CSVValuesAttr:NO,YES,YES,23861,31,51
Right now, I am grepping the files based on their layout (there is a field in the the header which tells me the version of the layout) to two different comma-delimited files and load them into two different tables. I would like to output both sections to the same file so the data scientist only has one table to use in his analysis.
Is there a way to use grep to produce an output like this and substitute empty fields with NA?
For one file type:
CSVHeaderAttr:PUIS,IdleImmediate,IdleImmediateSupported,IdleImmediateEnabled,POH,Temp,WorstTemp
CSVValuesAttr:NO,NO,NA,NA,9814,31,56
For the other file type:
CSVHeaderAttr:PUIS,IdleImmediate,IdleImmediateSupported,IdleImmediateEnabled,POH,Temp,WorstTemp
CSVValuesAttr:NO,NA,YES,YES,23861,31,51
Thanks for your input.
sed -n '/CSVHeaderAttr:/ c\
CSVHeaderAttr:PUIS,IdleImmediate,IdleImmediateSupported,IdleImmediateEnabled,POH,Temp,WorstTemp
/CSVValuesAttr:/ {
/\([^,]*,\)\{5\}/ s/\([^,]*,\)/&NA,/
t p
s/\(\([^,]*,\)\{2\}\)/\1NA,NA,/
# t p
: p
p
}' AllYourFiles > ConcatFile
using sed that test how many column (with /\([^,]*,\)\{5\}/) before changing the new layout
How to find the number of columns within a row key in hbase (since a row can have many columns)
I don't think there's a direct way to do that as each row can have a different number of columns and they may be spread over several files.
If you don't want to bring the whole row to the client to perform the count there you can write an endpoint coprocessor (HBase version for a stored procedure if you like) to perform the calculation on the region server side and only return the result. you can read a little about coprocessors here
There is a simple way:
Use hbase shell to scan through the table and write the output to a intermediate text file. Because hbase shell output splits each column of a row into a new line, we can just count the lines inside the text file (minus the first 6 lines which are hbase shell standard output and the last 2 lines).
echo "scan 'mytable', {STARTROW=>'mystartrow', ENDROW=>'myendrow'}" | hbase shell > row.txt
wc -l row.txt
Make sure to select the appropriate row keys, as the borders are not inclusive.
If you are only interested into specific columns (families), apply the filters in the hbase shell command above (e.g. FamilyFilter, ColumnRangeFilter, ...).
Thanks for #user3375803, actually you don't have to use external txt file. Because I can not comment on your answer, so I leave my answer below:
echo "scan 'mytable', {STARTROW=>'mystartrow', ENDROW=>'myendrow'}" | hbase shell | wc -l | awk '{print $1-8}'