How to split a large CSV file into multiple JSON files using the Miller command line tool? - miller

I am currently using this Miller command to convert a CSV file into a JSON array file:
mlr --icsv --ojson --jlistwrap cat sample.csv > sample.json
It works fine, but the JSON array is too large.
Can Miller split the output into many smaller JSON files of X rows each?
For example if the original CSV has 100 rows, can I modify the command to output 10 JSON Array files, with each JSON array holding 10 converted CSV rows?
Bonus points if each JSON Array can also be wrapped like this:
{
"instances":
//JSON ARRAY GOES HERE
}

you could run this
mlr --c2j --jlistwrap put -q '
begin {
#batch_size = 1000;
}
index = int(floor((NR-1) / #batch_size));
label = fmtnum(index,"%04d");
filename = "part-".label.".json";
tee > filename, $*
' ./input.csv
You will have a file named part-00xx every 1000 record.

Related

Copy columns of a file to specific location of another pipe delimited file

I have a file suppose xyz.dat which has data like below -
a1|b1|c1|d1|e1|f1|g1
a2|b2|c2|d2|e2|f2|g2
a3|b3|c3|d3|e3|f3|g3
Due to some requirement, I am making two new files(aka m.dat and o.dat) from original xyz.dat.
M.dat contains columns 2|4|6 like below after running some logic on it -
b11|d11|f11
b22|d22|f22
b33|d33|f33
O.dat contains all the columns except 2|4|6 like below without any change in it -
a1|c1|e1|g1
a2|c2|e2|g2
a3|c3|e3|g3
Now I want to merge both M and O file to create back the original format xyz.dat file.
a1|b11|c1|d11|e1|f11|g1
a2|b22|c2|d22|e2|f22|g2
a3|b33|c3|d33|e3|f33|g3
Please note column positions can change for another file. I will get the columns positions like in above example it is 2,4,6 so need some generic command to run in loop to merge the new M and O file or one command in which I can pass the columns positions and it will copy the columns form M.dat file and past it in O.dat file.
I tried paste, sed, cut but not able to make any perfect command.
Please help.
To perform column-wise merge of two files, better to use a scripting engine (Python, Awk or Perl or even bash). Tools like paste, sed and cut do not have enough flexibility for those tasks (join may come close, but require extra work).
Consider the following awk based script
awk -vOFS='|' '-F|' '
{
getline s < "o.dat"
n = split(s. a)
# Print output, Add a[n], or $n, ... as needed based on actual number of fields.
print $1, a[1], $2, a[2], $3, a[3], a[4]
}
' m.dat
The print line can be customized to generate whatever column order
Based on clarification from OP, looks like the goal is: Given an input of two files, and list of columns where data should be merged from the 2nd file, produce an output file that contain the merge data.
For example:
awk -f mergeCols COLS=2,4,6 M=b.dat a.dat
# If file is marked executable (chmod +x mergeCols)
mergeCols COLS=2,4,6 M=b.dat a.dat
Will insert the columns from b.dat into columns 2, 4 and 6, whereas other column will include data from a.dat
Implementation, using awk: (create a file mergeCols).
#! /usr/bin/awk -f
BEGIN {
FS=OFS="|"
}
NR==1 {
# Set the column map
nc=split(COLS, c, ",")
for (i=1 ; i<=nc ; i++ ) {
cmap[c[i]] = i
}
}
{
# Read one line from merged file, split into tokens in 'a'
getline s < M
n = split(s, a)
# Merge columns using pre-set 'cmap'
k=0
for (i=1 ; i<=NF+nc ; i++ ) {
# Pick up a column
v = cmap[i] ? a[cmap[i]] : $(++k)
sep = (i<NF+nc) ? "|" : "\n"
printf "%s%s", v, sep
}
}

Extract 2 fields from string with search

I have a file with several lines of data. The fields are not always in the same position/column. I want to search for 2 strings and then show only the field and the data that follows. For example:
{"id":"1111","name":"2222","versionCurrent":"3333","hwVersion":"4444"}
{"id":"5555","name":"6666","hwVersion":"7777"}
I would like to return the following:
"id":"1111","hwVersion":"4444"
"id":"5555","hwVersion":"7777"
I am struggling because the data isn't always in the same position, so I can't chose a column number. I feel I need to search for "id" and "hwVersion" Any help is GREATLY appreciated.
Totally agree with #KamilCuk. More specifically
jq -c '{id: .id, hwVersion: .hwVersion}' <<< '{"id":"1111","name":"2222","versionCurrent":"3333","hwVersion":"4444"}'
Outputs:
{"id":"1111","hwVersion":"4444"}
Not quite the specified output, but valid JSON
More to the point, your input should probably be processed record by record, and my guess is that a two column output with "id" and "hwVersion" would be even easier to parse:
cat << EOF | jq -j '"\(.id)\t\(.hwVersion)\n"'
{"id":"1111","name":"2222","versionCurrent":"3333","hwVersion":"4444"}
{"id":"5555","name":"6666","hwVersion":"7777"}
EOF
Outputs:
1111 4444
5555 7777
Since the data looks like a mapping objects and even corresponding to a JSON format, something like this should do, if you don't mind using Python (which comes with JSON) support:
import json
def get_id_hw(s):
d = json.loads(s)
return '"id":"{}","hwVersion":"{}"'.format(d["id"], d["hwVersion"])
We take a line of input string into s and parse it as JSON into a dictionary d. Then we return a formatted string with double-quoted id and hwVersion strings followed by column and double-quoted value of corresponding key from the previously obtained dict.
We can try this with these test input strings and prints:
# These will be our test inputs.
s1 = '{"id":"1111","name":"2222","versionCurrent":"3333","hwVersion":"4444"}'
s2 = '{"id":"5555","name":"6666","hwVersion":"7777"}'
# we pass and print them here
print(get_id_hw(s1))
print(get_id_hw(s2))
But we can just as well iterate over lines of any input.
If you really wanted to use awk, you could, but it's not the most robust and suitable tool:
awk '{ i = gensub(/.*"id":"([0-9]+)".*/, "\\1", "g")
h = gensub(/.*"id":"([0-9]+)".*/, "\\1", "g")
printf("\"id\":\"%s\",\"hwVersion\":\"%s\"\n"), i, h}' /your/file
Since you mention position is not known and assuming it can be in any order, we use one regex to extract id and the other to get hwVersion, then we print it out in given format. If the values could be something other then decimal digits as in your example, the [0-9]+ but would need to reflect that.
And for the fun if it (this preserves the order) if entries from the file, in sed:
sed -e 's#.*\("\(id\|hwVersion\)":"[0-9]\+"\).*\("\(id\|hwVersion\)":"[0-9]\+"\).*#\1,\3#' file
It looks for two groups of "id" or "hwVersion" followed by :"<DECIMAL_DIGITS>".

use smarter csv gem and processing csv in chunks - i need to delete rows from a large csv ( 2GB) by comparing the key/values with another csv (1 GB)

The following is the code i have used. I am not able to delete the rows from Main.csv, when the value of "name" col in Main.csv equals to the value of "name" col in Sub.csv. Please help me on the same. I know i am missing something. Thanks in advance.
require 'rubygems'
require 'smarter_csv'
main_csv = SmarterCSV.process('Main.csv', {:chunk_size => 100}) do |chunk|
short_csv = SmarterCSV.process('Sub.csv', {:chunk_size => 100}) do |smaller_chunk|
chunk.each do |each_ch|
smaller_chunk.each do |small_each_ch|
each_ch.delete_if{|k,v| v == small_each_ch[:name]}
end
end
end
end
It's a bit of a non-standard scenario for smarter_csv..
Sub.csv has 2000 rows. whereas Main.csv has around 1million rows.
If all you need to decide is if the name appears in both files, then you can do this:
1) read the Sub.csv file first, and just store the values of name in an array sub_names
2) open an output file for the result.csv file
3) read the Main.csv file, with processing in chunks,
and write the data for each row to the result.csv file, if the name does not appear in the array sub_names
4) close the output file - est voila!

How do I trim a csv in Matlab, like I could in bash, in order to load the csv with Matlab's readtable?

The csv that needs to be analysed contains a useless label in the first row.
The headers are located in the second row.
Other useless information from line 102 and on, which totals to 147 lines of uselessness, which contain a different number of columns than the 100 rows above it.
The relavent rows contain numeric values as well as the occasional NaN.
When the csv is opened, it would resemble:
unnecessarily labeled
columnA columnB columnC columnD columnE
1 2 3 4 5
4 5 6 NaN 8
[...]
301 302 303 304 305
data that really belongs in a separate csv
csv sample
unnecessarily labeled,,,,,,,,
columnA,columnB,columnC,columnD,columnE,,,,
1,2,3,4,5,,,,
4,5,6,NaN,8,,,,
301,302,303,304,305,,,,
data,that,really,belongs,in,a,separate,csv,
If I were to pre-process the file in bash I would:
sed -e1,1d $f > $processedFilename #remove the top line
head -n -147 $processedFilename > tmp && mv tmp $processedFilename #remove the last 147 lines
Can I do a similar pre-processing in Matlab? Can this be done more directly with readtable ? In other words, how can I load this csv data into a table preferably with the benefits of the headers populating automagically and with only the relevant rows and columns? In other other words, is there a parallel to
T = readtable('patients.xls',...
'Range','C2:E6',...
for csv data?
There's no direct method in Matlab to skip lines at the end of a file, so really the most you can do is read all of the data, then delete the extraneous rows/columns.
We can specify the number of lines to skip at the beginning of the file by passing 'HeaderLines', though.
I imagine you want a method that works in the general case with files of this format; However, if you know at least how many columns of data there will be, and how many extraneous lines there will be at the end, then this should work:
x = readtable('file', 'HeaderLines', 1);
x = x(:, 1:num_columns);
headers = table2cell(x(1, :));
x.Properties.VariableNames = headers;
x = x(2:size(x, 1) - num_extraneous_rows, :);
First, we manually pick the number of columns to include.
Then, we set the headers of the table.
Finally, we exclude the extraneous rows (and the first row containing the headers).

Search and output item if found in more than 90% of files

I have 100 txt files, each text file contain IDs in only one long column. I want to find each ID in all 100 files, if an ID appears in at least 90 out of the 100 files, the ID will be appended to a file. the program will look for each ID in all files and output all IDs found in at least 90% of the files. I have an idea of what to do but I couldnt put it together in a shell script. for example, each file looks like this
file_1.txt
BGIBMGA010657
BGIBMGA010658
BGIBMGA010659
BGIBMGA010664
BGIBMGA010666
BGIBMGA010671
BGIBMGA010673
BGIBMGA010674
BGIBMGA010676
BGIBMGA010685
BGIBMGA010687
BGIBMGA010699
BGIBMGA010714
BGIBMGA010723
The code will do something like this
for line in file
for files in *.txt
if line found in at least 90 files
append line in a new file
I need to translate it into a shell script.
thanks
awk '
BEGIN { num_files = ARGC - 1 }
{ count[$1]++ }
END {
for (id in count)
if ( (count[id]/num_files) >= 0.9)
print id
}
' *.txt

Resources