What is the Miller command for separating emails into their own rows, while also copying down other column data? - miller

I have a very large csv file (213,265 rows) with many columns.
In one of those columns I have some emails seperated by commas. A trimmed down version of the csv file looks like this:
I would like to use Miller to seperate out those emails into their own rows, but also copy down ALL the other columns in the spreadsheet (many of which aren't shown here in this simple example).
Following on with this example, I would like to end with something like this. But keep in mind the real spreadsheet has many other columns before and after the email column:
Is that possible to do with Miller (or another similar tool)? What would the command look like?

The verb is nest. Starting from
company,address,email
anna,123 fake,"anna#ciao.it,annac#gfail.com,a#box.net"
and running
mlr --csv nest --explode --values --across-records --nested-fs "," -f email input.csv
you will have
+---------+----------+-----------------+
| company | address | email |
+---------+----------+-----------------+
| anna | 123 fake | anna#ciao.it |
| anna | 123 fake | annac#gfail.com |
| anna | 123 fake | a#box.net |
+---------+----------+-----------------+
If you have a "bad" CSV, you could have some problems and you should try to clean it. A generic clean command could be this one:
mlr --csv -N clean-whitespace then remove-empty-columns then skip-trivial-records then cat -n sample.csv | mlr --csv nest --explode --values --across-records --nested-fs "," -f Email >output.csv
It removes empty rows, empty columns and wrong white spaces.

Related

Sort a file based on a name and age, without displaying the name

In bash I need to sort continent;Country;Capital, so for example, Europe;France;Paris.
I only need the European countries and in the new file, I need to display them without the word Europe.
I tried cat ../world.capitals | sort -u > europe.capitals but this only sorts them on name, and does not remove the word europe and does not single out the european countries.
source file
Select the lines beginning with "Europe" using grep, remove the first field using cut, and sort:
grep ^Europe world.capitals | cut -d ';' -f2- | sort > europe.capitals
You can combine the first two steps using sed in a single command:
sed -n 's/^Europe;//p' world.capitals | sort > europe.capitals

How can I create table in bash output, that shows every item about specific user?

I have simple bash script
#!/bin/bash
grep home /etc/passwd | grep $1
that shows information about user,
example of output:
Vladislav:x:1000:1000:Vladislav,,,:/home/vladislav:/bin/bash
Is it possible to make it look like table (example screenshot) ? I mean - with 2 columns and
simple separator between rows
if all you want is columns you can format it like:
grep home /etc/passwd | grep $1 | sed 's/:/ /g' | awk '{NAME=$1;ENC=$2;AMT=3;MIN=$4;$1=$2=$3=$4="";print "name\t"NAME"\nEncrypted password\t"ENC"\nAmount of days\t"AMT"\nMinimum count of days\t"MIN"\nOther item description\t"$0}'
This will take your original output, replace the : with a literal space then feed it into awk and format your "table".
The output I get from your sample input string is:
name Vladislav
Encrypted password x
Amount of days 3
Minimum count of days 1000
Other item description Vladislav,,, /home/vladislav /bin/bash

Using first 4 characters of file name to create unique list

I primarily use Linux Mint 17.1 and I like to use command line to get things done.
At the moment, I am working on organising a whole lot of family pictures and making them easy to view via a browser.
I have a directory with lots of images.
As I was filling the directory I made sure to keep the first four letters of the filename unique to a specific topic, eg, car_, hse_, chl_ etc
The rest of the filename keeps it unique.
There are some 120 different prefixes and I would like to create a list of the unique prefix.
I have tried 'ls i | uniq -d -w 4' and it works but it gives me the first filename of each prefix.
I just want the prefixes.
Fyi, I will use this list to generate an HTML page as a kind of catalogue.
Summary,
Convert car_001,car_002,car_003,dog_001,dog_002
to
car_,dog_
try this
$ ls -1 | cut -c1-3 | sort -u
uses the first 3 chars of the file names.
Try something like
ls -1 | cut -d'_' -f1 | uniq | sort
where cut splits the text by _ and takes the first field of each.

Append data to the end of a specific line in text file

I admit to being a novice in bash script, but can't quite seem to figure out how to accomplish a key step in a script and couldn't quite find what I was looking for in other threads.
I am trying to extract some specific data (numerical values) from multiple .xml files and add those to a space or tab delimited text file. The files will be generated over time so I need a way to append a new dataset to the pre-existing text file.
For instance, I would like to extract values for 3 different categories, 1 per row or column, and the value for each category from multiple xml files. Basically, I want to build a continuous graph of the data from each of 3 categories over time.
I have the following code which will successfully extract the 3 numbers from the xml file and trim the unnecessary text:
#!/bin/sh
grep "<observation name=\"meanGhost\" type=\"float\">" "/Users/Erik/MRI/PHANTOM/2/phantom_qa/summaryQA.xml" \
| sed 's/<observation name=\"meanGhost\" type=\"float\">//g' \
| sed 's/<\/observation>//g' >> $HOME/Desktop/testxml.txt
grep "<observation name=\"meanBrightGhost\" type=\"float\">" "/Users/Erik/MRI/PHANTOM/2/phantom_qa/summaryQA.xml" \
| sed 's/<observation name=\"meanBrightGhost\" type=\"float\">//g' \
| sed 's/<\/observation>//g' >> $HOME/Desktop/testxml.txt
grep "<observation name=\"std\" type=\"float\">" "/Users/Erik/MRI/PHANTOM/2/phantom_qa/summaryQA.xml" \
| sed 's/<observation name=\"std\" type=\"float\">//g' \
| sed 's/<\/observation>//g' >> $HOME/Desktop/testxml.txt
This gives the output:
1.12
0.33
134.1
I would like to then read in another xml file to get:
1.12 1.45
0.33 0.54
134.1 144.1
I would be grateful for any help with doing this! Thanks in advance.
Erik
It's much safer to use proper XML handling tools. For example, in xsh, you can write something like
$f1 := open /Users/Erik/MRI/PHANTOM/2/phantom_qa/summaryQA.xml ;
$f2 := open /path/to/the/second/file.xml ;
echo ($f1 | $f2)//observation[#name="meanGhost"] ;
echo ($f1 | $f2)//observation[#name="meanBrightGhost"] ;
echo ($f1 | $f2)//observation[#name="std"] ;

how to join files in unix without sorting

I am trying to join 2 csv files based on a key in unix.
My files are really huge 5GB each and sorting them is taking too long.
I want to repeat this procedure for 50 such joins.
Can someone tell me how to join without sorting and quickly.
Unfortunately there is no way around the sorting. But please take a look at some utility scripts I have written here: (https://github.com/stefan-schroedl/tabulator). You can use them if you keep the header of the column names as the first line in each file. There is a script 'tbljoin' that will take care of the sorting and column counting for you. For example, say you have
Employee.csv:
employee_id|employee_name|department_id
4|John|10
1|Monica|4
12|Louis|5
20|Peter|2
21|David|3
13|Barbara|6
Dept.csv:
department_id|department_name
1|HR
2|Manufacturing
3|Engineering
4|Marketing
5|Sales
6|Information technology
7|Security
Then the command tbljoin Employee.csv Dept.csv produces
employee_id|employee_name|department_id|department_name
20|Peter|2|Manufacturing
21|David|3|Engineering
1|Monica|4|Marketing
12|Louis|5|Sales
13|Barbara|6|Information technology.
tabulator contains many other useful features, e.g., for simple rearranging of columns.
Here is the example with two files having data delimited by pipe
Data from employee.csv with key employee_id, Name and department_id delimited by pipe.
Employee.csv
4|John | 10
1|Monica|4
12|Louis|5
20|Peter|2
21|David|3
13|Barbara|6
Department file with deparment_id and its name delimited by pipe.
Dept.csv
1|HR
2| Manufacturing
3| Engineering
4 |Marketing
5| Sales
6| Information technology
7| Security
command:
join -t “|” -1 3 -2 1 Employee_sort.csv Dept.csv
-t “| ” indicated files are delimited by pipe
-1 3 for third column of file 1 i.e deparment_id from Employee_sort.csv file
-2 1 for first column of file 2 i.e. deparment_id from Dept.csv file
By using above command, we get following output.
2|20|Peter| Manufacturing
3|21|David| Engineering
4|1|Monica| Marketing
5|12|Louis| Sales
6|13|Barbara| Information technology
If you want to get everything from file 2 and corresponding entries in file 1
You can also use -a and -v options
try following commands
join -t “|” -1 3 -2 1 -v2 Employee_sort.csv Dept.csv
join -t “|” -1 3 -2 1 -a2 Employee_sort.csv Dept.csv
I think that you could avoid using join (and thus sorting your file), but this is not a quick solution :
In both files, replace all pipes and all double-spaces with spaces :
sed -i 's/|/ /g;s/ / /g' Employee.csv Dept.csv
run these code lines as a bash script :
cat Employee.csv | while read a b c
do
cat Dept.csv | while read d e
do
if [ "$c" -eq "$d" ] ; then
echo -e "$a\t$b\t$c\t$e"
fi
done
done
Note that looping takes a long time

Resources