How to remove repeated columns using ruby FasterCSV - ruby

I'm using Ruby 1.8 and FasterCSV.
The csv file I'm reading in has several repeated columns.
| acct_id | amount | acct_num | color | acct_id | acct_type | acct_num |
| 345 | 12.34 | 123 | red | 345 | 'savings' | 123 |
| 678 | 11.34 | 432 | green | 678 | 'savings' | 432 |
...etc
I'd like to condense it to:
| acct_id | amount | acct_num | color | acct_type |
| 345 | 12.34 | 123 | red | 'savings' |
| 678 | 11.34 | 432 | green | 'savings' |
Is there a general purpose way to do this?
Currently my solution is something like:
headers = CSV.read_line(file)
headers = CSV.read_line # get rid of garbage line between headers and data
FasterCSV.filter(file, :headers => headers) do |row|
row.delete(6) #delete second acct_num field
row.delete(4) #delete second acct_id field
# additional processing on the data
row['color'] = color_to_number(row['color'])
row['acct_type'] = acct_type_to_number(row['acct_type'])
end

Assuming you want to get rid of the hardcoded deletions
row.delete(6) #delete second acct_num field
row.delete(4) #delete second acct_id field
Can be replaced by
row = row.to_hash
This will clobber duplicates. The rest of the posted code will keep working.

Related

How to write multiple arrow/parquet files in chunks while reading in large data quantities of data so that all written files are one dataset?

I'm working in R with the arrow package. I have multiple tsv files:
sample1.tsv
sample2.tsv
sample3.tsv
sample4.tsv
sample5.tsv
...
sample50.tsv
each of the form
| id | start| end | value|
| --- | -----|-----|------|
| id1 | 1 | 3 | 0.2 |
| id2 | 4 | 6 | 0.5 |
| id. | ... | ... | ... |
| id2 | 98 | 100 | 0.5 |
and an index file:
| id | start| end |
| --- | -----|-----|
| id1 | 1 | 3 |
| id2 | 4 | 6 |
| id. | ... | ... |
| id2 | 98 | 100 |
I use the index file to left join on id, start and end with each sample to get a datatable like this:
| id | start| end | sample 1| sample 2| sample ...|
| --- | -----|-----|---------|---------|-----------|
| id1 | 1 | 3 | 0.2 | 0.1 | ... |
| id2 | 4 | 6 | 0.5 | 0.8 | ... |
| id. | ... | ... | ... | ... | ... |
| id2 | 98 | 100 | 0.5 | 0.6 | ... |
With multiple samples. I'd like to read them in chunks (eg: chunk_size=5), and when I have a table of chunk_size samples read, write that joined datatable as a parquet file to disk.
Currently, I'm able to write each chunked datatable to disk and I read them with open_dataset(datadir). In a loop with i as the sample_number:
# read and join
...
if (i %% chunk_size == 0) {
write_parquet(joined_table, paste0("datadir", "chunk", i / chunk_size, ".parquet"))
}
...
# clear the data table of samples
However, even though the arrow package says it read as many files as were written, when I check the columns available, only the columns from the first chunk are found.
data <- arrow::open_dataset("datadir")
data
# FileSystemDataset with 10 Parquet files
# id: string
# start: int32
# end: int32
# sample1: double
# sample2: double
# sample3: double
# sample4: double
# sample5: double
Samples 6-50 are missing. Reading the parquet files individually shows that each contains the samples from their chunk.
data2 <- arrow::open_dataset("datadir/chunk2.parquet")
data2
# FileSystemDataset with 1 Parquet file
# id: string
# start: int32
# end: int32
# sample6: double
# sample7: double
# sample8: double
# sample9: double
# sample10: double
Are parquet files the right format for this task? I'm not sure what I'm missing to make a splintered set of files that are all the same dataset when read in.

Concanate two or more rows from result into single result on CI activerecord

I have situation like this, I want to get value from database(this values used comma delimited) from more than one rows based on month and year that I choose, for more detail check this out..
My Schedule.sql :
+---+------------+-------------------------------------+
|ID |Activ_date | Do_skill |
+---+------------+-------------------------------------+
| 1 | 2020-10-01 | Accountant,Medical,Photograph |
| 2 | 2020-11-01 | Medical,Photograph,Doctor,Freelancer|
| 3 | 2020-12-01 | EO,Teach,Scientist |
| 4 | 2021-01-01 | Engineering, Freelancer |
+---+------------+-------------------------------------+
My skillqmount.sql :
+----+------------+------------+-------+
|ID |Date_skill |Skill |Price |
+----+------------+------------+-------+
| 1 | 2020-10-02 | Accountant | $ 5 |
| 2 | 2020-10-03 | Medical | $ 7 |
| 3 | 2020-10-11 | Photograph | $ 5 |
| 4 | 2020-10-12 | Doctor | $ 9 |
| 5 | 2020-10-01 | Freelancer | $ 7 |
| 6 | 2020-10-04 | EO | $ 4 |
| 7 | 2020-10-05 | Teach | $ 4 |
| 8 | 2020-11-02 | Accountant | $ 5 |
| 9 | 2020-11-03 | Medical | $ 7 |
| 10 | 2020-11-11 | Photograph | $ 5 |
| 11 | 2020-11-12 | Doctor | $ 9 |
| 12 | 2020-11-01 | Freelancer | $ 7 |
+----+------------+------------+-------+
In my website I want to make calculation with those two table. So if in my website want to see start from date 2020-10-01 until 2020-11-01 for total amount between those date, I try to show it with this code :
Output example
+----+-----------+-----------+---------+
|No |Date Start |Date End |T.Amount |
+----+-------- --+-----------+---------+
|1 |2020-10-01 |2020-11-01 |$ 45 | <= this amount came from $5+$7+$5+$7+$5+$9+$7
+----+-------- --+-----------+---------+
Note :
Date Start : Input->post("A")
Date End : Input->post("B")
T.Amount : Total Amount based input A and B (on date)
I tried this code to get it :
<?php
$startd = $this->input->post('A');
$endd= $this->input->post('B');
$chck = $this->db->select('Do_skill')
->where('Activ_date >=',$startd)
->where('Activ_date <',$endd)
->get('Schedule')
->row('Do_skill');
$dcek = $this->Check_model->comma_separated_to_array($chck);
$t_amount = $this->db->select_sum('price')
->where('Date_skill >=',$startd)
->where('Date_skill <',$endd)
->where_in('Skill',$dcek)
->get('skillqmount')
->row('price');
echo $t_amount; ?>
Check_model :
public function comma_separated_to_array($chck, $separator = ',')
{
//Explode on comma
$vals = explode($separator, $chck);
$count = count($vals);
$val = array();
//Trim whitespace
for($i=0;$i<=$count-1;$i++) {
$val[] .= $vals[$i];
}
return $val;
}
My problem is the result from $t_amount not $45, I think there's some miss with my code above, please if there any advice, I very appreciate it...Thank you...
Your first query only return 1 row data.
I think you can do something like this for the first query.
$query1 = $this->db->query("SELECT Do_skill FROM schedule WHERE activ_date >= $startd and activ_date < $startd");
$check = $query1->result_array();
$array = [];
foreach($check as $ck){
$dats = explode(',',$ck['Do_skill']);
$counter = count($dats);
for($i=0;$i<$counter;$i++){
array_push($array,$dats[$i]);
}
and you can use the array to do your next query :)
The array $dcek has the values
Accountant,Medical,Photograph
The query from Codeigniter is
SELECT SUM(`price`) AS `price` FROM `skillqmount`
WHERE `Date_skill` >= '2020-10-01' AND
`Date_skill` < '2020-11-01' AND
`Skill` IN('Accountant', 'Medical', 'Photograph')
which returns 17 - this matches the first three entries in your data.
Your first query will only ever give one row, even if the date range would match multiple rows.

Redirect the table generated from Beeline to text file without the grid (Shell Script)

I am currently trying to find a way to redirect the standard output from beeline shell to text file without the grid. The biggest problem I am facing right now is that my columns have negative values and when I'm using regex to remove the '-', it is affecting the column values.
+-------------------+
| col |
+-------------------+
| -100 |
| 22 |
| -120 |
| -190 |
| -800 |
+-------------------+
Here's what I'm doing:
beeline -u jdbc:hive2://localhost:10000/default \
-e "SELECT * FROM $db.$tbl;" | sed 's/\+//g' | sed 's/\-//g' | sed 's/\|//g' > table.txt
I am trying to clean this file so I can read all the data into a variable.
Assumming all your data has the same pattern , where no significant '-' are wrapped in '+' :
[root#machine]# cat boo
+-------------------+
| col |
+-------------------+
| -100 |
| 22 |
| -120 |
| -190 |
| -800 |
+-------------------+
[root#machine]# cat boo | sed 's/\+-*+//g' | sed 's/\--//g' | sed 's/|//g'
col
-100
22
-120
-190
-800

How to split a row where there's 2 data in each cells separated by a carriage return?

Someone gives me a file with, sometimes, inadequate data.
Data should be like this :
+---------+-----------+--------+
| Name | Initial | Age |
+---------+-----------+--------+
| Jack | J | 43 |
+---------+-----------+--------+
| Nicole | N | 12 |
+---------+-----------+--------+
| Mark | M | 22 |
+---------+-----------+--------+
| Karine | K | 25 |
+---------+-----------+--------+
Sometimes it comes like this tho :
+---------+-----------+--------+
| Name | Initial | Age |
+---------+-----------+--------+
| Jack | J | 43 |
+---------+-----------+--------+
| Nicole | N | 12 |
| Mark | M | 22 |
+---------+-----------+--------+
| Karine | K | 25 |
+---------+-----------+--------+
As you can see, Nicole and Mark are put in the same row, but the data are separated by a carriage return.
I can do split by row, but it demultiply the data :
+---------+-----------+--------+
| Nicole | N | 12 |
| | M | 22 |
+---------+-----------+--------+
| Mark | N | 12 |
| | M | 22 |
+---------+-----------+--------+
Which make me lose that Mark is associated with the "2nd row" of data.
(The data here is purely an example)
One way to do this is to transform each cell into a list by doing a Text.Split on the line feed / carriage return symbol.
TextSplit = Table.TransformColumns(Source,
{
{"Name", each Text.Split(_,"#(lf)"), type text},
{"Initial", each Text.Split(_,"#(lf)"), type text},
{"Age", each Text.Split(_,"#(lf)"), type text}
}
)
Now each column is a list of lists which you can combine into one long list using List.Combine and you can glue these columns together to make table with Table.FromColumns.
= Table.FromColumns(
{
List.Combine(TextSplit[Name]),
List.Combine(TextSplit[Initial]),
List.Combine(TextSplit[Age])
},
{"Name", "Initial", "Age"}
)
Putting this together, the whole query looks like this:
let
Source = <Your data source>
TextSplit = Table.TransformColumns(Source,{{"Name", each Text.Split(_,"#(lf)"), type text},{"Initial", each Text.Split(_,"#(lf)"), type text},{"Age", each Text.Split(_,"#(lf)"), type text}}),
FromColumns = Table.FromColumns({List.Combine(TextSplit[Name]),List.Combine(TextSplit[Initial]),List.Combine(TextSplit[Age])},{"Name","Initial","Age"})
in
FromColumns

How do I group my data-set by month and apply SUM aggregation using Pandas/Python

I have the following Data-set df:
| date | Revenue |
|-----------|---------|
| 6/1/2017 | 100 |
| 5/21/2017 | 200 |
| 5/20/2017 | 300 |
| 6/22/2017 | 400 |
| 6/20/2017 | 500 |
I need to group the above data by month and write Python code to get the following output:
| date | SUM(Revenue) |
|------|--------------|
| May | 500 |
| June | 1000 |
I tried with the following code but got an error:
import pandas as pd
files='C:\\Users\\Month.csv'
df = pd.read_csv(files, parse_dates=['date'])
df = df.convert_objects(convert_numeric=True)
pd.to_datetime(df['date'], format = '%Y%m%d')
df2=df.sort_values('date',ascending=True)
df2.groupby(pd.TimeGrouper(freq='M')).sum()
specify the which ones is the month(m) or day(d) and then proceed with sort
pd.to_datetime(gdf['sale_date'],format='%m%d%Y')
gdf.sort('sale_date',ascending=False)

Resources