Merge tab delimited files, sort by columns and eliminate duplicates - sorting

I'm trying to merge two large text files, eliminate duplicates and create a new file using Powershell. I'm not familiar with Powershell, but haven't been able to draft a script to accomplish this task thus why I need help.
Merge two tab delimited tabs with 35 columns
Sort merged file by three columns. Note some column header names contain spaces.
a. Column one is text field that needs to be sorted ascending
b. Column two is text field that needs to be sorted descending
c. Column three is date field that need to be sorted descending
Identify the first occurrence of a record using a from step 2 and save all columns to a new tab delimited file.
Current
Customer Name State Date of last purchase + 32 columns...............
ABC Company TX 12/30/2022 11:01:54
DEF Company FL 10/01/2022 09:15:35
ABC Company TX 10/15/2022 03:14:18
ABC Company TX 09/25/2022 08:29:37
DEF Company FL 08/31/2022 10:48:03
DEF Company FL 10/01/2022 02:11:58
Result
Customer Name State Date of last purchase + 32 columns................
ABC Company TX 12/30/2022 11:01:54
DEF Company FL 10/01/2022 09:15:35
I tried several none have been successful..
Result
Customer Name State Date of last purchase + 32 columns................
ABC Company TX 12/30/2022 11:01:54
DEF Company FL 10/01/2022 09:15:35

Related

How to restrict query result from multiple instances of overlapping date ranges in Django ORM

First off, I admit that I am not sure whether what I am trying to achieve is possible (or even logical). Still I am putting forth this query (and if nothing else, at least be told that I need to redesign my table structure / business logic).
In a table (myValueTable) I have the following records:
Item
article
from_date
to_date
myStock
1
Paper
01/04/2021
31/12/9999
100
2
Tray
12/04/2021
31/12/9999
12
3
Paper
28/04/2021
31/12/9999
150
4
Paper
06/05/2021
31/12/9999
130
As part of the underlying process, I am to find out the value (of field myStock) as on a particular date, say 30/04/2021 (assuming no inward / outward stock movement in the interim).
To that end, I have the following values:
varRefDate = 30/04/2021
varArticle = "Paper"
And my query goes something like this:
get_value = myValueTable.objects.filter(from_date__lte=varRefDate, to_date__gte=varRefDate).get(article=varArticle).myStock
which should translate to:
get_value = SELECT myStock FROM myValueTable WHERE varRefDate BETWEEN from_date AND to_date
But with this I am coming up with more than one result (actually THREE!).
How do I restrict the query result to get ONLY the 3rd instance i.e. the one with value "150" (for article = "paper")?
NOTE: The upper limit of date range (to_date) is being kept constant at 31/12/9999.
Edit
Solved it. In a round about manner. Instead of .get, resorted to generating values_list with fields from_date and myStock. Using the count of objects returned; appended a list with date difference between from_date and the ref date (which is 30/04/2021) and the value of field myStock, sorted (ascending) the generated list. The first tuple in the sorted list will have the least date difference and the corresponding myStock value and that will be the value I am searching for. Tested and works.

How to design querying multiple tags on analytics database

I would like to store user purchase custom tags on each transaction, example if user bought shoes then tags are "SPORTS", "NIKE", SHOES, COLOUR_BLACK, SIZE_12,..
These tags are that seller interested in querying back to understand the sales.
My idea is when ever new tag comes in create new code(something like hashcode but sequential) for that tag, and code starts from "a-z" 26 letters then "aa, ab, ac...zz" goes on. Now keep all the tags given for in one transaction in the one column called tag (varchar) by separating with "|".
Let us assume mapping is (at application level)
"SPORTS" = a
"TENNIS" = b
"CRICKET" = c
...
...
"NIKE" = z //Brands company
"ADIDAS" = aa
"WOODLAND" = ab
...
...
SHOES = ay
...
...
COLOUR_BLACK = bc
COLOUR_RED = bd
COLOUR_BLUE = be
...
SIZE_12 = cq
...
So storing the above purchase transaction, tag will be like tag="|a|z|ay|bc|cq|" And now allowing seller to search number of SHOES sold by adding WHERE condition tag LIKE %|ay|%. Now the problem is i cannot use index (sort key in redshift db) for "LIKE starts with %". So how to solve this issue, since i might have 100 millions of records? dont want full table scan..
any solution to fix this?
Update_1:
I have not followed bridge table concept (cross-reference table) since I want to perform group by on the results after searching the specified tags. My solution will give only one row when two tags matched in a single transaction, but bridge table will give me two rows? then my sum() will be doubled.
I got suggestion like below
EXISTS (SELECT 1 FROM transaction_tag WHERE tag_id = 'zz' and trans_id
= tr.trans_id) in the WHERE clause once for each tag (note: assumes tr is an alias to the transaction table in the surrounding query)
I have not followed this; since i have to perform AND and OR condition on the tags, example ("SPORTS" AND "ADIDAS") ---- "SHOE" AND ("NIKE" OR "ADIDAS")
Update_2:
I have not followed bitfield, since dont know redshift has this support also I assuming if my system will be going to have minimum of 3500 tags, and allocating one bit for each; which results in 437 bytes for each transaction, though there will be only max of 5 tags can be given for a transaction. Any optimisation here?
Solution_1:
I have thought of adding min (SMALL_INT) and max value (SMALL_INT) along with tags column, and apply index on that.
so something like this
"SPORTS" = a = 1
"TENNIS" = b = 2
"CRICKET" = c = 3
...
...
"NIKE" = z = 26
"ADIDAS" = aa = 27
So my column values are
`tag="|a|z|ay|bc|cq|"` //sorted?
`minTag=1`
`maxTag=95` //for cq
And query for searching shoe(ay=51) is
maxTag <= 51 AND tag LIKE %|ay|%
And query for searching shoe(ay=51) AND SIZE_12 (cq=95) is
minTag >= 51 AND maxTag <= 95 AND tag LIKE %|ay|%|cq|%
Will this give any benefit? Kindly suggest any alternatives.
You can implement auto-tagging while the files get loaded to S3. Tagging at the DB level is too-late in the process. Tedious and involves lot of hard-coding
While loading to S3 tag it using the AWS s3API
example below
aws s3api put-object-tagging --bucket --key --tagging "TagSet=[{Key=Addidas,Value=AY}]"
capture tags dynamically by sending and as a parameter
2.load the tags to dynamodb as a metadata store
3.load data to Redshift using S3 COPY command
You can store tags column as varchar bit mask, i.e. a strictly defined bit sequence of 1s or 0s, so that if a purchase is marked by a tag there will be 1 and if not there will be 0, etc. For every row, you will have a sequence of 0s and 1s that has the same length as the number of tags you have. This sequence is sortable, however you would still need lookup into the middle but you will know at which specific position to look so you don't need like, just substring. For further optimization, you can convert this bit mask to integer values (it will be unique for each sequence) and make matching based on that but AFAIK Redshift doesn't support that yet out of box, you will have to define the rules yourself.
UPD: Looks like the best option here is to keep tags in a separate table and create an ETL process that unwraps tags into tabular structure of order_id, tag_id, distributed by order_id and sorted by tag_id. Optionally, you can create a view that joins the this one with the order table. Then lookups for orders with a particular tag and further aggregations of orders should be efficient. There is no silver bullet for optimizing this in a flat table, at least I don't know of such that would not bring a lot of unnecessary complexity versus "relational" solution.

How to use spark for map-reduce flow to select N columns, top M rows of all csv files under a folder?

To be concrete, say we have a folder with 10k of tab-delimited csv files with following attributes format (each csv file is about 10GB):
id name address city...
1 Matt add1 LA...
2 Will add2 LA...
3 Lucy add3 SF...
...
And we have a lookup table based on "name" above
name gender
Matt M
Lucy F
...
Now we are interested to output from top 100,000 rows of each csv file into following format:
id name gender
1 Matt M
...
Can we use pyspark to efficiently handle this?
How to handle these 10k csv files in parallel?
You can do that in python to exploit the 1000 first line of your files :
top1000 = sc.parallelize("YourFile.csv").map(lambda line : line.split("CsvSeparator")).take(1000)

Querying a specific cell with known column and row

I am working with Ruby (not Rails) and PostgeSQL and have been banging my head for hours trying to figure out how to assign the value of a field when you know the column and the row you are trying to cross reference.
I have tried the following:
A database containing cities and linked distances similar to:
Cities city1 city2 city3 city4
city1 0 17 13 6
city2 17 0 7 15
city3 . . .
city4 . . .
and I have tried playing around with the following code:
array.each { |city| #array being the array containing the sorted cities
from_city = city
query { |row|
#row is a hash containing each city as key and distance as val
#first row: {"Cities"=>"city1", "city1"=>"0", "city2"=>"17",...
#I have tried doing a row.each and grabbing the value of the specified city key,
but that doesn't work..
}
}
Is there a better way to go about doing this? Basically all I need to know is how I can pull the distance value when I know what two cities I want to use and will assign it to the variable distance.
I'd change your database schema (SQL databases aren't really designed to work by column access, and adding a new column everytime you add a city is really painful to do):
origin | destination | distance
city1 | city2 | 17
... More rows ...
That way looking up the distance between 2 cities is just:
conn.exec_params('SELECT distance from cities where origin = $1 and destination = $2', [city1, city2])
which will return 1 row with 1 column which is the distance between the two.
Alternatively if your data set is small and doesn't change much there's nothing wrong with storing the data as a file and loading it into memory once at startup. Depends on your use case.
Also, if you're going to do lots of geometry computations, PostGIS might be what you want.

Ruby csv file related operations

I'm working on a file related task..I have a CSV file with some rows and headers..I need to fetch column with particular header and create a new column at the end and do some operations..how can i fetch column values which is having particular header value and how can i create new column in the end?
Assuming you have CSV in following format
Zipcode Network ID Network Name Zone New Network? Display Name
64024 275 Kansas City 2 No Kansas City
64034 275 Kansas City 2 No Kansas City
You can user FasterCSV;
If you have headers in your csv then you can specify it headers => true what you can do is to fetch data from row by row using FasterCSV,as given below
FasterCSV.foreach(path_to_file, { :headers => true, :row_sep => :auto }) do |row|
Each time you iterate the csv you would get row from your CSV file, now you already know that column 2 has "network_id" header and column 3 has "network name" header so you can easily give network_id = row[2], network_name = row[3]
hope it would answer your question
First record in CSV may be considered as Header record. There is no rule which dictates that this has to be the case.
Coming to your question. Short answer is that you have to write the logic (using API or without it) to fetch what you consider a "header".
I need to fetch column with particular header and create a new column at the end and do some operations
I don't think API would provide any implicit mechanism to fetch a particular header. AFAIK CSV does not have a spec. You could use FasterCSV like API to parse through the CSV to get the work done.

Resources