We go venture back into the nightmarish world of STEM in the sequel to Shinji Mikami's survival horror masterpiece.
The Evil Within 2 Review - True Survival Horror
We take the latest South Park joint for a jaunt on the toilet... will it flush?
South Park: The Fractured But Whole Review!
We chat with the game director and creative head of Machine Games about the upcoming title where killing Nazis is all the rage.
A Happy Ending Filled with Nazis – Wolfenstein II Interview!
Handling Large CSV Files
DoCZero
Melbourne, Victoria
204 posts
Hi Guys,

Need some advise here - i've got a CSV file which is 6.5gb large - financial data since 2008. I need to delete 2 columns and also change the format of one of the numbers so each line has 6 decimals (i.e 1.278400)

I've tried: Csved / Unicsved and ReCSVeditor - all cant handle that file.

Any suggestions?

Thanks in advance =)
11:07pm 06/06/13 Permalink
system
Internet
--
11:07pm 06/06/13 Permalink
3x0dus
Townsville, Queensland
1779 posts
Might be a long way around, but import to SQL/MYSQL/Access, perform manipulation via SQL, file export back to csv.

IMO Importing to swl will handle the large data better then trying to edit in one go, but just my opinion.
11:12pm 06/06/13 Permalink
TiT
Brisbane, Queensland
5659 posts
wow 6.5gb csv file that's pretty amazing!
11:20pm 06/06/13 Permalink
TufNuT
I like eel pie
Brisbane, Queensland
4216 posts
Might be a long way around, but import to SQL/MYSQL/Access, perform manipulation via SQL, file export back to csv.


This. working with this data in a SQL database will be much easier. thats a lot of data, exactly what SQL is designed for.
11:32pm 06/06/13 Permalink
Strik3r
Brisbane, Queensland
2152 posts
I wouldn't be trying to open it into any sort of graphical editor. They aren't going to play nice displaying that size of file. I'd be surprised if you couldn't do this in a bash or perl script. Not my forte so I won't volunteer one.

I don't know that dumping it all into an SQL database is necessary. Seems like an awkward fix, but I guess if you don't have any programmers around, its an easy-to-implement solution.

If you really get stuck, pm me the layout of the file and I can throw together a console app that will do it.
11:33pm 06/06/13 Permalink
Opec
Brisbane, Queensland
7804 posts
Import into SQL Server. You will need to use BULK import command.

So basically create a table that matches the current CSV's column with the data def that you wanted and _all_ the columns to begin with. I wouldn't bother creating any indexes on the table unless you really need it to perform the query etc as it'll slow down the import unncessarily.

Then run this:

BULK
INSERT Your_SQL_Server_Table
FROM 'c:\path_to_your_file.csv'
WITH
(
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
)
GO

That will handle massive data import very quickly. Depending on the server you're running on 6.5GB shouldn't take too long to BULK INSERT as it is very efficient. Then you can just drop the column:

ALTER TABLE my_table DROP COLUMN column_a
GO
ALTER TABLE my_table DROP COLUMN column_B
GO

Then export the data out using standard SQL Server import/export task into the CSV format.

For more info on bulk import see:

http://blog.sqlauthority.com/2008/02/06/sql-server-import-csv-file-into-sql-server-using-bulk-insert-load-comma-delimited-file-into-sql-server/
11:54pm 06/06/13 Permalink
trillion
Brisbane, Queensland
2910 posts
wolfram why you no in context

http://i.imgur.com/JVA5CkP.jpg
12:04am 07/06/13 Permalink
Whoop
Brisbane, Queensland
21939 posts
rofl
12:11am 07/06/13 Permalink
kos
Germany
2688 posts
One character is one byte of plain text, so it's about 6.5 billion characters. :P
01:11am 07/06/13 Permalink
Dazhel
Gold Coast, Queensland
6070 posts
I need to delete 2 columns and also change the format of one of the numbers

Opec's solution is probably the go, but if you have SQL Server handy then you could also do something like this in SSIS - there's a graphical ETL package builder and is designed to do stuff that you're asking.
It will laugh at your puny 6.5GB csv file and just stream the data in & out as necessary.

One character is one byte of plain text

Countries with non-latin based languages would like a word with you kos ;)
01:20am 07/06/13 Permalink
trillion
Brisbane, Queensland
2912 posts
[6] 979 321 856 billion bytes at 1024 per kilobyte

that seems like a lot of plain text to be managing outside of a database
01:24am 07/06/13 Permalink
kos
Germany
2689 posts
Countries with non-latin based languages would like a word with you kos ;)

Somehow I doubt that DoCZero is working in Melbourne with 5 years of financial data stored in a CSV written in kanji.. ;P
02:03am 07/06/13 Permalink
trillion
Brisbane, Queensland
2914 posts
there is an easier way, it would depend on how well formed the data fields are in the csv

eg: input.csv


somename,someothername,123467,female,niceperson,...
...recurring


and you want to reduce it to


someone,1234567,niceperson


run it through a simple cut filter


cut -d, -f1,3,5 < input.csv > out.txt


breaking it down it's


cut (the cut program)

-d, (fields delimited by ,)

-f1,3,5 (list of fields assumed to be separated in the file by a delimiter character)

< input.csv (the input file to filter against these syntax)

> out.txt (the output file to write the result to. this can have whatever extention it's .txt as an example but doesn't contain won't be of any particular encoding)


that should shave it down to possibly a managable size that you can load it into one of those other programs to reduce the decimal places of the number field that needs doing
03:32am 07/06/13 Permalink
Spook
Brisbane, Queensland
35797 posts
perl bros
06:41am 07/06/13 Permalink
Raven
Melbourne, Victoria
7888 posts
If you can wait until I get home, I can probably write a little java app to do it for you when I get home. I've already written a really good CSV parser that both streams and also handles encap'd tedt properly, so it would be easy as hell to wrap this in an app where you just tell it you ant to stream it to an output file with only particular columns.
07:12am 07/06/13 Permalink
Spook
Brisbane, Queensland
35799 posts
lols java
07:27am 07/06/13 Permalink
DoCZero
Melbourne, Victoria
205 posts
Hi Guys,

First of all - thanks for all the effort you put into answering! Man I love this community for giving ideas. (also big thanks to raven for offering to make something to do this!)

Regarding the content - it is every tick for the currency pair EURUSD from 2008 - (in 2011 the average turnover was $250,000,000,000 traded per day Ref: http://www.forexturtle.com/fxoptions/forextradingvolume.aspx)

I'll be trying SQL tonight - if I get stuck i'll definately be back for more advice! I've got a lot of this to do , so I figure i'll try learn myself =)

Thanks again!
07:32am 07/06/13 Permalink
Persay
Brisbane, Queensland
7629 posts
you pretty much got nothing from this thread, and rejected offers to actually help, well done!
08:15am 07/06/13 Permalink
DoCZero
Melbourne, Victoria
206 posts
Now thats not true - yesterday I was trying to use csv editing software - today I'll be trying SQL and Cut as per this thread. What I got from this thread was more direction that I had yesterday. Yes I got offered some great help and im greatful - really I want to learn myself and not be a burden on the kind ppl here. So i'll give it a good shot using the advise in this thread.
09:02am 07/06/13 Permalink
Dazhel
Gold Coast, Queensland
6072 posts
Somehow I doubt that DoCZero is working in Melbourne with 5 years of financial data stored in a CSV written in kanji.. ;P


hehe most likely correct, I was just being a smart arse.
Murphy's law practically guarantees that somewhere in that 6.5GB is a character that will mess up an ETL process though. :/
10:54am 07/06/13 Permalink
Thundercracker
Brisbane, Queensland
3315 posts
awk, sed, grep will all do it very nicely ;)

how are your linux skills?
12:20pm 07/06/13 Permalink
bepatient
Melbourne, Victoria
1205 posts
God damn thats a huge file. But yeah agree with what some of the other guys already said re: SQL
12:25pm 07/06/13 Permalink
3x0dus
Townsville, Queensland
1781 posts
least with SQL if you need to manipulate it or grab chunks its so much easier, and if you expect to get given the tasks of doing this again in the future setup some cron job to import the data daily to the SQL so you always having a running forex data to work with.

I am working on a project atm where normal people just throw around CSV files full of data from dozens of systems all in different formats and its like !@#@#@ DIAF

So we are Data warehousing feeds of info into SQL, and making it available to people easier, via either predefined reporting and charting(highcharts), analyst based SQL access etc.

One of the systems we grab data from is by AT&T and whoever the DBA was that came up with the structure is mad, the system spits out about 1-2gb a Day, where a similar system providing similar information is 1/10th the size.
04:40pm 07/06/13 Permalink
blahnana
Brisbane, Queensland
621 posts
Post a demo line for us here (strip anything confidential or just change it), and what you want done, and someone will drop the 1 minute's work into a command line you can run.

Just make sure you keep a backed up copy somewhere in case someone posts something that donkeys the data. I doubt that'll happen but why take the change for a few GB of disk space?
05:27pm 07/06/13 Permalink
blahnana
Brisbane, Queensland
622 posts
(Manipulating CSV on a line by line basis like this is ridiculously easy, regardless of the filesize). Importing into any kind of DB program is huge overkill.)
05:28pm 07/06/13 Permalink
trog
AGN Admin
Brisbane, Queensland
37066 posts
If this stupid forum had 'likes' I would like trillion's post! I didn't know cut could talk CSV so easily; I would have written a little PHP script to do it.

I would assume (without knowing anything about it) that cut would operate on a line-by-line basis so using it should use (basically) no memory and take just as long as it takes to read 6.5GB and write it back out again. I agree with blahnana that importing would be overkill unless you're planning to manipulate the data with SQL later.
05:45pm 07/06/13 Permalink
trillion
Brisbane, Queensland
2915 posts
if for some reason you don't have the will to install and directory link a linux vm to use cat in, there's also a nice simple way to do this with powershell

eg data1.csv is delimited and formatted like

"data1a","data2a","data3a","data4a","data5a","data6a","data7a"
"data1b","data2b","data3b","data4b","data5b","data6b","data7b"


running this in a powershell prompt should do the business

PS C:\Users\User> Import-Csv C:\Users\User\Data1.csv | Select data2a,data5a,data7a | Export-Csv -Path C:\Users\User\Data2.csv -NoTypeInformation

which as you can probably guess will write this to Data2.csv


"data2a","data5a","data7a"
"data2b","data5b","data7b"


there's probably a way to identify the number column as an object and delimit the decimal places to a specific value, but i'll leave that with you
07:35pm 07/06/13 Permalink
Dazhel
Gold Coast, Queensland
6075 posts
Powershell will most likely do the job, but I've found the performance of the Import-Csv cmdlet is pretty horrible on files even under 100 MB. I'd shudder to think how long that command would take to complete on a 6+GB file.
08:01pm 07/06/13 Permalink
HerbalLizard
Brisbane, Queensland
5805 posts
Could be worse I was handling something the other day where the export from some proprietary ISAM sack of s*** db was exported to a 87gb csv o_0 might be a while
08:03pm 07/06/13 Permalink
trillion
Brisbane, Queensland
2917 posts
sure you can do it in pure .NET and have it parse and write out as fast as your hardware i/o can give channel to filestream.io to tear through it

it's the only way to be sure
08:36pm 07/06/13 Permalink
DoCZero
Melbourne, Victoria
207 posts
So - job done, ended up using powershell - just let it go overnight - cmd:

Import-Csv eurusd.csv | Select time,ask,bid | Export-Csv -Path eurusd2.csv -NoTypeInformation

Trillion - cheers for the tip of using powershell. I tried to use something called csvfix - which seemed to work, then the file wasnt edited how I needed.

=)
10:28am 08/06/13 Permalink
blahnana
Brisbane, Queensland
623 posts
Teehee @ overnight
06:37pm 08/06/13 Permalink
trillion
Brisbane, Queensland
2918 posts
yeah the cmdlet throughput speed is pokey slow
06:42pm 08/06/13 Permalink
Dazhel
Gold Coast, Queensland
6079 posts
lol, Achievement Unlocked: Number Cruncher
09:13am 09/06/13 Permalink
trillion
Brisbane, Queensland
2922 posts
Dahzel don't tease. achievements were the best innovation trog has ever made happen.

that's just how i remember it

cheers
03:08pm 09/06/13 Permalink
system
Internet
--
03:08pm 09/06/13 Permalink
AusGamers Forums
Show: per page
1
This thread is archived and cannot be replied to.