The scenario: I am given a huge CSV File dumped thru FTP from some big JDE-ish system everyday. The CSV File is about 15MB or so. The file has around 60,000 lines in it.
What I needed to do is to update a “main” transaction table. Which means I have to lookup each line in the CSV File and then search each row in the transaction table, and then update the row if a match is found. But I figured that it would be crazy to directly handle each line in the file and then go to MySQL for each line.
The pseudo code would look something like this:
1) Open CSV File
2) Loop each line of the file
3) Use $row[0] + $row[1] + $row[2] in a WHERE statement to search the MySQL Database
4) If row is found, update the row in MySQL. If not found, then insert the row.
In case you didn’t notice, steps 2-4 would loop 60,000 times! And note that the mysql table I had already had 300,000 records in it. Can you imagine how much memory and resources this script would eat up if I implemented the code above?
First, opening the big csv file would already consume a lot of resources. On top of that, we have to loop thru each line of the file and do database updates. This would do just fine if you were handling like 100 lines, but 60,000 would hurt a lot.
So what I did was I just let MySQL do most of the hard work. I created a temporary table in MySQL. I made a script that imports the CSV file into the temporary mysql table. After that, I used MySQL queries to compare the temporary table and the main transaction table. I used queries such as these:
$sql = " INSERT `transactions` (`fieldA`, `fieldB`, `... this means more fields ...`, `... this means more fields ...`, `fieldX`) SELECT daily.fieldA, daily.fieldB, ... this means more fields ..., ... this means more fields ..., daily.FieldX FROM ".$table_name." daily WHERE NOT EXISTS (SELECT t.fieldA, t.fieldB, t.fieldC, t.fieldD FROM transactions t WHERE t.fieldA = daily.fieldA AND t.fieldB = daily.fieldB AND t.fieldC = daily.fieldC AND t.fieldD = daily.fieldD) "; |
And for the updates I used something like this:
$sql = " UPDATE `transactions` t , `".$table_name."` daily SET t.fieldA = daily.fieldA, t.fieldB = daily.fieldB, t.fieldC = daily.fieldC, t.fieldD = daily.fieldD, /* more fields */ WHERE t.fieldA = daily.fieldA AND t.fieldB = daily.fieldB AND t.fieldC = daily.fieldC AND t.fieldD = daily.fieldD "; |
So there you have it. Once you have the CSV file imported into a MySQL table, you can basically do anything with it and let MySQL do all the hard work for you.
Have you looked into REPLACE INTO? sounds like it would do basically everything you’re doing there in one statement…
Hi lance,
Awesome. I will look up REPLACE INTO. There are other updates I needed to do. But I think the 2 statements above could do better. I will improve my working code when time permits..
Thanks lance!
For more info regarding Lance’s notes, please go to: http://dev.mysql.com/doc/refman/5.0/en/replace.html
Just being curious: is this csv file freely available? I’m looking for huge csv files in order to push my own database system to its limits.
Kind regards,
Paul Sprangers
Hi paul,
You can create a PHP script that will create a huge csv file.
thanks,
Wenbert
Dear Wenbert,
Thank you for replying.
Unfortunately, I can’t create PHP scripts (I don’t even know what they are), but fortunately, I found a lot of huge and downloadable databases and CSV files on the internet.
Kind regards,
Paul Sprangers