Importing from Huge CSV Files to MySQL Database and more…

Filed Under (General) by Wenbert on 18-06-2008

Tagged Under : , , , ,

The scenario: I am given a huge CSV File dumped thru FTP from some big JDE-ish system everyday. The CSV File is about 15MB or so. The file has around 60,000 lines in it.

What I needed to do is to update a “main” transaction table. Which means I have to lookup each line in the CSV File and then search each row in the transaction table, and then update the row if a match is found. But I figured that it would be crazy to directly handle each line in the file and then go to MySQL for each line.

The pseudo code would look something like this:

1) Open CSV File
2) Loop each line of the file
3) Use $row[0] + $row[1] + $row[2] in a WHERE statement to search the MySQL Database
4) If row is found, update the row in MySQL. If not found, then insert the row.

In case you didn’t notice, steps 2-4 would loop 60,000 times! And note that the mysql table I had already had 300,000 records in it. Can you imagine how much memory and resources this script would eat up if I implemented the code above?

First, opening the big csv file would already consume a lot of resources. On top of that, we have to loop thru each line of the file and do database updates. This would do just fine if you were handling like 100 lines, but 60,000 would hurt a lot.

So what I did was I just let MySQL do most of the hard work. I created a temporary table in MySQL. I made a script that imports the CSV file into the temporary mysql table. After that, I used MySQL queries to compare the temporary table and the main transaction table. I used queries such as these:

$sql = "
INSERT `transactions`
(`fieldA`,
`fieldB`,
`… this means more fields …`,
`… this means more fields …`,
`fieldX`)
SELECT
daily.fieldA,
daily.fieldB,
… this means more fields …,
… this means more fields …,
daily.FieldX
FROM "
.$table_name." daily
WHERE
NOT EXISTS
(SELECT
t.fieldA,
t.fieldB,
t.fieldC,
t.fieldD
FROM transactions t WHERE
t.fieldA = daily.fieldA AND
t.fieldB = daily.fieldB AND
t.fieldC = daily.fieldC AND
t.fieldD = daily.fieldD)
"
;

And for the updates I used something like this:

$sql = "
        UPDATE `transactions` t , `"
.$table_name."` daily
        SET
            t.fieldA  = daily.fieldA,
            t.fieldB  = daily.fieldB,
            t.fieldC  = daily.fieldC,
            t.fieldD  = daily.fieldD,
            /* more fields */

        WHERE
            t.fieldA  = daily.fieldA AND
            t.fieldB  = daily.fieldB AND
            t.fieldC  = daily.fieldC AND
            t.fieldD  = daily.fieldD
        ";
 

So there you have it. Once you have the CSV file imported into a MySQL table, you can basically do anything with it and let MySQL do all the hard work for you.

A Simple Recommendation System in MySQL

Filed Under (General) by Wenbert on 23-05-2008

Tagged Under : ,

I am posting this because I know I will need this soon or later.

I have built a simple recommendation for one of the websites I work with. It is only based on social relationships and does not take into account other parameters. The basic idea is if I like an item, select other items liked by people who also like this item and it can be used to build lists of similar items as seen on Amazon or YouTube. If we use as an example people who like movies (tables Person, Movie and many-to-many relationship PersonMovie), this can be expressed in SQL using several joins:

SELECT DISTINCT Movie.*, COUNT(Person.ID) AS PersonCount
FROM Movie
JOIN PersonMovie ON Movie.ID = PersonMovie.MovieID
JOIN Person ON PersonMovie.PersonID = Person.ID
JOIN PersonMovie AS PersonMovie1 ON Person.ID = PersonMovie1.PersonID
JOIN Movie AS Movie1 ON PersonMovie1.MovieID = Movie1.ID
WHERE Movie1.ID = 1 AND Movie.ID != Movie1.ID
GROUP BY Movie.ID
ORDER BY PersonCount DESC
 

First we join movies to people and then backwards to the specific movie with ID = 1 that we want to get similar movies for. PersonCount is the relevance factor — the higher it is, the more people like the movie.

How to do a Select statement inside the main select statement in MySQL

Filed Under (General) by Wenbert on 18-03-2008

Tagged Under :

A little bit slow but it gets the job done. I used this to generate a report to get the total open_amount for customers.

    SELECT
    s.parent_no,
    s.parent_name,
    t.parent_no,
    t.col_manager,
    t.invoice_date,
    t.due_date,
    t.over_7_date,
    t.over_7_by_eom,
    t.open_amount,

    (SELECT SUM(t2.open_amount) FROM transactions t2
        WHERE
        (t2.parent_no=t.parent_no)
        AND t2.trans_status = ‘open’) AS total
       
    FROM
    transactions t
    LEFT JOIN summary s ON s.parent_no = t.parent_no
    WHERE
    ( t.col_manager=‘aaa’ OR
    t.col_manager = ‘bbb’  OR
    t.col_manager = ‘ccc’  OR
    t.col_manager = ‘ddd’  OR
    t.col_manager = ‘eee’  OR
    t.col_manager = ‘fff’  )
    AND t.trans_status = ‘open’ ORDER BY s.parent_name ASC
 

How to filter database inserts - HTML, Usernames, etc. in Zend Framwork

Filed Under (General) by Wenbert on 15-02-2008

Tagged Under : , ,

I woke up today thinking about how to handle strings when inserting to MySQL. What if the string I am trying to save into the database contains HTML characters? Let’s say you are using FCKEditor, how would you “generally” handle strings to be inserted into your database? Because when I think of it, I don’t have a quick answer. I have to test it around until satisfied with the output.

If I am too strict of what to save — then I would have problems outputting the HTML into the browser. Tables would be messed up and form elements would not work.

So, how do you filter different kinds of data to be inserted into MySQL using Zend Framework? Kinds of data as in:

  • Strings with HTML characters (mostly from a CMS form where it needs to render the HTML again)
  • usernames (no special characters)
  • passwords
  • what about encoding?

Update rows in Table A from values of Table B in MySQL

Filed Under (General) by Wenbert on 14-02-2008

Tagged Under :

Sometimes, we need to update values of tables from the values from another table. Here is an example:

        UPDATE `transactions` t , `tran_daily` daily
        SET
            t.col_manager   = daily.Col_manager,
            t.invoice_date  = daily.Invoice_date,
            t.due_date      = daily.Due_date,
            t.po_no         = daily.Po_no,
            t.remark        = daily.Remark,
            t.gross_amount  = daily.Gross_amount,
            t.adjustments   = daily.Adjustments,
            t.open_amount   = daily.Open_amount,
            t.bill_to       = daily.Bill_to

        WHERE
            t.company       = daily.Company AND
            t.trans_type    = daily.Trans_type AND
            t.invoice_no    = daily.Invoice_no AND
            t.line_no       = daily.Line_no

What is the safest way to do database queries in Zend Framework?

Filed Under (General) by Wenbert on 12-02-2008

Tagged Under : ,

I have no idea. You tell me. All I do is this:

$sql = "SELECT * FROM Table1 WHERE id=’".$target_id."’";  /*That is $target_id inside single-quotes in the SQL Query.*/
 

Please share your thoughts below.

LAMP’s success is spelling its own doom?

Filed Under (General) by Wenbert on 01-02-2008

Tagged Under : , , ,

Here is a very interesting post regarding LAMP (Linux, Apache, MySQL and PHP). I agree with the post all the way.

So whats the problem here? Well I think one of the key issues is that LAMP focus on solving issues with a very non academic way. This implies that few people are exposed to LAMP during university. Even if they are, then they are forced to think in the traditional ways. Doing things the proper software engineering way etc. But this is not what helps you to become a top LAMP engineer. Here its important to solve real world issues in a good enough approach.

For example I bet you 99% of all professors would get a heart attack when I tell them that PHP is great and robust because any memory leaks get cleaned up after each request. So while developers try fix memory leaks, they can get away with some hanging around. This means that they can focus on adding new features instead. So lets hope that one by one universities realize that its worthwhile studying what makes LAMP so successful. Also to all students I can only say: If you are truly motivated, there are more chances in the LAMP space than in the Java/.Net world, since as this posts shows .. there is not enough talent in the pool .. meaning there are more opportunities for people that do join the LAMP talent pool. And there are more chances of sticking out from the crowd and becoming world famous (I am not kidding).

Click here for the entire post.

Convert Database to UTF-8

Filed Under (General) by Wenbert on 14-12-2007

Tagged Under :

I deal with a lot of database dumps. So this post is a must for me.

We seriously see a ton of customers coming in with the type of databases that are a nightmare to move over. When you’re dealing with special characters in a database, you have to make sure that the charset and collation are dumped *with* the database, so that when you move it to another server the tables and data create properly. The biggest annoyance so far is converting tables back to UTF-8, as when this is done through the MySQL shell or phpmyadmin is had to be done table-by-table. So, I wrote this simple PHP script to do it all at once…

The script is quite simple; I can just simply:

$sql = “ALTER TABLE $table DEFAULT CHARACTER SET utf8 COLLATE utf8_general_ci”;

Where $table = each table in the database.

Don’t use addslashes for database escapes

Filed Under (General) by Wenbert on 04-12-2007

Tagged Under : , ,

From jansch.nl:

This is not the best way to escape data. The most important reason is security. addslashes can lure you into a false sense of security. As Chris Shiflett points out, there are situations that addslashes doesn’t escape.

Use mysql_real_escape_string instead.

PHP 5: MySQL Singleton Example

Filed Under (General) by Wenbert on 23-11-2007

Tagged Under : , , ,

What is a Singleton?

In software engineering, the singleton pattern is a design pattern that is used to restrict instantiation of a class to one object. This is useful when exactly one object is needed to coordinate actions across the system. Sometimes it is generalized to systems that operate more efficiently when only one or a few objects exist.

Here is a good example of using a Singleton.

Basically, after including your class, you can do something like this:

include_once("MysqlDb.php");
MysqlDB::getInstance()->select("SELECT * FROM `help_category` LIMIT 5");
Subscribe to Rss Feed : Rss