Map Reduce: Python, episode 2.

Now back to Python.

Here is a nice cheat sheet about some basic language features, commands, variables.

In my first post about Python I presented the IPython shell add-on. It's very useful if you are doing some programming directly in the Python shell. You don't need to know all the functions from imported modules or objects, you can use the tab completion to quickly display all available members of an object or a module.
Example to find out the version of your Python, first import the "sys" module with:

import sys

Then type "sys." and then press the TAB button, now you will get a list of all available functions and member variables imported from the "sys" module. If you type "v" and press again TAB, you will get two options: sys.version and sys.version_info. The first option is already typed in the command line so you can press Enter to get the version information.
In Python everything is an object, every variable will have some member functions, for example a string:
mystring = 'My string'
Now if you type mystring.[TAB] you will get an impressive list of string processing functions.

A simple example to show the power of Python is a text processing program.
IMDB (The Internet Movie Database) has a big database of movies and related information. You can download the database as text files, which are not really easy to process. There is a text file for each movie related type of information. To do some statistics and complex queries I had to load the data into the memory, in a search-able way.

Here is the program:

import os
import string
import time
import sys

try:
    if len(Movies) == 0:
        Movies = {}
except:
    Movies = {}

def LoadMovies():
    global Movies
    filesize = os.path.getsize('movies.list')
    f=open('movies.list','rt')
    progress = [x * filesize / 100 for x in range(10,110,10)]
    start = False
    count = 0
    progressPos = 0
    lineNr = 0
    startTime = time.clock()
    print "0% ",
    for line in f:
        if lineNr%100 == 0 and f.tell() > progress[progressPos]:
            print str((progressPos + 1) * 10)+"% ",
            sys.stdout.flush()
            progressPos = progressPos + 1
        if start:
            ls = line.split('\t')
            if len(ls) > 1:
                moviename = unicode(string.strip(ls[0]),'latin_1')
                movieyear = string.strip(ls[-1])
                Movies[moviename]={'year':movieyear, 'genre':[]}
                count = count + 1
                if count == -1:
                    return
        else:
            if line.find('MOVIES LIST'):
                start = True
        lineNr = lineNr + 1
    print "100%\nLoaded",count,"entries."
    print "Done in ",time.clock() - startTime,"seconds."

Now a quick description:
The file is opened for reading with the open command. The "movies.list" file contains movie titles and release years.
First I read the file size to display the progress while reading the file. Almost 20% of the code above is this progress indicator, because I had to optimize for speed. In order not to read the position and calculate the percentage every time, I have pre-calculated for every 10th percent the position in the file. Then for every 100th line I get the position in the file and compare with the "progressPos"th value in the table. The line where I calculated these values may look strange, but this is called in Python "list comprehension". This is an expression followed by a "for" clause and then other "for" and "if" clauses.

Example: [2**x for x in range(0,8)] will calculate the power of two from 0 to 8 and the result will be [1, 2, 4, 8, 16, 32, 64, 128].

The "for" clause can be used for file reading too.
Because the text files from IMDB contain some other texts and details I had to jump over the lines until the actual list begins, for which I used the start variable.
Then I split every line by TAB character and I get the last word with the -1 position, because between the movie title and the movie year can be more TABs. This is another nice feature of the Python list indexing. In other languages you had to use the size-1 to get the last position, here you can use negative indexes.
I'm storing the title and year in a dictionary variable "Movies". Dictionaries are sometimes found in other languages as "associative memories" or "associative arrays". The C++ implementation is the "map" and HashTable in C#. The key will be the movie title and in the value another dictionary with key "year" and value the release year. This is because I will store later other information too.

Now, if you put this little program in a text file called imdb.py and save it in the same directory where you downloaded and unpacked the movies.list file, you can run the program either directly executing the ".py" file if you registered this extension to python. Or you can start IPython then change the current directory to the one where the files reside with the "cd" command. An example session you will find below:

Now sample queries:
- get all the movies for year 2010: y2010 = [k for k,v in imdb.Movies.iteritems() if v['year']=='2010']
- the number of movies: len y2010
- the first and last movie: y2010[0], y2010[-1]

To search by movie title, example for movies with "Star Trek" in title:
star_trek = [k for k,v in imdb.Movies.iteritems() if k.find('Star Trek') > 0]

You will need at least 1G RAM because the movie database contains 1.6 million items.

In the next post I will present the "MatPlot" library for Python and will make some nice graphics about movies by year, country, language, etc.

To be continued ...

Monday, May 24, 2010

Python, episode 2.

import sys

1 comment:

Links

Search This Blog

About Me

Favorite tools

I like to browse

Followers

Blog Archive