A tool for applying iptables safely: apply_firewall.

Have you ever done something stupid with iptables command which like blocking you self from accessing the SSH?  Yes, I have.  Most of administrator knows that is dangerous to change iptable rules remotely, with a little typo, you might have to restart the machine to access the SSH again.  For applying iptables sfaely, I wrote a simple tool which can be used to apply firewall rules safely.  It backups the original iptables configuration before applying it, and if you don’t type “yes” within specific time period, it will rollback to original iptables automatically.  Here is an example:

It is written in Python, you can install it by

easy_install apply_firewall

Or you can download it here:

http://pypi.python.org/pypi/apply_firewall/

I hope this could be helpful for linux administrators 😀

Posted in English Articles, Linux, Python | Tagged , , , , | 1 Comment

Open the browser, and here comes the computing power

This is really an awesome idea. Open the browser, and here comes the computing power. Imagine that you can open the browser and rent 10 machines, get everything done on them. Deployment? Software installing? No, the only thing you need is a browser.

The web-site of the project: http://stackvm.com/

Posted in English Articles, 分享 | Tagged , , , | Comments Off on Open the browser, and here comes the computing power

Now.in 遭受到來自歐洲的殭屍DDOS攻擊

Now.in這兩天遭到來自歐洲各國的殭屍DDOS攻擊,來源國家有羅馬尼亞、土耳其、捷克共合國、德國、法國

為了讓服務能正常運行,我花了一個晚上寫了一個防火牆,自從掛上去之後對方的攻擊都是無效的

我也不知道明明是無效的攻擊為什麼這位小朋友還要繼續試,從手法來看對方應該只是個script kid,攻擊的Request從一開始全部是這樣

GET /radio/musicfm
Host: en.now.in
變成
GET /radio/musicfm
Content-Type: text/plain
Host: en.now.in
或是
GET /radio/musicfm
Connection: keep-alive
Host: en.now.in
他大概蠢到以為加了keep-alive之後我的伺服器就會蠢到把連線開著讓他吃免錢的資源? 或是改變request的內容我的防火牆就會認不出來? 在他猜到防火牆設的規則前攻擊都是無效的,說真的,我想不到有什麼無聊的理由可以浪費時間在攻擊別人的網站,這樣一直狂送request就會高潮嗎? 如果有效就算了,還是這種無力的三流攻擊,我統計了一下記錄檔他十幾台殭屍聯合攻擊的request/seconds數量也才200多,連防火牆的1%CPU使用率都撐不起來
不過,網路上神經病何其多,架網站遇到這種莫名奇妙的攻擊也是遲早的事,感謝這位小朋友讓我有機會寫個防火牆,之後還一直有機會遇到神經病的攻擊,也算是經驗的一種,而這防火牆是因為殭屍而生的,所以就把它命名為 Molotov吧。
Posted in WTF, 中文文章, 資訊安全 | Tagged , , , | 3 Comments

淺談coroutine與gevent

這篇文章是要大略介紹一下coroutine和Python的相關應用的函式庫gevent,在介紹coroutine前我們先來點情境,因為目前常見的coroutine應用都是在網路程式上,因此我們得先建立一些網路框架的模形再介紹coroutine會比較容易懂

不同的網路框架模型

網路隨著時代發展,已經成為現代生活中越來越重要的重要的基礎,而做為提供這些服務的伺服器,負載的連線也越來越多,因此網路程式的軟體、硬體架構也一直在演變,才有辦法承載越來越多的連線數量,有人希望在一台機器能夠同時處理1萬個連線以上,所以提出C10K問題,並且之後有不少新的的技術來達成這個目標,而在此我們只專注在於軟體的架構上,首先介紹最簡單的架構

阻塞式單一行程

這樣的網路程式非常的簡單,只有一個迴圈,單一個行程,處理完一個要求後才繼續處理下一個,理所當然這樣的效能非常差,因為連線在完成前其它的連線都無法被處理,現代的伺服器已經很少看見這樣的架構,但因為優點是簡單,如果沒有什麼大量同時連線要處理,其實這樣的架構就很足夠

阻塞式多行程

因為既然單一行程只能同時處理一個請求,那很簡單的想法就是每個請求開一個行程去處理,如此一來就能同時處理更多的請求,但是這樣做有缺點,行程的copy如果是fork的話,有os paging system在成本上其實還好,但是還是有,而且越多的連線就表示需要越多的context-switch,當連線量多到一定程度時,可能大部份的CPU時間都在忙著進行context-switch,如此一來這樣的架構在此情況下是沒有效率的,但是優點是寫伺服器的部份事實上和寫單一行程阻塞式不會差太多,一樣簡單好寫

阻塞式多行程多執行緒

除了多行程多阻塞式,有些程式為了減少process copy的成本,或是其它的考量,會在多個行程上開多個執行緒來處理請求,也有可能是單行程多執行緒,但是基本上和上面這幾種都沒有太大的差別,而引進了執行緒帶來了一些額外的問題,dead lock、race condition等等,當不同的執行緒如果在一起有共同的東西要處理,這些常見的同作問題就會出現,使得程式得寫得更小心

非阻塞式事件驅動

為了解決上面提到多行程和多執行緒等所帶來的問題,有一種做法是只有單一主要的迴圈來檢查有無網路IO的事件發生,再來決定要怎樣處理,這樣的好處在於省掉了context-switch、process copy等等成本,也不會有dead lock、race condition等問題,但缺點在於程式的部份會變複雜,因為當你一件事件被觸發,有事情還沒做完,你就得記下目前狀態,再下次事件觸發時再依先前的狀態來決定接下來要做什麼,不像上面是線性的程式執行那樣直觀,Twisted就是這樣的網路框架

非阻塞式Coroutine

那你或許會想,有沒有可能我們能有事件驅動的好處,和阻塞式那樣的直觀好處呢? 答案或許就是Coroutine,基本上它的本質也是事件驅動,只有單一的迴圈在檢查事件的發生,但是加上了coroutine的概念,而Gevent就是這樣的函式庫

Coroutine

講了這麼多次coroutine,我相信大部份的讀者可能還是不懂這到底是什麼鬼東西,對於大部份程式設計師而言這應該都算是較陌生的名字,對我而言,在一開始這也是個令人困惑的名詞,但事實上只要理解以後就會發現coroutine不是這麼的難懂,用簡單的一句話來說Coroutine,就是可以暫時中斷,之後再繼續執行的程序,我們來看一個例子,事實上Python就有最基礎的Coroutine,也就是generator

# -*- coding: utf8 -*-
def foo():
    for i in range(10):
        # 丟資料並且把主控權交給呼叫者
        yield i
        print u'foo: 主控又回到我手上了,打我阿笨蛋'

bar = foo()
# 執行coroutine
print bar.next()
print u'main: 現在主控權在我們手上,做點雜事'
print 'main:hello baby!'
# 回到剛才foo這個coroutine中斷的地方繼續執行
print bar.next()
print bar.next()

執行結果:

0
main: 現在主控權在我們手上,做點雜事
main:hello baby!
foo: 主控又回到我手上了,打我阿笨蛋
1
foo: 主控又回到我手上了,打我阿笨蛋
2

看到了嗎? 我們的foo在執行的過程中被中斷又繼續了好幾次,這就是coroutine,你可能覺得這樣一點用都沒有,我在一開始也這樣想,thread的context-switch也是可以暫停然後再繼續執行,所以這樣的特性好處在哪裡? 有幾個重點在於

  • thread之間需要context-switch,而且成本很高,但是coroutine之間的切換很快
  • coroutine的成本很低,可以很輕易的產生大量的coroutine
  • 這些事情全是在同一個thread裡發生的,因此不會有race condition等問題發生 (還是可能會有)
  • thread的context-switch雖然我們可以進行某種程度的控制,但是很多部份還是得靠OS來決定要先排程哪個thread,而coroutine的執行是由我們自己控制的

接著我們用圖來說明coroutine和thread之間的差別

你可以發現,Coroutine所做的其實就只是在單一thread裡面不同的coroutine裡互相切換,本質上和thread很像,所以也有些coroutine叫做micro-thread,而切換通常都是我們主動去做的,看到下面這張,當你建立了不同的thread,context-switch未必是在你預期的情況下發生的,而且通常都是由OS引發的,除此之外,如果你老爸夠有錢,買了最新的Intel iCore 2000的一千核心CPU,那麼恭喜你,你的thread可能是由不同的processor執行,因此不同thread的程式真正的同時執行是有可能的

等等! 這時候你可能抓起你手邊的OS恐龍本敲你自己的頭,你這樣問道: 嘿! 恐龍本上面有提到FCFS、RR等等排程演算法,那Coroutine呢? 答案是如同我們前面提到,這由我們自己決定,這是好處之一,這時你可能又會抓著你的頭髮大叫 “網路呢? 網路呢? 講了半天我只看到Coroutine之間切來切去,我根本看不到在這裡面的半點網路成份,哪怕是一公克也好”

確實,當我一開始在讀相關資料時也很困惑,ˊ這樣跳來跳去又如何? 網路應用到底在哪裡? 現在就回到我們的主題來

少了非同步IO的Coroutine,就像少了哇沙必的生魚片

到目前為止都很難和網路有什麼相關的聯想,但是想到網路就會想到IO,而事實上這樣的優點就是在於有大量IO時會顯得特別好用,我們考慮一下當遇到IO時就將主控權交給別人

Gevent的運作方式基本上就很類似這張圖,它的Coroutine是由greenlet實作的,而每個Coroutine都有一個parent,最頂層的Coroutine就是main thread或是當前的thread,每當Coroutine遇到IO的時候,就將主控權交給root coroutine,它會視哪些coroutine的IO event是已完成的,就將主控權交給他,其實就只是這樣而已,事實上程式寫起來完全和一般的阻塞式伺服器沒什麼兩樣,但是它千真萬確是非同步的,這就是它神奇的地方,我們來看點實際的例子,我們直接拿gevent範例裡的同步下載程式

#!/usr/bin/python
# Copyright (c) 2009 Denis Bilenko. See LICENSE for details.

"""Spawn multiple workers and wait for them to complete"""

urls = ['http://www.google.com', 'http://www.yandex.ru', 'http://www.python.org']

import gevent
from gevent import monkey

# patches stdlib (including socket and ssl modules) to cooperate with other greenlets
monkey.patch_all()

import urllib2

def print_head(url):
    print 'Starting %s' % url
    data = urllib2.urlopen(url).read()
    print '%s: %s bytes: %r' % (url, len(data), data[:50])

jobs = [gevent.spawn(print_head, url) for url in urls]

gevent.joinall(jobs)

這程式太簡單了對吧? 但它確實能做到極高效能的同步下載,我們做一點簡單的解釋,首先令人困惑的是monkey.patch_all(),會有這行,是因為Python內建的各種函式庫裡的IO函式庫、及會阻塞住的函數,例如Sleep都會讓整個程式卡住,而不是利用Selector/epoll之類的功能來處理,所以monkey這個函式庫就是負責將Python內建的函式庫取代成以gevent的非同步形式的函式,如此一來當執行到那些IO之類的動作,會切到MainThread的coroutine進行排程,而非直接卡在那裡等結果,而當IO動作真的完成了,gevent內部會將該coroutine標示為可執行的,因此下次有機會就會排到那個coroutine,看到下面的spawn,就是在產生coroutine,在這裡的coroutine因為事實上是greenlet這個Python函式庫題供的,所以事實上叫做greenlet

你可以在腦中想像一下,spawn首先呼產生了三個執行print_head的routine ,在joinall的地方,把主控權交給第一個print_head,而在函數裡遇到了urllib2.urlopen這個IO動作,因此它將自己設為等待狀態,並且將主控權交還給MainThread,而MainThread將主控權排給了第二個print_head,同樣的也遇到了urllib IO動作,第三個也是一樣,而這三個都在等待,主控權便再次回到了MainThread,它等待有哪個gevent的IO事件完成了,就將主控權交給它,如此一來,重覆這樣的過程,三個網頁的同步下載就在coroutine的切換之間完成了,當三函數return,joinall也會結束,整個程式就跑完了

一開始會很難理解,但一但弄清楚之後就會瞭解這樣寫法的簡單和實用性,你可以忘記它背後的原理,把它當作是一般的阻塞式網路程式來寫,就會很輕鬆,但又輕量、高效能,以上大概就談到這裡,有興趣可以玩玩看gevent

Posted in Python, 中文文章 | Tagged , , , | 19 Comments

Simple tool for rotating nginx log file

#!/bin/env python
"""Simple tool for rotating nginx log file

@author: Victor Lin ([email protected]) blog: 
"""
import os
import shutil
import optparse
import datetime
import logging
import subprocess

log = logging.getLogger(__name__)

def main():
    parser = optparse.OptionParser()
    parser.add_option('-p', '--pid', dest='pidFile', metavar="FILE", help='/path/to/nginx.pid')
    parser.add_option('-l', '--log', dest='logFile', metavar="FILE", help='/path/to/logfile')
    parser.add_option('-f', '--format', dest='nameFormat',
        help='format of rotated log file name' )
    parser.add_option('-o', '--owner', dest='owner', help='the owner user of log file to set')
    (options, args) = parser.parse_args()

    if not os.path.exists(options.logFile):
        log.info('The log file %s does not exist', options.logFile)
        return

    # move the log file
    newName = datetime.date.today().strftime(options.nameFormat)
    log.info('Move log file %s to %s', options.logFile, newName)
    subprocess.check_call('mv %s %s' % (options.logFile, newName), shell=True)

    if options.owner:
        log.info('Set owner of %s to %s', newName, options.owner)
        subprocess.check_call('chown %s %s' % (options.owner, newName), shell=True)

    # tell nginx to reopen the file
    log.info('Reopen log file')
    pid = int(open(options.pidFile, 'rt').read())
    os.kill(pid, 10)

    log.info('done')

if __name__ == '__main__':
    logging.basicConfig(level=logging.INFO)
    main()

Usage:

python nginx_log_rotate.py -p /usr/nginx/logs/nginx.pid -l “/usr/nginx/logs/YOURDOMAIN.log” -f “/home/USER/logs/YOURDOMAIN.%Y-%m-%d” -o OWNER_USER

For example, you can set up a crontab task for rotating the log file like this:

0 0 * * * python nginx_log_rotate.py -p /usr/nginx/logs/nginx.pid -l “/usr/nginx/logs/YOURDOMAIN.log” -f “/home/USER/logs/YOURDOMAIN.%Y-%m-%d” -o OWNER_USER

Posted in English Articles, Python | Tagged , , , , | 1 Comment

How hateful it is to develop a web application which runs correctly with different browsers?

Like this:

You have to run different browsers for testing the web-page.  What’s more? You have to run those stupid IE browsers under VirtualBox or some virtual system.  Moreover, different detail behaviors of JavaScript/CSS engine drives you crazy, especially IE.  You can write the CSS which works correctly with Chrome/FireFox/Opera, unfortunately, the layout will be messed up by IE! Also, you can write some code that runs correctly with IE6/IE8/Chrome/FireFox/Opera, but not with IE7 (jQuery error when aborting an ajax call only in Internet Explorer).

Die! IE! Die!!!!!!

Posted in English Articles, WTF | Tagged , , , , | 1 Comment

A simple workaround for installation problem of PIL under virtualenv

I encountered a problem when I am installing PIL under a virtualenv.  I installed it with easy_install, the output said it was installed, but however, I can’t import it under the virtual environment.  I got an ImportError when I tried to import it.  Then I inspected the directory of PIL in site-packages directory,  I noticed that it didn’t contain egg information in the directory. Also, the path in .pth file to PIL is “PIL”, and you can only see content of PIL in that directory.  What does this mean?  When you tried to import PIL, it looks every path in sys.path and see are there a package PIL? But the files Python can only see is something like this inside the directory:

site-packages

PIL

__init__.py

_imaging.pyd

_imagingcms.pyd

… and other files of PIL

See? Python can only see __init__ and those stuff belong to PIL, but it can’t find the package PIL,  that’s why it failed.  To solve this issue, it is simple, just create another PIL directory in the original PIL directory, and move every in it into the new sub PIL directory.  They you will get something like this:

site-packages

PIL

PIL

__init__.py

other PIL sutff here …

That’s it!  Now python can see and find the PIL package.  Surely, the release of PIL  it is broken,  but fortunately, it is not difficult to fix.  Hope this article could be helpful for people who also encountered this problem.

Posted in English Articles, Python | Tagged , , , , , , , | Comments Off on A simple workaround for installation problem of PIL under virtualenv

PO檔自動繁簡轉換程式

最近在更新我的網站Now.in,最麻煩的一項工作就是國際化,通常我都是先以英文寫網頁,接著用TurboGears2的i18n功能將字串訊息取出,而我的GUI程式的i18n流程也差不多,都是先寫英文,翻成繁體中文,其中一個最腦人的過程,就是把翻好的繁體中文po檔裡的字串剪下貼上到google翻譯將它變成簡體中文再貼到簡體中文的po檔裡,這些工作都是重覆性極高的機械性動作,在一開始句子還少時做還可以,當字詞越來越多,這就變成一件痛苦的工作,既然是高重覆性的工作,為什麼需要人力來做呢? 於是我就寫了一個小程式,可以自動把po檔的內容抓出來丟到google翻譯去,把結果寫到另一個po檔中

# -*- coding: utf8 -*-
'''
Created on 2010/4/27

@author: Victor-mortal
'''

import os
import sys
import urllib
import json
import logging
import optparse
import codecs
import htmllib

log = logging.getLogger(__name__)

def unescape(s):
    p = htmllib.HTMLParser(None)
    p.save_bgn()
    p.feed(s)
    return p.save_end()

def translate(text, sourceLanguage, destLanguage):
    """Translate a text

    @param text: text to translate
    @param sourceLanguage: the language original text is in
    @param destLanguage: language to translate to
    """
    log.info('Translate %s from %s to %s', text, sourceLanguage, destLanguage)
    query = dict(v='1.0',
                 q=text.encode('utf8'),
                 langpair='%s|%s' % (sourceLanguage, destLanguage))
    file = urllib.urlopen(
        'http://ajax.googleapis.com/ajax/services/language/translate',
        data=urllib.urlencode(query)
    )
    result = file.read()
    file.close()
    jsonResult = json.loads(result)
    if not jsonResult['responseData']:
        return None
    return unescape(jsonResult['responseData']['translatedText'])

def main():
    logging.basicConfig(level=logging.INFO)

    parser = optparse.OptionParser(
        usage="usage: %prog sourcePoFile destPoFile sourceLanguage destLanguage"
    )
    (_, args) = parser.parse_args()

    sourceFilePath = args[0]
    destFilePath = args[1]
    sourceLang = args[2]
    destLang = args[3]

    log.info('Translate %s (in %s) to %s (in %s)',
             sourceFilePath, sourceLang, destFilePath, destLang)

    result = []
    sourceFile = codecs.open(sourceFilePath, 'rt', encoding='utf8')
    for line in sourceFile.readlines():
        if line.startswith('msgstr'):
            _, msg = line.split(' ', 1)
            msg = msg.strip()
            msg = msg[1:-1]
            translatedMsg = translate(msg, sourceLang, destLang)
            if not translatedMsg:
                translatedMsg = msg
            result.append(('msgstr "%s"n' % translatedMsg))
        else:
            result.append(line)
    sourceFile.close()

    destFile = codecs.open(destFilePath, 'wt', encoding='utf8')
    destFile.writelines(result)
    destFile.close()

if __name__ == '__main__':
    main()

使用方法很簡單

python po_translate.py 來源po檔 目的po檔 來源語言 目的語言

而語言的代碼,在Google translate的API文件中有寫,事實上它能做的不只有繁簡轉換,要翻成其它語言也可以,只是如果你能接受那品質的話

Posted in Python, 中文文章, 分享 | Tagged , , , , , , | 4 Comments

一些有用的Python函式庫列表

Python有豐富的第三方函式庫或是工具,我一直想說要整理一篇列表,今天找了一點時間整理了一下我所知道的或是用過的

視窗GUI開發

  • wxPython 跨平臺的GUI開發函式庫,由wxWidget移植而來,特點是原生介面
  • PyQT 同樣也是著名的跨平臺GUI函式庫,由QT移植而來,可惜的是它是在於GPL條款下,商業用途需要另外買授權
  • PySide 因為PyQT為GPL授權,所以有人另外出來搞了另一套QT的移植,它是在LGPL條款下,因此商業軟體也適合
  • PyGTK GTK的Python移植版

遊戲開發

  • pygame 一款簡單的2D遊戲開發函式庫,主要是基於SDL
  • Python-Ogre 著名Open source 3D遊戲引擎Ogre的Python移植版本
  • pyglet 一款跨平台的多媒體函式庫
  • PyOpenGL OpenGL的python移植
  • Python-Hge 我寫的HGE 2d遊戲引擎的Python移植版,大致上還算可用狀態,有些部份還沒移植完全

網頁開發

  • Flask 基於Werkzeug,以輕量為主要特色的網頁框架,簡單易用,沒有太大負擔,讓人喜愛的一款容易上手的網頁框架
  • Werkzeug 不算是網頁框架,它提供了開發WSGI常會用到的功能,不喜歡什麼都已經準備好的網頁框架,又不想重頭寫起可以試試看
  • Pyramid 這是結合了Pylons和BFG之後的新一代網頁框架,有完備的文件和號稱100%測試覆蓋率著稱
  • TurboGears 集大成的網頁開發框架,它是組合各種合適現成的組件而成的網頁框架,因此藕合度較鬆散,可以抽換掉某些組件,第一版用的組件主要為Cherrypy為網頁伺服器、Kid為預設樣版引擎、SQLObject為ORM,而第二版架構在另一套輕量框架Pylons上,ORM改為SQLAlchemy,預設樣版改為Genshi,並且符合WSGI的規範
  • Django 知名的Python網頁開發框架,不同於TurboGears的集大成,它的框架是自成一體的,所有組件從樣版到ORM都是獨力完成的
  • Web2py 一窾很有趣的網頁框架,它的框架本身就包含了整個完整的開發環境,所有開發都是在它的網頁程式上完成的
  • Webpy 跟Web2py很容易搞混,但其實是另一窾不同的網頁框架
  • BFG 它是從Zope和Plone社群下產生的,用到了不少Zope和Plone抽離出來的技術的樣子,相對於Plone和Zope的痴肥,它的廣告詞是”pay only for what you eat”
  • Pylons Pylons是一款輕量的網頁框架,它所強調的是彈性和快速開發,TurboGears2就是架構在Pylons上的框架
  • Paste 提供一些WSGI相關基礎建設、例如WSGI伺服器、設定檔、部屬等等功能的專案
  • WebOb 將WSGI物件化為Request和Response等物件的函式庫
  • ToscaWidget 將常用的網頁元件,如網頁表單、表格等等變成物件用於方便產生網頁用的函式庫
  • FormEncode 提供表單檢邏輯物件Validator的函式庫
  • Plone 知名的CMS,也就是內容管理系統,可以快速架出專業的網頁來,但缺點是很肥,它架構於Zope上
  • Zope Plone底層的CMS

網頁樣版

  • Kid 一款以XML串流為基礎的樣版引擎,特色是只能寫出合法的網頁,已經停止開發由Genshi繼承
  • Genshi 繼承自Kid的樣版語言,加強了Kid的缺失,增加了更多功能,例如filter可以篩飾XML串流
  • Mako 一款非XML串流的樣版語言
  • jinja 另一款非XML串流式的樣版語言

網路程式開發

  • Twisted 重量級的網路程式開發框架,採用的是非同步的reactor樣式,已經實作大部份常見的Protocol,因此對於快速的伺服器開發來說非常方便
  • Tornado 另一款網路程式的非同步IO框架
  • 其它 Python的網路程式開發框架實在是太多了,多到列不完,也沒有時間一個一個去試,這篇文章 Asynchronous Servers in Python 有列出不少Python的非同步伺服器開發的函式庫可以參考看看
  • pypcap Python抓取網路封包用的函式庫,在windows下需要Winpcap支援
  • dkpt 用於解析網路封包用的函式庫,可以和pypcap用於抓取封包
  • Gevent 一款結合libevent和greenlet的網路函式庫,最大的特色就是使用microthread來處理網路連線

資料庫

  • SQLAlchemy 一款資料庫的Object Relation Mapper函式庫,簡單的來說就是可以將資料庫表格和關聯映射到物件,以方便的物件操作方式來操作資料庫
  • SQLObject 另一款資料庫的ORM
  • Elixir 架構於SQLAlchemy上的ORM,強調於資料表的繼承、多形等等特性

伺服器管理

  • Supervisor 一款用來管理執行daemon process的工具,提供XML-RPC遠端控制界面,運行伺服器程式的最佳選擇
  • Fabric 可以同時對多台主機以SSH連線下指令的工具,對於管理大量伺服器這工具是很有用的工具

其它

  • lxml 極有效率且強大的XML/HTML解析、處理函式庫
  • py2exe 將Python程式包裝成執行檔的工具程式,只限於windows下
  • PyInstaller 另一款Python打包成執行檔的工具,不同的是它不只限於windows,可以跨平台
  • mapnik 支援Python的一款GIS函式庫,可以畫出漂亮的地圖,甚至可以做出像Google map那樣的網頁
  • matplotlib 強大的圖表繪圖函式庫,幾乎你想得到的圖它都畫得出來,而且可以支援多種輸出格式,也可和視窗整合
  • gluttony 我寫的用於找出Python函式庫之間關聯的工具,可以參考這篇文章 Python套件依賴關係圖工具: Gluttony

遺珠之憾

我短時間能想到的都寫上去了,但還有不少還沒在列表中,如果你有知道什麼覺得它應該在列表中,或是發現我寫錯什麼,請給我一個留言,我有空會補上或修正

更新

  • 2011/06/01 新增Flask, Werkzeug, Pyramid, Gevent, Supervisor, Fabric
Posted in Python, 中文文章 | Tagged , , , | 10 Comments

Memory efficient Python with bytearray

The story

I was developing an audio broadcasting server.  I wrote the server with Twisted.  It works fine, but there is still a big problem to solve: the memory usage.  My audio broadcasting server use memory so much.  You can see it from the following figure.

As you see, the memory usage goes up like crazy when there are more listeners on the line.  It is almost an exponential growth. And it doesn’t make sense why it takes so much memory.  Then I started to find the reason of memory overuse.  At first, I thought it might be some memory leaks from the C-modules of Twisted or other third-party packages.  But it is not reasonable, if it is memory leaking, why there is no relative raising of memory usage at all peaks of radios/listeners?

Surprising result from guppy

I inspected the memory usage detail with guppy, and get a surprising result:


Partition of a set of 116280 objects. Total size = 9552004 bytes.
 Index  Count   %     Size   % Cumulative  % Type
  0  52874  45  4505404  47   4505404  47 str
  1   5927   5  2231096  23   6736500  71 dict
  2  29215  25  1099676  12   7836176  82 tuple
  3   7503   6   510204   5   8346380  87 types.CodeType
  4   7625   7   427000   4   8773380  92 function
  5    672   1   292968   3   9066348  95 type
  6    866   1    82176   1   9148524  96 list
  7   1796   2    71840   1   9220364  97 __builtin__.weakref
  8   1140   1    41040   0   9261404  97 __builtin__.wrapper_descriptor
  9   2603   2    31236   0   9292640  97 int

Even the RSS reported by ps is almost 100MB, there is still little memory actually is used by Python objects.  I then considered another reason: the Fragmentation.

Fragmentation

I was not sure at first, I also didn’t do memory leak detecting on my server.  I am just guessing.  I reviewed the design of server.   Following class is the core class of the design.

class AudioStream(object):
    """Audio stream

    """

    def __init__(self, memoryLimit=128*1024):
        """

        @param memoryLimit: limit of memory usage in bytes
        """
        # limit of memory usage
        self.memoryLimit = memoryLimit
        # offset of audio stream (how many chunks)
        self.offset = 0
        # queue for audio data chunk
        self.chunks = []
        # total bytes of chunk in queue
        self.totalSize = 0

    def write(self, chunk):
        """Write audio data to audio stream

        @param chunk: audio data chunk to write
        """
        # append chunk to queue
        self.chunks.append(chunk)
        self.totalSize += len(chunk)

        # check the usage of memory, if exceeded, pop chunks
        while self.totalSize > self.memoryLimit:
            poppedSize = len(self.chunks.pop(0))
            self.totalSize -= poppedSize
            self.offset += 1
            log.debug('AudioStream pop chunk %d, remaining size %d, offset %d',
                poppedSize, self.totalSize, self.offset)

It is a simple idea: every radio has its own buffer: a list of string, and when a listener needs more data, just pick up a chunk and send it to peer.  A list of string?  That might be the reason of overuse of memory, I wondered.  And I tried to understand how Python manages memory.  I found the article Improving Python’s Memory Allocator.   Then I understand how it works.  It’s not difficult,  every object smaller than 256 bytes will be allocated in the memory pool,  otherwise it will be allocated with malloc.  Then I can imagine what happened behind the scene.  The chunks are not in fixed size,  their size might be 200~300 bytes to 1xxx bytes.  The list of chunks is limited in a fixed size,  when the size of total chunks is too big,  it just pops some old chunks out.  Here is the key point of the reason of fragmentation.  You can imagine:

  1. A chunk with 987 bytes occupies a piece of memory at 0x000123
  2. When it is no longer needed, it was freed, then there is a free memory chunk at 0x000123 with 987 bytes.
  3. Here comes another memory allocation request with size 768,  malloc founds the free chunk at 0x000123, it just occupies it and return
  4. Here is the fragment!  987 – 768 = 219, we got a small chunk of free memory, it is too small and difficult to be used

This thing happened again, again and again during the server is serving, then there will be more and more fragments !  I finally know the reason of overuse of memory.  Then here comes another question: how to fix it?  List of string is an easy way to store buffering chunks,  but it makes fragments.  I then wondered a Python C-module with memory chunk allocation and accessing function might be a good idea.

The final solution: bytearray

I started to write my memory chunk allocation C-module. But when I am doing it, I found there is a better solution just fits what I need: the bytearray.  I found it in the CPython source code.   It is a single memory chunk, you can change its size, it might be reallocated.  But if you don’t change its length, it is always same memory buffer.  At first, I was curious, why I can’t find anything for bytearray in Python2.6 documents? I then noticed it is a back-port from Python3, that’s why there is no document for it.  Okay, now let’s see the better solution with bytearray:

class AudioStream(object):
    """Audio stream

    """

    def __init__(self, size=1024, count=128):
        """

        The bytes is a big memory chunk, it buffers all incoming audio data.
        There are blocks in the memory chunk, they are the basic unit to send to
        peer.

        <-------------- Memory chunk ------------------>
        <--Block1--><--Block2--><--Block3--><--Block4-->
        ^          ^          ^          ^
        L1         L2         L3         L4

        We map blocks to the real audio stream

        <------------------ Audio Stream -------------->  ---> time goes
        <--Block3--><--Block4--><--Block1--><--Block2-->

                          Map to

        <-------------- Memory chunk ------------------>
        <--Block1--><--Block2--><--Block3--><--Block4-->

        Every listener got their offset of whole audio stream, so that we can
        know which block he got.

        ------------<------------------ Audio Stream --------------> --->
                    <--Block3--><--Block4--><--Block1--><--Block2-->
        ^
        L5

        When there is a listener point to a out of buffer window place, we
        should move the pointer to the first current block.

        ------------<------------------ Audio Stream --------------> --->
                    <--Block3--><--Block4--><--Block1--><--Block2-->
                    ^
                    L5

        @param size: size of block
        @param count: count of blocks
        """
        self._size = size
        self._count = count
        self._bufferSize = size*count

        # offset of begin of buffer window in audio stream
        self._offset = 0
        # bytes array
        self._bytes = bytearray(self.bufferSize)
        # small chunks, they are not big enough to fit a block
        self._pieces = []
        # total size of pieces
        self._pieceSize = 0

    def _getSize(self):
        return self._size
    size = property(_getSize)

    def _getCount(self):
        return self._count
    count = property(_getCount)

    def _getOffset(self):
        return self._offset
    offset = property(_getOffset)

    def _getBufferSize(self):
        return self._bufferSize
    bufferSize = property(_getBufferSize)

    def write(self, chunk):
        """Write audio data to audio stream

        @param chunk: audio data chunk to write
        """
        # append chunk to pieces
        self._pieces.append(chunk)
        self._pieceSize += len(chunk)

        while self._pieceSize >= self.size:
            total = ''.join(self._pieces)
            block = total[:self.size]
            # there is still some remain piece
            if self._pieceSize - self.size > 0:
                self._pieces = [total[self.size:]]
                self._pieceSize = len(self._pieces[0])
            else:
                self._pieces = []
                self._pieceSize = 0

            # write the block to buffer
            begin = self.offset % self.bufferSize
            oldSize = len(self._bytes)
            self._bytes[begin:begin+self.size] = block
            assert len(self._bytes) == oldSize, "buffer size is changed"

            self._offset += len(block)

    def read(self, offset):
        """Read a block from audio stream

        @param offset: offset to read block
        @return: (block, new offset)
        """
        begin = offset % self.bufferSize
        assert begin >= 0
        assert begin < self.bufferSize
        block = str(self._bytes[begin:begin+self.size])
        offset += self.size
        return block, offset


With new design, there are no more allocations and deallocations.  That saves a huge mount of memory.  Here is the memory usage figure with new design:

As you can see, we have almost 800 listeners on line, and the memory usage is still low and stable.  The original server uses over 100MB even with only 6x listeners on the line.  That’s huge difference.

Posted in English Articles, Python, 分享 | Tagged , , , | 1 Comment