The best choice to grab data from websites: Python + Twisted + lxml

This is a translation of my article 抓取網頁的最佳語言 : Python written in chinese

At first

At first, I used C/C++ to write programs for grabbing data from websites. I tried to write a library for these tasks, but I realized that it’s not easy to implement a HTTP client library. Then, I used cUrl library for downloading pages, but even with cUrl, it’s not productive. I had to to modify program frequently, the compiling time is costly. There was also no regular expression for C/C++. I also had to deal with many annoying details like memory management, string handling.

Then

After that, I was wondering, C/C++ is not a nice choice to grab data from websites. Why do I have to handle so many details? Why don’t I just use script language or other language? At first I was worrying about the performance, and then I realized that the performance of language is not the bottleneck. What’s more? I can get much more benefits if I use script language, it is easier to develop and debug. So I decided to find another solution for grabbing data from websites.

How about Perl?

Long time ago, I used Perl to write CGI programs, like guest-book, website managing system and so on. That said, Perl is a “write-once”  language. Lots of Perl programs are full filled with short syntax and symbols. It is really difficult to read. And it is not easy to modularize Perl programs. It doesn’t support OO well. And there is no more new version of Perl. Even the new Perl is under construction, but it takes too long time, I still think it is almost dead. For these reasons and personal feeling, I don’t like Perl.

PHP

As a popular programming language designed for websites, I don’t think it is suitable to use in other situations. And although it is popular, it is really a bad designed language. It is also not an easy job to modularize PHP programs, it doesn’t support OO well, too. The name-space is also a big problem, there are so many function looks like mysql_xxxx, mysql_oooo. But even such a bad language got its advantage. That is: popular, popular and popular. Some one said that:

PHP is the BASIC of the 21st century

Well, what ever, PHP is out.

Lua

Lua is a light weight script language, almost everything about design of Lua is for performance. I wanted to warp C/C++ library for Lua, but there is also lots of weakness of Lua. It is not easy to modularize, too. And almost everything in Lua is designed for performance, its syntax is not so friendly. What’s more, there are little resources for Lua, I might have to build everything I need. So Lua is not on the list.

Java

Java is a language grows with Internet, it is absolutely qualified. But, I don’t like it because it is too verbose. And what’s more, it is too fat! I want to throw my laptop that has only 256MB RAM out the window when I am running Eclipse on it. I’m sorry, I don’t like Java. The guy I mentioned in PHP, also said that:

Java is the COBOL of the 21st century

Python

Finally, I postdd questions on PTT, then one recommend Python. Well, Python? WTF? I have never heard that before. And I searched it and ask some questions. Then I found that it is exactly what I want! It can be extended easily. If I need performance, I can write module in C for Python. And there are so many resources to use. You can find almost any Python libraries that you can imagine. Also, those libraries are easy to install, you can type “easy_install” to install almost everything you want. Most of script languages are not suitable for big program, but Python is not the one among them, it is easy to modularize, and it supports OO well. What else, it is really easy to read and write. There are also lots of big guy use Python, like Google, YouTube and so on. When I decide to learn Python, I buy a Learning Python and start my journey with Python.

Fall in love with Python

It did’t let me feel disappointed. It is very productive to develop with Python. I wrote almost everything that I did in C/C++ before. But for grabbing data from websites, there is still lots of work to do.

Twisted

It is really a piece of cake for Python to get a web page. There are standard modules, urllib and urllib2. But they are not good enough. Then, I find Twisted.

Twisted is an event-driven networking engine written in Python and licensed under the MIT license.

It is very  powerful. It has beautiful callback design for handling async operations named deferred. You can write one line to grab a page:

getPage("http://www.google.com").addCallback(printPage)

You can also use its deferred to handle data

d = getPage("http://www.google.com")
d.addCallback(parseHtml)
d.addCallback(extractData)
d.addCallback(saveResult)

What’s more, I wrote an auto-retry function for twisted to retry any async function automatically, you can read An auto-retry recipe for Twisted.

Beautifulsoup

It is not a difficult job to get page from a website. Parsing html is a much more difficult job. There are standard modules of Python, but they are too simple. The biggest trouble of parsing html is: there are so many websites don’t follow the standard of html or xhtml. You can see lots of syntax error in those pages. It makes parsing become a difficult job. So I need an html parser that can deal wrong html syntax well. Then, here comes BeautifulSoup, it is an html parser written in Python, it can handle wrong html syntax well. But there is a problem, it is not efficient. For example, you want to find a specific tag, then you write:

soup.find('div', dict(id='content'))

It is okay when you do this in a small page. But it is a big problem if you do that in a big page, its tag finding method is very very slow. At first, I expect the bottleneck will be on network, but with beautifulsoup, the bottleneck is on parsing and finding tags. You can notice that when you run your spider, the CPU usage rate is 100% all the time. I run profile for my program, most of the time of running are in soup.find. For performance reason, I have to find another solution.

lxml

Then, I find a nice article: Python HTML Parser Performance, it shows comparison of performance of different Python html parsers. The most impressive one is lxml. At first, I am worrying about that is it difficult to find target tags with lxml. And I notice that it provides xpath! It is much easier to write xpath then find methods of beautifulsoup. And it is also much more efficient to use lxml to parse and find target tags. Here are some real life example I wrote:

def getNextPageLink(self, tree):
    """Get next page link

    @param tree: tree to get link
    @return: Return url of next page, if there is no next page, return None
    """
    paging = tree.xpath("//span[@class='paging']")
    if paging:
        links = paging[0].xpath("./a[(text(), '%s')]" % self.localText['next'])
        if links:
            return str(links[0].get('href'))
    return None
listPrice = tree.xpath("//*[@class='priceBlockLabel']/following-sibling::*")
if listPrice:
    detail['listPrice'] = self.stripMoney(listPrice[0].text)

With beautifulsoup, I have to write logic in Python to find target tags. With lxml, I write almost all logic in xpath, it is much easier to write.

Useful FireFox tool

With xpath, it is not a difficult job to find target tags. But it would be wonderful if you can try xpath on websites, right? I find there are some plugins of FireFox are very useful for writing spiders. Here are some useful tools for analysis:

FireFox插件XPath checker畫面

FireFox XPath checker

使用FireBug檢視網頁元素

FireBug

Example

I wrote an example to show how it looks like.

# -*- coding: utf8 -*-
import cStringIO as StringIO

from twisted.internet import reactor
from twisted.web.client import getPage
from twisted.python.util import println
from lxml import etree

def parseHtml(html):
    parser = etree.HTMLParser(encoding='utf8')
    tree = etree.parse(StringIO.StringIO(html), parser)
    return tree

def extractTitle(tree):
    titleText = unicode(tree.xpath("//title/text()")[0])
    return titleText

d = getPage('http://www.google.com')
d.addCallback(parseHtml)
d.addCallback(extraTitle)
d.addBoth(println)

reactor.run()

This is a very simple program, it grabs title of google.com and prints it out. Very elegance, isn’t it? 😀

Conclusion

One year has been passed since I wrote this article in Chinese. Today, I still use Python + Twited  + lxml for grabbing  data from websites. You might not agree what I said, but they are best tool to write spider (crawler or whatever) for me.

This entry was posted in English Articles, Python and tagged , , , , , , . Bookmark the permalink.

34 Responses to The best choice to grab data from websites: Python + Twisted + lxml

  1. Thomas says:

    Learn English.

  2. victor says:

    Sorry for my poor English, I already fix most of wrong grammar and typo. If you find anything wrong, please let me know. Thanks your advice.

  3. jon says:

    Fuck Thomas. I had seen javascript for the getNextPageLink function. Yours is a tribute to the elegance of Python and lxml.

    Bravo.

  4. Stefano says:

    Thanks for the article.
    Hey Thomas: vaffanculo!!! (learn italian)

  5. Robert says:

    ANY language is “write once”. That isn’t the languages fault it is the programmers. The same goes for Perl. Perl can be used as a “write once” language. That isn’t Perl’s fault. It is and always has been the programmers fault.

  6. Govind says:

    lxml is nice, thanks for sharing your knowledge.
    regards
    Govind

  7. crono5788 says:

    Nice article, and your English is pretty good!

  8. Ash says:

    Thanks for the article, your English is fine 🙂

    Thomas is a loser and he knows it 🙂

  9. Josh Narins says:

    Your knowledge of perl seems woefully outdated. It is trivial to make Perl modules, and I’d bet perl has the largest set of modules of any language out there (www.CPAN.org).

    It’s also possible to do perfectly reasonable OO, and if you want anal OO (positively stop anyone from doing anything outside the published interface) you can do that, too (See Conway’s Perl Best Practices and “inside out classes”).

    I do plenty of web page parsing in perl, and we’ve got stuff like Soup, more than once choice, in fact, for non-compliant HTML.

  10. victor says:

    Sorry for my out-of-date knowledge about perl, but even so, I don’t like the design idea of perl. It just put too many things to syntaxes, make it like a mess. You got tons of $$, $%, $&… and so on. There are so many dollar signs for different meanings. There are also tons of syntaxes for different tasks, e.g. read a line from file. How can you tell what the hell it is if you don’t have a manual, or you don’t remeber what it is?

    When I don’t know what python is, I can read some of simple python programs. When I know python, I can read almost all python programs that I can find. But…with perl, when I don’t know perl, I can’t read anything written in perl, they look like spell. When I know perl, it is still difficult to read a perl program, I have to read manual all the time when I encounter those plenty syntax. They just put too many things into syntax. Why you have to have syntaxes for everything? Why don’t just put them into modules? Do you need “Turn on the light of your kitchen” syntax, too? I don’t think so.

    Also, there are so many dirty ways to achieve same task. I have seen so many Perl programs modify global variable to to make something works. Well, it is really really a bad practice, what about another guy also modify the global variable in his module and expect it works?

    It is interesting, losts of perl guys hate Python, and lots of python guys hate perl. I think that’s because people trust different idea of design. The idea of python is “There should be one– and preferably only one –obvious way to do it.”. And the idea of perl is “There’s More Than One Way To Do It.”.

    So, I am sorry, I hate Perl.

  11. Stu says:

    Keep at the python, your english is fine by the way !

  12. design says:

    good choice with lxml. Its fast, lightweight and pretty good generally. A lot of people tend to think twisted is the solution for everything network related, I strongly disagree.

  13. your english is fine?! says:

    本來不想回的,可是看到這麼多人可以睜眼說瞎話實在忍不住。意思雖然勉強可以看懂,光第一段問題就一堆。

    At first, I use C/C++ to write programs for grabbing data from websites.
    At first是指一開始不是嗎?你現在還是活在你的”一開始”嗎?為什麼這一整段都是用現在式?這是這篇最嚴重的錯誤,不對的時態讓人讀起來感覺非常的奇怪。很多美國人第一句看起來不對後面就都不讀了。還好你這句還算吸引人–因為沒有人會用C++來做這種小工具。

    And I try to write a library for these tasks, but it is not easy to implement a HTTP client library.
    句子不要一開頭就用And。雖然你用了And,你上句說的跟這句說的還是八竿子打不著,接不起來。還有,一個段落前面兩個句子都讀完了,卻還是不知道你這段是想要說什麼。

    So I use cUrl library for downloading pages, but even so, it is not productive to write web spider in C/C++.
    句子不要一開頭就用So。還一個句子兩個so勒。第二個so指的是什麼?這樣子寫,意思一樣,還沒用到so:Even with cUrl library, it was unproductive to write a web spider in C/C++. 且上下句都還是可以連貫。有了這句,上句也不需要了。

    When I am developing spiders, I need to modify program frequently, but the compiling time is costly.
    不必說的話就不要說了,I need to modify program frequently不是廢話嗎?少了那句後,可以變得很簡潔。一個句子裡有二個以上的連接詞很奇怪。

    There is also no regular express for C/C++ (Now we have boost).
    regular “expression”。本來只想說一次,忍不住:不對的時態會改變意思。你這句翻成中文:現在C/C++也沒有regex(現在我們有boost)。這樣沒有抵觸嗎?刮號不知道怎麼用就不要用。其實文章裡也不應該用。

    With C/C++, I also have to deal with many annoying details like memory managing, string handling and e.t.c.
    memory “management”。不要用etc。 “like memory management and string handling”就很好了。大家都知道你在說C/C++,已經重覆很多次了,所以”With C/C++”不必要。這個句子也不像一個段要結束的樣子。後面應該還要有至少一個句子。類似,For all these reasons, I have long abandoned this approach to look for other solutions.

    最後,你這一段也才幾個句子,用了個幾個”C/C++”跟”but”?…..讀起來就很煩。你自己寫的文章有自己先讀過嗎?

  14. Steven says:

    Don’t apologize for your english, it is perfectly fine. The person that made that comment, Thomas, is what we call in the U.S., a dickhead.

  15. Steven says:

    我是美国人。你的英文比他的中文好!

  16. Manny says:

    PHP doesn’t support OO? You’re an idiot. Go back to the drawing board.

  17. victor says:

    @your english is fine?! :

    感謝你的意見,我英文程度到哪裡我自己很清礎,說我英文爛,這也是事實,我不會因為這樣而動怒或怎樣,相反的我會再一次檢查我的文章,確實有很多地方要修正

    如果你說我沒讀過我自己的文章,並不是那樣,我已經修改過n次了,因為原文是中文,我大略上照著中文的句子寫,受中文的影響,所以會有很奇怪的句子或用法出現,感謝你的提醒,你提到的部份我已經盡量改好了

    如果因為我覺得我自己英文很爛而不去用它,永遠就是那樣爛,這篇文章除了分享,也算是練習我自己的英文能力,有人能指出我文章裡的錯誤,其實我蠻高興的,修正了錯誤,只要記得,下次就不容易再犯同樣的錯誤,所謂的進步不就是這樣嗎?

  18. victor says:

    @Manny:

    I am sorry, I didi’t say “PHP doesn’t support OO”, I said “it doesn’t support OO well”. They are quite different. At that time I wrote this article, php did’t support OO well. For now, I have no idea how is php going. I did’t write php for a long while. I use TurboGears2 to build web application.

  19. SiliconChaos says:

    Victor,

    Thanks for the article. Ixml will help me with a project I’m working on and I’m going to look into Twisted.

    I’ve recently come back to Python after going over to ruby because of rails, but have found django to fit into my current projects.

    Hope to read more articles from you (to bad I can’t read Chinese).

    BTW, you’re English is not poor at all, I would like to see anybody who complains about your article speak and write in Chinese (or any other language).

  20. SiliconChaos says:

    Also, I will try to check out the rest of your site using google’s translate. I know it will not do the job right, but hopefully it can do a good enough job for me to get what you are trying to say.

    Best Regards

  21. your english is fine?! says:

    “BTW, you’re English is not poor at all”

    hahahaha. I rest my case.

  22. Manny says:

    PHP has had proper OO support for many years, what planet are you from? The date on your article isn’t that long ago (this month). Yikes, man.

  23. Pingback: Everybody Needs Some Kind of Bailout | HOT Trends and Breaking News

  24. PHP sucks says:

    To Manny:

    It doesn’t matter if PHP has OO. It still sucks really bad compared to C# and Python.

    If you want something “up-to-date”, take a look at this:
    http://www.bitstorm.org/edwin/en/php/

    PHP is like a bicycle. You can attach a lot of bells and whistles (like OO) to it, but it will NEVER go faster than an airplane.

    The world has already moved so far ahead that PHP simply has no hope of catching up.

  25. Yi says:

    Cool stuff! btw, also checkout Feedity – http://feedity.com – I use it a lot these days for creating custom RSS feeds from various webpages. It is simple to use and gives great results. Hope it helps. Chao 🙂

  26. Ari Seida says:

    Have you tried Scrapy?. It’s a very powerful (and simple) web crawling/screen-scraping framework which is also built on Python and Twisted.

  27. jose says:

    Hi,

    First, thanks for sharing that useful information. It is kind of difficult to find info about this topic.
    At the present moment, Im chosing the techology/language to develop a project which scraps several websites and
    I have a couple of questions for you:

    1.- Have you tried sitescraper (a tool based on lxml)? How was it?

    2.- I have seen a great Java library called htmlunit for webscraping.
    Isn’t the Java time performance MUCH better than lxml/Python ?
    To scrap 10 or 50 websites at the same time.
    Dont you think it is very time consuming to use lxml/python instead java? and what about memory ?
    That links points out some bechmarks written in pure python

    http://shootout.alioth.debian.org/gp4/benchmark.php?test=all&lang=java&lang2=iron

    However, I have read that using native-libraries like lxml the time is significant less.
    Thanks!

  28. victor says:

    @jose:

    1. I didn’t try it.

    2. lxml is based on library written in C language. It is very fast. You can reference to this article:

    http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/

    It has the best overall performance among Python html parsers. And talking about the performance of network framework, you can read this article:

    http://nichol.as/asynchronous-servers-in-python

    Twisted is not the best, but is is good enough, and the most important thing is it has full stack of protocols implementations.

    Hope this could be helpful for you.

  29. stefan says:

    @jose

    The libxml2 parser that lxml uses is actually faster than pretty much any parser that exists in the Java world. And for web scaping, your code will usually be limited by the network, not so much by the CPU.

  30. Wlcsguxf says:

    I’d like to pay this in, please Tiny Models Girls
    :-OOO

  31. Ami Minkowitz says:

    The first thing you need to do before anything else is to get yourself a domain name. A domain name is the name you want to give to your website. For example, the domain name of the website you’re reading is “thesitewizard.com”. To get a domain name, you have to pay an annual fee to a registrar for the right to use that name. Getting a name does not get you a website or anything like that. It’s just a name. It’s sort of like registering a business name in the brick-and-mortar world; having that business name does not mean that you also have the shop premises to go with the name.:

    Our personal web-site
    <"'http://www.beautyfashiondigest.com/keratin-treatment-reviews/