This is a translation of my article 抓取網頁的最佳語言 : Python written in chinese
At first, I used C/C++ to write programs for grabbing data from websites. I tried to write a library for these tasks, but I realized that it’s not easy to implement a HTTP client library. Then, I used cUrl library for downloading pages, but even with cUrl, it’s not productive. I had to to modify program frequently, the compiling time is costly. There was also no regular expression for C/C++. I also had to deal with many annoying details like memory management, string handling.
After that, I was wondering, C/C++ is not a nice choice to grab data from websites. Why do I have to handle so many details? Why don’t I just use script language or other language? At first I was worrying about the performance, and then I realized that the performance of language is not the bottleneck. What’s more? I can get much more benefits if I use script language, it is easier to develop and debug. So I decided to find another solution for grabbing data from websites.
How about Perl?
Long time ago, I used Perl to write CGI programs, like guest-book, website managing system and so on. That said, Perl is a “write-once” language. Lots of Perl programs are full filled with short syntax and symbols. It is really difficult to read. And it is not easy to modularize Perl programs. It doesn’t support OO well. And there is no more new version of Perl. Even the new Perl is under construction, but it takes too long time, I still think it is almost dead. For these reasons and personal feeling, I don’t like Perl.
As a popular programming language designed for websites, I don’t think it is suitable to use in other situations. And although it is popular, it is really a bad designed language. It is also not an easy job to modularize PHP programs, it doesn’t support OO well, too. The name-space is also a big problem, there are so many function looks like mysql_xxxx, mysql_oooo. But even such a bad language got its advantage. That is: popular, popular and popular. Some one said that:
PHP is the BASIC of the 21st century
Well, what ever, PHP is out.
Lua is a light weight script language, almost everything about design of Lua is for performance. I wanted to warp C/C++ library for Lua, but there is also lots of weakness of Lua. It is not easy to modularize, too. And almost everything in Lua is designed for performance, its syntax is not so friendly. What’s more, there are little resources for Lua, I might have to build everything I need. So Lua is not on the list.
Java is a language grows with Internet, it is absolutely qualified. But, I don’t like it because it is too verbose. And what’s more, it is too fat! I want to throw my laptop that has only 256MB RAM out the window when I am running Eclipse on it. I’m sorry, I don’t like Java. The guy I mentioned in PHP, also said that:
Java is the COBOL of the 21st century
Finally, I postdd questions on PTT, then one recommend Python. Well, Python? WTF? I have never heard that before. And I searched it and ask some questions. Then I found that it is exactly what I want! It can be extended easily. If I need performance, I can write module in C for Python. And there are so many resources to use. You can find almost any Python libraries that you can imagine. Also, those libraries are easy to install, you can type “easy_install” to install almost everything you want. Most of script languages are not suitable for big program, but Python is not the one among them, it is easy to modularize, and it supports OO well. What else, it is really easy to read and write. There are also lots of big guy use Python, like Google, YouTube and so on. When I decide to learn Python, I buy a Learning Python and start my journey with Python.
Fall in love with Python
It did’t let me feel disappointed. It is very productive to develop with Python. I wrote almost everything that I did in C/C++ before. But for grabbing data from websites, there is still lots of work to do.
It is really a piece of cake for Python to get a web page. There are standard modules, urllib and urllib2. But they are not good enough. Then, I find Twisted.
Twisted is an event-driven networking engine written in Python and licensed under the MIT license.
It is very powerful. It has beautiful callback design for handling async operations named deferred. You can write one line to grab a page:
You can also use its deferred to handle data
d = getPage("http://www.google.com") d.addCallback(parseHtml) d.addCallback(extractData) d.addCallback(saveResult)
What’s more, I wrote an auto-retry function for twisted to retry any async function automatically, you can read An auto-retry recipe for Twisted.
It is not a difficult job to get page from a website. Parsing html is a much more difficult job. There are standard modules of Python, but they are too simple. The biggest trouble of parsing html is: there are so many websites don’t follow the standard of html or xhtml. You can see lots of syntax error in those pages. It makes parsing become a difficult job. So I need an html parser that can deal wrong html syntax well. Then, here comes BeautifulSoup, it is an html parser written in Python, it can handle wrong html syntax well. But there is a problem, it is not efficient. For example, you want to find a specific tag, then you write:
It is okay when you do this in a small page. But it is a big problem if you do that in a big page, its tag finding method is very very slow. At first, I expect the bottleneck will be on network, but with beautifulsoup, the bottleneck is on parsing and finding tags. You can notice that when you run your spider, the CPU usage rate is 100% all the time. I run profile for my program, most of the time of running are in soup.find. For performance reason, I have to find another solution.
Then, I find a nice article: Python HTML Parser Performance, it shows comparison of performance of different Python html parsers. The most impressive one is lxml. At first, I am worrying about that is it difficult to find target tags with lxml. And I notice that it provides xpath! It is much easier to write xpath then find methods of beautifulsoup. And it is also much more efficient to use lxml to parse and find target tags. Here are some real life example I wrote:
def getNextPageLink(self, tree): """Get next page link @param tree: tree to get link @return: Return url of next page, if there is no next page, return None """ paging = tree.xpath("//span[@class='paging']") if paging: links = paging.xpath("./a[(text(), '%s')]" % self.localText['next']) if links: return str(links.get('href')) return None
listPrice = tree.xpath("//*[@class='priceBlockLabel']/following-sibling::*") if listPrice: detail['listPrice'] = self.stripMoney(listPrice.text)
With beautifulsoup, I have to write logic in Python to find target tags. With lxml, I write almost all logic in xpath, it is much easier to write.
Useful FireFox tool
With xpath, it is not a difficult job to find target tags. But it would be wonderful if you can try xpath on websites, right? I find there are some plugins of FireFox are very useful for writing spiders. Here are some useful tools for analysis:
I wrote an example to show how it looks like.
# -*- coding: utf8 -*- import cStringIO as StringIO from twisted.internet import reactor from twisted.web.client import getPage from twisted.python.util import println from lxml import etree def parseHtml(html): parser = etree.HTMLParser(encoding='utf8') tree = etree.parse(StringIO.StringIO(html), parser) return tree def extractTitle(tree): titleText = unicode(tree.xpath("//title/text()")) return titleText d = getPage('http://www.google.com') d.addCallback(parseHtml) d.addCallback(extraTitle) d.addBoth(println) reactor.run()
This is a very simple program, it grabs title of google.com and prints it out. Very elegance, isn’t it? 😀
One year has been passed since I wrote this article in Chinese. Today, I still use Python + Twited + lxml for grabbing data from websites. You might not agree what I said, but they are best tool to write spider (crawler or whatever) for me.