基于 Perl + QtWebKit + VDOM 的网页抽取

Yichun Zhang (‎agentzh‎)
闪电演讲
语言: 中文
标签: vdom webkit

您能在演讲者的网站找到更多资料:


下载演讲/download slides http://agentzh.org/misc/slides/taobao-fe.tar.gz

在这个演讲中,我将简单介绍一下如何从 Perl 中访问苹果的 WebKit 浏览器渲染引擎得到的带视觉信息的 HTML DOM。并将展示网页渲染后的视觉信息用于网页信息的自动化抽取的几种途径。最后会展示一些已用于生产的抽取器,如评论抽取器 CommentHunter.pm, 网页标题抽取器 TitleHunter.pm,和列表页/详情页分类器 ListHunter.pm。

In this talk, I'll show how to access the HTML DOM with vision info generated by Apple's WebKit web browser rendering engine from Perl, and how to exploit the vision info after rendering the webpage for automatic webpage data extraction. Finally I'll demonstrate some of our extractors in production, like user comment extractor CommentHunter.pm, webpage title extractor TitleHunter.pm, as well as our listpage/contentpage categorizer ListHunter.pm.