基于 Perl + QtWebKit + VDOM 的网页抽取
By Yichun Zhang (agentzh)
Lightning talk
Language: 中文
Tags: vdom webkit
You can find more information on the speaker's site:
下载演讲/download slides http://agentzh.org/misc/slides/taobao-fe.tar.gz
在这个演讲中,我将简单介绍一下如何从 Perl 中访问苹果的 WebKit 浏览器渲染引擎得到的带视觉信息的 HTML DOM。并将展示网页渲染后的视觉信息用于网页信息的自动化抽取的几种途径。最后会展示一些已用于生产的抽取器,如评论抽取器 CommentHunter.pm, 网页标题抽取器 TitleHunter.pm,和列表页/详情页分类器 ListHunter.pm。
In this talk, I'll show how to access the HTML DOM with vision info generated by Apple's WebKit web browser rendering engine from Perl, and how to exploit the vision info after rendering the webpage for automatic webpage data extraction. Finally I'll demonstrate some of our extractors in production, like user comment extractor CommentHunter.pm, webpage title extractor TitleHunter.pm, as well as our listpage/contentpage categorizer ListHunter.pm.