やってみる

アウトプットすべく己を導くためのブログ。その試行錯誤すらたれ流す。

HTMLの本文抽出(readability-lxml)

 最高だった。extractcontent3より良さげ。

成果物

情報源

インストール

pip3 install readability-lxml
pip3 install html2text

ログ

Collecting readability-lxml
  Using cached https://www.piwheels.org/simple/readability-lxml/readability_lxml-0.7.1-py3-none-any.whl
Collecting lxml (from readability-lxml)
  Using cached https://www.piwheels.org/simple/lxml/lxml-4.4.1-cp35-cp35m-linux_armv7l.whl
Collecting cssselect (from readability-lxml)
  Using cached https://files.pythonhosted.org/packages/3b/d4/3b5c17f00cce85b9a1e6f91096e1cc8e8ede2e1be8e96b87ce1ed09e92c5/cssselect-1.1.0-py2.py3-none-any.whl
Collecting chardet (from readability-lxml)
  Using cached https://files.pythonhosted.org/packages/bc/a9/01ffebfb562e4274b6487b4bb1ddec7ca55ec7510b22e4c51f14098443b8/chardet-3.0.4-py2.py3-none-any.whl
Installing collected packages: lxml, cssselect, chardet, readability-lxml
Successfully installed chardet-3.0.4 cssselect-1.1.0 lxml-4.4.1 readability-lxml-0.7.1
Collecting html2text
  Downloading https://files.pythonhosted.org/packages/f0/68/2bdc9ff2202b53d2fd6321be48d1effca0f679f0797701017df5be26bd82/html2text-2019.8.11-py2.py3-none-any.whl
Installing collected packages: html2text
Successfully installed html2text-2019.8.11

コード

 主な変更点の抜粋。

import extractcontent3
from readability.readability import Document
import html2text

class HtmlContentExtractor:
    def __init__(self, option=None):
        self.__html = None
        self.__text = None
        self.__md = None
        self.__extractor = extractcontent3.ExtractContent()
        if option is not None: self.__extractor.set_option(option) # option = {"threshold":50}
    @property
    def Title(self): return self.__title
    @property
    def Html(self): return self.__html
    @property
    def Markdown(self): return self.__md
    @property
    def Text(self): return self.__text
    def extract(self, html):
        # https://github.com/buriy/python-readability/blob/master/readability/readability.py
        doc = Document(html)
        self.__title = doc.title()
        self.__html = doc.summary()
        self.__md = html2text.html2text(self.__html)
        self.__text = self.__format_to_text(self.__html)
        return self.__text

 プレーンテキストでは取得できないようなので、そこは自前でやる。

対象環境

$ uname -a
Linux raspberrypi 4.19.42-v7+ #1218 SMP Tue May 14 00:48:17 BST 2019 armv7l GNU/Linux