最高だった。extractcontent3より良さげ。
成果物
情報源
- https://orangain.hatenablog.com/entry/content-extraction-from-html-in-python
- https://qiita.com/nolty/items/4cce0f17e27a812182d1
- https://github.com/buriy/python-readability/blob/master/readability/readability.py
インストール
pip3 install readability-lxml pip3 install html2text
ログ
Collecting readability-lxml Using cached https://www.piwheels.org/simple/readability-lxml/readability_lxml-0.7.1-py3-none-any.whl Collecting lxml (from readability-lxml) Using cached https://www.piwheels.org/simple/lxml/lxml-4.4.1-cp35-cp35m-linux_armv7l.whl Collecting cssselect (from readability-lxml) Using cached https://files.pythonhosted.org/packages/3b/d4/3b5c17f00cce85b9a1e6f91096e1cc8e8ede2e1be8e96b87ce1ed09e92c5/cssselect-1.1.0-py2.py3-none-any.whl Collecting chardet (from readability-lxml) Using cached https://files.pythonhosted.org/packages/bc/a9/01ffebfb562e4274b6487b4bb1ddec7ca55ec7510b22e4c51f14098443b8/chardet-3.0.4-py2.py3-none-any.whl Installing collected packages: lxml, cssselect, chardet, readability-lxml Successfully installed chardet-3.0.4 cssselect-1.1.0 lxml-4.4.1 readability-lxml-0.7.1
Collecting html2text Downloading https://files.pythonhosted.org/packages/f0/68/2bdc9ff2202b53d2fd6321be48d1effca0f679f0797701017df5be26bd82/html2text-2019.8.11-py2.py3-none-any.whl Installing collected packages: html2text Successfully installed html2text-2019.8.11
コード
主な変更点の抜粋。
import extractcontent3 from readability.readability import Document import html2text class HtmlContentExtractor: def __init__(self, option=None): self.__html = None self.__text = None self.__md = None self.__extractor = extractcontent3.ExtractContent() if option is not None: self.__extractor.set_option(option) # option = {"threshold":50} @property def Title(self): return self.__title @property def Html(self): return self.__html @property def Markdown(self): return self.__md @property def Text(self): return self.__text def extract(self, html): # https://github.com/buriy/python-readability/blob/master/readability/readability.py doc = Document(html) self.__title = doc.title() self.__html = doc.summary() self.__md = html2text.html2text(self.__html) self.__text = self.__format_to_text(self.__html) return self.__text
プレーンテキストでは取得できないようなので、そこは自前でやる。
対象環境
- Raspbierry pi 3 Model B+
- Raspbian stretch 9.0 2018-11-13 ※
- bash 4.4.12(1)-release ※
- Python 3.5.3
- SQLite 3.29.0 ※
- MeCab 0.996ユーザ辞書
$ uname -a Linux raspberrypi 4.19.42-v7+ #1218 SMP Tue May 14 00:48:17 BST 2019 armv7l GNU/Linux