Chur

`_:: $ docker pull ghcr.io/weblyzard/inscriptis:latest $ docker run -n inscriptis ghcr.io/weblyzard/inscriptis:latest Run as Kubernetes Deployment -------------------------------------- The helm chart for deployment on a kubernetes cluster is located in the `inscriptis-helm repository `_. Use the Web Service ------------------- The Web services receives the HTML file in the request body and returns the corresponding text. The file's encoding needs to be specified in the ``Content-Type`` header (``UTF-8`` in the example below):: $ curl -X POST -H "Content-Type: text/html; encoding=UTF8" \ --data-binary @test.html http://localhost:5000/get_text The service also supports a version call:: $ curl http://localhost:5000/version Example annotation profiles =========================== The following section provides a number of example annotation profiles illustrating the use of Inscriptis' annotation support. The examples present the used annotation rules and an image that highlights a snippet with the annotated text on the converted web page, which has been created using the HTML postprocessor as outlined in Section `annotation postprocessors <#annotation-postprocessors>`_. Wikipedia tables and table metadata ----------------------------------- The following annotation rules extract tables from Wikipedia pages, and annotate table headings that are typically used to indicate column or row headings. .. code-block:: json { "table": ["table"], "th": ["tableheading"], "caption": ["caption"] } The figure below outlines an example table from Wikipedia that has been annotated using these rules. .. figure:: https://github.com/weblyzard/inscriptis/raw/master/docs/images/wikipedia-chur-table-annotation.png :alt: Table and table metadata annotations extracted from the Wikipedia entry for Chur. References to entities, missing entities and citations from Wikipedia --------------------------------------------------------------------- This profile extracts references to Wikipedia entities, missing entities and citations. Please note that the profile isn't perfect, since it also annotates ``[ edit ]`` links. .. code-block:: json { "a#title": ["entity"], "a#class=new": ["missing"], "class=reference": ["citation"] } The figure shows entities and citations that have been identified on a Wikipedia page using these rules. .. figure:: https://github.com/weblyzard/inscriptis/raw/master/docs/images/wikipedia-chur-entry-annotation.png :alt: Metadata on entries, missing entries and citations extracted from the Wikipedia entry for Chur. Posts and post metadata from the XDA developer forum ---------------------------------------------------- The annotation rules below, extract posts with metadata on the post's time, user and the user's job title from the XDA developer forum. .. code-block:: json { "article#class=message-body": ["article"], "li#class=u-concealed": ["time"], "#itemprop=name": ["user-name"], "#itemprop=jobTitle": ["user-title"] } The figure illustrates the annotated metadata on posts from the XDA developer forum. .. figure:: https://github.com/weblyzard/inscriptis/raw/master/docs/images/xda-posts-annotation.png :alt: Posts and post metadata extracted from the XDA developer forum. Code and metadata from Stackoverflow pages ------------------------------------------ The rules below extracts code and metadata on users and comments from Stackoverflow pages. .. code-block:: json { "code": ["code"], "#itemprop=dateCreated": ["creation-date"], "#class=user-details": ["user"], "#class=reputation-score": ["reputation"], "#class=comment-date": ["comment-date"], "#class=comment-copy": ["comment-comment"] } Applying these rules to a Stackoverflow page on text extraction from HTML yields the following snippet: .. figure:: https://github.com/weblyzard/inscriptis/raw/master/docs/images/stackoverflow-code-annotation.png :alt: Code and metadata from Stackoverflow pages. Advanced topics =============== Annotated text -------------- Inscriptis can provide annotations alongside the extracted text which allows downstream components to draw upon semantics that have only been available in the original HTML file. The extracted text and annotations can be exported in different formats, including the popular JSONL format which is used by `doccano `_. Example output: .. code-block:: json {"text": "Chur\n\nChur is the capital and largest town of the Swiss canton of the Grisons and lies in the Grisonian Rhine Valley.", "label": [[0, 4, "heading"], [0, 4, "h1"], [6, 10, "emphasis"]]} The output above is produced, if inscriptis is run with the following annotation rules: .. code-block:: json { "h1": ["heading", "h1"], "b": ["emphasis"], } The code below demonstrates how inscriptis' annotation capabilities can be used within a program: .. code-block:: python import urllib.request from inscriptis import get_annotated_text from inscriptis.model.config import ParserConfig url = "https://www.fhgr.ch" html = urllib.request.urlopen(url).read().decode('utf-8') rules = {'h1': ['heading', 'h1'], 'h2': ['heading', 'h2'], 'b': ['emphasis'], 'table': ['table'] } output = get_annotated_text(html, ParserConfig(annotation_rules=rules) print("Text:", output['text']) print("Annotations:", output['label']) Fine tuning ----------- The following options are available for fine tuning inscriptis' HTML rendering: 1. **More rigorous indentation:** call ``inscriptis.get_text()`` with the parameter ``indentation='extended'`` to also use indentation for tags such as ``

`` and ```` that do not provide indentation in their standard definition. This strategy is the default in ``inscript`` and many other tools such as Lynx. If you do not want extended indentation you can use the parameter ``indentation='standard'`` instead. 2. **Overwriting the default CSS definition:** inscriptis uses CSS definitions that are maintained in ``inscriptis.css.CSS`` for rendering HTML tags. You can override these definitions (and therefore change the rendering) as outlined below: .. code-block:: python from lxml.html import fromstring from inscriptis.css_profiles import CSS_PROFILES, HtmlElement from inscriptis.html_properties import Display from inscriptis.model.config import ParserConfig # create a custom CSS based on the default style sheet and change the # rendering of `div` and `span` elements css = CSS_PROFILES['strict'].copy() css['div'] = HtmlElement(display=Display.block, padding=2) css['span'] = HtmlElement(prefix=' ', suffix=' ') html_tree = fromstring(html) # create a parser using a custom css config = ParserConfig(css=css) parser = Inscriptis(html_tree, config) text = parser.get_text() Custom HTML tag handling ------------------------ If the fine-tuning options discussed above are not sufficient, you may even override Inscriptis' handling of start and end tags as outlined below: .. code-block:: python from inscriptis import ParserConfig from inscriptis.html_engine import Inscriptis from inscriptis.model.tag import CustomHtmlTagHandlerMapping my_mapping = CustomHtmlTagHandlerMapping( start_tag_mapping={'a': my_handle_start_a}, end_tag_mapping={'a': my_handle_end_a} ) inscriptis = Inscriptis(html_tree, ParserConfig(custom_html_tag_handler_mapping=my_mapping)) text = inscriptis.get_text() In the example the standard HTML handlers for the ``a`` tag are overwritten with custom versions (i.e., ``my_handle_start_a`` and ``my_handle_end_a``). You may define custom handlers for any tag, regardless of whether it already exists in the standard mapping. Please refer to `custom-html-handling.py `_ for a working example. The standard HTML tag handlers can be found in the `inscriptis.model.tag `_ package. Optimizing memory consumption ----------------------------- Inscriptis uses the Python lxml library which prefers to reuse memory rather than release it to the operating system. This behavior might lead to an increased memory consumption, if you use inscriptis within a Web service that parses very complex HTML pages. The following code mitigates this problem on Unix systems by manually forcing lxml to release the allocated memory: .. code-block:: python import ctypes def trim_memory() -> int: libc = ctypes.CDLL("libc.so.6") return libc.malloc_trim(0) Examples ======== Strict indentation handling --------------------------- The following example demonstrates modifying ``ParserConfig`` for strict indentation handling. .. code-block:: python from inscriptis import get_text from inscriptis.css_profiles import CSS_PROFILES from inscriptis.model.config import ParserConfig config = ParserConfig(css=CSS_PROFILES['strict'].copy()) text = get_text('first', config) print(text) Ignore elements during parsing ------------------------------ Overwriting the default CSS profile also allows changing the rendering of selected elements. The snippet below, for example, removes forms from the parsed text by setting the definition of the ``form`` tag to ``Display.none``. .. code-block:: python from inscriptis import get_text from inscriptis.css_profiles import CSS_PROFILES, HtmlElement from inscriptis.html_properties import Display from inscriptis.model.config import ParserConfig # create a custom CSS based on the default style sheet and change the # rendering of `div` and `span` elements css = CSS_PROFILES['strict'].copy() css['form'] = HtmlElement(display=Display.none) # create a parser configuration using a custom css html = """First line. """ config = ParserConfig(css=css) text = get_text(html, config) print(text) Citation ======== There is a `Journal of Open Source Software `_ `paper `_ you can cite for Inscriptis: .. code-block:: bibtex @article{Weichselbraun2021, doi = {10.21105/joss.03557}, url = {https://doi.org/10.21105/joss.03557}, year = {2021}, publisher = {The Open Journal}, volume = {6}, number = {66}, pages = {3557}, author = {Albert Weichselbraun}, title = {Inscriptis - A Python-based HTML to text conversion library optimized for knowledge extraction from the Web}, journal = {Journal of Open Source Software} } Changelog ========= A full list of changes can be found in the `release notes `_.