Inscriptis module documentation¶
Parse HTML content and converts it into a text representation.
Inscriptis provides support for
nested HTML tables
basic Cascade Style Sheets
annotations
The following example provides the text representation of
https://www.fhgr.ch using the method inscriptis.get_text().
import urllib.request
from inscriptis import get_text
url = 'https://www.fhgr.ch'
html = urllib.request.urlopen(url).read().decode('utf-8')
text = get_text(html)
print(text)
Use the method inscriptis.get_annotated_text() to obtain text and
annotations. The method requires annotation rules as described in annotations.
import urllib.request
from inscriptis import get_annotated_text
url = "https://www.fhgr.ch"
html = urllib.request.urlopen(url).read().decode('utf-8')
# annotation rules specify the HTML elements and attributes to annotate.
rules = {'h1': ['heading'],
'h2': ['heading'],
'#class=FactBox': ['fact-box'],
'i': ['emphasis']}
output = get_annotated_text(html, ParserConfig(annotation_rules=rules)
print("Text:", output['text'])
print("Annotations:", output['label'])
The method returns a dictionary with two keys:
text which contains the page’s plain text and
- label with the annotations in JSONL format that is used by annotators
such as doccano.
- Annotations in the label field are returned as a list of triples with
start index, end index and label as indicated below:
{"text": "Chur\n\nChur is the capital and largest town of the Swiss canton
of the Grisons and lies in the Grisonian Rhine Valley.",
"label": [[0, 4, "heading"], [6, 10, "emphasis"]]}
- inscriptis.get_annotated_text(html_content: str, config: ParserConfig | None = None) dict[str, Any][source]¶
Return a dictionary of the extracted text and annotations.
Notes
the text is stored under the key ‘text’.
annotations are provided under the key ‘label’ which contains a list of Annotations.
Examples
- {“text”: “EU rejects German call to boycott British lamb.”, “
label”: [ [0, 2, “strong”], … ]}
- {“text”: “Peter Blackburn”,
“label”: [ [0, 15, “heading”] ]}
- Returns:
‘text’) and annotations (key: ‘label’)
- Return type:
A dictionary of text (key
- inscriptis.get_text(html_content: str, config: ParserConfig | None = None) str[source]¶
Provide a text representation of the given HTML content.
- Parameters:
html_content (str) – The HTML content to convert.
config – An optional ParserConfig object.
- Returns:
The text representation of the HTML content.
Inscriptis model¶
Inscriptis HTML engine¶
The HTML Engine is responsible for converting HTML to text.
- class inscriptis.html_engine.Inscriptis(html_tree: lxml.html.HtmlElement, config: ParserConfig = None)[source]¶
Translate an lxml HTML tree to the corresponding text representation.
- Parameters:
html_tree – the lxml HTML tree to convert.
config – an optional ParserConfig configuration object.
Example:
from lxml.html import fromstring from inscriptis.html_engine import Inscriptis html_content = "<html><body><h1>Test</h1></body></html>" # create an HTML tree from the HTML content. html_tree = fromstring(html_content) # transform the HTML tree to text. parser = Inscriptis(html_tree) text = parser.get_text()
- get_annotations() list[Annotation][source]¶
Return the annotations extracted from the HTML page.
Inscriptis HTML properties¶
Provide properties used for rendering HTML pages.
- Supported attributes::
Displayproperties.WhiteSpaceproperties.HorizontalAlignmentproperties.VerticalAlignmentproperties.
- class inscriptis.html_properties.Display(*values)[source]¶
Specify whether content will be rendered as inline, block or none.
Note
A display attribute on none indicates, that the content should not be rendered at all.
- class inscriptis.html_properties.HorizontalAlignment(*values)[source]¶
Specify the content’s horizontal alignment.
- center = '^'¶
Center the block’s content.
- left = '<'¶
Left alignment of the block’s content.
- right = '>'¶
Right alignment of the block’s content.
- class inscriptis.html_properties.VerticalAlignment(*values)[source]¶
Specify the content’s vertical alignment.
- bottom = 3¶
Align all content at the bottom.
- middle = 2¶
Align all content in the middle.
- top = 1¶
Align all content at the top.
- class inscriptis.html_properties.WhiteSpace(*values)[source]¶
Specify the HTML element’s whitespace handling.
Inscriptis supports the following handling strategies outlined in the Cascading Style Sheets specification.
- normal = 1¶
Collapse multiple whitespaces into a single one.
- pre = 3¶
Preserve sequences of whitespaces.
Inscriptis ParserConfig¶
Configure Inscripits HTML rendering.
- class inscriptis.model.config.ParserConfig(css: dict[str, HtmlElement] | None = None, display_images: bool = False, deduplicate_captions: bool = False, display_links: bool = False, display_anchors: bool = False, annotation_rules: dict[str, list[str]] | None = None, table_cell_separator: str = ' ', custom_html_tag_handler_mapping: CustomHtmlTagHandlerMapping = None)[source]¶
The ParserConfig class allows fine-tuning the HTML rendering.
CSS definitions (from
inscriptis.css_profilesor custom definitions).configuration options for handling images, captions, links, etc.
annotation rules, if Inscripitis is used for annotating text.
custom html tag handlers.
- css¶
An optional custom CSS definition.
- display_images¶
Whether to include image tiles/alt texts.
- deduplicate_captions¶
Whether to deduplicate captions such as image titles (many newspaper include images and video previews with identical titles).
- display_links¶
Whether to display link targets (e.g. [Python](https://www.python.org)).
- display_anchors¶
Whether to display anchors (e.g. [here](#here)).
- annotation_rules¶
An optional dictionary of annotation rules which specify tags and attributes to annotation.
- table_cell_separator¶
Separator to use between table cells.
- custom_html_tag_handler_mapping¶
An optional CustomHtmlTagHandler.
The following example demonstrates how ParserConfig is used to
enable the strict CSS profile and
prevent links from being shown.
from inscriptis import get_text from inscriptis.css_profiles import CSS_PROFILES from inscriptis.model.config import ParserConfig css_profile = CSS_PROFILES['strict'].copy() config = ParserConfig(css=css_profile, display_links=False) text = get_text('fi<span>r</span>st <a href="/first">link</a>', config) print(text)
Inscriptis CSS model¶
Implement basic CSS support for inscriptis.
The
HtmlElementclass encapsulates all CSS properties of a single HTML element.CssParseparses CSS specifications and translates them into the corresponding HtmlElements used by Inscriptis for rendering HTML pages.
- class inscriptis.model.css.CssParse[source]¶
Parse CSS specifications and applies them to HtmlElements.
The attribute display: none, for instance, is translated to
HtmlElement.display=Display.none.- static attr_display(value: str, html_element: HtmlElement)[source]¶
Apply the given display value.
- static attr_horizontal_align(value: str, html_element: HtmlElement)[source]¶
Apply the provided horizontal alignment.
- static attr_margin_after(value: str, html_element: HtmlElement)¶
Apply the provided bottom margin.
- static attr_margin_before(value: str, html_element: HtmlElement)¶
Apply the given top margin.
- static attr_margin_bottom(value: str, html_element: HtmlElement)[source]¶
Apply the provided bottom margin.
- static attr_margin_top(value: str, html_element: HtmlElement)[source]¶
Apply the given top margin.
- static attr_padding_left(value: str, html_element: HtmlElement)[source]¶
Apply the given left padding_inline.
- static attr_padding_start(value: str, html_element: HtmlElement)¶
Apply the given left padding_inline.
- static attr_style(style_attribute: str, html_element: HtmlElement)[source]¶
Apply the provided style attributes to the given HtmlElement.
- Parameters:
style_attribute – The attribute value of the given style sheet. Example: display: none
html_element – The HtmlElement to which the given style is applied.
- static attr_vertical_align(value: str, html_element: HtmlElement)[source]¶
Apply the given vertical alignment.
- static attr_white_space(value: str, html_element: HtmlElement)[source]¶
Apply the given white-space value.
Inscriptis canvas model¶
Classes used for rendering (parts) of the canvas.
Every parsed HtmlElement writes its
textual content to the canvas which is managed by the following three classes:
- class inscriptis.model.canvas.Canvas[source]¶
The text Canvas on which Inscriptis writes the HTML page.
- margin¶
the current margin to the previous block (this is required to ensure that the margin_after and margin_before constraints of HTML block elements are met).
- blocks¶
a list of strings containing the completed blocks (i.e., text lines). Each block spawns at least one line.
- annotations¶
the list of recorded
Annotations.
- _open_annotations¶
a map of open tags that contain annotations.
- close_block(tag: HtmlElement) None[source]¶
Close the given HtmlElement by writing its bottom margin.
- Parameters:
tag – the HTML Block element to close
- close_tag(tag: HtmlElement) None[source]¶
Register that the given tag tag is closed.
- Parameters:
tag – the tag to close.
- flush_inline() bool[source]¶
Attempt to flush the content in self.current_block into a new block.
Notes
If self.current_block does not contain any content (or only whitespaces) no changes are made.
Otherwise the content of current_block is added to blocks and a new current_block is initialized.
- Returns:
True if the attempt was successful, False otherwise.
- property left_margin: int¶
Return the length of the current line’s left margin.
- open_block(tag: HtmlElement) None[source]¶
Open an HTML block element.
- open_tag(tag: HtmlElement) None[source]¶
Register that a tag is opened.
- Parameters:
tag – the tag to open.
- write(tag: HtmlElement, text: str, whitespace: WhiteSpace = None) None[source]¶
Write the given text to the current block.
Representation of a text block within the HTML canvas.
- class inscriptis.model.canvas.block.Block(idx: int, prefix: Prefix)[source]¶
The current block of text.
A block usually refers to one line of output text.
Note
If pre-formatted content is merged with a block, it may also contain multiple lines.
- Parameters:
idx – the current block’s start index.
prefix – prefix used within the current block.
- merge(text: str, whitespace: WhiteSpace) None[source]¶
Merge the given text with the current block.
- Parameters:
text – the text to merge.
whitespace – whitespace handling.
- merge_normal_text(text: str) None[source]¶
Merge the given text with the current block.
- Parameters:
text – the text to merge
Note
- If the previous text ended with a whitespace and text starts with one, both
will automatically collapse into a single whitespace.
Manage the horizontal prefix (left-indentation, bullets) of canvas lines.
- class inscriptis.model.canvas.prefix.Prefix[source]¶
Class Prefix manages paddings and bullets that prefix an HTML block.
- current_padding¶
the number of characters used for the current left-indentation.
- paddings¶
the list of paddings for the current and all previous tags.
- bullets¶
the list of bullets in the current and all previous tags.
- consumed¶
whether the current bullet has already been consumed.
- property first: str¶
Return the prefix used at the beginning of a tag.
- Note::
A new block needs to be prefixed by the current padding and bullet. Once this has happened (i.e.,
consumedis set to True) no further prefixes should be used for a line.
- register_prefix(padding_inline: int, bullet: str) None[source]¶
Register the given prefix.
- Parameters:
padding_inline – the number of characters used for padding_inline
bullet – an optional bullet.
- property rest: str¶
Return the prefix used for new lines within a block.
This prefix is used for pre-text that contains newlines. The lines need to be prefixed with the right padding to preserver the indentation.
- property unconsumed_bullet: str¶
Yield any yet unconsumed bullet.
- Note::
This function yields the previous element’s bullets, if they have not been consumed yet.
Inscriptis HTML Element¶
The HtmlElement class controls how Inscriptis interprets HTML Elements.
The module
inscriptis.css_profilescontain CSS profiles which assign to each standard HTML tag the correspondingHtmlElement.As for standard GUI browsers, CSS definitions within the parsed HTML modify the
HtmlElementand its interpretation.
- class inscriptis.model.html_element.HtmlElement(tag: str = 'default', prefix: str = '', suffix: str = '', display: Display = Display.inline, margin_before: int = 0, margin_after: int = 0, padding_inline: int = 0, list_bullet: str = '', whitespace: WhiteSpace = WhiteSpace.normal, limit_whitespace_affixes: bool = False, align: HorizontalAlignment = HorizontalAlignment.left, valign: VerticalAlignment = VerticalAlignment.middle, annotation: tuple[str] = ())[source]¶
The HtmlElement class stores properties and metadata of HTML elements.
- canvas¶
the canvas to which the HtmlElement writes its content.
- tag¶
tag name of the given HtmlElement.
- prefix¶
specifies a prefix that to insert before the tag’s content.
- suffix¶
a suffix to append after the tag’s content.
- the content.
- margin_before¶
vertical margin before the tag’s content.
- margin_after¶
vertical margin after the tag’s content.
- padding_inline¶
horizontal padding_inline before the tag’s content.
- whitespace¶
the
Whitespacehandling
- strategy.
- limit_whitespace_affixes¶
limit printing of whitespace affixes to
- elements with `normal` whitespace handling.
- align¶
the element’s horizontal alignment.
- valign¶
the element’s vertical alignment.
- previous_margin_after¶
the margin after of the previous HtmlElement.
- annotation¶
annotations associated with the HtmlElement.
- get_refined_html_element(new: HtmlElement) HtmlElement[source]¶
Compute the new HTML element based on the previous one.
- Adaptations:
- margin_top: additional margin required when considering
margin_bottom of the previous element
- Parameters:
new – The new HtmlElement to be applied to the current context.
- Returns:
The refined element with the context applied.
Inscriptis table model¶
Classes used for representing Tables, TableRows and TableCells.
- class inscriptis.model.table.Table(left_margin_len: int, cell_separator: str)[source]¶
An HTML table.
- rows¶
the table’s rows.
- left_margin_len¶
length of the left margin before the table.
- cell_separator¶
string used for separating cells from each other.
- add_cell(table_cell: TableCell)[source]¶
Add a new
TableCellto the table’s last row.Note
If no row exists yet, a new row is created.
- get_annotations(idx: int, left_margin_len: int) list[Annotation][source]¶
Return all annotations in the given table.
- Parameters:
idx – the table’s start index.
left_margin_len – len of the left margin (required for adapting the position of annotations).
- Returns:
A list of all
Annotations present in the table.
- class inscriptis.model.table.TableCell(align: HorizontalAlignment, valign: VerticalAlignment)[source]¶
A table cell.
- line_width¶
the original line widths per line (required to adjust annotations after a reformatting)
- vertical_padding¶
vertical padding that has been introduced due to vertical formatting rules.
- get_annotations(idx: int, row_width: int) list[Annotation][source]¶
Return a list of all annotations within the TableCell.
- Returns:
A list of annotations that have been adjusted to the cell’s position.
- property height: int¶
Compute the table cell’s height.
- Returns:
The cell’s current height.
- normalize_blocks() int[source]¶
Split multi-line blocks into multiple one-line blocks.
- Returns:
The height of the normalized cell.
- property width: int¶
Compute the table cell’s width.
- Returns:
The cell’s current width.
Inscriptis annotations¶
The model used for saving annotations.
- class inscriptis.annotation.Annotation(start: int, end: int, metadata: str)[source]¶
An Inscriptis annotation which provides metadata on the extracted text.
The
startandendindices indicate the span of the text to which the metadata refers, and the attributemetadatacontains the tuple of tags describing this span.Example:
Annotation(0, 10, ('heading', ))
The annotation above indicates that the text span between the 1st (index 0) and 11th (index 10) character of the extracted text contains a heading.
- end: int¶
the annotation’s end index within the text output.
- metadata: str¶
the tag to be attached to the annotation.
- start: int¶
the annotation’s start index within the text output.
- inscriptis.annotation.horizontal_shift(annotations: list[Annotation], content_width: int, line_width: int, align: HorizontalAlignment, shift: int = 0) list[Annotation][source]¶
Shift annotations based on the given line’s formatting.
Adjusts the start and end indices of annotations based on the line’s formatting and width.
- Parameters:
annotations – a list of Annotations.
content_width – the width of the actual content
line_width – the width of the line in which the content is placed.
align – the horizontal alignment (left, right, center) to assume for the adjustment
shift – an optional additional shift
- Returns:
A list of
Annotations with the adjusted start and end positions.
Annotation processors¶
AnnotationProcessors transform annotations to an output format.
All AnnotationProcessor’s implement the AnnotationProcessor interface
by overwrite the class’s AnnotationProcessor.__call__() method.
Note
The AnnotationExtractor class must be put into a package with the extractor’s name (e.g.,
inscriptis.annotation.output.*package*) and be named*PackageExtractor*(see the examples below).The overwritten
__call__()method may either extend the original dictionary which contains the extracted text and annotations (e.g.,SurfaceExtractor) or may replace it with a custom output (e.g.,HtmlExtractorandXmlExtractor).
Currently, Inscriptis supports the following built-in AnnotationProcessors:
HtmlExtractorprovides an annotated HTML output format.
XmlExtractoryields an output which marks annotations with XML tags.
SurfaceExtractoradds the key surface to the result dictionary which contains the surface forms of the extracted annotations.
Inscriptis CSS profiles¶
Standard CSS profiles shipped with inscriptis.
CSS profiles are used together with
inscriptis.model.config.ParserConfig to customize
the HTML to text conversion.
- inscriptis.css_profiles.RELAXED_CSS_PROFILE¶
A relaxed CSS profile optimized for content extraction and text analytics.
- inscriptis.css_profiles.STRICT_CSS_PROFILE¶
A CSS profile that corresponds to the defaults used by the Firefox Browser