Inscriptis module documentation¶
Parse HTML content and converts it into a text representation.
Inscriptis provides support for
nested HTML tables
basic Cascade Style Sheets
annotations
The following example provides the text representation of https://www.fhgr.ch.
import urllib.request
from inscriptis import get_text
url = 'https://www.fhgr.ch'
html = urllib.request.urlopen(url).read().decode('utf-8')
text = get_text(html)
print(text)
Use the method get_annotated_text()
to obtain text and
annotations. The method requires annotation rules as described in annotations.
import urllib.request
from inscriptis import get_annotated_text
url = "https://www.fhgr.ch"
html = urllib.request.urlopen(url).read().decode('utf-8')
# annotation rules specify the HTML elements and attributes to annotate.
rules = {'h1': ['heading'],
'h2': ['heading'],
'#class=FactBox': ['fact-box'],
'i': ['emphasis']}
output = get_annotated_text(html, ParserConfig(annotation_rules=rules)
print("Text:", output['text'])
print("Annotations:", output['label'])
The method returns a dictionary with two keys:
text which contains the page’s plain text and
- label with the annotations in JSONL format that is used by annotators
such as doccano.
- Annotations in the label field are returned as a list of triples with
start index, end index and label as indicated below:
{"text": "Chur\n\nChur is the capital and largest town of the Swiss canton
of the Grisons and lies in the Grisonian Rhine Valley.",
"label": [[0, 4, "heading"], [6, 10, "emphasis"]]}
- inscriptis.get_annotated_text(html_content: str, config: ParserConfig = None) Dict[str, Any] [source]¶
Return a dictionary of the extracted text and annotations.
Notes
the text is stored under the key ‘text’.
annotations are provided under the key ‘label’ which contains a list of :class:`Annotation`s.
Examples
- {“text”: “EU rejects German call to boycott British lamb.”, “
label”: [ [0, 2, “strong”], … ]}
- {“text”: “Peter Blackburn”,
“label”: [ [0, 15, “heading”] ]}
- Returns:
‘text’) and annotations (key: ‘label’)
- Return type:
A dictionary of text (key
- inscriptis.get_text(html_content: str, config: ParserConfig = None) str [source]¶
Provide a text representation of the given HTML content.
- Parameters:
html_content (str) – The HTML content to convert.
config – An optional ParserConfig object.
- Returns:
The text representation of the HTML content.
Inscriptis model¶
Inscriptis HTML engine¶
The HTML Engine is responsible for converting HTML to text.
- class inscriptis.html_engine.Inscriptis(html_tree: HtmlElement, config: ParserConfig = None)[source]¶
Translate an lxml HTML tree to the corresponding text representation.
- Parameters:
html_tree – the lxml HTML tree to convert.
config – an optional ParserConfig configuration object.
Example:
from lxml.html import fromstring from inscriptis.html_engine import Inscriptis html_content = "<html><body><h1>Test</h1></body></html>" # create an HTML tree from the HTML content. html_tree = fromstring(html_content) # transform the HTML tree to text. parser = Inscriptis(html_tree) text = parser.get_text()
- get_annotations() List[Annotation] [source]¶
Return the annotations extracted from the HTML page.
Inscriptis HTML properties¶
Provide properties used for rendering HTML pages.
- Supported attributes::
Display
properties.WhiteSpace
properties.HorizontalAlignment
properties.VerticalAlignment
properties.
- class inscriptis.html_properties.Display(value, names=None, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶
Specify whether content will be rendered as inline, block or none.
Note
A display attribute on none indicates, that the content should not be rendered at all.
- class inscriptis.html_properties.HorizontalAlignment(value, names=None, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶
Specify the content’s horizontal alignment.
- center = '^'¶
Center the block’s content.
- left = '<'¶
Left alignment of the block’s content.
- right = '>'¶
Right alignment of the block’s content.
- class inscriptis.html_properties.VerticalAlignment(value, names=None, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶
Specify the content’s vertical alignment.
- bottom = 3¶
Align all content at the bottom.
- middle = 2¶
Align all content in the middle.
- top = 1¶
Align all content at the top.
- class inscriptis.html_properties.WhiteSpace(value, names=None, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶
Specify the HTML element’s whitespace handling.
Inscriptis supports the following handling strategies outlined in the Cascading Style Sheets specification.
- normal = 1¶
Collapse multiple whitespaces into a single one.
- pre = 3¶
Preserve sequences of whitespaces.
Inscriptis CSS model¶
Implement basic CSS support for inscriptis.
The
HtmlElement
class encapsulates all CSS properties of a single HTML element.CssParse
parses CSS specifications and translates them into the corresponding HtmlElements used by Inscriptis for rendering HTML pages.
- class inscriptis.model.css.CssParse[source]¶
Parse CSS specifications and applies them to HtmlElements.
The attribute display: none, for instance, is translated to
HtmlElement.display=Display.none
.- static attr_horizontal_align(value: str, html_element: HtmlElement)[source]¶
Apply the provided horizontal alignment.
- static attr_margin_after(value: str, html_element: HtmlElement)¶
Apply the provided bottom margin.
- static attr_margin_before(value: str, html_element: HtmlElement)¶
Apply the given top margin.
- static attr_margin_bottom(value: str, html_element: HtmlElement)[source]¶
Apply the provided bottom margin.
- static attr_padding_left(value: str, html_element: HtmlElement)[source]¶
Apply the given left padding_inline.
- static attr_padding_start(value: str, html_element: HtmlElement)¶
Apply the given left padding_inline.
- static attr_style(style_attribute: str, html_element: HtmlElement)[source]¶
Apply the provided style attributes to the given HtmlElement.
- Parameters:
style_attribute – The attribute value of the given style sheet. Example: display: none
html_element – The HtmlElement to which the given style is applied.
Inscriptis canvas model¶
Classes used for rendering (parts) of the canvas.
Every parsed HtmlElement
writes its
textual content to the canvas which is managed by the following three classes:
- class inscriptis.model.canvas.Canvas[source]¶
The text Canvas on which Inscriptis writes the HTML page.
- margin¶
the current margin to the previous block (this is required to ensure that the margin_after and margin_before constraints of HTML block elements are met).
- blocks¶
a list of strings containing the completed blocks (i.e., text lines). Each block spawns at least one line.
- annotations¶
the list of recorded
Annotation
s.
- _open_annotations¶
a map of open tags that contain annotations.
- close_block(tag: HtmlElement) None [source]¶
Close the given HtmlElement by writing its bottom margin.
- Parameters:
tag – the HTML Block element to close
- close_tag(tag: HtmlElement) None [source]¶
Register that the given tag tag is closed.
- Parameters:
tag – the tag to close.
- flush_inline() bool [source]¶
Attempt to flush the content in self.current_block into a new block.
Notes
If self.current_block does not contain any content (or only whitespaces) no changes are made.
Otherwise the content of current_block is added to blocks and a new current_block is initialized.
- Returns:
True if the attempt was successful, False otherwise.
- property left_margin: int¶
Return the length of the current line’s left margin.
- open_tag(tag: HtmlElement) None [source]¶
Register that a tag is opened.
- Parameters:
tag – the tag to open.
- write(tag: HtmlElement, text: str, whitespace: WhiteSpace = None) None [source]¶
Write the given text to the current block.
Representation of a text block within the HTML canvas.
- class inscriptis.model.canvas.block.Block(idx: int, prefix: Prefix)[source]¶
The current block of text.
A block usually refers to one line of output text.
Note
If pre-formatted content is merged with a block, it may also contain multiple lines.
- Parameters:
idx – the current block’s start index.
prefix – prefix used within the current block.
- merge(text: str, whitespace: WhiteSpace) None [source]¶
Merge the given text with the current block.
- Parameters:
text – the text to merge.
whitespace – whitespace handling.
- merge_normal_text(text: str) None [source]¶
Merge the given text with the current block.
- Parameters:
text – the text to merge
Note
- If the previous text ended with a whitespace and text starts with one, both
will automatically collapse into a single whitespace.
Manage the horizontal prefix (left-indentation, bullets) of canvas lines.
- class inscriptis.model.canvas.prefix.Prefix[source]¶
Class Prefix manages paddings and bullets that prefix an HTML block.
- current_padding¶
the number of characters used for the current left-indentation.
- paddings¶
the list of paddings for the current and all previous tags.
- bullets¶
the list of bullets in the current and all previous tags.
- consumed¶
whether the current bullet has already been consumed.
- property first: str¶
Return the prefix used at the beginning of a tag.
- Note::
A new block needs to be prefixed by the current padding and bullet. Once this has happened (i.e.,
consumed
is set to True) no further prefixes should be used for a line.
- register_prefix(padding_inline: int, bullet: str) None [source]¶
Register the given prefix.
- Parameters:
padding_inline – the number of characters used for padding_inline
bullet – an optional bullet.
- property rest: str¶
Return the prefix used for new lines within a block.
This prefix is used for pre-text that contains newlines. The lines need to be prefixed with the right padding to preserver the indentation.
- property unconsumed_bullet: str¶
Yield any yet unconsumed bullet.
- Note::
This function yields the previous element’s bullets, if they have not been consumed yet.
Inscriptis table model¶
Classes used for representing Tables, TableRows and TableCells.
- class inscriptis.model.table.Table(left_margin_len: int, cell_separator: str)[source]¶
An HTML table.
- rows¶
the table’s rows.
- left_margin_len¶
length of the left margin before the table.
- cell_separator¶
string used for separating cells from each other.
- add_cell(table_cell: TableCell)[source]¶
Add a new
TableCell
to the table’s last row.Note
If no row exists yet, a new row is created.
- get_annotations(idx: int, left_margin_len: int) List[Annotation] [source]¶
Return all annotations in the given table.
- Parameters:
idx – the table’s start index.
left_margin_len – len of the left margin (required for adapting the position of annotations).
- Returns:
A list of all
Annotation
s present in the table.
- class inscriptis.model.table.TableCell(align: HorizontalAlignment, valign: VerticalAlignment)[source]¶
A table cell.
- line_width¶
the original line widths per line (required to adjust annotations after a reformatting)
- vertical_padding¶
vertical padding that has been introduced due to vertical formatting rules.
- get_annotations(idx: int, row_width: int) List[Annotation] [source]¶
Return a list of all annotations within the TableCell.
- Returns:
A list of annotations that have been adjusted to the cell’s position.
- property height: int¶
Compute the table cell’s height.
- Returns:
The cell’s current height.
- normalize_blocks() int [source]¶
Split multi-line blocks into multiple one-line blocks.
- Returns:
The height of the normalized cell.
- property width: int¶
Compute the table cell’s width.
- Returns:
The cell’s current width.
Inscriptis annotations¶
The model used for saving annotations.
- class inscriptis.annotation.Annotation(start: int, end: int, metadata: str)[source]¶
An Inscriptis annotation which provides metadata on the extracted text.
The
start
andend
indices indicate the span of the text to which the metadata refers, and the attributemetadata
contains the tuple of tags describing this span.Example:
Annotation(0, 10, ('heading', ))
The annotation above indicates that the text span between the 1st (index 0) and 11th (index 10) character of the extracted text contains a heading.
- end: int¶
the annotation’s end index within the text output.
- metadata: str¶
the tag to be attached to the annotation.
- start: int¶
the annotation’s start index within the text output.
- inscriptis.annotation.horizontal_shift(annotations: List[Annotation], content_width: int, line_width: int, align: HorizontalAlignment, shift: int = 0) List[Annotation] [source]¶
Shift annotations based on the given line’s formatting.
Adjusts the start and end indices of annotations based on the line’s formatting and width.
- Parameters:
annotations – a list of Annotations.
content_width – the width of the actual content
line_width – the width of the line in which the content is placed.
align – the horizontal alignment (left, right, center) to assume for the adjustment
shift – an optional additional shift
- Returns:
A list of
Annotation
s with the adjusted start and end positions.
Annotation processors¶
AnnotationProcessor
s transform annotations to an output format.
All AnnotationProcessor’s implement the AnnotationProcessor
interface
by overwrite the class’s AnnotationProcessor.__call__()
method.
Note
The AnnotationExtractor class must be put into a package with the extractor’s name (e.g.,
inscriptis.annotation.output.*package*
) and be named*PackageExtractor*
(see the examples below).The overwritten
__call__()
method may either extend the original dictionary which contains the extracted text and annotations (e.g.,SurfaceExtractor
) or may replace it with an custom output (e.g.,HtmlExtractor
andXmlExtractor
.
Currently, Inscriptis supports the following built-in AnnotationProcessors:
HtmlExtractor
provides an annotated HTML output format.
XmlExtractor
yields an output which marks annotations with XML tags.
SurfaceExtractor
adds the key surface to the result dictionary which contains the surface forms of the extracted annotations.