==================================== Testing, benchmarking and evaluation ==================================== Unit tests ========== In addition to the standard unit tests that are located in the project's `test` directory Inscriptis also contains test cases that solely focus on the html to text conversion and are located in the `tests/html` directory. These tests consist of two files: 1. `test-name.html` and 2. `test-name.txt` The `.txt` file contains the reference text output for the given html file. Since Inscripits 2.0 there may also be a third file named `test-name.json` in the `tests/html` directory which contains a JSON dictioanry with keys 1. `annotation-rules` containing the annotation rules for extracting metadata from the corresponding html file, and 2. `result` which stores the surface forms of the extracted metadata. Example:: {"annotation_rules": { "h1": ["heading"], "b": ["emphasis"] }, "result": [ ["heading", "The first"], ["heading", "The second"], ["heading", "Subheading"] ] } Text conversion output comparison and benchmarking ================================================== The inscriptis project contains a benchmarking script that can compare different HTML to text conversion approaches. The script will run the different approaches on a list of URLs, `url_list.txt`, and save the text output into a time stamped folder in `benchmarking/benchmarking_results` for manual comparison. Additionally the processing speed of every approach per URL is measured and saved in a text file called `speed_comparisons.txt` in the respective time stamped folder. To run the benchmarking script execute `run_benchmarking.py` from within the folder `benchmarking`. In `def pipeline()` set the which HTML -> Text algorithms to be executed by modifying:: run_lynx = True run_justext = True run_html2text = True run_beautifulsoup = True run_inscriptis = True In `url_list.txt` the URLs to be parsed can be specified by adding them to the file, one per line with no additional formatting. URLs need to be complete (including http:// or https://) e.g.:: http://www.informationscience.ch https://en.wikipedia.org/wiki/Information_science ...