User manual

Modules

openiti.helper

openiti.helper.ara

openiti.helper.ara.ar_ch_cnt(text)

Count the number of Arabic characters in a string

Parameters:text – text
Returns:number of the Arabic characters in the text

Examples

>>> a = "ابجد ابجد اَبًجٌدُ"
>>> ar_ch_cnt(a)
16
openiti.helper.ara.ar_cnt_file(fp, mode='token', incl_editor_sections=True)

Count the number of Arabic characters/tokens in a text, given its pth

Parameters:
  • fp (str) – url / path to a file
  • mode (str) – either “char” for count of Arabic characters, or “token” for count of Arabic tokens
  • incl_editor_sections (bool) – if False, the sections marked as editorial (### |EDITOR|) will be left out of the token/character count. Default: True (editorial sections will be counted)
Returns:

Arabic character/token count

Return type:

(int)

openiti.helper.ara.ar_tok_cnt(text)

Count the number of Arabic tokens in a string

Parameters:text – text
Returns:number of Arabic tokens in the text

Examples

>>> a = "ابجد ابجد اَبًجٌدُ"
>>> ar_tok_cnt(a)
3
openiti.helper.ara.deNoise(text)

Remove non-consonantal characters from Arabic text.

Examples

>>> denoise("وَالَّذِينَ يُؤْمِنُونَ بِمَا أُنْزِلَ إِلَيْكَ وَمَا أُنْزِلَ مِنْ قَبْلِكَ وَبِالْآخِرَةِ هُمْ يُوقِنُونَ")
'والذين يؤمنون بما أنزل إليك وما أنزل من قبلك وبالآخرة هم يوقنون'
>>> denoise(" ْ ً ٌ ٍ َ ُ ِ ّ ۡ ࣰ ࣱ ࣲ ٰ ")
'              '
openiti.helper.ara.decode_unicode_name(s)

Convert unicode names into the characters they refer to.

Parameters:s (str) – input string
Returns:str

Examples

>>> decode_unicode_name("ARABIC LETTER ALEF_ARABIC LETTER YEH")
'اي'
>>> decode_unicode_name("ARABIC LETTER ALEF_*")
'ا*'
openiti.helper.ara.denoise(text)

Remove non-consonantal characters from Arabic text.

Examples

>>> denoise("وَالَّذِينَ يُؤْمِنُونَ بِمَا أُنْزِلَ إِلَيْكَ وَمَا أُنْزِلَ مِنْ قَبْلِكَ وَبِالْآخِرَةِ هُمْ يُوقِنُونَ")
'والذين يؤمنون بما أنزل إليك وما أنزل من قبلك وبالآخرة هم يوقنون'
>>> denoise(" ْ ً ٌ ٍ َ ُ ِ ّ ۡ ࣰ ࣱ ࣲ ٰ ")
'              '
openiti.helper.ara.denormalize(text)

Replace complex characters with a regex covering all variants.

Examples

>>> denormalize("يحيى")
'يحي[يى]'
>>> denormalize("هوية")
'هوي[هة]'
>>> denormalize("مقرئ")
'مقر(?:[ؤئ]|[وي]ء)'
>>> denormalize("فيء")
'في(?:[ؤئ]|[وي]ء)'
openiti.helper.ara.normalize(text, replacement_tuples=[])

Normalize Arabic text by replacing complex characters by simple ones. The function is used internally to do batch replacements. Also, it can be called externally

to run custom replacements with a list of tuples of (character/regex, replacement).
Parameters:
  • text (str) – the string that needs to be normalized
  • replacement_tuples (list of tuple pairs) – (character/regex, replacement)

Examples

>>> normalize('AlphaBet', [("A", "a"), ("B", "b")])
'alphabet'
openiti.helper.ara.normalize_ara_heavy(text)

Normalize Arabic text by simplifying complex characters: alifs, alif maqsura, hamzas, ta marbutas

Examples

>>> normalize_ara_heavy("ألف الف إلف آلف ٱلف")
'الف الف الف الف الف'
>>> normalize_ara_heavy("يحيى")
'يحيي'
>>> normalize_ara_heavy("مقرئ فيء")
'مقر في'
>>> normalize_ara_heavy("قهوة")
'قهوه'
openiti.helper.ara.normalize_ara_light(text)

Lighlty normalize Arabic strings: fixing only Alifs, Alif Maqsuras; replacing hamzas on carriers with standalone hamzas

Parameters:text (str) – the string that needs to be normalized

Examples

>>> normalize_ara_light("ألف الف إلف آلف ٱلف")
'الف الف الف الف الف'
>>> normalize_ara_light("يحيى")
'يحيي'
>>> normalize_ara_light("مقرئ فيء")
'مقرء فء'
>>> normalize_ara_light("قهوة")
'قهوة'
openiti.helper.ara.normalize_composites(text, method='NFKC')

Normalize composite characters and ligatures using unicode normalization methods.

Composite characters are characters that consist of a combination of a letter and a diacritic (e.g., ؤ “U+0624 : ARABIC LETTER WAW WITH HAMZA ABOVE”, آ “U+0622 : ARABIC LETTER ALEF WITH MADDA ABOVE”). Some normalization methods (NFD, NFKD) decompose these composite characters into their constituent characters, others (NFC, NFKC) compose these characters from their constituent characters.

Ligatures are another type of composite character: one unicode point represents one or more letters (e.g., ﷲ “U+FDF2 : ARABIC LIGATURE ALLAH ISOLATED FORM”, ﰋ “U+FC0B : ARABIC LIGATURE TEH WITH JEEM ISOLATED FORM”). Such ligatures can only be decomposed (NFKC, NFKD) or kept as they are (NFC, NFD); there are no methods that compose them from their constituent characters.

Finally, Unicode also contains code points for the different contextual forms of a letter (isolated, initial, medial, final), in addition to the code point for the general letter. E.g., for the letter ba’:

  • general: 0628 ب
  • isolated: FE8F ﺏ
  • final: FE90 ﺐ
  • medial: FE92 ﺒ
  • initial: FE91 ﺑ

Some methods (NFKC, NFKD) replace those contextual form code points by the equivalent general code point. The other methods (NFC, NFD) keep the contextual code points as they are. There are no methods that turn general letter code points into their contextual code points.

method composites ligatures contextual forms
NFC join keep keep
NFD split keep keep
NFKC join decompose generalize
NFKD split decompose generalize

For more details about Unicode normalization methods, see https://unicode.org/reports/tr15/

Parameters:

Examples

>>> len("ﷲ") # U+FDF2: ARABIC LIGATURE ALLAH ISOLATED FORM
1
>>> len(normalize_composites("ﷲ"))
4
>>> [char for char in normalize_composites("ﷲ")]
['ا', 'ل', 'ل', 'ه']
>>> len("ﻹ") # UFEF9: ARABIC LIGATURE LAM WITH ALEF WITH HAMZA BELOW ISOLATED FORM
1
>>> len(normalize_composites("ﻹ"))
2

alif+hamza written with 2 unicode characters: U+0627 (ARABIC LETTER ALEF) + U+0654 (ARABIC HAMZA ABOVE):

>>> a = "أ"
>>> len(a)
2
>>> len(normalize_composites(a))
1
openiti.helper.ara.normalize_per(text)

Normalize Persian strings by converting Arabic chars to related Persian unicode chars. fixing Alifs, Alif Maqsuras, hamzas, ta marbutas, kaf, ya، Fathatan, kasra;

Parameters:text (str) – user input string to be normalized

Examples

>>> normalize_per("سياسي")
'سیاسی'
>>> normalize_per("مدينة")
'مدینه'
>>> normalize_per("درِ باز")
'در باز'
>>> normalize_per("حتماً")
'حتما'
>>> normalize_per("مدرك")
'مدرک'
>>> normalize_per("أسماء")
'اسما'
>>> normalize_per("دربارۀ")
'درباره'
openiti.helper.ara.tokenize(text, token_regex=re.compile('[ءآأؤإئابةتثجحخدذرزسشصضطظعغـفقكلمنهوىيًٌٍَُِّْ٠١٢٣٤٥٦٧٨٩ٮٰٹپچژکگیے۱۲۳۴۵۶۷۸۹۰]+'))

Tokenize a text into tokens defined by token_regex

NB: make sure to remove the OpenITI header from the text

Parameters:
  • text (str) – full text with OpenITI header removed, cleaned of order marks (”‪”, “‫”, “‬”)
  • token_regex (str) – regex that defines a token
Returns:

list of all tokens in the text,

tokenStarts (list): list of start index of each token tokenEnds (list): list of end index of each token )

Return type:

tuple (tokens (list)

Examples

>>> a = "ابجد ابجد اَبًجٌدُ"
>>> tokens, tokenStarts, tokenEnds = tokenize(a)
>>> tokens
['ابجد', 'ابجد', 'اَبًجٌدُ']
>>> tokenStarts
[0, 5, 10]
>>> tokenEnds
[4, 9, 18]

openiti.helper.funcs

openiti.helper.funcs.find_section_title(loc, section_titles, section_starts)

Find the section title(s) for a character offset

Parameters:
  • loc (int) – character offset for which the section title is wanted
  • section_titles (list) – a list of all section titles in the document
  • section_starts (list) – a list of character offsets of the starts of all sections in the text
openiti.helper.funcs.get_all_characters_in_folder(start_folder, verbose=False, exclude_folders=[], exclude_files=[])

Get a set of all characters used in all OpenITI text files in a folder and its subfolders.

Parameters:
  • start_folder (str) – path to the root directory. All files and folders in it, except if they are in the exclude lists, will be processed.
  • verbose (bool) – if True, filenames and current number of characters in the set will be printed.
  • exclude_folders (list) – list of folder names to be excluded from the process.
  • exclude_folders – list of file names to be excluded.
Returns:

a set of all characters used in the folder.

Return type:

(set)

openiti.helper.funcs.get_all_characters_in_text(fp)

Get a set of all characters used in a text.

Parameters:fp (str) – path to a text file.
Returns:a set of all characters used in the text.
Return type:(set)
openiti.helper.funcs.get_all_text_files_in_folder(start_folder, excluded_folders=['OpenITI.github.io', 'Annotation', 'maintenance', 'i.mech00', 'i.mech01', 'i.mech02', 'i.mech03', 'i.mech04', 'i.mech05', 'i.mech06', 'i.mech07', 'i.mech', 'i.mech_Temp', 'i.mech08', 'i.mech09', 'i.logic', 'i.cex', 'i.cex_Temp', '.git'], exclude_files=['README.md', '.DS_Store', '.gitignore', 'text_questionnaire.md'])

A generator that yields the file path for all OpenITI text files in a folder and its subfolders.

OpenITI text files are defined here as files that have a language identifier (-ara1, -ara2, -per1, etc.) and have either no extension or .mARkdown, .completed, or .inProgress.

The script creates a generator over which you can iterate. It yields the full path to each of the text files.

Parameters:
  • start_folder (str) – path to the folder containing the text files
  • excluded_folders (list) – list of folder names that should be excluded (default: the list of excluded folders defined in this module)
  • excluded_files (list) – list of file names that should be excluded (default: the list of excluded file names defined in this module)

Examples

> folder = r”D:LondonOpenITIY_repos” > for fp in get_all_text_files_in_folder(folder):

print(fp)

> folder = r”D:LondonOpenITIY_repos5AH” > AH0025_file_list = [fp for fp in get_all_text_files_in_folder(folder)]

openiti.helper.funcs.get_all_yml_files_in_folder(start_folder, yml_types, excluded_folders=['OpenITI.github.io', 'Annotation', 'maintenance', 'i.mech00', 'i.mech01', 'i.mech02', 'i.mech03', 'i.mech04', 'i.mech05', 'i.mech06', 'i.mech07', 'i.mech', 'i.mech_Temp', 'i.mech08', 'i.mech09', 'i.logic', 'i.cex', 'i.cex_Temp', '.git'], exclude_files=['README.md', '.DS_Store', '.gitignore', 'text_questionnaire.md'])

A generator that yields the file path for all yml files of a specific type in a folder and its subfolders.

OpenITI yml files exist for authors, books and versions.

The script creates a generator over which you can iterate. It yields the full path to each of the yml files.

Parameters:
  • start_folder (str) – path to the folder containing the text files
  • yml_type (list) – list of desired yml file types: one or more of “author”, “book”, or “version”
  • excluded_folders (list) – list of folder names that should be excluded (default: the list of excluded folders defined in this module)
  • excluded_files (list) – list of file names that should be excluded (default: the list of excluded file names defined in this module)

Examples

> folder = r”D:LondonOpenITIY_repos” > for fp in get_all_yml_files_in_folder(folder):

print(fp)

> folder = r”D:LondonOpenITIY_repos5AH” > AH0025_file_list = [fp for fp in get_all_text_files_in_folder(folder)]

openiti.helper.funcs.get_character_names(characters, verbose=False)

Print the unicode name of a list/set/string of characters.

Parameters:
  • characters (list/set/string) – a list, string or set of characters.
  • verbose (bool) – if set to True, the output will be printed
Returns:

a dictionary of characters and their names.

Return type:

(dict)

Examples

>>> char_dict = {"١": "ARABIC-INDIC DIGIT ONE",                         "٢": "ARABIC-INDIC DIGIT TWO"}
>>> char_dict == get_character_names("١٢")
True
>>> char_dict == get_character_names(["١", "٢"])
True
>>> char_dict == get_character_names({"١", "٢"})
True
openiti.helper.funcs.get_page_number(page_numbers, pos)

Get the page number of a token at index position pos in a string based on a dictionary page_numbers that contains the index positions of the page numbers in that string.

Parameters:
  • page_numbers (dict) – key: index of the last character of the page number in the string value: page number
  • pos (int) – the index position of the start of a token in the string
openiti.helper.funcs.get_sections(text, section_header_regex='### .+', include_hierarchy=True)

Get the section titles and start offsets for all sections in the text

Parameters:
  • text (str) – the text containing the sections
  • section_header_regex (str) – regular expression pattern for section headers
  • include_hierarchy (bool) – if False, only the title of the section will be returned; if True, a list of titles of all parent sections will be returned
Returns:

list (section_titles, section_starts)

openiti.helper.funcs.get_semantic_tag_elements(tag_name, text, include_tag=False, include_prefix=False, include_offsets=False, max_tokens=99, normalize_spaces=False)

Extract semantic tags (the likes of @TOPdd+) from OpenITI texts

Parameters:
  • tag_name (str) – the tag you want to extract (e.g., @TOP, @PER, …)
  • text (str) – the string from which the tags are to be extracted
  • include_tag (bool) – if False, only the content of the tag will be returned. If True, both tag+content (default: False)
  • include_prefix (bool) – if False, the prefix (that is, the number of characters defined by the first digit after the tag) will be stripped off from the result. Only if include_tag is set to False. Default: False.
  • include_offsets (bool) – if True, the start and end offsets of each element will be included (as a dictionary: with keys “match”, “start”, “end”)
  • max_tokens (int) – the maximum number of tokens inside a tag. Default: 99.
  • normalize_spaces (bool) – if True, new lines, page numbers etc. will be removed from the returned tokens.
Returns:

list

openiti.helper.funcs.natural_sort(obj)

Sort a list containing letters and numbers in its natural order (1,2,3,4,5,6,7,8,9,10, … instead of 1,10,2,3,4,5,6,7,8,9,10)

based on https://stackoverflow.com/a/16090640/4045481

openiti.helper.funcs.read_header(pth, lines=300, header_splitter='#META#Header#End#', encoding='utf-8-sig')

Read only the OpenITI header of a file without opening the entire file.

Parameters:
  • pth (str) – path to local text file / URL of remote text file
  • lines (int) – number of lines at the top of the file to be read
  • header_splitter (str) – string that separates the header from the body text
  • encoding (str) – text encoding to use. Default: “utf-8-sig” (Unicode utf-8, strips BOM at start of file)
Returns:

the metadata header of the text file

Return type:

(str)

openiti.helper.funcs.read_text(pth, max_header_lines=300, split_header=False, remove_header=False, encoding='utf-8-sig', header_splitter='#META#Header#End#')

Read a text from a file or from a URL.

The parameters allow you to choose to * full text file content: metadata header + text in a single string * only the text, without the header, in a single string (remove_header=True) * header and text, separated, in a tuple of strings (split_header=True)

Parameters:
  • pth (str) – path to local text file / URL of remote text file
  • max_header_lines (int) – number of lines at the top of the file to be read to find the header
  • split_header (bool) – if True, the header and main body of the text will be returned as separate strings
  • remove_header (bool) – if True, only the main body of the text will be returned
  • encoding (str) – text encoding to use. Defaults to “utf-8-sig” (Unicode utf-8, strips BOM at start of file)
  • header_splitter (str) – string that separates the header from the body text. Defaults to “#META#Header#End#” (end of the standard OpenITI metadata header)
Returns:

str or tuple

openiti.helper.funcs.report_missing_numbers(fp, no_regex='### \\$ \\((\\d+)', report_repeated_numbers=True)

Use a regular expression to check whether numbers (of books, pages, etc.) are in sequence and no numbers are missing.

Parameters:
  • fp (str) – path to the text file
  • no_regex (str) – regular expression pattern describing the number for which the sequence should be checked. NB: the numbers should be in the first/only capture group
Use cases:
  • Page numbers: use regex PageVd+P(d+)
  • numbered sections: e.g., ### $ (?(d+) for dictionary items, ### |{2} (d+) for second-level sections, …
openiti.helper.funcs.text_cleaner(text)

Clean text by normalizing Arabic characters and removing all non-Arabic characters

Parameters:text (str) – the string to be cleaned
Returns:the cleaned string
Return type:(str)

openiti.helper.templates

Templates for OpenITI yml files, readme files, etc.

Templates:

  • MAGIC_VALUE
  • HEADER_SPLITTER
  • HTML_HEADER
  • HTML_FOOTER
  • author_yml_template
  • book_yml_template
  • version_yml_template
  • readme_template
  • text_questionnaire_template

openiti.helper.uri

Classes and functions to work with OpenITI URIs

Todo

  • make the print output look nicer
  • reflow texts + insert milestones?
  • compare with Maxim’s script: _add_new_text_from_folder.py in maintenance repo

The Module contains a URI class that represents an OpenITI URI as an object. The URI class’s methods allow

  • Checking whether all components of the URI are valid
  • Accessing and changing components of the URI
  • Getting the URI’s current uri_type (“author”, “book”, “version”, None)
  • Building different versions of the URI (“author”, “book”, etc.)
  • Building paths based on the URI

In addition to the URI class, the module contains a number of functions for implementing URI changes in the OpenITI corpus

Examples

The Module contains a URI class that represents an OpenITI URI as an object. By calling the URI class, you create an instance of the URI class

>>> from uri import URI
>>> instance1 = URI("0255Jahiz.Bayan")
>>> instance2 = URI("0255Jahiz.Hayawan.Sham19Y0023775-ara1.completed")

The URI instance inherits all the URI class’s methods and properties:

  • Making string representations of a URI instance: print() and repr():
>>> t = URI("0255Jahiz.Hayawan.Sham19Y0023775-ara1.completed")
>>> print(repr(t))
uri(date:0255, author:Jahiz, title:Hayawan, version:Sham19Y0023775, language:ara, edition_no:1, extension:completed)
>>> print(t)
0255Jahiz.Hayawan.Sham19Y0023775-ara1.completed
  • Accessing components of the URI:
>>> t.author
'Jahiz'
>>> t.date
'0255'
  • Changing components of the URI by setting properties to a new value:
>>> u = URI("0255Jahiz.Hayawan")
>>> u.author = "JahizBasri"
>>> print(u)
0255JahizBasri.Hayawan
  • Validity tests: setting invalid values for part of a uri returns an error (implemented for instantiation of URI objects + setting URI components):

    # >>> URI("255Jahiz")
    Exception: Date Error: URI must start with a date of 4 digits (255 has 3!)
    
    # >>> URI("0255Jāḥiẓ")
    Exception: Author name Error: Author name (Jāḥiẓ) should not contain
    digits or non-ASCII characters(culprits: ['ā', 'ḥ', 'ẓ'])
    
    # >>> t.author = "Jāḥiẓ"
    Exception: Author name Error: Author name (Jāḥiẓ) should not contain
    digits or non-ASCII characters(culprits: ['ā', 'ḥ', 'ẓ'])
    
    # >>> t.author = "0255Jahiz"
    Exception: Author name Error: Author name (0255Jahiz) should not contain
    digits or non-ASCII characters(culprits: ['0', '2', '5', '5'])
    
    # >>> URI("0255Jahiz.Al-Hayawan")
    Exception: Book title Error: Book title (Al-Hayawan) should not contain
    non-ASCII characters(culprits: ['-'])
    
    # >>> URI("0255Jahiz.Hayawan.Shāmila00123545-ara1")
    Exception: Version string Error: Version string (Shāmila00123545)
    should not contain non-ASCII characters(culprits: ['ā'])
    
    # >>> URI("0255Jahiz.Hayawan.Shamela00123545-arab1")
    Exception: Language code (arab) should be an ISO 639-2 language code,
    consisting of 3 characters
    
    # >>> t.extension = "markdown"
    Exception: Extension (markdown) is not among the allowed extensions
    (['inProgress', 'completed', 'mARkdown', 'yml', ''])
    
    # >>> URI("0255Jahiz.Hayawan.Shamela00123545-ara1.markdown")
    Exception: Extension (markdown) is not among the allowed extensions
    (['inProgress', 'completed', 'mARkdown', 'yml', ''])
    
  • Getting the URI’s current uri_type (“author”, “book”, “version”, None), i.e., the longest URI that can be built from the object’s components:

>>> t.uri_type
'version'
>>> t.language = ""
>>> t.uri_type
'book'
>>> t.date = ""
>>> t.uri_type == None
True
  • Building different versions of the URI (uri_types: “author”, “author_yml”, “book”, “book_yml”, “version”, “version_yml”, “version_file”):
>>> t = URI("0255Jahiz.Hayawan.Sham19Y0023775-ara1.completed")
>>> t.build_uri(uri_type="author")
'0255Jahiz'
>>> t.build_uri("book")
'0255Jahiz.Hayawan'
>>> t.build_uri("book_yml")
'0255Jahiz.Hayawan.yml'
>>> t.build_uri("version")
'0255Jahiz.Hayawan.Sham19Y0023775-ara1'
>>> t.build_uri("version_file")
'0255Jahiz.Hayawan.Sham19Y0023775-ara1.completed'

Simply calling the URI object (i.e., writing parentheses after the variable name works as an alias for the build_uri function:

>>> t = URI("0255Jahiz.Hayawan.Sham19Y0023775-ara1.completed")
>>> t(uri_type="author")
'0255Jahiz'
>>> t("book")
'0255Jahiz.Hayawan'
>>> t("book_yml")
'0255Jahiz.Hayawan.yml'
>>> t("version")
'0255Jahiz.Hayawan.Sham19Y0023775-ara1'
>>> t("version_file")
'0255Jahiz.Hayawan.Sham19Y0023775-ara1.completed'
  • Building paths based on the URI:
>>> t.build_pth(uri_type="version", base_pth="D:\test")
'D:/test/0275AH/data/0255Jahiz/0255Jahiz.Hayawan'
>>> t.build_pth(uri_type="version_file", base_pth="D:\test")
'D:/test/0275AH/data/0255Jahiz/0255Jahiz.Hayawan/0255Jahiz.Hayawan.Sham19Y0023775-ara1.completed'
>>> t.build_pth("version")
'./0275AH/data/0255Jahiz/0255Jahiz.Hayawan'
>>> t.build_pth(uri_type="book_yml")
'./0275AH/data/0255Jahiz/0255Jahiz.Hayawan/0255Jahiz.Hayawan.yml'

Without uri_type argument, build_pth() builds the fullest path it can:

>>> t.build_pth()
'./0275AH/data/0255Jahiz/0255Jahiz.Hayawan/0255Jahiz.Hayawan.Sham19Y0023775-ara1.completed'
>>> t.language=""  # (removing the language property makes it impossible to build a version uri)
>>> t.build_pth()
'./0275AH/data/0255Jahiz/0255Jahiz.Hayawan'

NB: by default, build_pth() takes the OpenITI folder structure into account, in which authors are grouped in 25-year batches by their death date. If you do not want to use this feature, set the URI class’s data_in_25_year_repos attribute to False:

>>> t.build_pth()
'./0275AH/data/0255Jahiz/0255Jahiz.Hayawan'
>>> t.data_in_25_year_repos = False
>>> t.build_pth()
'./0255Jahiz/0255Jahiz.Hayawan'
>>> t.data_in_25_year_repos = True
>>> URI.data_in_25_year_repos = False
>>> u = URI("0255Jahiz.Hayawan")
>>> u.build_pth()
'./0255Jahiz/0255Jahiz.Hayawan'
>>> URI.data_in_25_year_repos = True
>>> u.build_pth()
'./0275AH/data/0255Jahiz/0255Jahiz.Hayawan'

In addition to the URI class, the module contains a function for implementing URI changes in the OpenITI corpus: change_uri.

The function has an execute flag. If set to False, the function will not immediately be executed but first show all changes it will make, and then ask the user whether to carry out the changes or not.

If a version URI changes:

  • new author and book folders are made if necessary (including the creation of new author and book yml files)
  • all text files related that version should be moved
  • the yml file of that version should be updated and moved

If a book uri changes:

  • new author and book folders are made if necessary
  • the yml file of that book should be updated and moved
  • all text files of all versions of the book should be moved
  • all yml files of versions of that book should be updated and moved
  • the original book folder itself should be (re)moved

if an author uri changes:

  • new author and book folders are made if necessary
  • the yml file of that author should be updated and moved
  • all book yml files of that should be updated and moved
  • all annotation text files of all versions of all books should be moved
  • all yml files of versions of all books should be updated and moved
  • the original book folders should be (re)moved
  • the original author folder itself should be (re)moved

Examples:

change_uri("0255Jahiz", "0256Jahiz")
change_uri("0255Jahiz", "0255JahizBasri")
change_uri("0255Jahiz.Hayawan", "0255Jahiz.KitabHayawan")
change_uri("0255Jahiz.Hayawan.Shamela002526-ara1",
           "0255Jahiz.Hayawan.Shamela002526-ara2")
change_uri("0255Jahiz.Hayawan.Shamela002526-ara1.completed",
           "0255Jahiz.Hayawan.Shamela002526-ara1.mARkdown")
class openiti.helper.uri.URI(uri_string=None)

A class that represents the OpenITI URI as a Python object.

OpenITI URIs consist of the following elements: 0768IbnMuhammadTaqiDinBaclabakki.Hadith.Shamela0009426-ara1.mARkdown

  • VersionURI: consists of
    • EditionURI: consists of
      • Work URI: consists of
        • AuthorID: consists of
          • author’s death date (self.date): 0768
          • shuhra of the author (self.author): IbnMuhammadTaqiDinBaclabakki
        • BookID (self.title): Hadith: short title of the book
      • Version URI: consists of
        • VersionID (self.version): Shamela0009426: ID of the collection/contributor from which we got the book + number of the book in that collection
        • Lang:
          • self.language: ara: ISO 639-2 language code
          • self.edition_no: 1: edition version number (different digitizations of the same edition get the same edition_no)
  • self.extension = mARkdown (can be inProgress, mARkdown, completed, “”)

Examples

>>> from uri import URI
>>> t = URI("0255Jahiz.Hayawan.Sham19Y0023775-ara1.completed")

Representations of a URI object: print() and repr():

>>> print(repr(t))
uri(date:0255, author:Jahiz, title:Hayawan, version:Sham19Y0023775, language:ara, edition_no:1, extension:completed)
>>> print(t)
0255Jahiz.Hayawan.Sham19Y0023775-ara1.completed

Accessing components of the URI:

>>> t.author
'Jahiz'
>>> t.date
'0255'

Getting URI’s current uri_type (“author”, “book”, “version”, None), i.e., the longest URI that can be built from the object’s components:

>>> t.uri_type
'version'
>>> t.language = ""
>>> t.uri_type
'book'
>>> t.date = ""
>>> t.uri_type == None
True

Building different versions of the URI (uri_types: “author”, “author_yml”, “book”, “book_yml”, “version”, “version_yml”, “version_file”):

>>> t = URI("0255Jahiz.Hayawan.Sham19Y0023775-ara1.completed")
>>> t.build_uri(uri_type="author")
'0255Jahiz'
>>> t.build_uri("book")
'0255Jahiz.Hayawan'
>>> t.build_uri("book_yml")
'0255Jahiz.Hayawan.yml'
>>> t.build_uri("version")
'0255Jahiz.Hayawan.Sham19Y0023775-ara1'
>>> t.build_uri("version_file")
'0255Jahiz.Hayawan.Sham19Y0023775-ara1.completed'

Building paths based on the URI:

>>> t.build_pth(uri_type="version", base_pth="D:\test")
'D:/test/0275AH/data/0255Jahiz/0255Jahiz.Hayawan'
>>> t.build_pth(uri_type="version_file", base_pth="D:\test")
'D:/test/0275AH/data/0255Jahiz/0255Jahiz.Hayawan/0255Jahiz.Hayawan.Sham19Y0023775-ara1.completed'
>>> t.build_pth("version")
'./0275AH/data/0255Jahiz/0255Jahiz.Hayawan'
>>> t.build_pth(uri_type="book_yml")
'./0275AH/data/0255Jahiz/0255Jahiz.Hayawan/0255Jahiz.Hayawan.yml'

Without uri_type argument, build_pth() builds the fullest path it can:

>>> t.build_pth()
'./0275AH/data/0255Jahiz/0255Jahiz.Hayawan/0255Jahiz.Hayawan.Sham19Y0023775-ara1.completed'
>>> t.language=""  # (removing the language property makes it impossible to build a version uri)
>>> t.build_pth()
'./0275AH/data/0255Jahiz/0255Jahiz.Hayawan'

NB: by default, build_pth() takes the OpenITI folder structure into account, in which authors are grouped in 25-year batches by their death date. If you do not want to use this feature, set the URI class’s data_in_25_year_repos attribute to False:

>>> t.build_pth()
'./0275AH/data/0255Jahiz/0255Jahiz.Hayawan'
>>> t.data_in_25_year_repos = False
>>> t.build_pth()
'./0255Jahiz/0255Jahiz.Hayawan'
>>> t.language="ara"
>>> t.build_pth()
'./0255Jahiz/0255Jahiz.Hayawan'
>>> t.data_in_25_year_repos = True
>>> URI.data_in_25_year_repos = False
>>> u = URI("0255Jahiz.Hayawan")
>>> u.build_pth()
'./0255Jahiz/0255Jahiz.Hayawan'
>>> URI.data_in_25_year_repos = True
>>> u.build_pth()
'./0275AH/data/0255Jahiz/0255Jahiz.Hayawan'

Validity tests: setting invalid values for part of a uri returns an error:

# >>> URI("255Jahiz")
Exception: Date Error: URI must start with a date of 4 digits (255 has 3!)

# >>> URI("0255Jāḥiẓ")
Exception: Author name Error: Author name (Jāḥiẓ) should not contain
digits or non-ASCII characters(culprits: ['ā', 'ḥ', 'ẓ'])

# >>> t.author = "Jāḥiẓ"
Exception: Author name Error: Author name (Jāḥiẓ) should not contain
digits or non-ASCII characters(culprits: ['ā', 'ḥ', 'ẓ'])

# >>> t.author = "0255Jahiz"
Exception: Author name Error: Author name (0255Jahiz) should not contain
digits or non-ASCII characters(culprits: ['0', '2', '5', '5'])

# >>> URI("0255Jahiz.Al-Hayawan")
Exception: Book title Error: Book title (Al-Hayawan) should not contain
non-ASCII characters(culprits: ['-'])

# >>> URI("0255Jahiz.Hayawan.Shāmila00123545-ara1")
Exception: Version string Error: Version string (Shāmila00123545)
should not contain non-ASCII characters(culprits: ['ā'])

# >>> URI("0255Jahiz.Hayawan.Shamela00123545-arab1")
Exception: Language code (arab) should be an ISO 639-2 language code,
consisting of 3 characters

# >>> t.extension = "markdown"
Exception: Extension (markdown) is not among the allowed extensions
(['inProgress', 'completed', 'mARkdown', 'yml', ''])

# >>> URI("0255Jahiz.Hayawan.Shamela00123545-ara1.markdown")
Exception: Extension (markdown) is not among the allowed extensions
(['inProgress', 'completed', 'mARkdown', 'yml', ''])
__call__(uri_type=None, ext=None)

Call the self.build_uri() method of the URI instance.

Examples

>>> my_uri = URI("0768IbnMuhammadTaqiDinBaclabakki")
>>> my_uri.title = "Hadith"
>>> my_uri.version = "Shamela0009426"
>>> my_uri.language = "ara"
>>> my_uri.edition_no = "1"
>>> my_uri()
'0768IbnMuhammadTaqiDinBaclabakki.Hadith.Shamela0009426-ara1'
>>> my_uri("date")
'0768'
>>> my_uri("author")
'0768IbnMuhammadTaqiDinBaclabakki'
>>> my_uri("author_yml")
'0768IbnMuhammadTaqiDinBaclabakki.yml'
>>> my_uri("book")
'0768IbnMuhammadTaqiDinBaclabakki.Hadith'
>>> my_uri("book_yml")
'0768IbnMuhammadTaqiDinBaclabakki.Hadith.yml'
>>> my_uri("version")
'0768IbnMuhammadTaqiDinBaclabakki.Hadith.Shamela0009426-ara1'
>>> my_uri("version_yml")
'0768IbnMuhammadTaqiDinBaclabakki.Hadith.Shamela0009426-ara1.yml'
>>> my_uri("version_file", ext="completed")
'0768IbnMuhammadTaqiDinBaclabakki.Hadith.Shamela0009426-ara1.completed'
__iter__()

Enable iteration over a URI object.

Returns:an iterator containing the components of the URI
Return type:(iterator)

Examples

>>> my_uri = URI("0255Jahiz.Hayawan.Sham19Y0023775-ara1.inProgress")
>>> for component in my_uri: print(component)
0255
Jahiz
Hayawan
Sham19Y0023775
ara
1
inProgress
>>> my_uri = URI()
>>> for component in my_uri: print(component)
__repr__()

Return a representation of the components of the URI.

Returns:a representation of the components of the URI.
Return type:(str)

Examples

>>> my_uri = URI("0255Jahiz.Hayawan.Sham19Y0023775-ara1")
>>> repr(my_uri)
'uri(date:0255, author:Jahiz, title:Hayawan, version:Sham19Y0023775, language:ara, edition_no:1, extension:)'
>>> my_uri = URI()
>>> repr(my_uri)
'uri(date:, author:, title:, version:, language:, edition_no:, extension:)'
__str__(*args, **kwargs)

Return the reassembled URI.

Returns:the reassembled URI
Return type:(str)

Examples

>>> my_uri = URI("0255Jahiz.Hayawan.Sham19Y0023775-ara1.inProgress")
>>> my_uri.extension = "completed"
>>> print(my_uri)
0255Jahiz.Hayawan.Sham19Y0023775-ara1.completed
author

Get the URI’s author property

base_pth

Get the URI’s base_pth property (the path to be prepended to the URIs) (usually the folder in which the OpenITI 25-years repos reside)

build_uri(uri_type=None, ext=None)

Build an OpenITI URI string from its components.

Parameters:
  • uri_type (str) – the uri type to be returned (defaults to None): - “date” : only the date (format: 0000) - “author” : authorUri (format: 0255Jahiz) - “author_yml” : filename of the author yml file (format: 0255Jahiz.yml) - “book”: BookUri (format: 0255Jahiz.Hayawan) - “book_yml”: filename of the book yml file (format: 0255Jahiz.Hayawan.yml) - “version”: versionURI (format: 0255Jahiz.Hayawan.Shamela000245-ara1) - “version_yml”: filename of the version yml file (format: 0255Jahiz.Hayawan.Shamela000245-ara1.yml) - “version_file”: filename of the version text file (format: 0255Jahiz.Hayawan.Shamela000245-ara1.completed)
  • ext (str) – extension for the version_file uri string (can be “completed”, “inProgress”, “mARkdown”, “” or None).
Returns:

OpenITI URI as a string, e.g.,

0768IbnMuhammadTaqiDinBaclabakki.Hadith.Shamela0009426-ara1

Return type:

(str)

Examples

>>> my_uri = URI("0768IbnMuhammadTaqiDinBaclabakki")
>>> my_uri.title = "Hadith"
>>> my_uri.version = "Shamela0009426"
>>> my_uri.language = "ara"
>>> my_uri.edition_no = "1"
>>> my_uri.build_uri()
'0768IbnMuhammadTaqiDinBaclabakki.Hadith.Shamela0009426-ara1'
>>> my_uri.build_uri("date")
'0768'
>>> my_uri.build_uri("author")
'0768IbnMuhammadTaqiDinBaclabakki'
>>> my_uri.build_uri("author_yml")
'0768IbnMuhammadTaqiDinBaclabakki.yml'
>>> my_uri.build_uri("book")
'0768IbnMuhammadTaqiDinBaclabakki.Hadith'
>>> my_uri.build_uri("book_yml")
'0768IbnMuhammadTaqiDinBaclabakki.Hadith.yml'
>>> my_uri.build_uri("version")
'0768IbnMuhammadTaqiDinBaclabakki.Hadith.Shamela0009426-ara1'
>>> my_uri.build_uri("version_yml")
'0768IbnMuhammadTaqiDinBaclabakki.Hadith.Shamela0009426-ara1.yml'
>>> my_uri.build_uri("version_file", ext="completed")
'0768IbnMuhammadTaqiDinBaclabakki.Hadith.Shamela0009426-ara1.completed'
check_ASCII(test_string, string_type)

Check whether the test_string only contains ASCII letters and digits.

check_ASCII_letters(test_string, string_type)

Check whether the test_string only contains ASCII letters.

check_date(date)

Check if date is valid (i.e., 4-digit number or empty string)

check_extension(extension)

Check whether the proposed extension is allowed.

check_language_code(language)

Check whether language is a valid ISO 639-2 language code.

date

Get the URI’s date property

edition_no

Get the URI’s edition_no property (i.e., the last digit of the URI)

extension

Get the URI’s extension property.

from_folder(folder)

Create a URI from a folder path without a file name.

get_author_uri()

Returns the author uri.

Returns:
OpenITI URI as a string, e.g.,
0768IbnMuhammadTaqiDinBaclabakki
Return type:uri_string (str)

Example

>>> my_uri = URI('0768IbnMuhammadTaqiDinBaclabakki.Hadith.Shamela0009426-ara1')
>>> my_uri.get_author_uri()
'0768IbnMuhammadTaqiDinBaclabakki'
get_book_uri()

Returns the book uri.

Returns:
OpenITI URI as a string, e.g.,
0768IbnMuhammadTaqiDinBaclabakki.Hadith
Return type:uri_string (str)

Example

>>> my_uri = URI('0768IbnMuhammadTaqiDinBaclabakki.Hadith.Shamela0009426-ara1')
>>> my_uri.get_book_uri()
'0768IbnMuhammadTaqiDinBaclabakki.Hadith'
get_version_uri()

Returns the version uri.

Returns:
OpenITI URI as a string, e.g.,
0768IbnMuhammadTaqiDinBaclabakki.Hadith.Shamela0009426-ara1
Return type:(str)

Example

>>> my_uri = URI('0768IbnMuhammadTaqiDinBaclabakki.Hadith.Shamela0009426-ara1')
>>> my_uri.extension = "completed"
>>> my_uri.get_version_uri()
'0768IbnMuhammadTaqiDinBaclabakki.Hadith.Shamela0009426-ara1'
language

Get the URI’s language property (an ISO 639-2 language code).

normpath()

replace backslashes by forward slashes also on Windows This is necessary to make the doctests behave the same way on Windows, Mac and Unix systems

split_uri(uri_string=None)

Split an OpenITI URI string into its components and check if components are valid.

Parameters:uri_string (str) – OpenITI URI, e.g., 0768IbnMuhammadTaqiDinBaclabakki.Hadith.Shamela0009426-ara1
Returns:list of uri components
Return type:(list)

Examples

>>> my_uri = URI("0255Jahiz.Hayawan.Sham19Y0023775-ara1.completed")
>>> my_uri.split_uri()
['0255', 'Jahiz', 'Hayawan', 'Sham19Y0023775', 'ara', '1', 'completed']
>>> my_uri.extension=""
>>> my_uri.language=""
>>> my_uri.split_uri()
['0255', 'Jahiz', 'Hayawan']
title

Get the URI’s title property.

uri_type

Get the URI object’s current uri_type (“author”, “book”, “version”, None) based on its defined components.

NB: uri_type does not have a setter method, making it read-only!

Examples

>>> my_uri = URI("0255Jahiz.Hayawan.Sham19Y0023775-ara1.inProgress")
>>> my_uri.uri_type
'version'
>>> my_uri.language = ""
>>> my_uri.uri_type
'book'
>>> my_uri = URI()
>>> my_uri.uri_type is None
True
version

Set the URI’s version property.

openiti.helper.uri.add_character_count(tok_count, char_count, tar_uri, execute=False)

Add the character and token counts to the new version yml file

Parameters:
  • tok_count (int) – number of Arabic tokens in a text
  • char_count (int) – number of Arabic characters in a text
  • tar_uri (URI object) – uri of the target text
  • execute (bool) – if True, the function will do its work silently. If False, it will only print a description of the action.
Returns:

None

openiti.helper.uri.add_readme(target_folder)

Add default README.md file to target_folder.

Returns:None
openiti.helper.uri.add_text_questionnaire(target_folder)

Add default text_questionnaire.md file to target_folder.

Returns:None
openiti.helper.uri.change_uri(old, new, old_base_pth=None, new_base_pth=None, execute=False, book_relations_fp='https://github.com/OpenITI/kitab-metadata-automation/raw/master/output/OpenITI_Github_clone_book_relations.json')

Change a uri and put all files in the correct folder.

If a version URI changes:

  • all text files of that version should be moved
  • the yml file of that version should be updated and moved

If a book uri changes:

  • the yml file of that book should be updated and moved
  • all annotation text files of all versions of the book should be moved
  • all yml files of versions of that book should be updated and moved
  • the original book folder itself should be (re)moved
  • (optionally) all references to the book in book yml files of other books should be updated

if an author uri changes:

  • the yml file of that author should be updated and moved
  • all book yml files of that should be updated and moved
  • all annotation text files of all versions of all books should be moved
  • all yml files of versions of all books should be updated and moved
  • the original book folders should be (re)moved
  • the original author folder itself should be (re)moved

Examples:

change_uri("0255Jahiz", "0256Jahiz")
change_uri("0255Jahiz", "0255JahizBasri")
change_uri("0255Jahiz.Hayawan", "0255Jahiz.KitabHayawan")
change_uri("0255Jahiz.Hayawan.Shamela002526-ara1",                   "0255Jahiz.Hayawan.Shamela002526-ara2")
change_uri("0255Jahiz.Hayawan.Shamela002526-ara1.completed",                   "0255Jahiz.Hayawan.Shamela002526-ara1.mARkdown")
Parameters:
  • old (str) – URI string to be changed
  • new (str) – URI string to which the new URI should be changed.
  • old_base_pth (str) – path to the folder containing the OpenITI 25-year repos, related to the old uri
  • new_base_pth (str) – path to the folder containing the OpenITI 25-year repos, related to the new uri
  • execute (bool) – if False, the proposed changes will only be printed (the user will still be given the option to execute all proposed changes at the end); if True, all changes will be executed immediately.
Returns:

None

openiti.helper.uri.check_token_count(version_uri, ymlD, version_fp='', find_latest=True)

Check whether the token count in the version yml file agrees with the actual token count of the text file.

Parameters:
  • version_uri (URI object) – version uri of the target text
  • ymlD (dict) – dictionary containing the data from the relevant yml file
  • version_fp (str) – file path to the target text
  • find_latest (bool) – if False, the version_fp will be used as is; if set to True, the script will find the most developed version of the text file, based on its extension (mARkdown > completed > inProgress)
Returns:

Tuple containing 2 values (or None):

tok_count (int): number of Arabic tokens in the target text char_count (int): number of Arabic characters in the target text

Return type:

(tuple)

openiti.helper.uri.check_yml_file(yml_fp, yml_type, version_fp=None, execute=False, check_token_counts=True)

Check whether a yml file exist, is valid, and contains no foreign keys

Parameters:
  • yml_fp (str) – path to the yml file
  • yml_type (str) – either “author”, “book”, or “version”
  • version_fp (str) – path to the text file of the version (only relevant for version yml files; default = None)
  • execute (bool) – if False, the user will be prompted before any changes are made to the yml file
  • check_token_counts (bool) – if True, the script will check the number of tokens (and characters) in the text
Returns:

None or yml_fp

openiti.helper.uri.check_yml_files(start_folder, exclude=[], execute=False, check_token_counts=True, flat_folder=False)

Check whether yml files are missing or have faulty data in them.

Parameters:
  • start_folder (str) – path to the parent folder of the folders that need to be checked.
  • exclude (list) – a list of directory names that should be excluded.
  • execute (bool) – if execute is set to False, the script will only show which changes it would undertake if set to True. After it has looped through all files and folders, it will give the user the option to execute the proposed changes.
Returns:

None

openiti.helper.uri.make_folder(new_folder, new_uri, execute=False)

Check if folder exists; if not, make folder (and, if needed, parents)

Parameters:
  • new_folder (str) – path to new folder
  • new_uri (OpenITI uri object) – uri of the text
  • execute (bool) – if False, the proposed changes will only be printed (the user will still be given the option to execute all proposed changes at the end); if True, all changes will be executed immediately.
Returns:

None

openiti.helper.uri.move_to_new_uri_pth(old_fp, new_uri, execute=False)

Move file to its new location.

Parameters:
  • old_fp (filepath) – path to the old file
  • new_uri (URI object) – URI of the new file
  • execute (bool) – if False, the proposed changes will only be printed (the user will still be given the option to execute all proposed changes at the end); if True, all changes will be executed immediately.
Returns:

path to the new file

Return type:

(str)

openiti.helper.uri.move_yml(yml_fp, new_uri, uri_type, execute=False)

Replace the URI in the yml file and save the yml file in its new location.

Parameters:
  • yml_fp (str) – path to the original yml file
  • new_uri (URI object) – the new uri
  • uri_type (str) – uri type (author, book, version)
  • execute (bool) – if False, the proposed changes will only be printed (the user will still be given the option to execute all proposed changes at the end); if True, all changes will be executed immediately.
Returns:

filepath of the new yml file

Return type:

(str)

openiti.helper.uri.new_yml(tar_yfp, yml_type, execute=False)

Create a new yml file from template.

Parameters:
  • tar_yfp (str) – filepath to the new yml file
  • yml_type (str) – type of yml file (either “version_yml”, “book_yml”, or “author_yml”)
Returns:

None

openiti.helper.uri.replace_tok_counts(missing_tok_count)

Replace the token counts in the relevant yml files.

Parameters:missing_tok_count (list) – a list of tuples: uri (OpenITI URI object) version_fp (str) token_count (int): the number of Arabic tokens in the text file char_count (int): the number of Arabic characters in the text file
Returns:None

openiti.helper.yml

Functions to read and write yaml files.

OpenITI metadata is stored in yaml files. (yaml stands for “yet another markup language”)

NB: In correctly formatted OpenITI yml files,
  • keys (lemmata) should always:
    • contain at least one hash (#)
    • end with a colon (:)
    • be free of any other non-letter/numeric characters
  • Values:
    • may contain any character, including colons and new line characters (only something that looks like a yml key (i.e., a combination of letters and hashes ending with a colon) should be avoided at the beginning of a line)
    • mutiline values should be indented with 4 spaces

The ymlToDic and dicToYML functions will retain double new lines and new lines before bullet lists (in which bullets are * or -)

openiti.helper.yml.check_yml_completeness(fp, exclude_keys={'00#AUTH#URI######:', '00#BOOK#URI######:', '00#VERS#CLENGTH##:', '00#VERS#LENGTH###:', '00#VERS#URI######:'}, templates=['00#AUTH#URI######: \n10#AUTH#ISM####AR: Fulān\n10#AUTH#KUNYA##AR: Abū Fulān, Abū Fulānaŧ\n10#AUTH#LAQAB##AR: Fulān al-dīn, Fulān al-dawlaŧ\n10#AUTH#NASAB##AR: b. Fulān b. Fulān b. Fulān b. Fulān\n10#AUTH#NISBA##AR: al-Fulānī, al-Fāʿil, al-Fulānī, al-Mufaʿʿil\n10#AUTH#SHUHRA#AR: Ibn Fulān al-Fulānī\n20#AUTH#BORN#####: URIs from Althurayya, comma separated\n20#AUTH#DIED#####: URIs from Althurayya, comma separated\n20#AUTH#RESIDED##: URIs from Althurayya, comma separated\n20#AUTH#VISITED##: URIs from Althurayya, comma separated\n30#AUTH#BORN###AH: YEAR-MON-DA (X+ for unknown)\n30#AUTH#DIED###AH: YEAR-MON-DA (X+ for unknown)\n40#AUTH#STUDENTS#: AUTH_URI from OpenITI, comma separated\n40#AUTH#TEACHERS#: AUTH_URI from OpenITI, comma separated\n80#AUTH#BIBLIO###: src@id, src@id, src@id, src@id, src@id\n90#AUTH#COMMENT##: a free running comment here; you can add as many\n lines as you see fit; the main goal of this comment section is to have a\n place to record valuable information, which is difficult to formalize\n into the above given categories.', "00#BOOK#URI######: \n10#BOOK#GENRES###: src@keyword, src@keyword, src@keyword\n10#BOOK#TITLEA#AR: Kitāb al-Muʾallif\n10#BOOK#TITLEB#AR: Risālaŧ al-Muʾallif\n20#BOOK#WROTE####: URIs from Althurayya, comma separated\n30#BOOK#WROTE##AH: YEAR-MON-DA (X+ for unknown)\n40#BOOK#RELATED##: URI of a book from OpenITI, or [Author's Title],\n followed by abbreviation for relation type between brackets (see\n book_relations repo). Only include relations with older books. Separate\n related books with semicolon.\n80#BOOK#EDITIONS#: permalink, permalink, permalink\n80#BOOK#LINKS####: permalink, permalink, permalink\n80#BOOK#MSS######: permalink, permalink, permalink\n80#BOOK#STUDIES##: permalink, permalink, permalink\n80#BOOK#TRANSLAT#: permalink, permalink, permalink\n90#BOOK#COMMENT##: a free running comment here; you can add as many\n lines as you see fit; the main goal of this comment section is to have a\n place to record valuable information, which is difficult to formalize\n into the above given categories.", '00#VERS#LENGTH###:\n00#VERS#CLENGTH##:\n00#VERS#URI######: \n80#VERS#BASED####: permalink, permalink, permalink\n80#VERS#COLLATED#: permalink, permalink, permalink\n80#VERS#LINKS####: all@id, vol1@id, vol2@id, vol3@id, volX@id\n90#VERS#ANNOTATOR: the name of the annotator (latin characters; please\n use consistently)\n90#VERS#COMMENT##: a free running comment here; you can add as many\n lines as you see fit; the main goal of this comment section is to have a\n place to record valuable information, which is difficult to formalize\n into the above given categories.\n90#VERS#DATE#####: YYYY-MM-DD\n90#VERS#ISSUES###: formalized issues, separated with commas'])

Check how much of a yml file’s fields have been filled in.

Returns a list of all keys in the yml file that do not contain default values or are empty, and a list of all relevant keys.

NB: some fields are filled automatically (e.g., the URI field, token count, etc.), so you can choose to exclude such fields from the check.

Use this function if you are interested in which fields exactly are not filled in; if you are only interested in the percentage, use the check_yml_completeness_pct function instead.

Parameters:
  • fp (str) – path to the yml file
  • exclude_keys (set) – do not take these keys into account when
  • templates (list) – list of templates from which the default values are taken
Returns:

tuple (list of keys that contain non-default values,

list of relevant keys)

openiti.helper.yml.check_yml_completeness_pct(fp, exclude_keys={'00#AUTH#URI######:', '00#BOOK#URI######:', '00#VERS#CLENGTH##:', '00#VERS#LENGTH###:', '00#VERS#URI######:'}, templates=['00#AUTH#URI######: \n10#AUTH#ISM####AR: Fulān\n10#AUTH#KUNYA##AR: Abū Fulān, Abū Fulānaŧ\n10#AUTH#LAQAB##AR: Fulān al-dīn, Fulān al-dawlaŧ\n10#AUTH#NASAB##AR: b. Fulān b. Fulān b. Fulān b. Fulān\n10#AUTH#NISBA##AR: al-Fulānī, al-Fāʿil, al-Fulānī, al-Mufaʿʿil\n10#AUTH#SHUHRA#AR: Ibn Fulān al-Fulānī\n20#AUTH#BORN#####: URIs from Althurayya, comma separated\n20#AUTH#DIED#####: URIs from Althurayya, comma separated\n20#AUTH#RESIDED##: URIs from Althurayya, comma separated\n20#AUTH#VISITED##: URIs from Althurayya, comma separated\n30#AUTH#BORN###AH: YEAR-MON-DA (X+ for unknown)\n30#AUTH#DIED###AH: YEAR-MON-DA (X+ for unknown)\n40#AUTH#STUDENTS#: AUTH_URI from OpenITI, comma separated\n40#AUTH#TEACHERS#: AUTH_URI from OpenITI, comma separated\n80#AUTH#BIBLIO###: src@id, src@id, src@id, src@id, src@id\n90#AUTH#COMMENT##: a free running comment here; you can add as many\n lines as you see fit; the main goal of this comment section is to have a\n place to record valuable information, which is difficult to formalize\n into the above given categories.', "00#BOOK#URI######: \n10#BOOK#GENRES###: src@keyword, src@keyword, src@keyword\n10#BOOK#TITLEA#AR: Kitāb al-Muʾallif\n10#BOOK#TITLEB#AR: Risālaŧ al-Muʾallif\n20#BOOK#WROTE####: URIs from Althurayya, comma separated\n30#BOOK#WROTE##AH: YEAR-MON-DA (X+ for unknown)\n40#BOOK#RELATED##: URI of a book from OpenITI, or [Author's Title],\n followed by abbreviation for relation type between brackets (see\n book_relations repo). Only include relations with older books. Separate\n related books with semicolon.\n80#BOOK#EDITIONS#: permalink, permalink, permalink\n80#BOOK#LINKS####: permalink, permalink, permalink\n80#BOOK#MSS######: permalink, permalink, permalink\n80#BOOK#STUDIES##: permalink, permalink, permalink\n80#BOOK#TRANSLAT#: permalink, permalink, permalink\n90#BOOK#COMMENT##: a free running comment here; you can add as many\n lines as you see fit; the main goal of this comment section is to have a\n place to record valuable information, which is difficult to formalize\n into the above given categories.", '00#VERS#LENGTH###:\n00#VERS#CLENGTH##:\n00#VERS#URI######: \n80#VERS#BASED####: permalink, permalink, permalink\n80#VERS#COLLATED#: permalink, permalink, permalink\n80#VERS#LINKS####: all@id, vol1@id, vol2@id, vol3@id, volX@id\n90#VERS#ANNOTATOR: the name of the annotator (latin characters; please\n use consistently)\n90#VERS#COMMENT##: a free running comment here; you can add as many\n lines as you see fit; the main goal of this comment section is to have a\n place to record valuable information, which is difficult to formalize\n into the above given categories.\n90#VERS#DATE#####: YYYY-MM-DD\n90#VERS#ISSUES###: formalized issues, separated with commas'])

Check which proportion of the relevant fields in a yml file have been filled.

NB: some fields are filled automatically (e.g., the URI field, token count, etc.), so you can choose to exclude such fields from the check.

Use this function if you are only interested in the percentage of fields filled in; if you are interested in which fields exactly are not filled in, use the check_yml_completeness function instead.

Parameters:
  • fp (str) – path to the yml file
  • exclude_keys (set) – do not take these keys into account when
  • templates (list) – list of templates from which the default values are taken
Returns:

float (percentage of the fields filled in)

openiti.helper.yml.dicToYML(dic, max_length=80, reflow=True, break_long_words=False)

Convert a dictionary into a yml string.

NB: use the pilcrow (¶) to force a line break within dictionary values.

Parameters:
  • dic (dict) – a dictionary of key-value pairs.
  • max_length (int) – the maximum number of characters a line should contain.
  • reflow (bool) – if set to False, the original layout (line endings, indentation) of the yml string will be preserved (useful for files containing bullet lists etc.); if set to True, the indentation and line length will be standardized.
  • break_long_words (bool) – if False, long words will be kept on one line
Returns:

yml string representation of the dic’s key-value pairs

Return type:

(str)

Examples

>>> yml_dic = {'00#BOOK#URI######:': '0845Maqrizi.Muqaffa',                       '90#BOOK#COMMENT##:': 'multiline value; presence of colons: not a problem¶¶    * bullet point 1¶    * bullet point 2'}
>>> yml_str = '        00#BOOK#URI######: 0845Maqrizi.Muqaffa\n        90#BOOK#COMMENT##: multiline\n            value; presence of colons: not\n            a problem\n        \n            * bullet point 1\n            * bullet point 2        '.replace("        ", "") # remove Python indentation for doctest
>>> dicToYML(yml_dic, max_length=30, reflow=True) == yml_str
True
>>> yml_str = '        00#BOOK#URI######: 0845Maqrizi.Muqaffa\n        90#BOOK#COMMENT##: multiline value; presence of colons: not a problem\n            \n            * bullet point 1\n            * bullet point 2        '.replace("        ", "") # remove Python indentation for doctest
>>> dicToYML(yml_dic, max_length=30, reflow=False) == yml_str
True
openiti.helper.yml.fix_broken_yml(fp, execute=True)

Fix a yml file that is broken because (1) a line does not start with a valid key or space or (2) the colon after the key is absent

Parameters:
  • fp (str) – path to the broken yml file
  • execute (bool) – if False, user’s judgment about the fix will be asked before the fix is implemented
Returns:

None or yml_d

openiti.helper.yml.readYML(fp, reflow=False)

Read a yml file and convert it into a dictionary.

Args:

fp (str): path to the yml file. reflow (bool): if set to False, the original layout

(line endings, indentation) of the yml file will be preserved (useful for files containing bullet lists etc.); in the output string, new line characters will be replaced with ¶. if set to True, new line characters will be removed (except double line breaks and line breaks in bullet lists) and the indentation and line length will be standardized.
Returns:
(dict): dictionary representation of the yml key-value pairs

Examples:

## >>> fp = “D:/London/OpenITI/25Y_repos/0450AH/data/0429AbuMansurThacalibi/0429AbuMansurThacalibi.AhsanMaSamictu/0429AbuMansurThacalibi.AhsanMaSamictu.Shamela0025011-ara1.yml” ## >>> readYML(fp) ## {}

openiti.helper.yml.ymlToDic(yml_str, reflow=False, yml_fp='')

Convert a yml string into a dictionary.

NB: in order to be read correctly, OpenITI yml keys (lemmata) should always:
  • contain at least one hash (#)
  • end with a colon (:)
  • be free of any other non-letter/numeric characters

Values may contain any character, including colons.

In multiline values, every new line should be indented with 4 spaces; multiline values may use double new lines and bullet lists (using * or - for items) for clarity.

Parameters:
  • yml_str (str) – a yml string.
  • reflow (bool) – if set to False, the original layout (line endings, indentation) of the yml file will be preserved (useful for files containing bullet lists etc.); in the output string, new line characters will be replaced with ¶. if set to True, new line characters will be removed (except double line breaks and line breaks in bullet lists) and the indentation and line length will be standardized.
Returns:

dictionary representation of the yml key-value pairs

Return type:

(dict)

Examples

>>> from yml import ymlToDic
>>> yml_str = "00#BOOK#URI######: 0845Maqrizi.Muqaffa\n90#BOOK#COMMENT##: multiline value; presence\n    of colons: not a problem\n\n\n".replace("        ", "") # remove Python indentation for doctest
>>> yml_dic = {'00#BOOK#URI######:': '0845Maqrizi.Muqaffa',                       '90#BOOK#COMMENT##:': 'multiline value; presence of colons: not a problem'}
>>> ymlToDic(yml_str, reflow=True) == yml_dic
True

openiti.git

openiti.git.clone_OpenITI

Clone github repositories to your local machine.

Examples

Clone all OpenITI 25-year repos:

# >>> repo_list = get_repo_urls(group="orgs", name="OpenITI",
#                                path_pattern="\d{4}AH$")
# >>> clone_repos(repo_list, r"D:\OpenITI")

Clone specific repos:

# >>> base = ""https://github.com/OpenITI/"
# >>> repo_list = [base+"mARkdown_scheme", base+"RELEASE"]
# >>> clone_repos(repo_list, r"D:\OpenITI")
Command line usage:

To clone all the OpenITI organization’s 25-years repos:

python clone_OpenITI.py ["orgs" or "users"] [user or org name] [path_pattern] [[dest_folder]]

dest_folder$ python pth/to/clone_OpenITI.py orgs OpenITI \d+AH$

other_folder$ python pth/to/clone_OpenITI.py orgs OpenITI \d+AH$ path/to/dest_folder

If fewer than 3 arguments are given, the program will prompt you for the arguments:

$ python clone_OpenITI.py
Enter the organization/user name: openiti
Organization or User? (orgs/users): orgs
Enter the regex pattern to match desired repo URLs: \d{4}AH
Enter the destination folder for the clone: openiti_clone
openiti.git.clone_OpenITI.clone_repos(all_repos, clone_dir='.')

Clone the list of repo urls all_repos to the clone_dir

Parameters:
  • all_repos (list) – a list of repo urls.
  • clone_dir (str) – path to the local folder where the repos are to be cloned. Defaults to the current active directory.
Returns:

None

openiti.git.clone_OpenITI.get_repo_urls(group='orgs', name='OpenITI', path_pattern='\\d{4}AH$')

Get a list of all repos owned by organisation/user name that match the regex path_pattern

Parameters:
  • group (str) – either “users” or “orgs”. Defaults to “orgs”
  • name (str) – GitHub name of the user/organization. Defaults to “OpenITI”
  • path_pattern (str) – regex pattern that matches the desired repository names. If none is defined, all repos will be cloned. Defaults to r”d{4}AH$”.
Returns:

a list of repo urls that matches the path_pattern regex.

Return type:

(list)

openiti.git.collect_txt_files

Moves all text files from all subdirectories into a single directory.

Todo

if more than one version of the text file is in the folder,
the script will not automatically copy the best version!

Examples

Command line usage:

$ collect_txt_files D:\OpenITI\25Y_folders D:\OpenITI\release\texts

If the source and target directories were not given, the user will be prompted for them:

$ collect_txt_files
Enter the source directory:
Enter the target directory:
openiti.git.collect_txt_files.collect_text_files(source_dir, dest_dir, exclude_folders=[], extensions=['mARkdown', 'inProgress', 'completed', ''], version_id_only=True)

Copy all text files in source_dir (+ its subfolders) to dest_dir.

Parameters:
  • source_dir (str) – path to the directory that contains (subfolders containing) the text files
  • dest_dir (str) – path to a directory to which text files will be copied.
  • exclude_folders (list) – directories to be excluded from the collection process
  • extensions (list) – list of extensions; only text files with these extensions will be copied. Defaults to [“mARkdown”, “inProgress”, “completed”, “”]
  • version_id_only (bool) – if True, the filename will be shortened to the last part of the URI, i.e., the version id + language id (e.g., Shamela001185-ara1). Defaults to True.
Returns:

number of files copied

Return type:

(int)

openiti.git.collect_txt_files.move_ara_files(source_dir, dest_dir)

old name of the collect_text_files function; renamed because the name does not reflect what function does; old name kept for backward compatibility

openiti.git.get_issues

Get selected annotation issues from GitHub; optionally, print them or save them as a tsv file.

Example:

issues = get_issues("OpenITI/Annotation",
                    issue_labels=["in progress"],
                    state="all"
                    )
issues = define_text_uris(issues)
uri_dict = sort_issues_by_uri(issues)
print_issues_by_uri(uri_dict, "test.tsv")
openiti.git.get_issues.define_text_uris(issues, verbose=False)

Define which text uri the issue pertains to. Store the uri in the issue object (issue.uri).

Parameters:
  • issues (list) – a list of github issue objects.
  • verbose (bool) – if verbose, print issues for which no uris were found.
Returns:

the list of updated github issue objects

Return type:

(list)

openiti.git.get_issues.get_issues(repo_name, access_token=None, issue_labels=None, state='open')

Get all issues connected to a specific github repository.

Parameters:
  • repo_name (str) – the name of the Github repository
  • user (str) – username
  • password (str) – password
  • (list; default (issue_labels) – None): a list of github issue label names; only the issues with an issue label name in this list will be downloaded; if set to None, all issues will be downloaded
  • (str; default (state) – “open”): only the issues with this state (open/closed/all) will be downloaded.
Returns:

a list of github issues.

Return type:

(list)

openiti.git.get_issues.print_issues_by_uri(uri_dict, save_fp='')

Print a tab-delimited list of uris with the issues connected to them. URI Issue number Issue label

Parameters:uri_dict (dict) – key: uri, value: list of github issues related to this uri
Returns:None
openiti.git.get_issues.sort_issues_by_uri(issues)

Create a dictionary with the uris as keys.

Parameters:issues (list) – a list of github issue objects
Returns:
a dictionary with following key-value pairs:
key: uri; value: list of github issue objects related to this uri
Return type:(dict)

openiti.instantiations

openiti.instantiations.generate_imech_instantiation

openiti.instantiations.generate_istylo_instantiation

openiti.new_books

openiti.new_books.add.add_books

Scripts to add books to the correct OpenITI repositories.

  • initialize_texts_from_csv: use a csv file that contains for every file
    that needs to be added to the corpus: - the path to its current location - the URI it needs to get
  • initialize_new_texts_in_folder: initialize all text files in a folder
    (in order for this to work, all files need to have valid URIs)
  • download_texts_from_CSV: download texts from the internet and
    directly put them into the correct OpenITI folder
openiti.new_books.add.add_books.download_texts_from_CSV(csv_fp, base_url='', new_base_pth='')

Use a CSV file (filename, URI) to download a list of texts to the relevant OpenITI folder.

The CSV file (which should not contain a heading) can contain full urls for the original files, or only filenames; in the latter case, the url of the website where these files are located should be passed to the function as the base_url argument. Similarly, the URI column can contain full OpenITI URI filepaths or only the URIs; in the latter case, the path to the folder containing the OpenITI 25-years folders should be passed to the function as the new_base_pth argument.

Parameters:
  • csv_fp (str) – path to a csv file that contains the following columns: 0. filepath to (or filename of) the text file 1. full version uri of the text file (no headings!)
  • old_base_path (str) – path to the folder containing the files that need to be initialized. Defaults to “”.
  • new_base_pth (str) – path to the folder containing the OpenITI 25-years repos. Defaults to “”.
Returns:

None

openiti.new_books.add.add_books.initialize_new_text(origin_fp, target_base_pth, execute=False)

Move a new text file to its OpenITI repo, creating yml files if necessary (or copying them from the same folder if present).

The function also checks whether the new text adheres to OpenITI text format.

Parameters:
  • origin_fp (str) – filepath of the text file (filename must be in OpenITI uri format)
  • target_base_pth (str) – path to the folder that contains the 25-years-repos
  • execute (bool) – if False, the proposed changes will only be printed (the user will still be given the option to execute all proposed changes at the end); if True, all changes will be executed immediately.
Returns:

None

Example

# >>> origin_folder = r”D:OpenITIbarzakh” # >>> fn = “0375IkhwanSafa.Rasail.Hindawi95926405Vols-ara1.completed” # >>> origin_fp = os.path.join(origin_folder, fn) # >>> target_base_pth = r”D:OpenITI25Yrepos” # >>> initialize_new_text(origin_fp, target_base_pth, execute=False)

openiti.new_books.add.add_books.initialize_new_texts_in_folder(folder, target_base_pth, execute=False)

Move all new texts in folder to their OpenITI repo, creating yml files if necessary (or copying them from the same folder if present).

Parameters:
  • folder (str) – path to the folder that contains new text files (with OpenITI uri filenames) and perhaps yml files
  • target_base_pth (str) – path to the folder containing the 25-years repos
  • execute (bool) – if False, the proposed changes will only be printed (the user will still be given the option to execute all proposed changes at the end); if True, all changes will be executed immediately.
Returns:

None

Examples:

# >>> folder = r"D:\OpenITI\barzakh"
# >>> target_base_pth = r"D:\OpenITI\25Yrepos"
# >>> initialize_new_texts_in_folder(folder, target_base_pth,
#                                    execute=False)
openiti.new_books.add.add_books.initialize_texts_from_CSV(csv_fp, old_base_pth='', new_base_pth='', execute=False)

Use a CSV file (filename, URI) to move a list of texts to the relevant OpenITI folder.

The CSV file (which should not contain a heading) can contain full filepaths to the original files, or only filenames; in the latter case, the path to the folder where these files are located should be passed to the function as the old_base_pth argument. Similarly, the URI column can contain full OpenITI URI filepaths or only the URIs; in the latter case, the path to the folder containing the OpenITI 25-years folders should be passed to the function as the new_base_pth argument.

Parameters:
  • csv_fp (str) – path to a csv file that contains the following columns: 0. filepath to (or filename of) the text file 1. full version uri of the text file (no headings!)
  • old_base_path (str) – path to the folder containing the files that need to be initialized
  • new_base_pth (str) – path to the folder containing the OpenITI 25-years repos
  • execute (bool) – if False, the proposed changes will only be printed (the user will still be given the option to execute all proposed changes at the end); if True, all changes will be executed immediately.
Returns:

None

openiti.new_books.convert.generic_converter

A generic converter for texts into OpenITI format.

This generic converter contains the basic procedure for converting texts into OpenITI format.

These are the main methods of the GenericConverter class:

  • convert_file(source_fp): basic procedure for converting an html file.
  • convert_files_in_folder(source_folder): convert all files files in the
    source_folder (calls convert_file)

The procedure can be used as a template for other converters. Subclass the GenericConverter to make a converter for a new source of scraped books:

class GenericHtmlConverter(GenericConverter):
    def new_function(self, text):
        "This function adds functionality to GenericConverter"

Usually, it suffices to overwrite a small number of methods of the GenericConverter class in the sub-class, and/or add a number of methods needed for the conversion of a specific group of documents: e.g.:

class GenericHtmlConverter(GenericConverter):
    def get_data(self, source_fp):
        # code written here overwrites the GenericConverter's get_data method

    def post_process(self, text):
        # append new actions by appending them to
        # the superclass's post_process method:
        text = super().post_process(text)
        text = re.sub("\n\n+", "\n\n", text)
        return text

    def new_function(self, text):
        "This function adds functionality to GenericConverter"

The GenericConverter contains a number of dummy (placeholder) functions that must be overwritten in the subclass because they are dependent on the data format.

For example, this is the inheritance schema of the EShiaHtmlConverter, which is a subclass of the GenericHtmlConverter: (methods of GenericConverter are inherited by GenericHtmlConverter; and methods of GenericHtmlConverter are inherited by EShiaHtmlConverter. Methods of the superclass with the same name in the subclass are overwritten by the latter)

GenericConverter GenericHtmlConverter EShiaHtmlConverter
__init__ __init__ (inherited)
convert_files_in_folder (inherited) (inherited)
convert file (inherited) (inherited)
make_dest_fp (inherited - generic!) (inherited - generic!)
get_metadata (dummy) (inherited - dummy!) get_metadata
get_data get_data (inherited)
pre_process (inherited) (inherited)
add_page_numbers (dummy) (inherited - dummy!) add_page_numbers
add_structural_annotations (dummy) add_structural_annotations add_structural_annotations
remove_notes (dummy) remove_notes remove_notes
reflow (inherited) (inherited)
add_milestones (dummy) (inherited - dummy!) (inherited - dummy!)
post_process (inherited - generic!) post_process
compose (inherited) (inherited)
save_file (inherited) inspect_tags_in_html inspect_tags_in_folder find_example_of_tag (inherited) (inherited) (inherited) (inherited)

Examples

>>> from generic_converter import GenericConverter
>>> conv = GenericConverter()
>>> conv.VERBOSE = False
>>> folder = r"test/"
>>> conv.convert_file(folder+"86596.html")
>>> conv.convert_files_in_folder(folder, ["html"])

openiti.new_books.convert.epub_converter_generic

Generic converter that converts an Epub to OpenITI mARkdown.

Examples

Generic epub conversion:

>>> from epub_converter_generic import GenericEpubConverter
>>> gen_converter = GenericEpubConverter(dest_folder="test/converted")
>>> gen_converter.VERBOSE = False
>>> folder = r"test"
>>> fn = r"Houellebecq 2019 - serotonine.epub"
>>> gen_converter.convert_file(os.path.join(folder, fn))
>>> gen_converter.convert_files_in_folder(folder, ["epub",])

Sub-class to create a converter specific to epubs from the Hindawi library:

# >>> HindawiConverter = GenericEpubConverter("test/converted")
# >>> HindawiConverter.VERBOSE = False
# >>> from helper import html2md_hindawi
# >>> HindawiConverter.convert_html2md = html2md_hindawi.markdownify  # (1)
# >>> HindawiConverter.toc_fn = "nav.xhtml"  # (2)
# >>> folder = r"test"
# >>> fn = r"26362727.epub"
# >>> HindawiConverter.convert_file(os.path.join(folder, fn))

# (1) overwrite the convert_html2md function
# (2) specify the filename of the table of contents in Hindawi epub files

An Epub file is in fact a zipped archive. The most important element of the archive for conversion purposes are the folders that contain the html files with the text. Some Epubs have a table of contents that defines the order in which these html files should be read.

The GenericEpubConverter is a subclass of the GenericConverter from the generic_converter module:

GenericConverter
_ GenericEpubConverter

GenericEpubConverter’s main methods are inherited from the GenericConverter:

  • convert_file(source_fp): basic procedure for converting an epub file.
  • convert_files_in_folder(source_folder): convert all epub files in the folder
    (calls convert_file)

Methods of both classes: (methods of GenericConverter are inherited by GenericEpubConverter; methods of GenericConverter with the same name in GenericEpubConverter are overwritten by the latter)

GenericConverter GenericEpubConverter
__init__ __init__
convert_files_in_folder (inherited)
convert_file (inherited)
make_dest_fp (inherited - generic!)
get_metadata (inherited - generic!)
get_data get_data
pre_process (inherited)
add_page_numbers (inherited - generic!)
add_structural_annotations (inherited - generic!)
remove_notes remove_notes
reflow (inherited)
add_milestones (inherited)
post_process (inherited - generic!)
compose (inherited)
save_file (inherited) inspect_epub sort_html_files_by_toc add_unique_tags

To create a converter for a specific type of epubs, subclass the GenericEpubConverter and overwrite some of its methods:

GenericConverter
_ GenericEpubConverter
_ HindawiEpubConverter _ ShamelaEpubConverter _ …

openiti.new_books.convert.epub_converter_hindawi

Convert Epubs from the Hindawi library to OpenITI mARkdown.

The converter has two main functions: * convert_file: convert a single epub file. * convert_files_in_folder: convert all epub files in a given folder

Usage examples:
>>> folder = r"test/hindawi/"
>>> meta_fp = folder+"hindawi_metadata_man.yml"
>>> from epub_converter_hindawi import convert_file, convert_files_in_folder
>>> src_fp = folder+"26362727.epub"
>>> convert_file(src_fp, meta_fp, dest_fp=folder+"converted/26362727")
>>> convert_files_in_folder(folder, meta_fp, dest_folder=folder+"converted")
Converting all files in folder test/hindawi/ with extensions ['epub']

Both functions use the HindawiEpubConverter class to do the heavy lifting. The HindawiEpubConverter is a subclass of the GenericEpubConverter, which in turn is a subclass of the GenericConverter from the generic_converter module:

GenericConverter
_ GenericEpubConverter
_ HindawiEpubConverter

Methods of both classes:

(methods of GenericConverter are inherited by GenericEpubConverter; methods of GenericConverter with the same name in GenericEpubConverter are overwritten by the latter)

generic_converter epub_converter_generic epub_converter_hindawi
__init__ __init__ __init__
convert_files_in_folder (inherited) (inherited)
convert file (inherited) (inherited)
make_dest_fp (inherited - generic!) (inherited - generic!)
get_metadata (inherited - generic!) get_metadata
get_data get_data (inherited)
pre_process (inherited) (inherited)
add_page_numbers (inherited - generic!) (inherited - generic!)
add_structural_annotations (inherited - generic!) (inherited - generic!)
remove_notes remove_notes (inherited)
reflow (inherited) (inherited)
add_milestones (inherited) (inherited)
post_process (inherited - generic!) (inherited - generic!)
compose (inherited) (inherited)
save_file (inherited) convert_html2md inspect_epub sort_html_files_by_toc add_unique_tags (inherited) convert_html2md (inherited) (inherited) (inherited)

Examples

>>> from epub_converter_hindawi import HindawiEpubConverter
>>> from helper.yml2json import yml2json
>>> folder = "test/"
>>> fn = "26362727.epub"
>>> hc = HindawiEpubConverter(dest_folder="test/converted")
>>> hc.VERBOSE = False
>>> meta_fp = "test/hindawi/hindawi_metadata_man.yml"
>>> hc.metadata_dic = yml2json(meta_fp, container = {})
>>> hc.metadata_file = meta_fp
>>> hc.convert_file(folder+fn)

#>>> hc.convert_files_in_folder(folder)

openiti.new_books.convert.epub_converter_hindawi.convert_file(fp, meta_fp, dest_fp=None, verbose=False)

Convert one file to OpenITI format.

Parameters:
  • fp (str) – path to the file that must be converted.
  • meta_fp (str) – path to the yml file containing the Hindawi metadata
  • dest_fp (str) – path to the converted file.
Returns:

None

openiti.new_books.convert.epub_converter_hindawi.convert_files_in_folder(src_folder, meta_fp, dest_folder=None, verbose=False, extensions=['epub'], exclude_extensions=['yml'], fn_regex=None)

Convert all files in a folder to OpenITI format. Use the extensions and exclude_extensions lists to filter the files to be converted.

Parameters:
  • src_folder (str) – path to the folder that contains the files that must be converted.
  • meta_fp (str) – path to the yml file containing the Hindawi metadata
  • dest_folder (str) – path to the folder where converted files will be stored.
  • extensions (list) – list of extensions; if this list is not empty, only files with an extension in the list should be converted.
  • exclude_extensions (list) – list of extensions; if this list is not empty, only files whose extension is not in the list will be converted.
  • fn_regex (str) – regular expression defining the filename pattern e.g., “-(ara|per)d”. If fn_regex is defined, only files whose filename matches the pattern will be converted.
Returns:

None

openiti.new_books.convert.html_converter_generic

Generic converter that converts HTML files to OpenITI mARkdown.

This generic converter forms the basis for specific converters tailored to libraries whose texts are in html format.

Examples

Generic html conversion: >>> from html_converter_generic import GenericHtmlConverter >>> gen_converter = GenericHtmlConverter(dest_folder=”test/converted”) >>> gen_converter.VERBOSE = False >>> folder = r”test/eShia” >>> fn = “86596.html” >>> gen_converter.convert_file(os.path.join(folder, fn)) >>> gen_converter.convert_files_in_folder(folder, extensions=[“.html”])

Sub-class to create a converter specific to htmls from the eShia library: >>> eShiaConv = GenericHtmlConverter(“test/converted”) >>> eShiaConv.VERBOSE = False >>> from helper import html2md_eShia >>> eShiaConv.add_structural_annotations = html2md_eShia.markdownify # (1) >>> folder = r”test/eShia” >>> fn = “86596.html” >>> eShiaConv.convert_file(os.path.join(folder, fn))

# (1) overwrite the add_structural_annotations method with another function

The GenericHtmlConverter is a subclass of the GenericConverter from the generic_converter module:

GenericConverter
_ GenericEpubConverter _ GenericHtmlConverter

GenericHtmlConverter’s main methods are inherited from the GenericConverter:

  • convert_file(source_fp): basic procedure for converting an html file.
  • convert_files_in_folder(source_folder): convert all html files in the folder
    (calls convert_file)

Methods of both classes: (methods of GenericConverter are inherited by GenericHtmlConverter; methods of GenericConverter with the same name in GenericHtmlConverter are overwritten by the latter)

GenericConverter GenericHtmlConverter
__init__ __init__
convert_files_in_folder (inherited)
convert file (inherited)
make_dest_fp (inherited - generic!)
get_metadata (dummy) (inherited - dummy!)
get_data get_data
pre_process (inherited)
add_page_numbers (dummy) (inherited - dummy!)
add_structural_annotations (dummy) add_structural_annotations
remove_notes (dummy) remove_notes
reflow (inherited)
add_milestones (dummy) (inherited - dummy!)
post_process (inherited - generic!)
compose (inherited)
save_file (inherited) inspect_tags_in_html inspect_tags_in_folder find_example_of_tag

The main difference between the two converters is the add_structural_annotations method. GenericHtmlConverter uses here the html2md converter that converts or strips html tags.

To create a converter for a specific type of html files, subclass the GenericHtmlConverter and overwrite some of its methods that are dependent on the structure of the data, esp.:

  • get_metadata
  • add_page_numbers
  • remove_notes
  • add_structural_annotation (subclass the generic html2md converter for this, and create a html2md converter specific to the files you need to convert)
GenericConverter
_ GenericHtmlConverter
_ eShiaHtmlConverter _ NoorlibHtmlConverter _ …

In addition to the conversion methods, the GenericHtmlConverter contains a number of useful methods that help with deciding which html tags need to be converted:

  • inspect_tags_in_html
  • inspect_tags_in_folder
  • find_example_of_tag

openiti.new_books.convert.html_converter_eShia

Converter that converts HTML files from the eShia library to OpenITI mARkdown.

The converter has two main functions: * convert_file: convert a single html file. * convert_files_in_folder: convert all html files in a given folder

Usage examples:
>>> from html_converter_eShia import convert_file
>>> folder = r"test/eShia/"
>>> convert_file(folder+"86596.html", dest_fp=folder+"converted/86596")
>>> from html_converter_eShia import convert_files_in_folder
>>> convert_files_in_folder(folder, dest_folder=folder+"converted")

Both functions use the EShiaHtmlConverter class to do the heavy lifting. The EShiaHtmlConverter is a subclass of GenericHtmlConverter, which in its turn inherits many functions from the GenericConverter.

GenericConverter
_ GenericHtmlConverter
_ EShiaHtmlConverter _ NoorlibHtmlConverter _ …

Overview of the methods of these classes: (methods of GenericConverter are inherited by GenericHtmlConverter; and methods of GenericHtmlConverter are inherited by EShiaHtmlConverter. Methods of the superclass with the same name in the subclass are overwritten by the latter)

GenericConverter GenericHtmlConverter EShiaHtmlConverter
__init__ __init__ (inherited)
convert_files_in_folder (inherited) (inherited)
convert file (inherited) (inherited)
make_dest_fp (inherited - generic!) (inherited - generic!)
get_metadata (dummy) (inherited - dummy!) get_metadata
get_data (inherited) (inherited)
pre_process (inherited) (inherited)
add_page_numbers (dummy) (inherited - dummy!) add_page_numbers
add_structural_annotations (dummy) add_structural_annotations add_structural_annotations
remove_notes (dummy) remove_notes remove_notes
reflow (inherited) (inherited)
add_milestones (dummy) (inherited - dummy!) (inherited - dummy!)
post_process (inherited - generic!) post_process
compose (inherited) (inherited)
save_file (inherited) inspect_tags_in_html inspect_tags_in_folder find_example_of_tag (inherited) (inherited) (inherited) (inherited)

The EShiaHtmlConverter’s add_structural_annotations method uses html2md_eShia, an adaptation of the generic html2md (based on markdownify) to convert the html tags to OpenITI annotations.

Examples

>>> from html_converter_eShia import EShiaHtmlConverter
>>> conv = EShiaHtmlConverter()
>>> conv.VERBOSE = False
>>> folder = r"test/eShia/"
>>> conv.convert_file(folder+"86596.html")
>>> conv.convert_files_in_folder(folder, extensions=["html"])
openiti.new_books.convert.html_converter_eShia.convert_file(fp, dest_fp=None, verbose=False)

Convert one file to OpenITI format.

Parameters:
  • source_fp (str) – path to the file that must be converted.
  • dest_fp (str) – path to the converted file.
Returns:

None

openiti.new_books.convert.html_converter_eShia.convert_files_in_folder(src_folder, dest_folder=None, extensions=['html'], exclude_extensions=['yml'], fn_regex=None, verbose=False)

Convert all files in a folder to OpenITI format. Use the extensions and exclude_extensions lists to filter the files to be converted.

Parameters:
  • src_folder (str) – path to the folder that contains the files that must be converted.
  • dest_folder (str) – path to the folder where converted files will be stored.
  • extensions (list) – list of extensions; if this list is not empty, only files with an extension in the list should be converted.
  • exclude_extensions (list) – list of extensions; if this list is not empty, only files whose extension is not in the list will be converted.
  • fn_regex (str) – regular expression defining the filename pattern e.g., “-(ara|per)d”. If fn_regex is defined, only files whose filename matches the pattern will be converted.
Returns:

None

openiti.new_books.convert.shamela_converter

openiti.new_books.convert.tei_converter_generic

A generic converter for converting tei xml files into OpenITI format.

Subclass the TeiConverter to make a converter for a new source of scraped books in TEI xml format.

Examples

Generic tei conversion:

>>> from tei_converter_generic import TeiConverter
>>> conv = TeiConverter(dest_folder="test/converted")
>>> conv.VERBOSE = False
>>> folder = r"test"
>>> fn = r"GRAR000070"
>>> conv.convert_file(os.path.join(folder, fn))
>>> conv.convert_files_in_folder(folder, extensions = [""])

Sub-class to create a converter specific to GRAR collection tei files:

#>>> class GRARConverter(TeiConverter):
#        def post_process(self, text):
#            text = super().post_process(text)
#            vols = re.findall("#META#.+ vol. (\d+) pp", text)
#            if len(vols) == 1:
#                text = re.sub("PageV00", "PageV{:02d}".format(int(vols[0])), text)
#            elif len(vols) == 0:
#                text = re.sub("PageV00", "PageV01", text)
#            return text
#>>> conv = GRARConverter("test/converted")
#>>> conv.VERBOSE = False
#>>> folder = r"test"
#>>> conv.convert_files_in_folder(folder, exclude_ext = ["py", "yml", "txt", "epub"])
GenericConverter TeiConverter
__init__ __init__ (appended)
convert_files_in_folder (inherited)
convert file (inherited)
make_dest_fp (inherited)
get_metadata (dummy) get_metadata
get_data (inherited)
pre_process pre_process
add_page_numbers (dummy) (inherited - not used)
add_structural_annotations (dummy) add_structural_annotations
remove_notes (dummy) (inherited - generic!)
reflow (inherited)
add_milestones (dummy) (inherited - dummy)
post_process post_process
compose (inherited)
save_file (inherited) preprocess_page_numbers preprocess_wrapped_lines

openiti.new_books.convert.html_converter_GRAR

A converter for converting GRAR tei xml files into OpenITI format.

The converter has two main functions: * convert_file: convert a single html file. * convert_files_in_folder: convert all html files in a given folder

Usage examples:
>>> from tei_converter_GRAR import convert_file, convert_files_in_folder
>>> folder = r"test/GRAR/"
>>> convert_file(folder+"GRAR000070.xml", dest_fp=folder+"converted/GRAR000070")
>>> convert_files_in_folder(folder, dest_folder=folder+"converted")

Both functions use the GRARConverter class to do the heavy lifting.

The Graeco-Arabic studies website (graeco-arabic-studies.org) contains 78 texts transcribed in TEI xml, and 21 additional texts available only in html.

The XML texts were downloaded as is; for the html texts, for each text every separate page was downloaded, and a compound html file containing the metdata of that text + the div containing the text of each page was created (using the combine_files_in_folder function).

The GRARConverter (which is sub-classed from tei_converter_generic.TeiConverter) converts both the tei xml files and the html files. It uses the generic TeiConverter’s tei2md.TeiConverter for the xml files, and for the html files the html2md_GRAR.GRARHtmlConverter (which is a modified version (sub-class) of the html2md.MarkdownConverter).

Schema representing the method inheritance in the GRARConverter:

GenericConverter TeiConverter GRARConverter
__init__ __init__ (appended) (inherited)
convert_files_in_folder (inherited) convert_files_in_folder
convert file convert_file convert_file
make_dest_fp (inherited) (inherited)
get_metadata (dummy) get_metadata (inherited - for tei xml)
get_data (inherited) (inherited - for tei xml)
pre_process pre_process pre-process (appended)
add_page_numbers (dummy) (inherited - not used) (inherited - not used)
add_structural_annotations add_structural_annotations (inherited - for tei xml)
remove_notes (dummy) (inherited - generic!) (inherited - generic!)
reflow (inherited) (inherited)
add_milestones (dummy) (inherited - dummy) (inherited - not used)
post_process post_process post-process (appended)
compose (inherited) (inherited)
save_file (inherited) preprocess_page_numbers preprocess_wrapped_lines (inherited) (inherited) (inherited) get_html_data get_html_metadata format_html_metadata

##Examples: ## >>> from tei_converter_GRAR import GRARConverter ## >>> conv = GRARConverter(dest_folder=”test/GRAR/converted”) ## >>> conv.VERBOSE = False ## >>> folder = r”test/GRAR” ## >>> fn = r”GRAR000070.xml” ## >>> conv.convert_file(os.path.join(folder, fn))

openiti.new_books.convert.tei_converter_GRAR.combine_files_in_folder(folder, output_fp=None)

Combine separate html files into one master html file

openiti.new_books.convert.tei_converter_GRAR.convert_file(fp, dest_fp=None)

Convert one file to OpenITI format.

Parameters:
  • source_fp (str) – path to the file that must be converted.
  • dest_fp (str) – path to the converted file. Defaults to None (in which case, the converted folder will be put in a folder named “converted” in the same folder as the source_fp)
Returns:

None

openiti.new_books.convert.tei_converter_GRAR.convert_files_in_folder(src_folder, dest_folder=None, extensions=[], exclude_extensions=['yml'], fn_regex=None)

Convert all files in a folder to OpenITI format. Use the extensions and exclude_extensions lists to filter the files to be converted.

Parameters:
  • src_folder (str) – path to the folder that contains the files that must be converted.
  • dest_folder (str) – path to the folder where converted files will be stored.
  • extensions (list) – list of extensions; if this list is not empty, only files with an extension in the list should be converted.
  • exclude_extensions (list) – list of extensions; if this list is not empty, only files whose extension is not in the list will be converted.
  • fn_regex (str) – regular expression defining the filename pattern e.g., “-(ara|per)d”. If fn_regex is defined, only files whose filename matches the pattern will be converted.
Returns:

None

openiti.new_books.convert.tei_converter_GRAR.list_all_tags(folder, header_end_tag='</teiHeader>')

Extracts a list of all tags used in the texts in a folder:

For GRAR:

<body> <div1 type=”” n=”” (name=””)(/)> # book subdivision level 1 <div2 type=”” n=”” (name=””)(/)> # book subdivision level 2 <head> # title <lb(/)> # start of new line <milestone unit=”” n=””/> # <p> # paragraph <pb (type=””) n=””/> # start of new page <quote type=”” (author=””) (id=””)> # quotation of a source <text lang=”” id=””> # metadata

# tables: <table> <tr> <td>

# line groups (e.g., for poetry): <lg> # line group <l> # line in line group

div types:

div1 div2
book  
books  
chapter chapter
folio  
sentence sentence aphorism

pb types: primary, secondary

quote types: lemma, commentary

milestone units: book, ed1chapter, ed1page, ms1folio

openiti.new_books.convert.helper

openiti.new_books.convert.helper.bok
openiti.new_books.convert.helper.html2md

Convert html to Markdown.

This program is an adaptation of python-markdownify (https://github.com/matthewwithanm/python-markdownify) to output OpenITI mARkdown. It also adds methods for tables and images, and a post-processing method.

The main componant of this script is the MarkdownConverter class, which contains a basic procedure for converting html, tag by tag, (MarkdownConverter.convert) and methods for converting specific html tags to mARkdown (MarkdownConverter.convert_a, MarkdownConverter.convert_img, …).

The easiest way to use the MarkdownConverter is to use the markdownify function, which calls the convert method of the MarkdownConverter class.

You can use this class as a base class and subclass it to add methods, adapt the post-processing method etc.

E.g.:
class Hindawi_converter(html2md.MarkdownConverter):
def post_process_md(self, text):
text = super().post_process_md(text) # remove blank lines marked with “DELETE_PREVIOUS_BLANKLINES” tag text = re.sub(r”
+DELETE_PREVIOUS_BLANKLINES”, “”, text)
# replace placeholders for spaces in tables: text = re.sub(“ç”, ” “, text) return text

Examples (doctests):

Headings: h1

>>> import html2md
>>> h = '<h1>abc</h1>'
>>> html2md.markdownify(h)
'\n\n### | abc\n\n'
NB: heading style is OpenITI mARkdown style by default,
but can be set to other styles as well:
>>> h = '<h1>abc</h1>'
>>> html2md.markdownify(h, md_style=UNDERLINED)
'\n\nabc\n===\n\n'
>>> h = '<h1>abc</h1>'
>>> html2md.markdownify(h, md_style=ATX)
'\n\n# abc\n\n'

Paragraphs (<p>):

>>> h = "<p>abc</p>"
>>> html2md.markdownify(h)
'\n\n# abc\n\n'
>>> h = "<p>abc</p>"
>>> html2md.markdownify(h, md_style=ATX)
'\n\nabc\n\n'

Divs without class or with an unsupported class are stripped:

>>> h = 'abc             <div>def</div>             ghi'
>>> html2md.markdownify(h)
'abc def ghi'
>>> h = 'abc             <div class="unknown_div_class">def</div>             ghi'
>>> html2md.markdownify(h)
'abc def ghi'

Spans without class or with an unsupported class are stripped:

>>> h = 'abc <span>def</span> ghi'
>>> html2md.markdownify(h)
'abc def ghi'
>>> h = 'abc <span class="unknown_span_class">def</span> ghi'
>>> html2md.markdownify(h)
'abc def ghi'

Links:

>>> h = '<a href="a/b/c">abc</a>'
>>> html2md.markdownify(h)
'[abc](a/b/c)'

Unordered lists:

>>> h = '<ul><li>item1</li><li>item2</li></ul>'
>>> html2md.markdownify(h)
'\n* item1\n* item2\n\n'

Ordered lists:

>>> h = '<ol><li>item1</li><li>item2</li></ol>'
>>> html2md.markdownify(h)
'\n1. item1\n2. item2\n\n'

Nested lists:

>>> h = '<ol><li>item1</li><li>item2:<ul><li>item3</li><li>item4</li></ul></li></ol>'
>>> html2md.markdownify(h)
'\n1. item1\n2. item2:\n\n\t* item3\n\t* item4\n\t\n\n'

Italics (<i> and <em> tags):

>>> h = 'abc <em>def</em> ghi'
>>> html2md.markdownify(h)
'abc *def* ghi'
>>> h = 'abc <i>def</i> ghi'
>>> html2md.markdownify(h)
'abc *def* ghi'

Bold (<b> and <strong> tags):

>>> h = 'abc <b>def</b> ghi'
>>> html2md.markdownify(h)
'abc **def** ghi'
>>> h = 'abc <strong>def</strong> ghi'
>>> html2md.markdownify(h)
'abc **def** ghi'

Tables:

>>> h = '    <table>      <tr>        <th>th1aaa</th><th>th2</th>      </tr>      <tr>        <td>td1</td><td>td2</td>      </tr>    </table>'
>>> html2md.markdownify(h)
'\n\n| th1aaa | th2 |\n| ------ | --- |\n| td1    | td2 |\n\n'

# i.e., # | th1aaa | th2 | # | td1 | td2 |

openiti.new_books.convert.helper.html2md.markdownify(html, **options)

Convert html to markdown.

Calls the convert() method of the MarkdownConverter class.

class openiti.new_books.convert.helper.html2md.MarkdownConverter(**options)
convert(html)

Convert html to markdown.

# We want to take advantage of the html5 parsing, but we don’t actually # want a full document. Therefore, we’ll mark our fragment with an id, # create the document, and extract the element with the id.

convert_a(el, text)

Convert html links to markdown-style links.

Example

>>> import html2md
>>> h = '<a href="a/b/c">abc</a>'
>>> html2md.markdownify(h)
'[abc](a/b/c)'
convert_b(el, text)

Convert <b> tags into markdown formatting

convert_blockquote(el, text)

Convert <blockquote> tags into markdown formatting

convert_br(el, text)

Convert <br/> tags into newline characters.

convert_em(el, text)

convert <em> (italics) tags into markdown formatting.

convert_hn(n, el, text)

Convert html headings (<h1>, <h2>, etc. into markdown formatting.

convert_i(el, text)

convert <i> (italics) tags into markdown formatting.

convert_img(el, text)

Convert <img> tags into markdown-style links to image files.

Examples

>>> import html2md
>>> h = '<div><img class="figure" src="../Images/figure1.png" /></div>'
>>> html2md.markdownify(h)
'![](../Images/figure1.png)'
>>> html2md.markdownify(h, image_link_regex="../Images", image_folder="img")
'![](img/figure1.png)'
convert_li(el, text)

Convert list element tags <li>.

convert_list(el, text)

Convert ordered and unordered html lists (<ul> and <ol> tags).

Examples

# unordered lists:

>>> import html2md
>>> h = '<ul><li>item1</li><li>item2</li></ul>'
>>> html2md.markdownify(h)
'\n* item1\n* item2\n\n'

# ordered lists:

>>> import html2md
>>> h = '<ol><li>item1</li><li>item2</li></ol>'
>>> html2md.markdownify(h)
'\n1. item1\n2. item2\n\n'

# nested lists:

###### TEST FAILS FOR UNKNOWN REASONS
##>>> import html2md
##>>> h = '<ol><li>item1</li><li>item2:<ul><li>item3</li><li>item4</li></ul></li></ol>'
##>>> html2md.markdownify(h)
##'\n1. item1\n2. item2:\n\n\t* item3\n\t* item4\n\t\n\n'
convert_ol(el, text)

Convert ordered and unordered html lists (<ul> and <ol> tags).

Examples

# unordered lists:

>>> import html2md
>>> h = '<ul><li>item1</li><li>item2</li></ul>'
>>> html2md.markdownify(h)
'\n* item1\n* item2\n\n'

# ordered lists:

>>> import html2md
>>> h = '<ol><li>item1</li><li>item2</li></ol>'
>>> html2md.markdownify(h)
'\n1. item1\n2. item2\n\n'

# nested lists:

###### TEST FAILS FOR UNKNOWN REASONS
##>>> import html2md
##>>> h = '<ol><li>item1</li><li>item2:<ul><li>item3</li><li>item4</li></ul></li></ol>'
##>>> html2md.markdownify(h)
##'\n1. item1\n2. item2:\n\n\t* item3\n\t* item4\n\t\n\n'
convert_p(el, text)

Converts <p> tags.

Examples

>>> import html2md
>>> h = "<p>abc</p>"
>>> html2md.markdownify(h)
'\n\n# abc\n\n'
>>> h = "<p>abc</p>"
>>> html2md.markdownify(h, md_style=ATX)
'\n\nabc\n\n'
>>> h = "<p></p>"
>>> html2md.markdownify(h, md_style=ATX)
''
convert_strong(el, text)

Convert <b> and <strong> tags.

NB: convert_b refers to this same function

Examples

>>> import html2md
>>> h = 'abc <strong>def</strong> ghi'
>>> html2md.markdownify(h)
'abc **def** ghi'
>>> import html2md
>>> h = 'abc <b>def</b> ghi'
>>> html2md.markdownify(h)
'abc **def** ghi'
convert_table(el, text)

Wrap tables between double new lines.

NB: conversion of the tables takes place on the tr level.

convert_tr(el, text)

Convert table rows.

NB: rows are processed before the table tag is. Spaces to fill out columns are added in post-processing!

Examples

>>> import html2md
>>> h = '            <table>              <tr>                <th>th1aaa</th><th>th2</th>              </tr>              <tr>                <td>td1</td><td>td2</td>              </tr>            </table>'
>>> html2md.markdownify(h)
'\n\n| th1aaa | th2 |\n| ------ | --- |\n| td1    | td2 |\n\n'
i.e.:
th1aaa | th2 |
td1 | td2 |
convert_ul(el, text)

Convert ordered and unordered html lists (<ul> and <ol> tags).

Examples

# unordered lists:

>>> import html2md
>>> h = '<ul><li>item1</li><li>item2</li></ul>'
>>> html2md.markdownify(h)
'\n* item1\n* item2\n\n'

# ordered lists:

>>> import html2md
>>> h = '<ol><li>item1</li><li>item2</li></ol>'
>>> html2md.markdownify(h)
'\n1. item1\n2. item2\n\n'

# nested lists:

###### TEST FAILS FOR UNKNOWN REASONS
##>>> import html2md
##>>> h = '<ol><li>item1</li><li>item2:<ul><li>item3</li><li>item4</li></ul></li></ol>'
##>>> html2md.markdownify(h)
##'\n1. item1\n2. item2:\n\n\t* item3\n\t* item4\n\t\n\n'
create_underline_line(text, pad_char)

Create a sequence of pad_char characters the same lenght as text.

fill_out_columns(match)

Find the longest cell in a column; add spaces to shorter columns.

indent(text, level)

Add tab indentation before text.

post_process_md(text)

Post-processing operation to improve formatting of converted text.

post_process_named_entities(match)

Reformat named entity matches to mARkdown named entity standard.

Named entities should be marked with @TAG@ tags in the converter (3 capital letters between ampersands), and end with a new line. This post-processing step then converts these temporary tags into OpenITI mARkdown format @TAGdd+:

  • The first number after the @QUR tag refers to the number of letters following the tag that do not belong to the named entity (in this automatic step, this number will always be set to 0);
  • the following number(s) refer(s) to the length of the entity in tokens

Examples

>>> import html2md
>>> conv = html2md.MarkdownConverter()
>>> conv.post_process_md("abc @QUR@ def ghi\njkl")
'abc @QUR02 def ghi jkl'
>>> conv.post_process_md("abc @QUR@ def ghi\n~~jkl\nmno")
'abc @QUR03 def ghi\n~~jkl mno'
process_tag(node, children_only=False)

Process each tag and its children.

should_convert_tag(tag)

Check whether a tag should be converted or simply stripped

underline(text, pad_char)

Underline text with pad_char characters (-, =, or +).

Parameters:
  • text (str) – the text within a tag, to be underlined
  • pad_char (str) – the character used for the line (-, =, or +)
Returns:

an underlined line of text

Example

>>> import html2md
>>> html2md.MarkdownConverter().underline("123456789", "=")
'123456789\n========='
>>> html2md.MarkdownConverter().underline("123456789  ", "=")
'123456789\n========='
openiti.new_books.convert.helper.html2md_eShia

Convert eShia library html to OpenITI mARkdown.

This script subclasses the generic MarkdownConverter class from the html2md module (based on python-markdownify, https://github.com/matthewwithanm/python-markdownify), which uses BeautifulSoup to create a flexible converter.

The subclass in this module, GRARHtmlConverter, adds methods specifically for the conversion of books from the eShia library to OpenITI mARkdown:

  • Span, div and p conversion: span, div and p classes needed to be converted
    are defined in self.class_dict.

Inheritance schema of the GRARHtmlConverter:

MarkdownConverter EShiaHtmlConverter
Options (inherited)
DefaultOptions (inherited)
__init__ (inherited)
__getattr__ (inherited)
convert (inherited)
process_tag (inherited)
process_text (inherited)
fill_out_columns (inherited)
post_process_md (inherited)
should_convert_tag (inherited)
indent (inherited)
underline (inherited)
create_underline_line (inherited)
convert_a (inherited)
convert_b (inherited)
convert_blockquote (inherited)
convert_br (inherited)
convert_em (inherited)
convert_hn (inherited)
convert_i (inherited)
convert_img (inherited)
convert_list (inherited)
convert_li (inherited)
convert_ol (inherited)
convert_p convert_p
convert_table (inherited)
convert_tr (inherited)
convert_ul (inherited)
convert_strong (inherited) convert_span convert_div
class openiti.new_books.convert.helper.html2md_eShia.EShiaHtmlConverter(**options)

Convert EShia library html to OpenITI mARkdown.

Examples

>>> import html2md_eShia
>>> h = '<img class="libimages" src="/images/books/86596/01/cover.jpg">'
>>> html2md_eShia.markdownify(h)
'![](img/86596/01/cover.jpg)'
>>> import html2md_eShia
>>> h = 'abc <a href="www.example.com">def</a> ghi'
>>> html2md_eShia.markdownify(h)
'abc def ghi'
convert_a(el, text)

Convert html links to markdown-style links.

Example

>>> import html2md
>>> h = '<a href="a/b/c">abc</a>'
>>> html2md.markdownify(h)
'[abc](a/b/c)'
convert_div(el, text)

Converts html <div> tags, depending on their class attribute.

Supported div classes should be stored in self.class_dict (key: div class (str); value: formatting string)

Example

>>> import html2md_eShia
>>> h = 'abc <div>def</div> ghi'
>>> html2md_eShia.markdownify(h)
'abc def ghi'
>>> h = 'abc <div class="unknown_div_class">def</div> ghi'
>>> html2md_eShia.markdownify(h)
'abc def ghi'
>>> h = 'abc <div class="list3">def  ghi</div> jkl'
>>> html2md_eShia.markdownify(h)
'abc \tdef ghi jkl'
convert_p(el, text)

Converts <p> tags according to their class.

Supported p classes should be stored in self.class_dict (key: span class (str); value: formatting string) E.g., {“quran”: “@QUR@ {}n”}

<p> tags without class attribute, or unsupported class, will be converted according to the markdown style as defined in the self.options[“md_style”] value (from super().DefaultOptions)

Examples

>>> import html2md_eShia
>>> h = "<p>abc</p>"
>>> html2md_eShia.markdownify(h)
'\n\n# abc\n\n'
>>> h = "<p>abc</p>"
>>> html2md_eShia.markdownify(h, md_style=ATX)
'\n\nabc\n\n'
>>> h = "<p></p>"
>>> html2md_eShia.markdownify(h, md_style=ATX)
''
>>> h = '<p class="KalamateKhas">abc</p>'
>>> html2md_eShia.markdownify(h)
'\n\n### ||| abc\n\n'
convert_span(el, text)

Converts html <span> tags, depending on their class attribute.

Supported span classes should be stored in self.class_dict (key: span class (str); value: formatting string) E.g., {“quran”: “@QUR@ {}n”}

Example

>>> import html2md_eShia
>>> h = 'abc <span>def</span> ghi'
>>> html2md_eShia.markdownify(h)
'abc def ghi'
>>> h = 'abc <span class="unknown_span_class">def</span> ghi'
>>> html2md_eShia.markdownify(h)
'abc def ghi'
>>> h = 'abc <span class="Aye">def  ghi</span> jkl'
>>> html2md_eShia.markdownify(h)
'abc @QUR02 def ghi jkl'
>>> h = 'abc <span class="TextsStyles1">def  ghi</span> jkl'
>>> html2md_eShia.markdownify(h)
'abc @QUR02 def ghi jkl'

# the @QUR@ example outputs are a result of post-processing; # the function itself will produce: # ‘abc @QUR@ def ghinjkl’

>>> h = 'abc <span class="Titr3">def</span> ghi'
>>> html2md_eShia.markdownify(h)
'abc\n\n### ||| def\n\nghi'
openiti.new_books.convert.helper.html2md_eShia.markdownify(html, **options)

Shortcut to the convert method of the HindawiConverter class.

openiti.new_books.convert.helper.html2md_GRAR

Convert GRAR library html to OpenITI mARkdown.

This script subclasses the generic MarkdownConverter class from the html2md module (based on python-markdownify, https://github.com/matthewwithanm/python-markdownify), which uses BeautifulSoup to create a flexible converter. The subclass in this module, GRARHtmlConverter, adds methods specifically for the conversion of books from the GRAR library to OpenITI mARkdown:

  • span conversion: the GRAR html seems to be a conversion of tei xml;
    the tei data is often embedded inside the id of a span.

Inheritance schema of the GRARHtmlConverter:

MarkdownConverter GRARHtmlConverter
Options (inherited)
DefaultOptions (inherited)
__init__ (inherited)
__getattr__ (inherited)
convert (inherited)
process_tag (inherited)
process_text (inherited)
fill_out_columns (inherited)
post_process_md post_process_md (appended)
should_convert_tag (inherited)
indent (inherited)
underline (inherited)
create_underline_line (inherited)
convert_a (inherited)
convert_b (inherited)
convert_blockquote convert_blockquote
convert_br (inherited)
convert_em (inherited)
convert_hn (inherited)
convert_i (inherited)
convert_img (inherited)
convert_list (inherited)
convert_li (inherited)
convert_ol (inherited)
convert_p (inherited)
convert_table (inherited)
convert_tr (inherited)
convert_ul (inherited)
convert_strong (inherited) convert_span
class openiti.new_books.convert.helper.html2md_GRAR.GRARHtmlConverter(**options)

Convert GRAR library html to OpenITI mARkdown.

convert_blockquote(el, text)

Convert blockquote tags to mARkdown

NB: the @QUOTE@ tag is a temporary tag that will be removed in the post-processing step.

Examples

>>> import html2md_GRAR
>>> h = 'abc <blockquote>def</blockquote> ghi'
>>> html2md_GRAR.markdownify(h)
'abc def ghi'
>>> h = 'abc <span id="pb-21"/><blockquote>def</blockquote> ghi jkl'
>>> html2md_GRAR.markdownify(h)
'abc\n\nPageV00P020\n\n# def ghi jkl'
convert_span(el, text)

Converts html <span> tags, depending on their id attribute.

Example

>>> import html2md_GRAR
>>> h = 'abc <span>def</span> ghi'
>>> html2md_GRAR.markdownify(h)
'abc def ghi'
>>> h = 'abc <span class="unknown_span_class">def</span> ghi'
>>> html2md_GRAR.markdownify(h)
'abc def ghi'

Page numbers (NB: mARkdown uses page end instead of page beginning)

>>> h = 'abc <span id="pb-21"/>def  ghi jkl'
>>> html2md_GRAR.markdownify(h)
'abc PageV00P020 def ghi jkl'

Sections:

>>> h = 'abc <span class="book" id="part-2 div1-2"/>def  ghi jkl'
>>> html2md_GRAR.markdownify(h)
'abc\n\n### | [book 2]\n\ndef ghi jkl'
>>> h = 'abc <span class="chapter" id="part-2 div2-1"/>def  ghi jkl'
>>> html2md_GRAR.markdownify(h)
'abc\n\n### || [chapter 1]\n\ndef ghi jkl'
>>> h = 'abc <span class="chapter" id="part-2 div2-1" title="Intro"/>def  ghi jkl'
>>> html2md_GRAR.markdownify(h)
'abc\n\n### || [chapter 1: Intro]\n\ndef ghi jkl'
post_process_md(text)

Appends to the MarkdownConverter.post_process_md() method.

openiti.new_books.convert.helper.html2md_GRAR.markdownify(html, **options)

Shortcut to the convert method of the HindawiConverter class.

openiti.new_books.convert.helper.html2md_hindawi

Convert Hindawi library html to OpenITI mARkdown.

This script subclasses the generic MarkdownConverter class from the html2md module (based on python-markdownify, https://github.com/matthewwithanm/python-markdownify), which uses BeautifulSoup to create a flexible converter. The subclass in this module, HindawiConverter, adds methods specifically for the conversion of books from the Hindawi library to OpenITI mARkdown: * special treatment of <h4> heading tags * div classes “poetry_container”, “section”, “footnote” and “subtitle” * span class “quran”

The easiest way to use this is to simply feed the html (as string) to the markdownify() function, which will create an instance of the HindawiConverter class and return the converted string.

Examples (doctests):

Headings: h1 (from superclass html2md.MarkdownConverter)

>>> import html2md_hindawi
>>> h = '<h1>abc</h1>'
>>> html2md_hindawi.markdownify(h)
'\n\n### | abc\n\n'
NB: heading style is OpenITI mARkdown style by default,
but can be set to other styles as well:
>>> h = '<h1>abc</h1>'
>>> html2md_hindawi.markdownify(h, md_style=UNDERLINED)
'\n\nabc\n===\n\n'
>>> h = '<h1>abc</h1>'
>>> html2md_hindawi.markdownify(h, md_style=ATX)
'\n\n# abc\n\n'

Headings: <h4>

NB: in the Hindawi library, <h4> tag is used for section headings
on all levels but the highest one. The section level must be derived from the id of the parent div.
>>> h = '<div class="section" id="sect2_4"><h4>abc</h4></div>'
>>> html2md_hindawi.markdownify(h)
'\n\n### ||| abc\n\n'
>>> h = '<div class="section" id="sect5_2"><h4>abc</h4></div>'
>>> html2md_hindawi.markdownify(h)
'\n\n### |||||| abc\n\n'

Poetry div, single-line:

>>> h = '    <div class="poetry_container line">      <div>        <div>hemistich1</div>        <div>hemistich2</div>      </div>    </div>'
>>> html2md_hindawi.markdownify(h)
'\n# hemistich1 %~% hemistich2\n'

Poetry div, multiple line:

>>> h = '    abc    <div class="poetry_container">      <div>        <div>hemistich1</div>        <div>hemistich2</div>      </div>      <div>        <div>hemistich3</div>        <div>hemistich4</div>      </div>    </div>    def'
>>> html2md_hindawi.markdownify(h)
' abc\n# hemistich1 %~% hemistich2\n# hemistich3 %~% hemistich4\ndef'

Section div without heading:

>>> h = 'abc             <div class="section" id="sect2_9">def</div>             ghi'
>>> html2md_hindawi.markdownify(h)
'abc\n\n### |||\ndef\n\nghi'

Section div with heading (handled by h4):

>>> h = 'abc             <div class="section" id="sect2_9">               <h4>title</h4>               <p>def</p>             </div>             ghi'
>>> html2md_hindawi.markdownify(h)
'abc\n\n### ||| title\n\n# def\n\nghi'

Footnote divs:

>>> h = '<div class="footnote"><sup>1 </sup>footnotetext</div>'
>>> html2md_hindawi.markdownify(h)
'\n\nFOOTNOTE1 footnotetext\n\n'
NB: FOOTNOTE is a tag that will be used to extract all footnotes
in a next step.

Subtitle divs:

>>> h = '<h1>Title text</h1><div class="subtitle">Subtitle text</div>'
>>> html2md_hindawi.markdownify(h)
'\n\n### | Title text Subtitle text\n\n'

Divs without class or with an unsupported class are simply stripped:

>>> h = 'abc             <div>def</div>             ghi'
>>> html2md_hindawi.markdownify(h)
'abc def ghi'
>>> h = 'abc             <div class="unknown_div_class">def</div>             ghi'
>>> html2md_hindawi.markdownify(h)
'abc def ghi'

Spans with class “quran”:

>>> h = 'abc <span class="quran">def ghi</span> jkl'
>>> html2md_hindawi.markdownify(h)
'abc @QUR02 def ghi jkl'

# the latter is a result of post-processing; # the function itself will produce: # ‘abc @QUR@ def ghinjkl’

Spans without class or with an unsupported class are stripped:

>>> h = 'abc <span>def</span> ghi'
>>> html2md_hindawi.markdownify(h)
'abc def ghi'
>>> h = 'abc <span class="unknown_span_class">def</span> ghi'
>>> html2md_hindawi.markdownify(h)
'abc def ghi'

Links:

>>> h = '<a href="a/b/c">abc</a>'
>>> html2md_hindawi.markdownify(h)
'[abc](a/b/c)'

Unordered lists:

>>> h = '<ul><li>item1</li><li>item2</li></ul>'
>>> html2md_hindawi.markdownify(h)
'\n* item1\n* item2\n\n'

Ordered lists:

>>> h = '<ol><li>item1</li><li>item2</li></ol>'
>>> html2md_hindawi.markdownify(h)
'\n1. item1\n2. item2\n\n'

Nested lists:

>>> h = '<ol><li>item1</li><li>item2:<ul><li>item3</li><li>item4</li></ul></li></ol>'
>>> html2md_hindawi.markdownify(h)
'\n1. item1\n2. item2:\n\n\t* item3\n\t* item4\n\t\n\n'

Italics (<i> and <em> tags):

>>> h = 'abc <em>def</em> ghi'
>>> html2md_hindawi.markdownify(h)
'abc *def* ghi'
>>> h = 'abc <i>def</i> ghi'
>>> html2md_hindawi.markdownify(h)
'abc *def* ghi'

Bold (<b> and <strong> tags):

>>> h = 'abc <b>def</b> ghi'
>>> html2md_hindawi.markdownify(h)
'abc **def** ghi'
>>> h = 'abc <strong>def</strong> ghi'
>>> html2md_hindawi.markdownify(h)
'abc **def** ghi'

Tables:

>>> h = '    <table>      <tr>        <th>th1aaa</th><th>th2</th>      </tr>      <tr>        <td>td1</td><td>td2</td>      </tr>    </table>'
>>> html2md_hindawi.markdownify(h)
'\n\n| th1aaa | th2 |\n| ------ | --- |\n| td1    | td2 |\n\n'
class openiti.new_books.convert.helper.html2md_hindawi.HindawiConverter(**options)

Convert Hindawi library html to OpenITI mARkdown.

convert_a(el, text)

Converts html links.

Overwrites the MarkdownConverter.post_process_md() method. Introduces an exception for links between footnote markers and footnootes.

Example

>>> import html2md_hindawi
>>> h = '<a href="a/b/c">abc</a>'
>>> html2md_hindawi.markdownify(h)
'[abc](a/b/c)'
>>> import html2md_hindawi
>>> h = 'abc <a href="ftn1">1</a>'
>>> html2md_hindawi.markdownify(h)
'abc [1]'
convert_div(el, text)

Converts html <div> tags, depending on their class.

In the MarkdownConverter class, div tags are simply stripped away.

Examples

# no div class: tags are stripped off

>>> import html2md_hindawi
>>> h = 'abc                     <div>def</div>                     ghi'
>>> html2md_hindawi.markdownify(h)
'abc def ghi'

# unknown div class: tags are stripped off

>>> import html2md_hindawi
>>> h = 'abc                     <div class="unknown_div_class">def</div>                     ghi'
>>> html2md_hindawi.markdownify(h)
'abc def ghi'

# poetry single-line:

>>> h = '            <div class="poetry_container line">              <div>                <div>hemistich1</div>                <div>hemistich2</div>              </div>            </div>'
>>> html2md_hindawi.markdownify(h)
'\n# hemistich1 %~% hemistich2\n'

# poetry multiple line:

>>> h = '            abc            <div class="poetry_container">              <div>                <div>hemistich1</div>                <div>hemistich2</div>              </div>              <div>                <div>hemistich3</div>                <div>hemistich4</div>              </div>            </div>            def'
>>> html2md_hindawi.markdownify(h)
' abc\n# hemistich1 %~% hemistich2\n# hemistich3 %~% hemistich4\ndef'

# section without heading:

>>> h = 'abc                     <div class="section" id="sect2_9">def</div>                     ghi'
>>> html2md_hindawi.markdownify(h)
'abc\n\n### |||\ndef\n\nghi'

# section with heading (handled by h4):

>>> h = 'abc                     <div class="section" id="sect2_9">                       <h4>title</h4>                       <p>def</p>                     </div>                     ghi'
>>> html2md_hindawi.markdownify(h)
'abc\n\n### ||| title\n\n# def\n\nghi'

# footnote:

>>> h = '<div class="footnote"><sup>1 </sup>footnotetext</div>'
>>> html2md_hindawi.markdownify(h)
'\n\nFOOTNOTE1 footnotetext\n\n'

Paragraph block (similar to <p>):

>>> h = '<div class="paragraph-block">abc def ghi</div>'
>>> html2md_hindawi.markdownify(h)
'\n\n# abc def ghi\n\n'
>>> h = '<div class="paragraph-block"><p>abc def ghi</p></div>'
>>> html2md_hindawi.markdownify(h)
'\n\n# abc def ghi\n\n'
>>> h = '<div class="paragraph-block"></div>'
>>> html2md_hindawi.markdownify(h)
''
>>> h = '<div class="paragraph-block"><div class="paragraph-block">abc</div></div>'
>>> html2md_hindawi.markdownify(h)
'\n\n# abc\n\n'
convert_h4(el, text)

Converts <h4> header tags.

In the Hindawi library, <h4> tags are used for subsections on any level. The section level must be taken from the id of the parent div. NB: the ### | level in Hindawi is 0.

Example

>>> import html2md_hindawi
>>> h = '<div class="section" id="sect2_4"><h4>abc</h4></div>'
>>> html2md_hindawi.markdownify(h)
'\n\n### ||| abc\n\n'
>>> h = '<div class="section" id="sect5_2"><h4>abc</h4></div>'
>>> html2md_hindawi.markdownify(h)
'\n\n### |||||| abc\n\n'
convert_hn(n, el, text)

Converts headings in the usual way except h4 headings.

In the Hindawi library, <h4> tags are used for subsections on any level. The section level must be derived from the id of the parent div.

Example

>>> import html2md_hindawi
>>> h = '<h1>abc</h1>'
>>> html2md_hindawi.markdownify(h)
'\n\n### | abc\n\n'
>>> h = '<h3>abc</h3>'
>>> html2md_hindawi.markdownify(h)
'\n\n### ||| abc\n\n'
>>> h = '<div class="section" id="sect5_2"><h4>abc</h4></div>'
>>> html2md_hindawi.markdownify(h)
'\n\n### |||||| abc\n\n'
convert_span(el, text)

Converts html <span> tags, depending on their class attribute.

Supported span classes should be stored in self.span_dict (key: span class (str); value: formatting string)

Example

>>> import html2md_hindawi
>>> h = 'abc <span>def</span> ghi'
>>> html2md_hindawi.markdownify(h)
'abc def ghi'
>>> h = 'abc <span class="unknown_span_class">def</span> ghi'
>>> html2md_hindawi.markdownify(h)
'abc def ghi'
>>> h = 'abc <span class="quran">def  ghi</span> jkl'
>>> html2md_hindawi.markdownify(h)
'abc @QUR02 def ghi jkl'

# the latter is a result of post-processing; # the function itself will produce: # ‘abc @QUR@ def ghinjkl’

get_section_level(el)

Gets the level of the current section (or its parent).

openiti.new_books.convert.helper.html2md_hindawi.markdownify(html, **options)

Shortcut to the convert method of the HindawiConverter class.

openiti.new_books.convert.helper.md2html

Replace markdown tags by html tags.

TO DO: * replace ### |+ headers not by <hx> but by <div><hx>title</hx>content</div>?

openiti.new_books.convert.helper.md2html.dict_units(text)

Replace dictionary units mARkdown tags with html tags.

Examples

>>> import md2html
>>> md2html.dict_units("### $DIC_NIS$ Name of the entry")
'<div class="entry descr-name"><span class="entry-title">Name of the entry</span>\n</div>\n'
>>> md2html.dict_units("### $DIC_TOP$ Name of the place")
'<div class="entry toponym"><span class="entry-title">Name of the place</span>\n</div>\n'
>>> md2html.dict_units("### $DIC_LEX$ Word")
'<div class="entry lexical"><span class="entry-title">Word</span>\n</div>\n'
>>> md2html.dict_units("### $DIC_BIB$ Book title")
'<div class="entry book"><span class="entry-title">Book title</span>\n</div>\n'
>>> md2html.dict_units("### $BIO_MAN$ Name of the man")
'<div class="entry man"><span class="entry-title">Name of the man</span>\n</div>\n'
>>> md2html.dict_units("### $BIO_WOM$ Name of the woman")
'<div class="entry woman"><span class="entry-title">Name of the woman</span>\n</div>\n'
>>> md2html.dict_units("### $BIO_REF$ Cross-reference to a person")
'<div class="entry cross-ref"><span class="entry-title">Cross-reference to a person</span>\n</div>\n'
>>> md2html.dict_units("### $BIO_NLI$ List of names")
'<div class="entry name-list"><span class="entry-title">List of names</span>\n</div>\n'
>>> md2html.dict_units("### $CHR_EVE$ Event description")
'<div class="entry event"><span class="entry-title">Event description</span>\n</div>\n'
>>> md2html.dict_units("### $CHR_RAW$ Events description")
'<div class="entry events-batch"><span class="entry-title">Events description</span>\n</div>\n'
>>> md2html.dict_units("### $ Name of the man")
'<div class="entry man"><span class="entry-title">Name of the man</span>\n</div>\n'
>>> md2html.dict_units("### $$ Name of the woman")
'<div class="entry woman"><span class="entry-title">Name of the woman</span>\n</div>\n'
>>> md2html.dict_units("### $$$ Cross-reference to a person")
'<div class="entry cross-ref"><span class="entry-title">Cross-reference to a person</span>\n</div>\n'
>>> md2html.dict_units("### $$$$ List of names")
'<div class="entry name-list"><span class="entry-title">List of names</span>\n</div>\n'
openiti.new_books.convert.helper.md2html.tables(text)

Replace markdown table tags with html table tags.

Examples

>>> text = "# first paragraph\n| table header 1 | header 2 |\n|:---------------|----------|\n| table row 1 col 1 | table row 1 col 2|\n| table row 2 col 1 | table row 2 col 2|"
>>> import md2html
>>> md2html.tables(text)
'# first paragraph\n<table>\n<tr>\n<th>table header 1</th><th>header 2</th>\n</tr>\n<tr>\n<td>table row 1 col 1</td><td>table row 1 col 2</td>\n</tr>\n<tr>\n<td>table row 2 col 1</td><td>table row 2 col 2</td>\n</tr>\n</table>\n'
openiti.new_books.convert.helper.tei2md

Convert TEI xml to Markdown.

This program contains a sub-class of html2md.MarkdownConverter, which in turn is an adaptation of python-markdownify (https://github.com/matthewwithanm/python-markdownify) to output OpenITI mARkdown.

IMPORTANT: since TEI indicates page beginning (<pb/>) and OpenITI mARkdown page numbers are at the bottom of a page, page numbers should be pre-processed in a text before feeding it to the markdownify function. The tei2md.preprocess_page_numbers function can be used for this.

You can use the tei2md.TeiConverter class as a base class and subclass it to add methods, adapt the post-processing method etc.

E.g.::
def Class GRAR_converter(tei2md.TeiConverter):
def post_process_md(self, text):

text = super().post_process_md(text)

# remove blank lines marked with “DELETE_PREVIOUS_BLANKLINES” tag

text = re.sub(“\n+DELETE_PREVIOUS_BLANKLINES”, “”, text)

# replace placeholders for spaces in tables:

text = re.sub(“ç”, ” “, text) return text

This table shows the methods and classes the TeiConverter inherits from html2md.MarkdownConverter, which methods it overwrites, and which methods it adds:

html2md.MarkdownConverter tei2md.TeiConverter
class DefaultOptions (inherited)
class Options (inherited)
__init__ (inherited)
convert convert
process_tag (inherited)
process_text (inherited)
fill_out_columns (inherited)
post_process_md (inherited)
__getattr__ (inherited)
should_convert_tag (inherited)
indent (inherited)
create_underline_line (inherited)
underline (inherited)
convert_a (inherited)
convert_b (inherited)
convert_blockquote (inherited)
convert_br (inherited)
convert_em (inherited)
convert_hn (inherited)
convert_i (inherited)
convert_img (inherited)
convert_list (inherited)
convert_li (inherited)
convert_p (inherited)
convert_strong (inherited)
convert_table (inherited)
convert_tr (inherited) convert_div1 convert_div2 convert_div3 convert_div convert_head convert_lg find_heading_level (dummy)

Examples (doctests):

Specific tei-related tags:

>>> import tei2md
>>> h = 'abc             <div1 type="book" n="0" name="Preface">def</div1>             ghi'
>>> tei2md.markdownify(h)
'abc\n\n### | [book 0: Preface]\n\ndef\n\nghi'
>>> h = 'abc             <div2 type="section" n="1">def</div1>             ghi'
>>> tei2md.markdownify(h)
'abc\n\n### || [section 1]\n\ndef\n\nghi'
>>> h = 'abc             <div3 type="Aphorism">def</div1>             ghi'
>>> tei2md.markdownify(h)
'abc\n\n### ||| [Aphorism]\n\ndef\n\nghi'

Divs without type are stripped:

>>> h = 'abc             <div>def</div>             ghi'
>>> tei2md.markdownify(h)
'abc def ghi'

<head> tags are converted to level-3 mARkdown headers by default:

>>> h = 'abc             <head>def</head>             ghi'
>>> tei2md.markdownify(h)
'abc\n\n### ||| def\n\nghi'
>>> h = 'abc             <lb/>def             <lb/>ghi'
>>> tei2md.markdownify(h)
'abc\n~~def\n~~ghi'
>>> h = '    abc    <lg>      <l>line1</l>      <l>line2</l>      <l>line3</l>      <l>line4</l>    </lg>    def'
>>> tei2md.markdownify(h)
' abc\n# line1\n# line2\n# line3\n# line4\n\ndef'

In addition to these TEI tags, the converter also inherits methods from html2md.MarkdownConverter that deal with more standard html tags:

Headings: h1

>>> import tei2md
>>> h = '<h1>abc</h1>'
>>> tei2md.markdownify(h)
'\n\n### | abc\n\n'
NB: heading style is OpenITI mARkdown style by default,
but can be set to other styles as well:
>>> h = '<h1>abc</h1>'
>>> tei2md.markdownify(h, md_style=UNDERLINED)
'\n\nabc\n===\n\n'
>>> h = '<h1>abc</h1>'
>>> tei2md.markdownify(h, md_style=ATX)
'\n\n# abc\n\n'

Paragraphs (<p>):

>>> h = "<p>abc</p>"
>>> tei2md.markdownify(h)
'\n\n# abc\n\n'
>>> h = "<p>abc</p>"
>>> tei2md.markdownify(h, md_style=ATX)
'\n\nabc\n\n'

Divs without type are stripped:

>>> h = 'abc             <div>def</div>             ghi'
>>> tei2md.markdownify(h)
'abc def ghi'

Spans without class or with an unsupported class are stripped:

>>> h = 'abc <span>def</span> ghi'
>>> tei2md.markdownify(h)
'abc def ghi'
>>> h = 'abc <span class="unknown_span_class">def</span> ghi'
>>> tei2md.markdownify(h)
'abc def ghi'

Links:

>>> h = '<a href="a/b/c">abc</a>'
>>> tei2md.markdownify(h)
'[abc](a/b/c)'

Unordered lists:

>>> h = '<ul><li>item1</li><li>item2</li></ul>'
>>> tei2md.markdownify(h)
'\n* item1\n* item2\n\n'

Ordered lists:

>>> h = '<ol><li>item1</li><li>item2</li></ol>'
>>> tei2md.markdownify(h)
'\n1. item1\n2. item2\n\n'

Nested lists:

>>> h = '<ol><li>item1</li><li>item2:<ul><li>item3</li><li>item4</li></ul></li></ol>'
>>> tei2md.markdownify(h)
'\n1. item1\n2. item2:\n\n\t* item3\n\t* item4\n\t\n\n'

Italics (<i> and <em> tags):

>>> h = 'abc <em>def</em> ghi'
>>> tei2md.markdownify(h)
'abc *def* ghi'
>>> h = 'abc <i>def</i> ghi'
>>> tei2md.markdownify(h)
'abc *def* ghi'

Bold (<b> and <strong> tags):

>>> h = 'abc <b>def</b> ghi'
>>> tei2md.markdownify(h)
'abc **def** ghi'
>>> h = 'abc <strong>def</strong> ghi'
>>> tei2md.markdownify(h)
'abc **def** ghi'

Tables:

>>> h = '    <table>      <tr>        <th>th1aaa</th><th>th2</th>      </tr>      <tr>        <td>td1</td><td>td2</td>      </tr>    </table>'
>>> tei2md.markdownify(h)
'\n\n| th1aaa | th2 |\n| ------ | --- |\n| td1    | td2 |\n\n'
i.e.:
th1aaa | th2 |
td1 | td2 |
openiti.new_books.convert.helper.tei2md.preprocess_page_numbers(s)

Turn page beginnings into page endings.

TEI xml indicates the beginning of a page, while OpenITI mARkdown indicates the end of a page.

openiti.new_books.convert.helper.yml2json

Convert a YML file to json.

The yml2json function can be used to convert a yml metadata file to

  • a list of dictionaries: useful for
    • conserving the order
    • conserving comments between the records
  • a dictionary of dictionaries (key: record id, val: metadata dictionary), useful for easy lookup by record id
openiti.new_books.convert.helper.yml2json.yml2json(yml_fp, container=[], rec_start='##RECORD', id_key='01#BookID###', key_val_splitter=':::: ')

Convert a YML file to json.

Parameters:
  • yml_fp (str) – the filepath to the YML file.
  • container (obj) – an empty list or dict into which the metadata dictionaries will be stored. If the container is a list, the order of the records will be stored as well as comment lines between the records.
  • rec_start (str) – the starting characters of a record;
  • id_key (str) – the yml key that indicates the id of a record
  • key_val_splitter (str) – the separator used to separate keys and values.
Returns:

the container filled with the metadata dictionaries.

Return type:

container (obj)

openiti.new_books.scrape

openiti.release

openiti.release.collect_openITI_version

openiti.release.collect_release_stats