Mobileread
Post your Useful Plugin Code Fragments Here
#1  KevinH 12-14-2015, 02:49 PM
Please reserve this thread for plugin developers and others to share their code fragments useful for Sigil plugins. Any questions about them should be directed to the Plugin Development "sticky" thread.

Thanks!

KevinH
Reply 

#2  KevinH 12-15-2015, 10:54 AM
Code
 # Example of using the provided stream based QuickParser # to parse metadataxml (to look for cover id) # Also rebuilds the metadata xml in res ps = bk.qp ps.setContent(bk.getmetadataxml()) res = [] coverid = None # parse the metadataxml, store away cover_id and rebuild it for text, tagprefix, tagname, tagtype, tagattr in ps.parse_iter(): if text is not None: # print(text) res.append(text) else: # print(tagprefix, tagname, tagtype, tagattr) if tagname == "meta" and tagattr.get("name",'') == "cover": coverid = tagattr["content"] res.append(ps.tag_info_to_xml(tagname, tagtype, tagattr)) original_metadata = "".join(res)
Reply 

#3  rubeus 12-15-2015, 02:21 PM
You need:

Python Interpreter > 3 and PIL library installed

or

the internal builtin Python Interpreter from 0.9.0 and up.

Code
from PIL import Image
from io import BytesIO
Code
 for (id, href, mime) in bk.image_iter(): im = Image.open(BytesIO(bk.readfile(id))) (width, height) = im.size print ('id={} href={} mime={} width={} height={}'.format(id, href, mime, width,height))
Reply 

#4  DiapDealer 01-02-2016, 02:51 PM
Creating self-deleting temp folders with python's contextmanager:

Code
from contextlib import contextmanager
@contextmanager
def make_temp_directory(): import tempfile import shutil temp_dir = tempfile.mkdtemp() yield temp_dir shutil.rmtree(temp_dir)
Then in your plugin, you can simply do something like:
Code
with make_temp_directory() as temp_dir: do stuff with things in the temp_dir
It's not perfect, but barring any untrapped errors (or platform-specific permission problems), "temp_dir" will delete itself after completion of the with statement.
Reply 

#5  slowsmile 12-17-2016, 06:12 AM
Using BeautifulSoup, here's a quick way to remove all garbage proprietary data from an html file:


Code
try: import os.path from sigil_bs4 import BeautifulSoup
except: from bs4 import BeautifulSoup
def fixHTML(work_dir, file) output = os.path.join(work_dir, 'clean_html.htm') outfp = open(output, 'wt', encoding=('utf-8')) html = open(file, 'rt', encoding='utf-8').read() soup = BeautifulSoup(html, 'html.parser') # remove all unwanted proprietary attributes from the html file search_tags = ['p', 'span', 'div', 'body', 'a', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'br'] search_attribs = ['dir', 'name', 'title', 'link', 'id' ,'text', 'lang', 'clear'] for tag in soup.findAll(search_tags): for attribute in search_attribs: del tag[attribute] outfp.writelines(str(soup)) outfp.close() os.remove(file) os.rename(output, file) return(file)
Reply 

#6  DiapDealer 12-17-2016, 09:29 AM
Quote slowsmile
Using BeautifulSoup, here's a quick way to remove all garbage proprietary data from an html fille.
Nice example of deleting attributes from tags with bs4, but why would "id" or "lang" attributes be considered garbage (or proprietary)? Removing "id", for instance, could break a whole bunch of links in files (html toc and ncx included). Seems a very odd attribute to want to nuke ("name" should probably be converted to "id" to prevent any possible link breakage, as well).
Reply 

#7  slowsmile 12-17-2016, 07:06 PM
The 'lang' and 'id' attributes are garbage in what I'm doing at the moment. I'm currently writing a plugin to convert opendoc html to epub. This means that you have to initially remove all bookmarks and the TOC from the html as part of the html clean up process. My plugin app then regenerates a new TOC on conversion to epub. And apart from the lang declaration in the html header namespace, the lang attributes within the html code itself also seems to be completely superfluous. I've never seen 'lang' used in epubs within the html code.

I've also read that the 'name' attribute is now also deprecated, which is why 'id' should always be used in epubs now.
Reply 

#8  DiapDealer 12-17-2016, 08:18 PM
Quote slowsmile
The 'lang' and 'id' attributes are garbage in what I'm doing at the moment. I'm currently writing a plugin to convert opendoc html to epub. This means that you have to initially remove all bookmarks and the TOC from the html as part of the html clean up process. My plugin app then regenerates a new TOC on conversion to epub.
No problem. As I said, it's a very useful snippet for deleting attributes with bs4, I was just nervous about folks associating the "id" parameter as garbage or proprietary.

Quote slowsmile
And apart from the lang declaration in the html header namespace, the lang attributes within the html code itself also seems to be completely superfluous. I've never seen 'lang' used in epubs within the html code.
Multi-language epubs (or epubs that just display other languages) can make use of it extensively. It's why Sigil's spellchecking is being enhanced to parse the lang attribute in the html. You might not ever encounter it, but it's not really that rare.

Quote slowsmile
I've also read that the 'name' attribute is now also deprecated, which is why 'id' should always be used in epubs now.
It is deprecated, but it will often still "work." That's why converting "names" to "id" can be beneficial when working with cluttered/proprietary/old html.
Reply 

#9  slowsmile 12-17-2016, 10:16 PM
@DiapDealer...Thanks for the info. I was unaware that 'lang' was used that much in epubs so I guess I've learned something. I know that the html text is in utf-8 whereas I think the tag text is more or less ascii. So I'm slightly surprised that you need the 'lang' attribute everywhere in the html because I thought that utf-8 could be defined regionally for different languages within the epub html with the help of python. I guess that utf-8 isn't used like that when you use python in an html app.

Regarding the use of 'name' or 'id' -- I always use 'id' now because you will always get an error with epubcheck if you use 'name'. Although deprecated does not mean that you can't use it, it does infer that the 'name' attribute will be dropped from html sometime in the future -- perhaps when standard epub html eventually moves to HTML5. I also note that when you convert Word to HTML -- Word HTML still uses 'name' and not 'id'. So I'm guessing that that the removal of 'name' from epub html will not happen for quite a while.

Also, I think Kindle mobi allows the 'name' attribute'(because you can upload Word filtered html to KDP) whereas vendors that use standard IDPF epubs will not allow it.
Reply 

#10  Doitsu 12-18-2016, 05:21 AM
Quote slowsmile
Using BeautifulSoup, here's a quick way to remove all garbage proprietary data from an html file:
BTW, bs4 returns the attributes as an attrs dictionary and if you're absolutely sure that you don't need any of them you could delete them all at once by assigning an empty dictionary to attrs.

Here's a minimalist proof-of-concept example:

Spoiler Warning below






Code
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from sigil_bs4 import BeautifulSoup
def run(bk): # get all (X)HMTL files for (html_id, href) in bk.text_iter(): html = bk.readfile(html_id) soup = BeautifulSoup(html, 'html.parser') orig_soup = str(soup) for tag in soup.find_all(True): if tag.name not in ['style', 'a', 'nav', 'link', 'html', 'svg', 'image', 'meta'] and tag.attrs != {}: tag.attrs = {} if str(soup) != orig_soup: bk.writefile(html_id, str(soup)) print(bk.id_to_href(html_id) + ' updated.') return 0
def main(): print('I reached main when I should not have\n') return -1
if __name__ == "__main__": sys.exit(main())


Quote slowsmile
So I'm slightly surprised that you need the 'lang' attribute everywhere in the html [...]
You don't need to use lang attributes, unless you create a multilingual epub book, however, if you do use it, the IDPF recommends using both lang and xml:lang attributes.

Quote slowsmile
Regarding the use of 'name' or 'id' -- I always use 'id' now because you will always get an error with epubcheck if you use 'name'. Although deprecated does not mean that you can't use it, it does infer that the 'name' attribute will be dropped from html sometime in the future -- perhaps when standard epub html eventually moves to HTML5.
The epub 2.0.1. standard is based on XHTML 1.1 and XHTML 1.1 no longer allows the use of name attributes as fragment identifiers.

Quote slowsmile
I also note that when you convert Word to HTML -- Word HTML still uses 'name' and not 'id'. So I'm guessing that that the removal of 'name' from epub html will not happen for quite a while.
Just because MS Word doesn't generate XHTML 1.1 compliant output doesn't mean it's OK to use it as is, even though many epub apps can handle name attributes as fragment identifiers.

Quote slowsmile
Also, I think Kindle mobi allows the 'name' attribute'(because you can upload Word filtered html to KDP) whereas vendors that use standard IDPF epubs will not allow it.
Amazon indeed supports the upload of ebooks with MS Word generated html files, however, IMHO, that doesn't mean that they officially condone the use of the name attribute. IIRC, the Kindle Publishing Guidelines recommend using only well-formed (X)HTML files.
Based on strings found in the kindlegen binary, it also looks like KindleGen uses HTMLTidy internally to clean up all HTML files.
Reply 

  Next »  Last »  (1/2)
Today's Posts | Search this Thread | Login | Register