I love curly quotation marks. They're so round and inviting. I also love free e-books, and so have been delighted by Tor's current freee-bookeachweek program. Perhaps by Tor my loves may be joined? But alas not the HTML versions Tor provides have ASCII quotation marks, and when I asked if this could be rectified was told I'm afraid the quotation-mark conversion has to stay.
So for Robert Charles Wilsons
Spin I rolled up my crazy-sleeves, pulled out by regexps, and fixed them myself. Every last one. And modified the CSS and some of the markup to much more more closely resemble the formatting in the PDF version. Then wrapped it up as a valid .epub book. Then converted/tweaked to produce a great-looking Sony Reader BBeB book.
And theyre all for only me! Nope, cant give them to you. The power of copyright compels me! I can add those curly quotes myself because I have the source HTML to start with. If I start handing people my curly-quoted version I have no means to stop it from falling into new hands which didnt already have the straight-from-Tor edition.
Or do I?
I could provide you with a grid of just the byte offsets of the various curly quotes. Some extreme variant of diff/patch in which nothing of the original copyrighted text persists. It would contain just my curly quotes, owned by me under copyright law and free to give you as I wish. You provide the straight-from-Tor e-book, mix in my curly quotes and poof! you have a be-curled edition of
Spin. But this doesnt work for format-shifting over compression, encoding changes, etc., where put a curly quote here ceases to makes sense.
Unless we distill the idea down to the lowest level what is XOR but the difference between two bits?
Lets try an experiment, which Im calling Obelisk[1]. Download the following files:
Then get your copy of WilsonSpin_HTML.zip handy, pop open your favorite shell, and run:
Code
python obelisk.py Mohm5pei WilsonSpin_HTML.zip Mohm5pei#WilsonSpin_HTML.zip#Spin.epub.obelisk Spin.epub
python obelisk.py AhZe5shu WilsonSpin_HTML.zip AhZe5shu#WilsonSpin_HTML.zip#Spin.lrf.obelisk Spin.lrf
The results should be curly-quoted .epub and BBeB versions of
Spin, seamlessly merging Tors bits with mine into unified wholes.
Let me know what you think.
[1] Obelisk is similar to and inspired by a project called
Monolith, although with rather different goals.
Assuming the source file has an even number of quotes, shouldn't replacing them with curly quotes be as simple as
Code
intag = False
inquote = False
for i, chr in enumerate(data): if chr == '<': intag = True elif chr == '>' intag = False elif not intag and chr == '"': if inquote: data[i] = right curly quote inquote = False else: data[i] = left curly quote inquote = True
Or is there something about curly quotes I'm missing?
Quote kovidgoyal
Assuming the source file has an even number of quotes, shouldn't replacing them with curly quotes be as simple as
Its mostly mechanizable, but not quite that simply. For example:
This quotation-marked bit goes on for more than one paragraph. It doesnt end with a double quote.
And here I have some examples of single quotes. Ive got several of em. The examples quotation marks point in all kinds of directions.
And here ends the quote.
So pretty much the rules are:
Code
<ws>" ==
"<ws> ==
\w'\w ==
'<ws> ==
<ws>' ==
Where <ws> is whitespace plus ( ) [ ] - .
But then have to manually check all the instances of <ws> and probaly start by looking for any quotations marks with white space on both sides (usually found when doing "something like 'this' ").
So anyway. Mostly mechanizable, but still some manual labor to get it perfect. And cant automate improving the CSS. :-)
Ah I see, well lets see if Tor starts beating on your door in the middle of the night.
Um... it won't work because it never came zipped. And how do we know the filename to use in the ZIP file or even if we have the exact same contents?
Quote JSWolf
Um... it won't work because it never came zipped. And how do we know the filename to use in the ZIP file or even if we have the exact same contents?
The e-mails actually contain links to two separate HTML versions. One is the HTML content served directly, the other is a ZIP archive which contains the images used in the book, a (broken) OPF file, etc.
Quote llasram
The e-mails actually contain links to two separate HTML versions. One is the HTML content served directly, the other is a ZIP archive which contains the images used in the book, a (broken) OPF file, etc.
Yes, you are correct. My apologies. I'll give your script another go and see how it works out.
How do I use your script to generate a diff file for other content? I'd love to do one for
Mistborn based on the PDF to make the LRF from it.
I've taken the EPUB edition and built an LRF to my specification. Looks nice. Now all I need to do is build a proper ToC and I'll be all set.
Quote JSWolf
How do I use your script to generate a diff file for other content? I'd love to do one for Mistborn based on the PDF to make the LRF from it.
It's symmetric, so:
Code
python obelisk.py SALT KEYFILE INFILE OUTFILE
For both decryption and encryption. The SALT parameter is some string of your choosing but should not be reused for a particular KEYFILE. For example:
Code
python obelisk.py sai3sahS 9780765350381.zip Mistborn.lrf sai3sahS#9780765350381.zip#Mistborn.lrf.obelisk
HOWEVER I am not a lawyer. This certainly seems
reasonable given that one needs the original file to reconstitute the derived file, but I dont really know if Tor and/or your nations legal system will see it that way. This is an experiment use at your own risk.