WashPost recipe with sign-in
#1  stuartweinstein 11-13-2010, 10:13 AM
I have a paid subscription to the physical print edition of the Washington Post. However, I would also like to read it on my Kindle. The Post offers full access to its electronic print edition with a free registration. I wrote a recipe using the needs_subscription=true option, which works for the first few articles. Afterwords, though, the article URL returns a page to sign in (again?). Any ideas? I've attached my recipe. I've tried various things (such as going single-threaded), but it doesn't make a difference. Uncommenting the response to the sign-in submit shows commands to set cookies with my login, so it seemed to work. The commented out section in preprocess_html to sign-in again didn't seem to help. I'd appreciate any suggestions. Thanks! Stuart.
[txt] washpostprint.recipe.txt (3.7 KB, 107 views)

#2  kovidgoyal 11-13-2010, 02:32 PM
Try using the get_obfuscated_article method, it gives you more control.

#3  stuartweinstein 11-18-2010, 07:00 AM
Using get_obfuscated_article is a bit overkill, I think. I've been using self.log(soup.prettify()) in preprocess_html() to see the contents. The problem is that I need the URL to re-fetch after doing the sign-in. The advantage of get_obfuscated_article is that it is passed the URL, but I didn't want to deal with the output file. Instead, I overrode fetch_article() to hold onto the URL so I could grab it inside preprocess_html(). While I imagine this forces me to a single thread, the performance is fine (since it is a daily download at 4am). I'm attaching my solution, but I'll continue to tweak it. As for access to the URL and other article attributes, I'm going to start another thread to ask about that. Thanks for the help so far.
[txt] washpostprint.recipe.txt (4.0 KB, 101 views)

