Mobileread
libprs500 - title/author matching regex
#1  Megatron-UK 03-31-2008, 01:08 PM
I've just started playing with libprs500 (0.4.46) in preperation for a Sony PRS505 I have on the way and I'm having a spot of bother trying to get the standard regex to correctly identify the author and title from the filename.

The standard syntax I believe is: (?P<author>.+) - (?P<title>[^_]+)

Which, if in the test box, I paste in the following string "H.P Lovecraft - At the Mountains of Madness.txt" correctly reports the following:

Title: "At the Mountains of Madness"
Author: "H.P. Lovecraft"
Series: "No Match"
Series Index: "No Match"

However, actually importing that same file into the library displays the following:

Title: "H.P. Lovecraft - At the Mountains of Madness"
Author: "H.P. Lovecraft"
(all other columns are blank as expected)

Is this standard behaviour or a bug?
Reply 

#2  Megatron-UK 03-31-2008, 02:07 PM
Upon further investigation it only seems to do this with PDF documents; the author and title fields seem to map correctly against html, zip and text based files.

So if I rename a pdf, an html file, a text file and a zip all to the same name:

wibble - wobble.[pdf|zip|txt|html]

...then the html, text and zip version of the file will all correctly display as title="wobble", author="wibble".

However the pdf file will show as title="wibble - wobble" and author="wibble".
Reply 

#3  kovidgoyal 03-31-2008, 02:29 PM
libprs500 tries to read metadata from the file itself first. Only if that fails does it use the filename.
Reply 

#4  Megatron-UK 03-31-2008, 02:38 PM
Is this right though? I've attached an example of the difference in behaviour with the same filename for three different file types. There is no metadata set in the PDF file.
Clipboard01.jpg Clipboard02.jpg 
Reply 

#5  Megatron-UK 03-31-2008, 03:17 PM
Ok, digging a bit and would I be correct in thinking that pdf-meta.exe is used to determine the author and title of PDF documents?

Running pdf-meta on my renamed document I get the following:

pdf-meta.exe author\ -\ title.pdf
Title : author - title
Author : Unknown
Publisher: None
Category : None
Comments : None
ISBN : None

It looks like libprs500 is taking the Title as shown by pdf-meta and not running the regex to split it based on the filename. I have a whole load of PDF docs that have varying states of correct/incorrect meta data and I'd rather load them into libprs500 using the filenames to determine author and title.

Other than using pdftk and writing a script to recurse through all of my files to insert metadata based on the filename, can we force libprs500 to use the filename instead, even for PDF's?
Reply 

#6  kovidgoyal 03-31-2008, 03:43 PM
Open a ticket for a config option to customize this behavior.
Reply 

#7  Megatron-UK 03-31-2008, 04:27 PM
I've recursed through all of my PDF documents and ran the following script:

Quote
#!/bin/bash

find . -name "*.pdf" -print | grep -v .pdf.new | while read PDFPATH
do
DIR=`echo $PDFPATH | awk -F/ '{print $2}'`
FILE=`echo $PDFPATH | awk -F/ '{print $3}'`
AUTHOR=`echo $FILE | awk -F\- '{print $1}' | sed 's/ *$//'`
VAR2=`basename "$FILE" .pdf | awk -F\- '{print $2}' | sed 's/ *$//' | sed 's/^ //'`
VAR3=`basename "$FILE" .pdf | awk -F\- '{print $3}' | sed 's/ *$//' | sed 's/^ //'`
if [ "$VAR3" = "" ]
then
TITLE=$VAR2
SERIES=""
else
TITLE=$VAR3
SERIES=$VAR2
fi

echo "InfoKey: Author
InfoValue: $AUTHOR
InfoKey: Title
InfoValue: $TITLE" > ./metadata

pdftk "$DIR"/"$FILE" update_info metadata output "$DIR"/"$FILE".new

done
This correctly sets the PDF metadata, based on my known-good filename format of:

AUTHOR - SERIES - TITLE.pdf

or

AUTHOR - TITLE.pdf

However... libprs500 is still displaying the PDF files that I have correctly set the metadata on in the form of "author - title". Almost as if it is ignoring both the metadata *and* the filename regex pattern matching altogether and simply using the filename, minus the pdf extension.
Reply 

#8  kovidgoyal 03-31-2008, 04:28 PM
What does pdf-meta give you on the corrected PDF files?
Reply 

#9  Megatron-UK 03-31-2008, 04:38 PM
pdf-meta now shows the correct author, but the title is still the filename minus the extension. e.g.

Quote
megatron@elderthing:/cygdrive/y/resources/Books/pdf books $ pdf-meta.exe author\ -\ title.pdf
Title : author - title
Author : Unknown
Publisher: None
Category : None
Comments : None
ISBN : None

megatron@elderthing:/cygdrive/y/resources/Books/pdf books $ pdf-meta.exe author\ -\ title.pdf.new
Title : author - title.pdf
Author : author
Publisher: None
Category : None
Comments : None
ISBN : None
On the corrected PDF file, it looks suspiciously like pdf-meta is silently dropping the extension and treating the basename as the title - the metadata certainly doesn't show title as being "author - title.pdf" when I view it in Acrobat.
Reply 

#10  kovidgoyal 03-31-2008, 05:18 PM
Attach one of these PDF files here
Reply 

  Next »  Last »  (1/2)
Today's Posts | Search this Thread | Login | Register