While reading Geoff Nunberg's...ah...crisp assessment of the current state of GoogleBooks metadata--with which I thoroughly agree--I discovered Hathi Trust Digital Library, which "was conceived as a collaboration of the thirteen universities of the Committee on Institutional Cooperation and the University of California system to establish a repository for these universities to archive and share their digitized collections." Unfortunately, searching across texts (currently described as "experimental") is limited to OOC works. Right now, quite a lot of OOC works are limited to search-only; a quick test using "tale reformation" as the keywords turned up unviewable editions of very OOC novels like From Dawn to Dark in Italy. I'm spying some familiar problems with the metadata--Tales of the Persecuted, an undated collection published in Philadelphia, turns up with a date of 1800 (which is clearly wrong). And, as in GoogleBooks, there are orphaned volumes (e.g., Catherine Sinclair's triple-decker Cross Purposes). In books identified as search-only, the results indicate that there was a hit...but not what the hit was. Which, I have to say, is even less useful than GoogleBooks' much-loathed "snippet view." That being said, I prefer the screen layout. Books can be viewed as PDFs, text (somewhat cleaner than GoogleBooks' or the Internet Archive's plain text), or images.
It would be nice if the catalog search function would let users search by library.
It's too early to say much about the content, although there should be an incredible resource here once everything is uploaded and made available.
The problem, as always, is in the original metadata. Though people like to blame it on Google, a lot of it has to do with retrocon projects as libraries moved from card catalogs to OPACs, especially with undated works.
Posted by: Chris | September 01, 2009 at 09:45 PM
Other issues are associated with series of monographs, reprints, etc., which were problematic in card catalogs.
Posted by: Mr Punch | September 02, 2009 at 06:00 PM
Copyright determination is a challenging issue generally, and not less in HathiTrust. John Wilkin addresses our strategy for opening access to orphan works that are actually in the public domain in a comment to Geoff's blog post (http://languagelog.ldc.upenn.edu/nll/?p=1701#comment-41811). The task of opening access to works that we know are in the public domain can be equally arduous at times (requiring manual review) because of the number of volumes being ingested into the repository (approximately 400,000 and 300,000 in the last two months respectively) and the quality of original metadata.
For instance, the volume listed above, From Dawn to Dark in Italy, was originally cataloged with no date [n.d] in the published date field. When volumes are ingested into HathiTrust, an automatic processes uses existing metadata to make preliminary copyright determinations on those volumes. To protect against copyright infringement, this process is necessarily conservative and if the copyright status of a volume cannot be definitively ascertained (as in the example above), it is given a "search only" status in HathiTrust. It will remain unavailable to most users except for searching purposes until it turns up in our manual review process, or users let us know (which we encourage! - you can let us know by clicking the feedback link on any item). We are reviewing this volume now.
It should be noted that while search functionality in "search only" books is limited, it does provide a general level of access to these (mostly in copyright or orphaned) works. As John mentioned in his post as well, the fact that these volumes are being preserved in HathiTrust allows us to make them available fully to users with print disabilities, and for purposes of computational research as well.
Regarding the incorrect dating of Tales of the Persecuted, the publication date is listed as [18--] in the catalog record. That it is being represented as 1800 in the search results is an issue of date normalization for search and faceting purposes that we are looking into (thanks!). Again, we welcome feedback through links on the site for any information that appears confusing or incorrect.
As far as search, we are targeting early October for the release of full-text search functionality over all volumes (public domain and in copyright) in HathiTrust. We expect to be approaching 5 million volumes at that time. When you do a search in our bibliographic catalog (http://catalog.hathitrust.org), at the bottom of the faceting list there is an option to restrict results by volumes from a contributing institution.
Thank you for the compliments on our interface and stay tuned, we are working on a new release that will make it easier to read and browse books.
Posted by: -- Jeremy York, Project Librarian, HathiTrust | September 03, 2009 at 01:28 PM