[Edit][History] last modified April 11

Step 1: Get the data

The data we have and the data we want is all listed on the how you can help page. We've got a fair amount of it right now awaiting processing.

Here's where it is on our servers -- note: this doesn't include other data listed in the how you can help page:

  • Powells crawl: wiki-beta:/1/pharos/crawl/powells (aaronsw, phr)
  • OCA full text: fill in (phr)
  • Indian Million Books Project: in progress (anand)
  • random Princeton University Press files: wiki-beta:/1/princeton (aaronsw; wget mirror)
  • random Biblioteque National France files: wiki-beta:/1/ttk/bnf (bill)
  • book covers: wiki-beta:/1/pharos/covers and wiki-beta:/2/pharos/bookcovers (aaronsw)
  • bookmooch data: wiki-beta:/1/pharos/covers/bookmooch (aaronsw)
  • librarything data: wiki-beta:/1/pharos/covers/librarything (aaronsw)
  • librarything data: wiki-beta:/1/pharos/crawl/librarything (aaronsw; wget mirror)
  • fred20 data: wiki-beta:/1/pharos/crawl/fred20 (aaronsw)
  • Access Copyright data: wiki-beta:/1/pharos/onix/accesscopyright (aaronsw, rejon)
  • ONIX data: wiki-beta:/1/pharos/onix/originals (aaronsw, alexis)
  • recently-uploaded ONIX data: wiki-beta:/tmp/incoming (aaronsw, alexis)
  • PSU data: wiki-beta:/1/pharos/onix/psu (aaronsw)
  • Stanford copyright data: wiki-beta:/1/pharos/onix/copyright (aaronsw)
  • bulk.resource.org data: wiki-beta:/2/pharos/crawl/bulk.resource.org (aaronsw; wget mirror)
  • random Powells data: wiki-beta:/2/pharos/crawl/ftp.powells.com (aaronsw; wget mirror)
  • partial isbndb data: wiki-beta:/2/pharos/crawl/isbndb.com (aaronsw; wget mirror)
  • onlinebooks project data: wiki-beta:/2/pharos/crawl/onlinebooks.library.upenn.edu (aaronsw; wget mirror)
  • random book text: wiki-beta:/2/pharos/crawl/www.??????.us (aaronsw; wget mirror)
  • ISBNs: wiki-beta:/2/pharos/isbns and more recently apollonius:/0/pharos/crawl/more_isbn_project (aaronsw; workspace)
  • worldlibrary.net: wiki-beta:/2/pharos/pdfgrab (aaronsw; wget mirror)
  • wikipedia dumps: wiki-beta:/2/wikipedia (aaronsw; wget mirror)
  • wikipedia dumps: apollonius:/2/feisty/x-home/phr/wiki (phr)
  • dbpedia dumps: zenodotus:/2/pharos/crawl/dbpedia/ (aaronsw; wget mirror)
  • wikia dumps: zenodotus:/2/pharos/crawl/wikia/ (aaronsw; wget mirror)
  • amzn data: apollonius:/3/aaronsw/isbn and wiki-beta:/2/pharos/isbns/crawl (aaronsw)
  • MBL/WHOI catalog: wiki-beta:/1/pharos/marc (aaronsw)
  • enron emails: apollonius:/3/aaronsw/crawl(aaronsw; wget)
  • elsevier covers: wiki-beta:/1/pharos/onix/elsevier (aaronsw; wget)
  • LC additional data: apollonius:/0/pharos/crawl/lc_catdir (aaronsw; wget)
  • undemocracy: zenodotus:/2/pharos/crawl/dbpedia/undemocracy (aaronsw; wget mirror)
  • bowker: apollonius:/3/aaronsw/crawl/bowker (aaronsw)
  • nyt front pages: apollonius:/0/pharos/crawl/nytimes_front_page (aaronsw; wget mirror)
  • drexel catalog: apollonius:/0/pharos/crawl/drexel (aaronsw; wget)
  • laurentian catalog: apollonius:/3/aaronsw/crawl/laurentian (aaronsw; wget)
  • orbiscascade identifier mappings: wiki-beta:/2/pharos/crawl/orbiscascade (aaronsw; wget; hasn't been used on amzn yet)

Here's stuff that's been archived in the petabox:

in progress

  • lc content (incl. books): apollonius:/1/pharos/crawl/lc (aaronsw; wget mirror)
  • stanford books: apollonius:/0/pharos/crawl/stanford (aaronsw; in progress)

Step 2: Process it

Aside from some special cases (e.g. lists of ISBNs, book covers, holdings data), we take each data sources, write a processor for it, and output Python dictionaries.

Right now the code we have for this is in our repository in the catalog directory. There is code to process MARC and ONIX files.

Step 3: Merge it

We don't do any merging right now, but our own Karen Coyle has developed some merging algorithms to implement. The idea is to go through all the dictionaries and find duplicates and combine the data from both records into one item.

This is not implemented yet.

Step 4: FRBRize it

Then we need to go through and detect relationships between works (example: all of these editions of Tom Sawyer are all editions the same conceptual work). From this we can add relationships to each object and create new objects (like works).

This is not implemented yet.

Step 5: Import it

Once we have all this, we need to start giving it identifiers and importing it into ThingDB.

Right now there is some simple code for this in the catalog directory with some silly identifier schemes.