The data we have and the data we want is all listed on the how you can help page. We've got a fair amount of it right now awaiting processing.
Here's where it is on our servers -- note: this doesn't include other data listed in the how you can help page:
wiki-beta:/1/pharos/crawl/powells (aaronsw, phr)
wiki-beta:/1/princeton (aaronsw; wget mirror)
wiki-beta:/1/ttk/bnf (bill)
wiki-beta:/1/pharos/covers and wiki-beta:/2/pharos/bookcovers (aaronsw)
wiki-beta:/1/pharos/covers/bookmooch (aaronsw)
wiki-beta:/1/pharos/covers/librarything (aaronsw)
wiki-beta:/1/pharos/crawl/librarything (aaronsw; wget mirror)
wiki-beta:/1/pharos/crawl/fred20 (aaronsw)
wiki-beta:/1/pharos/onix/accesscopyright (aaronsw, rejon)
wiki-beta:/1/pharos/onix/originals (aaronsw, alexis)
wiki-beta:/tmp/incoming (aaronsw, alexis)
wiki-beta:/1/pharos/onix/psu (aaronsw)
wiki-beta:/1/pharos/onix/copyright (aaronsw)
wiki-beta:/2/pharos/crawl/bulk.resource.org (aaronsw; wget mirror)
wiki-beta:/2/pharos/crawl/ftp.powells.com (aaronsw; wget mirror)
wiki-beta:/2/pharos/crawl/isbndb.com (aaronsw; wget mirror)
wiki-beta:/2/pharos/crawl/onlinebooks.library.upenn.edu (aaronsw; wget mirror)
wiki-beta:/2/pharos/crawl/www.??????.us (aaronsw; wget mirror)
wiki-beta:/2/pharos/isbns and more recently apollonius:/0/pharos/crawl/more_isbn_project (aaronsw; workspace)
wiki-beta:/2/pharos/pdfgrab (aaronsw; wget mirror)
wiki-beta:/2/wikipedia (aaronsw; wget mirror)
apollonius:/2/feisty/x-home/phr/wiki (phr)
zenodotus:/2/pharos/crawl/dbpedia/ (aaronsw; wget mirror)
zenodotus:/2/pharos/crawl/wikia/ (aaronsw; wget mirror)
apollonius:/3/aaronsw/isbn and wiki-beta:/2/pharos/isbns/crawl (aaronsw)
wiki-beta:/1/pharos/marc (aaronsw)
apollonius:/3/aaronsw/crawl(aaronsw; wget)
wiki-beta:/1/pharos/onix/elsevier (aaronsw; wget)
apollonius:/0/pharos/crawl/lc_catdir (aaronsw; wget)
zenodotus:/2/pharos/crawl/dbpedia/undemocracy (aaronsw; wget mirror)
apollonius:/3/aaronsw/crawl/bowker (aaronsw)
apollonius:/0/pharos/crawl/nytimes_front_page (aaronsw; wget mirror)
apollonius:/0/pharos/crawl/drexel (aaronsw; wget)
apollonius:/3/aaronsw/crawl/laurentian (aaronsw; wget)
wiki-beta:/2/pharos/crawl/orbiscascade (aaronsw; wget; hasn't been used on amzn yet)
Here's stuff that's been archived in the petabox:
apollonius:/0/pharos/crawl/unc (aaronsw)
apollonius:/0/pharos/crawl/mit (aaronsw)
apollonius:/0/pharos/crawl/umich_pdscan_records (aaronsw; wget)
apollonius:/0/pharos/crawl/amazon_similarity_graph (aaronsw; awscrawl.py)
apollonious:/3/aaronsw/crawl/oregon/marc_oregon_summit_records (aaronsw; flash drive)
wiki-beta:/2/pharos/crawl/uoft (aaronsw; wget)
apollonius:/3/aaronsw/crawl/muohio (aaronsw; wget)
apollonius:/3/aaronsw/crawl/wwu/marc_western_washington_univ (aaronsw; cdrom)
wiki-beta:/2/pharos/crawl/bostoncollege/ (aaronsw; ftp)
apollonius:/1/pharos/crawl/lc (aaronsw; wget mirror)
apollonius:/0/pharos/crawl/stanford (aaronsw; in progress)
Aside from some special cases (e.g. lists of ISBNs, book covers, holdings data), we take each data sources, write a processor for it, and output Python dictionaries.
Right now the code we have for this is in our repository in the catalog directory. There is code to process MARC and ONIX files.
We don't do any merging right now, but our own Karen Coyle has developed some merging algorithms to implement. The idea is to go through all the dictionaries and find duplicates and combine the data from both records into one item.
This is not implemented yet.
Then we need to go through and detect relationships between works (example: all of these editions of Tom Sawyer are all editions the same conceptual work). From this we can add relationships to each object and create new objects (like works).
This is not implemented yet.
Once we have all this, we need to start giving it identifiers and importing it into ThingDB.
Right now there is some simple code for this in the catalog directory with some silly identifier schemes.