Adding Metadata and Uploading to IA

From Media History Project
Jump to: navigation, search

Digitizing Material, Adding Metadata, and Uploading to Internet Archive for the Media History Digital Library, Lantern, and Arclight

Derek Long (drlong@wisc.edu)

February 2016 Edition


This document is intended to help contributors digitize and add media history material to the Internet Archive (or “IA”)(archive.org) for open access. It will also go over the essential metadata fields you must add to each item on IA so that it can be indexed on the Media History Digital Library (mediahistoryproject.org) and be made searchable on Lantern (lantern.mediahist.org) and Arclight (search.projectarclight.org).


Some basic things to understand

IA is the MHDL’s scanning repository. It hosts the actual digitized collections indexed by MHDL and Lantern, and it also generates certain derivative files automatically—most importantly the Optical Character Recognition (OCR) files that append machine-readable text to page images. The following includes instructions specific to the MHDL, but for basic FAQs about uploading material to IA, see:

http://archive.org/about/faqs.php#Uploading_Content

and

http://archive.org/about/faqs.php#Books_and_Texts


Preparing files for upload

Before you even upload your item to IA, it’s important to assess the quality of what you have. You may even need to do some post-processing to make your item ready.

Most users will already have a set of scans that they are ready to upload, while some users will not yet have scanned their item(s). In either case, there are a few things you will want to make sure of:

1) First, you want to ensure that you have individual files for each (recto or verso) page of your item. If your scan has individual files for each 2-up page, or is a single file, you’ll need to do some post-production processing to separate them into individual page files. Most image batch software (e.g. Adobe Acrobat, ScanTailor) has tools for separating any 2-up scans you have into individual pages. If at all feasible, you may want to scan your original item such that one file corresponds to one page; otherwise, check your software’s documentation for the best way to create individual page files. If you already have a set of 2-up scans, page splits are relatively simple to do; we frequently use the software ScanTailor (http://scantailor.org/). You will also want to make sure that each image file is cropped fairly consistently, leaving out gutters and scanned areas outside the page—this improves OCR accuracy and the online presentation of your item.

2) Second, you want to be sure that your individual page files are in a high enough quality for IA to create its OCR and other derivative files. The process by which IA takes raw files and generates an OCR-ed, digital book interface (which IA calls the BookReader) is relatively robust—it will take almost any file format you give it. However, the higher the quality of the original files, the better the OCR will be. Ideally, the books or magazines you wish to contribute should be scanned at their full size and in high resolution (600 dpi), in color, from physical originals – this will produce the highest-quality OCR. From experience, we know that scans from microfilm can be hit-or-miss (although any format is better than not having access to the material at all). When it’s an option, scans from physical materials are best; however, the IA’s mission is one of access, not preservation, so whatever original format you have is generally going to be fine. Better to upload whatever you have in whatever quality it came to you in than to needlessly convert, compress, or process.

3) We have found that scanning items in .tiff provides an ideal mix of image size and quality. But again, if all you have is digital files, it’s better to upload in the highest quality of whatever original format you have (be it .pdf, .jpeg, or .tiff) than to try to change formats. The only obvious exception here is for proprietary raw camera formats (e.g. Canon’s .CR2 and the like) — these files are usually quite large, and converting them to .JPEG or .TIFF, especially for items that have more than a handful of pages, is advised.

4) For scans from microfilm or non-color scans, we have found that all other things being equal, bitonal (black-and-white) scans tend to produce slightly better OCR. They also create files that are smaller and more manageable, which may be a consideration for items with a large number of pages. However – the quality of a bitonal scan is directly related to the original source material; microfilm has the potential to introduce a lot of speckling to a bitonal scan, and there may be cases in which grayscale scans are preferable (non-color items consisting primarily of images, for example). If feasible, it’s a good idea to produce two versions (one grayscale and one bitonal) if you are scanning from an original microfilm source—if this is not feasible, generating a grayscale is preferable, since it can be converted to bitonal if necessary.


Naming Conventions to use in preparing items for IA

Most users will want to display their items in the BookReader. To prepare your item for the BookReader, you need to produce a folder (with a particular naming format described below), full of sequentially-named images of each individual page of your item.

We have found that it’s best to start by choosing a unique identifier for your item, according a format that will allow you to easily sort individual related items chronologically. For periodicals, we use this format:

[magazineName][VolumeNumber]-[YEAR]-[MONTH]

…so for the March 1927 volume of Variety, this would be:

variety86-1927-03

For books, something with a title, author (or sponsoring entity), and year of publication is usually sufficient:

[Author]_[TitleOfBook]_[YearOfPub]
Ex: Powdermaker_HollywoodTheDreamFactory_1950


Once you’ve chosen an ID, create a folder (on your desktop or another convenient working directory) with the name “YourUniqueID_images” (ex.: variety86-1927-03_images). This “_images” appendix is very important: it signals to IA that the files inside should be displayed as a digital book using the BookReader, rather than simply as a list of downloadable files.

Once you’ve created your _images folder, put all of your page files inside it, in sequential order (as presented in the book/volume/magazine). Whatever image batch software you use to process the files (e.g. Adobe Acrobat, ScanTailor) will likely do this automatically, or at least have an option to do so. Consult your individual software’s documentation. If not, you can use the Automator (in OS X) or some other simple scripting software to rename lots of files at once.

The best way to ensure that your individual page files remain in the correct order is to create them using a consistent file naming scheme. Each individual page of your item should look something like this:

YourUniqueID_000001.tiff
YourUniqueID_000002.tiff
YourUniqueID_000003.tiff
YourUniqueID_000004.tiff
YourUniqueID_000005.tiff
YourUniqueID_000006.tiff
YourUniqueID_000007.tiff
YourUniqueID_000008.tiff
YourUniqueID_000009.tiff
YourUniqueID_000010.tiff
YourUniqueID_000011.tiff
{etc.}

Again, your files needn’t take this exact form, and your image processor will likely name each file automatically. The only thing that’s important is that your naming scheme keep each page in its correct order. HOWEVER…

(VERY IMPORTANT) - Note here that the page numbers in each filename (000009, 000010, etc.) take up a consistent number of spaces. This is crucial to the pages appearing in the correct order in the BookReader – otherwise IA will automatically sort “YourUniqueID_10.tiff” before “YourUniqueID_9.tiff”, even though your computer’s operating system might not. DO NOT FORGET TO CHECK THIS – YOU MUST ENSURE THAT THE PAGE NUMBERS TAKE UP A CONSISTENT NUMBER OF SPACES. Most consumer operating systems are programmed to sort numbers in a way that is intuitive to humans, but IA’s system is stricter. We have had items appear out of order in the past as a result of using inconsistent page spacing formats.

For more detailed information about naming conventions for IA upload, see http://blog.archive.org/2012/05/24/uploading-images-for-text-items/


Finally, we typically include a credits page as the last file in each item’s _images folder, named with “zzz_” at the beginning such that it appears at the very last page of the BookReader:

zzz_credits.tiff

Note: this credits page must take the same format as the rest of the pages in order to appear in the BookReader (i.e. if your individual pagefiles are JPEGs, your credits page should also be a JPEG). Otherwise, the file will be included on IA but not displayed in the BookReader along with the other pages.


Final Pre-IA-Upload Steps: Compression and Metadata

Now that you have your YourUniqueID_images folder full of correctly-named, high-quality individual page files, you are almost ready to start uploading it to IA. However, there are still a few steps left to take.

First, you must compress the images folder for upload. IA is able to upload quite large files, but in general it’s best to keep the size under 10 GB (and preferably lower). To compress your images folder, right click on it and select “Compress ‘YourUniqueID_images.’” This will turn it into a .zip archive, YourUniqueID_images.zip. This .zip archive is what you actually upload to IA, and they will automatically un-compress it on their end.

Second, you must prepare metadata for your item. This allows IA to associate your item with the MHDL and to perform other necessary functions to display it in the BookReader, and it also enables your item to be searchable in Lantern and Arclight.

It’s a good idea to prepare metadata ahead of time, so that you can simply copy and paste it when it comes time to enter it for IA upload. We usually do this in a text editor—a program like Notepad or TextWrangler (http://www.barebones.com/products/textwrangler/) - not Microsoft Word, which introduces extraneous formatting.

There are two methods for preparing metadata for upload; the one described here is the simpler method and will suffice for the vast majority of users. For users wishing to contribute lots of material, however, there is a more scale-efficient method using IA’s API – the description of that method can be found at the end of this document.


Entering metadata

You are now ready to upload, and will enter metadata as part of that process. You will need an Internet Archive account to upload items. When you are logged in and ready to upload, click the “upload” button at the top right of the page. You will be asked to select your item—drag-and-drop or choose YourUniqueID_images.zip, and IA will proceed to the metadata entry form:

Entering metadata derek long.jpg

IA should automatically populate some fields, like Page Title and Page URL. These are populated based on your item’s unique ID — ensure that the URL field in particular takes that ID correctly (such that your item’s URL is https://archive.org/details/YourUniqueID - you will probably need to remove the “_images” part of the Page Title and Page URL). While unlikely, there’s a chance that the unique ID you chose for your item has already been taken—in that case, IA will suggest an alternative.

***It is very important that you confirm that your Unique ID is correct here, because once you submit the item, that ID will be reserved as part of the URL to access it and cannot be changed except by an IA administrator.***

Page title should generally be more descriptive but still concise; something like “Variety (April 1917)”.

You will also notice a “Test Item” field – if you want to test a particular version of your scans, selecting “Yes” here will make the item temporary, and IA will automatically remove it after 30 days. This way, you can run through the process without creating a permanent page for your item. If you do this, you will probably want to use an alternate unique ID, perhaps with “test” appended somewhere in the name.

Finally, the current version of IA’s uploader (as of February 2016) includes a “reCache” field. You will need to remove this before submitting, or you will get an error message.

Here are some basic metadata items that the uploader will ask for, with the April 1917 issues of Variety as an example (note that the colons listed below separate the field name from its value and do not go in the input boxes). Not every field is essential, but the more metadata you can give, the more useful and accessible it will be.

description: Four issues of VARIETY from April 1917. [A brief description of the item.]

subject tags: Motion Pictures, Film Industry Trade Magazine, Vaudeville, Theater [Keywords, separated by commas. These should be at a relatively high level of description.]

creator: New York, NY: Variety Publishing Company [The creator of the item – generally, its author or publisher. Alternatively, this can be entered as Publisher under “More Options”]

date: 1917 [The date the item was published. A year is usually sufficient.]

collection: mediahistory [For your item to be indexed in the MHDL and Lantern, this should always be “mediahistory.” You may need to add this under “More Options” as an additional field, if it is not an option for you in the drop-down tab.]

language: English [The full English name of the language of the item’s text. This is an important field, because it tells the OCR software which language to look for, so don’t forget it.]


In addition to the standard metadata fields, several additional fields are necessary for your item to be correctly indexed in the MHDL, Lantern, and Arclight. Add these under the More Options fields (with the field name in the first text box and the field value in the second – hit “Add additional metadata” for each new field). You should add the following fields:

journal-title: Variety [the title of the journal or book]

year: 1917 [year of publication - this must be a single integer]

year-end: 1917 [year-end will normally be the same value as 'year,' but for any volumes that span multiple years, 'year-end' designates the last year covered.]

date-start: 1917-04-03T23:23:59Z

date-end: 1917-04-24T23:23:59Z [date-start and date-end mark the temporal span of your item, and must be expressed in time zone format for Lantern’s faceting to work properly – we have found that it is helpful to have this general format ready to cut and paste, such that you only have to change the year, month, and day. Exact precision with the day is not strictly necessary, but it will help to improve the MHDL and Lantern’s functionality in the future, so exact dates are preferable.]

date-string: April 1917 [date-string is the date of your item as it will be displayed to the user. A balance between concision and readability is ideal here. Here are some of the other formatting conventions we have been using:

Jan-Jun 1925

Nov 1917 - Apr 1918

16 Mar 1931 - 25 May 1932

sponsor: Media History Digital Library [the entity that sponsored the scanning of your item]

contributor: The Museum of Modern Art Library, New York [the collection it came from]

page-count: 354 [For items with larger page counts, the easiest way to find this on a Mac is to right click your original _images folder (not the .zip archive), click “get info,” then consult the “size” entry for your item – the number of items in the folder will be your page-count.]

format: Periodicals [“Books”, “Annuals”, and “Catalogs” are the other standard entries for format – use whatever’s appropriate for your item.]


Finally, there are some other fields you may wish to include to enrich your item’s metadata:

coordinator: Media History Digital Library [the entity that coordinated putting your item up on IA]

sub-collection: Theatre and Vaudeville [The MHDL category you feel your item might best belong to – feel free to add multiple sub-collection fields]

publisher: New York, NY: Variety Publishing Company [an alternative for creator]

volume: 11

source: microfilm [the original source of your item]

notes: Missing April 10 issue [Any additional information you want to use to describe the item]

Also note that you can add custom metadata fields to your item if you deem it necessary – more metadata is almost never a bad thing! The only restriction is that your metadata field’s name can’t have a capital letter in it, and it’s best not to have numbers in the name either. Having multiple metadata fields with the same name is usually preferable to having a single field with multiple entries (subject tags are an exception).


For convenience, here are all the metadata fields you should create entries for:

[IA defaults – required for proper display and IA indexing]
page title
page URL [*remember – this one is crucial to get right the first time]
description
subject tags
creator
date
collection [*always mediahistory]
language [*don’t forget it or you’ll have to rederive]
[add under “More Options” – required for MHDL indexing]
journal-title
year
year-end
date-start
date-end
date-string
sponsor
contributor
page-count
format
[optional but recommended]
coordinator
sub-collection
publisher
volume
source
notes


Once you have entered all of the required and desired metadata fields – and double-checked that your item’s unique ID is correctly entered in the Page URL field - you are ready to hit “upload.”

Depending on the size of your file, the speed of your connection, and the number of uploads queued ahead of yours, your upload may take quite a while – be patient! Once your item has been uploaded, IA will give you a screen confirming that the upload was successful. It generally takes some time for IA to index the item and generate derivative files. While a web page for your item is usually generated relatively quickly, deriving takes significantly more time (you will see a message along the lines of “This item is being updated with a ‘derive.php’ task” as the derivation happens). In any case, your item should be finished in a day or two if all went well.

If your item is not showing up in bookReader form on its page, you may have forgotten a metadata field (such as language) that IA needs for its derivation to proceed. Adding this field and forcing a rederive (under the item’s “history” page, click on the little box next to the “derive.php” task and hit “rerun”) should correct this.

If you need to change or correct anything else, you can generally do so once the item has finished deriving. Minor metadata corrections (that is, for fields not required for the derivative files) happen quite quickly, while changing the actual files of your item or important derivation fields (like language) usually requires a rederive (which IA will sometimes do automatically, but you should check). Again, the only thing that you cannot change easily is your unique ID and the URL reserved for your item – be sure you double-check this before hitting “upload”!


If all goes well, your item should be available for viewing on IA and indexing in Lantern within a day or two. If you are uploading many items, it helps the indexing process if you keep a list of the exact unique IDs of the items. When your items are properly displayed and include all necessary metadata, contact Eric Hoyt (ehoyt@wisc.edu) to have your items formally displayed on the MHDL and indexed on Lantern and Arclight.


For additional information, consult IA’s various FAQ pages:

https://archive.org/about/faqs.php

https://archive.org/about/faqs.php#Uploading_Content

https://archive.org/about/faqs.php#Books_and_Texts

http://blog.archive.org/2012/05/24/uploading-images-for-text-items/

https://blog.archive.org/2013/02/08/presetting-metadata-with-the-new-beta-uploader/