The Internet Archive

The Internet Archive is on of the biggest archive for digitization on the internet. It is a nonprofit project with headquarters situated in San Francisco [1]. The project was founded in 1996 by Brewster Kahle to archive digital data for long-term in a free available form. The collection includes mementos, a resource, which encapsulates the original state of a source at defined point in time, of websites, music, movies, books, software and video games. The size of the collection reached 18.5 PetaBytes in August 2014 [2]. While being an online archive, the project is at same time an activist organization, which promotes a universal access to a free and open internet [3].

Most of the data is collected automatically by web crawlers, but it is open to the public to upload digital material to the archives data clusters. The crawlers serve to preserve as much as possible of the public web in mementos and add it to an archive called the Wayback Machine, which includes over 400 billion captures of such [4]. Additionally, the archive supervises one of the largest book digitization projects and includes nearly three million public-domain books. The project has about 200 employees, but everyone can volunteer after an application via e-mail. The data centers, apart from the headquarters, are located in three Californian cities. The complete collection is mirrored at the Bibliotheca Alexandrina, which is a major library and cultural center inf Egypt.

The initial archive was a for-profit web crawling company called Alexa Internet founded by Kahle in 1996. His motivation was to preserve the World Wide Web after the unexpected bankruptcy of the host of his hobby macramé website, which resulted in a lost of the complete content. The archive itself was publicly available after 2001, when the Wayback Machine went online. Apart from this archive it includes the Prelinger Archives, a collection of films relating to the U.S. cultural and social history, the NASA Images Archive, the Open Library, a wiki-editable library catalog and book information site and the contract crawling service Archive-it.

The most important sub-collections include a text collection, which contains over seven million texts of various languages from the years between 1800 until the present day. 12,000 books can be lend as a ebooks via the Internet Archive Lending Library and over 23 million catalog records of books, of which 250,000 are part of and international lending program, can be accessed with the Open Library. The media collection hosts digital media attested to be in the public domain and harbors nearly four thousand films and 350,000 news programs of three years of national U.S. Networks and stations in San Francisco and Washington D.C. The amount of archived movies are over 1.8 million. The Audio collection carries over two million recordings of which 200,000 are free digital recordings. The type of recordings are music, audio books, news broadcasts or old time radio show, but also poetry, podcasts or concert recordings. Moreover, the archive comprises a collection of software of 50 years of computer history through a collection of books, journals, computer magazines, shareware discs, FTP websites and video games. Since 2013 some abandonware video games, a software ignored by its owner and manufacturer for which no product support is available, also known as orphan works, are browser-playable via an emulation. The archive also offers the possibility of free uncensored hosting and protested with a twelve hour blackout against the SOPA and PIPA bills.

This archive can make various contributions to scholarship, since not only historical resources can be accessed, but also supplemental information through related content is available. For research of the last decade, information from numerous websites as contemporary witness can be obtained and the extension of older data with other digital media such as audio and video open the possibility for a deeper analysis as it might be possible with only a paper printed text. As depicted with the emulation of old video games, the archive can provide more than only a stark replica of a resource, but the whole functionality of a digital medium, as for example a complete website at a specific point in time with user-comments. The name and the archive's homepage clearly reveals its purpose to the user, an internet archive with “universal access to all knowledge.” This motto indicates, that it is addressed to everybody doing online research on any subject. The overall design of the page is simple and not visually eccentric, because it contains many different media and a straightforward layout makes the numerous sub-collections effortlessly accessible. Each collection can be browsed through by themes or sub-collections. Alternatively, terms can be searched in the according media collections or via a detailed advanced search. The media type is then indicated via a symbol in the search results.

The great extent of available content is the archive's biggest strength, but may be its weakness at the same time. Whenever a lot of substance for research is ready for use, filtering is essential. There are some filters applicable, but they are still very rough and oriented by global features of the results. Nevertheless, the rich amount of searchable content and its non-commercially dependent presentation make it an indispensable autonomous research tool for historic online related content or a much useful tool for digitally available media.


[1] Internet Archive Frequently Asked Questions


[3]  The Internet Archive

[4]  Internet Archive: Projects

29.12.14 21:40


bisher 0 Kommentar(e)     TrackBack-URL

E-Mail bei weiteren Kommentaren
Informationen speichern (Cookie)

Die Datenschuterklärung und die AGB habe ich gelesen, verstanden und akzeptiere sie. (Pflicht Angabe)

 Smileys einfügen