The Internet Archive

The Internet Archive is on of the biggest archive for digitization on the internet. It is a nonprofit project with headquarters situated in San Francisco [1]. The project was founded in 1996 by Brewster Kahle to archive digital data for long-term in a free available form. The collection includes mementos, a resource, which encapsulates the original state of a source at defined point in time, of websites, music, movies, books, software and video games. The size of the collection reached 18.5 PetaBytes in August 2014 [2]. While being an online archive, the project is at same time an activist organization, which promotes a universal access to a free and open internet [3].

Most of the data is collected automatically by web crawlers, but it is open to the public to upload digital material to the archives data clusters. The crawlers serve to preserve as much as possible of the public web in mementos and add it to an archive called the Wayback Machine, which includes over 400 billion captures of such [4]. Additionally, the archive supervises one of the largest book digitization projects and includes nearly three million public-domain books. The project has about 200 employees, but everyone can volunteer after an application via e-mail. The data centers, apart from the headquarters, are located in three Californian cities. The complete collection is mirrored at the Bibliotheca Alexandrina, which is a major library and cultural center inf Egypt.

The initial archive was a for-profit web crawling company called Alexa Internet founded by Kahle in 1996. His motivation was to preserve the World Wide Web after the unexpected bankruptcy of the host of his hobby macramé website, which resulted in a lost of the complete content. The archive itself was publicly available after 2001, when the Wayback Machine went online. Apart from this archive it includes the Prelinger Archives, a collection of films relating to the U.S. cultural and social history, the NASA Images Archive, the Open Library, a wiki-editable library catalog and book information site and the contract crawling service Archive-it.

The most important sub-collections include a text collection, which contains over seven million texts of various languages from the years between 1800 until the present day. 12,000 books can be lend as a ebooks via the Internet Archive Lending Library and over 23 million catalog records of books, of which 250,000 are part of and international lending program, can be accessed with the Open Library. The media collection hosts digital media attested to be in the public domain and harbors nearly four thousand films and 350,000 news programs of three years of national U.S. Networks and stations in San Francisco and Washington D.C. The amount of archived movies are over 1.8 million. The Audio collection carries over two million recordings of which 200,000 are free digital recordings. The type of recordings are music, audio books, news broadcasts or old time radio show, but also poetry, podcasts or concert recordings. Moreover, the archive comprises a collection of software of 50 years of computer history through a collection of books, journals, computer magazines, shareware discs, FTP websites and video games. Since 2013 some abandonware video games, a software ignored by its owner and manufacturer for which no product support is available, also known as orphan works, are browser-playable via an emulation. The archive also offers the possibility of free uncensored hosting and protested with a twelve hour blackout against the SOPA and PIPA bills.

This archive can make various contributions to scholarship, since not only historical resources can be accessed, but also supplemental information through related content is available. For research of the last decade, information from numerous websites as contemporary witness can be obtained and the extension of older data with other digital media such as audio and video open the possibility for a deeper analysis as it might be possible with only a paper printed text. As depicted with the emulation of old video games, the archive can provide more than only a stark replica of a resource, but the whole functionality of a digital medium, as for example a complete website at a specific point in time with user-comments. The name and the archive's homepage clearly reveals its purpose to the user, an internet archive with “universal access to all knowledge.” This motto indicates, that it is addressed to everybody doing online research on any subject. The overall design of the page is simple and not visually eccentric, because it contains many different media and a straightforward layout makes the numerous sub-collections effortlessly accessible. Each collection can be browsed through by themes or sub-collections. Alternatively, terms can be searched in the according media collections or via a detailed advanced search. The media type is then indicated via a symbol in the search results.

The great extent of available content is the archive's biggest strength, but may be its weakness at the same time. Whenever a lot of substance for research is ready for use, filtering is essential. There are some filters applicable, but they are still very rough and oriented by global features of the results. Nevertheless, the rich amount of searchable content and its non-commercially dependent presentation make it an indispensable autonomous research tool for historic online related content or a much useful tool for digitally available media.


[1] Internet Archive Frequently Asked Questions


[3]  The Internet Archive

[4]  Internet Archive: Projects

29.12.14 21:40, kommentieren


What about hypertext?

Ideas about hypertext are relatively old, especially considering the development of computer science in general. What happened to hypertext and the initial concept of it? When discussing hypertexts today, their importance is exceptionally to our quotidian life, especially when considering their most frequent occurrence, markup languages. Do they satisfy the intial concept and where is the development heading towards?

The initial idea of hypertext or links is based on Vannevar Bush's vision in 1945 of the Memex (memory extender) and its trails, where documents and sources are linked through a persistent address. His vision was revolutionary, because it catered inspiration for new ideas of scientists for many years to come. His motivation was the heavily expanding knowledge swelling from the manifold research areas. The Individual is confronted with many results and conclusions and has problems to memorize them. According to Bush specialization of the individual is inevitable for the progress, but with it comes the difficulty of interdisciplinary understanding. This hindrance could not be adequately succumbed in traditional ways of instruction or information delivery. Therefore, a machine to display, link and visualize these links is necessary to serve and search information.

This vision is certainly prodigious, since computers in this time where mainly used for calculation, in particular for the ballistic calculation or scientific research. They were by no means all-purpose machines, nor accessible or operable by an individual. During the sixties and the beginning years of the seventies, Theodor H. Nelson addressed the Bush vision from As We May Think [1] and expanded them to his new concept of hypertext, which is the mentioned the first time. Namely the paper As We Will Think [2], his Project Xanadu [3] and his book Computer Lib/ Dream Machines [4] promoted, advanced and concretized to some extent Bush's description. Nelson is not from the applied or natural sciences, but a studied philosopher. His primary intention was to make movies, like his father. This is why he thought the computer to be a medium for the realization of his foresight, the hypertext or a medium, which can neglect the constraints of paper. In addition, computer screens resemble Television or projection screens a lot.

His approach is a generalized form of Bush's trails and best expressed as a textual structure, which cannot be conventionally printed. It serves to spread information among professionals or laymen in a decentralized network in multiple bidirectional associative links. The human mind is his model, since humans think associative as they have memories connected in various ways. Furthermore, the sequential and bounded nature of texts on paper is a thorn in his eyes. Subsequently, his conceptualization of hypertext drafts abstractly a personal multi-purpose computer, which delivers information independent from its content in a parallel and easily humanly accessible manner.

Interestingly, shortly after his brainchild, machines, which we could call personal computers and graphical user interfaces (GUI) emerged. Some implementations were realized with Nelson, but all were inspired by his approach. The paradigms of windows and electronic mail were inspired by his ideas. The standard keyboard, direct symbol manipulation, digital libraries and information networks are only some of the concepts he worked on with colleagues like Douglas Engelbart, namely the inventor of the computer mouse. Many other forms of hypertext stem from this approach, such as SGML (Standard Generalized Markup Language), WWW (World Wide Web), HTML (Hypertext Markup Language), XML (Extensible Markup Language) or WIKI. But all of them only satisfy his approach to some degree. He expresses his discontentment about the evolution of the implementations of his creation in many publications or on his YouTube Channel [5].

Nelson's proper implementation, Project Xanadu, just like the Memex was and is never fully realized to the present day. Since he himself is a “media guy”, he is reliant on techies to materialize his ideas into machine-code in any way. Fortunately, his more than fifty years old project is realized in some prototypes and is still under development (browser-demo). His stubbornness on the original implementation of and obsession with his thoughts may have let him been overtaken by the improvements of modern technology. The modern realization of the internet and other computer technologies surly points towards the right direction. For example, the latest web-technologies as HTML5, XML and CSS strongly rely on the separation of content and representation. On the other hand, it is necessary to have a visionary to push development with his extreme, but still abstract and interpretable objectives.


[1] As We May Think; Vannevar Bush; July 1, 1945

[2] As We Will Think; Theodor H. Nelson; September 4-7, 1972 in: From Memex to hypertext

[3] Computer Lib/ Dream Machines

[4] Project Xanadu

[5] TheTedNelson

22.11.14 22:19, kommentieren