XML Download of the Electronic Transcription of Codex Sinaiticus
Introduction to the Download
The text of Codex Sinaiticus on this website is generated from an electronic transcription encoded in XML (eXtensible Markup Language). On this page, it is possible to download the file containing the full XML transcription for further analysis. The data is made available under a Creative Commons licence: it may not be used for commercial purposes, attribution must be made to the original creators (the Codex Sinaiticus Project, www.codexsinaiticus.org), and any derivatives must also be made freely available under the same terms as this original data. (For further information on the license, see: http://creativecommons.org/licenses/by-nc-sa/3.0/ )
More information on the structure of the file may be found below. The file, like the transcription on the website, uses Unicode character encoding and should be readable by most modern text-editing software packages, as well as certain web-browsers and specialist XML editors. However, please note that the full transcription is extremely large: it contains almost 28 million characters, with a file size of 28.8 megabytes, which may take some time to open and process, depending on the capacity of your computer. The download file is a compressed file of 4.4 megabytes, which will be automatically expanded by most operating systems to the full file when it is opened for the first time.
Please note that this transcription file is offered for the use of those who are already familiar with XML. If you are unsure how to proceed or unable to understand the file, we recommend that you only consult the manuscript online. ( http://codexsinaiticus.org/en/manuscript.aspx ).
The transcription file is periodically updated with minor alterations, as detailed in the file header. The current version is 1.02b (27.1.2012).
DOWNLOAD THE FILE HERE
(4.4MB zip file)
About the XML
XML encoding is a way of marking up electronic texts, where the text itself is supplemented by information enclosed in brackets. The XML for the Codex Sinaiticus transcription builds on the standards developed by the Text Encoding Initiative (TEI; http://www.tei-c.org/ ). However, Codex Sinaiticus is not simply a text, but a document. In order to represent every detail as it appears on the page, it was necessary for the project to add additional features not available in the TEI guidelines when the transcription was published (July 2009). Some of these may be incorporated in subsequent development of the TEI guidelines; others may remain incompatible with the TEI and require a workround to be used with TEI-based tools.
The principal challenge was to encode material which appears in margins. On most pages of Codex Sinaiticus, there are potentially nineteen margins: three page margins (along the top, bottom and outer edge of each page), up to eight column margins (at the top and bottom of each column) and up to eight line margins (on the left and right side of each column). In order to delineate each margin correctly, two elements are used to mark each page, column and line: a starting element, identified by an id beginning with "S-", and a corresponding ending element, identified by an id beginning with "E-". When material is located in a margin, the element is left open, and an additional <margin> element inserted, with information about the type and layout of the margin. The text itself is then provided; in the case of corrections transcribed in the flow of the text, a link to the correction is given instead in order to make it display at both points. After the text, the <margin> element is closed, the line, column or page break element is closed, and the transcription of the body of the manuscript continues.
Paratextual information, such as page numbers, quire numbers, section numbers and titles, running titles, glosses, lectionary indications and colophons, is all encoded as sub-types of the <note> element. All words and punctuation are enclosed within <w> elements, which was necessary in order to generate the database for aligning the text with the images. If a word is split over a line, page or column, this is indicated by the element <hi rend="kwhyphen'/>. Certain non-standard symbols, such as the binding mark, coronis and staurogram, as well as the variety of dots and lines used to indicate paragraphs and quotations or otherwise draw attention to the text, are also transcribed as <w> elements, using Unicode characters to achieve a stylised representation of the symbol. In the case of the two library stamps, links are provided to graphics, using the <g> element.
A PDF document detailing the specifications of the XML is available here.
A commented XSD schema developed at an earlier stage in the project is available for reference here.