Hosted websites will become read-only beginning in early 2024. At that time, all logins will be disabled, but hosted sites will remain on RootsWeb as static content.
Website owners wishing to maintain their sites must migrate to a different hosting provider before 2024 (More info)
This chapter describes the core GEDCOM data representation language.
The generic data representation language defined in this chapter may be used to represent any form
of structured information, not just genealogical data, using a sequential stream of characters.
Concepts
A GEDCOM transmission represents a database in the form of a sequential stream of related
records. A record is represented as a sequence of tagged, variable-length lines, arranged in a
hierarchy. A line always contains a hierarchical level number, a tag, and an optional value. A line
may also contain a cross-reference identifier or a pointer. The GEDCOM line is terminated by a
carriage return, a line feed character, or any combination of these.
The tag in the GEDCOM line, taken in its hierarchial context, identifies the information contained
in the line, in the same sense that a field-name identifies a field in a database record. This means
that the data is self-defining. Tags allow a field to occur any number of times within a record,
including zero times. They also allow the use of different or new fields to be included in the
GEDCOM data without introducing incompatibility, because the receiving system will ignore data
which it does not understand and process only the data that it does understand.
The hierarchical relationships are indicated by a level number. Subordinate lines have a higher level
number. The hierarchy allows a line to have sub-lines, which in turn may have their own sub-lines,
and so forth. A line and its sub-lines constitute a context or enclosure, that is, a cluster of
information pertaining directly to the same thing. This hierarchical arrangement corresponds with
the natural hierarchy found in most structured information.
A series of one or more lines constitutes a record. The beginning of a new record is indicated by a
line whose level number is 0 (zero).
In addition to hierarchical relationships, GEDCOM defines the inter-record relationships that allow a
record to be logically related to other records, without introducing redundancy. These relationships
are represented by two additional, but optional, parts of a line: a cross-reference pointer and a
cross-reference identifier. The cross-reference pointer "points at" a related record, which is
identified by a required, matching unique cross-reference identifier. The cross-reference identifier is
analogous to a primary key in relational database terminology.
Grammar
This chapter defines the grammar for the GEDCOM format. The grammar is a set of rules that
specify the character sequences that are valid for creating the expression of the GEDCOM line. The
character sequences are described in terms of various combinations of elements (variables and/or
constants). Elements may be described in terms of a set of other elements, some of which areselected from a set of alternative elements. Each element in the definition is separated by a plus sign
(+) signifying that both elements are required. When there is a choice of different elements that can
be used, the set of alternatives are listed between opening and closing square brackets ([]), with each
choice separated by a vertical bar ([alternative_1 | alternative_2]). The user can read the grammar
components of the selected element by substituting any sub-elements until all sub-elements have been
resolved.
A GEDCOM transmission consists of a sequence of logical records, each of which consists of a
sequence of
gedcom_lines
, all contained in a sequential file or stream of characters. The following
rules pertain to the
gedcom_line
:
Long values can be broken into shorter GEDCOM lines by using a subordinate CONC or
CONT tag. The CONC tag assumes that the accompanying subordinate value is concatenated to
the previous line value without saving the carriage return prior to the line terminator. The
CONT assumes that the subordinate line value is concatenated to the previous line, saving the
carriage return.
The beginning of a new logical record is designated by a line whose
level
number is 0 (zero).
Each new level number must be no higher than the previous line plus 1.
Logical GEDCOM record sizes should be constrained so that they will fit in a memory buffer of
less than 32K. GEDCOM files with records sizes greater than 32K run the risk of not being able
to be loaded in some programs. Use of pointers to records, particularly NOTE records, should
ensure that this limit will be sufficient. The size of embedded multimedia records can be
controlled through chaining MULTIMEDIA_RECORDS (see multimedia record format on p.
.)
Any length constraints are given in characters, not bytes. When wide characters (characters
wider than 8 bits) are used, byte buffer lengths should be adjusted accordingly.
Level numbers must be between 1 to 99 and must not contain leading zeroes, for example, level
one must be 1, not 01.
The cross-reference ID has a maximum of 22 characters, including the enclosing at signs (@),
and it must be unique within the GEDCOM transmission.
Pointers to records imply that the record pointed to does actually exists within the transmission.
Future pointer structures may allow pointing to records within a public accessible database as an
alternative.
The length of the GEDCOM TAG is a maximum of 31 characters, with the first 15 characters
being unique.
The total length of a GEDCOM line, including leading white space, level number, cross-reference number, tag, value, delimiters, and terminator, must not exceed 255 (wide)
characters.
Leading white space (tabs, spaces, and extra line terminators) preceding a GEDCOM line
should be ignored by the reading system. Systems generating GEDCOM should not place any
white space in front of the GEDCOM line.
The components used in the pattern above are defined below in alphabetical order. Some of the
components are defined in terms of other primitive patterns. The spaces used in the patterns below
are only to set them apart and are not a part of the resulting pattern. Character constants are
specified in the hex form (0x20) which is the ASCII hex value of a space character. Character
constants that are separated by a (-) dash represent any character with in that range from the first
constant shown to and including the second constant shown.
alpha
:=
[(0x41)-(0x5A) | (0x61)-(0x7A) | (0x5F) ]
where:
(0x41)-(0x5A)=A to Z
(0x61)-(0x7A)=a to z
(0x5F)=(_) underscore
escape_text
:=
[any_char | escape_text + any_char ]
The escape_text is coded to meet the rules of a particular GEDCOM form.
level
:=
[digit | level + digit ]
(Do not use non-significant leading zeroes such as 02.)
Any 8-bit ASCII character except control characters (0x00%0x1F), alphanum, space ( ), number
sign (#), at sign (@), _ underscore, and the DEL character (0x7F).
alpha
:=
The alpha characters include the underscore, which is used to link word pieces together in
forming tag names or tag labels.
any_char
:=
Any 8-bit ASCII character except the control characters found in the range of 0x00%0x1F and
0x7F. If an @ is desired as part of the line_value, it must be written in GEDCOM as a double
@, i.e., "3 doz. @ $20.00" must be stored as "3 doz. @@ $20.00."
delim
:=
The
delim
(delimiter), a single space character, terminates both the variable-length
level
number
and the variable-length
tag
. Note that space characters may also be present in a
value
.
escape
:=
The
escape
is a character sequence in the grammar used to specify special processing, such as
for switching character sets or for indicating an inclusion of a non-GEDCOM data form into the
GEDCOM structure. The form of the escape sequence is:
@+#+
escape_text
+@+
non_at
.
Receiving systems should discard any space character which follows the escape sequences
closing at-sign (@).
If the character following the escape sequence's closing at-sign (@) is not aspace character then it should be kept as a part of the text following the escape. Systems
writing escape sequences should always output a space character following the escape sequence.
The specific format of the escape sequence is defined for the specific GEDCOM form being
defined.
escape_text
:=
The escape_text is defined to meet the requirements of a particular GEDCOM form.
level
:=
The
level
number works the same way as the level of indentation in an indented outline, where
indented lines provide detail about the item under which they are indented. A line at any level L
is enclosed by and pertains directly to the nearest preceding line at level L-1. The Level L may
increase by 1 at most. Level numbers must not contain leading zeroes, for example level one
must be (1), not (01).
The enclosed subordinate lines at level L are said to be in the context of the enclosing superior
line at level L-1. The interpretation of a
tag
must be in the context of the
tag
s of the enclosing
line(s) rather than just the tag by itself. Take the following record about an individual's birth
and death dates, for example:
0 INDI
1 BIRT
2 DATE 12 MAY 1920
1 DEAT
2 DATE 1960
In this example, the expression DATE 12 MAY 1920 is interpreted within the INDI (individual)
BIRT (birth) context, representing the individual's birth date. The second DATE is in the
INDI.DEAT (individual's death) context. The complete meaning of DATE depends on the
context.
Note:The above example is indented according to the level numbers to make the concept
more obvious. In the actual GEDCOM data, the level numbers are lined up vertically,
meaning they are the first character(s) of the GEDCOM line.
Some systems output indented GEDCOM data for better readability by putting space or tab
characters between the terminator and the
level
number of the next line to visibly show the
hierarchy. Also, some people have suggested allowing extra blank lines to visibly separate
physical records. GEDCOM files produced with these features are not to be used when
transmitting GEDCOM to other systems.
line_value
:=
The
line_value
identifies an object within the domain of possible values allowed in the context
of the
tag
. The combination of the
tag
, the
line_value
, and the hierarchical context of the
supporting
gedcom_line
s provides the understanding of the enclosed
value
s. This domain is
defined by a specific grammar for representing a given GEDCOM form. (See Chapter 2
for Lineage-Linked GEDCOM Form grammar.)
Values whose source information contains illegible parts of the value should be indicated by
replacing the illegible part with an ellipsis (...).
Values are generally not encoded in binary or other abbreviation schemes for reducing space
requirements, and they are generally constrained to be understandable by a typical user without
decoding. This is intended to reduce the decoding burden on the receiving software. A
GEDCOM-optimized data compression standard will be defined in the future to reduce space
requirements. Meanwhile, users may agree to compress and decompress GEDCOM files using
any compression system available to both sender and receiver.
The
line_value
within the context of a tag hierarchy of
gedcom_line
s represents one piece of
information and corresponds to one field in traditional database or file terminology.
otherchar
:=
Any 8-bit ASCII character except control characters (0x00%0x1F),
alphanum,
space ( ),
number-sign (#), the at sign (@), and the DEL character (0x7F).
pointer
:=
A
pointer
stands in the place of the context identified by the matching
xref_id
. Theoretically, a
receiving system should be prepared to follow a
pointer
to find
any
needed
value
in a manner
that is transparent to the logic of the subsystem that is looking for specific
tag
s. This highly
flexible facility will probably be used more in the future. For the time being, however, the use
of
pointer
s is explicitly defined within the GEDCOM form, such as the Lineage-Linked
GEDCOM Form defined in Chapter 2).
The
pointer
represents the association between two objects that usually reside in different
records. Objects within a logical record can be associated. If this need exists, the pointer record
composition contains an exclamation point (!) that separates the parent record's cross-reference
ID from the specific substructure's cross-reference ID, which is at some subordinate level to the
logical record at level zero. The cross-reference ID of the substructure subordinate to a zero
level record, for inter-record associations is always composed of the Record ID number and the
Substructure ID number, such as @I132!1@. Including the Record ID number in the pointer
that associates objects within a record will allow the GEDCOM processors to build the index
only at the record level and then search sequentially for the appropriate substructure cross-reference ID. The parent record ID is assumed when the cross-reference ID begins with a
exclamation point (!) signifying an intra-record association.
Complex logical record structures are divided into small physical records to accommodate
memory constraints, many-to-many relationships, and independent record creation and deletion.
The
pointer
must match a corresponding unique
xref_id
within the transmission, unless the
colon (:) character is present (which will be used in the future as a network reference to a
permanent file record). A
pointer
is given instead of duplicating an object, though the logical
result is equivalent. An expanded traversal of a record tree includes following the
pointer
to
related records to some depth, and splicing those records (logically) into the resultant expanded
tree. Pointers may refer to either records which have not yet appeared in the transmission
(forward reference) or to records that have already appeared earlier in the transmission(backward reference). This arrangement usually requires a preliminary pass to construct a look
up table to support random access by
xref_id
during subsequent passes.
tag
:=
A
tag
consists of a variable length sequence of
alphanum
characters. All user-defined tags, tags
used that have not been defined in the GEDCOM standard, must begin with an underscore
character (0x5F).
The
tag
represents the meaning of the
line_value
within the context of the enclosing lines, and
contributes to the meaning of enclosed subordinate lines. Specific
tag
s are defined in Appendix A. The presence of a tag together with a value represents an assertion
which the submitter wishes to communicate to a receiver. A tag with no value does not
represent an assertion. If a tag is absent, no assertion is made, for example, no information is
submitted. Information of a negative nature (such as knowing positively an event did not occur)
is handled through the semantic definition of a tag and accompanying values that assert the
information explicitly. It is not represented by absence of a tag.
Although formally defined
tag
s are only three or four characters long, systems should prepare to
handle user tags of greater length. Tags will be unique within the first 15 characters.
Valid combinations of specific
tag
s,
line_value
s,
xref_id
s, and
pointer
s are constrained by the
GEDCOM form defined for representing a given kind of information. (See Chapter 2 for the Lineage-Linked GEDCOM Form grammar.)
terminator
:=
The
terminator
delimits the variable-length
line_value
and signals the end of the
gedcom_line
.
The valid terminator characters are:
[carriage_return |
line_feed |
carriage_return line_feed |
line_feed carriage_return ]
xref_id
:=
(See pointer)
The
xref_id
is formed by any arbitrary combination of characters from the
pointer_char
set.
The first character must be an alpha or a digit. The
xref_id
is not retained in the receiving
system, and it may therefore be formed from any convenient combination of identifiers from the
sending system. No meaning is attributed by the receiver to any part of the
xref_id
, other than
its
unique
association with the associated record. The use of the colon (:) character is also
reserved.
Examples:
The following are examples of valid but unrelated GEDCOM lines:
0 @1234@ INDI
. . .
1 AGE 13y
. . .
1 CHIL @1234@
. . .
1 NOTE This is a note field that is
2 CONT continued on the next line.
The first line has a
level
number 0, a
xref_id
of @1234@, an INDI
tag
, and no
value
.
The second line has a
level
number 1, no
xref_id
, an AGE
tag
, and a
value
of 13.
The third line has a
level
number 1, no
xref_id
, a CHIL
tag
, and a
value
of a
pointer
to a
xref_id
named @1234@.