Syntactic Normalization of the BTC2012 Dataset
[UPDATE: There was a bug in the code causing too many exceptions to be thrown for bad UTF-8 encoding. See the comment below for details.]
I recently normalized (meaning to be discussed) the 2012-09-02 version of the BTC2012 dataset for the purposes of later experimentation. In the process, I gathered some statistics that seem somewhat interesting. Perhaps these insights will be useful to others.
The normalization process — at a high level — consists of the following steps:
- Strip out the quad position of the BTC2012 N-quads files to generate N-triples files. Note that there is no removal of duplicate triples.
- For each line in the N-triples files:
- Parse it, throwing an exception if an error is encountered.
- Normalize each part in order of subject, predicate, object. If an error is encountered, throw an exception, ignoring remaining parts.
- Output triple as normalized.
The base code (not including the main function) used is available under the Apache 2.0 permissive license here, except this particular experiment updated the files in ucs/scripts and lang/scripts to be the latest UCS 6.2.0 and language tag repository, respectively. I wrote this code because I was tired of dealing with library dependencies on supercomputers with non-standard operating systems.
Table 1 shows (non-unique) triple counts. Input triples are those resulting from merely stripping out the quad position. Error triples are those for which problems occurred during parsing or normalization. Output triples are those remaining and considered part of the normalized BTC2012 dataset. Table 2 lists the sources of the errors in the error triples, that is, whether errors were found in the subject, predicate, object, or possibly an unknown position. Recall that once an error is found, the rest of the triple is ignored. That is, if an error is found in the subject, then the predicate and object are not checked. Similarly, if an error is found in the predicate, then the object is not checked.
[UPDATE: Removing duplicates from the output triples, there are 1,056,169,984 unique triples.]
Figure 1 breaks down the errors by cause. Errors from subjects and predicates were always malformed IRIs. The sources of all of the UTF-8 errors and some (86) of the invalid codepoint errors were unknown. The rest of the errors came from the object position of triples. Explanation of the legend in figure 1 follows.
- “BAD UTF8″ simply means that data in the triple which was assumed to be UTF-8 encoded was not validly encoded.
- “MALFORMED IRI” means that an IRI was found (either a RDF term that is an IRI or a datatype IRI in a typed literal) that was not a valid IRI according to RFC 3987.
- “RELATIVE IRI” merely means that the IRI is relative, which is not allowed in N-quads or N-triples formats.
- “BAD CODEPOINT” means that a codepoint was encountered that was not defined in UCS 6.2.0.
- “BAD DELIMITERS” means that RDF terms were encountered that were not properly delimited. For example, an IRI starting with ‘<‘ but not ending with ‘>’.
- “BAD LANGTAG” accounts for malformed language tags according to RFC 5646.
- “HEXTUPLE” accounts for the odd condition in which two triples were placed on the same line.
The complete list of malformed IRIs can be found here, and the top 10 are listed in table 3. Most of the top 10 are malformed because they contain square brackets in the query portion of the IRI. The 5th and 10th ones are malformed because they contain spaces. The 4th one is malformed because the fragment contains a ‘#’.
The complete list of relative IRIs can be found here, and they are also listed in table 4.
All 18 of the malformed language tags are the same: fr_1793. The apparent problem is the use of an underscore rather than a hyphen. That particular statistic along with the “hextuples” can be found here (with hextuples reported like bad language tags). The full list of invalid codepoints can be found here, and the full list of invalid UTF-8 encodings can be found here. (Each line in the latter has the format “N B:H” where N is the number of occurrences, B is the byte position of the UTF-8-encoded character, and H is the value of that byte in hexadecimal format.)
Figure 2 shows the breakdown of RDF terms. The astute observer will notice that there are more terms than could possibly be used in the normalized triples. This is due to the normalization process. If an error occurs in the predicate position of a triple, statistics are still collected for the valid subject. If an error occurs in the object position of a triple, statistics are still collected for the valid subject and predicate. Not surprisingly, IRIs account for about 76% of the RDF terms, blank nodes about 11% of the RDF terms, language-tagged literals about 7% of the RDF terms, typed literals about 3% of the RDF terms, and simple literals about 3% of the RDF terms.
There is a particular syntactical error that was not reported as an error above, and that is validation of blank node labels. According to the latest standardization of N-triples, blank node labels must be alpha-numeric. However, 1,696 blank node labels were not alpha-numeric, although they were all NFC-normalized UTF-8-encoded Unicode strings. (These statistics apply only to blank node labels without any other errors.)
579,710,373 (99.998%) of the literals had lexical representations that were NFC normalized.
Now we consider IRIs, not only as RDF terms, but also as datatype IRIs in RDF typed literals. Collectively, there are 3,401,695,106 of them accounted for in the statistics. Figure 3 gives the breakdown into hash, slash, and other IRIs. An IRI was classified as follows.
- If the scheme is neither http nor https, then classify as other.
- If there is a fragment, then classify as hash.
- If the path contains a slash, then classify as slash.
- Otherwise, classify as other.
About 65% of IRIs were slash, 35% of IRIs were hash, and nearly 0% were “other”. (Note that these statistics reflect nothing about dereferencing behavior but only syntactic appearance.)
Presence of different parts is accounted for in table 5. Since every valid IRI in an RDF dataset must be absolute, every IRI has a scheme and (possibly zero-length) path. Table 5 reports the number of IRIs having a certain, optional part.
IRIs were normalized by doing the following:
- Percent-unencoding any unnecessarily percent-encoded characters.
- Performing NFC normalization.
- Resolving/normalizing the path (e.g., /./a becomes /a).
- Lower-casing the scheme and host.
An IRI was considered “urified” (turned into a normalized URI) if it was normalized and then had all its iprivate and ucschars percent-encoded. (See RFC 3987 for details.) It is sometimes the case that an IRI has the same normal form and urified form, but not always. Table 6 gives the count of IRIs that were already normalized, urified, or both.
Table 7 gives the counts for language tags containing certain, optional parts. Every language tag must either contain a primary tag, be a purely private-use tag, or be a grandfathered tag (either regular or irregular). (See RFC 5646 for details.) No grandfathered tags were found. In gathering statistics, no check was made for primary tags in order to distinguish between typical tags and purely private-use tags. Thus, in table 7, the count for tags containing a private-use portion includes counts for any purely private-use tags.
Language tags were normalized by the following process (again, see RFC 5646 for details):
- Sort extensions by singletons.
- Replace redundant and grandfathered tags with preferred values.
- Replace subtags with preferred values.
In addition to having a normal form, language tags also have an “extlang” form. Sometimes the normal form and the extlang form are the same, but not always. Table 8 presents the relevant counts.
The aggregated raw statistics can be found here, and the raw statistics for each BTC2012 data file along with printed exceptions can be found here. This blog post has provided some additional interpretation of the raw statistics that is not readily apparent. If you happen to come across any exceptions that you think are unfounded, please let me know so that I can do the appropriate debugging.