$$%% examples \newcommand{\exGraph}{\graph_{\mathrm{ex}}} \newcommand{\exOnto}{\onto_{\mathrm{ex}}} \newcommand{\exMappings}{\mappings_{\mathrm{ex}}} \newcommand{\exExtensions}{\extensions_{\mathrm{ex}}} \newcommand{\exRule}{r_{\mathrm{ex}}} \newcommand{\RDFSrules}{\rules_{\mathrm{RDFS}}} %% RDF \newcommand{\triple}[3]{(#1, #2, #3)} \newcommand{\tuple}[1]{\langle #1 \rangle} \newcommand{\subject}{\mathtt{s}} \newcommand{\prop}{\mathtt{p}} \newcommand{\object}{\mathtt{o}} \newcommand{\blank}{\_{:}b} \newcommand{\blankn}[1]{\_{:}#1} \newcommand{\irin}[1]{{:}\mathrm{#1}} \newcommand{\class}{\mathtt{c}} \newcommand{\nsrdf}{\mathrm{rdf{:}}} \newcommand{\nsrdfs}{\mathrm{rdfs{:}}} \newcommand{\rdftype}{\mathrm{rdf{:}type}} \newcommand{\rdfLiteral}{\mathrm{rdf{:}Literal}} \newcommand{\rdfssubClassOf}{\mathrm{rdfs{:}subClassOf}} \newcommand{\rdfssubPropertyOf}{\mathrm{rdfs{:}subPropertyOf}} \newcommand{\rdfsdomain}{\mathrm{rdfs{:}domain}} \newcommand{\rdfsrange}{\mathrm{rdfs{:}range}} \newcommand{\rdfsClass}{\mathrm{rdfs{:}Class}} \newcommand{\rdfProperty}{\mathrm{rdf{:}Property}} \newcommand{\xsdint}{\mathrm{xsd{:}int}} %% \newcommand{\type}{\tau} \newcommand{\subclass}{\prec_{sc}} \newcommand{\subproperty}{\prec_{sp}} \newcommand{\domain}{\hookleftarrow_{d}} \newcommand{\range}{\hookrightarrow_{r}} \newcommand{\rdfentailment}{\vdash_{^\mathrm{RDF}}} \newcommand{\RDFS}[1]{\mathrm{RDFS}(#1)} \newcommand{\aka}{a.k.a.~} \newcommand{\etc}{etc} \newcommand{\wrt}{w.r.t.~} \newcommand{\st}{s.t.~} \newcommand{\ie}{i.e.,~} \newcommand{\eg}{e.g.,~} \newcommand{\graph}{G} \newcommand{\rules}{\mathcal{R}} \newcommand{\sources}{\mathcal{S}} \newcommand{\views}{\mathcal{V}} \newcommand{\extensions}{\mathcal{E}} \newcommand{\onto}{\mathcal{O}} \newcommand{\mappings}{\mathcal{M}} \newcommand{\modelsrdf}{\models_\rules} \newcommand{\bgp}{P} \newcommand{\Bl}[1]{\mathrm{Bl}(#1)} \newcommand{\Val}[1]{\mathrm{Val}(#1)} \newcommand{\Var}[1]{\mathrm{Var(#1)}} \newcommand{\ext}[1]{\mathrm{ext}(#1)} \newcommand{\cert}{\mathrm{cert}} \newcommand{\ans}{\mathrm{ans}} \newcommand{\query}{\leftarrow} \newcommand{\body}[1]{\textrm{body}(#1)} \newcommand{\head}[1]{\textrm{head}(#1)} \newcommand{\cs}{\mathrm{cs}} \newcommand{\lcs}{\mathrm{lcs}} \newcommand{\cl}{\mathrm{cl}} \newcommand{\lua}{\mathrm{lua}} \newcommand{\lur}{\mathrm{lur}} \newtheorem{lemma}{Lemma} \newtheorem{definition}{Definition} \newtheorem{problem}{Problem} \newtheorem{property}{Property} \newtheorem{corollary}{Corollary} \newtheorem{example}{Example} \newtheorem{theorem}{Theorem} \newcommand{\URIs}{\mathscr U} \newcommand{\IRIs}{\mathscr I} \newcommand{\BNodes}{\mathscr B} \newcommand{\Literals}{\mathscr L} \newcommand{\Variables}{\mathscr V} % DB \newcommand{\CQ}{\ensuremath{\mathtt{CQ}}\xspace} \newcommand{\UCQ}{\ensuremath{\mathtt{UCQ}}\xspace} \newcommand{\SQL}{\ensuremath{\mathtt{SQL}}\xspace} \newcommand{\rel}[1]{\mathsf{#1}} % Cost model \newcommand{\cans}[1]{|#1|_t} \newcommand{\cref}[1]{|#1|_r} \newcommand{\db}{\mathtt{db}} % DL \newcommand{\cn}{\ensuremath{N_{C}}\xspace} \newcommand{\rn}{\ensuremath{N_{R}}\xspace} \newcommand{\inds}{\ensuremath{N_{I}}\xspace} \newcommand{\ainds}{\ensuremath{\mathrm{Ind}}\xspace} \newcommand{\funct}{\mathit{funct} \ } \newcommand{\KB}{\mathcal{K}\xspace} \newcommand{\dlr}{DL-Lite$_{\mathcal{R}}$\xspace} % Logics \newcommand{\FOL}{\ensuremath{\mathtt{FOL}}\xspace} \newcommand{\datalog}{\ensuremath{\mathtt{Datalog}}\xspace} \newcommand{\dllite}{DL-Lite\xspace} \newcommand{\true}{\mathrm{true}} \newcommand{\false}{\mathrm{false}} \newcommand{\dis}{\mathtt{dis}} \newcommand{\vars}[1]{\ensuremath{\mathrm{vars}(#1)}} %\newcommand{\terms}[1]{\ensuremath{\mathrm{terms}(#1)}} %math \renewcommand{\phi}{\varphi} \newcommand\eqdef{\stackrel{\mathclap{\normalfont\mbox{def}}}{=}} \newcommand\restr[2]{#1_{|#2}} \newcommand{\ontoBody}[1]{\mathrm{body}_\onto(#1)} %proof of the rewriting theorem \newcommand{\rdfGraph}{\graph^{\mappings}_{\extensions}} \newcommand\systemGraph{\graph^{\mappings \cup \mappings^{\text{STD}}_\onto}_{\extensions \cup \extensions_\onto}} \newcommand\viewsGraph{\graph^{\mappings^{\rules,\onto} \cup \mappings^{\text{STD}}_\onto}_{\extensions \cup \extensions_\onto}} \newcommand{\standMappings}{\mappings^{\text{STD}}_\onto} \newcommand{\reminder}[1]{[\vadjust{\vbox to0pt{\vss\hbox to0pt{\hss{\Large $\Longrightarrow$}}}}{{\textsf{\small #1}}}]} %\newcommand{\FG}[1]{\textcolor{blue}{\reminder{FG:~#1}}} \newcommand{\extVersion}{false} \newcommand{\printIfExtVersion}[2] { \ifthenelse{\equal{\extVersion}{true}}{#1}{} \ifthenelse{\equal{\extVersion}{false}}{#2}{} } \newcommand{\bda}{\true} \newcommand{\ifBDA}[2]% {% \ifthenelse{\equal{\bda}{true}}{#1}{}% \ifthenelse{\equal{\bda}{false}}{#2}{}% } %%% Local Variables: %%% TeX-master: "paper" %%% End: $$

Star Wars RDF Integration Demo

This post describes the data loading of the RDF Integration System (RIS) about Star Wars used in Obi-Wan. The RIS is build on top of three complementary data sources in which only Star Wars data are been kept:

  1. IMDB partially loads in a relational database (Postgres),
  2. Wikidata partially loads in RDF graph database (Jena TDB),
  3. Star Wars API documents loads in JSON documents database (MongoDB).

IMDB

We downloaded the IMDB dumps at 12/03/2020 from https://www.imdb.com/interfaces/#plain. The imdb lacks of character identifiers since they decide to remove it from the database, but they kept the character names. In the demo, we use the IMDB data about Star Wars movies, their writers, directors, producer, actors with their character names.

Data Cleaning

We generate clean data tuple files for the loading:

  • we create a new table for the genre
  • we denormalized the person table, containing a array of character names for each title and actor identifiers, by introducing a tuple for each character name (the pair title,actor is not a key anymore)

    import csv
    
    with open('title.basics.tsv', newline='') as tsvin, open('genre.tsv', 'w', newline='') as genreout, open('title.tsv', 'w', newline='') as titleout:
        tsvin = csv.reader(tsvin, delimiter='\t', quoting=csv.QUOTE_NONE)
        genreout = csv.writer(genreout, delimiter='\t')
        titleout = csv.writer(titleout, delimiter='\t')
    
        iterrow = iter(tsvin)
        next(iterrow)
        for row in iterrow:
            ins = row.copy()
            ins.insert(0, ins[0][2:])
            ins = ins[:-1]
            titleout.writerow(ins)
            if row[8] != '\\N':
                title_id = row[0]
                genres = row[8].split(',')
                for genre in genres:
                    genreout.writerow([title_id, genre])
    
    with open('title.principals.tsv', newline='') as tsvin, open('principals.tsv', 'w', newline='') as csvout:
        tsvin = csv.reader(tsvin, delimiter='\t', quoting=csv.QUOTE_NONE)
        csvout = csv.writer(csvout, delimiter='\t')
    
        iterrow = iter(tsvin)
        next(iterrow)
        for row in iterrow:
            if row[5] != '\\N':
                names = row[5][1:-1].split(',')
                for name in names:
                    ins = row.copy()
                    ins[5] = name[1:-1]
                    csvout.writerow(ins)
            else:
                csvout.writerow(row)
    
    with open('name.basics.tsv', newline='') as tsvin, open('name.tsv', 'w', newline='') as csvout:
        tsvin = csv.reader(tsvin, delimiter='\t', quoting=csv.QUOTE_NONE)
        csvout = csv.writer(csvout, delimiter='\t')
    
        iterrow = iter(tsvin)
        next(iterrow)
        for row in iterrow:
            csvout.writerow(row[:-2])
    

Loading

CREATE DATABASE imdb;

DROP TABLE IF EXISTS title;
CREATE TABLE title (
id integer primary key,
tconst varchar(100) default NULL,
titleType varchar(100) default NULL,
primaryTitle varchar(2000) default NULL,
originalTitle varchar(2000) default NULL,
isAdult integer,
startYear integer,
endYear integer,
runtimeMinutes integer);

COPY title FROM '/home/mburon/téléchargements/data/imdb/title.tsv' WITH DELIMITER E'\t';

CREATE INDEX on title (tconst);

DROP TABLE IF EXISTS genre;
CREATE TABLE genre (
tconst varchar(100),
genre varchar(100) default NULL);

COPY genre FROM '/home/mburon/téléchargements/data/imdb/genre.tsv' WITH DELIMITER E'\t';

CREATE INDEX on genre (tconst);

DROP TABLE IF EXISTS casting;
CREATE TABLE casting (
tconst varchar(100),
ordering integer default NULL,
nconst varchar(100),
category varchar(100) default NULL,
job varchar(2000) default NULL,
character varchar(2000) default NULL);

COPY casting FROM '/home/mburon/téléchargements/data/imdb/principals.tsv' WITH DELIMITER E'\t';

CREATE INDEX on casting (tconst);
CREATE INDEX on casting (nconst);

DROP TABLE IF EXISTS person;
CREATE TABLE person (
nconst varchar(100),
primaryName varchar(1000) default NULL,
birthYear integer,
deathYear integer);

COPY person FROM '/home/mburon/téléchargements/data/imdb/name.tsv' WITH DELIMITER E'\t';
CREATE INDEX on person (nconst);

Star Wars Tables

select * from title where tconst in ('tt0076759', 'tt0080684', 'tt0086190', 'tt0120915', 'tt0121765', 'tt0121766', 'tt2488496', 'tt2527336', 'tt2527338', 'tt3748528');
id tconst titletype primarytitle originaltitle isadult startyear endyear runtimeminutes
86190 tt0086190 movie Star Wars: Episode VI - Return of the Jedi Star Wars: Episode VI - Return of the Jedi 0 1983   131
76759 tt0076759 movie Star Wars: Episode IV - A New Hope Star Wars 0 1977   121
80684 tt0080684 movie Star Wars: Episode V - The Empire Strikes Back Star Wars: Episode V - The Empire Strikes Back 0 1980   124
120915 tt0120915 movie Star Wars: Episode I - The Phantom Menace Star Wars: Episode I - The Phantom Menace 0 1999   136
121765 tt0121765 movie Star Wars: Episode II - Attack of the Clones Star Wars: Episode II - Attack of the Clones 0 2002   142
121766 tt0121766 movie Star Wars: Episode III - Revenge of the Sith Star Wars: Episode III - Revenge of the Sith 0 2005   140
2488496 tt2488496 movie Star Wars: Episode VII - The Force Awakens Star Wars: Episode VII - The Force Awakens 0 2015   138
2527336 tt2527336 movie Star Wars: Episode VIII - The Last Jedi Star Wars: Episode VIII - The Last Jedi 0 2017   152
2527338 tt2527338 movie Star Wars: Episode IX - The Rise of Skywalker Star Wars: Episode IX - The Rise of Skywalker 0 2019   142
3748528 tt3748528 movie Rogue One: A Star Wars Story Rogue One 0 2016   133
CREATE TABLE swTitle AS 
select * from title where tconst in ('tt0076759', 'tt0080684', 'tt0086190', 'tt0120915', 'tt0121765', 'tt0121766', 'tt2488496', 'tt2527336', 'tt2527338', 'tt3748528');
CREATE TABLE swGenre AS 
select genre.tconst, genre.genre from genre, swTitle where genre.tconst = swTitle.tconst;
DROP TABLE swCasting;
CREATE TABLE swCasting AS 
select distinct casting.tconst, casting.ordering, casting.nconst, casting.category, casting.job, casting.character from casting, swTitle where casting.tconst = swTitle.tconst;

insert missing values in the casting

INSERT INTO swCasting 
VALUES ('tt0086190', 11, 'nm0000469', 'actor', NULL, 'Darth Vader');
INSERT INTO swCasting 
VALUES ('tt0076759', 11, 'nm0000469', 'actor', NULL, 'Darth Vader');
INSERT INTO swCasting 
VALUES ('tt0080684', 11, 'nm0000469', 'actor', NULL, 'Darth Vader');
INSERT INTO swCasting 
VALUES ('tt3748528', 11, 'nm1872855', 'actor', NULL, 'Darth Vader');
Drop table swPerson;
CREATE TABLE swPerson AS 
select distinct person.nconst, person.primaryname, person.birthyear, person.deathyear from person, swCasting where swCasting.nconst = person.nconst;
select * from swCasting;

Wikidata

We downloaded some triples from Wikidata about Star Wars characters (e.g. Obi-Wan Kenobi), like their performers, their organisation memberships, their relation with others characters (student of, enemy of, child, spouse).

>sw.full.nt

while read i; do
    if [[ -n $i ]] && [[ ${i:0:1} != "#" ]]
    then
       curl https://www.wikidata.org/wiki/Special:EntityData/${i}.nt >> sw.full.nt
    fi
done <wikidata-sw-entity

We keep only certain interesting properties.

Star Wars API

We downloaded also the JSON documents from a Star Wars API, describing character species, spaceships and their pilots etc.

Download

for i in {1..75};
do 
curl https://swapi.co/api/starships/$i/ >> starships.json
done

for i in {1..75};
do 
curl https://swapi.co/api/vehicles/$i/ >> vehicles.json
done

for i in {1..87};
do
curl https://swapi.co/api/people/$i/ >> people.json
done

for i in {1..37};
do
curl https://swapi.co/api/species/ >> species.json
done

Insertion of People IRI

We insert formated name of people to create IRI of the form https://www.starwars.com/databank/luke-skywalker

import json
from json import JSONDecoder, JSONDecodeError
import re

NOT_WHITESPACE = re.compile(r'[^\s]')

def decode_stacked(document, pos=0, decoder=JSONDecoder()):
    while True:
        match = NOT_WHITESPACE.search(document, pos)
        if not match:
            return
        pos = match.start()

        try:
            obj, pos = decoder.raw_decode(document, pos)
        except JSONDecodeError:
            # do something sensible if there's some error
            raise
        yield obj

with open('people.json', newline='') as f, open('people-iri.json', 'w', newline='') as out:
    for obj in decode_stacked(f.read()):
        if 'name' in obj:
            formated_name = obj['name'].lower().replace(" ", "-")
            obj['formatedName'] = formated_name
            json.dump(obj, out)
            out.write('\n')

Loading in MongoDB

We insert 37 starships and 86 people

mongoimport --drop -c=people -d=swapi --mode=merge --file=people.json
mongoimport --drop -c=starships -d=swapi --mode=merge --file=starships.json
mongoimport --drop -c=vehicles -d=swapi --mode=merge --file=vehicles.json
mongoimport --drop -c=species -d=swapi --mode=merge --file=species.json