Computable Graphs: Difference between revisions

10,650 bytes added ,  08:42, 13 November 2022
Line 72: Line 72:


== Contextualizing knowledge graphs ==
== Contextualizing knowledge graphs ==
=== Key Question: Research connections between knowledge base provenance and NLP ===
==== Background ====
Knowledge bases or graphs are typically stored in the form of subject-predicate-object triples, often using formats like RDF (Resource Description Format) and OWL (Web Ontology Language). These triple-based formats don’t include provenance data, which covers information such as the source from which a triple was extracted, time of extraction, modification history, evidence for the triple, etc., or any other additional context by default. But contextual information can improve overall utility of a KG/KB, and is particularly helpful to capture when scholars are synthesizing literature into such structures (either individually or collectively).
==== What can additional context help with? ====
* Authority or trustworthiness of information
* Transparency of the process leading to the conclusion
* Validation and review of experiments conducted to attain a result
Property graphs/attribute graphs in which entities and relations can have attributes
==== Ways to insert additional context into a KG ====
Broadly can be distilled into the following strategies: Increasing number of nodes in a triple, allowing “nodes” to be triples themselves, adding attributes or descriptions to nodes, adding attributes to triples (attributes may or may not be treated the same as predicates)
RDF reification: Add additional RDF triples that link a subject, predicate or object entity with its source (or any other provenance attribute of interest). Leads to bloating - 4x the number of triples if you want to store information about subject, predicate, object and overall triple.
N-ary relations: Instead of binary relations, allow more entities (aside from subject and object) to be included in the same relation. This can complicate traversal of the knowledge graph.
Adding metadata property-value pairs to RDF triples
Forming RDF molecules, where subjects/objects/predicates are RDF triples (allowing recursion)
Two named graphs, default graph and assertion graph, store data and another graph details provenance data
Nanopublications, which seem very similar in structure to the discourse graph format
Most of these representations are text-only, what about multimodal provenance (images, toy datasets, audio and video recordings)?
==== At what levels can we have additional context? ====
[[/link.springer.com/content/pdf/10.1007/s41019-020-00118-0.pdf|[1]]] provides a review (slight focus on cybersecurity domain) and suggests the following granularities at which provenance/context information can be recorded:
Element-level provenance (how were certain entity or predicate types defined)
Triple-level provenance (most common level of provenance)
Molecule-level and graph-level provenance (having provenance information for composites or collections of RDF triples, but not individual triples)
Document-level provenance and corpus-level provenance (only having provenance information for documents or corpora from which triples are identified)
Scalability is a concern when deciding the level and extent to which provenance data can be retained
==== What kinds of information does provenance cover? ====
Domain-agnostic information types:
* Creation of the data
* Publication of the data
* Distribution of the data
* Copyright information
* Modification and revision history
* Current validity
Domain-specific provenance ontologies: defined for BBC, US Government, scientific experiments
DeFacto tool for deep fact validation on the web: [[/github.com/DeFacto/DeFacto|https://github.com/DeFacto/DeFacto]]
==== What additional context beyond provenance could be useful? ====
* Concept drift over time (in general entity/relation evolution over time)
* Seems like discourse links across papers don’t neatly fit the provenance definition, but are an aspect we want to capture?
* Are there other aspects that provenance doesn’t cover?
==== Links to NLP research ====
Biggest open questions seem to be: scalability vs expressivity, multiple modalities, updation
Scoring functions in KRL: In representation learning for KGs, models are trained to embed entities and relations in a common space and then score (subject, predicate, object) triples according to likelihood - can these scores be use as additional “plausibility” or “validity” metadata from a provenance perspective? However, these scores are usually produced by models that are learning to replicate the structure of the knowledge graph, so they can’t link back to any textual evidence to support the judgements.
Auxiliary knowledge in graphs: [https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9416312 <nowiki>[2]</nowiki>] lists some work that stores textual descriptions alongside entities, entity images (image-embodied IKRL) and uncertainty score information (ProBase, NELL and ConceptNet).
Temporal knowledge graphs: [2] is a survey of knowledge graphs from an NLP perspective and highlights temporal knowledge graphs as a key area of research. Temporality is one kind of provenance information. Current directions being explored include
* Temporal embeddings: extending existing KRL methods for quadruples instead of triples, where time is the fourth entity
* Entity dynamics: changes in entity states over time
* Temporal relational dependence (e.g., a person must be born before dying)
* Temporal logic reasoning: graphs in which temporal information is uncertain.
Knowledge acquisition: Some work has tried to explore jointly extracting entities/relations from unstructured text and performing knowledge graph completion. This can be a useful way to jointly model unstructured text from previous papers that do not have associated graphs and papers for which discourse graphs have been annotated. In general, joint graph+text embedding techniques seem useful. Additionally, a technique often used to train models to extract relations from text, distant supervision, can probably be leveraged to detect and add provenance information post-hoc. The idea behind distant supervision is to scan text and find sentences that mention both entities involved in the relation, considering such sentences to be potential evidence sentences. But this is a noisy heuristic - a sentence mentioning two connected entities may not be mentioning the relation between them - and will likely require some curation.
Interpretability or transparency: Provenance information can be used to make KGs more transparent, because we have sources that provide evidence for why this fact/relation holds true.
Commonsense angle: Another broader application where storing context might be helpful could be for commonsense knowledge graphs?
[[/link.springer.com/chapter/10.1007/978-3-642-13818-8 32|Application to finding evidence for clinical KBs]]
==== Important contexts to be captured for synthesis of scientific literature ====
[[/dl.acm.org/doi/pdf/10.1145/3227609.3227689|[3]]] proposes a framework to construct a knowledge graph for science. Problems, methods and results are key semantic resources, in addition to people, publications, institutions, grants, and datasets. Four modes of input:
# Bridge existing KGs, ontologies, metadata and information models
# Enable scientists to add their research possibly with some intelligent interface support
# Use automated methods for information extraction and linking
# Support curation and quality assurance by domain experts and librarians
The underlying data model that supports this endeavor can adopt RDF or Linked Data as a scaffold but must add comprehensive provenance, evolution and discourse information. Other requirements include ability to store and query graphs at scale, UI widgets that support collaborative authoring and curation and integration with semi-automated ways of extraction, search and recommendation.
Top-level ontology: research contribution, communicates one or more results in an attempt to address a problem using a set of methods. Top-down design of more specific elements is needed from domain experts across a set of domains.
Infrastructure: KG can be queried and explored by anyone, but registration is required to contribute to the KG (a way to ensure proper authorization and domain expertise?)
KGs need to capture fuzziness: competing evidence, differing definitions and conceptualizations
Interesting question around score for contributions to the scientific KG instead of document-centric metrics like h-indices (potentially more accurate assessment of contributions)
[https://www.researchgate.net/profile/Soeren-Auer/publication/330751750_Open_Research_Knowledge_Graph_Towards_Machine_Actionability_in_Scholarly_Communication/links/5c5e9bc7a6fdccb608b28f6f/Open-Research-Knowledge-Graph-Towards-Machine-Actionability-in-Scholarly-Communication.pdf <nowiki>[4]</nowiki>] has a really nice analogy: when we moved from phone books and maps were digitally transformed, we didn’t just created PDF books with this information, instead we developed new means to organize and access information. Why not push for such a re-organization in scholarly communication too? Building on [3], their ontology also contains: problem statement, approach, evaluation and conclusions. They use RDF with a minor difference: everything can be modeled as an entity with a unique ID. Limited subset of OWL (subclass inference) is supported. Small user study with this tool, but it doesn’t seem like Dafna’s question about utility of KG constructed by someone else or consensus across people were measured.
Note: Both papers actually reference a fair number of other works in this direction, which could be good follow-up reading
=== Papers I didn’t get to ===
[1] [https://pdfs.semanticscholar.org/713b/b398b85b034f2139e08b4ca0f7791fd545bc.pdf?_ga=2.19474043.2123037593.1668226180-648530244.1660769576 Empirical study of RDF reification]
[2] [[/dl.acm.org/doi/pdf/10.1145/3459637.3482330|Maintaining provenance in uncertain KGs]], [[/arxiv.org/pdf/2007.14864.pdf|maintaining provenance in dynamic KGs]]
[3] [[/kaixu.files.wordpress.com/2011/09/provtrack-ickde-submission.pdf|System for provenance exploration]]
[4] [[/www.semanticscholar.org/paper/iKNOW:-A-platform-for-knowledge-graph-construction-Babalou-Costa/3a492a7b2d04221941919f80ca259506866490b3|iKNOW: KG construction for biodiversity]]
[5] [[/eprints.soton.ac.uk/412923/1/WD sources iswc 7 .pdf|Provenance information in a collaborative environment: evaluation of Wikipedia external references]]
[6] [[/arxiv.org/pdf/2210.14846.pdf|ProVe: Automated provenance verification of KGs against text]]


== Understanding knowledge graph transfer ==
== Understanding knowledge graph transfer ==
1

edit