Jump to content

Computable Graphs: Difference between revisions

1,821 bytes added ,  17:33, 13 November 2022
notes from day 2
(notes from day 2)
Line 130: Line 130:
DeFacto tool for deep fact validation on the web: [[/github.com/DeFacto/DeFacto|https://github.com/DeFacto/DeFacto]]
DeFacto tool for deep fact validation on the web: [[/github.com/DeFacto/DeFacto|https://github.com/DeFacto/DeFacto]]


==== What additional context beyond provenance could be useful? ====
==== [[question]] What additional context beyond provenance could be useful? ====


* Concept drift over time (in general entity/relation evolution over time)
* Concept drift over time (in general entity/relation evolution over time)
Line 137: Line 137:


==== Links to NLP research ====
==== Links to NLP research ====
Biggest open questions seem to be: scalability vs expressivity, multiple modalities, updation
Seems like there's a bit of a gap between NLP research


Scoring functions in KRL: In representation learning for KGs, models are trained to embed entities and relations in a common space and then score (subject, predicate, object) triples according to likelihood - can these scores be use as additional “plausibility” or “validity” metadata from a provenance perspective? However, these scores are usually produced by models that are learning to replicate the structure of the knowledge graph, so they can’t link back to any textual evidence to support the judgements.
[[question]] What are the points of overlap between subproblems around scholarly knowledge graph provenance and existing NLP research?


Auxiliary knowledge in graphs: [https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9416312 <nowiki>[2]</nowiki>] lists some work that stores textual descriptions alongside entities, entity images (image-embodied IKRL) and uncertainty score information (ProBase, NELL and ConceptNet).
(probably a rich opportunity for a position paper - could bei nteresting to NLP venues; e.g., [[NeuRIPS AI for science workshop]]: possible contact: Kexin Huang


Temporal knowledge graphs: [2] is a survey of knowledge graphs from an NLP perspective and highlights temporal knowledge graphs as a key area of research. Temporality is one kind of provenance information. Current directions being explored include
* Biggest open questions seem to be:
** scalability vs expressivity, multiple modalities, updation
** Interpretability or transparency: Provenance information can be used to make KGs more transparent, because we have sources that provide evidence for why this fact/relation holds true.
** Commonsense angle: Another broader application where storing context might be helpful could be for commonsense knowledge graphs?
*** Might have an analog in terms of "scholarly" (un)common sense, esp. across disciplinary boundaries (but this also has analogs in cross-cultural barriers to communication)
* Maps to things with some existing body of work
** Scoring functions in KRL: In representation learning for KGs, models are trained to embed entities and relations in a common space and then score (subject, predicate, object) triples according to likelihood - can these scores be use as additional “plausibility” or “validity” metadata from a provenance perspective? However, these scores are usually produced by models that are learning to replicate the structure of the knowledge graph, so they can’t link back to any textual evidence to support the judgements.
** Auxiliary knowledge in graphs: [https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9416312 <nowiki>[2]</nowiki>] lists some work that stores textual descriptions alongside entities, entity images (image-embodied IKRL) and uncertainty score information (ProBase, NELL and ConceptNet).
** Temporal knowledge graphs: [2] is a survey of knowledge graphs from an NLP perspective and highlights temporal knowledge graphs as a key area of research. Temporality is one kind of provenance information. Current directions being explored include
*** Temporal embeddings: extending existing KRL methods for quadruples instead of triples, where time is the fourth entity
*** Entity dynamics: changes in entity states over time
*** Temporal relational dependence (e.g., a person must be born before dying)
*** Temporal logic reasoning: graphs in which temporal information is uncertain.
** Knowledge acquisition: Some work has tried to explore jointly extracting entities/relations from unstructured text and performing knowledge graph completion. This can be a useful way to jointly model unstructured text from previous papers that do not have associated graphs and papers for which discourse graphs have been annotated. In general, joint graph+text embedding techniques seem useful. Additionally, a technique often used to train models to extract relations from text, distant supervision, can probably be leveraged to detect and add provenance information post-hoc. The idea behind distant supervision is to scan text and find sentences that mention both entities involved in the relation, considering such sentences to be potential evidence sentences. But this is a noisy heuristic - a sentence mentioning two connected entities may not be mentioning the relation between them - and will likely require some curation.


* Temporal embeddings: extending existing KRL methods for quadruples instead of triples, where time is the fourth entity
* Entity dynamics: changes in entity states over time
* Temporal relational dependence (e.g., a person must be born before dying)
* Temporal logic reasoning: graphs in which temporal information is uncertain.


Knowledge acquisition: Some work has tried to explore jointly extracting entities/relations from unstructured text and performing knowledge graph completion. This can be a useful way to jointly model unstructured text from previous papers that do not have associated graphs and papers for which discourse graphs have been annotated. In general, joint graph+text embedding techniques seem useful. Additionally, a technique often used to train models to extract relations from text, distant supervision, can probably be leveraged to detect and add provenance information post-hoc. The idea behind distant supervision is to scan text and find sentences that mention both entities involved in the relation, considering such sentences to be potential evidence sentences. But this is a noisy heuristic - a sentence mentioning two connected entities may not be mentioning the relation between them - and will likely require some curation.
Interpretability or transparency: Provenance information can be used to make KGs more transparent, because we have sources that provide evidence for why this fact/relation holds true.
Commonsense angle: Another broader application where storing context might be helpful could be for commonsense knowledge graphs?


[[/link.springer.com/chapter/10.1007/978-3-642-13818-8 32|Application to finding evidence for clinical KBs]]
[[/link.springer.com/chapter/10.1007/978-3-642-13818-8 32|Application to finding evidence for clinical KBs]]
Line 176: Line 180:
Interesting question around score for contributions to the scientific KG instead of document-centric metrics like h-indices (potentially more accurate assessment of contributions)
Interesting question around score for contributions to the scientific KG instead of document-centric metrics like h-indices (potentially more accurate assessment of contributions)


[https://www.researchgate.net/profile/Soeren-Auer/publication/330751750_Open_Research_Knowledge_Graph_Towards_Machine_Actionability_in_Scholarly_Communication/links/5c5e9bc7a6fdccb608b28f6f/Open-Research-Knowledge-Graph-Towards-Machine-Actionability-in-Scholarly-Communication.pdf <nowiki>[4]</nowiki>] has a really nice analogy: when we moved from phone books and maps were digitally transformed, we didn’t just created PDF books with this information, instead we developed new means to organize and access information. Why not push for such a re-organization in scholarly communication too? Building on [3], their ontology also contains: problem statement, approach, evaluation and conclusions. They use RDF with a minor difference: everything can be modeled as an entity with a unique ID. Limited subset of OWL (subclass inference) is supported. Small user study with this tool, but it doesn’t seem like Dafna’s question about utility of KG constructed by someone else or consensus across people were measured.
[https://www.researchgate.net/profile/Soeren-Auer/publication/330751750_Open_Research_Knowledge_Graph_Towards_Machine_Actionability_in_Scholarly_Communication/links/5c5e9bc7a6fdccb608b28f6f/Open-Research-Knowledge-Graph-Towards-Machine-Actionability-in-Scholarly-Communication.pdf <nowiki>[4]</nowiki>] has a really nice analogy: when we moved from phone books and maps were digitally transformed, we didn’t just created PDF books with this information, instead we developed new means to organize and access information. Why not push for such a re-organization in scholarly communication too? Building on [3], their ontology also contains: problem statement, approach, evaluation and conclusions. They use RDF with a minor difference: everything can be modeled as an entity with a unique ID. Limited subset of OWL (subclass inference) is supported. Small user study with this tool, but it doesn’t seem like Dafna’s question about utility of KG constructed by someone else or consensus across people were measured. But they do speak to Dafna's question about how painful it is to construct KGs.
 
so possible good open [[question]]: how might KGs constructed by one set of scientists actually be useful or not for other scientists, and what factors influence whether this is the case?
 
* [[experiment]] idea: even just starting by compiling any successful instances of adoption of KGs taht span multiple groups would be a good thing to do; [[Peter Murray-Rust]]. maybe [https://www.nlm.nih.gov/research/umls/index.html UMLS] also


Note: Both papers actually reference a fair number of other works in this direction, which could be good follow-up reading
Note: Both papers actually reference a fair number of other works in this direction, which could be good follow-up reading


=== Papers I didn’t get to ===
=== Papers I didn’t get to ===
(numbering here does *not* map to numbering in sections above)
[1] [https://pdfs.semanticscholar.org/713b/b398b85b034f2139e08b4ca0f7791fd545bc.pdf?_ga=2.19474043.2123037593.1668226180-648530244.1660769576 Empirical study of RDF reification]
[1] [https://pdfs.semanticscholar.org/713b/b398b85b034f2139e08b4ca0f7791fd545bc.pdf?_ga=2.19474043.2123037593.1668226180-648530244.1660769576 Empirical study of RDF reification]


Line 206: Line 216:
* reuse of records/data/stuff in general in CSCW across a bunch of work domains/settings, like aircraft safety, healthcare, etc. (generally talks about why it's hard to reuse stuff from others, need context)
* reuse of records/data/stuff in general in CSCW across a bunch of work domains/settings, like aircraft safety, healthcare, etc. (generally talks about why it's hard to reuse stuff from others, need context)
* annotations / social reading (generally negative/mixed)
* annotations / social reading (generally negative/mixed)
MG: can we search the literature based on what tool they studied things with?
NO. not right now.
lack of good tools to "refactor" the literature to match those queries
maybe like Elicit? https://elicit.org/search?q=What+is+the+impact+of+creatine+on+cognition%3F&token=01GHRSBKG6P1R9343NBP2ZWZ5B&c-ps=intervention&c-ps=outcome-measured&c-s=What+tool+did+they+study but for everything??


== Compiling graphs to manuscripts ==
== Compiling graphs to manuscripts ==
Line 245: Line 262:
* the majority of empirical research papers in biology have a similar structure (question/ motivation/ evidence (fig.1a)/ claim for each paragraph & figure panel)
* the majority of empirical research papers in biology have a similar structure (question/ motivation/ evidence (fig.1a)/ claim for each paragraph & figure panel)
* multiple researchers (or students) asked to highlight the questions/claims/evidence text from a paper will highlight similar/consensus text (part of the NLP-to-highlights project)
* multiple researchers (or students) asked to highlight the questions/claims/evidence text from a paper will highlight similar/consensus text (part of the NLP-to-highlights project)
notes/questions
* may need more than just the sentences, but also mapping to intra-project/paper elements, like figures and tables to
[[File:Screenshot example of GPT-3 task for extracting components from narrative.png|none|thumb]]


== Next Steps ==
== Next Steps ==
91

edits