FLUID: A common model for semantic structural graph summaries based on equivalence relations
Till Blume, David Richerby, Ansgar Scherp
Summarization is a widespread method for handling very large graphs. The task of structural graph summarization is to compute a concise but meaningful synopsis of the key structural information of a graph. As summaries may be used for many different purposes, there is no single concept or model of graph summaries. We have studied existing structural graph summaries for large-scale (semantic) graphs. Despite their different concepts and purposes, we found commonalities in the graph structures they capture. We use these commonalities to provide for the first time a formally defined common model, FLUID (FLexible graph sUmmarIes for Data graphs), that allows us to flexibly define structural graph summaries. FLUID allows graph summaries to be quickly defined, adapted, and compared for different purposes and datasets. To this end, FLUID provides features of structural summarization based on equivalence relations such as distinction of types and properties, direction of edges, bisimulation, and inference. We conduct a detailed complexity analysis of the features provided by FLUID. We show that graph summaries defined with FLUID can be computed in the worst case in time O(n2) w.r.t. n, the number of edges in the data graph. An empirical analysis of large-scale web graphs with billions of edges indicates a typical running time of O(n). Based on the formal FLUID model, one can quickly define and modify various structural graph summaries from the literature and beyond.
Published in Theoretical Computer Science (TCS) https://doi.org/10.1016/j.tcs.2020.12.019
Aspect-based Document Similarity for Research Papers
Malte Ostendorff, Terry Ruas, Till Blume, Bela Gipp, Georg Rehm
Traditional document similarity measures provide a coarse-grained distinction between similar and dissimilar documents. Typically, they do not consider in what aspects two documents are similar. This limits the granularity of applications like recommender systems that rely on document similarity. In this paper, we extend similarity with aspect information by performing a pairwise document classification task. We evaluate our aspect-based document similarity approach for research papers. Paper citations indicate the aspect-based similarity, i.e., the title of a section in which a citation occurs acts as a label for the pair of citing and cited paper. We apply a series of Transformer models such as RoBERTa, ELECTRA, XLNet, and BERT variations and compare them to an LSTM baseline. We perform our experiments on two newly constructed datasets of 172,073 research paper pairs from the ACL Anthology and CORD-19 corpus. According to our results, SciBERT is the best performing system with F1-scores of up to 0.83. A qualitative analysis validates our quantitative results and indicates that aspect-based document similarity indeed leads to more fine-grained recommendations.
Published in the Proceedings of the International Conference on Computational Linguistics (COLING) http://dx.doi.org/10.18653/v1/2020.coling-main.545
Incremental and Parallel Computation of Structural Graph Summaries for Evolving Graphs
Till Blume, David Richerby, Ansgar Scherp
Graph summarization is the task of finding condensed representations of graphs such that a chosen set of (structural) subgraph features in the graph summary are equivalent to the input graph. Existing graph summarization algorithms are tailored to specific graph summary models, only support one-time batch computation, are designed and implemented for a specific task, or evaluated using static graphs. Our novel, incremental, parallel algorithm addresses all these shortcomings. We support various structural graph summary models defined in our formal language FLUID. All graph summaries defined with FLUID can be updated in time O(Δ · dk), where Δ is the number of additions, deletions, and modifications to the input graph, d is its maximum degree, and k is the maximum distance in the subgraphs considered. We empirically evaluate the performance of our algorithm on benchmark and real-world datasets. Our experiments show that, for commonly used summary models and datasets, the incremental summarization algorithm almost always outperforms its batch counterpart, even when about 50% of the graph database changes. The source code and the experimental results are openly available for reproducibility and extensibility.
Published in the Proceedings of the International Conference on Information & Knowledge Management (CIKM) https://doi.org/10.1145/3340531.3411878
Source Code available on GitHub https://github.com/t-blume/fluid-spark, pdf available here
Indexing Data on the Web: A Comparison of Schema-Level Indices for Data Search
Till Blume, Ansgar Scherp
Indexing the Web of Data offers many opportunities, in particular, to find and explore data sources. One major design decision when indexing the Web of Data is to find a suitable index model, i.e., how to index and summarize data. Various efforts have been conducted to develop specific index models for a given task. With each index model designed, implemented, and evaluated independently, it remains difficult to judge whether an approach generalizes well to another task, set of queries, or dataset. In this work, we empirically evaluate six representative index models with unique feature combinations. Among them is a new index model incorporating inferencing over RDFS and owl:sameAs. We implement all index models for the first time into a single, stream-based framework. We evaluate variations of the index models considering sub-graphs of sizes 0, 1, and 2 hops on two large, real-world datasets. We evaluate the quality of the indices regarding the compression ratio, summarization ratio, and F1-score denoting the approximation quality of the stream-based index computation. The experiments reveal huge variations in compression ratio, summarization ratio, and approximation quality for different index models, queries, and datasets. However, we observe meaningful correlations in the results that help to determine the right index model for a given task, type of query, and dataset.
Published in the Proceedings of the International Conference on Database and Expert Systems Applications (DEXA), https://doi.org/10.1007/978-3-030-59051-2_18
Source Code available on GitHub https://github.com/t-blume/fluid-framework
Towards an Open Platform for Legal Information
Malte Ostendorff, Till Blume, Saskia Ostendorff
Recent advances in the area of legal information systems have led to a variety of applications that promise support in processing and accessing legal documents. Unfortunately, these applications have various limitations, e.g., regarding scope or extensibility. Furthermore, we do not observe a trend towards open access in digital libraries in the legal domain as we observe in other domains, e.g., economics or computer science. To improve open access in the legal domain, we present our approach for an open source platform to transparently process and access Legal Open Data. This enables the sustainable development of legal applications by offering a single technology stack. Moreover, the approach facilitates the development and deployment of new technologies. As proof of concept, we implemented six technologies and generated metadata for more than 250,000 German laws and court decisions. Thus, we can provide users of our platform not only access to legal documents but also the contained information.
Published in the Proceedings of the Joint Conference on Digital Libraries (JCDL), https://doi.org/10.1145/3383583.3398616
Source Code avalable on GitHub https://github.com/openlegaldata/oldp and prototype online https://openlegaldata.io/
Open Innovation in the Big Data Era With the MOVING Platform
Iacopo Vagliano, Franziska Günther, Matthias Heinz, Aitor Apaolaza, Irina Bienia, Gert Breitfuss, Till Blume, Chrysa Collyda, Angela Fessl, Sebastian Gottfried, Peter Hasitschka, Jasmin Kellermann, Thomas Köhler, Annalouise Maas, Vasileios Mezaris, Ahmed Saleh, Andrzej M. J. Skulimowski, Stefan Thalmann, Markel Vigo, Alfred Wertner, Michael Wiese, Ansgar Scherp
The MOVING platform enables its users to improve their information literacy by training how to exploit data mining methods in their daily research tasks. Its novel integrated working and training environment supports the education of data-savvy information professionals and allows them to address the big data and open innovation challenges.
Published in IEEE MultiMedia https://doi.org/10.1109/MMUL.2018.2873495
Towards an Incremental Schema-level Index for Distributed Linked Open Data Graphs
Till Blume, Ansgar Scherp
Semi-structured, schema-free data formats are used in many applications because their flexibility enables simple data exchange. Especially graph data formats like RDF have become well established in theWeb of Data. For the Web of Data, it is known that data instances are notonly added, changed, and removed regularly, but that their schemas are also subject to enormous changes over time. Unfortunately, the collection, indexing, and analysis of the evolution of data schemas on the web is still in its infancy. To enable a detailed analysis of the evolution of Linked Open Data, we lay the foundation for the implementation of incremental schema-level indices for the Web of Data. Unlike existing schema-level indices, incremental schema-level indices have an efficient update mechanism to avoid costly recomputations of the entire index. This enables us to monitor changes to data instances at schema-level, trace changes, and ultimately provide an always up-to-date schema-level index for the Web of Data. In this paper, we analyze in detail the challenges of updating arbitrary schema-level indices for the Web of Data. To this end, we extend our previously developed meta model FLuID. In addition, we outline an algorithm for performing the updates.
Published in the Proceedings of the Joint Conference on Digital Libraries (LWDA), http://ceur-ws.org/Vol-2191/paper7.pdf
Towards Flexible Indices for Distributed Graph Data: The Formal Schema-level Index Model FLuID
Till Blume, Ansgar Scherp
Graph indices are a key to manage huge amounts of distributed graph data. Instance-level indices are available that focus on the fast retrieval of nodes. Furthermore, there are so-called schema-level indices focusing on summarizing nodes sharing common characteristics, i. e., the combination of attached types and used property-labels. We argue that there is not a one-size-fits-all schema-level index. Rather, a parameterized, formal model is needed that allows to quickly design, tailor, and compare different schema-level indices. We abstract from related works and provide the formal model FLuID using basic building blocks to flexibly define different schema-level indices. The FLuID model provides parameterized simple and complex schema elements together with four parameters. We show that all indices modeled in FLuID can be computed in O(n). Thus, FLuID enables us to efficiently implement, compare, and validate variants of schema-level indices tailored for specific application scenarios.
Published in the Proceedings of the Foundations of Databases (GvDB), http://ceur-ws.org/Vol-2126/paper3.pdf