ERef Bayreuth

Anmelden

Literatur vom gleichen Autor/der gleichen Autor*in

bei Google Scholar

Bibliografische Daten exportieren

Optimization and Parallelization of RegEx Based Information Extraction

Titelangaben

Doleschal, Johannes:
Optimization and Parallelization of RegEx Based Information Extraction.
Bayreuth , 2021 . - XVII, 176 S.
( Dissertation, 2021 , Universität Bayreuth, Bayreuther Graduiertenschule für Mathematik und Naturwissenschaften - BayNAT)
DOI: https://doi.org/10.15495/EPub_UBT_00005737

Volltext

Link zum Volltext (externe URL):

Angaben zu Projekten

Projektfinanzierung:	German-Israeli Foundation for Scientific Researchand Development (GIF), grant I-1502-407.6/2019. Deutsche Forschungsgemeinschaft (DFG), grant MA 4938/4-1. Special Research Fund (BOF) of Hasselt University.

Abstract

The framework of document spanners abstracts the task of information extraction from text as a function that maps every document (a string) into a relation over the document’s spans (intervals identified by their start and end indices). For instance, the regular document spanners are the regular expressions with capture variables, closed under the relational algebra. The expressive power of the regular spanners is precisely captured by the class of vset-automata—an extension of finite state automata, that mark the endpoints of selected spans. In this thesis, we embark the investigation of multiple different aspects of document spanners. Namely, parallel evaluation, weight annotation, and aggregation of document spanners.

Parallel Evaluation: Programs for extracting structured information from text, namely information extractors, often operate separately on document segments obtained from a generic splitting operation such as sentences, paragraphs, k-grams, HTTP requests, and so on. An automated detection of this behavior of extractors, which we refer to as split-correctness, would allow text analysis systems to devise query plans with parallel evaluation on segments for accelerating the processing of large documents. Other applications include the incremental evaluation on dynamic content, where re-evaluation of information extractors can be restricted to revised segments, and debugging, where developers of information extractors are informed about potential boundary crossing of different semantic components. We propose and study a new formal framework for split-correctness within the formalism of document spanners. Our analysis studies the complexity of split-correctness over regular spanners. We also discuss different variants of split-correctness, for instance, in the presence of black-box extractors with split constraints.

Weight Annotation: Concerning weight annotation, we embark on the investigation of document spanners that can annotate extractions with auxiliary information such as confidence, support, and confidentiality measures. To this end, we adopt the abstraction of provenance semirings by Green, Karvounarakis, and Tannen, where tuples of a relation are annotated with the elements of a commutative semiring, and where the annotation propagates through the positive relational Algebra operators via the semiring operators. Hence, the proposed spanner extension, referred to as an annotator, maps every document into an annotated relation over the spans. As a specific instantiation, we explore weighted vset-automata that, similarly to weighted automata and transducers, attach semiring elements to transitions. We investigate key aspects of expressiveness, such as the closure under the positive relational Algebra, and key aspects of computational complexity, such as the enumeration of annotated answers and their ranked enumeration in the case of ordered semirings. For a number of these problems, fundamental properties of the underlying semiring, such as positivity, are crucial for establishing tractability.

Aggregation: Lastly we investigate the computational complexity of querying text by aggregate functions — such as sum, average, and quantile — on top of regular document spanners. To this end, we formally define aggregate functions over document spanners and analyze the computational complexity of exact and approximate computation. More precisely, we show that in a restricted case, all studied aggregate functions can be computed in polynomial time. In general, however, even though exact computation is intractable, some aggregates can still be approximated with fully polynomial-time randomized approximation schemes (FPRAS).

Abstract in weiterer Sprache

Das System der Document Spanner (frei übersetzt: Abschnittsanfragen) abstrahiert die Aufgabe der Informationsextraktion aus Texten als eine Funktion, die Dokumente (Zeichenketten) in Relationen über Abschnitte des Dokuments (identifiziert durch die Start- und Endposition in der Zeichenkette) übersetzt. Reguläre Abschnittsanfragen zum Beispiel sind reguläre Ausdrücke mit Variablen, abgeschlossen unter der relationalen Algebra. Die Ausdrucksstärke von regulären Abschnittsanfragen umfasst exakt die Klasse der sogenannten vset-Automaten — eine Erweiterung von endlichen Automaten, welche die Endpunkte der selektierten Abschnitte markieren. In dieser Arbeit studieren wir mehrere verschiedene Aspekte von Abschnittsanfragen und befassen uns mit der parallelen Auswertung, Annotation von Gewichten und der Aggregation von Abschnittsanfragen.

Parallele Auswertung: Programme, die strukturierte Informationen aus Texten extrahieren, auch informations Extraktoren genannt, arbeiten oft unabhängig voneinander an mehreren Teilen eines Dokuments, die durch eine generische Aufteilung des Dokuments in Sätze, Absätze, HTTP Anfragen oder Ähnliches erzeugt werden. Ein automatisches Erkennen eines solchen Verhaltens, das wir als Aufteilungskorrektheit bezeichnen, würde es informations Extraktoren erlauben, Ausführungspläne zur parallelen Auswertung zu erzeugen und dadurch die Verarbeitung von großen Dokumenten zu beschleunigen. Andere Anwendungen sind die inkrementelle Auswertung von dynamischen Inhalten, für welche die erneute Auswertung von informations Extraktoren auf überarbeitete Bereiche beschränkt werden kann und die Suche von Fehlern, wobei Entwickler von informations Extraktoren über die mögliche Überschreitung von semantischen Inhaltsgrenzen informiert werden können. In dieser Arbeit definieren und studieren wir ein neues formales System für Aufteilung Korrektheit im Formalismus von Abschnittsanfragen. Unsere Analyse studiert die Komplexität der Aufteilungskorrektheit für reguläre Abschnittsanfragen und wir untersuchen verschiedene Varianten von Aufteilungskorrektheit, zum Beispiel im Zusammenhang mit Black Box Extraktoren unter Aufteilungsbedingungen.

Gewichts Annotation: Bezüglich der gewichteten Annotation untersuchen wir Abschnittsanfragen, welche die extrahierten Daten mit zusätzlichen Informationen, wie zum Beispiel der Irrtumswahrscheinlichkeit, dem Support oder Informationen zur Vertraulichkeit anreichern. Dazu passen wir die Abstraktion sogenannter “provenance semirings” (frei übersetzt: Ursprungs Semiringe) von Green, Karvounarakis und Tannen an. In diesen werden Tupel einer Relation mit Elementen eines kommutativen Semirings annotiert und Informationen zum Ursprung mittels der Semiring Operatoren durch positive relationale Algebra Operatoren propatiert. Die vorgeschlagene Erweiterung von Abschnittsanfragen, welche wir als Annotatoren bezeichnen, bildet Dokumente auf annotierte Relationen über Abschnitte ab. Als eine spezifische Darstellungsmöglichkeit untersuchen wir gewichtete vset-Automaten welche ähnlich zu gewichteten endlichen Automaten und Transduktoren Semiring Elemente zu Zustandsübergängen hinzufügt. Wir untersuchen Schüsseleigenschaften der Ausdrucksstärke, wie zum Beispiel den Abschluss unter positiver relationaler Algebra und Schlüsseleigenschaften der Komplexität, wie zum Beispiel der Aufzählung von annotierten Antworten und deren geordnete Aufzählung im Fall von geordneten Semiringen. Für eine Vielzahl dieser Probleme sind fundamentale Eigenschaften des verwendeten Semirings, beispielsweise Positivität, für effiziente Algorithmen essenziell.

Aggregation: Zuletzt studieren wir die Komplexität der Berechnung von Aggregatfunktionen — wie zum Beispiel Summe, Durchschnitt oder Quantil — über Resultate von regulären Abschnittsanfragen. Dazu definieren wir Aggregatfunktionen über Abschnittsanfragen formal und analysieren die Komplexität der exakten und näherungsweisen Auswertung. Genauer zeigen wir das in einem eingeschränkten Fall, alle betrachteten Aggregatfunktionen in polynomieller Zeit ausgewertet werden können. Allgemein können manche Aggregatfunktionen noch immer durch “fully polynomial-time randomized approximation schemes” (frei übersetzt: randomisierte echt polynomielle Approximationsschemata) angenähert werden, auch wenn eine exakte Berechnung nach allgemein geglaubten Hypothesen nicht in polynomieller Laufzeit möglich ist.

Weitere Angaben

Publikationsform:	Dissertation
Keywords:	Information extraction; document spanners; regular document spanners; weighted automata; provenance semirings;
Institutionen der Universität:	Fakultäten > Fakultät für Mathematik, Physik und Informatik > Institut für Informatik > Professur Angewandte Informatik VII > Professur Angewandte Informatik VII - Univ.-Prof. Dr. Wim Martens Graduierteneinrichtungen > University of Bayreuth Graduate School Graduierteneinrichtungen > Bayreuther Graduiertenschule für Mathematik und Naturwissenschaften - BayNAT Fakultäten Fakultäten > Fakultät für Mathematik, Physik und Informatik Fakultäten > Fakultät für Mathematik, Physik und Informatik > Institut für Informatik Fakultäten > Fakultät für Mathematik, Physik und Informatik > Institut für Informatik > Professur Angewandte Informatik VII Graduierteneinrichtungen
Titel an der UBT entstanden:	Ja
Themengebiete aus DDC:	000 Informatik,Informationswissenschaft, allgemeine Werke > 004 Informatik
Eingestellt am:	04 Sep 2021 21:00
Letzte Änderung:	06 Sep 2021 05:51
URI:	https://eref.uni-bayreuth.de/id/eprint/66919