Building an Article Summarizer for Enhancing Comprehension (1/2)

Motivation

Sometimes we come across articles or documents that are difficult to understand, especially if they're outside our area of expertise. This common experience sparked the idea for this project.

My goal with this project was twofold: First, to deepen my understanding of transformer models and natural language processing (NLP) as part of my computing coursework. Second, to tackle a problem I'm sure many of us have encountered - making sense of complex, domain-specific articles.

This project, in essence, is about creating a way to simplify scientific or industry articles into something more digestible. It is meant as a tool to bridge the gap between expert knowledge and layperson understanding.

Hypotheses

This is variant of abstractive summarization. The project takes a novel approach and treats abstractive summarization as two distinct components: knowledge extraction and text generation. This strategy allows us to harness the capabilities of LLMs while maintaining a level of control and interpretability over the output.

The crux of our method revolves around mining essential information in the form of Subject-Verb-Object (SVO) triplets from the corpus. Following this, we employ 3 different transformer models, specifically T5, BERT, and GPT, to generate a coherent narrative using these triplets.

The model’s modular nature allows us to harness the strengths of non-neural and neural methods, facilitating the training of specialized modules and providing a framework to interpret and explain the results.

Dataset

This project works with the BBC News Summary dataset as the corpus for our study due to its well-formatted and simple nature, with model handcrafted summaries to evaluate against. This dataset was created by re-purposing the BBC News dataset originally created for benchmarks in classification.

Text Distillation

The core hypothesis of this first part is that the semantic essence of a text can be represented through a series of interconnected and interdependent subject-verb-object triplets, thus creating a graph of the information contained within the piece of text. This knowledge graph is to serve as “ground truth” for the downstream task of text summarization.

The text preparation stage of our pipeline involves several standard Natural Language Processing (NLP) tasks, including tokenization, stopword removal, stemming, and lemmatization. These procedures aid in distilling the corpus, reducing language variability, and aiding more efficient information extraction. Additionally, co-reference resolution is performed to facilitate accurate entity linking across the corpus and to retain the narrative structure within individual sentences, enhancing the fidelity of the extracted SVO triplets.

In an attempt to incorporate relevant context into the knowledge extraction phase, we posit that the title of an article could serve as a representative set of important terms. Consequently, keywords from the article title are extracted and incorporated into the entity recognition process. This step aims to improve the contextual relevance of named entities and aids in better aligning the knowledge extraction with the overall topic of the text.

Knowledge Graph Extraction

Our process of generating triplets follows a structure inspired by the WebNLG corpus, leveraging the RDF schema for knowledge representation. This structure provides us with a rich source of training data from the WebNLG corpus, which we later utilize for fine-tuning our language generation models.

We follow a procedure that parses the dependency structure of the text and applies a series of heuristics to extract Subject-Verb-Object (SVO) triplets. Here is an overview of the considerations:

Sentence Parsing: We parse the input text on a sentence level.
Verb-Subject-Object Initialization: We create a dictionary where verbs serve as keys, and sets of associated subjects and objects as their corresponding values.
Token Analysis: Each token in the sentences is analyzed based on its dependency label and part-of-speech tag.
Role Assignment: Depending on the dependency label, a token is assigned as a subject or an object and accordingly added to the Verb-SVO dictionary. Subjects can be nouns or subordinate clauses, while objects can be nouns, entities introduced by a preposition, or subordinate clauses.
Role Expansion: We expand subjects and objects to include related tokens such as compound nouns, conjuncts, or tokens within a clause, utilizing the subtree of each token.
Verb Conjuncts: We account for cases where multiple verbs are linked via conjunction (e.g., the sentence ”I read and wrote this book”), and update the dictionary accordingly, considering that such verbs share the same subjects and objects.
SVO Triplet Extraction: The extracted triplets are retrieved from the Verb-SVO dictionary, each triplet represented as a tuple in the order of subject-verb-object.

In the process of triplet extraction, we have devised two variations - 'greedy' and 'strict'. These variations differ in steps 5 and 7.

In the 'greedy' implementation, we liberally follow token chains during subject and object expansion (step 5), which results in longer and more contextually rich SVO triplets. We also relax the criteria in step 7 to permit non-entity and non-verb tokens in the subject and verb positions.

Conversely, the 'strict' implementation focuses on the extraction of 'essential' triples. We cap token chains to a maximum of five tokens and exclude auxiliary verbs. In step 7, we restrict non-entity and non-verb tokens, with the exception for subjects and objects that originated from the title. All non-entities are then subjected to lemmatization.

Final Thoughts

The knowledge extraction stage of this project was challenging in a fun way. While it did convert complex articles into structured SVO triplets, it also revealed some gaps and areas for enhancement:

Context & Nuance- The current reliance on syntactic structures and heuristics failed to capture some nuanced meanings, such as those embedded in context, metaphor, or sarcasm. The process also struggled with texts where important information is not easily reducible to simple SVO triplets.
Semantic Boundaries - What is the boundary of relationships in a text? In the above process, triplets are generated from relationships within a sentence. While co-reference resolution will help somewhat, there will still be meaningful relationships that will be lost.
Sensitivity - The 'greedy' and 'strict' variations of triplet extraction present a trade-off between capturing rich context and maintaining precision, raising the opportunity to find the optimal balance.

Future improvements might include maintaining a “master graph” for the knowledge domain so it can be enriched with useful facts… but we have enough for now to continue.

The next part of this project will focus on text generation, building upon the structured data from the knowledge extraction phase. The varying outcomes of different transformer architectures will offer an perspective on the capabilities of modern NLP technology.

Appendix: Code Snippets

Here are some code snippets for the various procedures. Code was written in Python and uses SpaCy.

Tokenizer and Tagger Functions

##########################################################################################################################################
# TOKENIZER AND TAGGER
# helper functions
##########################################################################################################################################

def tag_and_tokenize_keywords(doc):
    # Create a lemmatized version of the input document

    matcher_ent = spacy.matcher.PhraseMatcher(nlp.vocab, attr="LEMMA")  
    matcher_other = spacy.matcher.PhraseMatcher(nlp.vocab, attr="LOWER")
    patterns_ent = [nlp(keyword) for keyword in keywords[0]]
    patterns_other = [nlp.make_doc(keyword) for keyword in keywords[1]]
    matcher_ent.add("KEYWORD_ENT", patterns_ent)
    matcher_other.add("KEYWORD_OTHER", patterns_other)

    lemmatized_doc = nlp(" ".join([token.lemma_ for token in doc]))

    # Retokenize the document to merge multi-token keyword spans
    with doc.retokenize() as retokenizer:
        # Iterate through the matchers (matcher_ent, matcher_other) and their target documents (lemmatized_doc, doc)
        for matcher, target_doc in zip([matcher_ent, matcher_other], [lemmatized_doc, doc]):
            # Find matches in the target document using the current matcher
            for match_id, start, end in matcher(target_doc):
                # Create a span in the original document corresponding to the matched keywords
                span = doc[start:end]
                # Merge the span into a single token
                retokenizer.merge(span)

                # Set the 'is_keyword' and 'custom_tag' attributes for the merged token
                span[0].set_extension("is_keyword", default=False, force=True)
                span[0]._.is_keyword = True
                span[0].set_extension("custom_tag", default=None, force=True)
                span[0]._.custom_tag = nlp.vocab.strings[match_id]

    return doc

def add_keywords_to_ents(doc):
    new_ents = []
    for token in doc:
        if token._.is_keyword:
            new_ent = spacy.tokens.Span(doc, token.i, token.i + 1, label=token._.custom_tag)
            new_ents.append(new_ent)

    # Filter out existing entities that overlap with the new keyword entities
    filtered_ents = []
    for ent in doc.ents:
        overlaps = [ent.start <= new_ent.start < ent.end or new_ent.start <= ent.start < new_ent.end for new_ent in new_ents]
        if not any(overlaps):
            filtered_ents.append(ent)

    doc.set_ents(filtered_ents + new_ents)
    return doc

def convert_entities_to_tokens(doc):
    with doc.retokenize() as retokenizer:
        for ent in doc.ents:
            # Create a span for the current entity
            span = doc[ent.start:ent.end]
            # Merge the span into a single token
            retokenizer.merge(span)

    return doc

def merge_noun_chunks(doc):
    with doc.retokenize() as retokenizer:
        for np in list(doc.noun_chunks):
            retokenizer.merge(np)
    return doc

def merge_symb2num(doc):
    """
    Merge adjacent currency symbol and number tokens.
    """
    i = 1
    while i < len(doc):
        if doc[i].is_digit and doc[i - 1].is_currency:
            span = doc[doc[i - 1].i: doc[i].i + 1]
            with doc.retokenize() as retokenizer:
                retokenizer.merge(span)
        else:
            i += 1
    return doc


# Displacy Formatting 
col_highlight1 = "magenta"
col_highlight2 = "yellow"
col_others = "lightblue"
options_ent = {
    "ents": ["KEYWORD_ENT", "KEYWORD_OTHER",
             "ORG", "PRODUCT", "GPE", "LOC", "PERSON", "FAC", "NORP", "DATE", "TIME", "PERCENT", "MONEY", "QUANTITY", "ORDINAL", "CARDINAL", "LANGUAGE", "EVENT", "LAW", "WORK_OF_ART"],
    "colors": {"KEYWORD_ENT": col_highlight1, 
               "KEYWORD_OTHER": col_highlight2, 
               "ORG": col_others, 
               "PRODUCT": col_others, 
               "GPE": col_others, 
               "LOC": col_others, 
               "PERSON": col_others, 
               "FAC": col_others, 
               "NORP": col_others, 
               "DATE": col_others, 
               "TIME": col_others, 
               "PERCENT": col_others, 
               "MONEY": col_others, 
               "QUANTITY": col_others, 
               "ORDINAL": col_others, 
               "CARDINAL": col_others, 
               "LANGUAGE": col_others, 
               "EVENT": col_others, 
               "LAW": col_others, 
               "WORK_OF_ART": col_others}  
}

options_dep = {'compact': True, 'color': 'black', 'bg': 'white', 'offset': 100, 'distance': 100, 'font': 'Arial'}

##########################################################################################################################################
# Process Sampling
##########################################################################################################################################

# Generate Sample
doc_idx = 368
title = seperate_title_and_body(articles[doc_idx])[0]
body = seperate_title_and_body(articles[doc_idx])[1]
keywords = article_keywords[doc_idx]
print("title: ", title)
print("title keywords: ", keywords)
print("body: ", body)

# Create new NLP instances, incl matchers
nlp = spacy.load("en_core_web_lg")

matcher_ent = spacy.matcher.PhraseMatcher(nlp.vocab, attr="LEMMA")  
matcher_other = spacy.matcher.PhraseMatcher(nlp.vocab, attr="LOWER")
patterns_ent = [nlp(keyword) for keyword in keywords[0]]
patterns_other = [nlp.make_doc(keyword) for keyword in keywords[1]]
matcher_ent.add("KEYWORD_ENT", patterns_ent)
matcher_other.add("KEYWORD_OTHER", patterns_other)

# Pipeline the inputs
doc = nlp(body)
doc = tag_and_tokenize_keywords(doc)
doc = add_keywords_to_ents(doc)
# doc = convert_entities_to_tokens(doc)
# doc = merge_noun_chunks(doc)
# doc = merge_symb2num(doc)

# spacy.displacy.render(doc, style = "dep")
spacy.displacy.render(doc, style = "dep", options = options_dep)
spacy.displacy.render(doc, style = "ent", options = options_ent)

Sample Output

Text Corpus with extracted keywords. Colour coded highlights indicate keyword tag.

Subject-Verb-Object Extraction Functions

##########################################################################################################################################
# SUBJECT-VERB-OBJECT EXTRACTION
# Helper Functions
##########################################################################################################################################
# The following generates triples regardless of keywords. 
# See ref: https://textacy.readthedocs.io/en/latest/api_reference/extract.html#triples


def generate_triplet2sentence_pairs(doc, arr_keyword, return_type = 'std', keyword_filter = False, VERBOSE = False):
    '''
    Returns a list of tuple-pairs: 
    [(triple, sentence), (triple, sentence), ...]

    Parameters:
    -----------
    doc : spacy.tokens.Doc
        A SpaCy document containing the text to extract SVO triplets from.
    arr_keyword : List[List[str]]
        A nested list of keywords (strings), where each sublist represents a set of related keywords.
    keyword_filter : bool, optional (default=False)
        If True, only include triplets containing at least one keyword from the given keyword list.
    VERBOSE : bool, optional (default=False)
        If True, print additional information, such as the list of flattened keywords.

    Returns:
    --------
    pairs : List[Tuple[str, spacy.tokens.Span]]
        A list of tuple pairs, where each tuple contains a formatted triplet (as a string) and
        the sentence.

    Example usage:
    --------------
    keywords = ['Time Warner', 'sales', 'boost', 'profit']
    svo_pairs = generate_triplet2sentence_pairs(doc, arr_keyword=keywords, keyword_filter=True,
    
    '''

    pairs = []
    combined_data = []
    
    # filtering, if it happens
    flat_keywords = [item for sublist in arr_keyword for item in sublist]
    if VERBOSE: print("flat keywords: ", flat_keywords)

    # Iterate through sentences in the doc
    for sent in doc.sents:
        # Extract SVO triples for each sentence
        triples = list(textacy.extract.subject_verb_object_triples(sent))
        formatted_triples = []
        
        for triple in triples:


            ###########################
            # visualizer 
            ###########################
            if VERBOSE:
                options = {'compact': True, 'color': 'black', 'bg': 'white', 'offset': 100, 'distance': 100, 'font': 'Arial'}
                spacy.displacy.render(sent, style='dep', jupyter=True, options=options)

                sent_start = sent[0].idx
                entities_s = [{'start': t.idx - sent_start, 'end': t.idx - sent_start + len(t), 'label': 'S'} for t in triple[0]]
                entities_v = [{'start': t.idx - sent_start, 'end': t.idx - sent_start + len(t), 'label': 'V'} for t in triple[1]]
                entities_o = [{'start': t.idx - sent_start, 'end': t.idx - sent_start + len(t), 'label': 'O'} for t in triple[2]]
                all_entities = entities_s + entities_v + entities_o
                displacy_ex = [{'text': sent.text, 'ents': all_entities, 'title': None}]

                spacy.displacy.render(displacy_ex, style='ent', jupyter=True, manual=True, options={'colors': {'S': 'yellow', 'V': 'magenta', 'O': 'orange'}})
            

            ###########################
            # modify triplets 
            ###########################
            # for s, v, o, print their entity types 
            if VERBOSE: 
                for svo in triple:
                    for tok in svo:
                        if hasattr(tok, "ent_id"):
                            print(tok, " : ", tok.ent_type_)
            
            for svo in triple:
                    
                for idx, tok in enumerate(svo[:-1]): 

                    # drop if its AUX 
                    if tok.pos_ == "AUX":
                        # drop from v_toks
                        svo.remove(tok)
                    
                    # drop if its PART
                    if tok.pos_ == "PART":
                        # drop from v_toks
                        svo.remove(tok)
                    
                    # if entity = money, include the prceeding token if its a symbol. new token should be put in front of the token
                    if tok.ent_type_ == "MONEY" and tok.nbor(-1).pos_ == "SYM":
                        # check if the token to the left of tok is already in the list
                        if idx > 0 and svo[idx-1] == tok.nbor(-1):
                            continue
                        svo.insert(idx - 1, tok.nbor(-1))
                    
                    
                    # FIXME - if token's child is linked to it by a possesive modifier, replace the token with the child
                    # has_poss_child = any(child.dep_ == "poss" for child in tok.children)
                    # if has_poss_child:
                    #     print("POSS FOUND: ", tok, " : ", tok.children[0])
                    #     svo[idx] = tok.children[0]


            ###########################
            # reject triples
            ###########################
            
            # Final check: only proceed if subject, verb or object have less than 5 tokens each
            if len(triple[0]) > 5 or len(triple[1]) > 5 or len(triple[2]) > 5:
                if VERBOSE: print("REJECT: Triple is too long. Passing")
                continue

            # reject subject is it is not an entity or noun 
            if (not any([tok.ent_type_ for tok in triple[0]]) and not any([tok.pos_ == "NOUN" for tok in triple[0]])):
                if VERBOSE: print("REJECT: Subject does not have entity type or is not a noun. Passing")
                continue
            
            # reject object is it is not an entity or noun 
            # if (not any([tok.ent_type_ for tok in triple[2]]) and not any([tok.pos_ == "NOUN" for tok in triple[2]])):
            #     if VERBOSE: print("REJECT: Object does not have entity type or is not a noun. Passing")
            #     continue
            
            
            if VERBOSE: 
                print(f"processed: {[tok.text for tok in triple[0]], [tok.text for tok in triple[1]], [tok.text for tok in triple[2]]}")
            ###########################
            # final formatting 
            ###########################

            subject = []
            verb = []
            obj = []

            for tok in triple[0]:
                if tok.ent_type_ and tok.ent_type_ != "KEYWORD_OTHER":
                    subject.append(tok.text)
                else:
                    subject.append(tok.lemma_)
            
            for tok in triple[1]:
                if tok.ent_type_:
                    verb.append(tok.text)
                else:
                    verb.append(tok.lemma_)
            
            for tok in triple[2]:
                if tok.ent_type_ and tok.ent_type_ != "KEYWORD_OTHER":
                    obj.append(tok.text)
                else:
                    obj.append(tok.lemma_)

            if keyword_filter:
                # Check if any keyword is present in the subject, verb, or object
                if not any(keyword in subject + verb + obj for keyword in flat_keywords):
                    continue
            
            formatted_triples = f"{'_'.join(subject)} | {'_'.join(verb)} | {'_'.join(obj)}"

            # store to main lists 
            pairs.append((formatted_triples, sent.text))
            combined_data.append(formatted_triples + " <==> " + sent.text)
            
            if VERBOSE: print("added: ", formatted_triples)

    if return_type == "std":
        return pairs
    if return_type == "llm":
        return combined_data

##########################################################################################################################################
# Process Sampling
##########################################################################################################################################

keywords = article_keywords[doc_idx]
# print(keywords)
pairs = generate_triplet2sentence_pairs(doc, arr_keyword = keywords, keyword_filter = False, VERBOSE= True)
print ([a for a,b in pairs])

Sample Output

Tagged sentence with extracted triplet.

Github Repo: https://github.com/lennardong/article_summarizer/tree/main

Motivation

Sometimes we come across articles or documents that are difficult to understand, especially if they're outside our area of expertise. This common experience sparked the idea for this project.

My goal with this project was twofold: First, to deepen my understanding of transformer models and natural language processing (NLP) as part of my computing coursework. Second, to tackle a problem I'm sure many of us have encountered - making sense of complex, domain-specific articles.

This project, in essence, is about creating a way to simplify scientific or industry articles into something more digestible. It is meant as a tool to bridge the gap between expert knowledge and layperson understanding.

Hypotheses

This is variant of abstractive summarization. The project takes a novel approach and treats abstractive summarization as two distinct components: knowledge extraction and text generation. This strategy allows us to harness the capabilities of LLMs while maintaining a level of control and interpretability over the output.

The crux of our method revolves around mining essential information in the form of Subject-Verb-Object (SVO) triplets from the corpus. Following this, we employ 3 different transformer models, specifically T5, BERT, and GPT, to generate a coherent narrative using these triplets.

The model’s modular nature allows us to harness the strengths of non-neural and neural methods, facilitating the training of specialized modules and providing a framework to interpret and explain the results.

Dataset

This project works with the BBC News Summary dataset as the corpus for our study due to its well-formatted and simple nature, with model handcrafted summaries to evaluate against. This dataset was created by re-purposing the BBC News dataset originally created for benchmarks in classification.

Text Distillation

The core hypothesis of this first part is that the semantic essence of a text can be represented through a series of interconnected and interdependent subject-verb-object triplets, thus creating a graph of the information contained within the piece of text. This knowledge graph is to serve as “ground truth” for the downstream task of text summarization.

The text preparation stage of our pipeline involves several standard Natural Language Processing (NLP) tasks, including tokenization, stopword removal, stemming, and lemmatization. These procedures aid in distilling the corpus, reducing language variability, and aiding more efficient information extraction. Additionally, co-reference resolution is performed to facilitate accurate entity linking across the corpus and to retain the narrative structure within individual sentences, enhancing the fidelity of the extracted SVO triplets.

In an attempt to incorporate relevant context into the knowledge extraction phase, we posit that the title of an article could serve as a representative set of important terms. Consequently, keywords from the article title are extracted and incorporated into the entity recognition process. This step aims to improve the contextual relevance of named entities and aids in better aligning the knowledge extraction with the overall topic of the text.

Knowledge Graph Extraction

Our process of generating triplets follows a structure inspired by the WebNLG corpus, leveraging the RDF schema for knowledge representation. This structure provides us with a rich source of training data from the WebNLG corpus, which we later utilize for fine-tuning our language generation models.

We follow a procedure that parses the dependency structure of the text and applies a series of heuristics to extract Subject-Verb-Object (SVO) triplets. Here is an overview of the considerations:

Sentence Parsing: We parse the input text on a sentence level.
Verb-Subject-Object Initialization: We create a dictionary where verbs serve as keys, and sets of associated subjects and objects as their corresponding values.
Token Analysis: Each token in the sentences is analyzed based on its dependency label and part-of-speech tag.
Role Assignment: Depending on the dependency label, a token is assigned as a subject or an object and accordingly added to the Verb-SVO dictionary. Subjects can be nouns or subordinate clauses, while objects can be nouns, entities introduced by a preposition, or subordinate clauses.
Role Expansion: We expand subjects and objects to include related tokens such as compound nouns, conjuncts, or tokens within a clause, utilizing the subtree of each token.
Verb Conjuncts: We account for cases where multiple verbs are linked via conjunction (e.g., the sentence ”I read and wrote this book”), and update the dictionary accordingly, considering that such verbs share the same subjects and objects.
SVO Triplet Extraction: The extracted triplets are retrieved from the Verb-SVO dictionary, each triplet represented as a tuple in the order of subject-verb-object.

In the process of triplet extraction, we have devised two variations - 'greedy' and 'strict'. These variations differ in steps 5 and 7.

In the 'greedy' implementation, we liberally follow token chains during subject and object expansion (step 5), which results in longer and more contextually rich SVO triplets. We also relax the criteria in step 7 to permit non-entity and non-verb tokens in the subject and verb positions.

Conversely, the 'strict' implementation focuses on the extraction of 'essential' triples. We cap token chains to a maximum of five tokens and exclude auxiliary verbs. In step 7, we restrict non-entity and non-verb tokens, with the exception for subjects and objects that originated from the title. All non-entities are then subjected to lemmatization.

Final Thoughts

The knowledge extraction stage of this project was challenging in a fun way. While it did convert complex articles into structured SVO triplets, it also revealed some gaps and areas for enhancement:

Context & Nuance- The current reliance on syntactic structures and heuristics failed to capture some nuanced meanings, such as those embedded in context, metaphor, or sarcasm. The process also struggled with texts where important information is not easily reducible to simple SVO triplets.
Semantic Boundaries - What is the boundary of relationships in a text? In the above process, triplets are generated from relationships within a sentence. While co-reference resolution will help somewhat, there will still be meaningful relationships that will be lost.
Sensitivity - The 'greedy' and 'strict' variations of triplet extraction present a trade-off between capturing rich context and maintaining precision, raising the opportunity to find the optimal balance.

Future improvements might include maintaining a “master graph” for the knowledge domain so it can be enriched with useful facts… but we have enough for now to continue.

The next part of this project will focus on text generation, building upon the structured data from the knowledge extraction phase. The varying outcomes of different transformer architectures will offer an perspective on the capabilities of modern NLP technology.

Appendix: Code Snippets

Here are some code snippets for the various procedures. Code was written in Python and uses SpaCy.

Tokenizer and Tagger Functions

##########################################################################################################################################
# TOKENIZER AND TAGGER
# helper functions
##########################################################################################################################################

def tag_and_tokenize_keywords(doc):
    # Create a lemmatized version of the input document

    matcher_ent = spacy.matcher.PhraseMatcher(nlp.vocab, attr="LEMMA")  
    matcher_other = spacy.matcher.PhraseMatcher(nlp.vocab, attr="LOWER")
    patterns_ent = [nlp(keyword) for keyword in keywords[0]]
    patterns_other = [nlp.make_doc(keyword) for keyword in keywords[1]]
    matcher_ent.add("KEYWORD_ENT", patterns_ent)
    matcher_other.add("KEYWORD_OTHER", patterns_other)

    lemmatized_doc = nlp(" ".join([token.lemma_ for token in doc]))

    # Retokenize the document to merge multi-token keyword spans
    with doc.retokenize() as retokenizer:
        # Iterate through the matchers (matcher_ent, matcher_other) and their target documents (lemmatized_doc, doc)
        for matcher, target_doc in zip([matcher_ent, matcher_other], [lemmatized_doc, doc]):
            # Find matches in the target document using the current matcher
            for match_id, start, end in matcher(target_doc):
                # Create a span in the original document corresponding to the matched keywords
                span = doc[start:end]
                # Merge the span into a single token
                retokenizer.merge(span)

                # Set the 'is_keyword' and 'custom_tag' attributes for the merged token
                span[0].set_extension("is_keyword", default=False, force=True)
                span[0]._.is_keyword = True
                span[0].set_extension("custom_tag", default=None, force=True)
                span[0]._.custom_tag = nlp.vocab.strings[match_id]

    return doc

def add_keywords_to_ents(doc):
    new_ents = []
    for token in doc:
        if token._.is_keyword:
            new_ent = spacy.tokens.Span(doc, token.i, token.i + 1, label=token._.custom_tag)
            new_ents.append(new_ent)

    # Filter out existing entities that overlap with the new keyword entities
    filtered_ents = []
    for ent in doc.ents:
        overlaps = [ent.start <= new_ent.start < ent.end or new_ent.start <= ent.start < new_ent.end for new_ent in new_ents]
        if not any(overlaps):
            filtered_ents.append(ent)

    doc.set_ents(filtered_ents + new_ents)
    return doc

def convert_entities_to_tokens(doc):
    with doc.retokenize() as retokenizer:
        for ent in doc.ents:
            # Create a span for the current entity
            span = doc[ent.start:ent.end]
            # Merge the span into a single token
            retokenizer.merge(span)

    return doc

def merge_noun_chunks(doc):
    with doc.retokenize() as retokenizer:
        for np in list(doc.noun_chunks):
            retokenizer.merge(np)
    return doc

def merge_symb2num(doc):
    """
    Merge adjacent currency symbol and number tokens.
    """
    i = 1
    while i < len(doc):
        if doc[i].is_digit and doc[i - 1].is_currency:
            span = doc[doc[i - 1].i: doc[i].i + 1]
            with doc.retokenize() as retokenizer:
                retokenizer.merge(span)
        else:
            i += 1
    return doc


# Displacy Formatting 
col_highlight1 = "magenta"
col_highlight2 = "yellow"
col_others = "lightblue"
options_ent = {
    "ents": ["KEYWORD_ENT", "KEYWORD_OTHER",
             "ORG", "PRODUCT", "GPE", "LOC", "PERSON", "FAC", "NORP", "DATE", "TIME", "PERCENT", "MONEY", "QUANTITY", "ORDINAL", "CARDINAL", "LANGUAGE", "EVENT", "LAW", "WORK_OF_ART"],
    "colors": {"KEYWORD_ENT": col_highlight1, 
               "KEYWORD_OTHER": col_highlight2, 
               "ORG": col_others, 
               "PRODUCT": col_others, 
               "GPE": col_others, 
               "LOC": col_others, 
               "PERSON": col_others, 
               "FAC": col_others, 
               "NORP": col_others, 
               "DATE": col_others, 
               "TIME": col_others, 
               "PERCENT": col_others, 
               "MONEY": col_others, 
               "QUANTITY": col_others, 
               "ORDINAL": col_others, 
               "CARDINAL": col_others, 
               "LANGUAGE": col_others, 
               "EVENT": col_others, 
               "LAW": col_others, 
               "WORK_OF_ART": col_others}  
}

options_dep = {'compact': True, 'color': 'black', 'bg': 'white', 'offset': 100, 'distance': 100, 'font': 'Arial'}

##########################################################################################################################################
# Process Sampling
##########################################################################################################################################

# Generate Sample
doc_idx = 368
title = seperate_title_and_body(articles[doc_idx])[0]
body = seperate_title_and_body(articles[doc_idx])[1]
keywords = article_keywords[doc_idx]
print("title: ", title)
print("title keywords: ", keywords)
print("body: ", body)

# Create new NLP instances, incl matchers
nlp = spacy.load("en_core_web_lg")

matcher_ent = spacy.matcher.PhraseMatcher(nlp.vocab, attr="LEMMA")  
matcher_other = spacy.matcher.PhraseMatcher(nlp.vocab, attr="LOWER")
patterns_ent = [nlp(keyword) for keyword in keywords[0]]
patterns_other = [nlp.make_doc(keyword) for keyword in keywords[1]]
matcher_ent.add("KEYWORD_ENT", patterns_ent)
matcher_other.add("KEYWORD_OTHER", patterns_other)

# Pipeline the inputs
doc = nlp(body)
doc = tag_and_tokenize_keywords(doc)
doc = add_keywords_to_ents(doc)
# doc = convert_entities_to_tokens(doc)
# doc = merge_noun_chunks(doc)
# doc = merge_symb2num(doc)

# spacy.displacy.render(doc, style = "dep")
spacy.displacy.render(doc, style = "dep", options = options_dep)
spacy.displacy.render(doc, style = "ent", options = options_ent)

Sample Output

Text Corpus with extracted keywords. Colour coded highlights indicate keyword tag.

Subject-Verb-Object Extraction Functions

##########################################################################################################################################
# SUBJECT-VERB-OBJECT EXTRACTION
# Helper Functions
##########################################################################################################################################
# The following generates triples regardless of keywords. 
# See ref: https://textacy.readthedocs.io/en/latest/api_reference/extract.html#triples


def generate_triplet2sentence_pairs(doc, arr_keyword, return_type = 'std', keyword_filter = False, VERBOSE = False):
    '''
    Returns a list of tuple-pairs: 
    [(triple, sentence), (triple, sentence), ...]

    Parameters:
    -----------
    doc : spacy.tokens.Doc
        A SpaCy document containing the text to extract SVO triplets from.
    arr_keyword : List[List[str]]
        A nested list of keywords (strings), where each sublist represents a set of related keywords.
    keyword_filter : bool, optional (default=False)
        If True, only include triplets containing at least one keyword from the given keyword list.
    VERBOSE : bool, optional (default=False)
        If True, print additional information, such as the list of flattened keywords.

    Returns:
    --------
    pairs : List[Tuple[str, spacy.tokens.Span]]
        A list of tuple pairs, where each tuple contains a formatted triplet (as a string) and
        the sentence.

    Example usage:
    --------------
    keywords = ['Time Warner', 'sales', 'boost', 'profit']
    svo_pairs = generate_triplet2sentence_pairs(doc, arr_keyword=keywords, keyword_filter=True,
    
    '''

    pairs = []
    combined_data = []
    
    # filtering, if it happens
    flat_keywords = [item for sublist in arr_keyword for item in sublist]
    if VERBOSE: print("flat keywords: ", flat_keywords)

    # Iterate through sentences in the doc
    for sent in doc.sents:
        # Extract SVO triples for each sentence
        triples = list(textacy.extract.subject_verb_object_triples(sent))
        formatted_triples = []
        
        for triple in triples:


            ###########################
            # visualizer 
            ###########################
            if VERBOSE:
                options = {'compact': True, 'color': 'black', 'bg': 'white', 'offset': 100, 'distance': 100, 'font': 'Arial'}
                spacy.displacy.render(sent, style='dep', jupyter=True, options=options)

                sent_start = sent[0].idx
                entities_s = [{'start': t.idx - sent_start, 'end': t.idx - sent_start + len(t), 'label': 'S'} for t in triple[0]]
                entities_v = [{'start': t.idx - sent_start, 'end': t.idx - sent_start + len(t), 'label': 'V'} for t in triple[1]]
                entities_o = [{'start': t.idx - sent_start, 'end': t.idx - sent_start + len(t), 'label': 'O'} for t in triple[2]]
                all_entities = entities_s + entities_v + entities_o
                displacy_ex = [{'text': sent.text, 'ents': all_entities, 'title': None}]

                spacy.displacy.render(displacy_ex, style='ent', jupyter=True, manual=True, options={'colors': {'S': 'yellow', 'V': 'magenta', 'O': 'orange'}})
            

            ###########################
            # modify triplets 
            ###########################
            # for s, v, o, print their entity types 
            if VERBOSE: 
                for svo in triple:
                    for tok in svo:
                        if hasattr(tok, "ent_id"):
                            print(tok, " : ", tok.ent_type_)
            
            for svo in triple:
                    
                for idx, tok in enumerate(svo[:-1]): 

                    # drop if its AUX 
                    if tok.pos_ == "AUX":
                        # drop from v_toks
                        svo.remove(tok)
                    
                    # drop if its PART
                    if tok.pos_ == "PART":
                        # drop from v_toks
                        svo.remove(tok)
                    
                    # if entity = money, include the prceeding token if its a symbol. new token should be put in front of the token
                    if tok.ent_type_ == "MONEY" and tok.nbor(-1).pos_ == "SYM":
                        # check if the token to the left of tok is already in the list
                        if idx > 0 and svo[idx-1] == tok.nbor(-1):
                            continue
                        svo.insert(idx - 1, tok.nbor(-1))
                    
                    
                    # FIXME - if token's child is linked to it by a possesive modifier, replace the token with the child
                    # has_poss_child = any(child.dep_ == "poss" for child in tok.children)
                    # if has_poss_child:
                    #     print("POSS FOUND: ", tok, " : ", tok.children[0])
                    #     svo[idx] = tok.children[0]


            ###########################
            # reject triples
            ###########################
            
            # Final check: only proceed if subject, verb or object have less than 5 tokens each
            if len(triple[0]) > 5 or len(triple[1]) > 5 or len(triple[2]) > 5:
                if VERBOSE: print("REJECT: Triple is too long. Passing")
                continue

            # reject subject is it is not an entity or noun 
            if (not any([tok.ent_type_ for tok in triple[0]]) and not any([tok.pos_ == "NOUN" for tok in triple[0]])):
                if VERBOSE: print("REJECT: Subject does not have entity type or is not a noun. Passing")
                continue
            
            # reject object is it is not an entity or noun 
            # if (not any([tok.ent_type_ for tok in triple[2]]) and not any([tok.pos_ == "NOUN" for tok in triple[2]])):
            #     if VERBOSE: print("REJECT: Object does not have entity type or is not a noun. Passing")
            #     continue
            
            
            if VERBOSE: 
                print(f"processed: {[tok.text for tok in triple[0]], [tok.text for tok in triple[1]], [tok.text for tok in triple[2]]}")
            ###########################
            # final formatting 
            ###########################

            subject = []
            verb = []
            obj = []

            for tok in triple[0]:
                if tok.ent_type_ and tok.ent_type_ != "KEYWORD_OTHER":
                    subject.append(tok.text)
                else:
                    subject.append(tok.lemma_)
            
            for tok in triple[1]:
                if tok.ent_type_:
                    verb.append(tok.text)
                else:
                    verb.append(tok.lemma_)
            
            for tok in triple[2]:
                if tok.ent_type_ and tok.ent_type_ != "KEYWORD_OTHER":
                    obj.append(tok.text)
                else:
                    obj.append(tok.lemma_)

            if keyword_filter:
                # Check if any keyword is present in the subject, verb, or object
                if not any(keyword in subject + verb + obj for keyword in flat_keywords):
                    continue
            
            formatted_triples = f"{'_'.join(subject)} | {'_'.join(verb)} | {'_'.join(obj)}"

            # store to main lists 
            pairs.append((formatted_triples, sent.text))
            combined_data.append(formatted_triples + " <==> " + sent.text)
            
            if VERBOSE: print("added: ", formatted_triples)

    if return_type == "std":
        return pairs
    if return_type == "llm":
        return combined_data

##########################################################################################################################################
# Process Sampling
##########################################################################################################################################

keywords = article_keywords[doc_idx]
# print(keywords)
pairs = generate_triplet2sentence_pairs(doc, arr_keyword = keywords, keyword_filter = False, VERBOSE= True)
print ([a for a,b in pairs])

Sample Output

Tagged sentence with extracted triplet.

Github Repo: https://github.com/lennardong/article_summarizer/tree/main

Lennard Ong

The Joyful Tinkerer

Lennard Ong

The Joyful Tinkerer

-> About

-> About

-> Projects

-> Projects

Building an Article Summarizer for Enhancing Comprehension (1/2)

Building an Article Summarizer for Enhancing Comprehension (1/2)

Building an Article Summarizer for Enhancing Comprehension (1/2)

tl;dr:

An exploratory approach to article summarization that pairs knowledge graphs & large transformer models

Motivation

Hypotheses

Dataset

Text Distillation

Knowledge Graph Extraction

Final Thoughts

Appendix: Code Snippets

Tokenizer and Tagger Functions

Subject-Verb-Object Extraction Functions

Motivation

Hypotheses

Dataset

Text Distillation

Knowledge Graph Extraction

Final Thoughts

Appendix: Code Snippets

Tokenizer and Tagger Functions

Subject-Verb-Object Extraction Functions

-> linkedin

-> github

-> email

-> cv.pdf