Lennard Ong

The Joyful Tinkerer

Lennard Ong

The Joyful Tinkerer

Building an Article Summarizer for Enhancing Comprehension (2/2)

Building an Article Summarizer for Enhancing Comprehension (2/2)

Building an Article Summarizer for Enhancing Comprehension (2/2)

tl;dr:

Encoder-Decoder vs Decoder only architectures for word simplification and sentence generation.

Motivation

Sometimes we come across articles or documents that are difficult to understand, especially if they're outside our area of expertise. This common experience sparked the idea for this project.

My goal with this project was twofold: First, to deepen my understanding of transformer models and natural language processing (NLP) as part of my computing coursework. Second, to tackle a problem I'm sure many of us have encountered - making sense of complex, domain-specific articles.

This project, in essence, is about creating a way to simplify scientific or industry articles into something more digestible. It is meant as a tool to bridge the gap between expert knowledge and layperson understanding.

Hypotheses

This is variant of abstractive summarization. The project takes a novel approach and treats abstractive summarization as two distinct components: knowledge extraction and text generation. This strategy allows us to harness the capabilities of LLMs while maintaining a level of control and interpretability over the output.

The crux of our method revolves around mining essential information in the form of Subject-Verb-Object (SVO) triplets from the corpus. Following this, we employ 3 different transformer models, specifically T5, BERT, and GPT, to generate a coherent narrative using these triplets.

The model’s modular nature allows us to harness the strengths of non-neural and neural methods, facilitating the training of specialized modules and providing a framework to interpret and explain the results.

This post covers the NN portion of the project. It documents the process of model selection, fine-tuning and evaluation against benchmarks.

Dataset

This project works with the BBC News Summary dataset as the corpus for our study due to its well-formatted and simple nature, with model handcrafted summaries to evaluate against. This dataset was created by re-purposing the BBC News dataset originally created for benchmarks in classification.

Triplet-to-Text Generation

Two fundamental language models were explored for this task. They are later benchmarked against a BERT model.

The first model explored is the Text-to-Text transformer. It's a model with both an encoder and a decoder and hence excels at it's reshaping input text into output text. For this purpose, the T5 model is fine-tuned using the WebNLG corpus to enhance its capability to convert sets of related tuples into coherent sentences.

To facilitate the generation of compound sentences, triplets were grouped based on their subject. By employing a probabilistic function, the model was trained to generate a varying range of triplet sets. This methodology was designed to capture the intricate relationships between entities in a structured yet fluid manner.

Here is an example showing how two grouped triplets translates into a single compound sentence:

Input Pair:
"Tom | Is A | American"
"Tom | Works As | Engineer

Compund Output: 
"Tom is an american who works as an engineer"."

The second model explored is the Generative Pre-trained Transformer (GPT). GPT's architecture is decoder-only, making it particularly skilled at predicting subsequent words based on its training. While this architecture excels in generating coherent sequences, it doesn't have the explicit input-to-output mapping that encoder-decoder models like T5 possess. This makes it excellent for free-form text generation, but it also means that for tasks requiring a strict correspondence between input and output, additional strategies or fine-tuning might be necessary to achieve optimal results.

To ensure the generated content from GPT remains contextually relevant and accurate, a filtering step was introduced. This involved creating candidate sentences for each triplet and selecting the one with the highest semantic closeness (measured using cosine similarity) to the article's title vector. This step was hypothesized to enhance the generation of summaries that are contextually aligned with the article's title, minimizing the introduction of unrelated details.

An example:

Title: "Tom's Passion for Books"
Triplet from Sentence Body: "Tom | Loves | Reading" 

Candidates w Cosine Similarity: 
1. "Tom has a deep affection for reading" (0.85) 
2. "Tom spends his time loving books" (0.81)
3. "Reading is what Tom loves the most" (0.88) <- selected
4. "Tom's favorite hobby is playing soccer" (0.40)
5. "Tom often finds solace in reading"


The third sentence, "Reading is what Tom loves the most", has the highest similarity score of 0.88. Thus, it's selected as the most contextually relevant sentence to the article's title. This process ensures that the generated sentence not only stems from the provided triplet but also aligns well with the broader context of the article.

Baselines

We set our models against a standard unsupervised summarizer using LexRank. In simple terms, we use Sentence-BERT to turn sentences into numerical values, or "embeddings". We then find how similar these sentences are to each other and pick the most central ones as our summary. This gives us a concise version of the original text.

Next, we aim to make this summary even simpler to read.

Our first approach identifies "complex" words, or words that aren't used often, using a scale called the Zipf scale. We then swap these complex words with simpler, more common ones using a database called WordNet.

Our second simplification method, LSBert, also identifies complex words. But instead of WordNet, we use BERT, another model, to suggest replacements. We then pick the simplest word from BERT's suggestions using the Zipf scale. We compared our models against reference summaries. We also looked into how well our models simplify text by comparing their outputs' complexity.

Evaluation

Evaluating summaries is tricky because language is flexible. Our evaluation methods helped us evaluate how the models perform in summaries that are both a) accurate and b) easy to read.

We mainly used ROUGE, which checks how much content our summaries share with reference summaries. For example, ROUGE-1 checks word overlap, while ROUGE-L checks for longer matching sequences, ensuring our summaries make sense. We also used BERTScore, which checks how similar the meaning of our summaries is to the reference, without needing exact word matches. While Bleu is a popular metric, we skipped it because it can be misleading for our purpose.

We wanted our summaries to be easy to read. So, we created a "simplicity index" that checks three things: how long the words and sentences are, how many uncommon words are used, and overall readability. The goal? Higher scores mean easier reading. We compared our models' scores to the original text and reference summaries.

Final Thoughts

Was this project a success? From a growth perspective, most definitely. It provided a hands-on deep-dive into the intricacies of NLP and the application of cutting-edge language models.

Performance-wise, GPT's results were a double-edged sword. While its proficiency was commendable, its effectiveness sometimes felt like a bypass to more engineered solutions, making it almost too good.

Here are some of the other takeaways:

  1. Text Simplification Insights: T5 produced the most readable summaries. Our baseline models struggled to simplify text effectively, but LSBert showed promise by offering more word replacement options. Still, reference summaries remained the most concise and to-the-point.

  2. Transfer Learning Hurdles: Fine-tuning models like T5 and DistilGPT on specific datasets didn't yield the results we hoped for. GPT-3.5, however, showed promise, coming close to LexRank in BERTScore. But, models like GPT-2 and T5 sometimes added made-up facts to their summaries, which isn't ideal.

To reflect from a 10,000ft perspective, this project demonstrated the continued relevance of non-neural approaches for some tasks. In the era of Neural-Network, "traditional" methods like LexRank can be overshadowed. Yet, this project highlighted their value. Non-neural techniques offer simplicity, transparency, and efficiency. They need less data and computational power, provide consistent results, and can be tailored with domain knowledge.

In essence, while neural models push the boundaries of what's possible in NLP, non-neural methods remain a robust and often more interpretable alternative, proving that sometimes, the traditional ways still have value.

Code Snippets

Fine Tuning T5

This snippet prepares and trains the T5 model using the Adafactor optimizer. After training, the model is saved for later use.

#################################
# Training
#################################

dev = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")
tokenizer = T5Tokenizer.from_pretrained('t5-base', model_max_length=512)
model = T5ForConditionalGeneration.from_pretrained('t5-base', return_dict=True)
model.to(dev)

optimizer = Adafactor(
    model.parameters(),
    lr=1e-3,
    eps=(1e-30, 1e-3),
    clip_threshold=1.0,
    decay_rate=-0.8,
    beta1=None,
    weight_decay=0.0,
    relative_step=False,
    scale_parameter=False,
    warmup_init=False,
)

train_model(train_df, model, tokenizer, optimizer, dev)
save_model(model, tokenizer, "trained_model")
shutil.make_archive("trained_model", 'zip', "trained_model")


Generating a Summary

This code processes documents, groups related information, and uses the trained T5 model to generate a coherent summary for each document.

#################################
# Generate a summary 
#################################

VERBOSE = False
FOLDER = "./results/Generated_T5_16epoch/"
if not os.path.exists(FOLDER):
    os.makedirs(FOLDER)

summaries = []
for id, doc_triples in tqdm(enumerate(all_triples)):
    
    # init 
    summary = []
    idx = 0

    # sort for subject match
    doc_triples = sorted(doc_triples)

    # add title to summary 
    summary.append(titles[id])
    summary.append("\n")

    # iterate through triples
    while idx < len(doc_triples):
        triple = doc_triples[idx]
        
        # look ahead 
        try: triple_n1 = doc_triples[idx+1]
        except: pass 
        try: triple_n2 = doc_triples[idx+2]
        except: pass 
        
        # base triple 
        predict = triple 

        # concatenate triples (rule + probabilistic)
        rand2 = random.randrange(0, 10)
        rand3 = random.randrange(0, 10)
        if triple[0] == triple_n1[0]:
            # concatenate
            predict = predict + " && " + triple_n1
            idx += 1 
    
        idx += 1 
        
        # end function 
        generated_sentence = generate_sentence(predict, model, tokenizer)
        
        if VERBOSE: 
            print(f"predict: {predict}")
            print(f"generated: {generated_sentence}")


        summary.append(generated_sentence)
    
    # convert summary to string
    summary = ' '.join(summary)
    
    # save summary as txt file
    filename = f"{id + 1:03d}.txt"
    path = os.path.join(FOLDER, filename)

    with open(path, "w") as f:
        f.write(summary)

Benchmark Summarizer

This snippet tokenizes an input document into sentences, computes embeddings for each sentence, and selects the most central sentences to form a concise summary.

import nltk
from nltk.tokenize import sent_tokenize
from sentence_transformers import SentenceTransformer, util
import numpy as np
import re
# from lexrank import degree_centrality_scores

nltk.download('punkt')


model = SentenceTransformer('all-MiniLM-L6-v2')

# Our input document we want to summarize

document = """
TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn.For the full-year, TimeWarner posted a profit of $3.36bn, up 27% from its 2003 performance, while revenues grew 6.4% to $42.09bn.Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier.However, the company said AOL's underlying profit before exceptional items rose 8% on the back of stronger internet advertising revenues.Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users for AOL.For 2005, TimeWarner is projecting operating earnings growth of around 5%, and also expects higher revenue and wider profit margins.It lost 464,000 subscribers in the fourth quarter profits were lower than in the preceding three quarters.Time Warner's fourth quarter profits were slightly better than analysts' expectations.
"""
document = re.sub(r"(?<!\d)\.(?!\d)", r". ", document)

#Split the document into sentences
sentences = sent_tokenize(document)
for i, sentence in enumerate(sentences):
    print(f"Sentence {i+1}: {sentence}\n")

print("Num sentences:", len(sentences))

#Compute the sentence embeddings
embeddings = model.encode(sentences, convert_to_tensor=True)

#Compute the pair-wise cosine similarities
cos_scores = util.cos_sim(embeddings, embeddings).cpu().numpy()

#Compute the centrality for each sentence
centrality_scores = degree_centrality_scores(cos_scores, threshold=None)

#We argsort so that the first element is the sentence with the highest score
most_central_sentence_indices = np.argsort(-centrality_scores)


#Print the 5 sentences with the highest scores
print("\n\nSummary:")
for idx in most_central_sentence_indices[0:5]:
    print(sentences[idx].strip())


github repo: https://github.com/lennardong/article_summarizer/tree/main


Motivation

Sometimes we come across articles or documents that are difficult to understand, especially if they're outside our area of expertise. This common experience sparked the idea for this project.

My goal with this project was twofold: First, to deepen my understanding of transformer models and natural language processing (NLP) as part of my computing coursework. Second, to tackle a problem I'm sure many of us have encountered - making sense of complex, domain-specific articles.

This project, in essence, is about creating a way to simplify scientific or industry articles into something more digestible. It is meant as a tool to bridge the gap between expert knowledge and layperson understanding.

Hypotheses

This is variant of abstractive summarization. The project takes a novel approach and treats abstractive summarization as two distinct components: knowledge extraction and text generation. This strategy allows us to harness the capabilities of LLMs while maintaining a level of control and interpretability over the output.

The crux of our method revolves around mining essential information in the form of Subject-Verb-Object (SVO) triplets from the corpus. Following this, we employ 3 different transformer models, specifically T5, BERT, and GPT, to generate a coherent narrative using these triplets.

The model’s modular nature allows us to harness the strengths of non-neural and neural methods, facilitating the training of specialized modules and providing a framework to interpret and explain the results.

This post covers the NN portion of the project. It documents the process of model selection, fine-tuning and evaluation against benchmarks.

Dataset

This project works with the BBC News Summary dataset as the corpus for our study due to its well-formatted and simple nature, with model handcrafted summaries to evaluate against. This dataset was created by re-purposing the BBC News dataset originally created for benchmarks in classification.

Triplet-to-Text Generation

Two fundamental language models were explored for this task. They are later benchmarked against a BERT model.

The first model explored is the Text-to-Text transformer. It's a model with both an encoder and a decoder and hence excels at it's reshaping input text into output text. For this purpose, the T5 model is fine-tuned using the WebNLG corpus to enhance its capability to convert sets of related tuples into coherent sentences.

To facilitate the generation of compound sentences, triplets were grouped based on their subject. By employing a probabilistic function, the model was trained to generate a varying range of triplet sets. This methodology was designed to capture the intricate relationships between entities in a structured yet fluid manner.

Here is an example showing how two grouped triplets translates into a single compound sentence:

Input Pair:
"Tom | Is A | American"
"Tom | Works As | Engineer

Compund Output: 
"Tom is an american who works as an engineer"."

The second model explored is the Generative Pre-trained Transformer (GPT). GPT's architecture is decoder-only, making it particularly skilled at predicting subsequent words based on its training. While this architecture excels in generating coherent sequences, it doesn't have the explicit input-to-output mapping that encoder-decoder models like T5 possess. This makes it excellent for free-form text generation, but it also means that for tasks requiring a strict correspondence between input and output, additional strategies or fine-tuning might be necessary to achieve optimal results.

To ensure the generated content from GPT remains contextually relevant and accurate, a filtering step was introduced. This involved creating candidate sentences for each triplet and selecting the one with the highest semantic closeness (measured using cosine similarity) to the article's title vector. This step was hypothesized to enhance the generation of summaries that are contextually aligned with the article's title, minimizing the introduction of unrelated details.

An example:

Title: "Tom's Passion for Books"
Triplet from Sentence Body: "Tom | Loves | Reading" 

Candidates w Cosine Similarity: 
1. "Tom has a deep affection for reading" (0.85) 
2. "Tom spends his time loving books" (0.81)
3. "Reading is what Tom loves the most" (0.88) <- selected
4. "Tom's favorite hobby is playing soccer" (0.40)
5. "Tom often finds solace in reading"


The third sentence, "Reading is what Tom loves the most", has the highest similarity score of 0.88. Thus, it's selected as the most contextually relevant sentence to the article's title. This process ensures that the generated sentence not only stems from the provided triplet but also aligns well with the broader context of the article.

Baselines

We set our models against a standard unsupervised summarizer using LexRank. In simple terms, we use Sentence-BERT to turn sentences into numerical values, or "embeddings". We then find how similar these sentences are to each other and pick the most central ones as our summary. This gives us a concise version of the original text.

Next, we aim to make this summary even simpler to read.

Our first approach identifies "complex" words, or words that aren't used often, using a scale called the Zipf scale. We then swap these complex words with simpler, more common ones using a database called WordNet.

Our second simplification method, LSBert, also identifies complex words. But instead of WordNet, we use BERT, another model, to suggest replacements. We then pick the simplest word from BERT's suggestions using the Zipf scale. We compared our models against reference summaries. We also looked into how well our models simplify text by comparing their outputs' complexity.

Evaluation

Evaluating summaries is tricky because language is flexible. Our evaluation methods helped us evaluate how the models perform in summaries that are both a) accurate and b) easy to read.

We mainly used ROUGE, which checks how much content our summaries share with reference summaries. For example, ROUGE-1 checks word overlap, while ROUGE-L checks for longer matching sequences, ensuring our summaries make sense. We also used BERTScore, which checks how similar the meaning of our summaries is to the reference, without needing exact word matches. While Bleu is a popular metric, we skipped it because it can be misleading for our purpose.

We wanted our summaries to be easy to read. So, we created a "simplicity index" that checks three things: how long the words and sentences are, how many uncommon words are used, and overall readability. The goal? Higher scores mean easier reading. We compared our models' scores to the original text and reference summaries.

Final Thoughts

Was this project a success? From a growth perspective, most definitely. It provided a hands-on deep-dive into the intricacies of NLP and the application of cutting-edge language models.

Performance-wise, GPT's results were a double-edged sword. While its proficiency was commendable, its effectiveness sometimes felt like a bypass to more engineered solutions, making it almost too good.

Here are some of the other takeaways:

  1. Text Simplification Insights: T5 produced the most readable summaries. Our baseline models struggled to simplify text effectively, but LSBert showed promise by offering more word replacement options. Still, reference summaries remained the most concise and to-the-point.

  2. Transfer Learning Hurdles: Fine-tuning models like T5 and DistilGPT on specific datasets didn't yield the results we hoped for. GPT-3.5, however, showed promise, coming close to LexRank in BERTScore. But, models like GPT-2 and T5 sometimes added made-up facts to their summaries, which isn't ideal.

To reflect from a 10,000ft perspective, this project demonstrated the continued relevance of non-neural approaches for some tasks. In the era of Neural-Network, "traditional" methods like LexRank can be overshadowed. Yet, this project highlighted their value. Non-neural techniques offer simplicity, transparency, and efficiency. They need less data and computational power, provide consistent results, and can be tailored with domain knowledge.

In essence, while neural models push the boundaries of what's possible in NLP, non-neural methods remain a robust and often more interpretable alternative, proving that sometimes, the traditional ways still have value.

Code Snippets

Fine Tuning T5

This snippet prepares and trains the T5 model using the Adafactor optimizer. After training, the model is saved for later use.

#################################
# Training
#################################

dev = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")
tokenizer = T5Tokenizer.from_pretrained('t5-base', model_max_length=512)
model = T5ForConditionalGeneration.from_pretrained('t5-base', return_dict=True)
model.to(dev)

optimizer = Adafactor(
    model.parameters(),
    lr=1e-3,
    eps=(1e-30, 1e-3),
    clip_threshold=1.0,
    decay_rate=-0.8,
    beta1=None,
    weight_decay=0.0,
    relative_step=False,
    scale_parameter=False,
    warmup_init=False,
)

train_model(train_df, model, tokenizer, optimizer, dev)
save_model(model, tokenizer, "trained_model")
shutil.make_archive("trained_model", 'zip', "trained_model")


Generating a Summary

This code processes documents, groups related information, and uses the trained T5 model to generate a coherent summary for each document.

#################################
# Generate a summary 
#################################

VERBOSE = False
FOLDER = "./results/Generated_T5_16epoch/"
if not os.path.exists(FOLDER):
    os.makedirs(FOLDER)

summaries = []
for id, doc_triples in tqdm(enumerate(all_triples)):
    
    # init 
    summary = []
    idx = 0

    # sort for subject match
    doc_triples = sorted(doc_triples)

    # add title to summary 
    summary.append(titles[id])
    summary.append("\n")

    # iterate through triples
    while idx < len(doc_triples):
        triple = doc_triples[idx]
        
        # look ahead 
        try: triple_n1 = doc_triples[idx+1]
        except: pass 
        try: triple_n2 = doc_triples[idx+2]
        except: pass 
        
        # base triple 
        predict = triple 

        # concatenate triples (rule + probabilistic)
        rand2 = random.randrange(0, 10)
        rand3 = random.randrange(0, 10)
        if triple[0] == triple_n1[0]:
            # concatenate
            predict = predict + " && " + triple_n1
            idx += 1 
    
        idx += 1 
        
        # end function 
        generated_sentence = generate_sentence(predict, model, tokenizer)
        
        if VERBOSE: 
            print(f"predict: {predict}")
            print(f"generated: {generated_sentence}")


        summary.append(generated_sentence)
    
    # convert summary to string
    summary = ' '.join(summary)
    
    # save summary as txt file
    filename = f"{id + 1:03d}.txt"
    path = os.path.join(FOLDER, filename)

    with open(path, "w") as f:
        f.write(summary)

Benchmark Summarizer

This snippet tokenizes an input document into sentences, computes embeddings for each sentence, and selects the most central sentences to form a concise summary.

import nltk
from nltk.tokenize import sent_tokenize
from sentence_transformers import SentenceTransformer, util
import numpy as np
import re
# from lexrank import degree_centrality_scores

nltk.download('punkt')


model = SentenceTransformer('all-MiniLM-L6-v2')

# Our input document we want to summarize

document = """
TimeWarner said fourth quarter sales rose 2% to $11.1bn from $10.9bn.For the full-year, TimeWarner posted a profit of $3.36bn, up 27% from its 2003 performance, while revenues grew 6.4% to $42.09bn.Quarterly profits at US media giant TimeWarner jumped 76% to $1.13bn (£600m) for the three months to December, from $639m year-earlier.However, the company said AOL's underlying profit before exceptional items rose 8% on the back of stronger internet advertising revenues.Its profits were buoyed by one-off gains which offset a profit dip at Warner Bros, and less users for AOL.For 2005, TimeWarner is projecting operating earnings growth of around 5%, and also expects higher revenue and wider profit margins.It lost 464,000 subscribers in the fourth quarter profits were lower than in the preceding three quarters.Time Warner's fourth quarter profits were slightly better than analysts' expectations.
"""
document = re.sub(r"(?<!\d)\.(?!\d)", r". ", document)

#Split the document into sentences
sentences = sent_tokenize(document)
for i, sentence in enumerate(sentences):
    print(f"Sentence {i+1}: {sentence}\n")

print("Num sentences:", len(sentences))

#Compute the sentence embeddings
embeddings = model.encode(sentences, convert_to_tensor=True)

#Compute the pair-wise cosine similarities
cos_scores = util.cos_sim(embeddings, embeddings).cpu().numpy()

#Compute the centrality for each sentence
centrality_scores = degree_centrality_scores(cos_scores, threshold=None)

#We argsort so that the first element is the sentence with the highest score
most_central_sentence_indices = np.argsort(-centrality_scores)


#Print the 5 sentences with the highest scores
print("\n\nSummary:")
for idx in most_central_sentence_indices[0:5]:
    print(sentences[idx].strip())


github repo: https://github.com/lennardong/article_summarizer/tree/main