Text Summarization

Text summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content.

There are two general approaches to automatic summarization: extraction and abstraction.

Extraction-based summarization

Here, content is extracted from the original data, but the extracted content is not modified in any way. Examples of extracted content include key-phrases that can be used to "tag" or index a text document.

Abstraction-based summarization

Abstractive methods build an internal semantic representation of the original content, and then use this representation to create a summary that is closer to what a human might express. Abstraction may transform the extracted content by paraphrasing sections of the source document, to condense a text more strongly than extraction. Such transformation, however, is computationally much more challenging than extraction, involving both natural language processing and often a deep understanding of the domain of the original text in cases where the original document relates to a special field of knowledge. Wikipedia

Libraries Compared: SpaCy and Sumy

SpaCy

Both of the summarization approaches compared use a general approach to rank sentences based on the significance of the words in teh sentence and then use the top N sentences as a summary. In this case spacy is used mostly to clean the text and remove stopwords.

Sumy

Sumy is a summarizaation specific NLP library, It provides implementations of several approaches to text summarization. This comparison uses the LexRank approach to summarization that is closest to the appraoch used with the SpaCy summarization.