Top Ranked Phrases in a Corpus

This is the project webpage for the Copan group that provides some general information and links. The goal of this project was to implement a parallel solution using C and MPI to list the Top R ranked terms that are of between M and N length. It is designed to extract these phrases from a given corpus in a input folder. The program falls under the BSD license.

TRPC

The input used is a large archive of newspaper articles called the “GigaWord- English” Corpus. The uncompressed size of the Corpus is about 9.7 GB and it contains about 1.8 billion words. For more information on the corpus : GigaWord-English