Skip to main content
Software Maintenance
Software Maintenance
ISIA

Mini-Project

This document describes the subject for the mini-project of the Tests and Maintenance course in ISIA4. The work is to be done in pairs and must be completely provided in a gitlab repository, preferably on the university’s or on gitlab.com. According to the semester’s evaluation modalities, the project includes both a group grade and an individual grade.

Context: Word Frequency Counter

Create a program that reads a text file and calculates the frequency of each word in the file. The result must be a list of words and their corresponding frequency, sorted by frequency.

Program to Implement

Main Features

  • Case insensitivity: The program must count words regardless of their case (for example, “Hello” and “hello” are considered the same word).
  • Punctuation handling: Handle punctuation correctly. Words adjacent to punctuation must be counted without punctuation marks (for example, “hello,” and “hello” are identical).
  • Frequency sorting: The result must list words in descending order of frequency. In case of equal frequency, words are sorted alphabetically.
  • File input/output: Read the source text from a file and write the frequency counts to an output file.

Advanced Features

  • Command line arguments: Allow the user to specify input and output file paths via command line arguments.
  • Stop words exclusion: Implement a feature to exclude common stop words (such as “the”, “is”, “at”, “which”, etc.) from the count. A default list is provided (see appendix), but the user can also specify a custom stop words file.
  • N-gram analysis option: In addition to individual words, implement an option to analyze the frequency of n-grams (phrase chunks), where n is an integer defined by the user (for example, 2 for bigrams, 3 for trigrams, etc.)
  • Real-time progress update: For large files, display a progress indicator showing the portion of the file that has been processed.
  • Interactive mode: Add an interactive mode in which the user can enter text directly and obtain the frequency analysis in real time.

Additional Requirements

  • Robust error handling: The program must gracefully handle potential errors such as missing files, invalid inputs, or empty files.
  • Code efficiency: Write code that efficiently processes text, keeping in mind memory usage and execution time, especially for large files.
  • Unit tests: Each basic feature must be accompanied by corresponding unit tests to validate its correctness.
  • Documentation: Provide clear documentation, particularly on the following points
    • How to compile and run the program.
    • Description of the program’s features.
    • Command line usage examples. Include in your repository example text files of varying length (from about 10 lines to 10,000 lines).
    • How to run the unit tests

Methodology

Beyond implementing the program, you will need to apply the various methods seen in class (TDD, Tests, Clean Code…). More specifically, the evaluation will take into account the following elements:

  • setting up unit tests, integration tests, performance tests
  • code coverage
  • automatic compilation / build
  • adherence to clean code principles (variable / function / class names, DRY, KISS principles…)
  • code indentation …
  • setting up a continuous integration pipeline under gitlab
  • (optional) setting up commit validation via pre-commit

This project is to be implemented in C based on the practical session seen in the first class.

Deliverables

You must provide access to your GIT repository which will contain the main program code, the tests, the means to compile your program and the configuration files gitlab-ci.yml, .pre-commit-config.yaml… and a README.md file that explains what you have implemented. The submission deadline is Sunday, April 9 at 23

. Make sure your instructor has access to your git repository.

Appendix

Stop Words List

i
me
my
myself
we
our
ours
ourselves
you
your
yours
yourself
yourselves
he
him
his
himself
she
her
hers
herself
it
its
itself
they
them
their
theirs
themselves
what
which
who
whom
this
that
these
those
am
is
are
was
were
be
been
being
have
has
had
having
do
does
did
doing
a
an
the
and
but
if
or
because
as
until
while
of
at
by
for
with
about
against
between
into
through
during
before
after
above
below
to
from
up
down
in
out
on
off
over
under
again
further
then
once
here
there
when
where
why
how
all
any
both
each
few
more
most
other
some
such
no
nor
not
only
own
same
so
than
too
very
s
t
can
will
just
don
should
now