Thursday, July 26, 2012

Character Frequency and Histogram in ANSI C

In order to understand this program you must know how to operate with text files and external arguments. If you don't, the following articles may prove helpful:
1. Computing the character frequency
/*
 * Description:
 *  Computes for each character the number of occurrences in the file
 *  specified by @stream.
 * Parameters:
 *  stream - a pointer to file
 *  statistics - a pointer to a vector containing the number of occurrences
 *      of each character.
 *  Returns:
 *   Nothing
 */
void DoStatistics(FILE* stream, long* statistics)
{
   /*Initializes the variable with an arbitrary value in order to avoid
    the use of a do-while loop*/
   char temp = 0x01;
   int i;
   /*Initializes the statistics vector*/
   for(i = 0; i<UCHAR_MAX; i++)
   {
      statistics[i] = 0L;
   }
   /*Computes the number of occurences for each character*/
   while(temp!=EOF)
   {
      temp = fgetc(stream);
      statistics[(int)(temp)]++;
   }
}
/*
 * Description:
 *  Outputs to console the character statistics in the format:
 *  character <number of occurrences>. Only printable characters will
 *  be taken into consideration.
 * Parameters:
 *  statistics - a pointer to a vector containing the number of occurrences
 *      of each character.
 *  printAll - if true, all characters will appear. Otherwise only character
 *       with at least one occurrence will appear.
 *  Returns:
 *   Nothing
 */
void PrintStatistics(long* statistics, bool printAll)
{
   int i;
   for(i = 0; i<UCHAR_MAX; i++)
   {
      if(isprint(i) && ( (statistics[i]!=0) || (printAll==true) ) )
      {
         printf("%c <%ld>\n",(char)(i),statistics[i]);
      }
   }
}
In order to compute the character frequency from a file we shall use the functions above. The first will read the file character by character and populate the statistics vector. The statistics vector size should be UCHAR_MAX. The second one function will be used for printing the character frequency statistics and should receive as a parameter a pointer to the vector populated by the first function.

2.Printing the Statistics as a Histogram
/*
 * Description:
 *  Outputs to console the character statistics as a histogram.
 *  Only printable characters will be taken into consideration.
 * Parameters:
 *  statistics - a pointer to a vector containing the number of occurrences
 *      of each character.
 *  maxScaleValue - the maximum number of asterisks that could appear in
 *      the histogram. The frequency of a character in the
 *      statistics vector is scaled according to this value.
 *  printAll - if true, all characters will appear in the histogram.
 *       Otherwise only character with at least one occurrence will
 *       appear.
 *  Returns:
 *   Nothing
 */
void PrintHorizontalHistogram(long* statistics, int maxScaleValue, bool printAll)
{
   long asterisks = 0;
   long maxValue = 0;
   int i, j;
   /*Finds out the character who has the maximum number of occurrences.
    This value will be used for scaling the statistics values*/
   for(i = 0; i<UCHAR_MAX; i++)
   {
      if(isprint(i))
      {
         if(statistics[i]>maxValue)
         {
            maxValue = statistics[i];
         }
      }
   }
   /*Checks if the statistics vector is not empty*/
   if(maxValue!=0)
   {
      for(i = 0; i<UCHAR_MAX; i++)
      {
         /*Will output information only for characters with at least
         one occurrence if the printAll option is not set. Otherwise
         will output information for all printable characters*/
         if(isprint(i) && ( (statistics[i]!=0) || (printAll==true) ) )
         {
            /*Computes the number of asterisks that will be printed.*/
            asterisks = (statistics[i]*maxScaleValue)/ maxValue;
            /*Prints the character*/
            putchar(i);
            putchar(' ');
            /*Prints a number of asterisks proportional with the number
            of occurrences*/
            for(j = 0; j<asterisks; j++)
            {
               putchar('*');
            }
            putchar('\n');
         }
      }  
   }
   else
   {
      puts("The statistics vector is empty");
   }
}
This function will print the character frequency statistics as a horizontal histogram.

3.Example
#include <stdio.h>
#include <stdbool.h>
#include <limits.h>
#include <ctype.h>

#define NR_ARGS        2
#define FILE_ARG_INDEX 1

void DoStatistics(FILE* stream, long* statistics);
void PrintStatistics(long* statistics, bool printAll);
void PrintHorizontalHistogram(long* statistics, int maxScaleValue, bool printAll);

/*
 * Description:
 *  The program will output the number of occurrences of all printable
 * characters in file.
 */
int main(int argc, char** argv)
{
    FILE* stream = NULL;
    long statistics[UCHAR_MAX];
    if(argc==NR_ARGS)
    {
       stream = fopen(argv[FILE_ARG_INDEX],"r");
       if(stream!=NULL)
       {
          DoStatistics(stream,statistics);
          PrintStatistics(statistics,false);
          PrintHorizontalHistogram(statistics,70,false);
       }
       else
       {
          perror("Could not open file");
       }
    }
    else
    {
       perror("Incorrect number of arguments");
    }
    return 0;
}

The example program above will open a text file and store the character frequency statistics in the statistics vector. After that it will print the character frequency values and build a histogram.

If we consider a file with the following text:

Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going through the cites of the word in classical literature, discovered the undoubtable source. Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of "de Finibus Bonorum et Malorum" (The Extremes of Good and Evil) by Cicero, written in 45 BC. This book is a treatise on the theory of ethics, very popular during the Renaissance. The first line of Lorem Ipsum, "Lorem ipsum dolor sit amet..", comes from a line in section 1.10.32. The standard chunk of Lorem Ipsum used since the 1500s is reproduced below for those interested. Sections 1.10.32 and 1.10.33 from "de Finibus Bonorum et Malorum" by Cicero are also reproduced in their exact original form, accompanied by English versions from the 1914 translation by H. Rackham.

The output of the program for a file containing this text will be:
  <174>
" <6>
( <1>
) <1>
, <13>
- <1>
. <21>
0 <10>
1 <13>
2 <4>
3 <7>
4 <3>
5 <3>
9 <1>
B <4>
C <7>
E <3>
F <2>
G <1>
H <2>
I <6>
L <9>
M <3>
R <3>
S <2>
T <4>
V <1>
a <50>
b <12>
c <31>
d <29>
e <85>
f <19>
g <10>
h <23>
i <59>
k <6>
l <27>
m <34>
n <49>
o <79>
p <19>
r <63>
s <54>
t <53>
u <28>
v <5>
w <4>
x <3>
y <11>
  **********************************************************************
" **
(
)
, *****
-
. ********
0 ****
1 *****
2 *
3 **
4 *
5 *
9
B *
C **
E *
F
G
H
I **
L ***
M *
R *
S
T *
V
a ********************
b ****
c ************
d ***********
e **********************************
f *******
g ****
h *********
i ***********************
k **
l **********
m *************
n *******************
o *******************************
p *******
r *************************
s *********************
t *********************
u ***********
v **
w *
x *
y ****
Feel free to experiment by only counting the occurrences of digits, letters or punctuation (this can be easily done by modifying the int isprint(char c) condition with another one).

No comments:

Post a Comment

Got a question regarding something in the article? Leave me a comment and I will get back at you as soon as I can!

Related Posts Plugin for WordPress, Blogger...
Recommended Post Slide Out For Blogger