Solve the classic Data Science problem on C99 and C++ 11.

While languages ​​like Python and R are becoming more and more popular for data science, C and C++ can be a strong choice for solving problems effectively in Data Science. In this article, we will use C99 and C++ 11 to write a program that works with Enscombe quartet, which I will discuss later.

I wrote about my motivation for constantly learning languages ​​in an article on Python and GNU Octave, which is worth reading. All programs are designed for the command line, and not for the graphical user interface (GUI). Full examples are available in the polyglot_fit repository.

programming task


The program you will write in this series:

  • Reads data from a CSV file
  • Interpolates the data with a straight line (i.e., f (x)=m ⋅ x + q).
  • Writes the result to an image file

This is a common task that many data experts face. An example of the data is Enscomb’s first quartet, presented in the table below. This is a set of artificially constructed data that gives the same results when fitting directly, but their graphs are very different. A data file is a text file with tabs for dividing columns and several lines forming the header. In this task, only the first set will be used (i.e. the first two columns).

Enscombe Quartet

image

Method of solution in C


C is a general-purpose programming language, which is one of the most popular languages ​​used today (according to TIOBE Index , RedMonk Programming Language Rankings , Programming Language Popularity Index , and the GitHub study ). This is a fairly old language (it was created around 1973), and many successful programs were written on it (for example, the Linux kernel and Git). This language is also as close as possible to the internal work of the computer, as it is used for direct memory management. This is a compiled language , so the source code must be translated by the compiler in the machine code . Its standard library is small and light in size, so other libraries have been developed to provide the missing functionality.

This is the language I use the most for number crushers , mainly because of its performance. I find it rather tedious to use, since it requires writing a large template code volume , but it is well supported in various environments. The C99 standard is a recent revision that adds some nifty features and is well supported by compilers.

I will talk about the necessary prerequisites for programming in C and C++ so that both beginners and advanced users can use these languages.

Installation


For development on C99, you need a compiler. I usually use Clang , but GCC will do the trick - Another complete open source compiler. To fit the data, I decided to use the GNU Science Library . I could not find any reasonable library for graphing, and therefore this program relies on an external program: Gnuplot .The example also uses a dynamic data structure for storing data, which is defined in the Berkeley Software Distribution ( Berkeley Software Distribution, BSD ).

Installing at Fedora is very easy:

sudo dnf install clang gnuplot gsl gsl-devel 

Code Comments


In C99, comments are formatted by adding//to the beginning of the line, and the rest of the line will be discarded by the interpreter. Anything between/* and */is also discarded.

//Компилятор проигнорирует этот комментарий./* И этот тоже проигнорирует */ 

Required Libraries


Libraries consist of two parts:

  • Header file containing a description of the functions
  • Source file containing function definitions

Header files are included in the source code, and the source code of the libraries are attached to the executable file. Therefore, the header files are needed for this example:

//Инструменты ввода-вывода #include <stdio.h>//Стандартная библиотека #include <stdlib.h>//Инструменты для работы с строками #include <string.h>//Структура данных "очередь" от BSD #include <sys/queue.h>//Научные инструменты GSL #include <gsl/gsl_fit.h> #include <gsl/gsl_statistics_double.h> 

Main Function


In C, the program must be inside a special function called main () :

int main(void) { ... } 

Here you can see the difference from Python, which was discussed in the last manual, because in the case of Python, any code that it finds in the source files will be executed.

Variable Definition


In C, variables must be declared before they are used, and they must be associated with a type. Whenever you want to use a variable, you have to decide what data to store in it. You can also indicate whether you intend to use the variable as a constant value, which is not necessary, but the compiler can benefit from this information. Example from the fitting_C99.c program located in the repository:

const char *input_file_name="anscombe.csv"; const char *delimiter="\t"; const unsigned int skip_header=3; const unsigned int column_x=0; const unsigned int column_y=1; const char *output_file_name="fit_C99.csv"; const unsigned int N=100; 

Arrays in C are not dynamic in the sense that their length must be determined in advance (i.e. before compilation):

int data_array[1024]; 

Since you usually don’t know how many data points are in a file, use a singly linked list . This is a dynamic data structure that can grow indefinitely. Fortunately, BSD provides singly linked lists . Here is an example definition:

struct data_point { double x; double y; SLIST_ENTRY(data_point) entries; }; SLIST_HEAD(data_list, data_point) head=SLIST_HEAD_INITIALIZER(head); SLIST_INIT(&head); 

This example defines a data_point list consisting of structured values ​​that contain both x values ​​and y values. The syntax is quite complex, but intuitive, and a detailed description would be too verbose.

Printout


For printing in the terminal, you can use the function printf () , which works like the printf () function in Octave (described in the first article) :

printf("#### Первый набор квартета Энскомба на C99 ####\n"); 

The printf () function does not automatically add a new line at the end of the printed line, so you need to add it yourself. The first argument is a string that can contain information about the format of other arguments that can be passed to the function, for example:

printf("Slope: %f\n", slope); 

Reading data


Now comes the hard part... There are several libraries for parsing CSV files in C, but not one of them is stable or popular enough to be in the Fedora package repository. Instead of adding a dependency for this tutorial, I decided to write this part myself. Again, going into details would be too verbose, so I will only explain the general idea. Some lines in the source code will be ignored for brevity, but you can find the complete example in the repository.

First open the input file:

FILE* input_file=fopen(input_file_name, "r"); 

Then read the file line by line until an error occurs or until the file ends:

while (!ferror(input_file) && !feof(input_file)) { size_t buffer_size=0; char *buffer=NULL; getline(&buffer, &buffer_size, input_file); ... } 

The getline () function is a nice recent addition from the POSIX.1-2008 standard . She can read a whole line in a file and take care of allocating the necessary memory. Each line is then split into tokens using the strtok () . Looking through the token, select the columns you need:

char *token=strtok(buffer, delimiter); while (token != NULL) { double value; sscanf(token, "%lf", &value); if (column == column_x) { x=value; } else if (column == column_y) { y=value; } column += 1; token=strtok(NULL, delimiter); } 

Finally, when the x and y values ​​are selected, add a new point to the list:

struct data_point *datum=malloc(sizeof(struct data_point)); datum->x=x; datum->y=y; SLIST_INSERT_HEAD(&head, datum, entries); 

The function malloc () dynamically allocates (reserves) a certain amount of permanent memory for a new point.

Data fitting


The linear interpolation function from GSL gsl_fit_linear () accepts regular arrays as input. Therefore, since you cannot know in advance the size of the created arrays, you must manually allocate memory for them:

const size_t entries_number=row - skip_header - 1; double *x=malloc(sizeof(double) * entries_number); double *y=malloc(sizeof(double) * entries_number); 

Then go through the list to save the corresponding data in arrays:

SLIST_FOREACH(datum, &head, entries) { const double current_x=datum->x; const double current_y=datum->y; x[i]=current_x; y[i]=current_y; i += 1; } 

Now that you are done with the list, clean up. Always free memory that has been manually allocated to prevent a memory leak . A memory leak is bad, bad, and bad again. Every time the memory is not freed, the garden gnome loses its head:

while (!SLIST_EMPTY(&head)) { struct data_point *datum=SLIST_FIRST(&head); SLIST_REMOVE_HEAD(&head, entries); free(datum); } 

Finally, finally (!), You can adjust your data:

gsl_fit_linear(x, 1, y, 1, entries_number, &intercept, &slope, &cov00, &cov01, &cov11, &chi_squared); const double r_value=gsl_stats_correlation(x, 1, y, 1, entries_number); printf("Slope: %f\n", slope); printf("Intercept: %f\n", intercept); printf("Correlation coefficient: %f\n", r_value); 

Graphing


To plot, you must use an external program. Therefore, save the fit function in an external file:

const double step_x=((max_x + 1) - (min_x - 1))/N; for (unsigned int i=0; i < N; i += 1) { const double current_x=(min_x - 1) + step_x * i; const double current_y=intercept + slope * current_x; fprintf(output_file, "%f\t%f\n", current_x, current_y); } 

The Gnuplot charting team is as follows:

plot 'fit_C99.csv' using 1:2 with lines title 'Fit', 'anscombe.csv' using 1:2 with points pointtype 7 title 'Data' 

Results


Before starting the program, you must compile it:

clang -std=c99 -I/usr/include/fitting_C99.c -L/usr/lib/-L/usr/lib64/-lgsl -lgslcblas -o fitting_C99 


This command tells the compiler to use the C99 standard, read the fitting_C99.c file, load the gsl and gslcblas libraries, and save the result in fitting_C99. The result at the command line:

#### Первый набор квартета Энскомба на C99 #### Угловой коэффициент: 0.500091 Пересечение: 3.000091 Коэффициент корреляции: 0.816421 

imagebr>
Here is the resulting image generated using Gnuplot.

Solution Method in C++ 11


C++ is a general-purpose programming language, which is also one of the most popular languages ​​used today. It was created as the successor of the C language (in 1983) with an emphasis on object-oriented programming (OOP). C++ is usually considered a superset of C, so a C program must be compiled with a C++ compiler. This is not always the case, as there are some edge cases where they behave differently. In my experience, C++ requires less boilerplate than C, but its syntax is more complicated if you want to develop objects. The C++ 11 standard is a recent revision that adds some nifty features that are more or less supported by compilers.

Since C++ is pretty much compatible with C, I just dwell on the differences between the two. If I do not describe any section in this part, it means that it is the same as in C.

Installation


The dependencies for C++ are the same as for Example C. On Fedora, you need to run the following command:

sudo dnf install clang gnuplot gsl gsl-devel 

Required Libraries


Libraries work the same as they do in C, but include directives are slightly different:

#include <cstdlib> #include <cstring> #include <iostream> #include <fstream> #include <string> #include <vector> #include <algorithm> extern "C" { #include <gsl/gsl_fit.h> #include <gsl/gsl_statistics_double.h> } 

Since the GSL libraries are written in C, the compiler needs to report this feature.

Variable Definition


C++ supports more types (classes) of data than C, for example, a string type that has much more capabilities than its C counterpart. Update the definition of variables accordingly:

const std::string input_file_name("anscombe.csv"); 

For structured objects, such as strings, you can define a variable without using the = sign.

Printout


You can use the printf () function, but it’s more common to use cout . Use the & lt; & lt; operator to specify the line (or objects) that you want to print using cout :

std::cout << "#### Первый набор квартета Энскомба на C++11 ####" << std::endl;... std::cout << "Угловой коэффициент: " << slope << std::endl; std::cout << "Пересечение: " << intercept << std::endl; std::cout << "Коэффициент корреляции: " << r_value << std::endl; 


Reading data


The scheme is the same as before. A file is opened and read line by line, but with a different syntax:

std::ifstream input_file(input_file_name); while (input_file.good()) { std::string line; getline(input_file, line); ... } 


String tokens are retrieved by the same function as in the C99 example. Instead of standard arrays from C use two vectors . Vectors are an extension of C arrays in the standard library C++, which allows you to dynamically manage memory without calling malloc () :

std::vector<double> x; std::vector<double> y;//Добавляем элементы в x и y x.emplace_back(value); y.emplace_back(value); 

Data fitting


To fit data in C++, you do not need to worry about lists, since vectors are guaranteed to have sequential memory. You can pass pointers to vector buffers directly to the fitting function:

gsl_fit_linear(x.data(), 1, y.data(), 1, entries_number, &intercept, &slope, &cov00, &cov01, &cov11, &chi_squared); const double r_value=gsl_stats_correlation(x.data(), 1, y.data(), 1, entries_number); std::cout << "Угловой коэффициент: " << slope << std::endl; std::cout << "Пересечение: " << intercept << std::endl; std::cout << "Коэффициент корреляции: " << r_value << std::endl; 

Graphing


Plotting is done the same way as before. Write to file:

const double step_x=((max_x + 1) - (min_x - 1))/N; for (unsigned int i=0; i < N; i += 1) { const double current_x=(min_x - 1) + step_x * i; const double current_y=intercept + slope * current_x; output_file << current_x << "\t" << current_y << std::endl; } output_file.close(); 

And then use Gnuplot to plot.

Results


Before starting the program, it must be compiled with a similar command:

clang++ -std=c++11 -I/usr/include/fitting_Cpp11.cpp -L/usr/lib/-L/usr/lib64/-lgsl -lgslcblas -o fitting_Cpp11 

The resulting output on the command line:

#### Первый набор квартета Энскомба на C++11 #### Угловой Коэффициент: 0.500091 Пересечение: 3.00009 Коэффициент корреляции: 0.816421 

And here is the resulting image generated using Gnuplot.

imagebr>

Conclusion


The article provides examples of data fitting and graphing on C99 and C++ 11. Since C++ is largely compatible with C, this article uses their similarities to write the second example. In some aspects, C++ is easier to use, since it partially removes the burden of explicit memory management, but its syntax is more complicated as it introduces the ability to write classes for OOP. However, you can write in C using OOP methods, since OOP is a programming style that can be used in any language. There are some great examples of OOP in C, such as the GObject and Jansson .

For numbers, I prefer to use C99 because of its simpler syntax and wide support. Until recently, C++ 11 was not widely supported, and I tried to avoid roughness in previous versions. For more complex software, C++ may be a good choice.

Are you using C or C++ for Data Science? Share your experience in the comments.

image

Learn the details of how to get a sought-after profession from scratch or Level Up in skills and salary, taking SkillFactory paid online courses:



Read more


.

Source