Creating Data Sets

This tutorial demonstrates how to create the files used by initialiseDataSetFromFile to initialise dataSet structures.

The dataSet structures are used to pass training data into the CGP-Library when using the default supervised learning fitness function.  The dataSet structures can also be used by custom fitness functions as demonstrated in the Custom Fitness Function tutorial.

Summary
Creating Data SetsThis tutorial demonstrates how to create the files used by initialiseDataSetFromFile to initialise dataSet structures.
The ProgramA simple program which demonstrates how to create custom dataSet files by producing the symbolic regression dataSet file used in the Getting Started tutorial.
Stepping Through the CodeFirst standard header files are used along with cgp.h.

The Program

A simple program which demonstrates how to create custom dataSet files by producing the symbolic regression dataSet file used in the Getting Started tutorial.

The program below is provided in the CGP-Library download within /examples/createDataSet.c and is included in the code::blocks project.

#include <stdio.h>
#include <math.h>
#include "../src/cgp.h"

#define NUMINPUTS 1
#define NUMOUTPUTS 1
#define NUMSAMPLES 101
#define INPUTRANGE 10.0

/*
    Returns x^6 - 2x^4 + x^2
*/
double symbolicEq1(double x){
    return pow(x,6) - 2*pow(x,4) + pow(x,2);
}

int main(void){

    int i;

    struct dataSet *data = NULL;

    double inputs[NUMSAMPLES][NUMINPUTS];
    double outputs[NUMSAMPLES][NUMOUTPUTS];

    double inputTemp;
    double outputTemp;

    for(i=0; i<NUMSAMPLES; i++){

        inputTemp = (i * (INPUTRANGE/(NUMSAMPLES-1))) - INPUTRANGE/2;
        outputTemp = symbolicEq1(inputTemp);

        inputs[i][0] = inputTemp;
        outputs[i][0] = outputTemp;
    }

    data = initialiseDataSetFromArrays(NUMINPUTS, NUMOUTPUTS, NUMSAMPLES, inputs[0], outputs[0]);

    saveDataSet(data, "symbolic.data");

    freeDataSet(data);

    return 0;
}

Stepping Through the Code

First standard header files are used along with cgp.h.

#include <stdio.h>
#include <math.h>
#include "../src/cgp.h"

A number of #defines are used to keep the main function neat (no memory management).  The NUMINPUTS and NUMOUTPUTS describe how many inputs and outputs each sample in the dataSet will have.  The NUMSAMPLES give the total number of samples in the created dataSet.  The INPUTRANGE specifies the range of the values the inputs will take centred around zero (+/- 5).

#define NUMINPUTS 1
#define NUMOUTPUTS 1
#define NUMSAMPLES 101
#define INPUTRANGE 10.0

The data stored in a dataSet structures can come from any source but in this tutorial they shall be generated using the following function.  The function returns x^6 - 2x^4 + x^2 for a given input x.

double symbolicEq1(double x){
    return pow(x,6) - 2*pow(x,4) + pow(x,2);
}

Within the main function a few variables are used including a dataSet structure.  This dataSet structure will be populated with input-output pairs (samples) before being save to a file.

The arrays “inputs” and “outputs” are used to store the input output pairs to be stored in the dataSet structure with inputTemp and outputTemp used to store intermediate values.

struct dataSet *data = NULL;

double inputs[NUMSAMPLES][NUMINPUTS];
double outputs[NUMSAMPLES][NUMOUTPUTS];

double inputTemp;
double outputTemp;

The following for loop calculates 101 input output pairs evenly sampled over the range +/- 5.  The output value for each input is calculated using the previously defined function.  Each sample’s input and output value is then stored in the “inputs” and “outputs” arrays.  Note it is important that the first dimension of the two arrays indexes the sample and the second dimension the respective input or output.

for(i=0; i<NUMSAMPLES; i++){

    inputTemp = (i * (INPUTRANGE/(NUMSAMPLES-1))) - INPUTRANGE/2;
    outputTemp = symbolicEq1(inputTemp);

    inputs[i][0] = inputTemp;
    outputs[i][0] = outputTemp;
}

Using the arrays which contain containing each samples input(s) and output(s) we can initialise a dataSet using initialiseDataSetFromArrays.  Where initialiseDataSetFromArrays takes the number of inputs, outputs and samples along with a pointer to the first element in the “inputs” and “outputs” arrays.  Note again the structure of the arrays, they must be in this form.  Once the dataSet has been initialised it can be printed to the terminal for inspection using printDataSet.

data = initialiseDataSetFromArrays(NUMINPUTS, NUMOUTPUTS, NUMSAMPLES, inputs[0], outputs[0]);

Now we have an initialised dataSet it can be saved to a file using saveDataSet.  Where saveDataSet takes the dataSet to be saved and the location to save the file.  Once saved the dataSet can be later used to initialise a new dataSet using initialiseDataSetFromFile as seen in the Getting Started tutorial.

saveDataSet(data, "symbolic.data");

Finally the dataSet structure used should be free’d.

freeDataSet(data);

And that’s it, dataSet structures can be initialised from arrays using initialiseDataSetFromArrays (provided the arrays are in the right form) and then saved to a file using saveDataSet to be read in again later.

DLL_EXPORT struct dataSet *initialiseDataSetFromFile(char const *file)
Initialises a dataSet structures using the given file.
struct dataSet
Stores a data set which can be used by fitness functions when calculating a chromosomes fitness.
This tutorial introduces and steps through a simple program which uses the CGP-Library to solve a symbolic regression problem.
This tutorial introduces how to use a custom fitness functions with the CGP-Library.
DLL_EXPORT struct dataSet *initialiseDataSetFromArrays(int numInputs,
int numOutputs,
int numSamples,
double *inputs,
double *outputs)
Initialises a dataSet structures using the given input output pairs.
DLL_EXPORT void printDataSet(struct dataSet *data)
Prints the input output pairs held by a dataSet structure to the terminal.
DLL_EXPORT void saveDataSet(struct dataSet *data,
char const *fileName)
Saves the given dataSet to a file which can be read using initialiseDataSetFromFile.
Close