Skip to main content

Overview

The Prediction_Data_validation class handles all validation for raw prediction data, ensuring files meet schema requirements before processing. Invalid files are automatically separated for review.

Schema Validation

Validation is based on the prediction schema defined in schema_prediction.json:
{
  "SampleFileName": "fraudDetection_021119920_010222.csv",
  "LengthOfDateStampInFile": 9,
  "LengthOfTimeStampInFile": 6,
  "NumberofColumns": 38,
  "ColName": {
    "months_as_customer": "Integer",
    "age": "Integer",
    "policy_number": "Integer",
    // ... 35 more columns
  }
}

Schema Extraction

The valuesFromSchema method extracts validation parameters from predictionDataValidation.py:30-77:
def valuesFromSchema(self):
    """
    Method Name: valuesFromSchema
    Description: This method extracts all the relevant information from the 
                 pre-defined "Schema" file.
    Output: LengthOfDateStampInFile, LengthOfTimeStampInFile, 
            column_names, Number of Columns
    On Failure: Raise ValueError, KeyError, Exception
    """
    try:
        with open(self.schema_path, 'r') as f:
            dic = json.load(f)
            f.close()
        
        pattern = dic['SampleFileName']
        LengthOfDateStampInFile = dic['LengthOfDateStampInFile']
        LengthOfTimeStampInFile = dic['LengthOfTimeStampInFile']
        column_names = dic['ColName']
        NumberofColumns = dic['NumberofColumns']

        file = open("Training_Logs/valuesfromSchemaValidationLog.txt", 'a+')
        message = ("LengthOfDateStampInFile:: %s" % LengthOfDateStampInFile + 
                   "\t" + "LengthOfTimeStampInFile:: %s" % LengthOfTimeStampInFile + 
                   "\t " + "NumberofColumns:: %s" % NumberofColumns + "\n")
        self.logger.log(file, message)
        file.close()

    except ValueError:
        file = open("Prediction_Logs/valuesfromSchemaValidationLog.txt", 'a+')
        self.logger.log(file, "ValueError:Value not found inside schema_training.json")
        file.close()
        raise ValueError

    except KeyError:
        file = open("Prediction_Logs/valuesfromSchemaValidationLog.txt", 'a+')
        self.logger.log(file, "KeyError:Key value error incorrect key passed")
        file.close()
        raise KeyError

    except Exception as e:
        file = open("Prediction_Logs/valuesfromSchemaValidationLog.txt", 'a+')
        self.logger.log(file, str(e))
        file.close()
        raise e

    return LengthOfDateStampInFile, LengthOfTimeStampInFile, column_names, NumberofColumns

File Format Requirements

File Naming

Must match regex: ['fraudDetection']+['_']+[\d_]+[\d]+\.csvExample: fraudDetection_021119920_010222.csv

Date Stamp

Must be exactly 9 characters longExample: 021119920

Time Stamp

Must be exactly 6 characters longExample: 010222

Column Count

Must contain exactly 38 columns as defined in schema

Filename Validation

The system validates filenames using regex from predictionDataValidation.py:80-95:
def manualRegexCreation(self):
    """
    Method Name: manualRegexCreation
    Description: This method contains a manually defined regex based on the 
                 "FileName" given in "Schema" file. This Regex is used to 
                 validate the filename of the prediction data.
    Output: Regex pattern
    On Failure: None
    """
    regex = "['fraudDetection']+['_'']+[\\d_]+[\\d]+\\.csv"
    return regex

Validation Process

The validation workflow includes multiple checks:
1

Directory Setup

Creates Good_Raw/ and Bad_Raw/ directories for file sorting.
path = os.path.join("Prediction_Raw_Files_Validated/", "Good_Raw/")
if not os.path.isdir(path):
    os.makedirs(path)
path = os.path.join("Prediction_Raw_Files_Validated/", "Bad_Raw/")
if not os.path.isdir(path):
    os.makedirs(path)
2

Filename Validation

Validates each file against the regex pattern and timestamp requirements from predictionDataValidation.py:228-274.Files matching all criteria are copied to Good_Raw/, others to Bad_Raw/.
3

Column Count Validation

Verifies that each good file has exactly 38 columns from predictionDataValidation.py:279-318:
def validateColumnLength(self, NumberofColumns):
    try:
        f = open("Prediction_Logs/columnValidationLog.txt", 'a+')
        self.logger.log(f, "Column Length Validation Started!!")
        
        for file in listdir('Prediction_Raw_Files_Validated/Good_Raw/'):
            csv = pd.read_csv("Prediction_Raw_Files_Validated/Good_Raw/" + file)
            if csv.shape[1] == NumberofColumns:
                csv.to_csv("Prediction_Raw_Files_Validated/Good_Raw/" + file, 
                          index=None, header=True)
            else:
                shutil.move("Prediction_Raw_Files_Validated/Good_Raw/" + file, 
                           "Prediction_Raw_Files_Validated/Bad_Raw")
                self.logger.log(f, 
                    "Invalid Column Length for the file!! File moved to Bad Raw Folder :: %s" % file)
        
        self.logger.log(f, "Column Length Validation Completed!!")
    except Exception as e:
        self.logger.log(f, "Error Occured:: %s" % e)
        raise e
4

Missing Values Check

Validates that no column has all values missing from predictionDataValidation.py:325-364:
def validateMissingValuesInWholeColumn(self):
    try:
        f = open("Prediction_Logs/missingValuesInColumn.txt", 'a+')
        self.logger.log(f, "Missing Values Validation Started!!")

        for file in listdir('Prediction_Raw_Files_Validated/Good_Raw/'):
            csv = pd.read_csv("Prediction_Raw_Files_Validated/Good_Raw/" + file)
            count = 0
            for columns in csv:
                if (len(csv[columns]) - csv[columns].count()) == len(csv[columns]):
                    count += 1
                    shutil.move("Prediction_Raw_Files_Validated/Good_Raw/" + file,
                               "Prediction_Raw_Files_Validated/Bad_Raw")
                    self.logger.log(f, 
                        "Invalid Column Length for the file!! File moved to Bad Raw Folder :: %s" % file)
                    break
            if count == 0:
                csv.to_csv("Prediction_Raw_Files_Validated/Good_Raw/" + file, 
                          index=None, header=True)
    except Exception as e:
        self.logger.log(f, "Error Occured:: %s" % e)
        raise e
5

Archive Bad Files

Moves invalid files to archive with timestamp for review.

Error Handling

The validation system includes comprehensive error handling:
Raised when required values are not found in the schema file.Logged to: Prediction_Logs/valuesfromSchemaValidationLog.txt
Raised when incorrect keys are used to access schema data.Logged to: Prediction_Logs/valuesfromSchemaValidationLog.txt
Raised during file operations (create/move/delete directories).Logged to: Prediction_Logs/GeneralLog.txt
All other exceptions are caught, logged, and re-raised with context.Logged to: Appropriate log file based on validation step

Validation Logs

Validation activities are logged to separate files:
Log FilePurpose
Prediction_Logs/nameValidationLog.txtFilename validation results
Prediction_Logs/columnValidationLog.txtColumn count validation
Prediction_Logs/missingValuesInColumn.txtMissing value checks
Prediction_Logs/valuesfromSchemaValidationLog.txtSchema extraction
Prediction_Logs/GeneralLog.txtDirectory operations and general errors

Bad File Archival

Files that fail validation are archived from predictionDataValidation.py:181-223:
def moveBadFilesToArchiveBad(self):
    now = datetime.now()
    date = now.date()
    time = now.strftime("%H%M%S")
    try:
        path = "PredictionArchivedBadData"
        if not os.path.isdir(path):
            os.makedirs(path)
        
        source = 'Prediction_Raw_Files_Validated/Bad_Raw/'
        dest = 'PredictionArchivedBadData/BadData_' + str(date) + "_" + str(time)
        
        if not os.path.isdir(dest):
            os.makedirs(dest)
        
        files = os.listdir(source)
        for f in files:
            if f not in os.listdir(dest):
                shutil.move(source + f, dest)
        
        file = open("Prediction_Logs/GeneralLog.txt", 'a+')
        self.logger.log(file, "Bad files moved to archive")
        
        path = 'Prediction_Raw_Files_Validated/'
        if os.path.isdir(path + 'Bad_Raw/'):
            shutil.rmtree(path + 'Bad_Raw/')
        
        self.logger.log(file, "Bad Raw Data Folder Deleted successfully!!")
        file.close()
    except OSError as e:
        file = open("Prediction_Logs/GeneralLog.txt", 'a+')
        self.logger.log(file, "Error while moving bad files to archive:: %s" % e)
        file.close()
        raise OSError
Bad files are archived with a timestamp (e.g., PredictionArchivedBadData/BadData_2026-03-04_143022/) for later review and resubmission after correction.

Next Steps

Prediction Overview

Understand the complete prediction workflow

Batch Prediction

Learn how validated files are processed

Build docs developers (and LLMs) love