classifier
Class WekaSVM

java.lang.Object
  extended by classifier.WekaSVM
All Implemented Interfaces:
Classifier

public class WekaSVM
extends java.lang.Object
implements Classifier

This class represents the implementation of the classifier interface, suited to be used with the SMO (SVM) implementation of the WEKA(TM) machine learning library.

Author:
Michiel Van Bel

Nested Class Summary
 
Nested classes/interfaces inherited from interface classifier.Classifier
Classifier.DATA_TYPE
 
Constructor Summary
WekaSVM(org.apache.log4j.Logger logger, ClassificationAction ca)
          Constructor, just initiates the necessary variables.
 
Method Summary
 void applyAttributeFilter(java.util.List<java.lang.Integer> attributeFilter, int maxNumFeatures, java.io.File toBeFilteredFile)
          After having used featureselection to get a filter, this filter can be used to change the featurefiles in order to optimize the svms.
 boolean buildClassifier()
          This method builds an SVM model file from a file with trainingexamples.
 java.lang.Double classify_single_instance_fast(double[] features)
          Use the trained classifier to classify a single instance of data in a very fast way, without having to resort to string parsing procedures (recommanded method for doing these classifications).
 java.lang.String classify_single_instance(java.lang.String instance)
          Use the trained classifier to classify a single instance of data, defined by the instance parameter.
 void classify(java.lang.String testFile, java.lang.String outputFile)
          Use the trained (or untrained, the modelfile must be set before though) SVM to classify data, and write the output to an outputfile
 CrossValidationResult crossValidate(int n, int maxPosTrain, int maxNegTrain)
          Performs a crossvalidation of trainingfile.
 java.lang.String generateFeatureString(java.util.List<java.lang.Double> data, Classifier.DATA_TYPE dataType)
          Creates a string of features for only one featurevector
 java.lang.String getFileExtension()
           
 java.lang.String getModelFile()
           
 int[] getPosNegExamplesInFile(java.io.File file)
          Returns the amount of positive and negative examples in a trainingfile.
 double getSigmoid_A()
           
 double getSigmoid_B()
           
 weka.core.Instances getTrainingFileInstances()
          Returns the the instances used for training the SMO.
 boolean loadClassifier()
          Sets the modelfile, and - dependend on the implementation - there may be an attempt to build the SVM from this modelfile.
 boolean loadClassifier(java.lang.String svmFile)
          Sets the modelfile, and - dependend on the implementation - there may be an attempt to build the SVM from this modelfile.
 SMO loadModel(java.lang.String fileIn)
          This method loads the SMO java-object from a file.
 java.io.File mergeFeatureFiles(java.io.File tempFilePositive, java.io.File tempFileNegative)
          Merges the featurefiles (one with positive training features, one with negative training features), in order to make the actual training file.
 java.util.List<ValPosCombination> performAttributeEvaluation(boolean sort, weka.attributeSelection.AttributeEvaluator evaluator)
          Performs feature selection by evaluating different attributes.
 java.lang.String[] prepareCrossvalidationCommand(int fold, java.lang.String fileIn, java.lang.String fileOut)
          Creates an array with string values, to be parsed by the implementation of the classifier.
 java.lang.String[] prepareTrainingCommand(java.lang.String fileIn, java.lang.String fileOut)
          Creates an array with string values, to be parsed by the classifier.
 void saveModel(SMO smo, java.lang.String fileOut)
          This method saves the SMO java-object to a file in the filesystem.
 void setModelFile(java.lang.String svmModelFile)
          Changes the model for the classifier by changing the name of the modelfile.
 void setOptions(ClassifierOptions options)
          Changes the various options of this classifier.
 void setSigmoid_A(double sigmoid_A)
          Changes the sigmoid variable A (see documentation about restructuring the output by use of sigmoid curves)
 void setSigmoid_B(double sigmoid_B)
          Changes the sigmoid variable B (see documentation about restructuring the output by use of sigmoid curves)
 java.lang.String to_genomeview_output(int id, java.lang.Double distance, int funsite_start, int funsite_stop, java.lang.String classification_name)
          Method which produces a string that can be used by the GenomeView program.
 java.lang.String to_splice_machine_output(java.lang.Double distance, int funsite, int increase, java.lang.String classification_name)
          Method which produces a string that is similar to the output provided by Splicemachine (with the provided results).
 java.lang.String to_splice_machine_output(java.lang.String classification_result, int funsite, int increase, java.lang.String classification_name)
          Method which produces a string that is similar to that of the Splicemachine program, according to the provided results.
 java.io.File writeTemporaryFeatureData(java.lang.String tempFileName, boolean forward_strand, java.util.List<java.util.List<java.lang.Double>> data, Classifier.DATA_TYPE dataType)
          This method writes the temporary featuredata (being all the features extracted from 1 sequence, each feature in a different list) to a file.
 java.io.File writeTemporaryFeatureData(java.lang.String tempFileName, java.util.List<java.util.List<java.lang.Double>> data, Classifier.DATA_TYPE dataType)
          This method writes the temporary featuredata (being all the features extracted from 1 sequence, each feature in a different list) to a file.
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

WekaSVM

public WekaSVM(org.apache.log4j.Logger logger,
               ClassificationAction ca)
Constructor, just initiates the necessary variables.

Parameters:
logger - The logging facility
ca - The ClassificationAction to which this classifier belongs (every classifier belongs to a classification action).
Method Detail

getFileExtension

public java.lang.String getFileExtension()
Specified by:
getFileExtension in interface Classifier
Returns:
The file extension to be used by files containing the features that will be used for either building the classification model or for evaluation.

crossValidate

public CrossValidationResult crossValidate(int n,
                                           int maxPosTrain,
                                           int maxNegTrain)
Performs a crossvalidation of trainingfile. The results of this crossvalidation (number of true positives,false positives,true negatives,false negatives and deduced numbers) are then returned. The normal procedure for crossvalidation can be followed. However, we extended the notion of it so different numbers of positive/negative examples can be used during the training phase of each crossvalidation step. The number of positive/ negative examples for the testing phase during each crossvalidation step remains the same (begin (n-1)*(total_amount)). For further information, see the Crossvalidation.txt document in the /doc subdirectory.

Specified by:
crossValidate in interface Classifier
Parameters:
n - The fold of the crossvalidation. Frequent numbers are 2,5 and 10
maxPosTrain - The maximum amount of positive training examples during the training phase of the crossvalidation.
maxNegTrain - The maximum amount of negative training examples during the training phase of the crossvalidation.
Returns:
The crossvalidationresult object, which contains the results of the crossvalidation, and the deduced statistics.

prepareCrossvalidationCommand

public java.lang.String[] prepareCrossvalidationCommand(int fold,
                                                        java.lang.String fileIn,
                                                        java.lang.String fileOut)
Creates an array with string values, to be parsed by the implementation of the classifier. Needed to be place in the classifier-interface to reduce implementation issues.

Specified by:
prepareCrossvalidationCommand in interface Classifier
Parameters:
fold - The fold of the crossvalidation
fileIn - The file containing the training features.
fileOut - The file for output (if applicable).
Returns:
Commandline array with options

prepareTrainingCommand

public java.lang.String[] prepareTrainingCommand(java.lang.String fileIn,
                                                 java.lang.String fileOut)
Creates an array with string values, to be parsed by the classifier.

Specified by:
prepareTrainingCommand in interface Classifier
Parameters:
fileIn - The name of the file containing the extracted features.
fileOut - The name of the file to which the output should be written (if applicable).
Returns:
Commandline array with options

buildClassifier

public boolean buildClassifier()
This method builds an SVM model file from a file with trainingexamples. The name of the file containing the extracted features should have been defined prior to this.

Specified by:
buildClassifier in interface Classifier

saveModel

public void saveModel(SMO smo,
                      java.lang.String fileOut)
               throws java.lang.Exception
This method saves the SMO java-object to a file in the filesystem. The file is zipped in order to make sure that no too much space is taken.

Parameters:
smo - The SMO to be stored
fileOut - The name of the file in which the SMO should be stored.
Throws:
java.lang.Exception - Exceptions might occur, but the caller should handle them

loadModel

public SMO loadModel(java.lang.String fileIn)
              throws java.lang.Exception
This method loads the SMO java-object from a file.

Parameters:
fileIn - The name of the file containing the SMO object
Returns:
The SMO that is extracted
Throws:
java.lang.Exception - should be handled by the caller

mergeFeatureFiles

public java.io.File mergeFeatureFiles(java.io.File tempFilePositive,
                                      java.io.File tempFileNegative)
Merges the featurefiles (one with positive training features, one with negative training features), in order to make the actual training file. The type (donor/acceptor) is set by the constructor of the svm's, so it can take the necessary information from there.

Specified by:
mergeFeatureFiles in interface Classifier
Parameters:
tempFilePositive - The name of the file with features for positive training
tempFileNegative - The name of the file with features for negative training
Returns:
The resulting training file.

writeTemporaryFeatureData

public java.io.File writeTemporaryFeatureData(java.lang.String tempFileName,
                                              boolean forward_strand,
                                              java.util.List<java.util.List<java.lang.Double>> data,
                                              Classifier.DATA_TYPE dataType)
This method writes the temporary featuredata (being all the features extracted from 1 sequence, each feature in a different list) to a file. The type of data is also important, since it is necessary to label the data (positive, negative, unknown) according to the type of data. At first all data (all features of all sequences) was kept in memory, but this very quickly resulted in heap overflow problems, for large datasets.

Specified by:
writeTemporaryFeatureData in interface Classifier
Parameters:
tempFileName - The name of the file to which the data should be written.
forward_strand - Indicates whether or not the data is located on the forward strand.
data - The featuredata, put in a nested linked list.
dataType - The type of data (see enum in this interface)
Returns:
The file with the temporary data.

writeTemporaryFeatureData

public java.io.File writeTemporaryFeatureData(java.lang.String tempFileName,
                                              java.util.List<java.util.List<java.lang.Double>> data,
                                              Classifier.DATA_TYPE dataType)
This method writes the temporary featuredata (being all the features extracted from 1 sequence, each feature in a different list) to a file. The type of data is also important, since it is necessary to label the data (positive, negative, unknown) according to the type of data. At first all data (all features of all sequences) was kept in memory, but this very quickly dissolved heap overflow problems.

Specified by:
writeTemporaryFeatureData in interface Classifier
Parameters:
tempFileName - The name of the file to which the data should be written.
data - The featuredata
dataType - The type of data (see enum in this interface)
Returns:
The file with the temporary data.

generateFeatureString

public java.lang.String generateFeatureString(java.util.List<java.lang.Double> data,
                                              Classifier.DATA_TYPE dataType)
Creates a string of features for only one featurevector

Specified by:
generateFeatureString in interface Classifier
Parameters:
data - The featuredata
dataType - The datatype (positive,negative,unclassified)
Returns:
The featurestring

loadClassifier

public boolean loadClassifier()
Sets the modelfile, and - dependend on the implementation - there may be an attempt to build the SVM from this modelfile.

Specified by:
loadClassifier in interface Classifier
Returns:
Whether or not loading the classifier succeeded.

loadClassifier

public boolean loadClassifier(java.lang.String svmFile)
Sets the modelfile, and - dependend on the implementation - there may be an attempt to build the SVM from this modelfile.

Specified by:
loadClassifier in interface Classifier
Parameters:
svmFile - The name of the modelfile
Returns:
Whether or not loading the classifier succeeded.

classify

public void classify(java.lang.String testFile,
                     java.lang.String outputFile)
Use the trained (or untrained, the modelfile must be set before though) SVM to classify data, and write the output to an outputfile

Specified by:
classify in interface Classifier
Parameters:
testFile - The name of the file that contains the extracted features, outputdirectory is supposed to be in the filename.
outputFile - The name of the outputfile, outputdirectory is supposed to be in the filename.

classify_single_instance

public java.lang.String classify_single_instance(java.lang.String instance)
Use the trained classifier to classify a single instance of data, defined by the instance parameter.

Specified by:
classify_single_instance in interface Classifier
Parameters:
instance - The instance (consisting of extracted features) to be classified.
Returns:
The string indicating the result of the classification.

classify_single_instance_fast

public java.lang.Double classify_single_instance_fast(double[] features)
Use the trained classifier to classify a single instance of data in a very fast way, without having to resort to string parsing procedures (recommanded method for doing these classifications).

Specified by:
classify_single_instance_fast in interface Classifier
Parameters:
features - The features that make up the instance that needs to be classified.
Returns:
A value given by the classifier. This is NOT a simple -1/+1 value. Indeed, when using SVM's this value indicates the distance to the hyperplane.

to_splice_machine_output

public java.lang.String to_splice_machine_output(java.lang.String classification_result,
                                                 int funsite,
                                                 int increase,
                                                 java.lang.String classification_name)
Method which produces a string that is similar to that of the Splicemachine program, according to the provided results.

Specified by:
to_splice_machine_output in interface Classifier
Parameters:
classification_result - The result of the classification, in string format.
funsite - The location of the functional site in the sequence.
increase - An extra increae for the output (see documentation).
classification_name - The name for this type of functional site.
Returns:
The string in Splicemachine format.

to_splice_machine_output

public java.lang.String to_splice_machine_output(java.lang.Double distance,
                                                 int funsite,
                                                 int increase,
                                                 java.lang.String classification_name)
Method which produces a string that is similar to the output provided by Splicemachine (with the provided results).

Specified by:
to_splice_machine_output in interface Classifier
Parameters:
distance - A value (distance to hyperplane for SVM's) that is used to give a score to a certain functional site.
funsite - The location of the fuctional site in the sequence.
increase - An extra increase for the location of the functional site in the output (see documentation).
classification_name - The name for this type of evaluated functional site.
Returns:
The string in Splicemachine format.

to_genomeview_output

public java.lang.String to_genomeview_output(int id,
                                             java.lang.Double distance,
                                             int funsite_start,
                                             int funsite_stop,
                                             java.lang.String classification_name)
Method which produces a string that can be used by the GenomeView program.

Specified by:
to_genomeview_output in interface Classifier
Parameters:
id - A unique id for the functional site in the sequence
distance - A value (distance to hyperplane for SVM's) that is used to give a score to a certain functional site.
funsite_start - The start of the functional site in the sequence
funsite_stop - The stop of the functional site in the sequence
classification_name - The name for this type of evaluated functional site.
Returns:
The string in Splicemachine format.

getPosNegExamplesInFile

public int[] getPosNegExamplesInFile(java.io.File file)
Returns the amount of positive and negative examples in a trainingfile.

Specified by:
getPosNegExamplesInFile in interface Classifier
Parameters:
file - The trainingfile
Returns:
An array of size 2, with the first number being the amount of positive training examples and the second number the amount of negative training examples.

getModelFile

public java.lang.String getModelFile()
Specified by:
getModelFile in interface Classifier
Returns:
The name of the file containg the model for the classifier.

setModelFile

public void setModelFile(java.lang.String svmModelFile)
Changes the model for the classifier by changing the name of the modelfile.

Specified by:
setModelFile in interface Classifier
Parameters:
svmModelFile - The name of the file containg the new model.

setOptions

public void setOptions(ClassifierOptions options)
Changes the various options of this classifier.

Specified by:
setOptions in interface Classifier
Parameters:
options - The new set of options for this classifier.

performAttributeEvaluation

public java.util.List<ValPosCombination> performAttributeEvaluation(boolean sort,
                                                                    weka.attributeSelection.AttributeEvaluator evaluator)
Performs feature selection by evaluating different attributes.

Specified by:
performAttributeEvaluation in interface Classifier
Parameters:
sort - Whether to sort the resulting valposcombinations according to their values
evaluator - The evaluator used for performing the evaluation of the attributes
Returns:
A list with the values and the original positions of those values in order to be able to locate the classificationfeature this attribute belonged to.

getTrainingFileInstances

public weka.core.Instances getTrainingFileInstances()
Returns the the instances used for training the SMO.

Specified by:
getTrainingFileInstances in interface Classifier
Returns:
The instances that are used for training the SMO.

applyAttributeFilter

public void applyAttributeFilter(java.util.List<java.lang.Integer> attributeFilter,
                                 int maxNumFeatures,
                                 java.io.File toBeFilteredFile)
After having used featureselection to get a filter, this filter can be used to change the featurefiles in order to optimize the svms.

Specified by:
applyAttributeFilter in interface Classifier
Parameters:
attributeFilter - The filter: this is an array with the numbers of the attributes that MUST be preserved.
maxNumFeatures - The maximum amount of features to be used by the classifier.
toBeFilteredFile - The file containing the various features (set in a classifier dependend way) which should be filtered by the given attributefilter.

getSigmoid_A

public double getSigmoid_A()
Specified by:
getSigmoid_A in interface Classifier
Returns:
the sigmoid variable A (see documentation about restructuring the output by use of sigmoid curves)

setSigmoid_A

public void setSigmoid_A(double sigmoid_A)
Changes the sigmoid variable A (see documentation about restructuring the output by use of sigmoid curves)

Specified by:
setSigmoid_A in interface Classifier
Parameters:
sigmoid_A - The new sigmoid variable A

getSigmoid_B

public double getSigmoid_B()
Specified by:
getSigmoid_B in interface Classifier
Returns:
the sigmoid variable B (see documentation about restructuring the output by use of sigmoid curves)

setSigmoid_B

public void setSigmoid_B(double sigmoid_B)
Changes the sigmoid variable B (see documentation about restructuring the output by use of sigmoid curves)

Specified by:
setSigmoid_B in interface Classifier
Parameters:
sigmoid_B - The new sigmoid variable B