GENHmm
=========

An HMM based artificial data generator to generate synthetic symbolic sequence data sets for evaluating anomaly detection algorithms. The GENHmm program generates three HMM models, which can be used to generate three sets of sequences, normal, anomalous_1 (completely different from the normal sequences), and anomalous_2 (slightly different from the normal sequences). If used to evaluate anomaly detection algorithms, the task of detecting anomalous_1 sequences is relatively easier than the task of detecting anomalous_2 sequences.
For more details about the underlying algorithm, please refer to:
1. Varun Chandola, Varun Mithal, and Vipin Kumar, Understanding Anomaly Detection Techniques for Symbolic Sequences. CS Technical Report 09-001, January 2009, Computer Science Department, University of Minnesota
2. Varun Chandola, Varun Mithal, and Vipin Kumar, A Comparative Evaluation of Anomaly Detection Techniques for Sequence Data, In Proceedings of 2008 IEEE International Conference on Data Mining (ICDM), December 2008,Pisa, Italy.

Each of the three HMM models generated by GENHmm have identical set of states. The states are divided into two groups, arranged as two circles. Each state is set to generate one alphabet with high probability and all other alphabets with a lower probability (parameter mu). From each state, a highly probable transition (parameter d) can be made to the next state in the cycle, and low probability transitions to all other states in the same cycle. For one of the states in a cycle (parameter s), a transition is possible (with probability d2) to a state in the other cycle.

Usage:
	./GENHmm -n numstates -o hmmfileprefix -s shortcircuit -d delta -d2 delta2 -m mu
	   -n  [4.. ] -> number of states in HMMs
	   -o         -> prefix for hmm model files.
	   -s  [1..n] -> shortcircuit parameter
 	   -d  [0...1]-> delta transition probability within a cluster
	   -d2 [0..1] -> delta transition probability from one cluster to another
	   -m         -> mu or observation probability


The number of states are required to be even and greater than 4. The number of alphabets generated by the HMMs are number of states/2.
The output of the program is three HMM models with names: pref.norm.hmm pref.anom1.hmm pref.anom2.hmm.

The discrete sequences can be generated by using the HMMGenSeq program.