Project Description

Given a set of feature vectors where each vector represents features found in a given column of data, we clustered these columns into a set of discrete domains based on the similarity of the feature vectors. We used Latent Dirichlet Allocation as the clustering technique.

We refined the domain classification algorithm used in step one by considering adjacent domains found in one or more domain graphs. We computed the probability that a column having values ranging from 300 to 800 is a FICO score, given the column is part of a graph that commonly includes a FICO score data element. In other words, P (C1 = FICO-DOMAIN | C1 is part of DG1) is the probability, given that column C1 is part of domain graph DG1, and that DG1 has some probability of including a FICO-DOMAIN.

We designed a method using TensorFlow to combine independently derived domain-graph models, such that the domain and domain graph representations eventually converge on a common set of domains and domain-graphs.