Everything You Need to Know About the Process of Data Profiling

in #data8 years ago

Data profiling, which is also known as data archaeology, is the process of statistically analyzing and assessing data values within the parameters of a data set for consistency, uniqueness or logic. The insight gained by this process can be used to determine the level of difficulty that might be faced in making use of existing data for other purposes. It can also be used to determine some of the important metrics which are used to assess the data quality as well as to help in determining whether or not the metadata accurately describes the source data.

Uses of data profiling 

Data profiling is an extremely useful process which can be utlised in a number of different purposes, some of which are as follows.  

  • It is used to find out if the existing data can be used for other purposes with ease 
  • It improves the ability to search for data by tagging it with pertinent keywords, descriptions, or by assigning it to a category 
  • It assesses the data quality, as well as whether the data conforms to particular standards or patterns as required 
  • It is used in assessing the risk involved in the process of integration of data in new applications, including the challenges 
  • It can be used to check the metadata of the source database which includes value patterns and distributions, functional dependencies, key candidates and foreign key candidates 
  • It is important in assessing whether the known metadata accurately describes the actual values which are there in the source database 
  • It helps in assessing and understanding the data challenges which might be faced later, early in any data intensive project, so that these challenges can be tackled better when they appear. The presence of data problems late in a project will lead to delays or cost overruns or both 
  • It helps an enterprise stay above in having an overall view of all data, for uses where key data is needed such as master data management or data governance in order to improve the data quality

Process of Data Profiling 

The process of data profiling makes use of methods of descriptive statistics such as minimum, maximum, mean, mode, median, percentile, standard deviation, variation, frequency, aggregates which include count and sum. It also utilizes several additional metadata information which have been obtained during data profiling such as length, data type, typical string patterns, discrete values, uniqueness, occurrence of null values and abstract type recognition etc. This metadata can now be used to discover problems in data like misspellings, missing values, illegal values, varying value representation and duplicates and any other problem that one might encounter further.   

Companies usually perform different kinds of analyses for various different structural levels. For example, while single columns can be profiled individually to obtain a thorough understanding of the frequency distribution of its different values, type, or use of each column, in embedded value, on the other hand, dependencies can be exposed using a cross columns style analysis. Similarly, overlapping value sets which can possibly be representing foreign key relationships between different entities, can be explored through an inter-table analysis.    

Data profiling has taken the process of data mapping to the next level. Applications consist of underlying metadata that provides information on the individual data objects, attributes, fields and business or semantic rules on how this data is persisted in its data repository. Therefore, if there is a need to add or update a new data record from another application into Accounts data object then there is a need to create the data mapping which has to be done will be facilitated by data profiling