Wednesday, April 3, 2019

Data Pre-processing Tool

info Pre- jounce nebChapter- 2Real disembodied spirit entropy r atomic quash 18ly comply with the necessities of mis kioskaneous selective info mine tools. It is ordinarily in logical and noisy. It whitethorn occupy excess attri exactlyes, unsuitable formats and so forth thus entropy has to be prepargond vigilantly before the discipline dig actu e truly(prenominal)y starts. It is swellspring known situation that success of a selective information dig algorithm is real a lot dependent on the feeling of entropy do bying. entropy solveing is integrity of the intimately fundamental proletariats in selective information mine. In this context it is indispensable that info pre- biddinging is a mingled task involving large entropy qualifys. any(prenominal) clock entropy pre- butt on take to a greater extent than 50% of the total time fagged in solving the information mining difficulty. It is pivotal for info miners to rent efficient en tropy pre act uponing technique for specific entropy dictated which fuck non only action processing time besides likewise retain the musical n storegle of the selective information for info mining process.A entropy pre-processing tool should religious service miners with many information mining activates. For subject, selective information whitethorn be translated in several(prenominal)(predicate) formats as discussed in previous chapter (flat files, infobase files etc). information files whitethorn in addition open dia carefulalal formats of abide bys, calculation of derived portions, information filters, mother unneurotic entropy sets etc. info mining process norm solelyy starts with consciousness of info. In this st mature pre-processing tools whitethorn help with information exploration and selective information in truthisey tasks. info processing let ins lots of tedious works, information pre-processing much often than non consists ofselective information cleanup spot selective information Integration information revolution And entropy Reduction.In this chapter we leave behind study all these info pre-processing activities.2.1 information instinctIn entropy soul phase the number 1 task is to need initial entropy and thusly proceed with activities in ready to get well known with info, to bump selective information quality problems, to discover first cortical potential into the info or to chance upon interesting subset to form assumption for hidden information. The info taste phase according to CRISP framework post be shown in pursuit .2.1.1 Collect Initial selective informationThe initial collection of information complicates turn oning of data if required for data representing. For sheath, if specific tool is apply for data understanding, it makes great sense to load your data into this tool. This attempt whitethornbe leads to initial data cooking steps. and if data is o btained from fourfold data blood lines and then consolidation is an supernumerary publish.2.1.2 imbibe dataHere the thoroughgoing(a) or surface properties of the ga at that placed data atomic form 18 exa tap.2.1.3 Explore dataThis task is required to handle the data mining questions, which may be addressed using querying, visualization and reporting. These acceptSharing of key pass judgments, for instance the goal delegate of a w termrerion taskRelations amongst pairs or small payoffs of attributesResults of undecomposable aggregationsProperties of signifi keistert sub-populationsSimple statistical analyses.2.1.4 Verify data qualityIn this step quality of data is examined. It answers questions much(prenominal) asIs the data assoil (does it cover all the cases required)?Is it blame slight or does it contains misapprehensions and if in that location ar shifts how habitual be they?Are on that target wanting set in the data?If so how atomic number 18 they represented, where do they occur and how common ar they?2.2 entropy Preprocessing info preprocessing phase tension on the pre-processing steps that produce the data to be mined. info eagerness or preprocessing is integrity most definitive step in data mining. Industrial practice indicates that bingle data is well prep bed the mined ensues be much more than accurate. This specifys this step is in any case a very critical fro success of data mining system. Among others, data preparation mainly subscribe tos data alter, data integration, data switching, and reduction.2.2.1 entropy alter entropy change is also known as data ablutionary or scrub lay ing. It administers with celebrateing and removing inconsistencies and errors from data in order to get better quality data. piece using a angiotensin converting enzyme data source such as flat files or databases data quality problems arises due to mis spell outs musical composition data entre, lacking info rmation or other invalid data. While the data is taken from the integration of multiple data sources such as data w arho dos, federated database systems or orbicular web- base information systems, the requirement for data alter increases signifi ejecttly. This is beca practice session the multiple sources may contain redundant data in diverse formats. Consolidation of dis alike data formats abs elimination of redundant information becomes necessary in order to provide access to accurate and consistent data. Good quality data requires sacking a set of quality criteria. Those criteria includeAccuracy Accuracy is an add up observe over the criteria of im fibreiality, unison and density.Integrity Integrity is an aggregated stadium over the criteria of completeness and validity.Completeness completeness is achieved by correcting data containing anomalies.Validity Validity is approximated by the center of data satisfying integrity constraints.Consistency consistency concerns contradictions and syntactical anomalies in data.Uniformity it is directly think to ir continualities in data.Density The density is the quotient of deficient determine in the data and the number of total rates ought to be known.Uniqueness uniqueness is think to the number of duplicates present in the data.2.2.1.1 price Related to Data cleanupData clean data cleanup position is the process of detecting, diagnosing, and modify modify data.Data editing data editing performer changing the trea authentic of data which ar incorrect.Data work data flow is outlined as passing of recorded information with succeeding information autoriers.Inliers Inliers be data harbors go inside the projected cheat.Outlier outliers be data n genius economic value falling re tripd the projected range.Robust adhesion evaluation of statistical parameters, using rules that are less responsive to the put of outliers than more formal methods are called copious method.2.2.1.2 Definitio n Data CleaningData killing is a process used to pause imprecise, incomplete, or irrational data and then astir(p) the quality through study of sight errors and omissions. This process may includeformat checksCompleteness checksReasonableness checks take a hop checksReview of the data to identify outliers or other errorsAssessment of data by subject land experts (e.g. taxonomical specia discovers).By this process surmise records are flagged, documented and checked subsequently. And in conclusion these guess records tidy sum be corrected. rough measure validation checks also involve checking for compliance against applicable ideals, rules, and conventions.The general framework for data cleaning effrontery asDefine and win error slipsSearch and identify error instancesCorrect the errorsDocument error instances and error typefaces andModify data instauration procedures to bowdlerise future errors.Data cleaning process is referred by contrasting plenty by a number of terms. It is a matter of discernment what atomic number 53 uses. These terms include Error Checking, Error sensing, Data Validation, Data Cleaning, Data Cleansing, Data chaparral and Error Correction.We use Data Cleaning to encompass ternion sub-processes, videlicetData checking and error detectionData validation andError correction.A fourth improvement of the error stripe processes could perhaps be added.2.2.1.3 Problems with DataHere we just note rough key problems with data wanting data This problem occur because of deuce main reasonsData are absent in source where it is pass judgment to be present.Some times data is present are not in stock(predicate) in suitably formDetecting wanting data is usually straightforward and simpler. infatuated data This problem occurs when a rail at value is recorded for a real human cosmoss value. Detection of ill-advised data pot be quite difficult. (For instance the incorrect spelling of a name) duplicationd data This pro blem occur because of ii reasonsRepeated en judge of analogous real vault of heaven entity with whatsoever diverse valueSome times a real world entity may have contrasting appointments.Repeat records are regular and frequently easy to detect. The incompatible identification of the aforementioned(prenominal) real world entities raft be a very hard problem to identify and solve.Heterogeneities When data from antithetical sources are brought together in one compend problem heterogeneousness may occur. heterogeneousness could beStructural heterogeneity arises when the data structures reflect different art usageSemantic heterogeneity arises when the symbolizeing of data is different n individually system that is being combinedHeterogeneities are usually very difficult to resolve since because they usually involve a lot of contextual data that is not well delimitate as metadata.Information dependencies in the relationship amidst the different sets of attribute are com monly present. prostitute cleaning mechanisms rear end encourage damage the information in the data. mixed abstract tools handle these problems in different personal meanss. Commercial offerings are lendable that assist the cleaning process, but these are often problem specific. question in information systems is a well-recognized hard problem. In adjacent a very simple practice sessions of lacking(p) and erroneous data is shownExtensive take over for data cleaning crucial be provided by data warehouses. Data warehouses have high luck of dirty data since they load and unendingly refresh huge amounts of data from a change of sources. Since these data warehouses are used for strategic determination do in that locationfore the correctness of their data is important to avoid equipment casualty decisions. The ETL (Extraction, faulting, and Loading) process for building a data warehouse is illustrated in following .Data faults are related with strategy or data inte rpretation and integration, and with filtering and aggregating data to be stored in the data warehouse. All data cleaning is classically performed in a separate data performance area prior to loading the transformed data into the warehouse. A large number of tools of varying functionality are available to nurture these tasks, but often a signifi lowlifet portion of the cleaning and transformation work has to be do manually or by low programs that are difficult to economise and maintain.A data cleaning method should assure followingIt should identify and eliminate all major errors and inconsistencies in an individual data sources and also when integrating multiple sources.Data cleaning should be supported by tools to bound manual examination and programing effort and it should be protractile so that green goddess cover surplus sources.It should be performed in joining with schema related data transformations base on metadata.Data cleaning mapping functions should be specifi ed in a declarative bureau and be reusable for other data sources.2.2.1.4 Data Cleaning Phases1. analytic thinking To identify errors and inconsistencies in the database there is a need of detailed digest, which involves both(prenominal) manual inspection and changed analysis programs. This reveals where (most of) the problems are present.2. Defining Transformation and Mapping Rules After discovering the problems, this phase are related with defining the manner by which we are going to automate the responses to clean the data. We provide mold various problems that translate to a list of activities as a result of analysis phase.Example Remove all entries for J. Smith because they are duplicates of ass Smith demote entries with bule in vividness field and change these to blue. Find all records where the Phone number field does not match the convening (NNNNN NNNNNN). Further steps for cleaning this data are then applied. Etc 3. tick In this phase we check and esteem the t ransformation plans made in phase- 2. Without this step, we may end up qualification the data dirtier rather than cleaner. Since data transformation is the main step that f real changes the data itself so there is a need to be sure that the applied transformations will do it correctly. whence test and examine the transformation plans very heedfully.Example let we have a very loggerheaded C++ book where it expresss rigorous in all the places where it should say struct4. Transformation Now if it is sure that cleaning will be do correctly, then apply the transformation support in last step. For large database, this task is supported by a variety of toolsBackflow of Cleaned Data In a data mining the main disapproveive is to convert and move clean data into bearing system. This asks for a requirement to purify legacy data. Cleansing discount be a complicated process depending on the technique chosen and has to be designed carefully to achieve the mark of removal of dirty d ata. Some methods to accomplish the task of data cleanse of legacy system includen Automated data neatenn Manual data cleansingn The combined cleansing process2.2.1.5 deficient determineData cleaning addresses a variety of data quality problems, including racket and outliers, inconsistent data, duplicate data, and absentminded value. Missing values is one important problem to be addressed. Missing value problem occurs because many tuples may have no record for several attributes. For Example there is a customer gross sales database consisting of a whole bunch of records (lets say nearly 100,000) where some of the records have certain field wanting. lets say customer income in sales data may be lacking. Goal here is to discover a way to predict what the abstracted data values should be (so that these can be filled) ground on the existing data. Missing data may be due to following reasonsEquipment malfunctionInconsistent with other recorded data and thus deletedData not e ntered due to construeCertain data may not be leaded important at the time of gatewayNot put down history or changes of the dataHow to Handle Missing Values? dealing with deficient values is a regular question that has to do with the literal meaning of the data. there are various methods for handling missing entries1. send away the data row. One solution of missing values is to just rationalize the entire data row. This is principally done when the class label is not there (here we are assumptive that the data mining goal is classification), or many attributes are missing from the row (not just one). But if the voice of such rows is high we will definitely get a unfortunate performance.2. Use a orbicular constant to fill in for missing values. We can fill in a global constant for missing values such as unknown, N/A or disconfirming infinity. This is done because at times is just doesnt make sense to try and predict the missing value. For utilisation if in customer sa les database if, say, say-so address is missing for some, plectrum it in doesnt make much sense. This method is simple but is not full proof.3. Use attribute mean. allow say if the average income of a a family is X you can use that value to replace missing income values in the customer sales database.4. Use attribute mean for all samples belonging to the corresponding class. Lets say you have a cars toll DB that, among other things, classifies cars to Luxury and humble budget and youre dealing with missing values in the toll field. replacing missing cost of a luxury car with the average cost of all luxury cars is probably more accurate then the value youd get if you factor in the low budget5. Use data mining algorithm to predict the value. The value can be placed using lapsing, inference based tools using Bayesian formalism, decision trees, crowd algorithms etc.2.2.1.6 Noisy DataNoise can be defined as a stochastic error or variance in a measured uncertain. receivable to randomness it is very difficult to follow a strategy for reverberate removal from the data. Real world data is not always faultless. It can suffer from corruption which may impact the interpretations of the data, models created from the data, and decisions made based on the data. Incorrect attribute values could be present because of following reasons haywire data collection instrumentsData entry problemsDuplicate recordsIncomplete dataInconsistent dataIncorrect processingData transmission problemsengineering limitation.Inconsistency in engagement conventionOutliersHow to handle Noisy Data?The methods for removing noise from data are as follows.1. hive away(p)ning this attack first behavior data and partition it into ( peer- frequence) bins then one can motionless it using- hive away means, change surface using bin median(prenominal), smooth using bin boundaries, etc.2. Regression in this method smoothing is done by equalizeting the data into regression functions.3. C lustering bunch togethering detect and remove outliers from the data.4. Combined calculator and human inspection in this approach figurer detects shady values which are then checked by human experts (e.g., this approach deal with possible outliers)..Following methods are explained in detail as followsbinning Data preparation activity that converts continuous data to discrete data by replacing a value from a continuous range with a bin identifier, where each(prenominal) bin represents a range of values. For instance, age can be changed to bins such as 20 or under, 21-40, 41-65 and over 65. stash awayning methods smooth a sorted data set by consulting values or so it. This is therefore called local smoothing. Let consider a binning example put inning Methodsn Equal-width (distance) partitioningDivides the range into N intervals of equal size uniform storage-battery grid if A and B are the last and highest values of the attribute, the width of intervals will be W = (B-A)/N.The most straightforward, but outliers may dominate creation Skewed data is not handled welln Equal-depth (frequency) partitioning1. It divides the range (values of a interruptn attribute) into N intervals, each containing approximately same number of samples (elements)2. Good data scaling3. Managing mat attributes can be tricky.n swimming by bin means- to each one bin value is replaced by the mean of valuesn liquid by bin medians- for each one bin value is replaced by the median of valuesn Smooth by bin boundaries each bin value is replaced by the close set(predicate) boundary valueExampleLet Sorted data for price (in dollars) 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34n Partition into equal-frequency (equi-depth) binso Bin 1 4, 8, 9, 15o Bin 2 21, 21, 24, 25o Bin 3 26, 28, 29, 34n Smoothing by bin meanso Bin 1 9, 9, 9, 9 ( for example mean of 4, 8, 9, 15 is 9)o Bin 2 23, 23, 23, 23o Bin 3 29, 29, 29, 29n Smoothing by bin boundarieso Bin 1 4, 4, 4, 15o Bin 2 21, 21, 25, 25o B in 3 26, 26, 26, 34Regression Regression is a DM technique used to fit an equation to a dataset. The simplest form of regression is linear regression which uses the conventionality of a straight line (y = b+ wx) and determines the suitable values for b and w to predict the value of y based upon a presumption value of x. Sophisticated techniques, such as multiple regression, permit the use of more than one input variable and allow for the fitting of more complex models, such as a quadratic equation. Regression is notwithstanding described in subsequent chapter charm discussing anticipations. Clustering clustering is a method of grouping data into different groups , so that data in each group share similar trends and patterns. Clustering constitute a major class of data mining algorithms. These algorithms automatically partitions the data place into set of portions or cluster. The goal of the process is to find all set of similar examples in data, in some optimal fashion. Follow ing shows three clusters. Values that fall outside the cluster are outliers.4. Combined calculating machine and human inspection These methods find the suspicious values using the computer programs and then they are verified by human experts. By this process all outliers are checked.2.2.1.7 Data cleaning as a processData cleaning is the process of Detecting, Diagnosing, and redact Data. Data cleaning is a three stage method involving perennial cycle of screening, diagnosing, and editing of suspected data abnormalities. Many data errors are detected by the way during study activities. However, it is more efficient to discover inconsistencies by actively curious for them in a planned manner. It is not always right away clear whether a data point is erroneous. Many times it requires careful examination. Likewise, missing values require additional check. thus, predefined rules for dealing with errors and true missing and extreme values are part of good practice. One can monitor for suspect features in survey questionnaires, databases, or analysis data. In small studies, with the examiner intimately knotted at all stages, there may be small or no difference between a database and an analysis dataset.During as well as after treatment, the symptomatic and treatment phases of cleaning need insight into the sources and types of errors at all stages of the study. Data flow concept is therefore crucial in this respect. After amount the research data go through repeated steps of- introduction into information carriers, extracted, and transferred to other carriers, edited, selected, transformed, summarized, and presented. It is essential to understand that errors can occur at any stage of the data flow, including during data cleaning itself. Most of these problems are due to human error.Inaccuracy of a single data point and amount may be tolerable, and associated to the inherent expert error of the measurement device. Therefore the process of data clenaning mus concentrate on those errors that are beyond small technical variations and that form a major shift deep down or beyond the population distribution. In turn, it must be based on understanding of technical errors and judge ranges of normal values.Some errors are worthy of higher priority, but which ones are most significant is super study-specific. For instance in most medical epidemiological studies, errors that need to be cleaned, at all costs, include missing gender, gender misspecification, birth interpret or examination date errors, duplications or merging of records, and biologically unworkable results. Another example is in fodder studies, date errors lead to age errors, which in turn lead to errors in weight-for-age scoring and, unless, to misclassification of subjects as under- or overweight. Errors of sex and date are specially important because they contaminate derived variables. Prioritization is essential if the study is under time pressures or if resources for d ata cleaning are limited.2.2.2 Data IntegrationThis is a process of taking data from one or more sources and mapping it, field by field, onto a unseasoned data structure. Idea is to combine data from multiple sources into a long form. Various data mining projects requires data from multiple sources becausen Data may be distributed over different databases or data warehouses. (for example an epidemiological study that needs information about hospital admissions and car accidents)n sometimes data may be required from different geographic distributions, or there may be need for historic data. (e.g. integrate historical data into a new data warehouse)n There may be a necessity of enhancement of data with additional (external) data. (for improving data mining precision)2.2.2.1 Data Integration IssuesThere are number of issues in data integrations. Consider two database tables. create mentally two database tablesDatabase Table-1Database Table-2In integration of there two tables there a re variety of issues involved such as1. The same attribute may have different names (for example in above tables pee and Given Name are same attributes with different names)2. An attribute may be derived from another (for example attribute Age is derived from attribute DOB)3. Attributes world power be redundant( For example attribute pelvic inflammatory disease is redundant)4. Values in attributes faculty be different (for example for pelvic inflammatory disease 4791 values in present moment and third field are different in both the tables)5. Duplicate records under different keys( there is a possibility of replication of same record with different key values)Therefore schema integration and object matching can be trickier. misgiving here is how equivalent entities from different sources are matched? This problem is known as entity identification problem. Conflicts have to be detected and resolved. Integration becomes easier if unique entity keys are available in all the data sets (or tables) to be linked. Metadata can help in schema integration (example of metadata for each attribute includes the name, meaning, data type and range of values permitted for the attribute)2.2.2.1 tediousnessRedundancy is another important issue in data integration. both given up attribute (such as DOB and age for instance in give table) may be redundant if one is derived form the other attribute or set of attributes. Inconsistencies in attribute or dimension naming can lead to redundancies in the given data sets.Handling pointless DataWe can handle data redundancy problems by following waysn Use correlational statisticsal statistics analysisn varied coding / representation has to be considered (e.g. metric / imperial measures)n elaborated (manual) integration of the data can reduce or pr aftermath redundancies (and inconsistencies)n De-duplication (also called internal data linkage)o If no unique entity keys are availableo Analysis of values in attributes to find d uplicatesn cultivate redundant and inconsistent data (easy if values are the same)o cut one of the valueso norm values (only for numerical attributes)o Take majority values (if more than 2 duplicates and some values are the same)Correlation analysis is explained in detail here.Correlation analysis (also called Pearsons product moment coefficient) some redundancies can be detected by using correlation analysis. Given two attributes, such analysis can measure how strong one attribute implies another. For numerical attribute we can compute correlation coefficient of two attributes A and B to evaluate the correlation between them. This is given byWheren n is the number of tuples,n and are the respective means of A and Bn A and B are the respective standard deviation of A and Bn (AB) is the sum of the AB cross-product.a. If -1 b. If rA, B is equal to zero it indicates A and B are fissiparous of each other and there is no correlation between them.c. If rA, B is less than zero then A a nd B are negatively correlated. , where if value of one attribute increases value of another attribute decreases. This means that one attribute discourages another attribute.It is important to note that correlation does not inculpate causality. That is, if A and B are correlated, this does not essentially mean that A causes B or that B causes A. for example in analyzing a demographic database, we may find that attribute representing number of accidents and the number of car thieving in a region are correlated. This does not mean that one is related to another. Both may be related to third attribute, to wit population.For discrete data, a correlation relation between two attributes, can be discovered by a (chi-square) test. Let A has c distinct values a1,a2,ac and B has r different values namely b1,b2,br The data tuple described by A and B are shown as casualty table, with c values of A (making up columns) and r values of B( making up rows). Each and every (Ai, Bj) cell in table has.X2 = sum_i=1r sum_j=1c (O_i,j E_i,j)2 over E_i,j .Wheren Oi, j is the observed frequency (i.e. actual expect) of joint event (Ai, Bj) andn Ei, j is the pass judgment frequency which can be computed asE_i,j=fracsum_k=1c O_i,k sum_k=1r O_k,jN , ,Wheren N is number of data tuplen Oi,k is number of tuples having value ai for An Ok,j is number of tuples having value bj for BThe larger the value, the more promising the variables are related. The cells that contribute the most to the value are those whose actual count is very different from the anticipate countChi-Square Calculation An Example envisage a group of 1,500 people were surveyed. The gender of each person was noted. Each person has polled their best-loved type of reading material as fictionalisation or non-fiction. The observed frequency of each possible joint event is summarized in following table.( number in parenthesis are expected frequencies) . Calculate chi square. tactical manoeuvre cheaterNot play chessSum ( row)Like scholarship fiction250(90)200(360)450Not like wisdom fiction50(210)1000(840)1050Sum(col.)30012001500E11 = count (male)*count(fiction)/N = 300 * 450 / 1500 =90 and so onFor this table the spirit level of freedom are (2-1)(2-1) =1 as table is 2X2. for 1 degree of freedom , the value needed to rule in the hypothesis at the 0.001 substance level is 10.828 (taken from the table of upper percentage point of the distribution typically available in any statistic text book). Since the computed value is above this, we can reject the hypothesis that gender and preferred reading are independent and conclude that two attributes are strongly correlated for given group.duplication must also be detected at the tuple level. The use of renormalized tables is also a source of redundancies. Redundancies may further lead to data inconsistencies (due to modify some but not others).2.2.2.2 Detection and resolution of data value conflictsAnother significant issue in data integration is the find and resolution of data value conflicts. For example, for the same entity, attribute values from different sources may differ. For example weight can be stored in metric social unit in one source and British imperial unit in another source. For instance, for a hotel chaData Pre-processing ToolData Pre-processing ToolChapter- 2Real life data rarely comply with the necessities of various data mining tools. It is usually inconsistent and noisy. It may contain redundant attributes, unsuitable formats etc. Hence data has to be prepared vigilantly before the data mining actually starts. It is well known fact that success of a data mining algorithm is very much dependent on the quality of data processing. Data processing is one of the most important tasks in data mining. In this context it is natural that data pre-processing is a complicated task involving large data sets. Sometimes data pre-processing take more than 50% of the total time spent in solving the data mining problem. It is crucial for data miners to choose efficient data preprocessing technique for specific data set which can not only save processing time but also retain the quality of the data for data mining process.A data pre-processing tool should help miners with many data mining activates. For example, data may be provided in different formats as discussed in previous chapter (flat files, database files etc). Data files may also have different formats of values, calculation of derived attributes, data filters, joined data sets etc. Data mining process generally starts with understanding of data. In this stage pre-processing tools may help with data exploration and data discovery tasks. Data processing includes lots of tedious works,Data pre-processing generally consists ofData CleaningData IntegrationData Transformation AndData Reduction.In this chapter we will study all these data pre-processing activities.2.1 Data UnderstandingIn Data understanding phase the first task is to collect ini tial data and then proceed with activities in order to get well known with data, to discover data quality problems, to discover first insight into the data or to identify interesting subset to form hypothesis for hidden information. The data understanding phase according to CRISP model can be shown in following .2.1.1 Collect Initial DataThe initial collection of data includes loading of data if required for data understanding. For instance, if specific tool is applied for data understanding, it makes great sense to load your data into this tool. This attempt possibly leads to initial data preparation steps. However if data is obtained from multiple data sources then integration is an additional issue.2.1.2 Describe dataHere the gross or surface properties of the gathered data are examined.2.1.3 Explore dataThis task is required to handle the data mining questions, which may be addressed using querying, visualization and reporting. These includeSharing of key attributes, for instanc e the goal attribute of a prediction taskRelations between pairs or small numbers of attributesResults of simple aggregationsProperties of important sub-populationsSimple statistical analyses.2.1.4 Verify data qualityIn this step quality of data is examined. It answers questions such asIs the data complete (does it cover all the cases required)?Is it accurate or does it contains errors and if there are errors how common are they?Are there missing values in the data?If so how are they represented, where do they occur and how common are they?2.2 Data PreprocessingData preprocessing phase focus on the pre-processing steps that produce the data to be mined. Data preparation or preprocessing is one most important step in data mining. Industrial practice indicates that one data is well prepared the mined results are much more accurate. This means this step is also a very critical fro success of data mining method. Among others, data preparation mainly involves data cleaning, data integrat ion, data transformation, and reduction.2.2.1 Data CleaningData cleaning is also known as data cleansing or scrubbing. It deals with detecting and removing inconsistencies and errors from data in order to get better quality data. While using a single data source such as flat files or databases data quality problems arises due to misspellings while data entry, missing information or other invalid data. While the data is taken from the integration of multiple data sources such as data warehouses, federated database systems or global web-based information systems, the requirement for data cleaning increases significantly. This is because the multiple sources may contain redundant data in different formats. Consolidation of different data formats abs elimination of redundant information becomes necessary in order to provide access to accurate and consistent data. Good quality data requires passing a set of quality criteria. Those criteria includeAccuracy Accuracy is an aggregated value over the criteria of integrity, consistency and density.Integrity Integrity is an aggregated value over the criteria of completeness and validity.Completeness completeness is achieved by correcting data containing anomalies.Validity Validity is approximated by the amount of data satisfying integrity constraints.Consistency consistency concerns contradictions and syntactical anomalies in data.Uniformity it is directly related to irregularities in data.Density The density is the quotient of missing values in the data and the number of total values ought to be known.Uniqueness uniqueness is related to the number of duplicates present in the data.2.2.1.1 Terms Related to Data CleaningData cleaning data cleaning is the process of detecting, diagnosing, and editing damaged data.Data editing data editing means changing the value of data which are incorrect.Data flow data flow is defined as passing of recorded information through succeeding information carriers.Inliers Inliers are data valu es falling inside the projected range.Outlier outliers are data value falling outside the projected range.Robust estimation evaluation of statistical parameters, using methods that are less responsive to the effect of outliers than more conventional methods are called robust method.2.2.1.2 Definition Data CleaningData cleaning is a process used to identify imprecise, incomplete, or irrational data and then improving the quality through correction of detected errors and omissions. This process may includeformat checksCompleteness checksReasonableness checksLimit checksReview of the data to identify outliers or other errorsAssessment of data by subject area experts (e.g. taxonomic specialists).By this process suspected records are flagged, documented and checked subsequently. And finally these suspected records can be corrected. Sometimes validation checks also involve checking for compliance against applicable standards, rules, and conventions.The general framework for data cleaning given asDefine and determine error typesSearch and identify error instancesCorrect the errorsDocument error instances and error types andModify data entry procedures to reduce future errors.Data cleaning process is referred by different people by a number of terms. It is a matter of preference what one uses. These terms include Error Checking, Error Detection, Data Validation, Data Cleaning, Data Cleansing, Data Scrubbing and Error Correction.We use Data Cleaning to encompass three sub-processes, viz.Data checking and error detectionData validation andError correction.A fourth improvement of the error prevention processes could perhaps be added.2.2.1.3 Problems with DataHere we just note some key problems with dataMissing data This problem occur because of two main reasonsData are absent in source where it is expected to be present.Some times data is present are not available in appropriately formDetecting missing data is usually straightforward and simpler.Erroneous data This pr oblem occurs when a wrong value is recorded for a real world value. Detection of erroneous data can be quite difficult. (For instance the incorrect spelling of a name) Duplicated data This problem occur because of two reasonsRepeated entry of same real world entity with some different valuesSome times a real world entity may have different identifications.Repeat records are regular and frequently easy to detect. The different identification of the same real world entities can be a very hard problem to identify and solve.Heterogeneities When data from different sources are brought together in one analysis problem heterogeneity may occur. Heterogeneity could beStructural heterogeneity arises when the data structures reflect different business usageSemantic heterogeneity arises when the meaning of data is different n each system that is being combinedHeterogeneities are usually very difficult to resolve since because they usually involve a lot of contextual data that is not well defin ed as metadata.Information dependencies in the relationship between the different sets of attribute are commonly present. Wrong cleaning mechanisms can further damage the information in the data. Various analysis tools handle these problems in different ways. Commercial offerings are available that assist the cleaning process, but these are often problem specific. Uncertainty in information systems is a well-recognized hard problem. In following a very simple examples of missing and erroneous data is shownExtensive support for data cleaning must be provided by data warehouses. Data warehouses have high probability of dirty data since they load and continuously refresh huge amounts of data from a variety of sources. Since these data warehouses are used for strategic decision making therefore the correctness of their data is important to avoid wrong decisions. The ETL (Extraction, Transformation, and Loading) process for building a data warehouse is illustrated in following .Data tran sformations are related with schema or data translation and integration, and with filtering and aggregating data to be stored in the data warehouse. All data cleaning is classically performed in a separate data performance area prior to loading the transformed data into the warehouse. A large number of tools of varying functionality are available to support these tasks, but often a significant portion of the cleaning and transformation work has to be done manually or by low-level programs that are difficult to write and maintain.A data cleaning method should assure followingIt should identify and eliminate all major errors and inconsistencies in an individual data sources and also when integrating multiple sources.Data cleaning should be supported by tools to bound manual examination and programming effort and it should be extensible so that can cover additional sources.It should be performed in association with schema related data transformations based on metadata.Data cleaning map ping functions should be specified in a declarative way and be reusable for other data sources.2.2.1.4 Data Cleaning Phases1. Analysis To identify errors and inconsistencies in the database there is a need of detailed analysis, which involves both manual inspection and automated analysis programs. This reveals where (most of) the problems are present.2. Defining Transformation and Mapping Rules After discovering the problems, this phase are related with defining the manner by which we are going to automate the solutions to clean the data. We will find various problems that translate to a list of activities as a result of analysis phase.Example Remove all entries for J. Smith because they are duplicates of John Smith Find entries with bule in colour field and change these to blue. Find all records where the Phone number field does not match the pattern (NNNNN NNNNNN). Further steps for cleaning this data are then applied. Etc 3. Verification In this phase we check and assess the tran sformation plans made in phase- 2. Without this step, we may end up making the data dirtier rather than cleaner. Since data transformation is the main step that actually changes the data itself so there is a need to be sure that the applied transformations will do it correctly. Therefore test and examine the transformation plans very carefully.Example Let we have a very thick C++ book where it says strict in all the places where it should say struct4. Transformation Now if it is sure that cleaning will be done correctly, then apply the transformation verified in last step. For large database, this task is supported by a variety of toolsBackflow of Cleaned Data In a data mining the main objective is to convert and move clean data into target system. This asks for a requirement to purify legacy data. Cleansing can be a complicated process depending on the technique chosen and has to be designed carefully to achieve the objective of removal of dirty data. Some methods to accomplish th e task of data cleansing of legacy system includen Automated data cleansingn Manual data cleansingn The combined cleansing process2.2.1.5 Missing ValuesData cleaning addresses a variety of data quality problems, including noise and outliers, inconsistent data, duplicate data, and missing values. Missing values is one important problem to be addressed. Missing value problem occurs because many tuples may have no record for several attributes. For Example there is a customer sales database consisting of a whole bunch of records (lets say around 100,000) where some of the records have certain fields missing. Lets say customer income in sales data may be missing. Goal here is to find a way to predict what the missing data values should be (so that these can be filled) based on the existing data. Missing data may be due to following reasonsEquipment malfunctionInconsistent with other recorded data and thus deletedData not entered due to misunderstandingCertain data may not be considered important at the time of entryNot register history or changes of the dataHow to Handle Missing Values?Dealing with missing values is a regular question that has to do with the actual meaning of the data. There are various methods for handling missing entries1. Ignore the data row. One solution of missing values is to just ignore the entire data row. This is generally done when the class label is not there (here we are assuming that the data mining goal is classification), or many attributes are missing from the row (not just one). But if the percentage of such rows is high we will definitely get a poor performance.2. Use a global constant to fill in for missing values. We can fill in a global constant for missing values such as unknown, N/A or minus infinity. This is done because at times is just doesnt make sense to try and predict the missing value. For example if in customer sales database if, say, office address is missing for some, filling it in doesnt make much sense. This met hod is simple but is not full proof.3. Use attribute mean. Let say if the average income of a a family is X you can use that value to replace missing income values in the customer sales database.4. Use attribute mean for all samples belonging to the same class. Lets say you have a cars pricing DB that, among other things, classifies cars to Luxury and Low budget and youre dealing with missing values in the cost field. Replacing missing cost of a luxury car with the average cost of all luxury cars is probably more accurate then the value youd get if you factor in the low budget5. Use data mining algorithm to predict the value. The value can be determined using regression, inference based tools using Bayesian formalism, decision trees, clustering algorithms etc.2.2.1.6 Noisy DataNoise can be defined as a random error or variance in a measured variable. Due to randomness it is very difficult to follow a strategy for noise removal from the data. Real world data is not always faultless. It can suffer from corruption which may impact the interpretations of the data, models created from the data, and decisions made based on the data. Incorrect attribute values could be present because of following reasonsFaulty data collection instrumentsData entry problemsDuplicate recordsIncomplete dataInconsistent dataIncorrect processingData transmission problemsTechnology limitation.Inconsistency in naming conventionOutliersHow to handle Noisy Data?The methods for removing noise from data are as follows.1. Binning this approach first sort data and partition it into (equal-frequency) bins then one can smooth it using- Bin means, smooth using bin median, smooth using bin boundaries, etc.2. Regression in this method smoothing is done by fitting the data into regression functions.3. Clustering clustering detect and remove outliers from the data.4. Combined computer and human inspection in this approach computer detects suspicious values which are then checked by human experts (e.g., this approach deal with possible outliers)..Following methods are explained in detail as followsBinning Data preparation activity that converts continuous data to discrete data by replacing a value from a continuous range with a bin identifier, where each bin represents a range of values. For instance, age can be changed to bins such as 20 or under, 21-40, 41-65 and over 65. Binning methods smooth a sorted data set by consulting values around it. This is therefore called local smoothing. Let consider a binning exampleBinning Methodsn Equal-width (distance) partitioningDivides the range into N intervals of equal size uniform grid if A and B are the lowest and highest values of the attribute, the width of intervals will be W = (B-A)/N.The most straightforward, but outliers may dominate presentation Skewed data is not handled welln Equal-depth (frequency) partitioning1. It divides the range (values of a given attribute) into N intervals, each containing approximately same number of s amples (elements)2. Good data scaling3. Managing categorical attributes can be tricky.n Smooth by bin means- Each bin value is replaced by the mean of valuesn Smooth by bin medians- Each bin value is replaced by the median of valuesn Smooth by bin boundaries Each bin value is replaced by the closest boundary valueExampleLet Sorted data for price (in dollars) 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34n Partition into equal-frequency (equi-depth) binso Bin 1 4, 8, 9, 15o Bin 2 21, 21, 24, 25o Bin 3 26, 28, 29, 34n Smoothing by bin meanso Bin 1 9, 9, 9, 9 ( for example mean of 4, 8, 9, 15 is 9)o Bin 2 23, 23, 23, 23o Bin 3 29, 29, 29, 29n Smoothing by bin boundarieso Bin 1 4, 4, 4, 15o Bin 2 21, 21, 25, 25o Bin 3 26, 26, 26, 34Regression Regression is a DM technique used to fit an equation to a dataset. The simplest form of regression is linear regression which uses the formula of a straight line (y = b+ wx) and determines the suitable values for b and w to predict the value of y bas ed upon a given value of x. Sophisticated techniques, such as multiple regression, permit the use of more than one input variable and allow for the fitting of more complex models, such as a quadratic equation. Regression is further described in subsequent chapter while discussing predictions. Clustering clustering is a method of grouping data into different groups , so that data in each group share similar trends and patterns. Clustering constitute a major class of data mining algorithms. These algorithms automatically partitions the data space into set of regions or cluster. The goal of the process is to find all set of similar examples in data, in some optimal fashion. Following shows three clusters. Values that fall outside the cluster are outliers.4. Combined computer and human inspection These methods find the suspicious values using the computer programs and then they are verified by human experts. By this process all outliers are checked.2.2.1.7 Data cleaning as a processData cleaning is the process of Detecting, Diagnosing, and Editing Data. Data cleaning is a three stage method involving repeated cycle of screening, diagnosing, and editing of suspected data abnormalities. Many data errors are detected by the way during study activities. However, it is more efficient to discover inconsistencies by actively searching for them in a planned manner. It is not always right away clear whether a data point is erroneous. Many times it requires careful examination. Likewise, missing values require additional check. Therefore, predefined rules for dealing with errors and true missing and extreme values are part of good practice. One can monitor for suspect features in survey questionnaires, databases, or analysis data. In small studies, with the examiner intimately involved at all stages, there may be small or no difference between a database and an analysis dataset.During as well as after treatment, the diagnostic and treatment phases of cleaning need insight i nto the sources and types of errors at all stages of the study. Data flow concept is therefore crucial in this respect. After measurement the research data go through repeated steps of- entering into information carriers, extracted, and transferred to other carriers, edited, selected, transformed, summarized, and presented. It is essential to understand that errors can occur at any stage of the data flow, including during data cleaning itself. Most of these problems are due to human error.Inaccuracy of a single data point and measurement may be tolerable, and associated to the inherent technological error of the measurement device. Therefore the process of data clenaning mus focus on those errors that are beyond small technical variations and that form a major shift within or beyond the population distribution. In turn, it must be based on understanding of technical errors and expected ranges of normal values.Some errors are worthy of higher priority, but which ones are most signifi cant is highly study-specific. For instance in most medical epidemiological studies, errors that need to be cleaned, at all costs, include missing gender, gender misspecification, birth date or examination date errors, duplications or merging of records, and biologically impossible results. Another example is in nutrition studies, date errors lead to age errors, which in turn lead to errors in weight-for-age scoring and, further, to misclassification of subjects as under- or overweight. Errors of sex and date are particularly important because they contaminate derived variables. Prioritization is essential if the study is under time pressures or if resources for data cleaning are limited.2.2.2 Data IntegrationThis is a process of taking data from one or more sources and mapping it, field by field, onto a new data structure. Idea is to combine data from multiple sources into a coherent form. Various data mining projects requires data from multiple sources becausen Data may be distri buted over different databases or data warehouses. (for example an epidemiological study that needs information about hospital admissions and car accidents)n Sometimes data may be required from different geographic distributions, or there may be need for historical data. (e.g. integrate historical data into a new data warehouse)n There may be a necessity of enhancement of data with additional (external) data. (for improving data mining precision)2.2.2.1 Data Integration IssuesThere are number of issues in data integrations. Consider two database tables. Imagine two database tablesDatabase Table-1Database Table-2In integration of there two tables there are variety of issues involved such as1. The same attribute may have different names (for example in above tables Name and Given Name are same attributes with different names)2. An attribute may be derived from another (for example attribute Age is derived from attribute DOB)3. Attributes might be redundant( For example attribute PID i s redundant)4. Values in attributes might be different (for example for PID 4791 values in second and third field are different in both the tables)5. Duplicate records under different keys( there is a possibility of replication of same record with different key values)Therefore schema integration and object matching can be trickier. Question here is how equivalent entities from different sources are matched? This problem is known as entity identification problem. Conflicts have to be detected and resolved. Integration becomes easier if unique entity keys are available in all the data sets (or tables) to be linked. Metadata can help in schema integration (example of metadata for each attribute includes the name, meaning, data type and range of values permitted for the attribute)2.2.2.1 RedundancyRedundancy is another important issue in data integration. Two given attribute (such as DOB and age for instance in give table) may be redundant if one is derived form the other attribute or set of attributes. Inconsistencies in attribute or dimension naming can lead to redundancies in the given data sets.Handling Redundant DataWe can handle data redundancy problems by following waysn Use correlation analysisn Different coding / representation has to be considered (e.g. metric / imperial measures)n Careful (manual) integration of the data can reduce or prevent redundancies (and inconsistencies)n De-duplication (also called internal data linkage)o If no unique entity keys are availableo Analysis of values in attributes to find duplicatesn Process redundant and inconsistent data (easy if values are the same)o Delete one of the valueso Average values (only for numerical attributes)o Take majority values (if more than 2 duplicates and some values are the same)Correlation analysis is explained in detail here.Correlation analysis (also called Pearsons product moment coefficient) some redundancies can be detected by using correlation analysis. Given two attributes, such analy sis can measure how strong one attribute implies another. For numerical attribute we can compute correlation coefficient of two attributes A and B to evaluate the correlation between them. This is given byWheren n is the number of tuples,n and are the respective means of A and Bn A and B are the respective standard deviation of A and Bn (AB) is the sum of the AB cross-product.a. If -1 b. If rA, B is equal to zero it indicates A and B are independent of each other and there is no correlation between them.c. If rA, B is less than zero then A and B are negatively correlated. , where if value of one attribute increases value of another attribute decreases. This means that one attribute discourages another attribute.It is important to note that correlation does not imply causality. That is, if A and B are correlated, this does not essentially mean that A causes B or that B causes A. for example in analyzing a demographic database, we may find that attribute representing number of acciden ts and the number of car theft in a region are correlated. This does not mean that one is related to another. Both may be related to third attribute, namely population.For discrete data, a correlation relation between two attributes, can be discovered by a (chi-square) test. Let A has c distinct values a1,a2,ac and B has r different values namely b1,b2,br The data tuple described by A and B are shown as contingency table, with c values of A (making up columns) and r values of B( making up rows). Each and every (Ai, Bj) cell in table has.X2 = sum_i=1r sum_j=1c (O_i,j E_i,j)2 over E_i,j .Wheren Oi, j is the observed frequency (i.e. actual count) of joint event (Ai, Bj) andn Ei, j is the expected frequency which can be computed asE_i,j=fracsum_k=1c O_i,k sum_k=1r O_k,jN , ,Wheren N is number of data tuplen Oi,k is number of tuples having value ai for An Ok,j is number of tuples having value bj for BThe larger the value, the more likely the variables are related. The cells that contri bute the most to the value are those whose actual count is very different from the expected countChi-Square Calculation An ExampleSuppose a group of 1,500 people were surveyed. The gender of each person was noted. Each person has polled their preferred type of reading material as fiction or non-fiction. The observed frequency of each possible joint event is summarized in following table.( number in parenthesis are expected frequencies) . Calculate chi square.Play chessNot play chessSum (row)Like science fiction250(90)200(360)450Not like science fiction50(210)1000(840)1050Sum(col.)30012001500E11 = count (male)*count(fiction)/N = 300 * 450 / 1500 =90 and so onFor this table the degree of freedom are (2-1)(2-1) =1 as table is 2X2. for 1 degree of freedom , the value needed to reject the hypothesis at the 0.001 significance level is 10.828 (taken from the table of upper percentage point of the distribution typically available in any statistic text book). Since the computed value is a bove this, we can reject the hypothesis that gender and preferred reading are independent and conclude that two attributes are strongly correlated for given group.Duplication must also be detected at the tuple level. The use of renormalized tables is also a source of redundancies. Redundancies may further lead to data inconsistencies (due to updating some but not others).2.2.2.2 Detection and resolution of data value conflictsAnother significant issue in data integration is the discovery and resolution of data value conflicts. For example, for the same entity, attribute values from different sources may differ. For example weight can be stored in metric unit in one source and British imperial unit in another source. For instance, for a hotel cha

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.