PREDICTIVE MODELING USING LOGISTIC REGRESSION SAS COURSE NOTES DOWNLOAD

adminComment(0)

Predictive Modeling Using Logistic Regression: Course Notes [SAS institute] on tingrakecoupde.ga *FREE* Logistic Regression Course. tingrakecoupde.ga Free Download. Download Predictive Modeling Using Logistic Regression Course Notes was developed .. have experience building statistical models using SAS software. Predictive Modeling Using Logistic Regression: Course Notes [SAS institute] on tingrakecoupde.ga *FREE* shipping on qualifying offers. Sas Institute Course notes.


Predictive Modeling Using Logistic Regression Sas Course Notes Download

Author:CARMELA BEKKER
Language:English, Portuguese, Hindi
Country:Nepal
Genre:Science & Research
Pages:487
Published (Last):08.12.2015
ISBN:907-9-52971-370-4
ePub File Size:22.87 MB
PDF File Size:8.71 MB
Distribution:Free* [*Sign up for free]
Downloads:39948
Uploaded by: JOSEPHINE

Author: Deepanshu Bhalla | Category: predictive modeling, sas, Multinomial or ordinary logistic regression can have dependent variable with more . Proc Logistic Data = training outest=coeff descending; .. options source notes; Irving Machine Learning data set repository and can be downloaded at the link here. Looks like the data isn't available for public download. But, there is a page where people who downloadd the course notes can request a download of the course. This course covers predictive modeling using SAS/STAT software with emphasis on the LOGISTIC procedure. This course also discusses selecting variables.

It is important to note that data sources are not the actual training data, but instead are the metadata that defines the source data.

The source data itself must reside in an allocated library. You have already allocated a libname to the donor data source as part of the start-up code for the project. Click OK and then select the Next button. The Data Table Properties sheet opens. There are 50 variables and 19, observations. In this course, the unknown quantity is called a target and the supplementary facts are called inputs. Variables in a data assume one of these two model roles. The inputs and target typically represent measurements of an observable phenomenon.

The measurements found in the input and target variables are recorded on one of several measurement scales. SAS Enterprise Miner recognizes the following measurement scales for the purposes of model construction: 1 Interval measurements are quantitative values permitting certain simple arithmetic or logarithmic transformations for example, monetary amounts.

Ordinal measurements are qualitative attributes having an inherent order for example, income group. Nominal measurements are qualitative attributes lacking an inherent order for example, state or province. Binary measurements are qualitative attributes with only two levels for example, gender. To solve the fundamental problem in prediction, a mathematical relationship between the inputs and the target is constructed. This mathematical relation is known as a predictive model.

After it is established, the predictive model can be used to produce an estimate of an unknown target value given a set of input measurements. The model role and measurement scale are examples of metadata.

What happens after I book?

Some metadata, such as field name, are stored with the data. Other metadata, such as how a particular variable in a data set should be used in a predictive model, must be manually specified. Defining modeling metadata is the process of establishing relevant facts about the data set prior to model construction. Click the Next button to apply advisor options. Two options are available: a. Basic Use the Basic option when you already know the variable roles and measurement levels. The initial role and level are based on the variable type and format values.

Automatic initial roles and level values are based on the variable type, the variable format, and the number of distinct values contained in the variable. Select Advanced. Select the Customize button to view additional variables rules you can impose. For example, Missing Percentage Threshold specifies the percentage of missing values required for a variable s modeling role to be set to Rejected.

Select OK to use the defaults for this example. Select Next in the Apply Advisor window to generate the metadata and open the columns metadata.

Click on the Names column header to sort the variables alphabetically. Select the Level column header to sort the variables by level. This is especially useful when you want to apply a metadata rule to several variables. By default, 10, observations are used to generate exploratory plots. In the Sample Properties window, set the fetch size to Max and then click Apply.

The plot is now generated from all 19, observations in the donor data source.

Select the bar for the donors 1 s. The donors are highlighted in the data table. To display a tool tip indicating the number of donors, place your cursor over this bar. Close the Explore window. You now finalize your metadata assignments. Select Next to open the Decision Processing window. For now, forego decision processing. Select Next to open the Data Source Attributes window. The data source can be used in other diagrams.

You can also define global data sources that can be used across multiple projects. See the online documentation for instructions on how to do this. Expand the Data Sources folder. Previously Observed Cases Construction of predictive models requires training data, a set of previously observed input and target measurements, or cases. The cases of the training data are assumed to be representative of future unobserved input and target measurements.

Given a set of input measurements, you need only to scan the training data for identical measurements and note the corresponding target measurement. Often in a real set of training data, a particular set of inputs corresponds to a range of target measurements. Because of this noise, predictive models usually provide the expected average value of the target for a given set of input measurements.

With a qualitative target, ordinal, nominal, or binary , the expected target value may be interpreted as the probability of each qualitative level. Both situations suggest that there are limits to the accuracy achievable by any predictive model. Usually, a given set of input measurements does not yield an exact match in the training data. How you compensate for this fact distinguishes various predictive modeling methods. Perhaps the most intuitive way to predict cases lacking an exact match in the training data is to look for a nearly matching case and note the corresponding target measurement.

This is the philosophy behind nearest-neighbor prediction and other local smoothing methods. The failure of either assumption results in poor predictive performance. However, you could argue that its philosophical roots date back at least to the taxonomists of the 19 th Century. In its simplest form, the predicted target value equals the target value of the nearest training data case.

You can envision this process as partitioning the input space, the set of all possible input measurements, into cells of distinct target values. The edge of these cells, where the predicted value changes, is known as the decision boundary.

A nearest neighbor model has a very complex decision boundary. While nearest neighbor prediction perfectly predicts training data cases, performance on new cases validation data can be substantially worse.

This is especially apparent when the data are noisy every small region of the input space contains cases with several distinct target values. In the slide above, the true value of a validation data case is indicated by dot color.

Any case whose nearest neighbor has a different color is incorrectly predicted, indicated by a red circle surrounding the case. One way to tune a nearest neighbor model is to change the number of training data cases used to make a prediction. Instead of using the target value of the nearest training case, the predicted target is taken to be the average target values of the k nearest training cases. This interpolation makes the model much less sensitive to noise and typically improves generalization.

In general, models are tuned to match the specific signal and noise characteristics of a given prediction problem.

When there is a strong signal and little noise, highly sensitive models can be built with complex decision boundaries. Where there is a weak signal and high noise, less sensitive models with simple decision boundaries are appropriate. In SAS Enterprise Miner, monitoring model performance on validation data usually determines the appropriate tuning value.

This choice is critical. Including extraneous inputs that is, those unrelated to the target can devastate model performance. This phenomenon, known as the curse of dimensionality, is the general observation that the complexity of a data set increases with dimension. Cases that are nearest neighbors in two dimensions need not be nearest neighbors in three dimensions. When only two of the three dimensions are related to the target, this can degrade the performance of the nearest neighbor model.

Nearest Neighbors? Extraneous Inputs 0 Nearest Neighbors Training Data 33 As the number of extraneous inputs increases, the problem becomes worse. Indeed, in high dimensions, the concept of nearest becomes quite distorted.

Suppose there are cases scattered randomly but uniformly on the range 0 to 1 of 10 independent inputs.

Now take any pair of inputs. How many of the cases are in the center half of both inputs? If the inputs are independent, as assumed, the answer is about For three inputs, there are about , and so on. Perhaps surprisingly, with 10 inputs, only about 1 case out of is simultaneously in the center half of all inputs. Put another way, To maintain some sense of nearness in high dimensions requires a tremendous increase in the number of training cases. Matters are worse in a typical prediction problem: for every relevant input there may be dozens of extraneous ones.

This devastates the performance of nearest neighbor methods and begs the question of how to proceed. To do this, the focus must shift from individual cases in the training data to the general pattern they create. Two approaches are widely used to overcome the curse of dimensionality. Predictive algorithms employ simple heuristic rules to reduce dimension. Parametric models are constrained to limit overgeneralization.

While this classification is used to group predictive models for the purposes of this course, the distinction is somewhat artificial. Predictive algorithms often utilize predictive models; predictive models often employ predictive algorithms.

In the example above, a single partition of the input space can lead to a surprisingly accurate prediction. This partition takes advantage of the clustering of solid cases on the right half of the original input space. It isolates cases with like-valued targets in each part of the partition. The common element of these techniques is the recursive partitioning of the input space. Partitions of the training data, based on the values of a single input, are considered.

The worth of a partition is measured by how well it isolates distinct groups of target values. The process continues by further subdividing each resulting split group.

Ultimately, the satisfaction of certain stopping conditions terminates the process. The number of times the partitioning process repeats can be thought of as a tuning parameter for the model. Each iteration subdivides the training data further and increases training data accuracy. However, increasing the training data accuracy often diminishes generalization. As with nearest neighbor models, validation data can be used to pick the optimal tuning value. Recursive partitioning techniques resist the curse of dimensionality by ignoring inputs not associated with the target.

Related Interests

If every partition involving a particular input results in partition groups with similar average target values, the calculated worth of these partitions will be small. The particular input is not selected to partition the data, and it is effectively disregarded.

Because they can quickly identify inputs with strong target associations, recursive partitioning methods are ideally suited to the role of initial predictive modeling methodology. The task that motivates predictive modeling in this course has been outlined in Section 1. Lapsing donors have been identified by basic business rules. Some of these donors will be subsequently ignored; some will continue to be solicited for donation.

A data set describing the donation response to a mailing identified as 97NK will be used to make this decision. The simplest approach to this problem involves estimating donation propensity from the 97NK data. Individuals with the highest probability of response are selected for continued solicitation.

Those with the lowest probability of response are ignored in the future. For now, the amount of response enters into the solicitation decision after the propensity to donate is estimated. Other variables in the training data provide supplemental facts about each individual. Not all of these inputs will be needed to build a successful predictive model. To build any predictive model, however, you must first create an analysis diagram. Create a Diagram 1.

Predictive Modeling Using Logistic Regression

Expand the diagram folder to see the open diagram. Select the diagram icon. Curiously, the derived values and the provided values do not always agree. Because it is impossible to determine which are correct, these supplied values were also included in the final analysis data.

They describe the response behavior to the 97NK campaign. The models to be built will attempt to predict their value in the presence of all the other information in the analysis data set. SAS Enterprise Miner 5. The interface is divided into six interface components: Toolbar The toolbar in SAS Enterprise Miner is a graphical set of node icons and tools that you use to build process flow diagrams in the Diagram Workspace.

To display the text name of any node or tool icon, position your mouse pointer over the icon.. Project panel Use the Project panel to manage and view data sources, diagrams, results, and project users. Properties panel Use the Properties panel to view and edit the settings of data sources, diagrams, nodes, results, and users. Diagram Workspace Use the Diagram Workspace to build, edit, run, and save process flow diagrams.

This is where you graphically build, order, and sequence the nodes that you use to mine your data and generate reports. Help panel The Help panel displays a short description of the property that you select in the Properties Panel. Extended help can be found in the Help Topics selection from the Help main menu.

Status bar The status bar is single pane at the bottom of the window that indicates the execution status of an Enterprise Miner task. The SAS analytic server is configured in advance to access predefined data sources. Unlike the 5. The Start Enterprise Miner window opens. Select Personal Workstation when you are using SAS services that run on your personal computer or laptop computer. This course assumes the Personal Workstation configuration.

Select Start in the Start Enterprise Miner window. After a brief pause, the Welcome to Enterprise Miner startup page opens. This window enables you to create a new project or open an existing project. As an alternative, you can select File New Project from the main menu. The Create New Project window opens. Specify the project name in the Name field. The administrator determines the paths to which you can have access. If a default project is not provided, type in the name and path location where you want to store the project.

For example, type a. PVA for the name b. Select the OK button to create the project. You are encouraged to read through the help topics, which cover many of the remaining tasks in more detail.

Many of the sample data sources used in the online help can be created by selecting Generate Sample Data Sources from the Help menu. The metadata for these tables is already predefined. Select Preferences from the Options drop-down menu item to set the GUI appearance and specify model results package options. It is important to note that data sources are not the actual training data, but instead are the metadata that defines the source data.

The source data itself must reside in an allocated library. You have already allocated a libname to the donor data source as part of the start-up code for the project. Click OK and then select the Next button. The Data Table Properties sheet opens.

There are 50 variables and 19, observations. In this course, the unknown quantity is called a target and the supplementary facts are called inputs. Variables in a data assume one of these two model roles. The inputs and target typically represent measurements of an observable phenomenon. The measurements found in the input and target variables are recorded on one of several measurement scales.

SAS Enterprise Miner recognizes the following measurement scales for the purposes of model construction: 1 Interval measurements are quantitative values permitting certain simple arithmetic or logarithmic transformations for example, monetary amounts.

Ordinal measurements are qualitative attributes having an inherent order for example, income group. Nominal measurements are qualitative attributes lacking an inherent order for example, state or province. Binary measurements are qualitative attributes with only two levels for example, gender. To solve the fundamental problem in prediction, a mathematical relationship between the inputs and the target is constructed.

This mathematical relation is known as a predictive model. After it is established, the predictive model can be used to produce an estimate of an unknown target value given a set of input measurements. The model role and measurement scale are examples of metadata. Some metadata, such as field name, are stored with the data. Other metadata, such as how a particular variable in a data set should be used in a predictive model, must be manually specified.

Defining modeling metadata is the process of establishing relevant facts about the data set prior to model construction. Click the Next button to apply advisor options. Two options are available: a. Basic Use the Basic option when you already know the variable roles and measurement levels.

The initial role and level are based on the variable type and format values. Automatic initial roles and level values are based on the variable type, the variable format, and the number of distinct values contained in the variable. Select Advanced. Select the Customize button to view additional variables rules you can impose.

For example, Missing Percentage Threshold specifies the percentage of missing values required for a variable s modeling role to be set to Rejected. Select OK to use the defaults for this example.

Select Next in the Apply Advisor window to generate the metadata and open the columns metadata. Click on the Names column header to sort the variables alphabetically. Select the Level column header to sort the variables by level.

This is especially useful when you want to apply a metadata rule to several variables. By default, 10, observations are used to generate exploratory plots.

In the Sample Properties window, set the fetch size to Max and then click Apply. The plot is now generated from all 19, observations in the donor data source. Select the bar for the donors 1 s. The donors are highlighted in the data table.

To display a tool tip indicating the number of donors, place your cursor over this bar. Close the Explore window. You now finalize your metadata assignments.

Select Next to open the Decision Processing window. For now, forego decision processing. Select Next to open the Data Source Attributes window. The data source can be used in other diagrams. You can also define global data sources that can be used across multiple projects. See the online documentation for instructions on how to do this. Expand the Data Sources folder. Previously Observed Cases Construction of predictive models requires training data, a set of previously observed input and target measurements, or cases.

The cases of the training data are assumed to be representative of future unobserved input and target measurements. Given a set of input measurements, you need only to scan the training data for identical measurements and note the corresponding target measurement.

Often in a real set of training data, a particular set of inputs corresponds to a range of target measurements. Because of this noise, predictive models usually provide the expected average value of the target for a given set of input measurements. With a qualitative target, ordinal, nominal, or binary , the expected target value may be interpreted as the probability of each qualitative level.

Both situations suggest that there are limits to the accuracy achievable by any predictive model. Usually, a given set of input measurements does not yield an exact match in the training data. How you compensate for this fact distinguishes various predictive modeling methods. Perhaps the most intuitive way to predict cases lacking an exact match in the training data is to look for a nearly matching case and note the corresponding target measurement.

This is the philosophy behind nearest-neighbor prediction and other local smoothing methods. The failure of either assumption results in poor predictive performance. However, you could argue that its philosophical roots date back at least to the taxonomists of the 19 th Century. In its simplest form, the predicted target value equals the target value of the nearest training data case. You can envision this process as partitioning the input space, the set of all possible input measurements, into cells of distinct target values.

The edge of these cells, where the predicted value changes, is known as the decision boundary. A nearest neighbor model has a very complex decision boundary.

While nearest neighbor prediction perfectly predicts training data cases, performance on new cases validation data can be substantially worse. This is especially apparent when the data are noisy every small region of the input space contains cases with several distinct target values. In the slide above, the true value of a validation data case is indicated by dot color. Any case whose nearest neighbor has a different color is incorrectly predicted, indicated by a red circle surrounding the case.

One way to tune a nearest neighbor model is to change the number of training data cases used to make a prediction.

Instead of using the target value of the nearest training case, the predicted target is taken to be the average target values of the k nearest training cases.

This interpolation makes the model much less sensitive to noise and typically improves generalization. In general, models are tuned to match the specific signal and noise characteristics of a given prediction problem. When there is a strong signal and little noise, highly sensitive models can be built with complex decision boundaries.

Where there is a weak signal and high noise, less sensitive models with simple decision boundaries are appropriate. In SAS Enterprise Miner, monitoring model performance on validation data usually determines the appropriate tuning value.

This choice is critical. Including extraneous inputs that is, those unrelated to the target can devastate model performance. This phenomenon, known as the curse of dimensionality, is the general observation that the complexity of a data set increases with dimension.

Cases that are nearest neighbors in two dimensions need not be nearest neighbors in three dimensions. When only two of the three dimensions are related to the target, this can degrade the performance of the nearest neighbor model. Nearest Neighbors?

Extraneous Inputs 0 Nearest Neighbors Training Data 33 As the number of extraneous inputs increases, the problem becomes worse. Indeed, in high dimensions, the concept of nearest becomes quite distorted. Suppose there are cases scattered randomly but uniformly on the range 0 to 1 of 10 independent inputs. Now take any pair of inputs.

How many of the cases are in the center half of both inputs? If the inputs are independent, as assumed, the answer is about For three inputs, there are about , and so on. Perhaps surprisingly, with 10 inputs, only about 1 case out of is simultaneously in the center half of all inputs. Put another way, To maintain some sense of nearness in high dimensions requires a tremendous increase in the number of training cases.

Matters are worse in a typical prediction problem: for every relevant input there may be dozens of extraneous ones.

This devastates the performance of nearest neighbor methods and begs the question of how to proceed. To do this, the focus must shift from individual cases in the training data to the general pattern they create.

Two approaches are widely used to overcome the curse of dimensionality. This autonomy, however, can hide inefficiencies that are created when processes repeatedly vie for data set access and are forced to wait for each other.

HTML reports demonstrate unsuccessful and delayed lock attempts, elucidating to developers where potential bottlenecks exist and where potential efficiencies can be gained.

This presentation identifies and explores the areas that are hot and not-so-hot in the world of the professional SAS user. These courses leave students and managers with an overly simplistic view of how informed statistical decisions are made in practice. This paper focuses on the more recent pedagogical ideas of exposing students to underlying likelihood methods and treating these specific t- and F- tests as special cases embedded in this larger structure. This approach enables students and managers to pose and examine more meaningful queries.

For example, the techniques discussed here allow practitioners to focus on the estimation of important model parameters in the presence of serially correlated errors rather than on the detection of the exact time-series error structure. Numerous additional practical examples of the applicability of likelihood methods are provided and discussed; specifically, the provided illustrations include novel approaches useful in statistical modelling, drug synergy and relative potency.

Examples of successful decisions and tips will be provided. Come, listen and learn from our most successful leaders on how to enhance your career opportunities and prepare for the future. You can ask questions about working in corporations, academia, independently and others. This paper also provides several ways you can find the longitude and latitude coordinates. The information shown in the dialogue box can be any multimedia information, such as plain text, images, videos, URL, email links, etc.

The report generated by this macro retains all the functionalities of the Google map, allowing you to zoom in, zoom out, or move the map in the report, show the map in satellite mode, etc.

The macro also has capability for you to display different styles of pin icons on the map. The SAS user does not need prior knowledge or expertise in any website programming language to use this macro. The SAS user only needs to prepare the input data, call the macro, and the Google map report will be generated. The zip code boundary data files from U. This paper demonstrates the use of all these datasets with SAS to exhibit sales force alignment and target locations on the Google like maps with cities, highways, roads, bodies of water and forests in the background.

Each alignment defined area, territory, has its own color. Each territory ID is labelled at its 'center' location. The boundary of each zip code in a territory is displayed. Each zip code and the number of targets in the zip code are labelled. Different targets could be at the same location. Each target location is dotted with different colors to reflect the different number range of targets at the same address.

This paper will demonstrate how to present business insights using PROC GMAP with real life examples and show how additional map features can be added to SAS maps to make it visually stimulating by implementing annotated datasets. Sample code and a macro for map cosmetics will be provided.

Some statistical capabilities with graphics were introduced with selected procedures in an earlier version. This is further complicated by the lack of simple demonstrations of capabilities.

Most graphs in training materials and publications are of rather complicated more difficult graphs that while useful, are not good teaching examples. This paper contains many examples of very simple ways to get very simple things accomplished. CARS datasets. In addition, the paper addresses those situations where the user must alternatively use a combination of Proc Template with Proc SGRender to accomplish the task. The emphasis on this paper is simplicity in the learning process.

Users will be able to take the included code and run it immediately on their personal machines as the data is included with SAS installation. Monte Carlo simulations are created as teaching devices for live demonstrations of statistical concepts. These results are then coded into data visualizations to aid in intuitive understanding by the audience.

Coding is done in both SAS and R, and a comparison is made to show how both platforms are capable of performing the required tasks. Both code and output plots are presented. Over the years there has been increase in studies that focus on assessing associations between biomarkers and disease of interest. Many of the biomarkers are measured as continuous variables. Investigators seek to identify the possible cutpoint to classify patients as high risk versus low risk based on the value of the biomarker.

Several data-oriented techniques such as median and upper quartile, and outcome-oriented techniques based on score, Wald and likelihood ratio tests are commonly used in the literature. Contal and O'Quigley presented a technique that used log rank test statistic in order to estimate the cutpoint. Their method was computationally intensive and hence was overlooked due to the unavailability of built in options in standard statistical software.

ISBN 13: 9789993159780

New and updated features will include: results presented in a much cleaner report format, user specified cut points, macro parameter error checking, temporary data set clean-up, preserving current option settings, and increased processing speed.

In addition, we will critically compare this method with some of the existing methods and discuss the use and misuse of categorizing a continuous covariate. The author then examines whether clinical characteristics predict membership in the different statuses and transitions between latent statuses over time using both SAS and Mplus.

Mplus programming code is provided to compute standard errors of the parameter estimates. This paper is suited to students who are beginning their study of social and behavioral health sciences and to professors and research professionals who are conducting research in the fields in epidemiology, clinical psychology, or health services research.The particular input is not selected to partition the data, and it is effectively disregarded.

The basic idea is that a model built from training data with three or four times as many common cases in the training data produces a model just as predictive as one built from training data with 30 to 40 times as many common cases.

Before modeling begins, data must be assembled, often from a variety of sources, and arranged in a format suitable for model building. The number of months since origin is a field derived from the first donation date.

Of course the user needs to be able to control such things as the age groups, color selection and order, and number of desired ranks.

FERMIN from Waco
I relish exploring ePub and PDF books slowly . See my other articles. I have a variety of hobbies, like cryptography.
>