There is no set rule as to the number of variables to include in a prediction model as it often depends on several factors. The ‘one in ten rule’, a rule that stipulates how many variables/parameters can be estimated from a data set, is quite popular in traditional clinical prediction modeling strategies (e.g., logistic regression and survival models). According to this rule, one variable can be considered in a model for every 10 events.
Problem: how can we collect data?
Acquisition of a new dataset
Data collection is usually a costly process.
Investment of time and money for:
Programming or learning to use an acquisition tool
Handling of large amounts of data
Testing to find any bugs that could compromise the success of the acquisition
Unfortunately, we often notice them at the end of the process
Acquire new hardware for data collection and storage
It is necessary to carefully consider whether it is appropriate to acquire a new dataset.
Considerations not only in engineering but also in management and economic aspects
Future needs must be foreseen in advance
Common Problems in Data Collection
Sometimes companies have a lot of “old” data in their databases or information systems:
They don’t know what to do with it
Data re-collection on existing data (since data must be clean or something similar)
In many business processes, it is unclear:
Which data is possible to collect (also due to privacy issues)
Which data is (really) useful for the business
Public Datasets
We can get data mainly in two ways.
By using publicly available data (datasets or databases)
Someone collected them for us!
They can be free or for a fee
The quality of the data made available must be checked
By acquiring a new set of data, but why?
It is not certain that public data well represents the problem we want to solve
We want to acquire specific data and thus generate specific expertise for the company (know-how)
We are forced to acquire data that, due to their sensitive nature, would not otherwise be available (privacy issues)
The company we work for already has a data collection process that we can use
Public Datasets
Many universities publicly release their datasets:
There are no requirements related to profit or non-disclosure agreement (NDA)
It is the basis of the scientific method, in particular for the reproducibility of the results obtained
I release my data so that others can conduct my experiments and verify my results
The “survivors” get studied, while the failures are excluded, leading to potentially flawed conclusions.
Start-up Success Stories
People often hear stories of wildly successful companies like Apple, Amazon, or Tesla and assume that hard work and a good idea are enough to succeed. This overlooks the countless failed start-ups that had hard-working teams and great ideas but didn’t survive due to market conditions, competition, or other factors.
Fitness and Weight Loss Programs
Testimonials for fitness programs often highlight people who achieved dramatic results. These “success stories” ignore the many individuals who followed the same program but didn’t achieve noticeable results, either due to differences in genetics, lifestyle, or other factors.
Investment Portfolios
Financial advice often highlights top-performing stocks or mutual funds as examples of great investments. These examples focus on the “survivors” in the market, ignoring the many investments that failed or underperformed, which can lead to overestimating the likelihood of similar success in the future.
Survivorship bias
https://xkcd.com/1827/
Selection bias
https://xkcd.com/2618/
… and more!
Confirmation Bias
A person who believes in a specific political ideology might seek out news articles or social media posts that reinforce their beliefs, while ignoring or dismissing information that contradicts their views. This selective exposure to information strengthens their pre-existing opinions.
Anchoring Bias
When shopping for a new car, a person may see a car priced at $30,000, and then a second car priced at $25,000. Even if the second car is not objectively better or a great deal, the first price “anchors” their perception of the value of the second car, leading them to think it’s a better deal simply because it’s cheaper than the first option.
Availability Bias
After watching several news reports about airplane crashes, a person might overestimate the risk of flying. They might avoid flying, despite it being statistically safer than driving, because the images and stories of crashes are more readily available in their memory.
Biases and fairness
Fairness refers to the various attempts to correct algorithmic bias in automated decision processes based on ML models.
Decisions made by such models after a learning process may be considered unfair if they were based on variables considered sensitive
E.g., gender, ethnicity, sexual orientation, or disability
Batini, Carlo, Cinzia Cappiello, Chiara Francalanci, and Andrea Maurino. 2009. “Methodologies for Data Quality Assessment and Improvement.”ACM Computing Surveys (CSUR) 41 (3): 1–52.
Chowdhury, Mohammad Ziaul Islam, and Tanvir C Turin. 2020. “Variable Selection Strategies and Its Importance in Clinical Prediction Modelling.”Family Medicine and Community Health 8 (1).
Fisher, Ronald A. 1936. “The Use of Multiple Measurements in Taxonomic Problems.”Annals of Eugenics 7 (2): 179–88.
Sidi, Fatimah, Payam Hassany Shariat Panahy, Lilly Suriani Affendey, Marzanah A Jabar, Hamidah Ibrahim, and Aida Mustapha. 2012. “Data Quality: A Survey of Data Quality Dimensions.”2012 International Conference on Information Retrieval & Knowledge Management, 300–304.