Beware of overfitting and underfitting!
When the algorithms are trained, some issues may appear and affect the algorithm’s prediction ability on new data.
Overfitting occurs when the algorithm learns by heart from the training data, so that prediction errors are very low on the training data, but high for new observations (or test data). Cross-validation for optimal selection of model parameter values, bagging and regularization are all known solutions to mitigate overfitting.
Underfitting occurs when the algorithm does not learn enough about the training data, resulting in high prediction errors on both the training and test data. This can be caused by poor model selection (e.g., an overly simple linear model). Solutions for underfitting include adding more features to the initial model, increasing its complexity and increasing training time and data size.
Preparing your data well
In a data science project, 80% of the time will be dedicated to collecting and preparing data. In fact, a successful data science project begins with good data preparation, which should not be underestimated.
Several data categories can be collected to carry out your projects, such as text data, images and videos. They can be internal (e.g., data from vibration measuring sensors, electrical currents, motor temperature) or external (e.g., weather data and other public data relevant to the problem). This raw data from various sources is generally unstructured.
After the data is collected, the next step is to structure and centralize it into a single structure available across the organization. Structuring your data involves cleaning it (e.g., removing irrelevant data and duplicates), formatting it (e.g., type conversion, imputation, syntax errors, standardization, value scaling) and assessing its quality (completeness, consistency and uniformity).
Select your model
After properly preparing your data, the next step in your data science project is naturally to select your machine learning model. Depending on the form of training you need to use (supervised, unsupervised) and the type of algorithms you want to set up (classification, regression, clustering), there are several options (e.g., decision trees, neural networks, support vector machines, K-Means, DBScan). But which model should you choose?
It is generally difficult to guess which model is the most appropriate, because it depends on data size and quality as well as the type of problems you want to solve. Each model has its pros and cons. It is recommended to test several machine learning models and compare the results of various models on the same data. The performance of a model can vary (increase or decrease) depending on its parameter values; choosing the right model parameter values allows you to get better performance, and vice versa. Finally, be aware that the performance of your selected model is not guaranteed forever, especially if your data tends to change over time. So, it is important to re-train your model on a regular basis to keep it updated and to compare it with other models to ensure it still best meets your needs.
How to start your AI project?
The best way to start a project in a little-known field is to be supported by a partner who has the relevant expertise. Our experts in AI, advanced data analysis and industrial IT are here to guide you in your data science project and to advise you in making the right choices, whether for preparing your data or selecting your learning algorithm.
Try our new AI jump start program to quickly launch your first AI project and make it a success. This three- to four-week program will help you find out where your organization stands in terms of AI and then clearly identify the steps you need to prioritize to maximize your investments as quickly as possible. A team of BBA experts will help you define your AI project and guide you in developing a short- and long-term AI vision for your organization.
BBA also offers training in machine learning that provides the opportunity for you to discover this current field and learn about the latest market trends.
Contact us to find out more and start your AI adventure!