Data collection for Machine Learning Strategies

Training Data = I’ll Worry About It Later

The tendency to undervalue training data isn’t a baseless decision that developers and data scientists conscientiously take. It’s something that happens as a result of multiple factors coming together:

  • Mindset: You entered your field to solve problems. This is creative work that is highly engaging to technically-minded people.
  • Advances: In the AI literature, most of the exciting advancements are focused on machine learning methods.
  • Chores: When compared with the creative work, the idea of overseeing, curating, and labelling training data can feel like a chore.

These factors can have a serious impact on our personal and organizational priorities. However, organizations may lose valuable time if they don’t take training data seriously from the start. The quality of your training data will affect the performance of any machine learning model, no matter how revolutionary it may seem to be.

Starting from the major importance of the data in the ML project pipeline, in this post, we’re discussing best Data collection for Machine Learning practices, strategies, approaches, and quality assurance methods.

Cyclic Data Annotation

Annotating the ML datasets inside iterative cycles.

The MAMA (Model-Annotate-Model-Annotate) Cycle

The MAMA Cycle

Modeling denotes modeling the phenomenon under consideration, by providing a description that forms:

  • The basis for the annotation.
  • The basis of the features used in a development cycle for training and testing the algorithm.

Once you have an initial model for the phenomena, you have the first specification for the annotation. This is the document from which you will create the blueprint for how to annotate your corpus, which is called the annotation guidelines. You need good annotation guidelines. “I know it when I see it” is unacceptable. The guidelines need to give examples with corresponding explanations.

Once you have a guideline written up you need to put it to the test. Give a few people a set of examples and ask them to annotate the text based on the guideline. Did they agree on how to annotate the text? (evaluation) If yes, great, then the guideline is probably useful. If no, revise and refine the guideline.

The MATTER Cycle

The MATTER Cycle

Again this Cycle starts with modeling the phenomenon. Once you have an initial model for the phenomenon, you have the first specification for the annotation and the features used for training and testing the algorithm. After annotating the data, an algorithm is trained and tested over it using the features specified based on the model. Finally, based on an analysis and evaluation of the performance of a system, the model of the phenomenon may be revised for retraining, testing, and making the annotation more robust and reliable.

Data Quality-Assurance Methods

Three standard methods are mostly used for ensuring accuracy and consistency:

  • Gold sets, or benchmarks, measure accuracy by comparing annotations (or annotators) to a “gold set” or vetted example. This helps to measure how well a set of annotations from a group or individual matches the benchmark.
  • Consensus, or overlap, measures consistency and agreement amongst a group, and does so by dividing the sum of agreeing data annotations by the total number of annotations.
  • Auditing measures both accuracy and consistency by having an expert review the labels, either by spot-checking or reviewing them all.

Data Annotation and Learning Protocols

Facilitating the annotation process using learning protocols.

Active Learning

With active learning, objects are intelligently sampled, thus minimizing the total number of objects to be annotated before some level of predictive accuracy is achieved. Rather than annotating all objects independently and simultaneously, the learning protocol in active learning follows a set of well-defined criteria to select the next object to annotate at each step.

For example, you can try to maximize the diversity between data samples to be annotated by computing the distance between them prior to the annotation. At each step, the sample that is so far the most dissimilar with all annotated samples is selected. Alternatively, you can select objects on the border of two classes and therefore containing the maximum amount of information so that the algorithm learns to discriminate objects from different classes.

Weakly Supervised Learning

Another way to minimize the annotation cost is weakly supervised learning, which is an umbrella covering several processes attempt to build predictive models by learning with weak supervision. It consists of an approach to inject domain expertise, functions that label data based on newly generated training data.

There are three typical types of weak supervision: incomplete supervision when only a subset of training data is labelled, inexact supervision when the training data are given with labels but not as exact as desired and inaccurate supervision when in the training data there are some labels with mistakes.

One example is using a transductive support vector machine (TSVM). While SVM looks for a decision boundary (finding a hyperplane) with a maximal margin over the labelled data, TSVM label the unlabeled instances looking for a decision boundary with a maximal margin over all of the data.

Annotation Approaches Comparison

The choice of an approach depends on the complexity of a problem and training data, the size of a data science team and the financial and time resources a company can allocate to implement a project.

Internal-labelingAssigning the task to an in-house data science team– Predictable results.
– High accuracy of labelled data
– The ability to track progress.
It takes much time
OutsourcingRecruitment temporary employees on freelance platforms.The ability to evaluate applicants’ skills.The need to organize workflow
CrowdsourcingCooperation with freelancers from crowdsourcing platforms.– Cost-saving.
– Fast results.
Quality of work can suffer.
Specialized outsourcing companiesHiring an external team for a specific project.Assured qualityHigher price compared to crowdsourcing
Synthetic labellingGenerating data with the same attributes of real data.– Fewer constraints for using sensitive and regulated data.
– Training data without mismatches and gaps.
– Cost and time effectiveness.
High computational power required
Data programmingUsing scripts that programmatically label data to avoid manual work.– Automation
– Fast results
Lower quality dataset


In this post, we focused on data collection for Machine Learning which is a basic build block in machine learning. We introduced developing the ML datasets inside iterative cycles, proposed common data quality assurance measures, listed annotation approaches with their pros and cons, and discussed the usage of learning protocols in the data annotation process.

Do you know that we use all this and other AI technologies in our app? Look at what you’re reading now applied in action. Try our Almeta News app. You can download it from Google Play or Apple’s App Store

Leave a Reply

Your email address will not be published. Required fields are marked *