Cristina Ferrero Castaño
Top 10 Terms You Need to Know as a Data Scientist
Your guide to understanding basic data science lingo
Towards Data Science
Sara A. Metwalli
Dec 13, 2020
Data science is one of the fields that can be very overwhelming for new people joining in. The term “data science” is broad and used as an umbrella term to cover many subfields. Machine learning is data science, artificial intelligence is data science, natural language processing is data science, and mining data is also considered data science.
All of these terminologies can be — and is — extremely confusing and sometimes discouraging for a newbie. When you decide to join the field, you need to know what the field actually is, what it includes, and the basic terminology.
But gathering this information is not easy, especially if you don’t have the knowledge required to navigate the web and extract the correct information.
When I first joined the field, I felt I had to juggle many things. Learning the techniques, getting up to date with the research and advancement in the field, and trying to understand the terminology — or as I called it, “the lingo.”
So, a couple of years in, I thought I would write an article to help others joining the field and not knowing where to start getting familiar with data science lingo. I gathered 10 terms essential for each data scientist to know to build/ develop any data science project.
One of the most important terms in data science that you will hear quite often is “model.” Model training, improving model efficiency, model behavior, etc. But what is a model?
Mathematically speaking, a model is a specification of some probabilistic relationship between different variables. In Layman’s term, a model is a way of describing how two variables behave together.
Since the term “modeling” can be vague, “statistical modeling” is often used to describe modeling done by data scientists more accurately.
Regression is a machine learning term. In fact, regression is the most basic and simple unsupervised machine learning approach. In regression problems, you often have two values, a target value — also called criterion variables — and other value/s known as the predictors.
An example of that is the job market; how easy/ difficult getting a job is (criterion variable) depends on the demand for the position and the supply for it (predictors).
There are different types of regression to match different applications; the easiest ones are the linear and logistic regressions.
Nº 3: Parameter
This is one of the terms that can be quite confusing because it has slightly different meanings based on the scope you’re using it in. For example, in statistics, a parameter is used to describe a probability distribution's different properties, e.g., its shape, scale.
In data science or machine learning, the term parameter is often used to components the system is learning to be precise. In machine learning, there are two types of models, parametric models, and nonparametric models.
When you hear or read the term basis, your brain often associate's it with something negative. However, it’s not always true. In data science, bias is often used to refer to an error in the data.
The reason bias occurs in the data is the results of sampling and estimation. When we choose some data to analyze, we often sample a bigger data pool. The sample you select could be biased, as in, it could be an inaccurate representation of the pool.
Since the model we training only knows the data we give it will learn only what it can see. That’s why data scientists need to be fully aware of this fact to create unbiased models.
In general, we use correlation to refer to the degree of occurrence of two or more events. For example, if depression cases increase in cold weather areas, there might be some correlation between cold weather and depression.
Often, things correlate together with different degrees. For example, following a recipe and having a delicious dish have a higher correlation than depression and cold weather. This correlation degree is called the correlation coefficient.
When the correlation coefficient is 1, the two events in question are strongly correlated, where if it is 0.2, then the events are weakly correlated. The coefficient can also be negative. In this case, the relation between the events is the opposite. For example, if you eat well, your chances of getting ill will decrease.
Finally, you must always remember correlation doesn’t mean causation.
Nº6 & 7: Overfitting/ Underfitting
We already said that a model is a relationship between variables. We also mentioned what parametric and nonparametric models are. Another way to describe models is how much they fit the data they are being applied to.
Overfitting happens when your model considers too much information about that data. So, you end up with an overly complex model and difficult to apply to different training data.
The opposite of overfitting is underfitting. Underfitting happens when the model doesn’t have much information about the data. In this case, you end up with a poorly fitted model.
One of the skills you will need to learn as a data scientist is how to find the middle ground between overfitting and underfitting.
Cross-validation is a way to evaluate the model’s behavior when asked to learn from a dataset different from the training data used to build it. This is a big concern for data scientists because your model will often have good results on the training data but end up with so much noise when applied to real-life data.
A hypothesis, in general, is an explanation for some event. Often, hypotheses are made based on previous data and observations. A valid hypothesis is one that can be tested with results, either true or false.
In statistics, a hypothesis must be falsifiable. That means we can test any hypothesis to determine whether it’s valid or not. In machine learning, the term hypothesis refers to candidate models that can be used to map the model’s inputs to the correct and valid output.
Outlier is a term used in data science and statistics to refer to an observation that lies an unusual distance from other values in the dataset. The first thing every data scientist should do when given a dataset is deciding what is considered usual distancing and what’s unusual.
An outlier can represent the different things in the data; it could be noise that occurred during the collection of the data or a way to spot rare events and unique patterns. That’s why outliers shouldn’t be deleted right away; rather, it should be understood and investigated.