Learning

Let’s Learn Data Engineering

What Does a Data Engineer Do?

Data engineers are responsible for designing, building, and maintaining systems that collect, process, and store data. They ensure that data is accessible, reliable, and prepared for analysis by data scientists and analysts. Key responsibilities include:

  • Developing and managing data pipelines
  • Implementing ETL (Extract, Transform, Load) processes
  • Ensuring data quality and integrity
  • Collaborating with cross-functional teams to understand data needs

Steps to Become a Data Engineer

Educational Background

While a bachelor’s degree in computer science, software engineering, or information technology is common, it’s not mandatory. Many professionals transition into data engineering through self-study, bootcamps, or specialized courses. The focus should be on acquiring relevant skills and practical experience.

Develop Essential Skills

Key skills for data engineers include:

  • Programming Languages: Proficiency in languages like Python, Java, or Scala is crucial.
  • Database Management: Strong understanding of SQL and familiarity with NoSQL databases.
  • Data Warehousing Solutions: Experience with tools like Amazon Redshift, Google BigQuery, or Snowflake.
  • ETL Tools: Knowledge of ETL frameworks such as Apache Airflow or Apache NiFi.
  • Cloud Platforms: Familiarity with cloud services like AWS, Azure, or Google Cloud Platform.
  • Big Data Technologies: Understanding of Hadoop, Spark, and Kafka.

Continuous learning is vital, as the data engineering landscape evolves rapidly.

Hands-On Experience

Practical experience is invaluable. Engage in projects that involve building data pipelines, setting up databases, or processing large datasets. Platforms like GitHub can showcase your work to potential employers.

Lets Learn SQL as an essential step!

Data Definition Language (DDL)


CREATE TABLE

Create a new table with specified columns:


CREATE TABLE employees (
emp_id INT,
name VARCHAR(100),
salary DECIMAL(10,2),
dept VARCHAR(50)
);

ALTER TABLE

Modify table structure, e.g. add/drop column:

ALTER TABLE employees
ADD hire_date DATE;

DROP TABLE

Delete a table and its data:

DROP TABLE employees;

Data Manipulation Language (DML)

INSERT

INSERT INTO employees (emp_id, name, salary, dept)
VALUES (1, ‘Alice’, 70000, ‘HR’);

*For multiple rows:

INSERT INTO employees (emp_id, name, salary)
VALUES (2,’Bob’,80000), (3,’Carol’,90000);

UPDATE

UPDATE employees
SET salary = salary * 1.1
WHERE dept = ‘Engineering’;

*To delete all rows:

DELETE FROM employees;

Data Query Language (DQL)

SELECT (basic retrieval)

SELECT * FROM employees;


SELECT name, salary FROM employees;

WHERE (filter rows)

SELECT * FROM employees
WHERE salary > 60000 AND dept = ‘HR’;

DISTINCT (unique values)

SELECT DISTINCT dept FROM employees;

ORDER BY (sort results)

SELECT name, salary
FROM employees
ORDER BY salary DESC;

LIMIT / TOP / OFFSET-FETCH (limit rows)

SELECT * FROM orders LIMIT 10; — MySQL/PostgreSQL


SELECT TOP 5 * FROM orders; — SQL Server


SELECT * FROM orders ORDER BY order_date


OFFSET 20 ROWS FETCH NEXT 10 ROWS ONLY; — Pagination


LIKE, IN, BETWEEN (pattern & range filtering)

SELECT name FROM employees WHERE name LIKE ‘A%’;


SELECT * FROM products WHERE category IN (‘Books’,’Toys’);


SELECT * FROM orders WHERE order_date BETWEEN ‘2024-01-01’ AND ‘2024-12-31’;

Aggregation & Grouping

Aggregate functions: COUNT, SUM, AVG, MIN, MAX

SELECT COUNT(*) FROM employees WHERE dept = ‘Sales’;


SELECT SUM(salary), AVG(salary) FROM employees;

GROUP BY + HAVING

SELECT dept, AVG(salary) AS avg_salary
FROM employees
GROUP BY dept
HAVING AVG(salary) > 50000;


Joins & Subqueries

JOIN (combine tables)

SELECT c.customer_name, o.order_date
FROM customers c
INNER JOIN orders o
ON c.customer_id = o.customer_id;

Subqueries (nested queries)

SELECT * FROM employees
WHERE salary > (
SELECT AVG(salary) FROM employees
);


Correlated subquery example:

SELECT emp_id, name FROM employees e
WHERE salary > (
SELECT AVG(salary) FROM employees
WHERE dept = e.dept
);

Let’s Learn Excel

FunctionSyntax
UNIQUEUNIQUE(array, [by_col], [exactly_once])
XLOOKUPXLOOKUP(lookup_value,lookup_array,return_array,[match_mode],[search_mode])
Logical
ANDAND(logical1,logical2)
IFIF(logical_test,value_if_true,value_if_false)
IFERRORIFERROR(VALUE,value_if_error)
NOTNOT(logical)
OROR(logical1,logical2)
XORXOR(logical1,logical2)
Lookup & Reference
ADDRESSADDRESS(row_num,column_num,abs_num,C1,sheet_text)
AREASAREAS(reference)
CHOOSECHOOSE(index_num,value1,value2)
COLUMNCOLUMN(reference)
COLUMNSCOLUMNS(array)
HLOOKUPHLOOKUP(lookup_value,table_array,row_index_num,range_lookup)
HYPERLINKHYPERLINK(link_location,friendly_name)
INDEXINDEX(array,row_num,column_num)
INDIRECTINDIRECT(ref_text,C1)
LOOKUPLOOKUP(lookup_value,lookup_vector,result_vector)
MATCHMATCH(lookup_value,lookup_array,match_type)
OFFSETOFFSET(reference,rows,cols,height,width)
ROWROW(reference)
ROWSROWS(array)
TRANSPOSETRANSPOSE(array)
VLOOKUPVLOOKUP(lookup_value,table_array,col_index_num,range_lookup)
Date & Time
DATEDATE(year,month,day)
DATEDIFDATEDIF
DATEVALUEDATEVALUE(date_text)
DAYDAY(serial_number)
DAYSDAYS(end_date,start_date)
DAYS360DAYS360(start_date,end_date,method)
EDATEEDATE(start_date,months)
EOMONTHEOMONTH(start_date,months)
HOURHOUR(serial_number)
MINUTEMINUTE(serial_number)
MONTHMONTH(serial_number)
NETWORKDAYSNETWORKDAYS(start_date,end_date,holidays)
NETWORKDAYS.INTLNETWORKDAYS.INTL(start_date,end_date,weekend,holidays)
NOWNOW()
SECONDSECOND(serial_number)
TIMETIME(hour,minute,second)
TIMEVALUETIMEVALUE(time_text)
TODAYTODAY()
WEEKDAYWEEKDAY(serial_number,return_type)
WEEKNUMWEEKNUM(serial_number,return_type)
WORKDAYWORKDAY(start_date,days,holidays)
WORKDAY.INTLWORKDAY.INTL(start_date,days,weekend,holidays)
YEARYEAR(serial_number)
YEARFRACYEARFRAC(start_date,end_date,basis)
Engineering
CONVERTCONVERT(number,from_unit,to_unit)
Financial
FVFV(rate,nper,pmt,pv,type)
PVPV(rate,nper,pmt,fv,type)
NPERNPER(rate,pmt,pv,fv,type)
PMTPMT(rate,nper,pv,fv,type)
RATERATE(nper,pmt,pv,fv,type,guess)
NPVNPV(rate,value1,value2)
IRRIRR(values,guess)
XIRRXIRR(values,dates,guess)
PRICEPRICE(settlement,maturity,rate,yld,redemption,FREQUENCY,basis)
YIELDYIELD(settlement,maturity,rate,pr,redemption,FREQUENCY,basis)
INTRATEINTRATE(settlement,maturity,investment,redemption,basis)
Information
CELLCELL(info_type,reference)
ERROR.TYPEERROR.TYPE(error_val)
ISBLANKISBLANK(VALUE)
ISERRISERR(VALUE)
ISERRORISERROR(VALUE)
ISEVENISEVEN(number)
ISFORMULAISFORMULA(reference)
ISLOGICALISLOGICAL(VALUE)
ISNAISNA(VALUE)
ISNONTEXTISNONTEXT(VALUE)
ISNUMBERISNUMBER(VALUE)
ISODDISODD(number)
ISREFISREF(VALUE)
ISTEXTISTEXT(VALUE)
NN(VALUE)
NANA()
TYPETYPE(VALUE)
Math
ABSABS(number)
AGGREGATEAGGREGATE(function_num,options,array,k)
CEILINGCEILING(number,significance)
COSCOS(number)
DEGREESDEGREES(angle)
DSUMDSUM(database,field,criteria)
EVENEVEN(number)
EXPEXP(number)
FACTFACT(number)
FLOORFLOOR(number,significance)
GCDGCD(number1,number2)
INTINT(number)
LCMLCM(number1,number2)
LNLN(number)
LOGLOG(number,base)
LOG10LOG10(number)
MODMOD(number,divisor)
MROUNDMROUND(number,multiple)
ODDODD(number)
PIPI()
POWERPOWER(number,power)
PRODUCTPRODUCT(number1,number2)
QUOTIENTQUOTIENT(numerator,denominator)
RADIANSRADIANS(angle)
RANDRAND()
RANDBETWEENRANDBETWEEN(bottom,top)
ROUNDROUND(number,num_digits)
ROUNDDOWNROUNDDOWN(number,num_digits)
ROUNDUPROUNDUP(number,num_digits)
SIGNSIGN(number)
SINSIN(number)
SQRTSQRT(number)
SUBTOTALSUBTOTAL(function_num,REH1)
SUMSUM(number1,number2)
SUMIFSUMIF(range,criteria,sum_range)
SUMIFSSUMIFS(sum_range,criteria_range,criteria)
SUMPRODUCTSUMPRODUCT(array1,array2,array3)
TANTAN(number)
TRUNCTRUNC(number,num_digits)
Stats
AVERAGEAVERAGE(number1,number2)
AVERAGEAAVERAGEA(value1,value2)
AVERAGEIFAVERAGEIF(range,criteria,average_range)
AVERAGEIFSAVERAGEIFS(average_range,criteria_range,criteria)
CORRELCORREL(array1,array2)
COUNTCOUNT(value1,value2)
COUNTACOUNTA(value1,value2)
COUNTBLANKCOUNTBLANK(range)
COUNTIFCOUNTIF(range,criteria)
COUNTIFSCOUNTIFS(criteria_range,criteria)
FORECASTFORECAST(x,known_y’s,known_x’s)
FREQUENCYFREQUENCY(data_array,bins_array)
GROWTHGROWTH(known_ys,known_x,new_x,const)
INTERCEPTINTERCEPT(known_ys,known_xs)
LARGELARGE(array,k)
LINESTLINEST(known_ys,known_xs,const,stats)
MAXMAX(number1,number2)
MEDIANMEDIAN(number1,number2)
MINMIN(number1,number2)
MODEMODE(number1,number2)
PERCENTILEPERCENTILE(array,k)
PERCENTILE.INCPERCENTILE.INC(array,k)
PERCENTILE.EXCPERCENTILE.EXC(array,k)
QUARTILEQUARTILE(array,quart)
QUARTILE.INCQUARTILE.INC(array,quart)
QUARTILE.EXCQUARTILE.EXC(array,quart)
RANKRANK(number,ref,order)
RANK.AVGRANK.AVG(number,ref,order)
RANK.EQRANK.EQ(number,ref,order)
SLOPESLOPE(known_ys,known_xs)
SMALLSMALL(array,k)
STDEVSTDEV(number1,number2)
STDEV.PSTDEV.P(number1,number2)
STDEV.SSTDEV.S(number1,number2)
STDEVPSTDEVP(number1,number2)
TRENDTREND(known_ys,known_xs,new_xs,const)
Text
CHARCHAR(number)
CLEANCLEAN(text)
CODECODE(text)
CONCATENATECONCATENATE(text1,text2)
DOLLARDOLLAR(number,decimals)
EXACTEXACT(text1,text2)
FINDFIND(find_text,within_text,start_num)
LEFTLEFT(text,num_chars)
LENLEN(text)
LOWERLOWER(text)
MIDMID(text,start_num,num_chars)
PROPERPROPER(text)
REPLACEREPLACE(old_text,start_num,num_chars,new_text)
REPTREPT(text,number_times)
RIGHTRIGHT(text,num_chars)
SEARCHSEARCH(find_text,within_text,start_num)
SUBSTITUTESUBSTITUTE(text,old_text,new_text,instance_num)
TEXTTEXT(VALUE,format_text)
TRIMTRIM(text)
UPPERUPPER(text)
VALUEVALUE(text)