I’m a serial unsubscriber — absolutely ruthless when it comes to keeping my inbox in order. If I get a new ad or newsletter on my inbox I immediately scroll to the end of it to click on the tiny “unsubscribe” link. I admit I have great pleasure doing this without even seeing the ad.
Read MoreTag: lang:en
Posts in english.
Industrializing Machine Learning
I’m doing Machine Learning Industrialization for more than 2 years and I’m thrilled to see it featured by McKinsey as top 2 in its 2023 tech trends!
Read More
My time on the IBM Linux Impact Team, and legacy
In this extensive article, Jon “MadDog” delves into the behind-the-scenes narrative of how Linux and Open Source gained acceptance within the corporate sphere, eventually establishing itself as the dominant platform in today’s enterprise information technology. It has become the operating system powering contemporary cloud infrastructure and, most notably, has transformed into the primary methodology for driving software innovation.
Read More
Importance of Machine Learning Engineers, again
This is often the outcome when AI projects lack Machine Learning Engineers and rely solely on Data Scientists.
Read More
Apple did it again with Vision Pro
Apple has once again made waves with their latest release, the Vision Pro. The mere fact that this new device ensures legible text for its users sets it years ahead of competitors like HTC and Meta. Not to mention the array of groundbreaking sensors, user-friendly interface, and independence from a computer. It’s important to note that this is just the initial version, with much more to come.
Read MoreStory of the first digital computers
The story of the first digital computers, since a phenomenon observed in Edison’s light bulb, through 2 bit logic operations, through triodes and vacuum tubes, up to the ENIAC, capable of doing astonishing 500 math operations per second, and running without failure for a maximum of 116 hours.
As a matter of comparison, your smartphone can do almost one trillion math operations per second with just a tiny fraction of the required energy. Your smartphone is 2 billion times faster than the first, commercial large and expensive digital computers of the 1940’s.
Veritasium nailed it again.

Web scraping and site mirroring
Need to mirror an entire website? Use the httrack command, available in all Linux distributions. If site requires authentication, provide to httrack a cookies.txt
file exported from your browser.

MariaDB backups in one line
First allow the Unix user that will make backups (root
, in my case) to access MariaDB without a password (works only if accessing from same host that the server is running):
GRANT ALL PRIVILEGES ON *.* TO `root`@`localhost` IDENTIFIED VIA unix_socket WITH GRANT OPTION;
Read More

Hybrid cloud is the way to go
It is about time for companies that consume these public cloud services to use them in a way that they can exit/leave/migrate easily.
Read MoreLinkedIn Inferences About You

Export all your LinkedIn data (on computer, select Me ➔ Settings & Privacy ➔ Data Privacy ➔ Get a copy of your data ➔ Larger data archive) and then check the Inferences_about_you.csv
file.
As the file name says, it is how LinkedIn AI models see you. Do you have career stability? Are you in the early stages of your career? Are you a people or senior leader? Business owner?
These classifications are certainly used by recruiters to search for people. And you should use it to check if there are things you must change in your profile.
UPDATE: LinkedIn apparently isn’t providing this information anymore. It was being provided until a few days before my post.

Importance of Machine Learning Engineering
This diagram highlights the importance of Machine Learning Engineering for Data/AI projects and the community. And it doesn’t even show one of my favorite topics: software design patterns, an outrageously important subject that helps with code maintainership, extensibility, standards, organization, beauty, which in turns help with (much) higher productivity of Data professionals.
Read MoreData Scientists should develop their software engineering skills
Yes, Data Scientists should develop their software engineering skills. Let me react to a LinkedIn post by Neil Leiser.
But Data Scientists can’t do it alone, or by themselves. Read on.
I see that software engineering, IT architecture is a touchy subject amongst even the best data scientists, usually because they came from other knowledge domains as economy, statistics, pure math, physics, biology etc. This is a normal evolution. Data Science demands a wide broad skill set, sometimes too wide and too broad. Data Scientists need to handle Docker and HTTP APIs along with outliers, RMSE, ROC curves and Gaussian distributions. Go figure…
ML engineers — usually folks that have more software engineering background — should help here.
But the most important thing ➔ it is the mission of the CDO, tech lead or CTO with strategic vision to clearly detect these gaps and design a roadmap to handle them, not just with conventional training but also encouraging mixed squads whose members will exchange skills and knowledge, leveraging multi-disciplinar environments where everybody grows together.
Related posts:
GPT me

This is what GPT “knows” about me. More precisely, this is the sequence of words GPT generates when asked with that specific prompt.
First paragraph is 100% correct.
Second is kind of 50% (in)correct and outdated. I do Fedora, not Debian nor Ubuntu, I’ve contributed to several FOSS projects, but never to Apache HTTPD, and I did work for IBM, but never to Red Hat.
Third paragraph he completely confused me with one of my relatives that have same last name but different first name.
Also, I think GPT would have a different perspective about me if blog posts in social media, such as Facebook, would be part of its training dataset. But it can’t because Meta won’t allow open access to their platform even if I post openly there.
Clouds are super expensive
While clouds are the natural go-to choice for an early-stage startup, staying 100% in clouds with substantial infrastructure may sink a company as it and its infrastructure grow.
This study shows that the monthly infrastructure cost of clouds would be more than 10 times higher than a collocation with self-designed infrastructure. Not to mention the taylor-made possibilities.
Your CTOs and tech leaders must provide clever ways to use public clouds, avoiding their typical lock-ins, so you can leave [and reduce vast amounts of infrastructure costs] whenever you may need.
Benefits of public clouds are flexibility and agility, not costs.
7 Habits of Highly Effective People by Stephen Covey, summary by getAbstract
I read the summary of this book in getAbstract. There is also an audio version of the summary on their page. Here is a my personal copy.

Recommendation
In this updated edition of the late Stephen R. Covey’s bestseller, Sean Covey draws on ancient wisdom, modern psychology and 20th century science and wraps the mix in a distinctively American can-do program of easy-looking steps calling mostly for self-discipline. This classic – now in a new 30th anniversary edition with a foreword by Jim Collins – is a popular, trusted manual for self-improvement, although you still may find some prescriptions easier to agree with than to act upon.
iPadOS external display support
With the release of iPadOS 16.2 last December, M1-powered devices can now be used as more beefed up terminals, complete with external physical keyboard, mouse/trackpad and extended screen that can display content and apps different from the main iPad screen (as shows the photo).

Minimum device that supports this is the iPad Air 5th generation (2022) which already features an USB-C port instead of lightning. Then, on this port, you can plug a dongle with HDMI output, power source and more USB ports to connect your human interaction devices. Or connect them through Bluetooth.
This opens the possibility for road warriors to have an even lighter and inexpensive terminal with the iPad, instead of a regular (and problematic) laptop. Then, when at home or office, they can dock it to KVM (keyboard, video, mouse) to experience a more productive workstation.
And yes, I know Android phones can do similar things since long ago. But it doesn’t get widespread or even real until this feature lands on the popular iPad.

Command Line in Windows
Command line on Windows (10+) nowadays doesn’t have to be only PuTTY to a remote Linux machine. In fact many Linux concepts were incorporated on Windows.
Windows Subsystem for Linux
First, activate WSL. Since I enjoy using Fedora, and not Ubuntu, this guide by Jonathan Bowman has helped me to set WSL exactly as I like.
Windows native SSH clients
Yes, it has tools from OpenSSH, such as the plain ssh client, ssh-agent and others. No need for PutTTY.
This guide by Chris Hastie explains how to activate SSH Agent with your private key. I’m not sure it is fairly complete, since I didn’t test yet if it adds your key in session startup for a complete password-less experience. I’m still trying.
Windows Terminal
The old command prompt is very limited, as we know, and obsolete. Luckily, Microsoft has released a new, much improved, Terminal application that can be installed from the Store.

It allows defining sessions with custom commands as ‘wsl
‘ (to get into the Fedora WSL container installed above), ‘cmd
‘, ‘ssh
‘. I use tmux in all Linux computers that I connect, so my default access command is:
ssh -l USERNAME -A -t HOSTNAME "tmux new-session -s default -n default -P -A -D"
Windows Terminal app is highly customizable, with colors and icons. And this repo by Mark Badolato contains a great number of terminal color schemes. Select a few from the windowsterminal folder and paste their JSON snippet into the file %HOME%\AppData\Local\Packages\Microsoft.WindowsTerminal_8wekyb3d8bbwe\LocalState\settings.json
.

Data Scientist × Data Analyst
Analysts inform, explain and visualize DATA THAT EXISTS in order to help business executives make strategic decisions. Thus, data analysts live in business meetings, talk to a lot of people and create data visualizations to help others understand what is going on. Tools: SQL, BI, spreadsheets, PowerPoint.
Scientists infer and calculate INFORMATION THAT STILL DOESN’T EXIST, such as the future, usually in order to optimize each and every business transaction. Example: if you like this one product, you might also like that other product. Example: according to data from surroundings, this house price should be around $X. Example: I learned how cars look like, so there is 98% chance there is a car in this photo. Thus, they create or improve digital products using machine learning and applied statistics. To create such improved user experiences, first data scientists use advanced exploratory data analysis techniques, create data visualization only for themselves, only for their better comprehension of what is going on. Tools: SQL, Pandas, math and statistics, git, programing, containers, Linux.
Data analysts tend to have a more glamorous job, while data scientists job is more hard skills oriented. Both need to work with large amounts of information, such as tables with millions or billions of data points.
There is also the Data Engineer role, which is as important as these other data professions, and focused on data availability, consistency and performance.
Inspired by Gerson Lerner’s post, I thought I should give my take on the subject too.
New products featuring old USB connectors
22 years into 21st century but new products still feature connectors from previous century. Precisely 1996, when this very old USB connector was released.
Product designers, please upgrade to USB-C, which is already 8 years old. It’s about time!

Upgrade to USB-C
22 years into 21st century but new products still feature connectors from previous century. Precisely 1996, when this very old USB connector was released.
Product designers, please upgrade to USB-C, which is already 8 years old. It’s about time!
5G Download Speed
5G download speed at home in São Paulo today. 420 megabits per second (mbps), equivalent to 52 megabytes per second.
It means that it takes about 10 seconds to download 1 hour of hi-fi music without any compression. But since compression is everywhere, just 2 seconds will be enough.
Upload speed gives me 10 mbps. Pretty good, though we know this is probably not for long.
What 4G, 5G speeds do you get and where?
State of the Windows Laptop Market
The Windows-based laptop market is a bad joke of confusing, overlapping offerings. It operates almost like a scam to underskilled consumers because manufacturers try hard to increase their profit around a purely commodity product. The results are “creative” but quite useless features as detachable keyboards, pens and tablet PCs. If you have one of those, think about the rare situations you actually used them in a comfortable way.
For a general use laptop, a $1000 MacBook Air has all the features you need, in order of importance: great high density screen (a.k.a. Retina display, most important feature, always), light and small and elegant, fast internal storage, outstanding global customer service, enough RAM (8GB minimum, 16GB recommended), modern connectivity with USB-C. Oh, and a good CPU too.
Don’t go for less than that and be aware that a similar feature set in the Windows universe will have same price, if not more. But it will be hidden under a pile of confusing, overlapping and oversized configurations.
This post was written for your private life laptop consumer self, to help you buy your next good laptop. Not for your corporate self.
“Free Market” is a myth

Insightful tweet by Robert Reich, Public Policy professor at UC Berkeley and Harvard:
The naturally occurring “free market” is a myth. The market is a set of rules organized and maintained by governments.
The real question isn’t free market or government — it’s whether the current rules favor the many or the few.
Java is the New Cobol
Java 18 was recently released and I can’t help reminding you that Java is the new Cobol: everybody heard about it, even have some legacy in production, needs to be supported, is important, but please don’t ask me to start any new project with Java, because there are much better things I can use today.

Passwordless Sign-in

Get ready to say goodbye to password managers or even all your passwords. Thanks to FIDO, the industry is shifting to open standards password-less authentication everywhere.
Who’s been using macOS, iOS credential management, integration and synchronization already have an idea about how it works across devices, apps and websites. But now the experience will be improved, extended and made even easier.
Power solution to rule them all
The one single power and connectivity kit needed in your laptop backpack.
① One +65W USB-C power charger
② One USB-C 2m/6ft cable with Power Delivery
③ One USB-C kit of adapters to old USB and Micro USB
④ One USB-C adapter to Apple Lightning
This kit: Powers your modern laptop through USB-C. Charges your phone through Lightning or USB-C. Charges eventual other devices on their old USB ports. Connects all devices to one another.
Portable batteries are obsolete. Instead, use your large and powerful laptop battery to charge your phone on the road.
Caution with Streamlit
Streamlit (streamlit.io) is a lovely Python module that helps data scientists build interactive dataviz apps.
Use it when a BI is overkill — as this Streamlit dashboard that I wrote to manage my personal investments —, or where there is no BI, such as very small companies. Or where there is no interactive app developers to create a native app.
Streamlit proliferation in mid to large size companies might however be a bad sign of several things:
1️⃣ Application and/or integration developer’s job wrongly assigned to Data Scientists
2️⃣ Lack of a solid BI platform and practice
3️⃣ Siloed data that isn’t flowing due to lack of data streaming or API architecture
4️⃣ All the above.
Use Streamlit with caution; we don’t want it to become the new, data science-era spreadsheet for corporate reporting, with all the burden that spreadsheet proliferation have caused.
Best Data Scientist’s time is spent getting insights from Exploratory Data Analysis, and then using it to model outstanding estimators and predictors. Definitively not writing nice looking apps.
Impressions about Open Data Science Conference Boston 2022
Open Data Science Conference 2022 has happened in Boston this week. Conference featured panels, workshops, presentations and a vendor expo. I attended the 3 days and here are some impressions.
Prefer Safari over Chrome
I can’t stand the Mac users that use Google Chrome while they already have the Safari browser.
Safari is lighter, more concerned about privacy, more well integrated to the platform and their other devices (iPhone etc), and is smarter in password management.
I don’t even have Google Chrome installed on my Mac.
Use Safari on your Mac
I can’t stand the Mac users that use Google Chrome while they already have the Safari browser, which is lighter, more concerned about privacy, more well integrated to the platform and their other devices (iPhone etc), and is smarter in password management. I don’t even have Google Chrome installed on my Mac.
Good luck to Kyndryl
To all friends that I’ve worked with at IBM and that are now moving to Kyndryl, I wish you success and good luck. The Cloud and IT services opportunity will continue to be huge forever. The countdown you have promoted here was warm and vibrant.
For the still-on-IBM friends, please keep on doing such a great company that always was and continues to be a brilliant reference to the world, not just IT. IBM is an unforgettable school for me and for anybody else that has spent even just a minute working there.
Business worldwide, as we know it, is shaped by companies such as IBM, even if you’ve never heard about it (well, that’s quite impossible).
How programmers should record time
We the data people immediately identify a poorly designed system when we see it handling date and time as plain local time, instead of the number of seconds since January 1st 1970 of time zone 0.
- This post was published on 1,626,425,523 (UTC, always UTC).
- Jesus was born -62,399,513,432.
- Man visited the moon between -14,552,880 and 93,172,200.
- And so on…
Just your daily dose of nerdy facts…
Die, e-mail, die, die
Nobody here reads e-mails. Avoid sending e-mails. If you need to send an e-mail to someone, notify him/her on Slack in order to actually have them reading it.
First week on a startup.
Die, e-mail, die, die. Finally!
What means to be Driven By Data
I’ve seen companies saying they have Big Data because they implemented Hadoop or a data lake and maybe Spark.
That’s just wrong.
Big Data, or more precisely, to be Data Driven, is a state where the data a company produces can be reused, as soon as possible, to optimize itself. And there are many ways to reuse data: all meetings and decisions happen with abundance of data, or recently generated data instantly feeds machine learning algorithms to optimize transactions, just to name a few situations.
To be Driven by Data is part culture and part infrastructure. On the infrastructure side, IT teams still struggle with limited visions about how data should flow pervasively and how access should be granted. They fear about security and performance while they should fear of missing out the data opportunity.
Data Streaming is a breakthrough recent technology that is here to help with more fluent data access. For an agile and effective data architecture, Data Streaming is much more strategic and important than just a bigger data warehouse because it is the component that can unleash your data and finally make it useful.
What is Apache Spark
Apache Spark is like Python’s Pandas and is like SQL databases. It can manipulate datasets, filter, integrate, transform.
But Spark was designed from scratch with horizontal scalability and parallelism in mind, which makes it capable of handling datasets with billions or even unknown number of rows — even if a bit less flexible than Pandas.
This is not new in the industry. Enterprise editions of commercial SQL databases are parallel and scalable since a very long time, being also very expensive in all levels of the stack: service/support, software and hardware.
But Spark is free software. And can use Hadoop — also a free software — as scalable and highly available storage, on cheap commodity hardware. In addition, it has a vibrant community and a democratic ecosystem of services and support.
As with all Open Source, Apache Spark changes the economic landscape of massive data processing systems market, taking money out of a few proprietary HW and SW vendors and pulverizing it locally on people and support.

Design Patterns
Programming is the art of creating flexible engines that can be easily extended as new features are needed over time.

Experienced programmers use Design Patterns to help make engine’s functions, features and structure (materialized as code) easily and clearly extensible.
Young programmers must learn and use Design Patterns, and Refactoring Guru has a very nice starting point.
List of Hard Skills for Data Professionals
2020 list of desired hard skills for data professionals. From the most essential to the more difficult ones.
- The English language
- SQL
- Spreadsheets
- Descriptive Statistics (median, variance, correlation etc)
- Notions of Data visualization
- Notions of Time Series
- Handling computer files and folders (this one entered the list because we observed many people simply don’t have it)
- Notions of digital information storage (numbers and their limits, time, time zones, text, Unicode, compression)
- Probability
- Probability Distributions
- Linear and Logistic Regressions
- Python libraries ecosystem, pip, PyPi
- Python’s Pandas, DataFrame and Series wrangling
- Linux and the computer command line
- NoSQL, JSON, YAML, XML, SVG, APIs, HTTP, protocols and data representation
- Cloud and infrastructure as code
- Notions of symmetric and asymmetric cryptography, digital signatures and applications
- “Big data” systems (Hadoop, Spark)
- Software Engineering (classes, modularisation, versioning, containerisation, packaging, DevOps)
- Inferential Statistics (confidence intervals, hypothesis testing)
- Machine Learning algorithms for regression and classification
- Calculus and Numerical Calculus (integrals, derivaties)
- Natural Language Processing
- Computer vision
- Neural Networks
Please remember this list has only hard skills. Ethics, domain and industry knowledge, communication are very important soft skills that won’t fit in this list.
Generally speaking, beginning of the list is where Data Analysts are (up to ≈11). Data Engineers get up to the middle of list (up to ≈18). And Scientists get all the list.
There is also the following graph that I’ve produced:
Jupyter and Data Science on a Mac (without Anaconda)
macOS Catalina doesn’t ship with Python 3, only 2. But you can still get 3 from Apple, updated regularly through system’s official update methods. You don’t need to get the awful Anaconda on you Mac to play with Python.
Python 3 is shipped by Xcode Command Line Tools. To get it installed (without the heavy Xcode GUI), type this in your terminal:
xcode-select --install
This way, every time Apple releases an update, you’ll get it.
Settings window will pop so wait 5 minutes for the installation to finish.
If you already have complete Xcode installed, this step was unnecessary (you already had Python 3 installed) and you can continue to the next section of the tutorial.
Clean Old Python Modules
In case you already have Python installed under your user and modules downloaded with pip, remove it:
rm -rf ${HOME}/Caches/com.apple.python/${HOME}/Library/Python \ ${HOME}/Library/Python/ \ ${HOME}/Library/Caches/pip
Install Python Modules
Now that you get a useful Python 3 installation, use pip3 to install Python modules that you’ll need. Don’t forget to use –user to get things installed on your home folder so you won’t pollute your overall system. For my personal use, I need the complete machine learning, data wrangling and Jupyter suite:
pip3 install --user sqlalchemy pip3 install --user matplotlib pip3 install --user pandas pip3 install --user jupyterlab pip3 install --user PyMySQL pip3 install --user configobj pip3 install --user requests pip3 install --user seaborn pip3 install --user bs4 pip3 install --user xgboost pip3 install --user scikit_learn
But you might need other things as Django or other sqlalchemy drivers. Set yourself at home and install them with pip3.
For modules that require compilation and special library, say crypto, do it like this:
CFLAGS="-I/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/include" \ LDFLAGS="-L/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.7/lib" \ pip3 install --user pycrypto
Use Correct Python 3 Binary
For some reason, Apple installs many different Python 3 binaries in different places of the system. The one that gets installed on /usr/bin/python3 has problems loading some libraries and instrumentation with install_name_tool would be required. So lets just use the binary that works better:
export PATH=/Library/Developer/CommandLineTools/usr/bin:$PATH
Run Jupyter Lab on your Mac
Commands installed by pip3 will be available in the ~/Library/Python/3.7/bin/ folder, so just add it to your PATH:
export PATH=$PATH:~/Library/Python/3.7/bin/
Now I can simply type jupyter-lab anywhere in the terminal or command line to make it fire my browser and get a Jupyter environment.
More about Xcode Command Line Tools
Xcode Command Line Tools will get you a full hand of other useful developer tools, such as git, subversion, GCC and LLVM compilers and linkers, make, m4 and a complete Python 3 distribution. You can see most of its installation on /Library/Developer/CommandLineTools folder.
For production and high end processing I’ll still use Python on Linux with my preferred distribution’s default packages (no Anaconda). But this method of getting Python on macOS is fastest and cleanest to get you going on your own data scientist laptop without a VM nor a container.